One of the things I wanted to be able to easily do is explore the sensitivity of the calculated log-likelihood scores on the entries of the contingency matrix for each bigram. This feature has been added and the tool updated.
For exploring the impact of changes to a contingency matrix, you can either manually enter specific values in one of the contingency tables, or drag your mouse left or right on a specific entry in order to decrease/increase the values.
Explore the Impact of Changes in the Contingency Matrix Values on the LLR Scores (available for arbitrary text documents at https://googledrive.com/host/0B2GQktu-wcTidC01Ym1lR2h1TTA/) |
Using a slightly modified version of the matrix from Dunning's ( http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html), we write the contingency matrix as
Starts with Word 1 | Not Word 1 | |
Ends with Word 2 | k_11: Count of the bigram Word 1 Word 2 | k_12: Count of bigrams that end with Word 2, but do not start with Word 1 |
Does Not End with Word 2 | k_21: Count of bigrams starting with Word 1 but not ending with Word 2 | k_22: Count of bigrams that do not start with Word 1 and do not end with Word 2 |
and the log-likelihood score is calculated from the terms k_11, k_12, k_21, k_22.
The "exploring" involves watching the impact of changes in the values k_ij. At the moment, you can only increase/decrease a particular entry. However, in reality there are additional changes that are of interest; namely, keeping the total number of bigrams constant when changing a particular value k_ij, so that there has to be a corresponding change in the other value(s) when k_ij is changed. In fact, because of the lack of this restriction, it may be the case that you end up with a contingency matrix that could not occur for bigrams in a document.
You can also get other on-first-glance weirdness: very high LLR scores without having the bigram itself appear at all - at least, based on how the entries of the matrix are interpreted. As an extreme, for example, the LLR for the contingency matrix (k11,k12,k21,k22)=(0,100,100,0) has value 277.3, even though since k11=0 it would mean that the bigram itself did not occur at all. However, in this case the RootLLR is negative (at -16.7), indicating that the bigram appeared fewer times than expected (see http://s.apache.org/CGL). The RootLLR is defined as
RootLLR = signum[k11/(k11+k12) - k21/(k21+k22)] * sqrt(LLR)
= signum[0 - 1] * sqrt(277.3)
= - 16.7
There is much more to explore here.
No comments:
Post a Comment