Exploring Sensitivity of Log-Likelihood Scores for Bigrams

I have been playing with a simple online tool that will take a text document and calculate the log-likehood scores for all of the bigrams in the document, and generate the sorted list.

One of the things I wanted to be able to easily do is explore the sensitivity of the calculated log-likelihood scores on the entries of the contingency matrix for each bigram.  This feature has been added and the tool updated.

For exploring the impact of changes to a contingency matrix, you can either manually enter specific values in one of the contingency tables, or drag your mouse left or right on a specific entry in order to decrease/increase the values.


Explore the Impact of Changes in the
Contingency Matrix Values on the LLR Scores
(available for arbitrary text documents at 

https://googledrive.com/host/0B2GQktu-wcTidC01Ym1lR2h1TTA/)

Using a slightly modified version of the matrix from Dunning's ( http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html), we write the contingency matrix as


Starts with Word 1Not Word 1
Ends with Word 2k_11: Count of the bigram Word 1 Word 2k_12: Count of bigrams that end with Word 2, but do not start with Word 1
Does Not End with Word 2k_21: Count of bigrams starting with Word 1 but not ending with Word 2k_22: Count of bigrams that do not start with Word 1 and do not end with Word 2


and the log-likelihood score is calculated from the terms k_11, k_12, k_21, k_22.  

The "exploring" involves watching the impact of changes in the values k_ij.  At the moment, you can only increase/decrease a particular entry.  However, in reality there are additional changes that are of interest; namely, keeping the total number of bigrams constant when changing a particular value k_ij, so that there has to be a corresponding change in the other value(s) when k_ij is changed.  In fact, because of the lack of this restriction, it may be the case that you end up with a contingency matrix that could not occur for bigrams in a document.  

You can also get other on-first-glance weirdness: very high LLR scores without having the bigram itself appear at all - at least, based on how the entries of the matrix are interpreted.  As an extreme, for example, the LLR for the contingency matrix (k11,k12,k21,k22)=(0,100,100,0) has value 277.3, even though since k11=0 it would mean that the bigram itself did not occur at all.  However, in this case the RootLLR is negative (at -16.7), indicating that the bigram appeared fewer times than expected (see http://s.apache.org/CGL).  The RootLLR is defined as

RootLLR = signum[k11/(k11+k12) - k21/(k21+k22)] * sqrt(LLR)
        = signum[0 - 1] * sqrt(277.3)
        = - 16.7
                     There is much more to explore here.  

No comments:

Post a Comment

Popular Posts