To help me better understand this, I have started a small online project that lets you calculate these scores for bigrams of a given text document. You can use a text file from your machine, one of the preselected ones from Project Gutenberg, or the MED dataset from the Classic3 dataset. The contingency table upon which the score is based is also shown for each bigram. Note that currently a list of stopwords (based on Ken Church's ngrams tutorial) is used so that bigrams that include these words are not included in the analysis (e.g., "of", "they", etc.). I am not sure whether this list should included in an upfront fashion or not, and am still researching a better way to address this kind of thing.
This is still very much a work-in-progress.
You can selectively add/remove words from the "top" bigrams by clicking on the words. "Noise" seems to be an issue that has not been addressed here in any serious way yet - there are lots of "nuisance" words that show up (especially since the Project Gutenberg files were not modified in any way), and this is in spite of the use of a common set of stopwords - in fact, that's why I added the easy ability to remove additional words dynamically from the list by just clicking on them (either as a start word or end word in a bigram).
![]() |
Top Bigrams from Moby Dick (full list of 75 not included here) (from https://googledrive.com/host/0B2GQktu-wcTidC01Ym1lR2h1TTA/) |
The (relatively simple) calculations of the log-likelihood scores themselves are done with a straightforward translation of the LogLikelihood.java class from the Apache Mahout project. Also, the handful of log-likelihood tests from that project were used to find an issue with the calculations in the javascript version (that was fixed June 23). Also included are the "root log-likelihood ratios" (see this mailing list post by Ted Dunning for some background on this).
The contingency tables are included to assist the ongoing debugging, and the plan is to make visible more of the intermediate calculations and statistics for each "run".
No comments:
Post a Comment