Nowhere Near Ithaca: A Simple Online tool for Exploring Bigrams in Text Documents

The methods described in Ted Dunning's Surprise and Coincidence blog post regarding the log-likelihood ratio score can be used for a variety of interesting applications. This score is simple to calculate, and yet apparently can capture "anomalous" rare events for filtering purposes.

To help me better understand this, I have started a small online project that lets you calculate these scores for bigrams of a given text document. You can use a text file from your machine, one of the preselected ones from Project Gutenberg, or the MED dataset from the Classic3 dataset. The contingency table upon which the score is based is also shown for each bigram. Note that currently a list of stopwords (based on Ken Church's ngrams tutorial) is used so that bigrams that include these words are not included in the analysis (e.g., "of", "they", etc.). I am not sure whether this list should included in an upfront fashion or not, and am still researching a better way to address this kind of thing.

This is still very much a work-in-progress.

Performance-wise, it seems to run fine for files up to about 1MB or so (javascript web workers are used for the raw processing on a file).

You can selectively add/remove words from the "top" bigrams by clicking on the words. "Noise" seems to be an issue that has not been addressed here in any serious way yet - there are lots of "nuisance" words that show up (especially since the Project Gutenberg files were not modified in any way), and this is in spite of the use of a common set of stopwords - in fact, that's why I added the easy ability to remove additional words dynamically from the list by just clicking on them (either as a start word or end word in a bigram).

Top Bigrams from Moby Dick (full list of 75 not included here)
(from https://googledrive.com/host/0B2GQktu-wcTidC01Ym1lR2h1TTA/)

The (relatively simple) calculations of the log-likelihood scores themselves are done with a straightforward translation of the LogLikelihood.java class from the Apache Mahout project. Also, the handful of log-likelihood tests from that project were used to find an issue with the calculations in the javascript version (that was fixed June 23). Also included are the "root log-likelihood ratios" (see this mailing list post by Ted Dunning for some background on this).

The contingency tables are included to assist the ongoing debugging, and the plan is to make visible more of the intermediate calculations and statistics for each "run".

Nowhere Near Ithaca

A Simple Online tool for Exploring Bigrams in Text Documents

No comments:

Post a Comment

Popular Posts