Nowhere Near Ithaca: A Tool for Exploring Co-occurrence matrices and Recommenders

As part of becoming more familiar with recommender systems, I put together a simple (work-in-progress) tool to explore how recommendations are calculated using the basic methods discussed in Mahout in Action (and elsewhere). The site is on googledrive here.

My main goal with this tool was to provide a way to explore the connections between the various entities used in the (mostly matrix) calculations - this is attempted via mouseover popups and dynamic highlighting of related quantities.

Figure 1. A Tool for Exploring Simple Recommender Systems
https://googledrive.com/host/0B2GQktu-wcTiWHRwZFJacjlqODA/

There are three main sections in the visualization.

A section with the raw data of interactions between users and items, and the resulting user-item matrix of interactions

The raw data consists of lines of the form user,item; this reflects that there is some kind of interaction between a user and an item: a view of a web page, clicking a link, a purchase, viewing some portion of a video. This kind of data would be something parsed from web log files, etc.
The tool is based on simple boolean "was there an interaction?", rather than including ratings. It has been my impression that the importance of ratings is frequently overrated, given the noise that can accompany them.
The user-item matrix A is simply another view of the raw data, but serves as the starting place for further analysis. Moving your mouse over an entry of the user-item matrix will highlight the corresponding raw data, and vice versa. I think it's nice to see this side-by-side with the raw data, as it helps to reinforce the connection that can be harder to grasp when viewing things in a static context (see Figure 1).

A section showing the calculation of the co-occurrence matrix itself

This is calculated by multiplying A^T, the transpose of the user-item matrix, by A.
All of the intermediate matrices are shown here, and mousing over an entry of the co-occurrence matrix will highlight not only the relevant rows and columns of A^Tand A, but also goes back to the raw data itself (see Figure 2). When you do this, you also see that the entries of the co-occurrence matrix are simply the similarities between the various columns of the user-item matrix. While I may have basically known this, I found that seeing it materialize in front of me was a fairly powerful and effective mechanism for personal learning. I also then saw that all of the other similarity measures (log-likelihood, Tanimoto, Cosine, Pearson, etc.) can be viewed as simply alternative ways to define the matrix product, or equivalently, replacing the dot product of columns with other functions of the columns vectors of the user-item matrix. This seemed to corral the swimming concepts a bit in my head in a surprisingly satisfying way.
I tried to use colors to reflect how the difference pieces come together: yellow for the relevant row of A^T, and blue for the relevant column of A, resulting in green in the co-occurrence matrix itself. I am not sure how effective this is, but I think that it is important that the colors are different.

A section showing the co-occurrence matrix, a (changeable) user-interaction vector, and the final recommendation weights that would be used for recommending new items

You can click the checkboxes to indicate an interaction with an item, and the recommendation vector is automatically updated
Putting your mouse over an entry of the recommendation vector itself will highlight the relevant columns of the co-occurrence matrix that actually contributed to the recommendation weight - these columns correspond to the rows of the user interaction vector that are checked (see Figure 3).
The popups in the recommendation vector are intended to cover a variety of cases

when the entry corresponds to an item (or items) that would be recommended first
when the item would not be recommended because the user has already interacted with it (via the specified user interaction vector itself)
when the item is eligible for recommendation, but its calculated value in the recommendation vector is not the largest

Figure 2. Showing the connections between an entry of the co-occurrence matrix,
the user-item matrix and its transpose, and the raw data itself
https://googledrive.com/host/0B2GQktu-wcTiWHRwZFJacjlqODA/

Figure 3. Showing the connections between a calculated recommendation weight, the current user-interaction vector, and the relevant entries of the co-occurrence matrix
https://googledrive.com/host/0B2GQktu-wcTiWHRwZFJacjlqODA/

This has been a fun thing to put together, and for me seemed to definitely help highlight and reinforce various concepts related to recommender systems. Please feel free to let me know of errors in my interpretation, clarification needed, etc. Learning is a subjective thing, and there may be additional little nuances that could be added that could help better convey the details.

Nowhere Near Ithaca

A Tool for Exploring Co-occurrence matrices and Recommenders

No comments:

Post a Comment

Popular Posts