Understanding Document Aboutness Step Two: Identifying Interesting Things

Michael Gamon; Patrick Pantel

Understanding Document Aboutness Step Two: Identifying Interesting Things

Michael Gamon ,
Patrick Pantel

MSR-TR-2014-1 | January 2014

Download BibTex

We define the notion of an interesting nugget in a document. Such nuggets attract a user’s attention and lead them to explore more information around that nugget. In order to measure and model interestingness, we look at browsing sessions within Wikipedia and we build a data set of transitions (clickthrough) from a source Wikipedia page to a destination Wikipedia page through anchor clicks. We investigate factors that influence the probability of a click on an anchor in a Wikipedia page. We propose a topic modeling approach which jointly models the contents of the source and destination pages. We then use the estimated posterior on latent variables as features, along with page structure and user metadata features to build a model of interestingness. Finally, we evaluate this model using different feature sets and we demonstrate the model’s effectiveness at predicting interesting nuggets. Experimental results show that the latent semantic features are effective in predicting interestingness and can outperform baseline features.