The Smoothed-Dirichlet Distribution: A New Building Block for Generative Topic Models

In this work, we present the Smoothed Dirichlet (SD) Distribution, as an alternative distribution to the multinomial and the more recent Dirichlet-Compound-Multinomial (DCM) distributions as a basic building block for generative topical models for text. We show that this distribution is as simple to estimate as the multinomial and as effective in capturing term occurrence statistics as the DCM, thus combining the most desirable properties of these two distributions into one unit.

We also argue that the particular form of KL-divergence ranking function used successfully in information retrieval performs well for the simple reason that is corresponds to log-likelihood w.r.t. the Smoothed-Dirichlet distribution, a better generative model of text. We compared various generative distributions for text on the task of text classification and found that besides outperforming the multinomial, SD is also significantly better than the DCM and ordinary Dirichlet distributions. Therefore it deserves serious consideration as a new building block in generative models for text.