Topic models arise from the need of understanding and exploring large textdocument collections and predicting their underlying structure. Latent DirichletAllocation (LDA) (Blei et al., 2003) has quickly become one of the most populartext modelling techniques. The idea is that documents are represented as randommixtures over latent topics, where a distribution over words characterizes each topic.Unfortunately, topic models give no guaranty on the interpretability of their outputs.The topics learned from texts may be characterized by a set of irrelevant orunchained words. Therefore, topic models require validation of the coherence ofestimated topics. However, the automatic evaluation of the latent space of a topicmodel is a difficult task. Formerly, the most used metric for evaluating the quality ofa topic model was the held-out likelihood. Still, the literature has shown that thismethod emphasizes complexity rather than interpretability. Although manyprocedures were recently proposed (Röder et al., 2015), the automatic evaluation oftopic coherence remains an open research area. Our work aims to provide a newtechnique based on Statistically Validated Network (Tumminello et al., 2011). Ourapproach consists in representing each topic as a network of its most probablewords. The presence of a link between each pair of words is assessed by statisticallyvalidating their co-occurrences in sentences against the null hypothesis of randomco-occurrence. Thus, we propose a new coherence measure based on the structure ofthe statistically validated network. Furthermore, the new measure provides a rankingof topics and distinguishes high-quality from low-quality topics. The intuition is thatthe pairwise associations of words is strictly related to the semantic coherence andinterpretability of a topic.
|Titolo della pubblicazione ospite||Book of Abstracts Third international conference on Data Science & Social Research,|
|Numero di pagine||60|
|Stato di pubblicazione||Published - 2020|