www.fgks.org   »   [go: up one dir, main page]

Academia.eduAcademia.edu
Topic Model Validation Eduardo H. Ramireza,∗∗, Ramon Brenaa , Davide Magattib , Fabio Stellab,∗ b DISCo, a Tecnologico de Monterrey, Campus Monterrey, Monterrey, MEXICO. Università degli Studi di Milano-Bicocca, Viale Sarca 336, 20126 Milano, ITALY. Abstract In this paper the problem of performing external validation of the semantic coherence of topic models is considered. The Fowlkes-Mallows index, a known clustering validation metric, is generalized for the case of overlapping partitions and multi-labeled collections, thus making it suitable for validating topic modeling algorithms. In addition, we propose new probabilistic metrics inspired by the concepts of recall and precision. The proposed metrics also have clear probabilistic interpretations and can be applied to validate and compare other soft and overlapping clustering algorithms. The approach is exemplified by using the Reuters-21578 multi-labeled collection to validate LDA models, then using Monte Carlo simulations to show the convergence to the correct results. Additional statistical evidence is provided to better understand the relation of the metrics presented. Keywords: topic models, soft clustering, Fowlkes-Mallows index, Monte Carlo 1. Introduction The continuously increasing amount of text available on the WEB, news wires, forums and chat lines, business company intranets, personal computers, e-mails and elsewhere is overwhelming [1]. Information, is switching from useful to troublesome. Indeed, it is becoming more and more evident that while the amount of data increases exponentially our storing and processing capabilities do not grow with comparable speed [2]. This trend strongly limits the extraction of valuable knowledge from text and thus drastically reduces the competitive advantage we can gain. Search engines have exacerbated such a problem by dramatically increasing the amount of text available in a matter of a few key strokes. Probabilistic Topic Modeling (PTM) methods comprise a new and promising family of unsupervised techniques to model text collections which go towards the direction suggested by Halevy et al. [3]. ∗ Corresponding author corresponding author Email addresses: eduardo.ramirez@itesm.mx (Eduardo H. Ramirez), ramon.brena@itesm.mx (Ramon Brena), magatti@disco.unimib.it (Davide Magatti), stella@disco.unimib.it (Fabio Stella) ∗∗ Principal Preprint submitted to Elsevier PTM is a particular form of document clustering and organization used to analyze the content of documents and the meaning of words with the aim to discover the topics mentioned in a document collection. Table 1 shows two topics, out of three hundreds, derived from the TASA corpus, a collection of over 37,000 text passages from educational materials [4]. Topic 247 word prob Drugs .069 Drug .060 Medicine .027 Effects .026 Body .023 Medicines .019 Pain .016 Person .016 Marijuana .014 Label .012 Topic word Red Blue Green Yellow White Color Bright Colors Orange Brown 5 prob .202 .099 .096 .073 .048 .048 .030 .029 .027 .027 Table 1: Illustration of two (out of 300) topics extracted from the TASA corpus (language and arts, social studies, health, sciences). The ten most probable words (word) for each topic are listed together with the corresponding probability value (prob). April 11, 2011 A variety of PTM models [5, 6, 7, 8] have been proposed, described and analyzed in the specialized literature. These models differ among them in terms of the assumptions they make concerning the data generating process. However, they all share the same rationale, i.e. a document is a mixture of topics. To describe how the PTM model works, let P (z) be the probability distribution over topics z and P (w|z) be the probability distribution over words w given topic z. The topic-word distribution P (w|z) specifies the weight to thematically related words. A document is assumed to be formed as follows: the ith word wi is generated by first extracting a sample from the topic distribution P (z), then by choosing a word from P (w|z). We let P (zi = j) be the probability that the j th topic was sampled for the ith word token while P (wi |zi = j) is the probability of word wi under topic j. Therefore, the PTM induces the following distribution over words within a document: P (wi ) = T X P (wi |zi = j) P (zi = j) β. The authors showed that good choices for the hyperparameters α and β will depend on the number of topics T and vocabulary size W , and that accordingly to the results of their empirical investigation α = 50 T and β = 0.01 to work well with many different document collections. PTM originated in the psychology community where semantic coherence of topics were evaluated by focusing on replicating human performances or judgments. This strategy was implemented by performing standardized tests, comparing sense distinctions, and matching intuitions about synonymy [10]. Topic model validation is an extremely important task which aims to check whether a PTM is semantically meaningful or not. Indeed, PTMs assume that the latent space, were topics are projected, is semantically meaningful and thus it can be exploited to validate the extracted topics, to summarize a given document corpus and to guide its contextual exploration. Furthermore, to ignore the analysis of the internal representation of topic models is in contrast with their rational and development. Several approaches have been proposed for topic model evaluation; measures based on held-out likelihood, sentiment detection or information retrieval as discussed in Chang et al. [10]. Wallach et al. [11] summarized several evaluation techniques based on tools from language modeling. These techniques measure the fit of the knowledge learned when applied to unseen documents. The main drawback of such methods is that they only measure the probability of observation while the internal representation of the topic models is ignored. An important exception is offered by Griffiths and Steyvers [8] where the authors have shown that different senses of a word are correlated with the number of topics that word appears in. However, Chang et al. [10] pointed out that this is still not a deep analysis of the structure of the latent space, as it does not examine the structure of the topics themselves. The problem of validating topic models without relying on the performance of a task was recently addressed by Chang et al. [10]. Their work, the first in that direction, shows by means of user studies that some problem exists when using supervised classification predictive metrics. In this paper we are concerned with the analysis and validation of the semantic coherence of the results obtained through PTM and with the problem of their comparison with the results obtained through alternative models. The proposed approach consists of transforming each topic model (1) j=1 where T is the number of topics. The main idea of PTM can be summarized as follows; a document is a linear combination of multiple topics (Formula 1). A topic is a probability distribution over a given vocabulary of words. Therefore, a document can be interpreted as a linear combination of probability distributions over a given vocabulary, where each probability distribution, i.e. topic, is associated with a specific argument, idea or theme. Hoffmann [6, 9] proposed the probabilistic Latent Semantic Indexing (pLSI) method which makes no assumptions about how the mixture weights in formula 1, i.e. P (zi = j), are generated. Blei et al. [7] improved the generalizability of this model to new documents. They introduced a Dirichlet prior with hyperparameter α on P (z), thus creating the Latent Dirichlet Allocation (LDA) model. In 2004, Griffiths and Steyvers [4] introduced an extension of the original LDA model which associates a Dirichlet prior, with hyperparameter β, also to P (wi |zi = j). The authors suggested the hyperparameter to be interpreted as the prior observation count on the number of times words are sampled from a topic before any word from the corpus is observed. This choice can smooth the word distribution in every topic with the amount of smoothing determined by 2 into a “hard” overlapping partition through the discretization of the “soft” document-topic associations. Then, the validation is performed by exploiting novel probabilistic metrics, based on the interpretations of widely accepted concepts such as “precision” and “recall”. Also, we have generalized the Fowlkes-Mallows index (FM) [12, 13], an existing cluster validation metric, in order to make it suitable to validate overlapping clusterings. The generalization of the FM index was performed on the basis of its underlying probabilistic interpretation and allows us to link the Fowlkes-Mallows index to the semantic coherence of the model rather than to the mere similarity between cluster partitions. Novel metrics inspired on the widely known precision and recall concepts are also presented. The proposed validation approach has the following advantages with respect to existing metrics: consists in correctly evaluating the quality of the resulting overlapping partitions. Let us first introduce the terminology and the notation used in the rest of the paper. Every “hard-clustering” problem applied to a multi labeled document corpus involves the following elements: • a dataset D = {d0 , ..., dn } consisting of n documents; • a partition of D in K clusters: {u1 , ..., uK }; U = • a partition of D in S classes: C = {c1 , ..., cS }. Most of the existing validation metrics [14] can be expressed in terms of a |U | × |C| contingency table (Table 2) where the content of each cell nij represents the number of documents belonging to cluster ui and class cj . In the special case where clusters do not overlap and the document corpus is uni-labeled, the following properties hold: SK 1. 1 ui = D; • it offers an explicit probabilistic interpretation; • it allows to validate overlapping partitions; • it allows to validate incomplete partitions. 2. ui ∩ uj = ∅ ∀i, j = 1, ..., K with i 6= j: there is no “overlap” in the cluster partition; Harmonic mean of precision and recall can be computed to obtain a combined single-metric measurement of the quality of the partition. The paper also shows how the proposed metrics allow to perform a “drill-down” analysis into the individual clusters (topics) to make straightforward the determination of: 3. ci ∩ cj = ∅ ∀i, j = 1, ..., S with i 6= j: there is no “overlap” in the class partition. In this work we consider the case where the aforementioned properties cannot be assumed to hold, which happens when: 1. which are the best/worst clusters in the partition; • the tresholding procedure used to move from soft to hard clustering results in some documents being assigned to no cluster; 2. which topics are better recalled by any given cluster. • the tresholding procedure used to move from soft to hard clustering results in some documents being assigned to more than one cluster; The rest of this paper is organized as follows. In Section 2 we establish the notation and the main objects involved in our analysis. In Section 3 we propose a dissection of the Fowlkes-Mallows index. Section 4 introduces and describes the proposed metrics to evaluate topic model quality. The results of the numerical experiments performed on the Reuters-21578 data set are described in Section 5. Finally, conclusions and research directions are reported in Section 6. • the document corpus is multi-labeled, and thus in principle every document can be assigned to no, one or more classes. Classes Cluster 2. Notation and problem statement Once the soft-clustering solution of a multilabeled corpus is discretized to obtain the corresponding hard-clustering, the problem to be faced u1 u2 ... uK Σ c1 n11 n21 ... nK1 n∗1 c2 n12 n22 ... nK2 n∗2 ... ... ... ... ... ... cS n1S n2S ... nKS n∗S Table 2: Contingency table. 3 Σ n1∗ n2∗ ... nK∗ n∗∗ In order to analyze the FM index, the events associated with the experiment of randomly sampling two documents d1 and d2 without replacement from D, are defined as follows: 3. Background PTM can be thought of as soft-clustering methods that discover probabilistic associations of documents to latent topics. Therefore, to evaluate the quality of the knowledge extracted from PTM we can exploit metrics designed to compare heterogeneous soft-clustering methods in text collections. This strategy, based on external validation metrics, takes full advantage of multi-labeled corpora. To the best of our knowledge this is the first paper that proposes and develops such an approach. It is worthwhile to notice that most of the works on external clustering validation deal with the case of non overlapping partitions of uni-labeled corpora. A comprehensive review of the traditional metrics used to validate non overlapping partitions can be found in [14] and [15]. A fuzzy clustering validation approach for complete solutions is presented in [16]. In this section we will focus on dissecting the Fowlkes-Mallows Index, as being a probabilistic metric which can be extended to work in the overlapping and incomplete case. Finally, it is important to clarify that clustering entropy, a probabilistic, external metric commonly referred in literature is not considered in detail in our analysis despite its ability to validate overlapping clustering solutions. Indeed, a probability distribution over the class variable has to be constructed to compute the entropy of a cluster. However, such a probability distribution can be constructed only in the case where the classes do not overlap. In our opinion this is an undesirable property for a topic model validation corpora, such as Reuters-21578 where classes actually overlap1 . • Su∗ : d1 and d2 belong to the same cluster; • S∗c : d1 and d2 belong to the same class; • Suc : d1 and d2 belong to the same cluster and class. To denote the event of d1 and d2 belonging to class cj we write S∗cj whose probability is given by:  P (S∗cj ) = ij ni∗ i 2 2 P j n∗j 2 . = h(n∗∗ , n∗j , 2, 2) (3) where h(n∗∗ , n∗j , 2, 2) represents the probability value, according to the hypergeometric distribution, to obtain 2 successes in a sampling without replacement of size 2, from a population of size n∗∗ that contains n∗j successes. In a similar manner, we write Sui ∗ to denote that d1 and d2 belong to cluster ui , where the corresponding probability value is given by:  ni∗ 2  (4) P (Sui ∗ ) = n∗∗ = h(n∗∗ , ni∗ , 2, 2). 2 The probability of two documents belonging to the same class can be computed from expression (3) to be:   X 1 X n∗j P (S∗c ) = P (S∗cj ) = n∗∗  (5) 2 2 j j 3.1. Dissecting the Fowlkes-Mallows index Among the existing cluster validation metrics, a particular interesting one is the Fowlkes-Mallows index [12, 13] (hereafer referred to as “FM”). Using the contingency table notation from Table 2, the FM index is defined as follows: P nij  F M = qP n∗j 2  n∗∗ 2 while the probability of two documents belonging to the same cluster can be computed from expression (4) to be:   X 1 X ni∗  P (Sui ∗ ) = n∗∗ P (Su∗ ) = . (6) 2 2 i i (2) Finally, the probability of two randomly sampled documents, without replacement, to belong to the same class and cluster is:   X 1 X nij . (7) P (Sui cj ) = n∗∗  P (Suc ) = 2 2 ij ij 1 Certainly, a generalized version of the cluster entropy measure could be produced by mapping the original class labels into a new set of labels, where each distinct label combination is transformed into a new, composite class label. We consider, however that such an approach would be excessively “strict” with the evaluation of the resulting solutions in terms of semantic coherence. Then, the conditional probability that two randomly sampled documents, without replacement, 4 and: belong to the same class given they belong to the same cluster is: P nij  P (Suc ) ij 2  (8) P (S∗c |Su∗ ) = = P ni∗ P (Su∗ ) i 2 P ij P (Su∗ |S∗c ) = P . Example 1 Consider a non-overlapping partition consisting of 2 clusters and 2 classes. Let u1 = {d1 , d2 , d3 , d4 , d5 } with {d1 , d2 , d3 } ∈ c1 and {d4 , d5 } ∈ c2 and let u2 = {d6 , d7 , d8 , d9 , d10 } with {d6 } ∈ c1 and {d7 , d8 , d9 , d10 } ∈ c2 . The situation can be conveniently summarized through the following contingency table: P c1 c2 u1 3 2 n1∗ = 5 u2 1 4 n2∗ = 5 n∗1 = 4 n∗2 = 6 n∗∗ = 10 (11) In practice, when N is greater than 50 and m/N ≤ 0.10, the hypergeometric distribution can be conveniently approximated by the binomial distribution [17]. So, the proposed formalization serves two goals. On one hand, it is helpful for computational purposes as it allows the usage of cost efficient approximations; on the other hand, it is useful to better understand the properties of the considered metric. For instance, by expressions (4) and (6), the probability of sampling two documents from the same P cluster could be rewritten as follows P (Su∗ ) = i h(n∗∗ , ni∗ , 2, 2) while the probability of sampling two documents from the same class becomes P h(n , n , 2, 2). In a similar fashP (S∗c ) = ∗∗ ∗jP j ion, we have P (Suc ) = ij h(n∗∗ , nij , 2, 2). Thus, the conditional probabilities expressed above can be rewritten as: P ij h(n∗∗ , nij , 2, 2) (12) P (S∗c |Su∗ ) = P i h(n∗∗ , ni∗ , 2, 2) (13) 3.2. Overlapping partitions When validating using a multiply labeled corpus, such as Reuters-21578, the set of ground-truth classes result in overlapping partitions. In such a case the FM index cannot be computed by using equation (2) because the assumption of sampling without replacement does not hold. The main difficulty with overlapping when computing the FM index, is due to the use of the contingency table notation, which hides the probability being computed that easily results in making the wrong assumptions that ni∗ = |ui |, n∗j = |cj | and n∗∗ = |D|. The implications of such a wrong assumptions are shown through the following example. The previous formulations can also be expressed in terms of the hypergeometric distribution, defined as the probability of selecting exactly x successes in a sample of size m, obtained without replacement from a population of N objects from which k possess the characteristic of interest. Formally:   k N −k m−x  N m . (14) It is worthwhile to mention that equation (14) makes it easy to account the effects of overlapping clusters when computing the FM index. It is worthwhile to note that the FM index (2) can be obtained by computing the geometric mean of the conditional probability that the pair of sampled documents belong to the same class given they belong to the same cluster (P (S∗c |Su∗ )), and the conditional probability that the pair of sampled documents belong to the same cluster given they belong to the same class (P (Su∗ |S∗c )). Therefore, expressions (8) and (9) allow us to write the following: p (10) F M = P (S∗c |Su∗ )P (Su∗ |S∗c ). x h(n∗∗ , n∗j , 2, 2) Finally, by geometrically averaging (12) and (13) the new expression for (2) is: P ij h(n∗∗ , nij , 2, 2) . F M = qP P h(n , n , 2, 2) h(n , n , 2, 2) ∗∗ ∗j ∗∗ i∗ j i while the conditional probability that they belong to the same cluster given that they belong to the same class is: P nij  P (Suc ) ij 2 . P (Su∗ |S∗c ) = (9) = P n∗j P (S∗c ) j 2 h(N, k, m, x) = j h(n∗∗ , nij , 2, 2) According to (3) and (5) wePcan compute P (S∗c ) as follows; P (S∗c ) = j P (S∗cj ) = P (42) (62) 21 j h(n∗∗ , n∗j , 2, 2) = (10) + (10) = 45 to obtain 2 2 the correct probability value. The following class overlapping scenario, due to multi-labeled documents, is considered. Let c3 be such that {d1 , d4 , d8 , d9 , d10 } ∈ c3 . The corresponding contingency table is: 5 u1 u2 c1 3 1 n∗1 = 4 c2 2 4 n∗2 = 6 c3 2 3 n∗3 = 5 P P (S∗c ) is obtained by subtracting the probability of the classes intersection:    ! 4 4 6 5 n1∗ = 7 n2∗ = 8 n∗∗ = 15 Intuitively, we expect the intra-cluster overlap to increase the value of P (S∗c ). However, equation (5) yields the incorrect result of 31/105, which is smaller than the correct one 21/45. This is due to the fact that the sampling without replacement assumption no longer holds. Indeed, there are not  15 = 105 ways to select 2 documents, as that 2 would allow to sample the same document more than once. The correct number of ways to select 2 = 10 elements is still 45 and it is given by |D| 2 . 2 However, the events S∗cj to sample two documents from the same class j are no longer independent. Therefore, they cannot be added as in (5). Basically, when class and/or cluster overlap exist, the contingency table bins do not represent mutually exclusive events. Thus, the value of P (S∗c ) when classes overlap exists is given by: X h(|D|, |cj |, 2, 2) − J(C) (15) P (S∗c ) = P (Su∗ ) = + 2  10 2 − 2  10 2 = 0.55. X h(|D|, |ui |, 2, 2) − J(U ) (16) i j ′ >j or accordingly to the hypergeometric notation by the following expression: XX h(|D|, |{S∗cj ∩ S∗cj′ }|, 2, 2). J(C) = j ′ >j where J(U ) accounts for the probability of selecting a pair of documents belonging to two or more clusters, and it is given by adding up the probabilities of cluster intersections: XX J(U ) = P (Sui ∗ ∩ Sui′ ∗ ) i However, the above formulas deal with the case where the classes overlap is restricted to pairs. The case where general classes overlap is concerned is more complex from both the theoretical and computational point of view and will be presented in a different work. Formula (15) is a re-expression of (5) under the general addition rule of probability for non independent events2 . For instance, if any of the pairs {(d4 , d8 ), (d4 , d9 ), (d4 , d10 ), (d8 , d9 ), (d8 , d10 ), (d9 , d10 )} are sampled, then S∗c2 and S∗c3 are both true, and this results in a double count. The correct value of 2 Which 2  10 2 When hardening a soft-cluster solution we potentially obtain overlapping and incomplete partitions; thus, the validation metrics should be sensitive to some form of “recall”. In the FM index computation the basic assumption would be that the column marginal totals correspond to the size of the classes, i.e. n∗j = |cj |, and that the row marginal totals equal the size of the clusters. As shown before, such an assumption is false when an overlapping exists with the same applying to cases where the clusters are incomplete. Measuring incomplete partitions with the FM contingency table is wrong. Indeed, it incorrectly reduces the number of successes inside the population by using ni∗ instead of |ui |. Furthermore, the possibility of cluster overlapping has to be taken into account. Therefore, the correct probability of selecting two documents from the same cluster will be given by: where J(C) is the probability that a selected pair of documents belongs to two classes simultaneously, defined by the expression: XX J(C) = P (S∗cj ∩ S∗cj′ ) j + 3.3. Incomplete partitions j j 2  10 2 P (S∗c ) = states that: P (A∪B) = P (A)+P (B)−P (A∩B). 6 i′ >i and by using the hypergeometric distribution: J(U ) = XX i h(|D|, |{Sui ∗ ∩ Sui′ ∗ }|, 2, 2). i′ >i It is worthwhile to note that formula (16) is also valid in the case where clusters do not overlap. However, although FM can be corrected to take into account some of the effects of partitions’ incompleteness and/or overlap, we consider that its interpretation is more biased toward measuring partition similarity, and thus we find it valuable to study new metrics that can serve better to estimate semantic coherence. • the probability that two randomly sampled documents belong to the same class, given they belong to the same cluster, i.e.: P ij h(|D|, nij , 2, 2) − J(U, C) ; P (S∗c |Su∗ ) = P i h(|D|, |ui |, 2, 2) − J(U ) (19) 4. Proposed metrics In this section, we introduce a version of the FM index adjusted for overlapping and incomplete clusters. Two overlapping “precision” metrics together with their probabilistic interpretations are given. Finally, we discuss the computation of a kind of cluster “recall” which can be used to achieve a single metric performance. • the probability that two randomly sampled documents belong to the same cluster, given they belong to the same class, i.e.: P ij h(|D|, nij , 2, 2) − J(U, C) . P (Su∗ |S∗c ) = P j h(|D|, |cj |, 2, 2) − J(C) (20) 4.1. Generalized Fowlkes-Mallows Index As discussed in section 3, if the FM index is expressed in terms of the contingency table, it can not be used to validate overlapping or incomplete clusters. The reason is that while its additive terms come from the hypergeometric distribution, they would use an incorrect population size in the case where cluster overlapping is concerned. However, we have shown that when re-expressing the FM index in terms of the hypergeometric distribution and by correcting its formula in order to use the cluster size |ui | and the class size |cj |, the probabilities P (S∗c ) in (15) and P (Su∗ ) in (16) are correct under the assumption that the maximum overlap equals two. Therefore, the last step to obtain a generalized version of the FM index requires to generalize the computation of P (Suc ) in such a way that nonindependent events Sui cj are correctly taken into account. This generalization requires to compute the probability of the intersection of elementary events. For the whole contingency table the sum of the probabilities of the intersection between “bins” will be denoted by: XX J(U, C) = P (Sui cj ∩ Sui′ cj′ ) In conclusion, the generalized version of the FM index, referred to as GFM, is given by3 : q P P ij h(|D|, nij , 2, 2) − J(U, C)  P h(|D|, |ui |, 2, 2) − J(U ) [ i . h(|D|, |cj |, 2, 2) − J(C)] j (21) 4.2. Partial Class Match Precision This metric is inspired by the notion of precision utilized in the IR field. The Partial Class Match Precision (PCMP) measures the probability of randomly selecting two documents from the same class taken from a randomly sampled cluster. In contrast to FM, where we are concerned with the random sampling of two documents d1 and d2 from the documents corpus, PCMP requires to first randomly sample a cluster and then to randomly sample two documents from the sampled cluster. In order to clearly differentiate both random events, we use S̃c∗ to denote the event of selecting two documents belonging to the same class sampled from a given cluster. Formally, the PCMP metric is defined as follows: X P (S̃∗c |ui )P (ui ) (22) PP M = P (S̃∗c ) = ij i′ ,j ′ where i′ > i and j ′ > j and where by using the hypergeometric probabilities we obtain: XX h(|D|, |{Sui cj ∩ Sui′ cj′ }|, 2, 2). J(U, C) = ij i′ ,j ′ (17) Note that the computation of J(U, C) requires the creation of an additional “overlap matrix” consisting of (|U |×|C|)2 elements. Finally, the generalized result for P (Suc ) is given by: X h(|D|, nij , 2, 2) − J(U, C). (18) P (Suc ) = i where the prior probability of selecting the cluster ui is given by P (ui ) = ni∗ /n∗∗ . 3 We are aware that this formulation may not be accurate on extreme cases of strongly overlapped collections, however, extending the formulations to work on such cases is straightforward. Moreover, we will show experimentally that the hypothetical error, which is in fact an underestimation of actual probabilities is negligible in real-world corpora such as Reuters-21578. ij Thus, the generalized version of the metric can be defined as the geometric average of: 7 feature to focus on measuring semantic coherence rather than mere partition similarity. For instance, in contrast to similarity oriented metrics, more than one clustering solution can achieve the maximum evaluation in terms of the PCMP metric. In fact, we can think of two clustering solutions that will obtain a PCMP value of 1, where any pair of elements sampled from a given cluster will belong to the same class: PCMP measures the probability of the event S̃∗c , i.e. to sample two documents from the same class, after having randomly selected a cluster. However, the computation of each individual P (S̃∗c |ui ) also needs to be generalized in the case of class overlapping. Therefore, we need to add up the probability of selecting two documents from each class comprised within the cluster P (S̃∗cj |ui ) under the general rule of the addition for non-independent events, which implies discounting the probability of a success in two classes simultaneously. Thus, each individual P (S̃∗c |ui ) would be given by: P (S̃∗c |ui ) = X a) Creating one cluster for every class, and assigning all the elements in ci to ui , so that k = |C|. b) Creating clusters of elements that share exactly the same class labels. P (S̃∗cj |ui ) − J(ui ) (23) Finally, we should highlight that this metric can be efficiently approximated via a Monte Carlo simulation. Indeed, we will exploit this method to demonstrate the correctness of the metric. j where J(ui ), which represents the probability to sample two elements from two or more classes when selecting documents d1 and d2 which belong to cluster ui , is given by: XX J(ui ) = P ({Sui cj ∩ Sui cj′ }). (24) j 4.3. Clustering Recall In the IR field the “recall” measure represents the probability that a relevant document is retrieved. Therefore, for the clustering scenarios under consideration, when the completeness of the partition cannot be assumed, it is critical to provide clear ways to measure the completeness of the clustering. Let Nc be the total number of class assignments, given by the sum of the sizes of every class: X |cj |. Nc = j ′ >j The previous equation represents the probability of selecting two elements from cluster ui that simultaneously belong to two different classes. Thus, in order to obtain J(ui ) we need to compute the individual probabilities of selecting two documents that simultaneously belong to every distinct pair of classes (cj , cj ′ ) and then add them up to obtain the probability of selecting two documents that simultaneously belong to any pair of classes. The expression for the individual probabilities can also be represented using the formula of the hypergeometric distribution, where the parameter accounting for the number of successful outcomes is the number of elements in ui that belong to both cj and cj ′ , that is, the “overlap” between cj and cj ′ : J(ui ) = XX j j In overlapping and incomplete clustering we must not rely on the values of the contingency table to compute recall values, because they can account for duplicates. They also do not consider elements not assigned to any clusters. 4.3.1. Class recall If we are interested in measuring which classes are better captured by the clustering it is straigthforward to compute a class recall value. We define this “class recall” as the probability that a document d, randomly sampled from the class cj , is included in any cluster: Tk | i {ui ∩ cj }| k . (26) R(cj ) = P ([d ∈ ∪i ui ]|cj ) = |cj | h(|ui |, |{Sui cj ∩Sui cj′ }|, 2, 2). (25) j ′ >j This metric is designed to work well with multilabeled documents corpus. The name “Partial” comes from the fact that in a multi-label setting the two randomly sampled elements d1 and d2 can be associated with many classes. As long as one of their classes matches we will consider the result to be semantically coherent, thus a success. We consider that this property of the metric is a valuable In other words, equation (26) divides the number of documents, labelled with class cj that were recalled by any cluster ui , by the total number of documents labelled with class cj . 8 Documents with less than 10 unique words were discarded while train and test documents were all included into the analysed documents corpus. Therefore, a total of 10,468 documents were used which are labelled with 117 ground-truth classes. 4.3.2. Gross clustering recall From the previous expression and recalling that the probability of selecting a class would be given by P (cj ) = |cj |/Nc , it is possible to derive the following unconditional expression to measure the recall of the whole clustering: X P (d ∈ ∪ki ui |cj )P (cj ), RU = P (d ∈ ∪ki ui ) = 5.1. Topic Extraction For demonstration purposes, the used algorithm was LDA (β = 0.01, α = 50/K) running 1,000 iterations of Gibbs sampling. The proposed measurement techniques require a discretization of the document-topic assignments. Thus, to show the effects of the discretization on the measurement we generated models using document-topic probability thresholds t equal to 0.05, 0.1, 0.2, 0.25 and a number of topics K equal to 10, 30, 50, 70, 90 and 117. Four topics extracted by LDA, in the case where K = 90, are shown in Table 3. Each topic is associated with its prior probability P (zi ) while each word is associated with its conditional probability P (wi |zi ). j (27) where the probability of selecting each class would be given by |cj |/Nc . Therefore, (27) becomes RU = 1 X R(cj )|cj |. Nc j (28) 4.4. Single-metric performance In retrieval and classification it is widely known that it is trivial to achieve high recall at the expense of precision and viceversa. Thus, traditionally they are averaged into a single metric, the F-Score. The traditional F-Score is nothing but the harmonic mean between precision and recall. Almost any two probabilities can be averaged in this way. However, for the particular case of topic-model validation we are interested in balancing the best measurement for semantic coherence with the best measure for completeness, so our proposed metric is defined by the harmonic average of equation (22) and equation (28) to obtain: Fo = 2PP M RU . PP M + R U (29) Notice that the selection of (22) and (28) comes at the expense of not penalizing some clustering dissimilarities. Thus, if the ultimate performance criteria is the partition similarity, then the GFM may be a best metric of choice. Both components of the Fo metric, are micro-averaged so that every document has the same weight on the result. The micro-averaging effect is achieved by the marginalization step performed in (22) and (28) in order to work with unconditional probabilities. Topic 0 0.0097 Topic 2 0.0081 coffee brazil said export quotas quota producers ico brazilian international 0.0620 0.0397 0.0372 0.0348 0.0261 0.0220 0.0183 0.0166 0.0154 0.0153 price prices oil effective cts crude increase raised barrel raises 0.0623 0.0468 0.0336 0.0280 0.0223 0.0218 0.0211 0.0209 0.0201 0.0186 Topic 49 0.0127 Topic 56 0.0110 rate rates interest pct cut bank market money prime point 0.0945 0.0771 0.0551 0.0484 0.0354 0.0267 0.0227 0.0216 0.0204 0.0163 wheat agriculture us usda corn grain program farm said farmers 0.0365 0.0340 0.0330 0.0322 0.0298 0.0285 0.0278 0.0265 0.0235 0.0197 5. Numerical Experiments Table 3: Four out of the 90 topics extractred with the LDA algorithm running 1,000 iterations of Gibbs sampling with parameters value β = 0.01 and α = 0.56. The ten most frequent words for each topic are listed together with their corresponding conditional probabilities. Furthermore, each topic is associated with its prior probability. Experimental evidence of the correctness of the theoretical formulations together with some insights on their characteristics are presented by using the Reuters-21578 corpus, ModApte split, including only documents labeled with topics. 9 5.2. Empirical approximation to the metrics i) sample a cluster, ii) sample two documents from it and check if they belong to the same class, iii) sample a class cx , then sample a document dz from cx and check if it is included in the clustering solution (lines 15 - 19) and iv) compute estimates for P (S̃∗c |ui ), RU and Fo (lines 20-22). The correctness of the GFM and Fo formulations has been demonstrated by Monte Carlo simulation. The approximation algorithm 1 for the GFM metric can be summarized as follows: i) randomly sample a pair of documents, ii) check if they belong to the same class, iii) check if they belong to the same cluster, iv) check if they belong to the same class and cluster and v) compute estimates for P (Suc ), P (S∗c |Su∗ ), P (Su∗ |S∗c ) and GF M (lines 16-22). Algorithm 2 estimates PCMP and Fo as follows: Algorithm 2 Approximation to Fo Require: U = {u1 , ..., uk }, the set of clusters, C = {c1 , ..., cs }, the set of classes and maxT rials, the maximum number of trials. Ensure: Φ0 , the empirical approximation to Fo . SampleClass(C) randomly samples an element from the set of classes C. SampleClust(U ) randomly samples an element from the set of clusters U . SampleDocClass(c) randomly samples a document associated with the class c. SampleDocsClust(u) randomly samples a pair of documents associated with the cluster u. Algorithm 1 Approximation to GFM Require: D = {d1 , ..., dN }, the input documents and maxT rials, the maximum number of trials. Ensure: Γ, the empirical approximation to the Generalized Fowlkes-Mallows index GFM. ClaSet(d) and CluSet(d) return respectively the set of classes and clusters associated with document d. SampleDocsP air(D) randomly samples a pair from the set of documents D. sameClassF req ← 0 2: sameClustF req ← 0 3: sameClassAndClustF req ← 0 1: 4: for trials = 1 to maxT rials do 1: 2: 3: sameClassGivenClustF req ← 0 recDocsF req ← 0 RecalledDocs ← ∅ 4: 5: 6: for all uj ∈ U do RecalledDocs ← {RecalledDocs ∪ uj } end for 7: for trials = 1 to maxT rials do 5: 6: 7: sameClass ← F alse sameClust ← F alse dx , dy ← SampleDocsP air(D) 8: 9: if {ClaSet(dx ) ∩ ClaSet(dy )} = 6 ∅ then sameClassF req + + sameClass ← T rue end if 8: 9: 10: 11: sameClass ← F alse sameClust ← F alse ux ← SampleClust(U ) dx , dy ← SampleDocsClust(ux ) 12: 14: 15: if {CluSet(dx ) ∩ CluSet(dy )} = 6 ∅ then sameClustF req + + sameClust ← T rue end if 13: 14: if {ClaSet(dx ) ∩ ClaSet(dy )} = 6 ∅ then sameClassGivenClustF req + + end if 16: 17: 18: if (sameClass ∧ sameClust) then sameClassAndClustF req + + end if 15: 16: cx ← SampleClass(C) dz ← SampleDocClass(cx ) 19: 20: 21: Puc ← sameClassAndClustF req/trials P∗c ← sameClassF req/trials Pu∗ ← qsameClustF req/trials 17: 18: 19: if (dz ∈ RecalledDocs) then recDocsF req + + end if 20: PP M ← sameClassGivenClustF req/trials RU ← recDocsF req/trials P M ·RU Φ0 ← 2·P PP M +RU 10: 11: 12: 13: 22: Γ← Puc Pu∗ · 21: 22: Puc P∗c 23: end for 23: end for 24: return Γ 24: return Φ0 10 ber of topics and probability threshold on the considered measurements. The results, summarized in Table 5 show that both factors, the probability threshold (t) and the number of topics (K), have a statistically significant effect on the Fo metric with confidence of above 94%, while this effect can only be moderately noticed on the GFM metric for the probability threshold factor with a confidence of about 88%. A potentially important consequence of the higher sensitivity of the Fo metric is that it becomes the natural candidate to perform model selection analysis. 0.7 0.65 0.6 Predicted GFM Empirical GFM 0.55 Predicted Fo Empirical Fo 0.5 0.45 0.4 100 2600 5100 7600 10100 12600 Figure 1: Monte Carlo approximation of GFM and Fo . Factor Number of topics (K) Probability threshold (t) Results of an individual simulation for K = 90, t = 0.2 are shown in Figure 1 where the convergence pattern of the empirical measurements to their correct values is depicted. Fo 0.0523* 0.0004* GFM 0.5105 0.1267 Table 5: Two-Factor ANOVA of the Fo and GFM metrics, statistics and p-values associated with the Number of topics (K) and the Probability threshold (t) factors. 5.3. Relation of GFM and Overlapping Fo 6. Conclusions and Future Work We are interested to measure how sensitive the Fo , PCMP and GFM metrics are with respect to the document-topic discretization probability threshold (t) and the number of topics (K). Furthermore we are interested to measure how the presented metrics correlate to each other. Therefore, we think it is important to report two statistical measures. First, in Table 4 we present the results of a crosscorrelation tab between GFM and the components of Fo for the overall data set. In Table 4 it is possible to observe a high correlation between GFM and Fo , although not high enough to make the metrics redundant. Recall is positively correlated with Fo and GFM while it is negatively correlated with Precision; this property is widely acknowledged in the retrieval field. A twofactor analysis of variance has been performed to evaluate the impact of the following factors; num- Fo GFM PCMP Recall Fo 1 0.57 0.91 -0.04 GFM PCMP Recall 1 0.33 0.38 1 -0.41 1 In this paper we have shown that it is possible to measure the semantic coherence of topic models by considering them to be special instances of softclustering algorithms and then using multi-labeled corpora as external validation input. In order to accomplish this goal, we have generalized existing metrics designed to evaluate non-overlapping partitions like the Fowlkes-Mallows Index. We have also proposed metrics with more straightforward probabilistic interpretations and of easier implementation. In both cases we have shown the correctness of the formulations by empirically approximating the predicted values through Monte Carlo simulation. The usage of annotated collections to compute the validation metrics as proposed in this work, will result in additional methodological benefits for the research on probabilistic topic models. For instance, the approach will enable automated validation of models, therefore, producing a significant acceleration of the development-validation cycle. In addition, more robust and objective comparisons between independently developed models will be possible as objections related to the repeatability of experiments, such as judge selection or expertise, will be eliminated. All of the aforementioned benefits will be obtained while preserving the usage Table 4: Correlation matrix for the following metrics; Fo , GFM, PCMP and Recall. 11 of valuable human input as the ultimate benchmark of model quality. In future works we are interested in discussing how the different properties of a topic modeling algorithm like completeness, similarity between partitions or semantic coherence are stressed by the different metrics. Moreover, although this metric is already based on human input, it would be useful to more clearly visualize the predictive power of such probabilistic metrics on the performance of machine learning tasks like classification or retrieval. [14] [15] [16] References [1] R. Feldman, J. Sanger, The Text Mining Handbook, Cambridge University Press, New York, 2007. [2] The Economist, A special report on managing information: Data, data everywhere, The Economist ISSN 0013-0613, URL http://www.economist.com/opinion/ displaystory.cfm?story_id=15557443. [3] A. Halevy, P. Norvig, F. Pereira, The Unreasonable Effectiveness of Data, IEEE Intelligent Systems 24 (2) (2009) 8–12, ISSN 1541-1672, doi:\bibinfo{doi}{http: //doi.ieeecomputersociety.org/10.1109/MIS.2009.36}. [4] T. L. Griffiths, M. Steyvers, Finding scientific topics., Proc Natl Acad Sci U S A 101 Suppl 1 (2004) 5228– 5235. [5] T. L. Griffiths, M. Steyvers, A probabilistic approach to semantic representation, in: G. W., C. Schunn (Eds.), Proceedings of the Twenty-Fourth Annual Conference of Cognitive Science Society, 381–386, 2002. [6] T. Hofmann, Probabilistic latent semantic analysis, in: In Proc. of Uncertainty in Artificial Intelligence, UAI 1999, 289–296, 1999. [7] D. M. Blei, N. Andrew, M. I. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning Research 3 (2003) 993–1022. [8] T. L. Griffiths, M. Steyvers, Probabilistic Topic Models, Erlbaum, 2007. [9] T. Hofmann, Unsupervised Learning by Probabilistic Latent Semantic Analysis, Machine Learning 42 (1-2) (2001) 177–196, ISSN 0885-6125, URL http://portal. acm.org/citation.cfm?id=599631. [10] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, D. Blei, Reading Tea Leaves: How Humans Interpret Topic Models, in: Neural Information Processing Systems (NIPS), 2009. [11] H. M. Wallach, I. Murray, R. Salakhutdinov, D. M. Mimno, Evaluation methods for topic models., in: A. P. Danyluk, L. Bottou, M. L. Littman (Eds.), ICML, vol. 382 of ACM International Conference Proceeding Series, ACM, ISBN 978-1-60558-516-1, 139, URL http://dblp.uni-trier.de/db/conf/icml/ icml2009.html#WallachMSM09, 2009. [12] E. B. Fowlkes, C. L. Mallows, A Method for Comparing Two Hierarchical Clusterings, Journal of the American Statistical Association 78 (383) (1983) 553–569, ISSN 01621459, URL http://www.jstor.org/stable/ 2288117. [13] D. L. Wallace, A Method for Comparing Two Hierarchical Clusterings: Comment, Journal of the American Statistical Association 78 (383) (1983) 569–576, [17] 12 ISSN 01621459, URL http://www.jstor.org/stable/ 2288118. J. Wu, H. Xiong, J. Chen, Adapting the right measures for K-means clustering, in: KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, NY, USA, ISBN 978-1-60558-4959, 877–886, doi:\bibinfo{doi}{http://doi.acm.org/10. 1145/1557019.1557115}, 2009. L. Denoeud, A. Guénoche, Comparison of Distance Indices Between Partitions, Studies in Classification, Data Analysis, and Knowledge Organization (2006) 21– 28. A. Di Nuovo, V. Catania, On External Measures for Validation of Fuzzy Partitions, in: P. Melin, O. Castillo, L. Aguilar, J. Kacprzyk, W. Pedrycz (Eds.), Foundations of Fuzzy Logic and Soft Computing, vol. 4529 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 491–501, 2007. F. W. H.D. Brunk, James E. Holstein, A Comparison of Binomial Approximations to the Hypergeometric Distribution, The American Statistician 22 (1) (1968) 24– 26.