Topic Model Validation
Eduardo H. Ramireza,∗∗, Ramon Brenaa , Davide Magattib , Fabio Stellab,∗
b DISCo,
a Tecnologico de Monterrey, Campus Monterrey, Monterrey, MEXICO.
Università degli Studi di Milano-Bicocca, Viale Sarca 336, 20126 Milano, ITALY.
Abstract
In this paper the problem of performing external validation of the semantic coherence of topic models is
considered. The Fowlkes-Mallows index, a known clustering validation metric, is generalized for the case of
overlapping partitions and multi-labeled collections, thus making it suitable for validating topic modeling
algorithms. In addition, we propose new probabilistic metrics inspired by the concepts of recall and precision.
The proposed metrics also have clear probabilistic interpretations and can be applied to validate and compare
other soft and overlapping clustering algorithms. The approach is exemplified by using the Reuters-21578
multi-labeled collection to validate LDA models, then using Monte Carlo simulations to show the convergence
to the correct results. Additional statistical evidence is provided to better understand the relation of the
metrics presented.
Keywords: topic models, soft clustering, Fowlkes-Mallows index, Monte Carlo
1. Introduction
The continuously increasing amount of text available on the WEB, news wires, forums and chat
lines, business company intranets, personal computers, e-mails and elsewhere is overwhelming [1].
Information, is switching from useful to troublesome. Indeed, it is becoming more and more evident that while the amount of data increases exponentially our storing and processing capabilities
do not grow with comparable speed [2]. This trend
strongly limits the extraction of valuable knowledge
from text and thus drastically reduces the competitive advantage we can gain. Search engines have exacerbated such a problem by dramatically increasing the amount of text available in a matter of a
few key strokes.
Probabilistic Topic Modeling (PTM) methods
comprise a new and promising family of unsupervised techniques to model text collections which go
towards the direction suggested by Halevy et al. [3].
∗ Corresponding
author
corresponding author
Email addresses: eduardo.ramirez@itesm.mx (Eduardo
H. Ramirez), ramon.brena@itesm.mx (Ramon Brena),
magatti@disco.unimib.it (Davide Magatti),
stella@disco.unimib.it (Fabio Stella)
∗∗ Principal
Preprint submitted to Elsevier
PTM is a particular form of document clustering and organization used to analyze the content
of documents and the meaning of words with the
aim to discover the topics mentioned in a document
collection. Table 1 shows two topics, out of three
hundreds, derived from the TASA corpus, a collection of over 37,000 text passages from educational
materials [4].
Topic 247
word
prob
Drugs
.069
Drug
.060
Medicine
.027
Effects
.026
Body
.023
Medicines .019
Pain
.016
Person
.016
Marijuana .014
Label
.012
Topic
word
Red
Blue
Green
Yellow
White
Color
Bright
Colors
Orange
Brown
5
prob
.202
.099
.096
.073
.048
.048
.030
.029
.027
.027
Table 1: Illustration of two (out of 300) topics extracted from
the TASA corpus (language and arts, social studies, health,
sciences). The ten most probable words (word) for each topic
are listed together with the corresponding probability value
(prob).
April 11, 2011
A variety of PTM models [5, 6, 7, 8] have been
proposed, described and analyzed in the specialized literature. These models differ among them in
terms of the assumptions they make concerning the
data generating process. However, they all share
the same rationale, i.e. a document is a mixture of
topics.
To describe how the PTM model works, let P (z)
be the probability distribution over topics z and
P (w|z) be the probability distribution over words
w given topic z. The topic-word distribution P (w|z)
specifies the weight to thematically related words.
A document is assumed to be formed as follows:
the ith word wi is generated by first extracting a
sample from the topic distribution P (z), then by
choosing a word from P (w|z). We let P (zi = j)
be the probability that the j th topic was sampled
for the ith word token while P (wi |zi = j) is the
probability of word wi under topic j. Therefore, the
PTM induces the following distribution over words
within a document:
P (wi ) =
T
X
P (wi |zi = j) P (zi = j)
β. The authors showed that good choices for the
hyperparameters α and β will depend on the number of topics T and vocabulary size W , and that
accordingly to the results of their empirical investigation α = 50
T and β = 0.01 to work well with many
different document collections.
PTM originated in the psychology community
where semantic coherence of topics were evaluated
by focusing on replicating human performances or
judgments. This strategy was implemented by performing standardized tests, comparing sense distinctions, and matching intuitions about synonymy
[10]. Topic model validation is an extremely important task which aims to check whether a PTM
is semantically meaningful or not. Indeed, PTMs
assume that the latent space, were topics are projected, is semantically meaningful and thus it can
be exploited to validate the extracted topics, to
summarize a given document corpus and to guide
its contextual exploration. Furthermore, to ignore the analysis of the internal representation of
topic models is in contrast with their rational and
development. Several approaches have been proposed for topic model evaluation; measures based
on held-out likelihood, sentiment detection or information retrieval as discussed in Chang et al. [10].
Wallach et al. [11] summarized several evaluation
techniques based on tools from language modeling.
These techniques measure the fit of the knowledge
learned when applied to unseen documents. The
main drawback of such methods is that they only
measure the probability of observation while the internal representation of the topic models is ignored.
An important exception is offered by Griffiths and
Steyvers [8] where the authors have shown that
different senses of a word are correlated with the
number of topics that word appears in. However,
Chang et al. [10] pointed out that this is still not
a deep analysis of the structure of the latent space,
as it does not examine the structure of the topics
themselves. The problem of validating topic models without relying on the performance of a task
was recently addressed by Chang et al. [10]. Their
work, the first in that direction, shows by means of
user studies that some problem exists when using
supervised classification predictive metrics.
In this paper we are concerned with the analysis and validation of the semantic coherence of
the results obtained through PTM and with the
problem of their comparison with the results obtained through alternative models. The proposed
approach consists of transforming each topic model
(1)
j=1
where T is the number of topics.
The main idea of PTM can be summarized as follows; a document is a linear combination of multiple
topics (Formula 1). A topic is a probability distribution over a given vocabulary of words. Therefore,
a document can be interpreted as a linear combination of probability distributions over a given vocabulary, where each probability distribution, i.e.
topic, is associated with a specific argument, idea
or theme.
Hoffmann [6, 9] proposed the probabilistic Latent
Semantic Indexing (pLSI) method which makes no
assumptions about how the mixture weights in formula 1, i.e. P (zi = j), are generated. Blei et al.
[7] improved the generalizability of this model to
new documents. They introduced a Dirichlet prior
with hyperparameter α on P (z), thus creating the
Latent Dirichlet Allocation (LDA) model. In 2004,
Griffiths and Steyvers [4] introduced an extension of
the original LDA model which associates a Dirichlet
prior, with hyperparameter β, also to P (wi |zi = j).
The authors suggested the hyperparameter to be
interpreted as the prior observation count on the
number of times words are sampled from a topic
before any word from the corpus is observed. This
choice can smooth the word distribution in every
topic with the amount of smoothing determined by
2
into a “hard” overlapping partition through the discretization of the “soft” document-topic associations. Then, the validation is performed by exploiting novel probabilistic metrics, based on the interpretations of widely accepted concepts such as “precision” and “recall”. Also, we have generalized the
Fowlkes-Mallows index (FM) [12, 13], an existing
cluster validation metric, in order to make it suitable to validate overlapping clusterings. The generalization of the FM index was performed on the
basis of its underlying probabilistic interpretation
and allows us to link the Fowlkes-Mallows index to
the semantic coherence of the model rather than
to the mere similarity between cluster partitions.
Novel metrics inspired on the widely known precision and recall concepts are also presented. The
proposed validation approach has the following advantages with respect to existing metrics:
consists in correctly evaluating the quality of the
resulting overlapping partitions. Let us first introduce the terminology and the notation used in the
rest of the paper. Every “hard-clustering” problem
applied to a multi labeled document corpus involves
the following elements:
• a dataset D = {d0 , ..., dn } consisting of n documents;
• a partition of D in K clusters:
{u1 , ..., uK };
U
=
• a partition of D in S classes: C = {c1 , ..., cS }.
Most of the existing validation metrics [14] can
be expressed in terms of a |U | × |C| contingency
table (Table 2) where the content of each cell nij
represents the number of documents belonging to
cluster ui and class cj . In the special case where
clusters do not overlap and the document corpus is
uni-labeled, the following properties hold:
SK
1. 1 ui = D;
• it offers an explicit probabilistic interpretation;
• it allows to validate overlapping partitions;
• it allows to validate incomplete partitions.
2. ui ∩ uj = ∅ ∀i, j = 1, ..., K with i 6= j: there is
no “overlap” in the cluster partition;
Harmonic mean of precision and recall can be computed to obtain a combined single-metric measurement of the quality of the partition. The paper also
shows how the proposed metrics allow to perform
a “drill-down” analysis into the individual clusters
(topics) to make straightforward the determination
of:
3. ci ∩ cj = ∅ ∀i, j = 1, ..., S with i 6= j: there is
no “overlap” in the class partition.
In this work we consider the case where the aforementioned properties cannot be assumed to hold,
which happens when:
1. which are the best/worst clusters in the partition;
• the tresholding procedure used to move from
soft to hard clustering results in some documents being assigned to no cluster;
2. which topics are better recalled by any given
cluster.
• the tresholding procedure used to move from
soft to hard clustering results in some documents being assigned to more than one cluster;
The rest of this paper is organized as follows. In
Section 2 we establish the notation and the main
objects involved in our analysis. In Section 3 we
propose a dissection of the Fowlkes-Mallows index.
Section 4 introduces and describes the proposed
metrics to evaluate topic model quality. The results of the numerical experiments performed on
the Reuters-21578 data set are described in Section
5. Finally, conclusions and research directions are
reported in Section 6.
• the document corpus is multi-labeled, and thus
in principle every document can be assigned to
no, one or more classes.
Classes
Cluster
2. Notation and problem statement
Once the soft-clustering solution of a multilabeled corpus is discretized to obtain the corresponding hard-clustering, the problem to be faced
u1
u2
...
uK
Σ
c1
n11
n21
...
nK1
n∗1
c2
n12
n22
...
nK2
n∗2
...
...
...
...
...
...
cS
n1S
n2S
...
nKS
n∗S
Table 2: Contingency table.
3
Σ
n1∗
n2∗
...
nK∗
n∗∗
In order to analyze the FM index, the events associated with the experiment of randomly sampling two
documents d1 and d2 without replacement from D,
are defined as follows:
3. Background
PTM can be thought of as soft-clustering methods that discover probabilistic associations of documents to latent topics. Therefore, to evaluate the
quality of the knowledge extracted from PTM we
can exploit metrics designed to compare heterogeneous soft-clustering methods in text collections.
This strategy, based on external validation metrics,
takes full advantage of multi-labeled corpora. To
the best of our knowledge this is the first paper
that proposes and develops such an approach. It
is worthwhile to notice that most of the works on
external clustering validation deal with the case of
non overlapping partitions of uni-labeled corpora.
A comprehensive review of the traditional metrics
used to validate non overlapping partitions can be
found in [14] and [15]. A fuzzy clustering validation
approach for complete solutions is presented in [16].
In this section we will focus on dissecting the
Fowlkes-Mallows Index, as being a probabilistic
metric which can be extended to work in the overlapping and incomplete case. Finally, it is important to clarify that clustering entropy, a probabilistic, external metric commonly referred in literature
is not considered in detail in our analysis despite
its ability to validate overlapping clustering solutions. Indeed, a probability distribution over the
class variable has to be constructed to compute the
entropy of a cluster. However, such a probability
distribution can be constructed only in the case
where the classes do not overlap. In our opinion this
is an undesirable property for a topic model validation corpora, such as Reuters-21578 where classes
actually overlap1 .
• Su∗ : d1 and d2 belong to the same cluster;
• S∗c : d1 and d2 belong to the same class;
• Suc : d1 and d2 belong to the same cluster and
class.
To denote the event of d1 and d2 belonging to
class cj we write S∗cj whose probability is given
by:
P (S∗cj ) =
ij
ni∗
i
2
2
P
j
n∗j
2
.
= h(n∗∗ , n∗j , 2, 2)
(3)
where h(n∗∗ , n∗j , 2, 2) represents the probability
value, according to the hypergeometric distribution, to obtain 2 successes in a sampling without
replacement of size 2, from a population of size n∗∗
that contains n∗j successes. In a similar manner, we
write Sui ∗ to denote that d1 and d2 belong to cluster ui , where the corresponding probability value is
given by:
ni∗
2
(4)
P (Sui ∗ ) = n∗∗ = h(n∗∗ , ni∗ , 2, 2).
2
The probability of two documents belonging to the
same class can be computed from expression (3) to
be:
X
1 X n∗j
P (S∗c ) =
P (S∗cj ) = n∗∗
(5)
2
2
j
j
3.1. Dissecting the Fowlkes-Mallows index
Among the existing cluster validation metrics, a
particular interesting one is the Fowlkes-Mallows
index [12, 13] (hereafer referred to as “FM”). Using
the contingency table notation from Table 2, the
FM index is defined as follows:
P nij
F M = qP
n∗j
2
n∗∗
2
while the probability of two documents belonging to
the same cluster can be computed from expression
(4) to be:
X
1 X ni∗
P (Sui ∗ ) = n∗∗
P (Su∗ ) =
. (6)
2
2
i
i
(2)
Finally, the probability of two randomly sampled
documents, without replacement, to belong to the
same class and cluster is:
X
1 X nij
. (7)
P (Sui cj ) = n∗∗
P (Suc ) =
2
2
ij
ij
1 Certainly, a generalized version of the cluster entropy
measure could be produced by mapping the original class labels into a new set of labels, where each distinct label combination is transformed into a new, composite class label. We
consider, however that such an approach would be excessively “strict” with the evaluation of the resulting solutions
in terms of semantic coherence.
Then, the conditional probability that two randomly sampled documents, without replacement,
4
and:
belong to the same class given they belong to the
same cluster is:
P nij
P (Suc )
ij
2
(8)
P (S∗c |Su∗ ) =
= P ni∗
P (Su∗ )
i
2
P
ij
P (Su∗ |S∗c ) = P
.
Example 1
Consider a non-overlapping partition consisting of 2 clusters and 2 classes.
Let u1 =
{d1 , d2 , d3 , d4 , d5 } with {d1 , d2 , d3 } ∈ c1 and
{d4 , d5 } ∈ c2 and let u2 = {d6 , d7 , d8 , d9 , d10 } with
{d6 } ∈ c1 and {d7 , d8 , d9 , d10 } ∈ c2 . The situation can be conveniently summarized through the
following contingency table:
P
c1
c2
u1
3
2
n1∗ = 5
u2
1
4
n2∗ = 5
n∗1 = 4 n∗2 = 6 n∗∗ = 10
(11)
In practice, when N is greater than 50 and
m/N ≤ 0.10, the hypergeometric distribution can
be conveniently approximated by the binomial distribution [17]. So, the proposed formalization
serves two goals. On one hand, it is helpful for
computational purposes as it allows the usage of
cost efficient approximations; on the other hand, it
is useful to better understand the properties of the
considered metric.
For instance, by expressions (4) and (6), the
probability of sampling two documents from the
same
P cluster could be rewritten as follows P (Su∗ ) =
i h(n∗∗ , ni∗ , 2, 2) while the probability of sampling two documents
from the same class becomes
P
h(n
,
n
, 2, 2). In a similar fashP (S∗c ) =
∗∗
∗jP
j
ion, we have P (Suc ) = ij h(n∗∗ , nij , 2, 2). Thus,
the conditional probabilities expressed above can be
rewritten as:
P
ij h(n∗∗ , nij , 2, 2)
(12)
P (S∗c |Su∗ ) = P
i h(n∗∗ , ni∗ , 2, 2)
(13)
3.2. Overlapping partitions
When validating using a multiply labeled corpus, such as Reuters-21578, the set of ground-truth
classes result in overlapping partitions. In such a
case the FM index cannot be computed by using
equation (2) because the assumption of sampling
without replacement does not hold. The main difficulty with overlapping when computing the FM
index, is due to the use of the contingency table notation, which hides the probability being computed
that easily results in making the wrong assumptions
that ni∗ = |ui |, n∗j = |cj | and n∗∗ = |D|. The implications of such a wrong assumptions are shown
through the following example.
The previous formulations can also be expressed
in terms of the hypergeometric distribution, defined
as the probability of selecting exactly x successes in
a sample of size m, obtained without replacement
from a population of N objects from which k possess the characteristic of interest. Formally:
k N −k
m−x
N
m
.
(14)
It is worthwhile to mention that equation (14)
makes it easy to account the effects of overlapping
clusters when computing the FM index.
It is worthwhile to note that the FM index (2) can
be obtained by computing the geometric mean of
the conditional probability that the pair of sampled
documents belong to the same class given they belong to the same cluster (P (S∗c |Su∗ )), and the conditional probability that the pair of sampled documents belong to the same cluster given they belong
to the same class (P (Su∗ |S∗c )). Therefore, expressions (8) and (9) allow us to write the following:
p
(10)
F M = P (S∗c |Su∗ )P (Su∗ |S∗c ).
x
h(n∗∗ , n∗j , 2, 2)
Finally, by geometrically averaging (12) and (13)
the new expression for (2) is:
P
ij h(n∗∗ , nij , 2, 2)
.
F M = qP
P
h(n
,
n
,
2,
2)
h(n
,
n
,
2,
2)
∗∗
∗j
∗∗
i∗
j
i
while the conditional probability that they belong
to the same cluster given that they belong to the
same class is:
P nij
P (Suc )
ij
2
.
P (Su∗ |S∗c ) =
(9)
= P n∗j
P (S∗c )
j
2
h(N, k, m, x) =
j
h(n∗∗ , nij , 2, 2)
According to (3) and (5) wePcan compute
P (S∗c ) as follows; P (S∗c ) =
j P (S∗cj ) =
P
(42)
(62)
21
j h(n∗∗ , n∗j , 2, 2) = (10) + (10) = 45 to obtain
2
2
the correct probability value.
The following class overlapping scenario, due to
multi-labeled documents, is considered. Let c3 be
such that {d1 , d4 , d8 , d9 , d10 } ∈ c3 . The corresponding contingency table is:
5
u1
u2
c1
3
1
n∗1 = 4
c2
2
4
n∗2 = 6
c3
2
3
n∗3 = 5
P
P (S∗c ) is obtained by subtracting the probability
of the classes intersection:
!
4
4
6
5
n1∗ = 7
n2∗ = 8
n∗∗ = 15
Intuitively, we expect the intra-cluster overlap to
increase the value of P (S∗c ). However, equation
(5) yields the incorrect result of 31/105, which is
smaller than the correct one 21/45. This is due to
the fact that the sampling without replacement assumption
no longer holds. Indeed, there are not
15
=
105
ways to select 2 documents, as that
2
would allow to sample the same document more
than once. The correct number of ways to select 2
= 10
elements is still 45 and it is given by |D|
2 .
2
However, the events S∗cj to sample two documents
from the same class j are no longer independent.
Therefore, they cannot be added as in (5). Basically, when class and/or cluster overlap exist, the
contingency table bins do not represent mutually
exclusive events. Thus, the value of P (S∗c ) when
classes overlap exists is given by:
X
h(|D|, |cj |, 2, 2) − J(C)
(15)
P (S∗c ) =
P (Su∗ ) =
+
2
10
2
−
2
10
2
= 0.55.
X
h(|D|, |ui |, 2, 2) − J(U )
(16)
i
j ′ >j
or accordingly to the hypergeometric notation by
the following expression:
XX
h(|D|, |{S∗cj ∩ S∗cj′ }|, 2, 2).
J(C) =
j ′ >j
where J(U ) accounts for the probability of selecting
a pair of documents belonging to two or more clusters, and it is given by adding up the probabilities
of cluster intersections:
XX
J(U ) =
P (Sui ∗ ∩ Sui′ ∗ )
i
However, the above formulas deal with the
case where the classes overlap is restricted to
pairs. The case where general classes overlap
is concerned is more complex from both the
theoretical and computational point of view and
will be presented in a different work. Formula
(15) is a re-expression of (5) under the general
addition rule of probability for non independent events2 . For instance, if any of the pairs
{(d4 , d8 ), (d4 , d9 ), (d4 , d10 ), (d8 , d9 ), (d8 , d10 ), (d9 , d10 )}
are sampled, then S∗c2 and S∗c3 are both true, and
this results in a double count. The correct value of
2 Which
2
10
2
When hardening a soft-cluster solution we potentially obtain overlapping and incomplete partitions;
thus, the validation metrics should be sensitive to
some form of “recall”. In the FM index computation the basic assumption would be that the column marginal totals correspond to the size of the
classes, i.e. n∗j = |cj |, and that the row marginal
totals equal the size of the clusters. As shown before, such an assumption is false when an overlapping exists with the same applying to cases where
the clusters are incomplete. Measuring incomplete
partitions with the FM contingency table is wrong.
Indeed, it incorrectly reduces the number of successes inside the population by using ni∗ instead of
|ui |. Furthermore, the possibility of cluster overlapping has to be taken into account. Therefore, the
correct probability of selecting two documents from
the same cluster will be given by:
where J(C) is the probability that a selected pair
of documents belongs to two classes simultaneously,
defined by the expression:
XX
J(C) =
P (S∗cj ∩ S∗cj′ )
j
+
3.3. Incomplete partitions
j
j
2
10
2
P (S∗c ) =
states that: P (A∪B) = P (A)+P (B)−P (A∩B).
6
i′ >i
and by using the hypergeometric distribution:
J(U ) =
XX
i
h(|D|, |{Sui ∗ ∩ Sui′ ∗ }|, 2, 2).
i′ >i
It is worthwhile to note that formula (16) is also
valid in the case where clusters do not overlap.
However, although FM can be corrected to take
into account some of the effects of partitions’ incompleteness and/or overlap, we consider that its interpretation is more biased toward measuring partition
similarity, and thus we find it valuable to study new
metrics that can serve better to estimate semantic
coherence.
• the probability that two randomly sampled
documents belong to the same class, given they
belong to the same cluster, i.e.:
P
ij h(|D|, nij , 2, 2) − J(U, C)
;
P (S∗c |Su∗ ) = P
i h(|D|, |ui |, 2, 2) − J(U )
(19)
4. Proposed metrics
In this section, we introduce a version of the FM
index adjusted for overlapping and incomplete clusters. Two overlapping “precision” metrics together
with their probabilistic interpretations are given.
Finally, we discuss the computation of a kind of
cluster “recall” which can be used to achieve a single metric performance.
• the probability that two randomly sampled
documents belong to the same cluster, given
they belong to the same class, i.e.:
P
ij h(|D|, nij , 2, 2) − J(U, C)
.
P (Su∗ |S∗c ) = P
j h(|D|, |cj |, 2, 2) − J(C)
(20)
4.1. Generalized Fowlkes-Mallows Index
As discussed in section 3, if the FM index is expressed in terms of the contingency table, it can not
be used to validate overlapping or incomplete clusters. The reason is that while its additive terms
come from the hypergeometric distribution, they
would use an incorrect population size in the case
where cluster overlapping is concerned. However,
we have shown that when re-expressing the FM index in terms of the hypergeometric distribution and
by correcting its formula in order to use the cluster size |ui | and the class size |cj |, the probabilities
P (S∗c ) in (15) and P (Su∗ ) in (16) are correct under
the assumption that the maximum overlap equals
two. Therefore, the last step to obtain a generalized version of the FM index requires to generalize
the computation of P (Suc ) in such a way that nonindependent events Sui cj are correctly taken into
account. This generalization requires to compute
the probability of the intersection of elementary
events. For the whole contingency table the sum of
the probabilities of the intersection between “bins”
will be denoted by:
XX
J(U, C) =
P (Sui cj ∩ Sui′ cj′ )
In conclusion, the generalized version of the FM
index, referred to as GFM, is given by3 :
q P
P
ij
h(|D|, nij , 2, 2) − J(U, C)
P
h(|D|, |ui |, 2, 2) − J(U ) [
i
.
h(|D|, |cj |, 2, 2) − J(C)]
j
(21)
4.2. Partial Class Match Precision
This metric is inspired by the notion of precision
utilized in the IR field. The Partial Class Match
Precision (PCMP) measures the probability of randomly selecting two documents from the same class
taken from a randomly sampled cluster. In contrast to FM, where we are concerned with the random sampling of two documents d1 and d2 from
the documents corpus, PCMP requires to first randomly sample a cluster and then to randomly sample two documents from the sampled cluster. In
order to clearly differentiate both random events,
we use S̃c∗ to denote the event of selecting two documents belonging to the same class sampled from
a given cluster. Formally, the PCMP metric is defined as follows:
X
P (S̃∗c |ui )P (ui )
(22)
PP M = P (S̃∗c ) =
ij i′ ,j ′
where i′ > i and j ′ > j and where by using the
hypergeometric probabilities we obtain:
XX
h(|D|, |{Sui cj ∩ Sui′ cj′ }|, 2, 2).
J(U, C) =
ij i′ ,j ′
(17)
Note that the computation of J(U, C) requires the
creation of an additional “overlap matrix” consisting of (|U |×|C|)2 elements. Finally, the generalized
result for P (Suc ) is given by:
X
h(|D|, nij , 2, 2) − J(U, C).
(18)
P (Suc ) =
i
where the prior probability of selecting the cluster
ui is given by P (ui ) = ni∗ /n∗∗ .
3 We are aware that this formulation may not be accurate
on extreme cases of strongly overlapped collections, however,
extending the formulations to work on such cases is straightforward. Moreover, we will show experimentally that the
hypothetical error, which is in fact an underestimation of
actual probabilities is negligible in real-world corpora such
as Reuters-21578.
ij
Thus, the generalized version of the metric can
be defined as the geometric average of:
7
feature to focus on measuring semantic coherence
rather than mere partition similarity. For instance,
in contrast to similarity oriented metrics, more than
one clustering solution can achieve the maximum
evaluation in terms of the PCMP metric. In fact,
we can think of two clustering solutions that will
obtain a PCMP value of 1, where any pair of elements sampled from a given cluster will belong to
the same class:
PCMP measures the probability of the event S̃∗c ,
i.e. to sample two documents from the same class,
after having randomly selected a cluster. However,
the computation of each individual P (S̃∗c |ui ) also
needs to be generalized in the case of class overlapping. Therefore, we need to add up the probability
of selecting two documents from each class comprised within the cluster P (S̃∗cj |ui ) under the general rule of the addition for non-independent events,
which implies discounting the probability of a success in two classes simultaneously. Thus, each individual P (S̃∗c |ui ) would be given by:
P (S̃∗c |ui ) =
X
a) Creating one cluster for every class, and assigning all the elements in ci to ui , so that k = |C|.
b) Creating clusters of elements that share exactly
the same class labels.
P (S̃∗cj |ui ) − J(ui )
(23)
Finally, we should highlight that this metric can
be efficiently approximated via a Monte Carlo simulation. Indeed, we will exploit this method to
demonstrate the correctness of the metric.
j
where J(ui ), which represents the probability to
sample two elements from two or more classes when
selecting documents d1 and d2 which belong to cluster ui , is given by:
XX
J(ui ) =
P ({Sui cj ∩ Sui cj′ }).
(24)
j
4.3. Clustering Recall
In the IR field the “recall” measure represents the
probability that a relevant document is retrieved.
Therefore, for the clustering scenarios under consideration, when the completeness of the partition
cannot be assumed, it is critical to provide clear
ways to measure the completeness of the clustering.
Let Nc be the total number of class assignments,
given by the sum of the sizes of every class:
X
|cj |.
Nc =
j ′ >j
The previous equation represents the probability of
selecting two elements from cluster ui that simultaneously belong to two different classes. Thus, in
order to obtain J(ui ) we need to compute the individual probabilities of selecting two documents
that simultaneously belong to every distinct pair
of classes (cj , cj ′ ) and then add them up to obtain
the probability of selecting two documents that simultaneously belong to any pair of classes. The expression for the individual probabilities can also be
represented using the formula of the hypergeometric distribution, where the parameter accounting for
the number of successful outcomes is the number of
elements in ui that belong to both cj and cj ′ , that
is, the “overlap” between cj and cj ′ :
J(ui ) =
XX
j
j
In overlapping and incomplete clustering we must
not rely on the values of the contingency table to
compute recall values, because they can account for
duplicates. They also do not consider elements not
assigned to any clusters.
4.3.1. Class recall
If we are interested in measuring which classes are
better captured by the clustering it is straigthforward to compute a class recall value. We define this
“class recall” as the probability that a document d,
randomly sampled from the class cj , is included in
any cluster:
Tk
| i {ui ∩ cj }|
k
. (26)
R(cj ) = P ([d ∈ ∪i ui ]|cj ) =
|cj |
h(|ui |, |{Sui cj ∩Sui cj′ }|, 2, 2). (25)
j ′ >j
This metric is designed to work well with multilabeled documents corpus. The name “Partial”
comes from the fact that in a multi-label setting
the two randomly sampled elements d1 and d2 can
be associated with many classes. As long as one of
their classes matches we will consider the result to
be semantically coherent, thus a success. We consider that this property of the metric is a valuable
In other words, equation (26) divides the number of documents, labelled with class cj that were
recalled by any cluster ui , by the total number of
documents labelled with class cj .
8
Documents with less than 10 unique words were
discarded while train and test documents were
all included into the analysed documents corpus.
Therefore, a total of 10,468 documents were used
which are labelled with 117 ground-truth classes.
4.3.2. Gross clustering recall
From the previous expression and recalling that
the probability of selecting a class would be given by
P (cj ) = |cj |/Nc , it is possible to derive the following unconditional expression to measure the recall
of the whole clustering:
X
P (d ∈ ∪ki ui |cj )P (cj ),
RU = P (d ∈ ∪ki ui ) =
5.1. Topic Extraction
For demonstration purposes, the used algorithm
was LDA (β = 0.01, α = 50/K) running 1,000
iterations of Gibbs sampling. The proposed measurement techniques require a discretization of the
document-topic assignments. Thus, to show the effects of the discretization on the measurement we
generated models using document-topic probability
thresholds t equal to 0.05, 0.1, 0.2, 0.25 and a number of topics K equal to 10, 30, 50, 70, 90 and 117.
Four topics extracted by LDA, in the case where
K = 90, are shown in Table 3. Each topic is associated with its prior probability P (zi ) while each
word is associated with its conditional probability
P (wi |zi ).
j
(27)
where the probability of selecting each class would
be given by |cj |/Nc . Therefore, (27) becomes
RU =
1 X
R(cj )|cj |.
Nc j
(28)
4.4. Single-metric performance
In retrieval and classification it is widely known
that it is trivial to achieve high recall at the expense
of precision and viceversa. Thus, traditionally they
are averaged into a single metric, the F-Score.
The traditional F-Score is nothing but the harmonic mean between precision and recall. Almost
any two probabilities can be averaged in this way.
However, for the particular case of topic-model validation we are interested in balancing the best measurement for semantic coherence with the best measure for completeness, so our proposed metric is defined by the harmonic average of equation (22) and
equation (28) to obtain:
Fo =
2PP M RU
.
PP M + R U
(29)
Notice that the selection of (22) and (28) comes
at the expense of not penalizing some clustering
dissimilarities. Thus, if the ultimate performance
criteria is the partition similarity, then the GFM
may be a best metric of choice. Both components
of the Fo metric, are micro-averaged so that every
document has the same weight on the result. The
micro-averaging effect is achieved by the marginalization step performed in (22) and (28) in order to
work with unconditional probabilities.
Topic 0
0.0097
Topic 2
0.0081
coffee
brazil
said
export
quotas
quota
producers
ico
brazilian
international
0.0620
0.0397
0.0372
0.0348
0.0261
0.0220
0.0183
0.0166
0.0154
0.0153
price
prices
oil
effective
cts
crude
increase
raised
barrel
raises
0.0623
0.0468
0.0336
0.0280
0.0223
0.0218
0.0211
0.0209
0.0201
0.0186
Topic 49
0.0127
Topic 56
0.0110
rate
rates
interest
pct
cut
bank
market
money
prime
point
0.0945
0.0771
0.0551
0.0484
0.0354
0.0267
0.0227
0.0216
0.0204
0.0163
wheat
agriculture
us
usda
corn
grain
program
farm
said
farmers
0.0365
0.0340
0.0330
0.0322
0.0298
0.0285
0.0278
0.0265
0.0235
0.0197
5. Numerical Experiments
Table 3: Four out of the 90 topics extractred with the LDA
algorithm running 1,000 iterations of Gibbs sampling with
parameters value β = 0.01 and α = 0.56. The ten most
frequent words for each topic are listed together with their
corresponding conditional probabilities. Furthermore, each
topic is associated with its prior probability.
Experimental evidence of the correctness of the
theoretical formulations together with some insights on their characteristics are presented by using
the Reuters-21578 corpus, ModApte split, including
only documents labeled with topics.
9
5.2. Empirical approximation to the metrics
i) sample a cluster, ii) sample two documents from
it and check if they belong to the same class, iii)
sample a class cx , then sample a document dz from
cx and check if it is included in the clustering solution (lines 15 - 19) and iv) compute estimates for
P (S̃∗c |ui ), RU and Fo (lines 20-22).
The correctness of the GFM and Fo formulations
has been demonstrated by Monte Carlo simulation.
The approximation algorithm 1 for the GFM metric
can be summarized as follows: i) randomly sample
a pair of documents, ii) check if they belong to the
same class, iii) check if they belong to the same
cluster, iv) check if they belong to the same class
and cluster and v) compute estimates for P (Suc ),
P (S∗c |Su∗ ), P (Su∗ |S∗c ) and GF M (lines 16-22).
Algorithm 2 estimates PCMP and Fo as follows:
Algorithm 2 Approximation to Fo
Require: U = {u1 , ..., uk }, the set of clusters, C =
{c1 , ..., cs }, the set of classes and maxT rials,
the maximum number of trials.
Ensure: Φ0 , the empirical approximation to Fo .
SampleClass(C) randomly samples an element
from the set of classes C. SampleClust(U ) randomly samples an element from the set of clusters U . SampleDocClass(c) randomly samples a document associated with the class c.
SampleDocsClust(u) randomly samples a pair
of documents associated with the cluster u.
Algorithm 1 Approximation to GFM
Require: D = {d1 , ..., dN }, the input documents
and maxT rials, the maximum number of trials.
Ensure: Γ, the empirical approximation to the
Generalized Fowlkes-Mallows index GFM.
ClaSet(d) and CluSet(d) return respectively the
set of classes and clusters associated with document d. SampleDocsP air(D) randomly samples a pair from the set of documents D.
sameClassF req ← 0
2: sameClustF req ← 0
3: sameClassAndClustF req ← 0
1:
4:
for trials = 1 to maxT rials do
1:
2:
3:
sameClassGivenClustF req ← 0
recDocsF req ← 0
RecalledDocs ← ∅
4:
5:
6:
for all uj ∈ U do
RecalledDocs ← {RecalledDocs ∪ uj }
end for
7:
for trials = 1 to maxT rials do
5:
6:
7:
sameClass ← F alse
sameClust ← F alse
dx , dy ← SampleDocsP air(D)
8:
9:
if {ClaSet(dx ) ∩ ClaSet(dy )} =
6 ∅ then
sameClassF req + +
sameClass ← T rue
end if
8:
9:
10:
11:
sameClass ← F alse
sameClust ← F alse
ux ← SampleClust(U )
dx , dy ← SampleDocsClust(ux )
12:
14:
15:
if {CluSet(dx ) ∩ CluSet(dy )} =
6 ∅ then
sameClustF req + +
sameClust ← T rue
end if
13:
14:
if {ClaSet(dx ) ∩ ClaSet(dy )} =
6 ∅ then
sameClassGivenClustF req + +
end if
16:
17:
18:
if (sameClass ∧ sameClust) then
sameClassAndClustF req + +
end if
15:
16:
cx ← SampleClass(C)
dz ← SampleDocClass(cx )
19:
20:
21:
Puc ← sameClassAndClustF req/trials
P∗c ← sameClassF req/trials
Pu∗ ←
qsameClustF req/trials
17:
18:
19:
if (dz ∈ RecalledDocs) then
recDocsF req + +
end if
20:
PP M ← sameClassGivenClustF req/trials
RU ← recDocsF req/trials
P M ·RU
Φ0 ← 2·P
PP M +RU
10:
11:
12:
13:
22:
Γ←
Puc
Pu∗
·
21:
22:
Puc
P∗c
23:
end for
23:
end for
24:
return Γ
24:
return Φ0
10
ber of topics and probability threshold on the considered measurements.
The results, summarized in Table 5 show that
both factors, the probability threshold (t) and the
number of topics (K), have a statistically significant effect on the Fo metric with confidence of
above 94%, while this effect can only be moderately noticed on the GFM metric for the probability threshold factor with a confidence of about 88%.
A potentially important consequence of the higher
sensitivity of the Fo metric is that it becomes the
natural candidate to perform model selection analysis.
0.7
0.65
0.6
Predicted GFM
Empirical GFM
0.55
Predicted Fo
Empirical Fo
0.5
0.45
0.4
100
2600
5100
7600
10100
12600
Figure 1: Monte Carlo approximation of GFM and Fo .
Factor
Number of topics (K)
Probability threshold (t)
Results of an individual simulation for K = 90,
t = 0.2 are shown in Figure 1 where the convergence
pattern of the empirical measurements to their correct values is depicted.
Fo
0.0523*
0.0004*
GFM
0.5105
0.1267
Table 5: Two-Factor ANOVA of the Fo and GFM metrics,
statistics and p-values associated with the Number of topics
(K) and the Probability threshold (t) factors.
5.3. Relation of GFM and Overlapping Fo
6. Conclusions and Future Work
We are interested to measure how sensitive the
Fo , PCMP and GFM metrics are with respect to the
document-topic discretization probability threshold
(t) and the number of topics (K). Furthermore we
are interested to measure how the presented metrics correlate to each other. Therefore, we think
it is important to report two statistical measures.
First, in Table 4 we present the results of a crosscorrelation tab between GFM and the components
of Fo for the overall data set.
In Table 4 it is possible to observe a high correlation between GFM and Fo , although not high
enough to make the metrics redundant. Recall is
positively correlated with Fo and GFM while it is
negatively correlated with Precision; this property
is widely acknowledged in the retrieval field. A twofactor analysis of variance has been performed to
evaluate the impact of the following factors; num-
Fo
GFM
PCMP
Recall
Fo
1
0.57
0.91
-0.04
GFM
PCMP
Recall
1
0.33
0.38
1
-0.41
1
In this paper we have shown that it is possible
to measure the semantic coherence of topic models
by considering them to be special instances of softclustering algorithms and then using multi-labeled
corpora as external validation input. In order to
accomplish this goal, we have generalized existing
metrics designed to evaluate non-overlapping partitions like the Fowlkes-Mallows Index. We have also
proposed metrics with more straightforward probabilistic interpretations and of easier implementation. In both cases we have shown the correctness
of the formulations by empirically approximating
the predicted values through Monte Carlo simulation.
The usage of annotated collections to compute
the validation metrics as proposed in this work,
will result in additional methodological benefits for
the research on probabilistic topic models. For instance, the approach will enable automated validation of models, therefore, producing a significant acceleration of the development-validation cycle. In
addition, more robust and objective comparisons
between independently developed models will be
possible as objections related to the repeatability
of experiments, such as judge selection or expertise, will be eliminated. All of the aforementioned
benefits will be obtained while preserving the usage
Table 4: Correlation matrix for the following metrics; Fo ,
GFM, PCMP and Recall.
11
of valuable human input as the ultimate benchmark
of model quality.
In future works we are interested in discussing
how the different properties of a topic modeling algorithm like completeness, similarity between partitions or semantic coherence are stressed by the
different metrics. Moreover, although this metric is
already based on human input, it would be useful to
more clearly visualize the predictive power of such
probabilistic metrics on the performance of machine
learning tasks like classification or retrieval.
[14]
[15]
[16]
References
[1] R. Feldman, J. Sanger, The Text Mining Handbook,
Cambridge University Press, New York, 2007.
[2] The Economist, A special report on managing information: Data, data everywhere, The Economist ISSN
0013-0613, URL http://www.economist.com/opinion/
displaystory.cfm?story_id=15557443.
[3] A. Halevy, P. Norvig, F. Pereira, The Unreasonable Effectiveness of Data, IEEE Intelligent Systems 24 (2)
(2009) 8–12, ISSN 1541-1672, doi:\bibinfo{doi}{http:
//doi.ieeecomputersociety.org/10.1109/MIS.2009.36}.
[4] T. L. Griffiths, M. Steyvers, Finding scientific topics.,
Proc Natl Acad Sci U S A 101 Suppl 1 (2004) 5228–
5235.
[5] T. L. Griffiths, M. Steyvers, A probabilistic approach to
semantic representation, in: G. W., C. Schunn (Eds.),
Proceedings of the Twenty-Fourth Annual Conference
of Cognitive Science Society, 381–386, 2002.
[6] T. Hofmann, Probabilistic latent semantic analysis, in:
In Proc. of Uncertainty in Artificial Intelligence, UAI
1999, 289–296, 1999.
[7] D. M. Blei, N. Andrew, M. I. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning Research 3
(2003) 993–1022.
[8] T. L. Griffiths, M. Steyvers, Probabilistic Topic Models,
Erlbaum, 2007.
[9] T. Hofmann, Unsupervised Learning by Probabilistic
Latent Semantic Analysis, Machine Learning 42 (1-2)
(2001) 177–196, ISSN 0885-6125, URL http://portal.
acm.org/citation.cfm?id=599631.
[10] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, D. Blei,
Reading Tea Leaves: How Humans Interpret Topic
Models, in: Neural Information Processing Systems
(NIPS), 2009.
[11] H. M. Wallach, I. Murray, R. Salakhutdinov, D. M.
Mimno, Evaluation methods for topic models., in:
A. P. Danyluk, L. Bottou, M. L. Littman (Eds.),
ICML, vol. 382 of ACM International Conference
Proceeding Series, ACM, ISBN 978-1-60558-516-1,
139, URL http://dblp.uni-trier.de/db/conf/icml/
icml2009.html#WallachMSM09, 2009.
[12] E. B. Fowlkes, C. L. Mallows, A Method for Comparing Two Hierarchical Clusterings, Journal of the American Statistical Association 78 (383) (1983) 553–569,
ISSN 01621459, URL http://www.jstor.org/stable/
2288117.
[13] D. L. Wallace, A Method for Comparing Two Hierarchical Clusterings: Comment, Journal of the American Statistical Association 78 (383) (1983) 569–576,
[17]
12
ISSN 01621459, URL http://www.jstor.org/stable/
2288118.
J. Wu, H. Xiong, J. Chen, Adapting the right measures for K-means clustering, in: KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining,
ACM, New York, NY, USA, ISBN 978-1-60558-4959, 877–886, doi:\bibinfo{doi}{http://doi.acm.org/10.
1145/1557019.1557115}, 2009.
L. Denoeud, A. Guénoche, Comparison of Distance
Indices Between Partitions, Studies in Classification,
Data Analysis, and Knowledge Organization (2006) 21–
28.
A. Di Nuovo, V. Catania, On External Measures for
Validation of Fuzzy Partitions, in: P. Melin, O. Castillo,
L. Aguilar, J. Kacprzyk, W. Pedrycz (Eds.), Foundations of Fuzzy Logic and Soft Computing, vol. 4529 of
Lecture Notes in Computer Science, Springer Berlin /
Heidelberg, 491–501, 2007.
F. W. H.D. Brunk, James E. Holstein, A Comparison of
Binomial Approximations to the Hypergeometric Distribution, The American Statistician 22 (1) (1968) 24–
26.