Latent Semantic Analysis
Latent Semantic Analysis
analyzing documents to find the underlying meaning or concepts of those documents. If each
word only meant one concept, and each concept was only described by one word, then LSA
would be easy since there is a simple mapping from words to concepts.
Unfortunately, this problem is difficult because English has different words that mean the same
thing (synonyms), words with multiple meanings, and all sorts of ambiguities that obscure the
concepts to the point where even people can have a hard time understanding.
For example, the word bank when used together with mortgage, loans, and rates probably means
a financial institution. However, the word bank when used together with lures, casting, and fish
probably means a stream or river bank.
Since authors have a wide choice of words available when they write, the concepts can be
obscured due to different word choices from different authors. This essentially random choice of
words introduces noise into the word-concept relationship. Latent Semantic Analysis filters out
some of this noise and also attempts to find the smallest set of concepts that spans all the
documents.
In order to make this difficult problem solvable, LSA introduces some dramatic simplifications.
1. Documents are represented as "bags of words", where the order of the words in a
document is not important, only how many times each word appears in a document.
2. Concepts are represented as patterns of words that usually appear together in documents.
For example "leash", "treat", and "obey" might usually appear in documents about dog
training.
3. Words are assumed to have only one meaning. This is clearly not the case (banks could be
river banks or financial banks) but it makes the problem tractable.
A Small Example
As a small example, I searched for books using the word “investing” at Amazon.com and took
the top 10 book titles that appeared. One of these titles was dropped because it had only one index
word in common with the other titles. An index word is any word that:
In this example we have removed the following stop words: “and”, “edition”, “for”, “in”, “little”,
“of”, “the”, “to”.
Here are the 9 remaining tiles. The index words (words that appear in 2 or more titles and are not
stop words) are underlined.
Once Latent Semantic Analysis has been run on this example, we can plot the index words and
titles on an XY graph and identify clusters of titles. The 9 titles are plotted with blue circles and
the 11 index words are plotted with red squares. Not only can we spot clusters of titles, but since
index words can be plotted along with titles, we can label the clusters. For example, the blue
cluster, containing titles T7 and T9, is about real estate. The green cluster, with titles T2, T4, T5,
and T8, is about value investing, and finally the red cluster, with titles T1 and T3, is about the
stock market. The T6 title is an outlier, off on its own.
In the next few sections, we'll go through all steps needed to run Latent Semantic Analysis on this
example.
Throughout this article, we'll give Python code that implements all the steps necessary for doing
Latent Semantic Analysis. We'll go through the code section by section and explain everything.
The Python code used in this article can be downloaded here and then run in Python. You need to
have already installed the Python NumPy and SciPy libraries.
First we need to import a few functions from Python libraries to handle some of the math we need
to do. NumPy is the Python numerical library, and we'll import zeros, a function that creates a
matrix of zeros that we use when building our words by titles matrix. From the linear algebra part
of the scientific package (scipy.linalg) we import the svd function that actually does the singular
value decomposition, which is the heart of LSA.
Next, we define the data that we are using. Titles holds the 9 book titles that we have gathered,
stopwords holds the 8 common words that we are going to ignore when we count the words in
each title, and ignorechars has all the punctuation characters that we will remove from words. We
use Python's triple quoted strings, so there are actually only 4 punctuation symbols we are
removing: comma (,), colon (:), apostrophe ('), and exclamation point (!).
titles =
"The Little Book of Common Sense Investing: The Only Way to Guarantee Your
Fair Share of Stock Market Returns",
"Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the
Middle Class Do Not!",
"Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of
Finding Hidden Profits Most Investors Miss"
stopwords = ['and','edition','for','in','little','of','the','to']
ignorechars = ''',:'!'''
The LSA class has methods for initialization, parsing documents, building the matrix of word
counts, and calculating. The first method is the __init__ method, which is called whenever an
instance of the LSA class is created. It stores the stopwords and ignorechars so they can be used
later, and then initializes the word dictionary and the document count variables.
class LSA(object):
self.stopwords = stopwords
self.ignorechars = ignorechars
self.wdict = {}
self.dcount = 0
The parse method takes a document, splits it into words, removes the ignored characters and turns
everything into lowercase so the words can be compared to the stop words. If the word is a stop
word, it is ignored and we move on to the next word. If it is not a stop word, we put the word in
the dictionary, and also append the current document number to keep track of which documents
the word appears in.
The documents that each word appears in are kept in a list associated with that word in the
dictionary. For example, since the word book appears in titles 3 and 4, we would have
self.wdict['book'] = [3, 4] after all titles are parsed.
After processing all words from the current document, we increase the document count in
preparation for the next document to be parsed.
words = doc.split();
for w in words:
w = w.lower().translate(None, self.ignorechars)
if w in self.stopwords:
continue
elif w in self.wdict:
self.wdict[w].append(self.dcount)
else:
self.wdict[w] = [self.dcount]
self.dcount += 1
Once all documents are parsed, all the words (dictionary keys) that are in more than 1 document
are extracted and sorted, and a matrix is built with the number of rows equal to the number of
words (keys), and the number of columns equal to the document count. Finally, for each word
(key) and document pair the corresponding matrix cell is incremented.
def build(self):
self.keys.sort()
for i, k in enumerate(self.keys):
for d in self.wdict[k]:
self.A[i,d] += 1
The printA() method is very simple, it just prints out the matrix that we have built so it can be
checked.
def printA(self):
print self.A
After defining the LSA class, it's time to try it out on our 9 book titles. First we create an instance
of LSA, called mylsa, and pass it the stopwords and ignorechars that we defined. During creation,
the __init__ method is called which stores the stopwords and ignorechars and initializes the word
dictionary and document count.
Next, we call the parse method on each title. This method extracts the words in each title, strips
out punctuation characters, converts each word to lower case, throws out stop words, and stores
remaining words in a dictionary along with what title number they came from.
Finally we call the build() method to create the matrix of word by title counts. This extracts all
the words we have seen so far, throws out words that occur in less than 2 titles, sorts them, builds
a zero matrix of the right size, and then increments the proper cell whenever a word appears in a
title.
for t in titles:
mylsa.parse(t)
mylsa.build()
mylsa.printA()
Here is the raw output produced by printA(). As you can see, it's the same as the matrix that we
showed earlier.
[[ 0. 0. 1. 1. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 1. 0. 0. 1.]
[ 0. 1. 0. 0. 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 1.]
[ 1. 0. 0. 0. 0. 1. 0. 0. 0.]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[ 1. 0. 1. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 1.]
[ 0. 0. 0. 0. 0. 2. 0. 0. 1.]
[ 1. 0. 1. 0. 0. 0. 0. 1. 0.]
[ 0. 0. 0. 1. 1. 0. 0. 0. 0.]]
Ni,j = the number of times word i appears in document j (the original cell count).
N*,j = the number of total words in document j (just add the counts in column j).
D = the number of documents (the number of columns).
Di = the number of documents in which word i appears (the number of non-zero columns
in row i).
In this formula, words that concentrate in certain documents are emphasized (by the Ni,j / N*,j
ratio) and words that only appear in a few documents are also emphasized (by the log( D / Di )
term).
Since we have such a small example, we will skip this step and move on the heart of LSA, doing
the singular value decomposition of our matrix of counts. However, if we did want to add TFIDF
to our LSA class we could add the following two lines at the beginning of our python file to
import the log, asarray, and sum functions.
Then we would add the following TFIDF method to our LSA class. WordsPerDoc (N*,j) just
holds the sum of each column, which is the total number of index words in each document.
DocsPerWord (Di) uses asarray to create an array of what would be True and False values,
depending on whether the cell value is greater than 0 or not, but the 'i' argument turns it into 1's
and 0's instead. Then each row is summed up which tells us how many documents each word
appears in. Finally, we just step through each cell and apply the formula. We do have to change
cols (which is the number of documents) into a float to prevent integer division.
def TFIDF(self):
for i in range(rows):
for j in range(cols):
The reason SVD is useful, is that it finds a reduced dimensional representation of our matrix that
emphasizes the strongest relationships and throws away the noise. In other words, it makes the
best possible reconstruction of the matrix with the least possible information. To do this, it throws
out noise, which does not help, and emphasizes strong patterns and trends, which do help. The
trick in using SVD is in figuring out how many dimensions or "concepts" to use when
approximating the matrix. Too few dimensions and important patterns are left out, too many and
noise caused by random word choices will creep back in.
The SVD algorithm is a little involved, but fortunately Python has a library function that makes it
simple to use. By adding the one line method below to our LSA class, we can factor our matrix
into 3 other matrices. The U matrix gives us the coordinates of each word on our “concept” space,
the Vt matrix gives us the coordinates of each document in our “concept” space, and the S matrix
of singular values gives us a clue as to how many dimensions or “concepts” we need to include.
def calc(self):
In order to choose the right number of dimensions to use, we can make a histogram of the square
of the singular values. This graphs the importance each singular value contributes to
approximating our matrix. Here is the histogram in our example.
For large collections of documents, the number of dimensions used is in the 100 to 500 range. In
our little example, since we want to graph it, we’ll use 3 dimensions, throw out the first
dimension, and graph the second and third dimensions.
The reason we throw out the first dimension is interesting. For documents, the first dimension
correlates with the length of the document. For words, it correlates with the number of times that
word has been used in all documents. If we had centered our matrix, by subtracting the average
column value from each column, then we would use the first dimension. As an analogy, consider
golf scores. We don’t want to know the actual score, we want to know the score after subtracting
it from par. That tells us whether the player made a birdie, bogie, etc.
The reason we don't center the matrix when using LSA, is that we would turn a sparse matrix into
a dense matrix and dramatically increase the memory and computation requirements. It's more
efficient to not center the matrix and then throw out the first dimension.
Here is the complete 3 dimensional Singular Value Decomposition of our matrix. Each word has
3 numbers associated with it, one for each dimension. The first number tends to correspond to the
number of times that word appears in all titles and is not as informative as the second and third
dimensions, as we discussed. Similarly, each title also has 3 numbers associated with it, one for
each dimension. Once again, the first dimension is not very interesting because it tends to
correspond to the number of words in the title.
Dim2 Titles
red 6-7, 9
blue 1-5, 8
Using the third dimension, we can split each of these groups again the same way. For example,
looking at the third dimension, title 6 is blue, but title 7 and title 9 are still red. Doing this for both
groups, we end up with these 4 groups.
In the graph below, words are represented by red squares and titles are represented by blue
circles. For example the word "book" has dimension values (0.15, -0.27, 0.04). We ignore the
first dimension value 0.15 and graph "book" to position (x = -0.27, y = 0.04) as can be seen in the
graph. Titles are similarly graphed.
One advantage of this technique is that both words and titles are placed on the same graph. Not
only can we identify clusters of titles, but we can also label the clusters by looking at what words
are also in the cluster. For example, the lower left cluster has titles 1 and 3 which are both about
stock market investing. The words "stock" and "market" are conveniently located in the cluster,
making it easy to see what the cluster is about. Another example is the middle cluster which has
titles 2, 4, 5, and, to a somewhat lesser extent, title 8. Titles 2, 4, and 5 are close to the words
"value" and "investing" which summarizes those titles quite well.
Advantages, Disadvantages, and Applications of LSA
Latent Semantic Analysis has many nice properties that make it widely applicable to many
problems.
1. First, the documents and words end up being mapped to the same concept space. In this
space we can cluster documents, cluster words, and most importantly, see how these
clusters coincide so we can retrieve documents based on words and vice versa.
2. Second, the concept space has vastly fewer dimensions compared to the original matrix.
Not only that, but these dimensions have been chosen specifically because they contain
the most information and least noise. This makes the new concept space ideal for running
further algorithms such as testing different clustering algorithms.
3. Last, LSA is an inherently global algorithm that looks at trends and patterns from all
documents and all words so it can find things that may not be apparent to a more locally
based algorithm. It can also be usefully combined with a more local algorithm such as
nearest neighbors to become more useful than either algorithm by itself.
There are a few limitations that must be considered when deciding whether to use LSA. Some of
these are:
1. LSA assumes a Gaussian distribution and Frobenius norm which may not fit all problems.
For example, words in documents seem to follow a Poisson distribution rather than a
Gaussian distribution.
2. LSA cannot handle polysemy (words with multiple meanings) effectively. It assumes that
the same word means the same concept which causes problems for words like bank that
have multiple meanings depending on which contexts they appear in.
3. LSA depends heavily on SVD which is computationally intensive and hard to update as
new documents appear. However recent work has led to a new efficient algorithm which
can update SVD based on new documents in a theoretically exact sense.
In spite of these limitations, LSA is widely used for finding and organizing search results,
grouping documents into clusters, spam filtering, speech recognition, patent searches, automated
essay evaluation, etc.
As an example, iMetaSearch uses LSA to map search results and words to a “concept” space.
Users can then find which results are closest to which words and vice versa. The LSA results are
also used to cluster search results together so that you save time when looking for related results.