12. Classification of texts¶
Note
The script with the code for this chapter is nlpTextclass.py
, which you can download with codeDowner()
, see Practice 1, question 2.
12.1. From points to vectors¶
In Chapter 9 on text statistics you became accustomed to seeing a measurement plotted as in this diagram:
I explained this by saying that you count 3 across the x axis and then 3 up the y axis to arrive at the location of point A, (3,3). In this section you will learn that there is more to this manner of representation than meets the eye.
12.1.1. The vector representation of points¶
Imagine that the x and y axes are strings with weights attached to them. Actually, you don’t have to imagine at all; an apparatus built from strings and weights is a staple of physics labs called a force table:
Top view  Side view 

The idea is to position the strings so that force of gravity pulling down on the weights is balanced. In the balanced state, the ring to which the strings are tied does not touch the center post (it is pulled equally in all three directions). Your intuition should tell you that, if the weights are the same, then the weight is equally distributed when each string is in sense some perpendicular to the others. In a 360° degree circle, this means that each string is 120° from the other two.
Now return you attention to Fig. 12.1 and imagine that the x and y axes are strings with weights attached to them that are pulling on the zero point or origin:
They exert a force that is balanced between them along the red arrow. Mathematically, the direction of equilibrium is their sum. Thus the point depicted in Fig. 12.1 can be understood as the force exerted by an x weight of 3 and a y weight of 3:
I will refer to a point interpreted in this fashion as a vector, whose technical definition is a directed line segment. Where typographically feasible, the name of a vector is annotated with a combining right arrow above it, \(\overrightarrow{a}\). The point where a vector ends is given in square brackets, [3,3]. Each of these numbers is a component of the vector. A vector has two properties. One is its length or magnitude \(\overrightarrow{a}\), calculated by finding the square root of the sum of each component squared, \(\overrightarrow{a}=\sqrt{x^2 + y^2}\). The other is its direction, measured as the angle that it makes with respect to the x axis, called θ, and found by the inverse tangent of y divided by x, \(∠\overrightarrow{a}=θ=tan^{1}\frac{y}{x}\). You don’t have to understand the trigonometry, however; all you have to know is that the two quantities are calculable. Python will do the calculating for you.
12.1.2. How to construct text vectors in pattern.vector¶
The vector
module of pattern
converts documents to vectors and provides several means of classifying them. In this section, I walk you through the various methods for doing so. Go ahead and import the main constructs:
>>> from pattern.vector import Document, Model, TFIDF
12.1.2.1. How to make a frequency and a probability distribution with Document()¶
The single method Document()
builds in all of the utilities to convert a string to a frequency distribution and from there to a vector. As you might guess, it takes many arguments to finetune the conversion process. You are welcome to check out their explanation in help(Document)
, but they are complex enough to merit a more indepth treatment. Their default values are detailed in the following listing, copied from pattern.web's
documentation:
1 2 3 4 5 6 7 8 9 10 11 12  document = Document(string,
filter = lambda w: w.lstrip("'").isalnum(),
punctuation = '.,;:!?()[]{}\'`"@#$*+=~_',
top = None, # Filter words not in the top most frequent.
threshold = 0, # Filter words whose count falls below threshold.
exclude = [], # Filter words in the exclude list.
stemmer = None, # STEMMER  LEMMA  function  None.
stopwords = False, # Include stop words?
name = None,
type = None,
language = None,
description = None)

Tokenization is performed implicitly and the highlighted lines 2 and 3 show how the resulting tokens are required to be alphanumeric and stripped of quotation marks, as well as the characters in the string of punctuation. Line 4 lets you remove the words whose frequency falls below the maximum that you set. Line 5 does the opposite, letting you remove the words whose frequency falls below the minimum that you set. Line 6 lets you set a list of additional words to exclude. Line 7 lets you call a stemmer. Line 8 lets you include stop words. Line 9 lets you set a string as a name for the document, while line 10 lets you set a string as a type for the document. The logical usage would be to give each document a unique name but group them into types if necessary. Line 11 specifies a language. And finally, line 12 lets you set a string as a description of the document.
Here is an example, as well as a printout of the material that the method returns:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27  >>> d1 = Document('A tiger is a big yellow cat with stripes.', name='tiger',
... type='test', description="Example from Document()'s documentation.")
>>> properties = {'desc': d1.description,
... 'feat': d1.features,
... 'id': d1.id,
... 'lang': d1.language,
... 'model': d1.model,
... 'name': d1.name,
... 'terms': d1.terms,
... 'type': d1.type,
... 'vect': d1.vector,
... 'wrdcnt':d1.wordcount,
... 'words': d1.words}
>>> from pprint import pprint
>>> pprint(properties)
{'desc': "Example from Document()'s documentation.",
'feat': [u'tiger', u'stripes', u'yellow', u'cat'],
'id': 'Q36TFfV1',
'lang': None,
'model': None,
'name': 'tiger',
'terms': {u'cat': 1, u'stripes': 1, u'tiger': 1, u'yellow': 1},
'type': 'test',
'vect': {u'cat': 0.25, u'stripes': 0.25, u'tiger': 0.25, u'yellow': 0.25},
'words': {u'cat': 1, u'stripes': 1, u'tiger': 1, u'yellow': 1},
'wrdcnt': 4}

Document()
returns two dictionaries, a tally of word counts called term
or words
and a dictionary of word weights called vector
. features
lists the terms/words, which is to say, the keys which are common to both dictionaries. They are called “features” because they are the characteristics by means of which a document can be classified. Their total is given by wordcount
.
The weights of vector
for an isolated document – pythonically, the values of the feature/term/word keys – are simply the feature counts divided by the total number of features. This is equivalent to what we called a probability distribution in Chapter 9. pattern.vector's
documentation refers to this calculation as term frequency or tf. It can be retrieved with tf()
or term_frequency()
:
1 2 3  >>> tigerFeatures = d1.features
>>> [d1.tf(f) for f in tigerFeatures]
[0.25, 0.25, 0.25, 0.25]

Several descriptive properties of Document()
should be apparent from the previous discussion, such as description
, language
, name
and type
. A new one is id
, which is assigned by Document()
. It is the preferred property to use to address the document as a whole. Another one is model
, which refers to the corpus which the document is part of. Since there is only one document, there is no corpus – yet.
12.1.2.2. How to make a corpus with Model()¶
Now is the time to change that. Create two more documents, without a description:
1 2  >>> d2 = Document('A lion is a big yellow cat with a mane.', name='lion', type='test')
>>> d3 = Document('An elephant is a big gray animal with a trunk.', name='elephant', type='test')

Combine them into a corpus with Model()
:
>>> m = Model(documents=[d1, d2, d3], weight=TFIDF)
Check to see whether this has changed the model
property of the first document:
>>> d1.model
It has, but we do not have direct access to the pattern.vector.Model
object that is produced. To check for a document in the model, use:
1 2  >>> m.document('tiger')
>>> m.documents

The main usefulness of Model()
is to facilitate comparison of documents across the corpus. It does so by calculating a new dictionary of feature/term/word weights for each document, in a manner which I explain in a moment. A list of the dictionary’s keys is returned by features
:
1 2 3  >>> m.features
[u'manes', u'gray', u'elephant', u'stripes', u'yellow', u'cat', u'tiger',
u'lion', u'animal', u'trunk']

A list of document dictionaries, i.e., the new feature vectors, is returned by m.vectors
:
1 2 3 4 5 6 7 8 9 10 11 12 13  >>> pprint(m.vectors)
[{u'cat': 0.10136634521138871,
u'stripes': 0.2746532569132872,
u'tiger': 0.2746532569132872,
u'yellow': 0.10136634521138871},
{u'cat': 0.10136634521138871,
u'lion': 0.2746532569132872,
u'mane': 0.2746532569132872,
u'yellow': 0.10136634521138871},
{u'animal': 0.2746532569132872,
u'elephant': 0.2746532569132872,
u'gray': 0.2746532569132872,
u'trunk': 0.2746532569132872}]

Oddly enough, Model()'s
new weights overwrite the individual documents’ old weights:
1 2 3 4 5  >>> pprint(d1.vector)
{u'cat': 0.10136634521138871,
u'stripes': 0.2746532569132872,
u'tiger': 0.2746532569132872,
u'yellow': 0.10136634521138871}

From all having the weight of 0.25, they have evolved to two notreadilyinterpretable quantities. So how are they calculated?
The first step is for Model()
to tally up the number of documents containing a feature. This tally can be retrieved for a given feature from inverted
, which is a dictionary of featureset pairs:
1 2 3  >>> m.inverted['cat']
set([Document(id='Q3ByNfT1', name='tiger', type='test'),
Document(id='Q3ByNfT2', name='lion', type='test')])

Then each count is divided by the total number of documents. pattern.vector's
documentation calls this quotient document frequency or df, though it can also be understood as a probability distribution. The document frequency of a feature can be retrieved from a model through document_frequency()
or df()
:
1 2  >>> [m.df(f) for f in tigerFeatures]
[0.3333333333333333, 0.3333333333333333, 0.6666666666666666, 0.6666666666666666]

Even with the stop words excluded, a word that occurs frequently across all documents will tend to get a larger weight. Yet the supposition underlying this approach is that words that appear frequently in all documents are probably not very relevant to any of them. Thus it would be desirable to ‘invert’ the document frequency, which can be done by dividing it into one, \(\frac{1}{df}\). This just moves the problem around, though. The huge values of the most frequent words are now tiny, maintaining the large inequality among weights. To fix this problem, the logarithm of the inverted document frequency is taken, \(log(\frac{1}{df})\). This measure is called the inverse document frequency or idf of a feature. It can be retrieved from a model through the methods inverse_document_frequency()
or idf()
:
1 2 3 4 5  >>> from numpy import log
>>> [log(1/m.df(f)) for f in tigerFeatures]
[1.0986122886681098, 1.0986122886681098, 0.40546510810816438, 0.40546510810816438]
>>> [m.idf(f) for f in tigerFeatures]
[1.0986130276531487, 1.0986130276531487, 0.40546538084555483, 0.40546538084555483]

The logarithm squashes the weights into a smaller range, but what about term frequency, tf? The final step is to bring it back into the fold by multiplying it by idf, \(tf*log(\frac{1}{df})\). This calculation can also be retrieved from a document by means of term_frequency_inverse_document_frequency()
, tf_idf()
or tfidf()
:
1 2 3 4 5 6 7 8 9 10 11  >>> [d1.tf(f)*m.idf(f) for f in tigerFeatures]
[0.2746532569132872, 0.2746532569132872, 0.10136634521138871, 0.10136634521138871]
>>> [d1.tfidf(f) for f in tigerFeatures]
[0.2746532569132872, 0.2746532569132872, 0.10136634521138871, 0.10136634521138871]
>>> pprint(m.vectors[0])
{u'cat': 0.10136634521138871,
u'stripes': 0.2746532569132872,
u'tiger': 0.2746532569132872,
u'yellow': 0.10136634521138871}
>>> [round(d1.tfidf(f),2) for f in tigerFeatures]
[0.27, 0.27, 0.1, 0.1]

Line 1 performs the multiplication, while line 3 uses a document method to do it. Line 5 extracts the tiger dictionary from the model to display the features that correspond to the weights calculated in the previous lines. Finally, I am sure that you are tired of looking at such giant numbers, so line 10 rounds them to two decimal places.
12.1.3. Practice¶
 To help sharpen your intuitions about the tfidf calculation, bring the stop words back into each document, as illustrated in the line below, and recalculate the model:
>>> d0 = Document('A tiger is a big yellow cat with stripes.', name='tiger0', stopwords=True)
How do the weights change?
12.2. Similarity in a vector space¶
Consider the three vectors in the graph below. Do you see any similarity among them?
Most people say that \(\overrightarrow{a}\) and \(\overrightarrow{b}\) are similar whereas \(\overrightarrow{c}\) is different.
12.2.1. Similarity as the angle between two vectors¶
How could that intuition be made precise, mathematically precise? The solution that mathematicians have hit upon is to say that two vectors are more similar the closer together they are. This closeness can be measured as the angle between them, which is found by the inverse cosine or arccos of the dot product of the two vectors divided by the product of their magnitudes, \(θ(\overrightarrow{a},\overrightarrow{b})=arccos\frac{(\overrightarrow{a}•\overrightarrow{b})}{(\overrightarrow{a}*\overrightarrow{b})}\). So now you know. OK, don’t freak out. Python will calculate this for you.
It follows that two vectors are identical if the angle between them is zero:
Magnitude plays no role. You can understand this by thinking back to the forcetable interpretation of vectors. If \(\overrightarrow{a}\) is a string on the table, how would you convert it to \(\overrightarrow{b}\)? The simplest way is to just add more weight to string a. The additional weight increases the force a exerts, but it is still the same string. Now you can translate this insight back to the interpretation of vectors as directed line segments by proposing that \(\overrightarrow{b}\) is just \(\overrightarrow{a}\) times 2, \(\overrightarrow{b}=\overrightarrow{a}*2\), or \([6,6]=[3,3]*2\). In this way, the components of a directed line segment can be seen as analogous to the weights of a string on a force table. This analogy is used extensively in the textual interpretation of vectors.
12.2.2. How to calculate document similarity¶
Returning to pattern.vector.Model
, since the model contains a vector representation of the three documents, there should be a way to compare their similarity through their vector angles. And there is, with similarity()
:
1 2 3 4 5 6  >>> m.similarity(d1, d2) # tiger vs. lion
0.11988321306398905
>>> m.similarity(d1, d3) # tiger vs. elephant
0.0
>>> m.similarity(d2, d3) # lion vs. elephant
0.0

Do these numbers make intuitive sense to you?
12.3. Three operations on vector spaces¶
12.3.1. The dimensionality of a vector space¶
But before we return to the main topic of the chapter, I need to ask you something. How many dimensions does a graph like Fig. 12.5 have? I hope you said two, something like breadth and height.
How could you increase the number of dimensions of this space? You can imagine that a third dimension can be incorporated by ‘pulling’ the screen towards you and so adding depth. It is conventionally stated that a fourth dimension can be incorporated by including movement through 3D space, which is the dimension of time. Beyond these four dimensions of spacetime, there is little to nothing in human experience on which to base intuitions about additional dimensions. Yet mathematicians (and physicists, as in the 11 dimensions of string theory) routinely manage spaces of higher dimensionality. How can they do this?
They can do it because the equations can be augmented with additional components or weights or dimensions, even though we may not be able to map these onto our everyday life. As our simplest example, recall the equation for calculating the magnitude of a vector. It happily accepts more components, for instance \(\overrightarrow{d}=\sqrt{x^2+y^2+z^2+α^2+β^2+ɣ^2}\). \(\overrightarrow{d}\) has six components and so six dimensions. Even though I don’t know what the dimensions of α, β, or ɣ are, nor what it would feel like to live in them, I can still calculate \(\overrightarrow{d}\). This sort of induction to higher dimensions holds for all of the calculations that we are going to perform.
12.3.2. Dimensionality reduction¶
Now, how would you reduce the two dimensions of our graphs to just one? Let me pivot around the vectors of Fig. 12.4 to give you a concrete example:
Intuitively, what needs to done is to ‘get rid of’ the y axis, which would smash \(\overrightarrow{a}\) and \(\overrightarrow{b}\) together on the negative end of the x axis, leaving \(\overrightarrow{c}\) on the positive end:
You might object that such a reduction in dimensionality destroys information, in particular, the difference between \(\overrightarrow{a}\) and \(\overrightarrow{b}\). I would counter that we concluded on the basis of Fig. 12.4 that \(\overrightarrow{a}\) and \(\overrightarrow{b}\) are similar while \(\overrightarrow{c}\) is different. Thus the dimensionality reduction can be read as tossing out some useless information (the minor difference between \(\overrightarrow{a}\) and \(\overrightarrow{b}\) carried by the y axis) and so emphasizing the important information (the contrast between those two and \(\overrightarrow{c}\) on the x axis).
The mathematical process to show which dimensions bear little information and so can be discarded is called singular value decomposition, SVD, which is a complex operation – so complex that I won’t even try to show an equation for it. And it doesn’t matter. Python will calculate it for you.
12.3.2.1. How to reduce dimensions with latent semantic analysis¶
Warning
need an intro
Let us calculate a new model to maintain a clear distinction between the different calculations:
>>> m2 = Model(documents=[d1, d2, d3])
I did not set the weighting scheme to TFIDF because it is the default. If you ever lose track of the weighting scheme, you can check it with the weight
property:
>>> m2.weight
Now you can use reduce()
to decrease the number of dimensions to as few as you please. Recall that we start with four, which I am going to pare them down to two, in hopes of finding a coherent differentiation of the two ‘cat’ concepts from the one ‘elephant or ‘noncat’ concept:
>>> m2.reduce(2)
Just to be clear, the integer 2
is the number of dimensions to wind up with.
That was simple enough; the hard part is to display the results. They are stored in a special pattern.vector.LSA
object to which we do not have direct access:
>>> m2.lsa
What we do have access to is three containers, lsa.features
, lsa.vectors
, and lsa.concepts
.
lsa.features
is a list of features, just like m.features
:
1 2 3  >>> m2.lsa.features
[u'trunk', u'grey', u'yellow', u'cat', u'tiger', u'lion', u'mane', u'animal',
u'elephant', u'stripes']

lsa.vectors
is a dictionary of document keys and vector values, where each vector is in turn a dictionary of concept keys and weight values:
1 2 3 4  >>> pprint(m2.lsa.vectors)
{'Q3ByNfT1': {1: 0.707106781186548},
'Q3ByNfT2': {1: 0.707106781186547},
'Q3ByNfT3': {0: 1.0}}

From this dictionary, you can extract the document ids as the keys, line 1, and the vector keyvalue mappings from the values, lines 23:
1 2 3 4 5 6 7 8 9 10  >>> docs = m2.lsa.vectors.keys()
>>> concepts = [v.keys()[0] for v in m2.lsa.vectors.values()]
>>> weights = [v.values()[0] for v in m2.lsa.vectors.values()]
>>> triples = zip(docs, concepts, weights)
>>> for t in triples:
... print '{}\t{}\t{}'.format(t[0], t[1], round(t[2],2))
...
Q3ByNfT3 0 1.00
Q3ByNfT2 1 0.71
Q3ByNfT1 1 0.71

Line 4 zips them together into triples and the others display each triple on its own line for your edification.
lsa.concepts
is a list of dictionaries of feature keys and weight values, in which each dictionary is a concept:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22  >>> pprint(m2.lsa.concepts)
[{u'animal': 0.5,
u'cat': 0.0,
u'elephant': 0.5,
u'gray': 0.5,
u'lion': 0.0,
u'mane': 0.0,
u'stripes': 0.0,
u'tiger': 0.0,
u'trunk': 0.5,
u'yellow': 0.0},
{u'animal': 0.0,
u'cat': 0.32718457421366,
u'elephant': 0.0,
u'gray': 0.0,
u'lion': 0.443255149093965,
u'mane': 0.443255149093965,
u'stripes': 0.443255149093965,
u'tiger': 0.443255149093965,
u'trunk': 0.0,
u'yellow': 0.32718457421366}]

As with lsa.vectors
, lsa.concepts
can be unpacked into triples to display the contrast between the two concepts more forcefully:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  >>> features = m2.lsa.concepts[0].keys()
>>> concept1Wts = m2.lsa.concepts[0].values()
>>> concept2Wts = m2.lsa.concepts[1].values()
>>> triples = zip(features, concept1Wts, concept2Wts)
>>> for t in triples:
... print '{}\t{}\t{}'.format(t[0], round(t[1],2), round(t[2],2))
...
elephant 0.5 0.00
gray 0.5 0.00
yellow 0.0 0.33
cat 0.0 0.33
tiger 0.0 0.44
lion 0.0 0.44
mane 0.0 0.44
animal 0.5 0.00
trunk 0.5 0.00
stripes 0.0 0.44

12.3.3. Vector clustering¶
12.3.3.1. How to cluster vectors in pattern.vector¶
1 2 3 4 5 6  >>> from pattern.vector import KMEANS, COSINE, centroid, distance
>>> clusters = m2.cluster(method=KMEANS, k=2, iterations=10, distance=COSINE)
>>> pprint(clusters)
[[Document(id='Q3CksSm2', name='lion', type='test'),
Document(id='Q3CksSm1', name='tiger', type='test')],
[Document(id='Q3CksSm3', name='elephant', type='test')]]

12.3.4. Dividing lines¶
12.4. Classification¶
Up until now, our techniques have ‘learned’ about the statistical structure of a space defined by document vectors just from the distribution of the vectors in the space. This is often called unsupervised learning. Now we are going to take the information so learned and use it to apply a label to the document vectors. This is known as classification or categorization, though in the machinelearning literature, classification is the preferred term. Categorization is preferred in psychology.
To get started, we need a data set. Our current corpus of three sentences is too simple, so we will use the corpus that pattern.vector
analyzes in its documentation. Download it from http://www.clips.ua.ac.be/media/reviews.csv.zip, decompress it and drag it into pyScripts
. It is a csv file, for which pattern.db
supplies a handy reader. Go ahead and poke around in it:
1 2 3  >>> from pattern.db import csv
>>> rawData = csv('reviews.csv')
>>> pprint(rawData[:10])

What is in the file?
Yes, it is series of movie ratings, each of which consists of a ‘review’ as a single sentence and a rating as a Unicode string from 0 to 5.
The first step when confronted with a data set is to preprocess it to make it compatible with the analytical algorithm. In this case, the main thing that needs to be done is to convert the rating strings to integers, plus grouping a review and its rating into a tuple:
1 2  >>> data = [(review, int(rating)) for review, rating in rawData]
>>> pprint(data[:10])

With this prepping, the data can be converted to a list of pattern.vector
documents in which a review is the text and its rating is the type:
>>> docs = [Document(review, type=rating, stopwords=True) for review, rating in data]
A classifier will be trained on this corpus to label a review with a rating. The question is, how many of the one thousand documents should be used for the training? You might answer that all of them should be used, but the drawback is that none will be left to evaluate the accuracy of the classifier. The convention in the literature on machine learning is to retain some percentage of the data as a test set and use the rest for the training set. In the upcoming examples, I save the first half for the test set and use the other half for the training set.
12.4.1. How to evaluate a classifier¶
Once the classifier has been trained, it can be evaluated in several ways.
Recall from Will the best regex please stand up? where you learned about how to evaluate a regular expression not only by the strings that it correctly applies to or excludes, but also about the strings that that it incorrectly applies to or excludes. I drew up a table like the one below to help you appreciate the distinction:
Actual class  Predicted class  

True 
False 

True 
true positive (TP)  false negative (FN) 
False 
false positive (FP)  true negative (TN) 
To apply this way of thinking, you have to imagine the binary case, with two labels for the data set, and the classifier tries to decide whether to apply one of them or not. In prose, if a classifier predicts the current label for a document, and that is actually the label for the document, then the prediction is called a true positive. Conversely, if the classifier predicts that the document does not bear the current label, and it in fact bears the other one, then the prediction is called a true negative. Those are the expected outcomes, but there are two more. If the classifier predicts the current label for a document, but it is labeled with the other one, then the prediction is called a false positive. Finally, if the classifier predicts that the document does not bear the current label, but it does, then the prediction is called a false negative.
These four outcomes can be quantified for a corpus, from which four metrics can be calculated:
Metric  Formula 

Accuracy  \(\frac{TP + TN}{TP + TN + FP + FN}\) 
Precision (P)  \(\frac{TP}{TP + FP}\) 
Recall (R)  \(\frac{TP}{TP + FN}\) 
F1score  \(\frac{2 x P x R}{P + R}\) 
See Math is Fun’s Accuracy and precision and Wikipedia’s Accuracy and precision.
You are now ready to classify the reviews by rating, to see whether a novel text can be rated accurately.
12.4.2. Knearest neighbor classification¶
Continuing with the theme of clustering, we start with knearest neighbor classification, which chooses as the label of a target the label of its nearest neighbors in a vector space. In textual terms, this means that the algorithm judges the similarity of the target document to nearby documents in the corpus space. The number of neighbors is conventionally referred to as “k”.
In pattern.vector
knearest neighbor classification is performed by KNN()
, whose default settings are illustrated here:
>>> classifier = KNN(train=[], baseline=MAJORITY, k=10, distance=COSINE)
train
is where the training list is inserted. Our example uses the first 500 documents in the model. baseline
chooses the class to be predicted. The default is MAJORITY
, the most frequent class, though it can be set to another class by the user. k
is the number of neighbors to check (default is 10), and distance
sets the way of measuring the distance to a neighbor, from COSINE
(the default), EUCLIDEAN
, MANHATTAN
, or HAMMING
. All of this can be put together into:
1 2  >>> from pattern.vector import KNN
>>> knn = KNN(train=docs[:500])

The resulting classifier knn
has several properties to help you understand the underlying statistics of the data:
1 2 3 4 5 6 7 8 9 10 11 12  >>> print knn.classes
[0, 1, 2, 3, 4, 5]
>>> print knn.distribution
{0: 32, 1: 51, 2: 44, 3: 127, 4: 178, 5: 68}
>>> knn.baseline
4
>>> print knn.majority
4
>>> print knn.minority
None
>>> knn.skewness
0.7462122734264776

classes
lists the labels used by the classifier, which were set to by type
in the documents. distribution
records the number of documents found in each class. baseline
picks out the predicted class, which is either the most frequent or the one set by the user. majority
indicates the most frequent class. minority
indicates the least frequent class, but appears to have been converted from “0” to None
. Whooops! skewness
is 0 if the classes are evenly distributed. They clearly are not, as seen in line 4.
There is also a method classify()
, which prompts the classifier to predict a label for a pattern.vector
document. What label do you predict that the classifier will produce? Now give it a try:
>>> print knn.classify(Document('A good movie!'))
Were you right?
There is also a trio of housekeeping methods. finalize()
removes the training data from memory. save(path)
saves the classifier to the hard drive at the place given by path
; it can be read back into Python with load()
.
12.4.2.1. How to evaluate the classifier¶
As a first step towards evaluating knn
, let us choose a reference value, say the label 4
, and get pattern.vector
to calculate the four cells of a confusion matrix:
1 2 3  >>> knnConfusion = knn.confusion_matrix(data[500:])
>>> print knnConfusion(4)
(109, 70, 165, 68) # (TP, TN, FP, FN)

I insert these results into Table 12.2 with label 4
as the reference value:
Actual class  Predicted class  

4 
others  
4 
109 (TP)  68 (FN) 
others  165 (FP)  70? (TN) (158) 
To see the entire array of outcomes, use the table
property:
1 2 3 4 5 6 7 8 9 10  >>> print knnConfusion.table
0 1 2 3 4 5
0 1 3 3 4 12 3
1 1 3 1 10 28 3
2 0 2 1 12 31 3
3 2 4 0 50 48 13
4 2 7 3 29 109 27 = 68 excluding 109
5 2 4 1 17 46 15 (remainder = 158)

165 excluding 109

The intersection of line 7 (label 4) and column 4 holds the 109 true positives. The rest of line 7 (label 4) details the 68 false negatives, for which 4
was not predicted. Column 4 without label 4 lays out the 165 documents for which 4
was predicted incorrectly, the false positives. Finally, I would think that the true negatives should be everything else, that is, the total documents minus the ones associated with 4
, which works out to 158:
>>> 500  109  68  165
However, knnConfusion(4)
prints 70.
To ascertain the accuracy of your classifier for label 4
, the second half of the data can be run through it by way of the test()
method:
1 2 3 4 5 6 7 8 9 10 11 12 13  >>> accuracy, precision, recall, f1 = knn.test(docs[500:], target=4)
>>> print '''
... accuracy = {},
... precision = {},
... recall = {},
... f1 = {}'''.format(round(accuracy, 2),
... round(precision, 2),
... round(recall, 2),
... round(f1, 2))
accuracy = 0.43
precision = 0.40
recall = 0.62
f1 = 0.48

These metrics are calculated by using a value of 70 for TN, which can be demonstrated by calculating them by hand from the equations in Table 12.3:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  >>> TP = 109
>>> TN = 70 or 158
>>> FP = 165
>>> FN = 68
>>> (TP + 70) / float(TP + TN + FP + FN) # accuracy, TN = 70
0.4344660194174757
>>> (TP + 158) / float(TP + TN + FP + FN) # accuracy, TN = 158
0.6480582524271845
>>> P = TP / float(TP + FP) # precision
>>> P
0.3978102189781022
>>> R = TP / float(TP + FN) # recall
>>> R
0.615819209039548
>>> 2 * P * R / float(P + R) # f1
0.4833702882483371

So the accuracy statistic for a single label is suspect, but average accuracy is more ‘accurate’, as I endeavor to show in the next paragraph.
Up to now, the evaluation of the classifier is based on the arbitrarilychosen value of ten for k
. You could train the classifier on different values to see how they fare, but I am going to take a more systematic approach and plot all four metrics for k
ranging from one to fifty:
The average accuracy for k = 10 is 0.66, which is very close to the 0.65 calculated in line 8 of Listing 12.2. So let us assume that it is reliable enough to use. How would you describe the effect of increasing k
on accuracy?
Todo
describe other plots
Kfold crossvalidation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18  >>> from pattern.vector import c
>>> accuracy, precision, recall, f1, stdev = kfoldcv(KNN, docs, folds=10, k=10)
>>> print '''
... accuracy = {}
... precision = {}
... recall = {}
... f1 = {}
... stdev = {}'''.format(round(accuracy, 2),
... round(precision, 2),
... round(recall, 2),
... round(f1, 2),
... round(stdev, 2))
accuracy = 0.64
precision = 0.24
recall = 0.22
f1 = 0.23
stdev = 0.04

12.4.2.2. Feature selection¶
12.4.3. How to use the genres of the Brown corpus¶
The Brown corpus contains 1,161,192 words scattered among 15 genres. If yo haven’t already downloaded the entire NLTK corpus, you can get just the Brown section:
1 2 3 4  >>> import nltk
>>> nltk.download_shell()
Downloader> d brown
Downloader> q

In NLTK the various genres are called categories:
1 2 3 4 5  >>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government',
'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion',
'reviews', 'romance', 'science_fiction']

Each genre consists of a series of texts that the compilers of the corpus classified as such. For instance, to display the “news” genre or category, tokenize the corpus while setting the category argument to “news”:
1 2 3 4 5 6 7 8 9  >>> brown.words(categories='news')[:50]
[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday',
u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary',
u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that',
u'any', u'irregularities', u'took', u'place', u'.', u'The', u'jury',
u'further', u'said', u'in', u'termend', u'presentments', u'that',
u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had',
u'overall', u'charge', u'of', u'the', u'election', u',', u'``',
u'deserves', u'the', u'praise']

This looks like a newspaper article.
The categories are scattered among many documents, which can be retrieved through fileids()
:
1 2 3  >>> len(brown.fileids())
>>> brown.fileids()[:5]
>>> brown.words(fileids='ca01')[:50]

A first question is to ask how the file names relate to the genres, and while we are at it, how many documents are there for each genre? The CONTENTS
file (/Users/harryhow/nltk_data/corpora/brown/CONTENTS
) says:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  This directory contains the Brown Corpus, in Form C (tagged). Each
filename consists of a "c" (corpus form C), followed by a letter ar
designating the genre, followed by two digits.
A. PRESS: REPORTAGE
B. PRESS: EDITORIAL
C. PRESS: REVIEWS
D. RELIGION
E. SKILL AND HOBBIES
F. POPULAR LORE
G. BELLESLETTRES
H. MISCELLANEOUS: GOVERNMENT & HOUSE ORGANS
J. LEARNED
K. FICTION: GENERAL
L. FICTION: MYSTERY
M. FICTION: SCIENCE
N. FICTION: ADVENTURE
P. FICTION: ROMANCE
R. HUMOR

So a file name like cp01
is composed of c
for corpus, p
for romantic fiction and is the first of its genre. And “I”, “O” and “Q” have vanished into the dustbin of history.
How would you count the number of files per genre? In Python, not by hand. I hope you replied that you would use a frequency distribution:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20  >>> from re import search, findall
>>> from nltk.probability import FreqDist
>>> prefixes = [search('^c([az])', fileid).group(1) for fileid in brown.fileids()]
>>> filePairs = sorted(FreqDist(prefixes).items())
>>> pprint(filePairs)
[(u'a', 44),
(u'b', 27),
(u'c', 17),
(u'd', 17),
(u'e', 36),
(u'f', 48),
(u'g', 75),
(u'h', 30),
(u'j', 80),
(u'k', 29),
(u'l', 24),
(u'm', 6),
(u'n', 29),
(u'p', 29),
(u'r', 9)]

I still have to check the previous list to understand what the letters stand for. I would prefer to bring the two together, ‘cause I’m lazy. the following chunk of code does so, though it is overkill. It reads in the CONTENTS
file, cuts out the list, uses a re.findall()
to extract each line of the list and zips it together with the counts from the frequency distribution. Don’t feel like you need to dwell too much on it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  >>> folder = '/Users/harryhow/nltk_data/corpora/brown/'
>>> with open(folder+'CONTENTS','r') as tempFile:
... contentString = tempFile.read()
>>> contentString = contentString[contentString.find('s.'):
... contentString.find('LIST')]
>>> contents = findall('([AZ].*)\n', contentString)
>>> fileCountType = zip(contents, (count for prefix, count in filePairs))
>>> pprint(fileCountType)
[('A. PRESS: REPORTAGE', 44),
('B. PRESS: EDITORIAL', 27),
('C. PRESS: REVIEWS', 17),
('D. RELIGION', 17),
('E. SKILL AND HOBBIES', 36),
('F. POPULAR LORE', 48),
('G. BELLESLETTRES', 75),
('H. MISCELLANEOUS: GOVERNMENT & HOUSE ORGANS', 30),
('J. LEARNED', 80),
('K: FICTION: GENERAL', 29),
('L: FICTION: MYSTERY', 24),
('M: FICTION: SCIENCE', 6),
('N: FICTION: ADVENTURE', 29),
('P. FICTION: ROMANCE', 29),
('R. HUMOR', 9)]

This is much more convenient.
12.4.3.1. How to measure similarity within the Brown genres¶
The Brown corpus affords us a handy labeled dataset to use to learn more about classification. However, a first step is to ask whether the genres are internally consistent enough to make classification possible. Thus the goal of this section is to analyze the withingenre similarity of the entire corpus. To this end, we start with two dictionaries:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  >>> prefix2type = {'a':'news',
... 'b':'editorial',
... 'c':'reviews',
... 'd':'religion',
... 'e':'hobbies',
... 'f':'lore',
... 'g':'belles_lettres',
... 'h':'government',
... 'j':'learned',
... 'k':'fiction',
... 'l':'mystery',
... 'm':'science_fiction',
... 'n':'adventure',
... 'p':'romance',
... 'r':'humor'}
>>> prefix2count = FreqDist(prefixes)

prefix2type
is the only construct in the entire task that is built by hand.
The next step is to read the text from each file into a pattern.vector
document:
1 2 3 4 5 6 7  >>> corpus = []
>>> for prefix in sorted(prefix2type.keys()):
... for docNum in range(1, prefix2count[prefix]):
... brownDoc = 'c{}{}'.format(prefix, str(docNum).zfill(2))
... contentString = ' '.join(brown.words(fileids=brownDoc))
... corpus.append(Document(contentString, name=brownDoc, type=prefix2type[prefix]))
>>> brownMod = Model(corpus, weight=TFIDF)

We next calculate similarity within a genre by calculating the similarity of each genre document to its peers:
1 2 3 4 5 6 7 8 9  >>> data = ([],)*len(prefix2type)
>>> typeNames = sorted(prefix2type.values())
>>> data = {name:[] for name in typeNames}
>>> for doc1 in corpus:
... for doc2 in corpus:
... if doc1.type == doc2.type:
... sim = brownMod.similarity(doc1, doc2)
... sim = round(sim, 2)
... data[doc1.type].append(sim)

Finally, the similarity measures are compared as boxplots:
1 2 3 4 5 6  >>> plt.figure()
>>> plt.boxplot(sorted(data.values()))
>>> plt.xticks(range(1,len(data.values())), [k for k in sorted(data.keys())], rotation=45)
>>> plt.ylim(0, 0.6)
>>> plt.tight_layout()
>>> plt.show()

The plot drawn is this one: