14. Statistics and probability for text

14.1. Exploratory text analysis with NLTK text

NLTK has a sub-package called text, as shown in the diagram below:

_images/NLTK_text.png

The text methods of Text provide a shortcut to text analysis. To make them available, you convert a text document to a PlaintextCorpusReader object, tokenize it into words, and then transform it into NLTK text. These three steps can be coded into three, two, or even one line of code, if you don’t mind a long one:

>>> import nltk
>>> from nltk.corpus import PlaintextCorpusReader
>>> from nltk.text import Text
>>> path = '/Users/{your_user_name}/nltk_data/pytextos'
>>> name = 'Gitanilla.txt'
# long version
>>> texlector = PlaintextCorpusReader(path, name, encoding='utf8')
>>> temp = texlector.words()
>>> texto = Text(temp)
# medium version
>>> temp = PlaintextCorpusReader(path, name, encoding='utf8').words()
>>> texto = Text(temp)
# short version
    >>> texto = Text(PlaintextCorpusReader(path, name, encoding='utf8').words())

14.1.1. Methods for searching text that you are already familiar with

count() returns the number of times that a word appears in the text; index() returns the index of the first occurrence of a word in the text:

>>> texto.count('gitana')
36
>>> texto.index('gitana')
75

There is a findall() method for searching for instances of a regular expression in the text, but it is so limited that it is better to use the re module. There was a search() method also for regular expressions, but it has been removed. vocab() returns the frequency distribution of the text, but there is an entire class for doing this called FreqDist() that we shall review in the next section. readability() does not work, at least on Spanish text.

14.1.2. A reminder about non-ASCII characters

If you want to use a word with a non-ASCII character in it as the argument of these or other methods, the safest thing to do is to convert it to Unicode:

>>> texto.count('Andrés')
0
>>> texto.count('Andr\xc3\xa9s')
0
>>> texto.count(u'Andr\xe9s')
68
>>> texto.count('Andrés'.decode('utf8'))
68

The final line of decoding “Andrés” from UTF-8 (to Unicode) only works because my Python console uses UTF-8. If yours doesn’t, you must indicate the encoding that it uses.

14.1.3. New methods for searching text

14.1.3.1. Collocations()

A collocation is a group of words that occur together frequently in a text. Text’s collocations() method finds collocations of two words:

>>> texto.collocations()
Building collocations list
gitana vieja; los gitanos; Andrés Caballero; las gitanas; doña Clara; respondió Preciosa; doña Guiomar; sin duda; Por vida; las manos; todos los; !-- dijo; Apenas hubo; gallarda disposición; ?-- dijo; todas las; vuesa merced; los ojos; Santa Ana; había hecho

14.1.3.2. Common_contexts()

The inverse procedure is to start with two words, say gitano and gitana, and find the contexts that they share:

>>> texto.common_contexts(['gitano', 'gitana'])
de_,

This might not be many, depending on how long the text is.

14.1.3.3. Concordance()

It is often helpful to know the context of a word. The concordance view shows a certain number of characters before and after every occurrence of a given word:

>>> texto.concordance('gallarda')
Building index...
Displaying 3 of 3 matches:
. Preciosa , algo aficionada de la gallarda disposición de Andrés , ya deseaba
con amor , le miraban : tal era la gallarda disposición de Andrés , que hasta
ía lugar donde no se hablase de la gallarda disposición del gitano Andrés Caba

14.1.3.4. Similar()

A text can also be searched for words that have a distribution similar to that of a given word:

>>> texto.similar('gitana')
vieja carducha gitanilla buenaventura compañía cruz dicen hermosa mano mía noche reja verdad vuestra , abrazaban acrecentarla advertidas alquiler andrés

Since gitana is a feminine noun, similar() finds predominantly feminine nouns.

14.1.3.5. Generate()

Finally, Text includes a method for creating a random assortment of words from the text that obey its statistical patterns:

>>> texto.generate()
Building ngram index...
LA GITANILLA Parece que los trujo . En tanto que yo volveré y le diré más venturas y aventuras que las leyes con que quedaron más alegres y más , que me quería hacer , de los tinientes de la fiesta , desde luego le desnudaron un brazo , y puesto delante de la verdad que estaniña me ha renovado mi desventura !-- dijo a esta sazón la gitana , y viésedes que os habéis de considerar que en esto , estaba temerosa de alguna pequeña criatura . -- Calla , hija ?-- dijo a su partida , por

This is just for fun, but it does give you an indication of the style of the author or the genre, and perhaps of the content.

14.1.4. Text methods for plotting

We have had problems plotting on the Macintosh, which I hope will be resolved soon, so Mac user may want to hold off on actually creating these plots.

14.1.4.1. Dispersion_plot()

Text includes a quite unexpected way of understanding the distribution of a word in the text. dispersion_plot() draws a graph of where every instance of a word is found offset from the beginning of the text:

>>> texto.dispersion_plot(['gitana', 'gitano'])
_images/LexDispersion.png

The offset measures how far an instance of the word is from the beginning of the text, counted in words.

Since I know that two of the main characters are named Preciosa and Andrés, and both are gypsies, I suspect that gitana may be used in the same passages as Preciousa and gitano may be used in the same passages as Andrés. I can test this hypothesis by adding the names to the previous plot – but Andrés contains a non-ASCII character. A pythonic way of dealing with it is by creating a list for all four words in normal Spanish spelling, and then using a list comprehension to convert it to Unicode within the argument of dispersion_plot():

>>> palabras = ['gitana', 'Preciosa', 'gitano', 'Andrés']
>>> texto.dispersion_plot([p.decode('utf8') for p in palabras])
_images/LexDispersion2.png

In the resulting graph, you can see that Preciosa is mentioned many more times than Andrés, but there is not a great overlap between their names and gitana or gitano.

Text also has a method plot() for viewing a frequency distribution, but we will take that up in the next section.

14.2. Frequency distributions from FreqDist

The concept that is crucial to quantitative text analysis is the frequency of an item, which is to say, how often it occurs in a text. Natural Language Processing with Python uses a tally like the following to illustrate this idea:

el 19
sido 11
mensaje 4
perseverar 1
nación 8

A tally is simply a way of keeping track of the count of a group of items. You will design a program for doing this in Python yourself before we look at how NLTK does it, but we cannot even get started before hitting a problem. The picture indicates that we need a data structure to store to different pieces of information, a list of words and a list of the corresponding numbers. We could try doing this with two lists, but it would get very tricky to maintain the correspondence. Fortunately, Python supplies a data structure for just this situation, called a dictionary. Let us say a few words about dictionary work, and then use one to perform a tally.

14.2.1. How to keep track of disparate types with a dictionary

A Python dictionary is a sequence within curly brackets of pairs of a key and a value joined by a colon, i.e. {key1:value1, key2:value2, …}. The next block creates a dictionary for a spanishized version of the tally in the figure, which is a sequence of string and number pairs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
>>> tally = {'el':19, 'sido':11, 'mensaje':4, 'perseverar':1, 'nacion':8}
>>> tally['el']
19
>>> tally['la']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'la'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'la'

A key can be queried for its value using square-bracket notation to treat the key like an index into the dictionary (line 2). If the dictionary is queried for a key that is not in it, Python raises an error (line 4 onward).

There are a couple of limitations on keys. The first is that they are immutable, so only strings (and numbers and tuples) are allowed, but lists are not. The second is that there cannot be duplicate keys in a dictionary.

There are several methods for dictionaries that it is convenient to know:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
>>> len(tally)
5
>>> str(tally)
"{'el': 19, 'perseverar': 1, 'mensaje': 4, 'nacion': 8, 'sido': 11}"
>>> type(tally)
<type 'dict'>
>>> tally.has_key('el')
True
>>> 'el' in tally
True
>>> tally.items()
[('el', 19), ('perseverar', 1), ('mensaje', 4), ('nacion', 8), ('sido', 11)]
>>> tally.keys()
['el', 'perseverar', 'mensaje', 'nacion', 'sido']
>>> tally.values()
[19, 1, 4, 8, 11]

The length of a dictionary is the number of pairs in it (line 1). str() returns a printable representation of a dictionary (line 3). A dictionary is of type dict (line 5). has_key() returns True if the dictionary has the key, otherwise it returns False (line 7). key in dictionary in line 9 is faster, though, and has_key() may be removed in upcoming versions of Python. Key-value pairs are called items. items() returns a list of key-value pairs (line 11). keys() returns a list of keys (line 13). values() returns a list of values (line 15).

Before moving on, let me point out that there are a couple of relationships among the keys and values that can be used to double-check results. The fact that a dictionary’s keys remove duplicates means that they can be approximated by applying set() to the text. In other words, the keys of a tally are the types of a text. Likewise, the values should sum up to the number of words in the text. Thus there are two equalities that should hold:

len(dictionary) == len(set(text))
sum(dictionary.values()) == len(text)

You will try these out on a real text in just a moment.

14.2.2. How to keep a tally with a dictionary

Let us now write some code for keeping a tally of the words in the first sentence of La gitanilla:

>>> muestra = texto[2:68]
>>> for p in muestra: print p.encode('utf8'),
...
Parece que los gitanos y gitanas solamente nacieron en el mundo para ser ladrones : nacen de padres ladrones , críanse con ladrones , estudian para ladrones y , finalmente , salen con ser ladrones corrientes y molientes a todo ruedo , y la gana del hurtar y el hurtar son en ellos como acidentes inseparables , que no se quitan sino con la muerte .

The procedure to do so is something like, create an empty dictionary; then, examine every word in the text in such a way that if a word is already in the dictionary, add 1 to its value; otherwise, insert the word in the dictionary with the value of 1. Python follows English so closely that you can practically code this up word for word:

>>> gitdict = {}
>>> for word in muestra:
...     if gitdict.has_key(word): gitdict[word] = gitdict[word]+1
...     else: gitdict[word] = 1
...

As a quick check to see whether this worked, find the length of the text and that of the dictionary:

>>> len(muestra)
66
>>> len(gitdict)
44

Good, there are fewer words in the dictionary than in the text, so it may have worked. Now let us check the equalities:

>>> len(gitdict) == len(set(muestra))
True
>>> len(muestra) == sum(gitdict.values())
True

They are both true, so the code probably worked. Go ahead and view the entire dictionary, plus the keys and values:

>>> str(gitdict)
"{u'el': 2, u'quitan': 1, u'en': 2, u'ser': 2, u'gitanas': 1, u'muerte': 1, u'gitanos': 1, u'mundo': 1, u'son': 1, u'Parece': 1, u'gana': 1, u'nacieron': 1, u'todo': 1, u'nacen': 1, u'molientes': 1, u',': 6, u'.': 1, u'los': 1, u'hurtar': 2, u'solamente': 1, u'ellos': 1, u'no': 1, u':': 1, u'corrientes': 1, u'estudian': 1, u'acidentes': 1, u'para': 2, u'de': 1, u'sino': 1, u'que': 2, u'padres': 1, u'como': 1, u'ladrones': 5, u'a': 1, u'salen': 1, u'cr\\xedanse': 1, u'finalmente': 1, u'inseparables': 1, u'ruedo': 1, u'la': 2, u'del': 1, u'y': 5, u'se': 1, u'con': 3}"
>>> gitdict.keys()
[u'el', u'quitan', u'en', u'ser', u'gitanas', u'muerte', u'gitanos', u'mundo', u'son', u'Parece', u'gana', u'nacieron', u'todo', u'nacen', u'molientes', u',', u'.', u'parece', u'los', u'hurtar', u'solamente', u'ellos', u'no', u':', u'corrientes', u'estudian', u'acidentes', u'para', u'de', u'sino', u'que', u'padres', u'como', u'ladrones', u'a', u'salen', u'cr\xedanse', u'finalmente', u'inseparables', u'ruedo', u'la', u'del', u'y', u'se', u'con']
>>> gitdict.values()
[2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 5, 1, 1, 1, 1, 1, 1, 2, 1, 5, 1, 3]

Do the counts appear accurate to you? Note that it would have been more accurate to to have taken the lowercase form of the words, but I eschewed doing so for the sake of perspicuity.

14.2.3. Look before you leap (LBYL) vs. it is easier to ask for permission (EAFP)

The usage of the if-else statement in the creation of gitdict above is an example of checking first and then doing something, a programming style referred to as “look before you leap” (LBYL). Among Python programmers it is considered to be just as accurate but faster to go ahead and do something and then wait for an error, a programming style referred to as “it is easier to ask for permission than forgiveness” (EAFP). The EAFP alternative to if-else is try-catch. The block below recasts the the if-else statement as try-except:

>>> gitdict = {}
>>> for word in muestra:
...     try: gitdict[word] += 1
...     except KeyError: gitdict[word] = 1
...

The try statement assumes that word is a key in gitdict and increments its value by one (note the idiom for incrementation by one). If assigning to word’s value throws an error because word is not a key in the dictionary and so has no value, the except statement catches the error – prevents it from halting execution – and inserts word in the dictionary with the initial value of one.

So you now have a dictionary of word frequencies. There are a lot of things that could be done with it, but we would have to code them up by hand. Let us let NLTK do some of the heavy lifting for us.

14.2.4. How to keep a tally with FreqDist()

Keeping a tally is such a basic task that NLTK has a class called FreqDist that supplies an interface to a counting procedure and several other processes that are based on it. FreqDist is part of the NLTK sub-package called probability, as shown in the diagram below:

_images/NLTK_probability.png

FreqDist does all of the work of creating a dictionary of word frequencies for us. As a first step towards appreciating how much it does, let us see how it simplifies the dictionary procedure from above. FreqDist’s method inc() does all of the incrementation of the dictionary for us, so all that has to be done is to put it in a loop:

>>> import nltk
>>> from nltk.probability import FreqDist
>>> fdist = FreqDist()
>>> for word in muestra:
...     fdist.inc(word)
...
>>> len(fdist)
44
>>> str(fdist)
"<FreqDist: u',': 6, u'ladrones': 5, u'y': 5, u'con': 3, u'el': 2, u'en': 2, u'hurtar': 2, u'la': 2, u'para': 2, u'que': 2, ...>"
>>> fdist.keys()
[u',', u'ladrones', u'y', u'con', u'el', u'en', u'hurtar', u'la', u'para', u'que', u'ser', u'.', u':', u'Parece', u'a', u'acidentes', u'como', u'corrientes', u'cr\xedanse', u'de', u'del', u'ellos', u'estudian', u'finalmente', u'gana', u'gitanas', u'gitanos', u'inseparables', u'los', u'molientes', u'muerte', u'mundo', u'nacen', u'nacieron', u'no', u'padres', u'quitan', u'ruedo', u'salen', u'se', u'sino', u'solamente', u'son', u'todo']
>>> fdist.values()
[6, 5, 5, 3, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

But even that is too much – FreqDist() can be applied to the text directly:

>>> fdist2 = FreqDist(muestra)
>>> len(fdist2)
44
>>> str(fdist2)
"<FreqDist: u',': 6, u'ladrones': 5, u'y': 5, u'con': 3, u'el': 2, u'en': 2, u'hurtar': 2, u'la': 2, u'para': 2, u'que': 2, ...>"

Nice and simple, the way we like it. By the way, how does FreqDist() order the items?

14.2.5. The rest of FreqDist()

In this section, we explore the rest of the methods revealed by FreqDist. If you haven’t already done so, convert La gitanilla to NLTK text as explained above and apply FreqDist() to it:

>>> df = FreqDist(texto)
>>> df
<FreqDist with 3056 samples and 14879 outcomes>
>>> len(texto)
14879
>>> sum(df.values())
14879
>>> len(set(texto))
3056
>>> len(df)
3056

The description of the object returned by FreqDist() is different from that of a dictionary, even though it is a dictionary. The dictionary methods work on it, and the expected equalities hold. The sum of the values is equal to the length of the text, and the length of the dictionary is equal to the length of the set representation of the text. Thus we can conclude that “samples” in a frequency distribution means “keys” in a dictionary, and “outcomes” means “values”.

14.2.5.1. A new terminology from statistics

The reason for the change in nomenclature is because the developers of NLTK have adopted the terminology of statistics in talking about how FreqDist() works. In statistics, a frequency distribution is the result of an experiment whose outcomes are categorized according to their class or sample. In text processing, the experiment is “what is the next word in the text?” and the outcome is added to one sample or another.

14.2.5.2. The same old methods

A FreqDist() object inherits all of the methods of its underlying dictionary:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
>>> df['gitana']
36
>>> len(df)
3056
>>> str(df)
"<FreqDist: u',': 1199, u'de': 646, u'que': 625, u'y': 622, u'la': 441, u'.': 311, u'a': 311, u'en': 273, u'el': 239, u';': 206, ...>"
>>> type(df)
<class 'nltk.probability.FreqDist'>
>>> df.has_key('gitana')
True
>>> df.items()[:10]
[(u',', 1199), (u'de', 646), (u'que', 625), (u'y', 622), (u'la', 441), (u'.', 311), (u'a', 311), (u'en', 273), (u'el', 239), (u';', 206)]
>>> df.keys()[:10]
[u',', u'de', u'que', u'y', u'la', u'.', u'a', u'en', u'el', u';']
>>> df.values()[:10]
[1199, 646, 625, 622, 441, 311, 311, 273, 239, 206]

14.2.5.3. Old wine in new bottles: samples(), N(), B()

Given the change in terminology, FreqDist() adds methods in statistics-speak which reproduce those of dictionary-speak:

1
2
3
4
5
6
>>> df.samples()[:10]
[u',', u'de', u'que', u'y', u'la', u'.', u'a', u'en', u'el', u';']
>>> df.B()
3056
>>> df.N()
14879

samples() is the same as keys(). B() is the number of bins or samples. N() is the number of outcomes or the sum of values, i.e. the number of words in the text.

More interesting are the new new methods.

14.2.5.4. max(), Nr(), freq() and hapaxes()

max() returns the most frequent word or sample:

>>> df.max()
u','

Nr() returns the number of samples that have a given number of outcomes:

>>> df.Nr(1199)
1
>>> df[',']
1199

freq() returns the frequency of a sample, which is calculated as the outcomes of the sample divided by the total outcomes:

>>> df.freq(',')
0.08058337253847705
>>> df[',']/float(df.N())
0.08058337253847705

A hapax is a word that only appears once in a text:

>>> df.hapaxes()[:40]
[u'!",', u'#--', u');', u'--#', u'--(', u'---:', u'.]', u'Abraz\xe1ronse', u'Abri\xf3', u'Abri\xf3le', u'Acabado', u'Acabaron', u'Acab\xe1ronse', u'Acudi\xf3', u'Admirado', u'Admirados', u'Advierte', u'Ad\xf3nde', u'Agosto', u'All\xed', u'Andad', u'Anim\xf3las', u'Aqu\xed', u'Argos', u'Arzobispo', u'Ascensi\xf3n', u'Asom\xf3se', u'At\xf3nito', u'A\xfan', u'Belmonte', u'Bien', u'Bonita', u'B\xe1rbara', u'Caco', u'Call\xf3', u'Caro', u'Castilla', u'Cogi\xf3', u'Coheche', u'Como']
>>> len(df.hapaxes())
1974
>>> df.Nr(1)
1974

14.2.5.5. How to create a table of results with tabulate()

The tabulate() method creates a table whose columns are samples and whose rows are outcomes. The table is printed to the console, so only about sixteen columns can be read across. Supplying an integer argument stops the columns at that number. Supplying two integers starts the columns at the fist and stops them at the second, minus one:

>>> df.tabulate(10)
   ,   de  que    y   la    .    a   en   el    ;
1199  646  625  622  441  311  311  273  239  206
>>> df.tabulate(10, 20)
  no   --  los  con   se  las   su  por   le Preciosa
 168  164  161  150  142  140  117  116  114  113
>>> df.tabulate(20)
   ,   de  que    y   la    .    a   en   el    ;   no   --  los  con   se  las   su  por   le Preciosa
1199  646  625  622  441  311  311  273  239  206  168  164  161  150  142  140  117  116  114  113

14.2.5.6. How to create a graph of results with plot()

plot() graphs the samples on the x axis against their outcomes on the y axis:

>>> df.plot(50)
_images/FreqDistPlot.png

Appending the argument cumulative=True to the method makes the plot add the outcomes up as it goes along:

>>> df.plot(50, cumulative=True)
_images/FreqDistPlotCum.png

14.3. How to remove the most frequent words

14.4. ConditionalFreqDist

Some of the corpora included in nltk data have their data categorized. The PlaintextCorpusReader can work with these designations by means of the following methods:

PlaintextCorpusReader methods for categories
fileids([categories]) the files that have the categories listed
categories() the categories of the corpus
categories([fileids]) the categories in the files listed
raw(categories=[c1,c2]) the raw text (chain) of the categories
words(categories=[c1,c2]) the words of the categories
sents(categories=[c1,c2]) the sentences of the categories

14.5. Summary

14.6. Further practice

14.7. Further reading

14.8. Appendix

Footnotes

[1]

Last edited: February 10, 2015