10. Statistics and probability for text

You are going to need a text, so get one:

1
2
3
4
>>> with open('Wub.txt','r') as tempFile:
...     rawText =  tempFile.read()
>>> from nltk import word_tokenize
>>> tokens = word_tokenize(rawText)

You are already familiar with the following methods for simple text statistics:

1
2
3
4
5
6
7
>>> len(tokens)
>>> set(tokens)
>>> len(set(tokens))
>>> tokens.count('wub')
>>> tokens.count('the')
>>> 100 * tokens.count('wub') / float(len(tokens))
>>> 100 * tokens.count('the') / float(len(tokens))

Note

The script with the code for this chapter is nlpTextstats.py, which you can download with codeDowner(), see Practice 1, question 2.

10.1. How to keep a tally with a dictionary

While all of these methods are helpful at giving us a bird’s-eye quantitative view of a text, the most basic statistic is just to count the number of times that a word occurs in a text. The built-in count() method can do this for us. This code counts the tokens of ten unique words drawn more or less randomly from the text:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
>>> sample = list(set(tokens))[:10]
>>> sample
[u'all', u'semantic', u'pardon', u'switched', u'Kindred',
u'splashing', u'excellent', u'month', u'four', u'sunk']
>>> tallyLoop = []
>>> for word in sample: tallyLoop.append(text.count(word))
...
>>> tallyLoop
[13, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> tallyLC = [tokens.count(word) for word in sample]
>>> tallyLC

Obviously my sample was not very interesting, but it is fine as a illustration of what to do. The problem is that the list called tally does not relate the count to the word being counted, so we will quickly get lost once we move on to texts of thousands of unique words.

What is needed is a data structure like the following table, in which each type is associated to the count of its tokens by being on the same row:

all 13
semantic 1
pardon 1
switched 1
Kindred 1

In terms of what has been said so far, a data structure is needed to store two different pieces of information, a list of words and a list of the corresponding numbers. We could try doing this with two lists, but it would get very tricky to maintain the correspondence. Fortunately, Python supplies a data structure for just this situation, called a dictionary. Let us say a few words about dictionary work, and then use one to perform a tally.

10.1.1. How to associate words and counts with a dictionary

A Python dictionary is a sequence within curly brackets of pairs of a key and a value joined by a colon, i.e. {key1:value1, key2:value2, …}. The next block creates a partial dictionary for the tally in the table, which is a sequence of string and integer pairs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
>>> tally = {'all': 13, 'semantic': 1}
>>> tally['all']
13
>>> tally['pardon']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'pardon'
>>> tally['pardon'] = 1
>>> tally
{all': 13, 'semantic': 1, 'pardon': 1}

In line 2, a key is queried for its value using square-bracket notation to treat the key like an index into the dictionary. If the dictionary is queried for a key that is not in it, as in line 4, Python raises a key error. However, if a non-included key is assigned a value as in line 8, Python assumes that the user knows what she is doing and creates a new key-value pair.

Now try to update the dictionary with a duplicate key:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
>>> tally['pardon'] = 2
>>> tally
{'all': 13, 'semantic': 1, 'pardon': 2}
>>> tally2 = {'all':13, 'semantic':1, 'semantic':5}
>>> tally2
{'all': 13, 'semantic': 5}
>>> tally3 = {'semantic':5}
>>> tally.update(tally3)
>>> tally
{'all': 13, 'semantic': 5, 'pardon': 2}

Assigning a new value to a key in line 1 does not duplicate the key but rather just updates the value. Trying to create a dictionary with duplicate keys as in line 4 is not successful; what Python does is overwrite the preceding pair with the following one, so that the last one remains. What other Python construct behaves this way? A set does not allow duplicates either, so dictionary keys are like set elements, which in turn are equivalent to textual types. By way of confirmation, there is a built-in method update() which merges dictionaries. Line 7 creates a new dictionary, and line 8 merges it with the old one, but the duplicate keys do not survive: the current one is overwritten with the new one.

Finally, try to query a value for its key:

1
2
3
4
>>> tally[13]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 13

The statement raises a key error. The obvious reason is that square brackets are only for keys, but the real reason is that a value is extracted from a key, and so through a key is the only way to go. The arrows in this diagram depict the flow of processing:

_images/DictKey2Value.png

Fig. 10.1 A dictionary maps a key to a value, and not vice versa.

Author’s diagram

Given the pivotal nature of keys, you might speculate that they are immutable, allowing only strings (and numbers and tuples), but not lists not. Values, on the other hand, should be mutable. Can you think of how to test this hypothesis:

1
2
3
4
5
6
7
>>> tally4 = {['month', 'four']:1}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>> tally5 = {'month':[2,3]}
>>> tally5['month']
[2, 3]

Trying to create a dictionary with a key containing a list produces a type error in line 1. Values are not so-constrained, see line 5 and following, though having multiple values does not make any sense for a tally.

There are several other methods for dictionaries that it is convenient to know:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
>>> type(tally)
>>> len(tally)
>>> str(tally)
>>> tally.has_key('pardon')     # prefer next line
>>> 'pardon' in tally
>>> tally.items()
>>> tally.items()[:3]
>>> tally.keys()
>>> tally.keys()[:3]
>>> tally.values()
>>> tally.values()[:3]

In line 1, a dictionary has its own type, dict. The length of a dictionary is the number of pairs in it, as shown in line 2. str() in line 2 a dictionary into a string of key:value pairs. has_key() in line 4 returns True if the dictionary has the key, otherwise it returns False. key in dictionary in line 5 does the same but is faster, and has_key() may be removed in upcoming versions of Python. Key-value pairs are called items. items() returns a list of key-value pairs in line 6, which can be sliced as in line 7. keys() returns a list of keys in line 8, which can be sliced as in line 9. values() returns a list of values in line 10, which can be sliced as in line 11.

Before moving on, let me point out that there are a couple of relationships among a tally dictionary and the text that is calculated from that can be used to double-check results. The fact that a dictionary’s keys remove duplicates means that they can be approximated by applying set() to the text from which the dictionary is extracted. In other words, the keys of a dictionary are the types of a text. Likewise, the values should sum up to the number of words in the text. Thus there are two equalities that should hold:

1
2
len(dictionary) == len(set(tokens))
sum(dictionary.values()) == len(tokens)

You will try these out on our text in just a moment.

10.1.2. How to create a dictionary with a for loop

Let us now write some code for keeping a tally of the words in Beyond Lies the Wub. The algorithm is to create an empty dictionary; then, examine every word in the text in such a way that if the current word is already in the dictionary, add 1 to its value; otherwise, insert the word in the dictionary with the value of 1. Python follows English so closely that you can practically code this up word for word:

1
2
3
4
5
6
7
>>> wubDict = {}
>>> for word in tokens:
...     if word in wubDict:
            wubDict[word] = wubDict[word] + 1
...     else:
            wubDict[word] = 1
...

Tip

The incrementation expressed by wubDict[word] = wubDict[word] + 1 can be abbreviated to wubDict[word] += 1.

As a quick check to see whether this worked, check the equalities:

1
2
>>> len(wubDict) == len(set(tokens))
>>> sum(wubDict.values()) == len(tokens)

They are both true, so the code probably worked. Go ahead and view the first thirty items:

1
2
3
4
5
6
7
>>> wubDict.items()[:30]
[(u'all', 13), (u'semantic', 1), (u'pardon', 1), (u'switched', 1), (u'Kindred', 1),
(u'splashing', 1), (u'excellent', 1), (u'month', 1), (u'four', 1), (u'sunk', 1),
(u'straws', 1), (u'sleep', 1), (u'skin', 1), (u'go', 8), (u'meditation', 2),
(u'shrugged', 1), (u'milk', 1), (u'issues', 1), (u'...."', 1), (u'apartment', 1),
(u'to', 57), (u'tail', 3), (u'dejectedly', 1), (u'squeezing', 1), (u'Not', 1),
(u'sorry', 2), (u'Now', 2), (u'Eat', 1), (u'fists', 1), (u'And', 5)]

Do the counts appear plausible to you? Note that it would have been more accurate to to have taken the lowercase form of the words, but I eschewed doing so for the sake of perspicuity.

So you now have a dictionary of word counts. There are a lot of things that could be done with it, but we would have to code them up by hand. Let us let NLTK do some of the heavy lifting for us. But first, some practice.

10.1.3. Practice 1

I have introduced the pythonic dictionary in the terms of this book’s concern with textual computation, but you are surrounded by easier-to-understand dictionaries.

  1. Take as an example a phone book. It associates a name with a number, as in the table below:
White House 202-456-1414
Homeland Security 202-282-8000
IRS 800-829-1040
Pentagon Dental Clinic 703-692-8700

Turn this table into a dictionary whose keys are offices and whose values are phone numbers. Show how to look up the IRS phone number.

  1. An even more obvious example of a dictionary is, well, a dictionary. The next table contains a few abbreviated entries from the first page that I turned to in my ancient American College Dictionary:
goof a foolish or stupid person
googly Cricket a bowled ball that serves first one way and then breaks in the other
googol a number, usually 1 followed by 100 zeros
goon U.S. Slang a stupid person

Convert this table into a pythonic dictionary whose keys are words and whose values are definition. Show how to look up the definition of “googol”.

10.2. How to keep a tally with FreqDist

Keeping a tally is such a basic task that NLTK has a class called FreqDist that supplies an interface to a counting procedure and several other processes that are based on it. FreqDist is part of the NLTK sub-package called probability, as shown in the diagram below:

_images/NLTK-probability.png

Fig. 10.2 FreqDist methods in NLTK 3, excluding dictionary methods.

Author’s image.

FreqDist does all of the work of creating a dictionary of word frequencies for us. However, some of the methods have changed in NLTK 3. In the upcoming discussion, I rely on the more recent version.

Tip

You can check the version of a Python package or module that has already been imported with print package.__version__, e.g. print nltk.__version__. Those are two underlines in a row. You can display the contents of a module that has been imported with print dir(module); for instance, print dir(nltk.probability).

As a first step towards appreciating how much FreqDist does, let us see how it simplifies the creation of a tally. FreqDist can create a tally as if it were a list comprehension:

1
2
3
4
5
6
7
8
9
>>> from nltk.probability import FreqDist
>>> wubFD = FreqDist(word for word in tokens)
>>> wubFD.items()[:30]
[(u'all', 13), (u'semantic', 1), (u'pardon', 1), (u'switched', 1), (u'Kindred', 1),
(u'splashing', 1), (u'excellent', 1), (u'month', 1), (u'four', 1), (u'sunk', 1),
(u'straws', 1), (u'sleep', 1), (u'skin', 1), (u'go', 8), (u'meditation', 2),
(u'shrugged', 1), (u'milk', 1), (u'issues', 1), (u'...."', 1), (u'apartment', 1),
(u'to', 57), (u'tail', 3), (u'dejectedly', 1), (u'squeezing', 1), (u'Not', 1),
(u'sorry', 2), (u'Now', 2), (u'Eat', 1), (u'fists', 1), (u'And', 5)]

Nice and simple, the way we like it.

10.2.1. The same old methods

The object returned by FreqDist() is of the type nltk.probability.FreqDist, but it has all the properties of a dictionary. The dictionary methods work on it, and the expected equalities hold:

1
2
3
4
5
6
7
8
9
>>> type(wubFD)
>>> len(wubFD)
>>> str(wubFD)
>>> 'pardon' in wubFD
>>> wubFD.items()
>>> wubFD.keys()
>>> wubFD.values()
>>> len(wubFD) == len(set(tokens))
>>> len(tokens) == sum(wubFD.values())

The length of the dictionary is equal to the number of unique words in the text, and the length of the text is equal to the sum of the values of the dictionary.

10.2.2. A new terminology from statistics: samples, outcomes, B(), and N()

Despite the shared properties of a dictionary and a frequency distribution, a frequency distribution describes itself in a different way from a dictionary. If you ask wubFD to identify itself it responds with:

1
2
>>> wubFD
<FreqDist with 929 samples and 3693 outcomes>

The reason for the change in nomenclature is because the developers of NLTK have adopted the terminology of statistics in talking about how FreqDist() works. In statistics, a frequency distribution is the result of an experiment whose outcomes are categorized according to their class or sample. In text processing, the experiment is “what is the next word in the text?” and the outcome is added to the current sample. Given the change in mathematical orientation, FreqDist() adds methods in statistics-speak which reproduce those of dictionary-speak:

1
2
3
4
5
>>> len(wubFD.keys())
>>> wubFD.samples()
>>> wubFD.B()
>>> sum(wubFD.values())
>>> wubFD.N()

samples() is the same as keys(). B() is the number of bins or samples. N() is the number of outcomes or the sum of values, i.e. the number of words in the text. These new terms are added to the previous diagram of information flow in a dictionary to help you visualize their relation:

_images/DictSample2Outcome.png

Fig. 10.3 A dictionary and a FreqDist map a key to a value, or B sample to N outcomes, and not vice versa.

Author’s diagram.

10.2.3. How to calculate the frequency of a sample with freq()

freq() returns the frequency of a sample, which is calculated as the outcome of the sample divided by the total outcomes:

1
2
3
4
5
6
>>> wubFD.freq('.')
0.07825616030327646
>>> wubFD['.'] / float(wubFD.N())
0.07825616030327646
>>> round(wubFD.freq('.'), 3)
0.078

10.2.4. How to find samples from outcomes with max(), Nr(), r_Nr() and hapax()

I know that I impressed upon you the fact that it is not possible to go from values to keys, or in stat-speak, from outcomes to samples, but FreqDist provides a few methods that overcome this limitation:

max() returns the most frequent sample:

>>> wubFD.max()

Nr() returns the number of samples that have a given number of outcomes:

1
2
>>> wubFD[wubFD.max()]
>>> wubFD.Nr(289)

r_Nr() returns the dictionary of outcomes to samples:

>>> wubFD.r_Nr()

A hapax is a word that only appears once in a text. hapaxes() returns a list of them:

1
2
3
4
5
6
7
>>> len(wubFD.hapaxes())
>>> wubFD.Nr(1)
>>> wubFD.hapaxes()[:30]
[u'...."', u'1952', u'://', u'Am', u'An', u'Anything', u'Apparently', u'Are',
u'Atomic', u'BEYOND', u'Back', u'Be', u'Beyond', u'Blundell', u'By', u'Cheer',
u'DICK', u'Dick', u'Distributed', u'Do', u'Earth', u'Earthmen', u'Eat', u'Eating',
u'End', u'Extensive', u'Finally', u'For', u'Good', u'Greg']

10.2.5. How to create a table of results with tabulate()

The tabulate() method creates a table whose columns are samples and whose rows are outcomes. The table is printed to the console, so only about sixteen columns can be read across. Supplying an integer argument stops the columns at that number. Supplying two integers starts the columns at the fist and stops them at the second, minus one:

1
2
3
4
5
6
>>> wubFD.tabulate(10)
  .    "  the    ,    I    ' said  The   to   ."
289  164  146  141   69   66   61   59   57   56
>>> wubFD.tabulate(10, 20)
 wub   it   ,"  and   of  you   ?"   It  his    s
  54   53   48   41   39   37   34   34   34   34

10.2.6. How to create a graph of results with plot()

plot() graphs the samples on the x axis against their outcomes on the y axis:

>>> wubFD.plot(50)
_images/wubFDplot50.png

Fig. 10.4 Count of first 50 samples in “Beyond Lies the Wub”

Author’s FreqDist plot.

Appending the argument cumulative=True to the method makes the plot add the outcomes up as it goes along:

>>> wubFD.plot(50, cumulative=True)
_images/wubFDplot50cum.png

Fig. 10.5 Cumulative count of first 50 samples in “Beyond Lies the Wub”

Author’s FreqDist plot.

10.3. How to plot by hand with matplotlib.pyplot

Unfortunately, not everything that you will want to graph can be done with FreqDist.plot(). In fact, hardly any of it can. So I want to take this opportunity to introduce you to the basics of plotting with Python’s graphing toolbox, matplotlib.pyplot. Go ahead and import it:

>>> import matplotlib.pyplot as plt

You are going to use so many methods of pyplot that it is too confusing to try to import every one of them individually.

The essence of a graph is to represent visually the relationship between one list of numbers, the X axis, and another list of numbers, the Y axis. In statistics, the X axis is often called the independent variable in an experiment, and the Y axis is called the dependent variable. Y is graphed against X to reveal whether Y is influenced by X. In this textual experiment, we want to know whether counts depend on the words counted. In dictionary terms, this is whether values depend on keys, and in statistical terms, it is whether outcomes N depend on samples B. You can see why the statistical terms are as they are, because, stated this way, the make sense.

more See Wikipedia’s article on Dependent and independent variables, though there are many other explanations on the Web, such as the one at What Is the Difference Between Independent and Dependent Variables?, if you can find it amidst the ads.

10.3.1. How to graph a single plot

Thus the general approach to graphing is to collect the X and Y data and then plot them. In this section, the source of the data is wubFD, so you don’t have to go very far to collect it. But this data has a quirk: because it is stored in a dictionary, it is not ordered. Thus it has to be sorted. In fact, it has to be sorted in reverse order, because the X axis of Fig. 10.4 is ordered from largest count to smallest. The normal direction of sorting is from smallest to largest. sorted(list, reverse=True) will perform the reverse sorting for you.

But it gets worse. The Y values need to be sorted, but you do not want to lose track of their corresponding X keys/samples. There is no obvious way to do that with what you know so far, though the solution is not difficult. It is to use X and Y together as a list of pairs (x, y) and then sort just the ys, in reverse. Recall that a frequency distribution’s items() method returns (key, value) pairs, so the code looks like this:

1
2
3
4
>>> XY = wubFD.items()
>>> # use one or the other
>>> XY = sorted(XY, key=lambda pair: pair[1], reverse=True)
>>> XY.sort(key=lambda pair: pair[1], reverse=True)

The first line you have seen before; it is the last two that deserve explication. Line 3 implements the sorting method that you are familiar with, sorted(), on the list of pairs XY. sorted() permits a key argument that limits the items to be sorted. In this line, I have assigned a lambda function to the key. A lambda function is a single-line function without a name. It takes pair as input and returns its 1th or second member. sorted() applies this function to every pair that it sorts, so it effectively sorts both members of the pair according to the second one. Line 4 illustrates the usage of an alternative sorting method, sort(), which sorts the list in place. It is considered to be faster than sorted(), because it does not create a new list. I will use it in the upcoming code blocks.

You are only halfway done, however. The data for X and Y have to be extracted from the sorted list of pairs. In effect, the list of pairs has to be ‘unpaired’ back to two separate lists. There is no method for doing this in Python, so it has to be done by hand, with two list comprehensions:

1
2
3
>>> limit = 50
>>> X = [x for (x,y) in XY[:limit]]
>>> Y = [y for (x,y) in XY[:limit]]

This sequence of pairing the Xs and Ys, sorting one of them, and then unpairing them is going to be done so many times in the upcoming code that I am going to call it the pair-order-unpair algorithm.

There is one final step. As you may imagine, pyplot can only plot numbers, not the list of tokens contained in X. So X should be transformed into a list of numbers. Any old numbers will do, as long as they are in ascending order. Python’s range() method creates a list of integers in ascending order if given a maximum integer as an argument, but it starts at 0:

1
2
>>> range(4)
[0, 1, 2, 3]

This is good enough for now, so go ahead and set it up:

>>> nX = range(len(X))

You have now assembled all the data necessary to plot. What is missing are the labels of the various aspects of the plot, but you can add them on the fly:

1
2
3
4
5
6
7
>>> plt.figure()
>>> plt.plot(nX, Y)
>>> plt.xlabel('Tokens')
>>> plt.ylabel('Counts')
>>> plt.title('Counts by tokens')
>>> plt.grid()
>>> plt.show()

This may look like a random jumble of lines, but there is actually a pattern to it. It always begins with the opening of a figure window. Then the plotting. Then the labeling. And finally, any overall formatting. It finishes with show(), though this is not strictly necessary: the graph is constructed and displayed as each line is run.

The labeling itself is tightly ordered, from the center of the graph to the periphery. Axis labels are closer to the plot than the title, so they come first. Without a ‘layer’ of this onion, X always comes before Y.

This block produces the figure below, parts of which are labeled with the lines of code that produced them:

_images/wubFDplot50byHand.png

Fig. 10.6 Count of first 50 samples in “Beyond Lies the Wub”, plotted by hand.

Author’s plot from code above.

10.3.2. How to change X tick labels with xticks()

This plot is not good enough, because we want to see the tokens arrayed along the X axis as the tick labels. And since tokens can take up a lot of room, the should be aligned vertically and not horizontally. pyplot enables this customization of the X axis through the xticks() method, which takes three arguments, the data list, a list of corresponding strings to be displayed as the tick labels, and an optional parameter for rotation. Since the variable X assigned above contains the token strings, it can be called upon to fill in the label information. The figure can be replotted with this method as so, with xticks() highlighted:

1
2
3
4
5
6
7
8
9
>>> plt.figure()
>>> plt.plot(nX, Y)
>>> plt.xticks(nX, X, rotation='vertical')
>>> plt.xlabel('Tokens')
>>> plt.ylabel('Counts')
>>> plt.title('Counts by tokens')
>>> plt.grid()
>>> plt.tight_layout()
>>> plt.show()

It gleefully spits out the figure below:

_images/wubFDplot50byHand1.png

Fig. 10.7 Count of first 50 samples in “Beyond Lies the Wub”, plotted by hand, with token labels.

Author’s plot from code above.

Two comments should be made. The first is that the grid stippling here is a bit obtrusive, but I do not omit it so that you can see how it looks with closely spaced data. The second is that the highlighted line 8 turns on tight-layout formatting. Without it, the token labels run off the bottom edge of the figure, and the X label Tokens is not displayed at all. The text and label placement could be customized by hand, but it is much, much easier to have plt.tight_layout() do it for you.

10.3.3. How to add a legend with legend()

There is another piece of information that can be added to a plot to make it more informative, namely a legend. Our example only has one plot, so most of its information is apparent in the axis labels and the title, but go ahead and add one any way, to see how it works.

Legend code is distributed between two places. The plotting method takes an optional argument label='' that holds the string to be displayed in the legend box. The box itself is turned on with legend(), which takes many optional arguments. The plots below often need the two used here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
>>> plt.figure()
>>> plt.plot(nX, Y, label='counts to tokens')
>>> plt.xticks(nX, X, rotation='vertical')
>>> plt.xlabel('Tokens')
>>> plt.ylabel('Counts')
>>> plt.title('Counts by tokens')
>>> plt.legend(loc='upper left', fontsize='small')
>>> plt.grid()
>>> plt.tight_layout()
>>> plt.show()

As before, the two new lines, 3 and 7, are highlighted.

Running this block magically results in this image:

_images/wubFDplot50byHand2.png

Fig. 10.8 Count of first 50 samples in “Beyond Lies the Wub”, plotted by hand, with legend.

Author’s plot from code above.

The argument loc='best' tells pyplot to locate the legend box where it thinks is best. pyplot does this by default, so mentioning it is redundant, but in some plots you may want to place the legend by hand, so it’s best to see it a simple example. The other argument, fontsize='small' reduces the size of the font – but you had already guessed that. This is a good trick to know for small plots.

10.3.4. How to change the numeric labeling of ticks with get_yticks()

Say you are dissatisfied with the labeling of the Y-axis ticks and would like to alter it to count by tens. unfortunately, pyplot hides the mechanics of deciding how to apportion the ticks from you, so your task is a bit convoluted. I will break it down into parts so that you can grasp it, but in general, all you have to do is copy the code that I offer.

The goal is to divide the current ticks by ten. Yet as I said, pyplot hides this from you. All you can do is get a list containing the ticks that pyplot has decided on. So start a new plot and interrupt it as shown:

1
2
3
4
>>> plt.figure()
>>> plt.plot(nX, Y, label='counts to tokens')
>>> plt.gca().get_yticks()
array([   0.,   50.,  100.,  150.,  200.,  250.,  300.,  350.])

gca() stands for get current axis and is required for all requests to pyplot to reveal what it has calculated. get_yticks() is the method that requests the tick labels of Y. It returns an array, a kind of mathematical list, with the labels represented as floats. The array can be iterated over, so each float can be divided by 10, converted to an integer, and assigned to a new list, as in this assignment, though you shouldn’t run it:

>>> newyTicks = [int(tk/10) for tk in plt.gca().get_yticks()]

Then you get the current axes again and set the Y tick labels to the new list:

>>>  plt.gca().set_yticklabels(newyTicks)

This works out to the following code, with new stuff highlighted:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
>>> plt.figure()
>>> plt.plot(nX, Y, label='counts to tokens')
>>> plt.xticks(nX, X, rotation='vertical')
>>> newyTicks = [int(tk/10) for tk in plt.gca().get_yticks()]
>>> plt.gca().set_yticklabels(newyTicks)
>>> plt.xlabel('Tokens')
>>> plt.ylabel('Counts x 10')
>>> plt.title('Counts by tokens')
>>> plt.legend(loc='upper left', fontsize='small')
>>> plt.grid()
>>> plt.tight_layout()
>>> plt.show()

And paints this dazzling picture:

_images/wubFDplot50byHand3.png

Fig. 10.9 Count of first 50 samples in “Beyond Lies the Wub”, plotted by hand, with customized Y ticks.

Author’s plot from code above.

Did you notice that the Y label is modified to show that it is counted by 10s?

10.3.5. How to scatter plot with scatter()

So far, your plots have joined the data points by a line, deemphasizing the points themselves. Your next challenge is to reverse this state of affairs, by calling attention to the points themselves in a scatter plot, with no line connecting them. This is accomplished by the scatter() method, which can be substituted for plot() in line 2:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
>>> plt.figure()
>>> plt.scatter(nX, Y, label='counts to tokens')
>>> plt.xticks(nX, X, rotation='vertical') # omitted in bottom plot
>>> plt.xlabel('Tokens')
>>> plt.ylabel('Counts')
>>> plt.title('Counts by tokens')
>>> plt.legend(loc='upper left', fontsize='small')
>>> plt.grid()
>>> plt.tight_layout()
>>> plt.show()

It is here that pyplot falls flat on its face, because what we do not want is either of these:

_images/wubFDplot50byHand4.png

Fig. 10.10 Count of first 50 samples in “Beyond Lies the Wub”, plotted by hand, with bad scatter plots.

Author’s plot modified from code above to produce two panes.

10.3.6. How to change axis limits with xlim() (and ylim())

The problem is that the scattered dots are slightly wider than the ticks on the X axis and so trick pyplot into overcompensating by widening the right and left margins. Turning off the text labels for the X axis in the bottom plot reveals how the X axis jumps to the next 10 units of tick spacing on either end of the X axis, even though there are no points to plot there.

We are fortunate to have an easy work-around. The limits of the X axis can be increased in each direction from [0, 50] to [-1, 51] with xlim() to fit the fat dots:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
>>> plt.figure()
>>> plt.scatter(nX, Y, label='counts to tokens')
>>> plt.xlim([-1, 51])
>>> plt.xticks(nX, X, rotation='vertical')
>>> plt.xlabel('Tokens')
>>> plt.ylabel('Counts')
>>> plt.title('Counts by tokens')
>>> plt.legend(loc='upper left', fontsize='small')
>>> plt.grid()
>>> plt.tight_layout()
>>> plt.show()

The X axis now fits the data correctly:

_images/wubFDplot50byHand5.png

Fig. 10.11 Count of first 50 samples in “Beyond Lies the Wub”, plotted by hand, with scatter plot.

Author’s plot from code above.

If you wanted to change the limits of the Y axis, the corresponding method is ylim(). You can set the Y limits to be equal to the X limits by putting the latter inside the former, ylim(xlim()).

10.3.7. How to annotate points with annotate()

Now that you have a bunch of points, you can make them convey more information by labeling them. This transfers the information from the X axis to the points themselves. Pyplot’s annotate() method does this, but it is a bit complicated. Well, so complicated that I don’t want you to check its docstring with help(plt.annotate). Instead, I will just tell you that it takes three arguments, annotate('', xy, xytext). The first is a string with which to label the point. The second, xy are the coordinates of the point, and the third xylabel are the coordinates where the string will start. Since all of this is different for each point, you have no choice but to iterate through the points after they are plotted. The label can be taken from X and the coordinates from (nX, Y) but the iteration has to coordinate the two. The best way to do this is through the explicit index given by enumerate(). The code just for this is as follows:

1
2
3
4
>>> for (pt, label) in enumerate(X):
...     plt.annotate(label,
...                  xy=(nX[pt], Y[pt]),
...                  xytext=(nX[pt], Y[pt]))

Recall that enumerate assigns an index to each element of a list and pairs the two together, much like zip(). You do not have direct access to the enumerate object, but it is a generator and so can be iterated over:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
>>> song = ['parsley', 'sage', 'rosemary', 'thyme']
>>> enumerate(song)
<enumerate object at 0x12811b500>
>>> for (i, s) in enumerate(song):
...     print '{}\t{}'.format(i, s)
...
    0   parsley
    1   sage
    2   rosemary
    3   thyme

For the sake of explicitness, let’s break this code down into an algorithm:

  1. Choose a list to attach to the plot as labels,
  2. plug it into enumerate(),
  3. create a pair of variables that corresponds to the (index, label) pairs output by enumerate(),
  4. insert #2 and #3 into a for statement.
  5. The only line in the body of the statement is annotate(), so add it.
  6. Use the label member of the pair as the string argument of annotate().
  7. Use the index member, pt of the pair to pick out corresponding X and Y values as the xy argument, indicating the points to be annotated, and finally
  8. re-use #7 for the xytext argument, indicating where to locate the annotation.

To illustrate more perspicuously how this algorithm works, I am going to make up some data and zip it into pairs to serve as the basis for labels of the annotation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
>>> extent = 4
>>> B = range(extent)*extent
>>> A = sorted(B)
>>> AB = zip(A, B)

>>> plt.figure(5,5)
>>> plt.scatter(A, B, label='coordinates')
>>> for (pt, label) in enumerate(AB):
...     plt.annotate(str(label),
...                  xy=(A[pt], B[pt]),
...                  xytext=(A[pt], B[pt]))
>>> plt.xlabel('A')
>>> plt.ylabel('B')
>>> plt.title('B by A')
>>> plt.legend(loc='upper left', fontsize='small')
>>> plt.grid()
>>> plt.tight_layout()
>>> plt.show()

The problem is that putting a label at a point obscures the point:

_images/annotateEx1.png

Fig. 10.12 Scatter plot of points labeled with their coordinates.

Author’s plot from code above.

10.3.8. How to offset an annotation with get_xticks()

The general solution to this problem is to offset the label from the data point, which is the whole purpose of annotate() having the xytext argument. I find it easiest to add a little bit to the X value of the point to budge the label a bit to the right. Knowing how much to add depends on the scaling of the X axis. As mentioned above, it is hidden from you and can only be approximated through gca().get_xticks(), which returns a list of the X ticks, as in this example, which only works inside plotting code:

1
2
3
4
>>> plt.figure()
>>> plt.scatter(A, B, label='coordinates')
>>> plt.gca().get_xticks()
array([-0.5,  0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5])

Subtract the second one from the first one to get the distance between ticks, 0 — —0.5 = 0.5, and divide by ten to get an offset of 0.05:

>>> offset = (plt.gca().get_xticks()[1] - plt.gca().get_xticks()[0]) / 10

You can now plug an offset into our example code block, at lines 3 and 7:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
>>> plt.figure(5,5)
>>> plt.scatter(A, B, label='coordinates')
>>> offset = (plt.gca().get_xticks()[1] - plt.gca().get_xticks()[0]) / 10
>>> for (pt, label) in enumerate(AB):
...     plt.annotate(str(label),
...                  xy=(A[pt], B[pt]),
...                  xytext=(A[pt]+offset, B[pt]))
>>> plt.xlabel('A')
>>> plt.ylabel('B')
>>> plt.title('B by A')
>>> plt.legend(loc='upper left', fontsize='small')
>>> plt.grid()
>>> plt.tight_layout()
>>> plt.show()

Notice the nudge of the coordinate labels to the right:

_images/annotateEx2.png

Fig. 10.13 Scatter plot of points labeled with their coordinates.

Author’s plot from code above.

Subtracting an offset from the x coordinate nudges the label to the left. Adding an offset to the y coordinate nudges the label up; subtracting an offset from the y coordinate nudges the label down.

You can now try out the annotation algorithm, reinforced with an offset, on the Wub frequency distribution. To show it at its best, cut it down to the first ten:

_images/wubFDplot50byHand6.png

Fig. 10.14 Scatter plot of counts of first 10 samples in “Beyond Lies the Wub”, with points labeled with their coordinates.

Author’s plot from code below.

The code increments the previous block with the annotation and offset, while omitting the labels from the X axis (and the tweak of the X axis limits):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
>>>  limit = 10
>>>  X = [x for (x,y) in XY[:limit]]
>>>  Y = [y for (x,y) in XY[:limit]]
>>>  nX = range(len(X))

>>> plt.figure()
>>> plt.scatter(nX, Y, label='counts to tokens')
>>> offset = (plt.gca().get_xticks()[1] - plt.gca().get_xticks()[0]) / 10
>>> for (pt, label) in enumerate(X):
...     plt.annotate(label,
...                  xy=(nX[pt], Y[pt]),
...                  xytext=(nX[pt]+offset, Y[pt]))
>>> plt.xlabel('Tokens')
>>> plt.ylabel('Counts')
>>> plt.title('Wub frequency distribution')
>>> plt.legend(loc='upper left', fontsize='small')
>>> plt.grid()
>>> plt.tight_layout()
>>> plt.show()

10.3.9. How to split the window into separate panes with subplot()

Look, Ma, I shrunk the plot!

_images/wubFDplot50byHand8.png

Fig. 10.15 Line plot of counts of first 50 samples in “Beyond Lies the Wub” in the first pane of a 2x2 window.

Author’s plot from code below.

This is an example of a subplot(), which splits the figure window into panes. In the figure above, the panes are in a 2 x 2 layout, with plotting done on the first one. The code shows you how:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
>>> plt.figure()
>>> plt.subplot(2, 2, 1)
>>> plt.plot(nX, Y, label='counts to tokens')
>>> plt.xticks(nX, X, rotation='vertical')
>>> plt.xlabel('Tokens')
>>> plt.ylabel('Counts')
>>> plt.title('Counts by tokens')
>>> plt.legend(loc='upper left', fontsize='small')
>>> plt.grid()
>>> plt.tight_layout()
>>> plt.show()

subplot() follows figure(), and any code following subplot() goes into the pane that it activates.

10.3.10. How to plot onto several panes

The fun part is to fill all the panes with plots. You can do that one by one, but to do it inside a loop actually reveals the regularities in your task. I recommend creating a list of data or labels to be iterated over and then using enumerate() in a for loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
>>> plotTitles = ['First ten', 'Second ten', 'Third ten', 'Fourth ten']
>>> limit = 10
>>> plt.figure()
>>> for (pane, title) in enumerate(plotTitles):
...     XYdataSpan = XY[pane*limit:(pane+1)*limit]
...     X = [x for (x,y) in XYdataSpan]
...     Y = [y for (x,y) in XYdataSpan]
...     nX = range(limit)
...     plt.subplot(2, 2, pane+1)
...     plt.plot(nX, Y, label='counts to tokens')
...     plt.xticks(nX, X, rotation='vertical')
...     plt.xlabel('Tokens')
...     plt.ylabel('Counts')
...     plt.title(title)
...     plt.legend(loc='best', fontsize='small')
>>> plt.tight_layout()
>>> plt.show()

This code iterates over the plot titles, selecting the data to be plotted within the loop. It selects the data from XY in sequences of ten, using the indices of pane as multipliers. Imagine that enumerate() will iterate pane through [0,1,2,3], which are multiplied by the limit of 10 in turn to calculate the appropriate selections of XY.

The only trick is that subplot() hates to have an initial number of 0, which is why in line 9 I add one to the pane index.

This results in a quite elegant figure:

_images/wubFDplot50byHand9.png

Fig. 10.16 Count of first 40 samples in “Beyond Lies the Wub”, plotted by hand in four panes of a 2x2 window.

Author’s plot from code above.

10.3.11. How to address individual panes with if

Your final task is to customize individual plots within the loop. For instance, I would like you to omit the X label Tokens in panes one and two – because you can find them underneath panes three and four. Likewise, I would like you to omit the Y label Counts in panes two and four – because you can find them to the left of panes one and three, like so:

_images/wubFDplot50byHand10.png

Fig. 10.17 Count of first 50 samples in “Beyond Lies the Wub”, plotted by hand, in first pane of 2x2 window.

Author’s plot from code below.

There is no general way to do this; it has to be done on a pane by pane basis, via if:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
>>> plotTitles = ['First ten', 'Second ten', 'Third ten', 'Fourth ten']
>>> limit = 10
>>> plt.figure()
>>> for (pane, title) in enumerate(plotTitles):
...     XYdataSpan = XY[pane*limit:(pane+1)*limit]
...     X = [x for (x,y) in XYdataSpan]
...     Y = [y for (x,y) in XYdataSpan]
...     nX = range(limit)
...     plt.subplot(2, 2, pane+1)
...     plt.plot(nX, Y, label='counts to tokens')
...     plt.xticks(nX, X, rotation='vertical')
...     if pane == 0: plt.ylabel('Counts')
...     # pane 1 takes no axis labels
...     elif pane == 2:
...         plt.xlabel('Tokens')
...         plt.ylabel('Counts')
...     elif pane == 3: plt.xlabel('Tokens')
...     plt.title(title)
...     plt.legend(loc='best', fontsize='small')
>>> plt.tight_layout()
>>> plt.show()

10.3.12. Practice 2

Practice, sorting, range, enumerate.

  1. Here is a dictionary of state names and their populations for the 2015, from U.S. Population by State, 1790 to 2015:

    1
    2
    >>> statePop = {'Louisiana':4649676, 'California':39144818,
                    'Florida':20271272, 'Texas':27469114}
    

Use it to plot population by state in four ways, in a single window. The four subplots should be:

  1. a line plot ordered by state with state labels on the X axis,
  2. a line plot ordered by population with state labels on the X axis,
  3. a scatter plot with the points labeled by state and no state label on the X axis (just numbers),
  4. (Y in millions)

10.4. How to create a word cloud with wordcloud

What is this an image of?

_images/wubWordCloud.png

Fig. 10.18 A word cloud generated from ‘Beyond Lies the Wub’.

Author’s wordcloud plot; see code below.

I hope you realized that it is an image of the most frequent words of the text, scaled according to their frequency. And colored and organized to look pretty.

This is known as a word cloud, though Wikipedia prefers tag cloud. The concepts introduced in this chapter are displayed in all their blinding glory.

Yes, of course there is a Python package named wordcloud for making these images. You can install it with pip:

>$ pip install wordcloud

Wordcloud does tokenization, stop-word deletion and frequency calculation for you, so you are left with very little to make an image with the default settings:

1
2
3
4
5
6
>>> import matplotlib.pyplot as plt
>>> from wordcloud import WordCloud
>>> wubCloud = WordCloud().generate(rawText)
>>> plt.figure()
>>> plt.imshow(wubCloud)
>>> plt.axis("off")

You will have to look at the documentation to change the defaults. Knock yourself out.

more See wordcloud’s documentation.

10.5. The inverse relationship between rank and frequency

Just for fun, you can plot the entire data set with the line below, which responds with Fig. 10.19:

>>> wubFD.plot()
_images/wubFDplotAll.png

Fig. 10.19 Counts of all samples in “Beyond Lies the Wub”

Author’s FreqDist plot.

What a mess! There is not enough room on the page to annotate the graph with all the words, but what is worse, the curve has practically no informative detail in its shape: it crashes from a large value to a steady small value with almost no transition.

The problem is that the largest values are too big for the scale. This is not an uncommon problem in mathematics, and mathematicians have devised a simple solution. It is to use logarithmic scales, instead of linear ones. On a linear scale, each tick has the same magnitude as the previous tick. On a logarithmic scale, each tick is a multiple of the previous tick.

10.5.1. How to plot on logarithmic axes

Matplotlib’s pyplot can easily change from linear to logarithmic axes. In fact, it can handle all four combinations of the two axis sorts, as illustrated in this diagram:

_images/wubFD4axes.png

Fig. 10.20 Illustration of the effect of changing from linear to logarithmic axes for a rank-frequency plot of the tokens of ‘Beyond Lies the Wub’.

Author’s pyplot plot, from code in “nlp10.py”, though inspired on Wikipedia’s Logarithmic Scales.

I will only list here the code for plotting on log-log axes by themselves; the full listing is in the script for the chapter:

1
2
3
4
5
6
7
8
9
>>> Y = wubFD.values()
>>> Y = sorted(Y, reverse=True)
>>> X = range(len(Y))
>>> plt.figure()
>>> plt.loglog(X, Y)
>>> plt.xlabel('Rank')
>>> plt.ylabel('Frequency')
>>> plt.grid()
>>> plt.show()

The values have to be sorted in reverse, from largest to smallest.

10.5.2. Zipf’s law

The log-log plot of the text is an example of Zipf’s law:

Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation.

Our discovery of Zipf’s law leads to paradoxical conclusion about the structure of texts: the most frequent words in a text are the least informative. Thus in order to get at the essence of a text, you may want to get rid of them. This is the motivation for stop-word deletion already reviewed in the previous chapter.

10.6. How to make a probability distribution with MLEProbDist

There is one glaring drawback to trying to use a frequency distribution to try to understand the lexical structure of a text – the tally is relative to the size of the text. That is to say, whether there are a ‘lot’ of tokens of a word or not depends on the total number of words in the text. It would be more informative to know the ‘relative’ frequency of a word by dividing its count by the total number of words, N. You saw how to do this briefly above with the freq() method, but it only works on a single word. Fortunately, NLTK’s probability class provides a function for dividing an entire frequency distribution by N, called MLEProbDist(). MLEProbDist() takes a frequency distribution as input to produce a probability distribution:

1
2
3
4
5
>>> from nltk.probability import MLEProbDist
>>> wubPD = MLEProbDist(wubFD)
>>> wubFD.freq('.')
>>> wubPD.prob('.')
>>> wubPD.samples()[:10]

This division has the effect of converting a frequency distribution into a probability distribution, as the name of MLEProbDist() suggests.

10.6.1. Small detour: How to plot a MLEProbDist

Unfortunately, NLTK’s probability distributions do not define a plotting method, so you have to do it yourself. Do not feel like you have to understand every facet of this code. To create the X and Y data, you must create a list of pairs of token and probability from the prob() and samples() methods and then sort the list by the probabilities in reverse order to start with the largest one. This is illustrated in the first two lines of code:

1
2
3
4
5
6
>>> wubProb = [(sample, wubPD.prob(sample)) for sample in wubPD.samples()]
>>> wubProb.sort(key=lambda pair: pair[1], reverse=True)
>>> limit = 50
>>> X = [t for (t, p) in wubProb[:limit]]
>>> Y = [p for (t, p) in wubProb[:limit]]
>>> nX = range(limit)

Then you choose a cut off point, say 50 as before, extract the first fifty tokens/samples to be X – but they must be assigned numbers to actually be plotted on the X axis – and finally the first fifty probabilities are extracted to be Y.

You now have enough data to plot it. You should be able to figure that out on your own:

1
2
3
4
5
6
7
8
>>> import matplotlib.pyplot as plt
>>> plt.figure()
>>> plt.plot(nX, Y)
>>> plt.xticks(nX, X, rotation='vertical')
>>> plt.xlabel('Rank')
>>> plt.ylabel('Probability')
>>> plt.grid()
>>> plt.show()

Here is what you should see appear on your computer screen:

_images/wubPDplot50.png

Fig. 10.21 Rank-probability plot of the first 50 most probable tokens of ‘Beyond Lies the Wub’.

Author’s pyplot plot, from code above.

A probability distribution varies between 0 and 1, and all the individual probabilities add up to 1. The shape of the curve should mimic the previous shapes for this sample.

The next step is to plot the entire data set. You did most of the work for the previous graph, so the only innovation is to plot two graphs in different panes of the same plotting window, the first with linear axes and the second with logarithmic axes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
>>> plotTitles = ['Linear scales', 'Logarithmic scales']
>>> Y = [p for (t, p) in wubProb]
>>> X = range(Y)
>>> for (pane, title) in enumerate(plotTitles):
...     plt.subplot(2, 2, pane+1)
...     if pane == 0:
...         # linear scales
...         plt.plot(X, Y)
...         plt.ylabel('Probability')
...     elif pane == 1:
...         # loglog scales
...         plt.loglog(X, Y)
...     plt.xlabel('Rank')
...     plt.title(title)
...     plt.grid()
>>> plt.tight_layout()
>>> plt.show()
_images/wubPDplotAll.png

Fig. 10.22 Rank-probability plots of all the tokens of ‘Beyond Lies the Wub’.

Author’s pyplot plot, from code above.

Warning

Discuss in class

10.7. Authorship attribution and conditional frequency distribution

Imagine that you have six texts that you have never read. Five are by the same author, and one is not. How would you figure out which one was different? Well, you could just read all six, but that is no fun. Anybody can read. I want you to figure it out quantitatively, with Python. The task is known as stylometry or authorship attribution and has been studied intensively since at least 1964, if not 1887, see [Sta09]. Here are some

  • Lexical diversity, which is a measure of the richness of the author’s vocabulary.
  • The average number of words per sentence.
  • Average number of commas, semicolons and colons per sentence.
  • Frequencies of 10 most common words.
  • Frequencies of most common POS tags, e.g, NN, NNP, NNS, DT, IN, JJ.

So let’s get started. The first thing to do is collect a corpus. Here it is:

1
2
>>> scifiCorpus = ['21279', '28554', '28644', '30240', '30255', '31516',
...                '32032', '32154']

It consists of the numerical ids of eight short stories from Project Gutenberg, six by one author and two by another. There are all in the same genre and were all published at about the same time. I have sorted them to make the individual files a bit easier to keep track of.

Download them with gutenLoader. But wait. Project Gutenberg limits the “perceived use of automated tools” to download its files. To avoid getting blocked from the website for 24 hours, put your code to sleep for five seconds between downloads with time.sleep():

1
2
3
4
5
6
>>> from time import sleep
>>> from textProc import gutenLoader
>>> for g in scifiCorpus:
...     url = 'http://www.gutenberg.org/cache/epub/{}/pg{}.txt'.format(g, g)
...     trash = gutenLoader(url, g)
...     sleep(5)

more See Project Gutenberg’s policy at Information About Robot Access to our Pages.

Now take a first look at the corpus by calculating the number of characters in each file and tabulating it with the file name. While you are at it, create a way to recover a story’s id from its length:

1
2
3
4
5
6
7
8
9
>>> storyLen = []
>>> len2id = {}
>>> for sfid in scifiCorpus:
...     with open(sfid, 'r') as tempFile:
...         rawText =  tempFile.read()
...     tempLen = len(rawText)
...     storyLen.append(tempLen)
...     len2id[tempLen] = sfid
...     print '{}\t{}'.format(sfid, str(len(rawText)))

Why do I assign story length to its own variable in line 6 rather than just calculating it each time I need it? As for line 9, I will try to be consistent and use string formatting every time I need to build a complex string.

I include the output so we can refer back to it, with slight modifications to make it more readable:

1
2
3
4
5
6
7
8
21279    15641
28554    16023
28644    14183
30240    22331
30255    45841
31516     7060
32032    91824
32154   150832

Double check this table against the dictionary len2id that you created above, by typing the dictionary’s name in the console. It should reproduce the tabulation, from right to left.

Notice that there are considerable differences in the lengths of the members of the corpus. This will give you plenty of leeway to see the advantages and disadvantages of the methods that you are going to learn.

Another thing to appreciate is that by referring to a story by its id rather than its name, you will not be biased by your knowledge of the stories, and in particular their authors. Thus you can conduct the following computational experiments objectively, as a sort of double-blind methodology.

more See Wikipedia’s article on Blind experiment.

10.7.1. How to measure lexical diversity

One simple statistic for a text is to find the ratio of total words to unique words, or word tokens divided by word types. This is simple enough for you to calculate – with the caveat of insuring real number (float) division – as long as you filter out the punctuation. Oh, and you should round the resulting floating point numbers to two decimal places, otherwise Python returns it to fourteen decimal places:

1
2
3
4
5
6
7
>>> lexDiver = []
>>> for sfid in scifiCorpus:
...     with open(sfid, 'r') as tempFile:
...         rawText = tempFile.read()
...     words = [w for w in word_tokenize(rawText) if w.isalpha()]
...     lD = len(words) / float(len(set(words)))
...     lexDiver.append(round(lD, 2))

I hope that you got this answer:

>>> [2.96, 3.09, 3.34, 2.85, 4.65, 2.17, 6.52, 6.57]

Now you might suspect that the lexical diversity of text could be proportional to how long the text is. The longer the text, the greater the chance for the author to use words that she has not used yet. You can test this hypothesis by plotting lexical diversity against the length of the texts. That, the length of the texts is the independent variable, on the X axis, against which you want to evaluate the lexical diversity of a text, the dependent variable on the Y axis. As the terminology suggests, the dependent variable depends on the independent one, which is shown quantitatively by plotting Y onto X.

You already know how to do this, except for one detail. The text lengths have to be sorted from low to high, but the diversity scores depend on them, so they have to go along with the sort. So the two have to be paired through zipping, then sorted by length, then unpaired back to lists through list comprehensions. This is another instance of the pair-sort-unpair algorithm mentioned in How to graph a single plot:

1
2
3
4
>>> lenDiverPairs = zip(storyLen, lexDiver)
>>> lenDiverPairs.sort(key=lambda pair: pair[0])
>>> X = [l for (l,d) in lenDiverPairs]
>>> Y = [d for (l,d) in lenDiverPairs]

The sort() method allows sorting on items of a tuple through the lambda statement, which is just a quick and dirty function without a name, also introduced in How to graph a single plot. It says to sort based on the first item of each pair.

Having sorted the lengths to make X, and having gathered together Y as a by-product, you can now print a scatter plot to look for the hypothesized relationship. Be sure to add some annotation code, as explained in How to annotate points with annotate():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
>>> plt.figure()
>>> plt.scatter(X, Y)
>>> offset = (plt.gca().get_xticks()[1] - plt.gca().get_xticks()[0]) / 10
>>> for (pt, label) in enumerate(X):
...     plt.annotate(len2id[label],
...                   xy=(X[pt], Y[pt]),
...                   xytext=(X[pt]+offset, Y[pt]))
>>> plt.xlabel('Story length')
>>> plt.ylabel('Lexical diversity')
>>> plt.grid()
>>> plt.show()

Here are the results of you handiwork:

_images/scifiLenDiversity2.png

Fig. 10.23 Plot of lexical diversity by story length for a corpus of unknown authors, with point labels.

Author’s pyplot plot, from code above.

So does it look to you like, as the length of the stories increases, they get more lexically diverse?

more See textInspector’s explanation of Lexical Diversity.

You might say it does, and I might agree with you, but what we really need is an objective measure of the proportionality. Fortunately, this is a well-known problem in statistics, for which there is a well-known and Python-friendly solution.

Warning

Discuss in class.

10.7.1.1. How to perform linear regression

Visually, the points in the graph should approximate a straight line that goes from the bottom left corner to the top right corner. This just represents the idea that as X gets bigger, Y gets bigger, too. So what is a line? It is something that starts at a certain point and then continues at a fixed angle, called its slope. Mathematically, some points x are multiplied by the slope, conventionally referred to as m or a, to which is added the value for x at 0, conventionally referred to as b, or:

\[\]

y = a*x + b

This is eminently plotable with pyplot. You have to supply a representative number of points on the X axis, which is what range() is for. Well, not exactly, range() only accepts integers and it cannot be multiplied or added to. The arange() method of numpy creates a range as an array, which is a more complex mathematical object that does permit floats, multiplication and addition. In the code below I make up a and b and plot a line from them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
>>> from numpy import arange
>>> a = 0.5
>>> b = 2
>>> X = arange(0, 10, 0.1)
>>> Y = a*X + b
>>> X
# ----
>>> plt.figure(figsize=(5,5))
>>> plt.plot(X, Y)
>>> plt.ylim(plt.xlim())
>>> plt.xlabel('X')
>>> plt.ylabel('Y')
>>> plt.grid()
>>> plt.show()
_images/plotLine.png

Fig. 10.24 Plot of a line.

Author’s pyplot plot, from code above.

Warning

Discuss in class.

more See Purplemath for more on Straight-Line Equations: Slope-Intercept Form.

Returning to our topic of analysis, if lexical diversity is proportional to story length, then there should be a line connecting the points plotted in Fig. 10.23. Finding such a line is the purpose of linear regression. Python’s math packages have several ways of preforming linear regression. I am going to illustrate one from scipy that does everything in one fell swoop.

As a first step, let us see how it fares on the ‘data’ that you just graphed:

1
2
3
4
5
>>> from scipy import stats
>>> pNames = ['slope', 'intercept', 'r_value', 'p_value', 'std_err']
>>> parameters = stats.linregress(X,Y)
>>> for (n,p) in zip(pNames, parameters):
...     print '{}\t{}'.format(n, round(p,2))

The last two lines display this little table:

1
2
3
4
5
slope       0.5
intercept   2.0
r_value     1.0
p_value     0.0
std_err     0.0

linregress() found the slope (a) and start-point or y intercept (b) exactly, and with essentially no errors.

Warning

See documentation & discuss.

So it is time to turn it loose on our data set. To get started, reinitialize X and Y and then plug them into linregress() and tabulate the resulting parameters:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
>>> X = [l for (l,d) in lenDiverPairs]
>>> Y = [d for (l,d) in lenDiverPairs]
>>> parameters = stats.linregress(X,Y)
>>> for (n,p) in zip(pNames, parameters):
...     print '{}\t{}'.format(n, round(p,6))

slope       3.1e-05
intercept   2.595858
r_value     0.929703
p_value     0.000823
std_err     5e-06

Now scatter plot the data points and add the regression line, which you have to calculate using the line equation above and the first two parameters returned by linregress():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
>>> from numpy import array
>>> aX = array(X)
>>> rY = parameters[0]*aX + parameters[1]

>>> plt.figure()
>>> plt.scatter(X, Y)
>>> plt.plot(X, rY, 'r')
>>> offset = (plt.gca().get_xticks()[1]-plt.gca().get_xticks()[0])/10
>>> for (pt, label) in enumerate(X):
...     plt.annotate(len2id[label],
...              xy=(X[pt], Y[pt]),
...              xytext=(X[pt]+offset, Y[pt]))
>>> plt.xlabel('Story length')
>>> plt.ylabel('Lexical diversity')
>>> plt.grid()
>>> plt.show()
_images/scifiLenDiversity3.png

Fig. 10.25 Plot of lexical diversity by story length for a corpus of unknown authors, with linear regression line.

Author’s pyplot plot, from code above.

Warning

Discuss in class.

more See Scipy’s documentation for more on scipy.stats.linregress.

more For more than you want to know about linear regression in a Python framework, see A friendly introduction to linear regression (using Python).

10.7.1.2. Detour: how to unpack a tuple in an assignment

It may not have been immediately clear to you that stats.linregress() returns a tuple of values, but it does:

>>> stats.linregress(X,Y)

It would save some typing to assign the members of this tuple to variable names. Python is up to the task, either directly or as a tuple:

1
2
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(X,Y)
>>> (slope, intercept, r_value, p_value, std_err) = stats.linregress(X,Y)

In fact, the output tuple in the second line can itself be assigned to a name:

1
2
>>> params = (slope, intercept, r_value, p_value, std_err) = stats.linregress(X,Y)
>>> params[0] == slope

While this particular case is known as unpacking a tuple, it falls under the more general topic of destructuring assignment, which allows you to condense a bunch of variable assignments into a single, easy-to-read line.

10.7.2. How to measure lexical density

quiz topic

more See Graham Williamson’s page on Lexical Density.

10.7.2.1. How to calculate POS densities

Knowing how to tag the words in our corpus with their part of speech opens up a variety of ways to look for stylistic patterns. The figure below summarizes a few of them.

_images/scifiVNADVbyWord.png

Fig. 10.26 Plot of POS measures for a corpus of unknown authors, with linear regression lines.

Author’s plot from code below.

I will walk you through the code.

The first step is to import NLTK’s part-of-speech tagger and define some interesting lists of tags. I am going to focus on verbs, nouns, adverbs and adjectives:

1
2
3
4
5
>>> from nltk.tag import pos_tag
>>> V = ['VB', 'VBZ', 'VBP', 'VBD', 'VBG']
>>> N = ['NN', 'NNS', 'NNP', 'NNPS']
>>> ADV = ['RB', 'RBR', 'RBS']
>>> ADJ = ['JJ', 'JJR', 'JJS']

The next next step is to initialize some lists to hold the results you get from processing the tags. These are going to be the six axes of the three plots in Fig. 10.26, the number of verbs and of nouns in a text and the ratio of adverbs to verbs, adjectives to nouns, nouns to verbs, and adjectives to adverbs in a text. The initialization can be done in two ways, by initializing each list on its own line or through a destructing assignment, as in line 7:

1
2
3
4
5
6
>>> wLen = []       # number of words
>>> vLen = []       # number of verbs
>>> nLen = []       # number of nouns
>>> advLen = []     # number of adverbs
>>> adjLen = []     # number of adjectives
>>> vLen, nLen, advLen, adjLen, wLen = ([] for i in range(5))

The first technique is a bit of jumble, but it does let you explain what each list is for. The second is sleek and elegant, though it has been described as “code golf”.

Now you collect the data and calculate the ratios. The algorithm is as before:

  • Loop through the corpus to …

    1. read a file,

    2. split it into tokens,

    3. collect the words,

    4. tag them,

    5. collect the relevant parts of speech …

      1. initialize lists for verbs, nouns, adjectives, and adverbs,
      2. loop over the tagged words, assigning each one to its POS list;
    6. calculate totals and ratios.

The code is as below. I have highlighted another destructing assignment in line 7:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
>>> for sfid in scifiCorpus:
...     with open(sfid,'r') as tempFile:
...         rawText =  tempFile.read()
...     tokens = word_tokenize(rawText)
...     words = [t for t in tokens if t.isalpha()]
...     taggedW = pos_tag(words)
...     verbs, nouns, advs, adjs = ([] for i in range(4))
...     for (w,tag) in taggedW:
...         if tag in V: verbs.append(w)
...         elif tag in N: nouns.append(w)
...         elif tag in ADV: advs.append(w)
...         elif tag in ADJ: adjs.append(w)
...     wLen.append(len(words))
...     vLen.append(len(verbs))
...     nLen.append(len(nouns))
...     advLen.append(len(advs))
...     adjLen.append(len(adjs))

The final step is to plot the results. This is a bit of a bear because of there is plenty of formatting. I will walk you through the first of the three panes. The overall approach is to:

  1. initialize a pane,
  2. calculate,
  3. plot,
  4. label.

The specific algorithm is to:

  1. initialize a pane,
  2. scatter plot the two data sets,
  3. calculate the regression parameters,
  4. plot the regression line, formatting the first two parameters to be part of the legend,
  5. annotate each point with an id,
  6. label each axis,
  7. add the legend, with appropriate font size and location, and finally
  8. add the grid stippling.

Here is the code for the entire loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
>>> plotData0 = [(wLen, vLen), (wLen, nLen), (wLen, advLen)]
>>> yaxisLabels = ['V x 1000', 'N x 1000', 'ADV x 1000']

>>> plt.figure(figsize=(7.5,7.5))
>>> for (pane, data) in enumerate(plotData0):
...    X, Y = data[0], data[1]
...    slope, intercept = stats.linregress(X, Y)[0:2]
...    rX = slope*array(X) + intercept
...    plt.subplot(2, 2, pane+1)
...    plt.scatter(X, Y)
...    plt.plot(X, rX, 'r',
...            label='slope={},\nintercept={}'.format(
...            round(slope,2),
...            round(intercept,2)))
...    plt.ylim(plt.xlim())
...    wTicks = [int(tk/1000) for tk in plt.gca().get_xticks()]
...    plt.gca().set_xticklabels(wTicks)
...    plt.gca().set_yticklabels(wTicks)
...    offset = (plt.gca().get_xticks()[1]-plt.gca().get_xticks()[0])/10
...    for pt in range(len(X)):
...        plt.annotate(str(pt), # scifiCorpus[i]
...                     xy=(X[pt], Y[pt]),
...                     xytext=(X[pt]+offset, Y[pt]))
...    plt.xlabel('words x 1000')
...    plt.ylabel(yaxisLabels[pane])
...    plt.legend(loc='upper left', fontsize='small')
...    plt.grid()
>>> plt.tight_layout()
>>> plt.show()

A few lines are worthy of commentary. Sizing a graph to make it legible but not take up too much space on a page requires a lot of tweaking, so I like to add the final dimensions to the figure() method in line 4 so I don’t forget what they are. Line 7 destructures the first two parameters from stats.linregress() as the slope and y intercept of the linear regression line calculated in line 8. Line 12 uses a new line character \n to split the values for slope and y intercept onto separate lines in the legend. Line 15 sets the limits of the Y axis to those of the X axis, in order to compare the slope of the regression lines across the three plots. Lines 16 recalculates the numeric labeling of the axes, because the default numbering in 1000s is not legible. It divides the list of default tick labels plt.gca().get_xticks() by 1000 to get a new list of labels wTicks, which is then applied to each axis.

The strong linear relationship among the data points of the three plots in Fig. 10.26 indicates that the number of parts of speech depends on the total number of words. Thus, to remove this dependency and get a true vision of the distribution of each part of speech in the corpus, the POS must be divided by the number of words, to produce density metrics. The next figure reports the results of this division:

_images/scifiVNADVDensities.png

Fig. 10.27 Densities of verbs, nouns and adverbs in a corpus of unknown authors.

Author’s plot from code below.

It is a line plot of the three densities on the same axes, for easy comparison.

Warning

Discuss in class.

As for the code that produces this figure, its core is performing a division such as the one in line 1 below. But this will not work, because a list can’t be divided by a list, and in any event, float() only applies to an integer, not a list of integers. Converting the lists to arrays as in line 2 does not help, because an array can’t be ‘floated’, either:

1
2
3
>>> density = vLen / float(wLen)
>>> density = array(vLen) / array(float(wLen))
>>> density = vLen[0]/float(wLen[0]) for (vLen[0], wLen[0]) in (vLen, wLen)

There is a deeper problem, however, in that that the word counts of the texts should be ordered to reveal any residual linearity. We are lead inevitably to the pair-sort-unpair algorithm:

  1. zip together the two lists,
  2. order the resulting pairs by one member,
  3. convert the pairs back to lists by list comprehensions over one and then the other member of the pairs.

Fortunately, the third step can be simplified because the two output lists can be collapsed into a single list of densities by performing the division on the pairs, as in line 3 above.

The calculate & print algorithm is repeated three times, so the obvious organization is to pack it all into a loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
>>> tagLists = [vLen, nLen, advLen]
>>> labels = ['Verb density', 'Noun density', 'Adverb density']

>>> plt.figure(figsize=(5,5))
>>> for (line, tag) in enumerate(tagLists):
...     tempPair = zip(tag, wLen)
...     tempPair.sort(key=lambda pair: pair[1])
...     density = [t/float(w) for (t,w) in tempPair]
...     plt.plot(range(len(wLen)), density, label=labels[line])
>>> plt.ylim([0, 0.35]) # make room for legend at top
>>> plt.xticks(range(len(wLen)), sorted(wLen), rotation='vertical')
>>> plt.xlabel('Words in text')
>>> plt.ylabel('Density')
>>> plt.legend(loc='best', fontsize='small')
>>> plt.grid()
>>> plt.tight_layout()
>>> plt.show()

Lines 6 to 8 instantiate the zip, order and unzip algorithm. enumerate() is used to index the steps in the loop in line 5 because the index makes picking out the correct label for the line plots easier in line 9.

10.7.2.2. Relationships between parts of speech

_images/scifiPOSrelations.png

Fig. 10.28 Relationships between verbs and nouns, verbs and adverbs, and nouns and adjectives in a corpus of unknown authors.

Author’s plot from code below.

Warning

Discuss in class

Turning to the code that produces this pretty picture, the data has already been gathered, so linear regressions are calculated and then the data and regressions are plotted. This is accomplished in lines 9 and 10; everything else is just to make the graph look pretty:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
>>> plotDataRelations = [(vLen, nLen), (vLen, advLen), (nLen, adjLen)]
>>> axisLabels1 = [('V x 1000', 'N'), ('V x 1000', 'ADV'), ('N x 1000', 'ADJ')]

>>> plt.figure(figsize=(7.5,7.5))
>>> for (pane, data) in enumerate(plotDataRelations):
...     X, Y = data[0], data[1]
...     slope, intercept = stats.linregress(X, Y)[0:2]
...     rX = slope*array(X) + intercept
...     plt.subplot(2, 2, pane+1)
...     plt.scatter(X, Y)
...     plt.plot(X, rX, 'r',
...             label='slope={},\nintercept={}'.format(
...             round(slope,2),
...             round(intercept,2)))
...     newticks = [int(tk/1000) for tk in plt.gca().get_xticks()]
...     plt.gca().set_xticklabels(newticks)
...     offset = (plt.gca().get_xticks()[1] - plt.gca().get_xticks()[0]) / 10
...     for pt in range(len(X)):
...         plt.annotate(str(pt), # scifiCorpus[i]
...                      xy=(X[pt], Y[pt]),
...                      xytext=(X[pt]+offset, Y[pt]))
...     plt.xlabel(axisLabels1[pane][0])
...     plt.ylabel(axisLabels1[pane][1])
...     plt.legend(loc='upper left', fontsize='small')
...     plt.grid()
>>> plt.tight_layout()
>>> plt.show()

The next step is to divide by the Y axis to try to nullify the linear relationship. The next figure conveys the results; it’s code is underneath:

_images/scifiPOSrelationRatios.png

Fig. 10.29 Ratios or densities of nouns to verbs, adverbs to verbs, and adjectives to nouns in a corpus of unknown authors.

Author’s plot from code below.

The challenge of this data is that the lists from which the ratios are to be calculated should be sorted by the word totals of each text, which means zipping three lists into triples, sorting on the last, and then untripling the triples:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
>>> plotDataTriples = [(vLen,nLen,wLen), (vLen,advLen,wLen), (nLen,adjLen,wLen)]
>>> plotLabels = ['N/V', 'ADV/V', 'ADJ/N'] # left out: 'ADJ/ADV

>>> plt.figure(figsize=(5,5))
>>> for (line, data) in enumerate(plotDataTriples):
...     X, Y, W = data[0], data[1], data[2]
...     tempTrio = zip(X, Y, W)
...     tempTrio.sort(key=lambda trio: trio[2])
...     ratio = [y / float(x) for (x,y,z) in tempTrio]
...     plt.plot(range(len(W)), ratio, label=plotLabels[line])
>>> plt.xticks(range(len(W)), W, rotation='vertical')
>>> plt.xlabel('Words')
>>> plt.ylabel('Ratio')
>>> plt.legend(loc='best', fontsize='small')
>>> plt.grid()
>>> plt.tight_layout()
>>> plt.show()

Lines 7 through 9 instantiate the pair-sort-unpair algorithm on the data triples. I guess I should have called it the ‘tuple-sort-untuple’ algorithm.

Warning

Discuss in class

10.7.2.3. Summary and/or practice

10.7.3. How to fit the Zipf distribution to a power law

We can now be more exact about the shape of the logarithmic plot of the Wub tokens: even though it is bent at the top and bumpy at the bottom, it approximates a straight line. Approximating a straight line on logarithmic scales is an indication that the data obeys a power law.
>$ pip install powerlaw

Plotting with powerlaw couldn’t be simpler, well, except for formatting the legend with the values for slope and y intercept:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
>>> from powerlaw import Fit, plot_pdf
>>> Y = wubFD.values()
>>> wubFit = Fit(Y, discrete=True)
>>> slope = wubFit.power_law.alpha
>>> intercept = wubFit.power_law.xmin

>>> plt.figure(figsize=(5,5))
>>> plot_pdf(Y)
>>> wubFit.power_law.plot_pdf(linestyle='--',
...                           color='g',
...                           label='slope={}, intercept={}'.format(
...                           round(slope,2),(round(intercept,2))))
>>> plt.xlabel('Rank')
>>> plt.ylabel('Frequency')
>>> plt.show()

The result is …

_images/wubFDpowerLaw.png

Fig. 10.30 Fitting of rank-frequency plot of the tokens of ‘Beyond Lies the Wub’ to a power law.

Author’s plot from code above.

Now you can apply this technique to the entire corpus:

_images/scifiFDpowerLaw.png

Fig. 10.31 Fitting of rank-frequency plot of the tokens of a corpus of unknown authors to a power law.

Author’s plot from code below.

Warning

Discuss in class.

The code has to iterate over eight panes, one for each text in the corpus:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
>>> plt.figure(figsize=(13.5,5.5))
>>> for (pane, sfid) in enumerate(scifiCorpus):
...     with open(sfid, 'r') as tempFile:
...         rawText =  tempFile.read()
...     textFD = FreqDist(t for t in word_tokenize(rawText))
...     Y = textFD.values()
...     textFit = Fit(Y, discrete=True)
...     slope = textFit.power_law.alpha
...     intercept = textFit.power_law.xmin
...     plt.subplot(2, 4, pane+1)
...     plot_pdf(Y, label='{}: {}'.format(pane+1, sfid))
...     textFit.power_law.plot_pdf(linestyle='--',
...                                color='g',
...                                label='slope={}\nintercept={}'.format(
...                                round(slope,2),(round(intercept,2))))
...     plt.xlim([0,10000])
...     if pane in [0,1,2,3]: plt.tick_params(labelbottom='off')
...     if pane not in [0,4]: plt.tick_params(labelleft='off')
...     if pane in [4,5,6,7]: plt.xlabel('Rank')
...     if pane in [0,4]:     plt.ylabel('Frequency')
...     plt.legend(fontsize='small')
...     plt.grid()
>>> plt.show()

Given that there are two parameters to a linear regression, the slope and y intercept of the line, they can be plotted against each other to look for a pattern:

_images/scifiFDPowerlawAB.png

Fig. 10.32 Scatter plot of y intercept by slope from the fitting of the rank-frequency plot of the tokens of a corpus of unknown authors to a power law.

Author’s plot from code below.

Warning

Discuss in class.

The code is a straightforward generalization from the previous two graphs. The various slopes and intercepts are calculated first, in a loop and the results are scatter plotted:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
>>> slopes, intercepts = ([] for i in range(2))

>>> for (point, sfid) in enumerate(scifiCorpus):
...     with open(sfid, 'r') as tempFile:
...         rawText =  tempFile.read()
...     textFD = FreqDist(t for t in word_tokenize(rawText))
...     Y = textFD.values()
...     textFit = Fit(Y, discrete=True)
...     slope = textFit.power_law.alpha
...     slopes.append(slope)
...     intercept = textFit.power_law.xmin
...     intercepts.append(intercept)

>>> plt.figure(figsize=(5,5))
>>> X, Y = slopes, intercepts
>>> plt.scatter(X, Y)
>>> offset = (plt.gca().get_xticks()[1] - plt.gca().get_xticks()[0]) / 10
>>> for (pt, sfid) in enumerate(scifiCorpus):
...     if sfid != '21279': xytextOffset = [X[pt]+offset, Y[pt]]
...     else:               xytextOffset = [X[pt]-offset*3, Y[pt]]
...     plt.annotate(str(pt),
...                  xy=(X[pt], Y[pt]),
...                  xytext=(xytextOffset))
>>> plt.xlabel('Slope')
>>> plt.ylabel('Intercept')
>>> plt.grid()
>>> plt.show()

Since the label for text 0 (21279) prints on top of the label for text 2 (28644) with the right-side offset, I had to offset it to the left, as evidenced by the highlighting of lines 19-20.

more See powerlaw.

10.7.4. How to measure words per sentence

As your first feature, measure the number of words per sentence by first tokenizing the sentences, and then for each sentence, tokenize its words and count the number. Save the results to a FreqDist. Try it out on a single text:

1
2
3
4
5
6
7
>>> sfid = '28554'
>>> with open(sfid, 'r') as tempFile:
...     rawText = tempFile.read()
>>> sentences = sent_tokenize(rawText)
>>> sLenFD = FreqDist(len(word_tokenize(s)) for s in sentences)
>>> X = [k for (k, v) in sLenFD.items()]
>>> Y = [v for (k, v) in sLenFD.items()]

Now plot the results by hand:

1
2
3
4
5
6
7
>>> plt.figure(figsize=(5,5))
>>> plt.plot(X, Y, label=sfid)
>>> plt.xlabel('Sentence length')
>>> plt.ylabel('Frequency')
>>> plt.legend(fontsize='small')
>>> plt.grid()
>>> plt.show()
_images/scifi1SentLenFreq.png

Fig. 10.33 Frequency of sentence lengths in a text of unknown authorship.

Author’s pyplot plot from code above.

Now do this for all of them:

_images/scifiSentLenFreq.png

Fig. 10.34 Frequency of sentence lengths in a corpus of unknown authorship.

Author’s pyplot plot from code above.

Warning

Discuss in class.

Here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
>>> plt.figure(figsize=(13.5,5.5))
>>> for (pane, sfid) in enumerate(scifiCorpus):
...     with open(sfid, 'r') as tempFile:
...         rawText =  tempFile.read()
...     sentences = sent_tokenize(rawText)
...     sLenFD = FreqDist(len(word_tokenize(s)) for s in sentences)
...     X = [k for (k, v) in sLenFD.items()]
...     Y = [v for (k, v) in sLenFD.items()]
...     plt.subplot(4, 4, pane+1)
...     plt.plot(X, Y, label='{}: {}'.format(pane, sfid)))
...     if pane in [4,5,6,7]: plt.xlabel('Sentence length')
...     if pane in [0,4]:     plt.ylabel('Frequency')
...     plt.legend(fontsize='small')
...     plt.grid()
>>> plt.show()

But maybe the different sizes of the corpora influence the patterns you see in these graphs. Convert the frequency distribution to a MLE probability distribution to control for this potential confounding factor. First try it on the sample text:

_images/scifi1SentLenFreqProb.png

Fig. 10.35 Frequency and probability of sentence lengths in a single text of unknown authorship.

Author’s pyplot plot from code below.

As before, the data is calculated, and then it is plotted by looping over the data subsets:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
>>> sfid = '28554'
>>> with open(sfid, 'r') as tempFile:
...     rawText =  tempFile.read()
>>> sentences = sent_tokenize(rawText)
>>> sLenFD = FreqDist(len(word_tokenize(s)) for s in sentences)
>>> sLenPD = MLEProbDist(sLenFD)
>>> sLenProb = [(s, sLenPD.prob(s)) for s in sLenPD.samples()]
>>> sLenProb.sort(key=lambda pair: pair[0])
>>> plotData = [sLenFD.items(), sLenProb]
>>> yaxisLabels = ['Frequency', 'Probability']

>>> plt.figure()
>>> for (pane, data) in enumerate(plotData):
...     plt.subplot(2, 2, pane+1)
...     X = [k for (k, v) in data]
...     Y = [v for (k, v) in data]
...     plt.plot(X, Y, label=sfid)
...     plt.xlabel('Sentence length')
...     plt.ylabel(yaxisLabels[pane])
...     plt.legend(fontsize='small')
...     plt.grid()
>>> plt.tight_layout()
>>> plt.show()

Now do this to the entire corpus:

_images/scifiSentLenProb.png

Fig. 10.36 Probability of sentence lengths in a corpus of unknown authorship, with axes set to same magnitudes.

Author’s pyplot plot from code below.

Warning

Discuss in class.

Both the data calculation and the plotting must be done in the same loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
>>> plt.figure(figsize=(13.5,5.5))
>>> for (pane, sfid) in enumerate(scifiCorpus):
...     with open(sfid, 'r') as tempFile:
...         rawText =  tempFile.read()
...     sentences = sent_tokenize(rawText)
...     sLenFD = FreqDist(len(word_tokenize(s)) for s in sentences)
...     sLenPD = MLEProbDist(sLenFD)
...     sLenProb = [(s, sLenPD.prob(s)) for s in sLenPD.samples()]
...     sLenProb.sort(key=lambda pair: pair[0])
...     X = [k for (k, v) in sLenProb]
...     Y = [v for (k, v) in sLenProb]
...     plt.subplot(2, 4, pane+1)
...     plt.plot(X, Y, label='{}: {}'.format(pane, sfid)))
...     plt.xlim([0,90])
...     plt.ylim([0,0.14])
...     if pane in [0,1,2,3]: plt.tick_params(labelbottom='off')
...     if pane not in [0,4]: plt.tick_params(labelleft='off')
...     if pane in [4,5,6,7]: plt.xlabel('Sentence length')
...     if pane in [0,4]:     plt.ylabel('Probability')
...     plt.legend(fontsize='small')
...     plt.grid()
>>> plt.tight_layout()
>>> plt.show()

If there is anything new it is that I set all the X axes and all the Y axes to the maximum of each in lines 14 and 15 in order to facilitate comparison between the texts.

10.8. Summary

10.9. References

[Sta09]Efstathios Stamatatos. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60(3):538–556, 2009.

10.10. Powerpoint and podcast


Last edited: Dec 05, 2016