# 9. Basic natural language processing¶

In this chapter, you will learn the basic techniques for natural language processing, using modules drawn mainly from NLTK and from Pattern. We are going to follow the text processing work-flow laid out in the figure below:

The general idea is to follow the sizing of linguistic units from smallest, the individual characters in a string, to largest, that of an entire text or discourse. Two approaches can be distinguished for this work-flow.

One is based roughly on using regular expression pattern matching as the main sort of analysis. This is the top path, the one that ends at the star labeled “regex grammar”. It encompasses what can be called ‘practical NLP’, or perhaps more technically “information retrieval”. Almost all of it can be performed with the Pattern package.

The other approach, found by tracing the bottom path, is more general and recapitulates how theoretical linguists conceptualize the problem, which I have called “text understanding” for lack of a better term. It ends at the star labeled, in a way that reflects my own biases, “real grammar”. While it may be the object of cutting-edge research, it has not produced any polished software that you can use right out of the box.

Since you will be using Pattern a lot, go ahead and install it with pip:

>$pip install pattern  See Pattern. Note The script with the code for this chapter is nlp9.py, which you can download with codeDowner(), see Practice 1, question 2. ## 9.1. Tokenization, again¶ The first step, that of tokenization, you are already familiar with from Tokenization. It is the gateway to everything else. ### 9.1.1. Tokenization in NLTK¶ As you know, NLTK has a tokenization module:  1 2 >>> import nltk >>> help(nltk.tokenize)  You have already learned how to tokenize a string into words. NLTK can also tokenize a string into sentences, using the Punkt sentence tokenizer:  1 2 3 4 >>> with open('Wub.txt','r') as tempFile: ... rawText = tempFile.read() >>> from nltk.tokenize import sent_tokenize >>> sentences = sent_tokenize(rawText)  See NLTK Tokenizer Package for the basics of tokenization. See nltk.tokenize package in NLTK documentation for more information. ### 9.1.2. How to tokenize with pattern.en.parse()¶ Pattern also begins with tokenization, through its parse() method:  1 2 3 4 >>> import pattern >>> from pattern.en import parse >>> help(parse) >>> sentence = 'I tickled a girl with a feather.'  Since parse() does almost the entire information retrieval work-flow, it has many options. Their default setting is True:  1 2 3 4 5 6 7 8 parse(string, tokenize = True, # Split punctuation marks from words? tags = True, # Parse part-of-speech tags? (NN, JJ, ...) chunks = True, # Parse chunks? (NP, VP, PNP, ...) relations = True, # Parse chunk relations? (-SBJ, -OBJ, ...) lemmata = True, # Parse lemmata? (ate => eat) encoding = 'utf-8' # Input string encoding. tagset = None) # Penn Treebank II (default) or UNIVERSAL.  To illustrate how to tokenize, turn off all of them other than tokenize by setting them to False:  1 2 >>> parse(sentence, tags=False, chunks=False, relations=False, lemmata=False) u'I tickled a girl with a feather .'  The output is a Unicode string with every token separated by spaces. Unfortunately, this is almost indistinguishable from the input string. The only way to know that it did something is take the next step in the work-flow and tag the output. You will learn about part-of-speech tagging below in Part-of-speech (POS) tagging, so this is just an appetizer:  1 2 >>> parse(sentence, tags=True, chunks=False, relations=False, lemmata=False) u'I/PRP tickled/VBD a/DT girl/NN with/IN a/DT feather/NN ./.'  In this format, the part-of-speech tag is appended to its token with a forward slash. The crucial difference is that the period is tagged with its part-of-speech tag, a period. Thus you can conclude that input string was actually tokenized. ## 9.2. NLTK methods for simple text processing¶ One of the reasons for using NLTK is that it relieves us of much of the effort of making a raw text amenable to computational analysis. It does so by including a module of corpus readers, which pre-process files for certain tasks or formats. Most of them are specialized for particular corpora, so we will start with the basic one, called the PlaintextCorpusReader. ### 9.2.1. How to pre-process a text with the PlaintextCorpusReader¶ To try the PlaintextCorpusReader out, import it with from nltk.corpus import PlaintextCorpusReader. It needs to know two things: where your file is and what its name is. If the current working directory is where the file is, the location argument can be left ‘blank’ by using the null string ''. We only have one file, ‘Wub.txt’. It will also prevent problems down the line to give the method an optional third argument that relays its encoding, encoding='utf-8'. Plug these three strings into the argument slots of PlaintextCorpusReader() and send the output to a variable. Now let NLTK tokenize the text into words and punctuation with words(). Check how many there are and display the first fifty of them, as in the following snippet of code:   1 2 3 4 5 6 7 8 9 10 11 >>> from nltk.corpus import PlaintextCorpusReader >>> wubReader = PlaintextCorpusReader('', 'Wub.txt', encoding='utf-8') >>> wubWords = wubReader.words() >>> len(wubWords) # should be 3693 >>> wubWords[:50] [u'Produced', u'by', u'Greg', u'Weeks', u',', u'Stephen', u'Blundell', u'and', u'the', u'Online', u'Distributed', u'Proofreading', u'Team', u'at', u'http', u'://', u'www', u'.', u'pgdp', u'.', u'net', u'[', u'Illustration', u':', u'_', u'"', u'The', u'wub', u',', u'sir', u',"', u'Peterson', u'said', u'.', u'"', u'It', u'spoke', u'!"', u'_', u']', u'BEYOND', u'LIES', u'THE', u'WUB', u'By', u'PHILIP', u'K', u'.', u'DICK', u'_The']  As you can see, the list of tokenized strings is in Unicode. Note The general syntax of PlaintextCorpusReader is: PlaintextCorpusReader(path_to_your_file_which_can_be_empty, file_name_with_extension, encoding=’your_encoding_usually_utf8’) #### 9.2.1.1. The PlaintextCorpusReader methods¶ Besides words, there are several other methods revealed by PlaintextCorpusReader that you may find useful. They are exemplified below:  1 2 3 4 5 6 7 >>> wubReader.raw()[:50] >>> wubReader.sents()[:2] >>> wubReader.fileids() >>> wubReader.abspath('Wub.txt') >>> wubReader.root >>> wubReader.encoding('Wub.txt') >>> wubReader.readme()  The first line returns the single string from which the file was read. The second line tokenizes the string to a list of lists of of strings, each of which is a sentence, as far as the algorithm can tell. Line 3 returns the file that the reader is reading. Line 4 returns a FileSystemPathPointer to that file, which is its path. Line 5 returns a FileSystemPathPointer to the current working directory. Line 6 returns the encoding of the file being read. The last line returns the readme file associated with the file being read via the current working directory, which unfortunately our text does not have. See nltk.corpus.reader.plaintext module in NLTK’s documentation for more information. #### 9.2.1.2. Adding the methods of NLTK Text¶ NLTK has a sub-package called text, as shown in the diagram below: The text methods of Text provide a shortcut to text analysis. To make them available, you transform a tokenized PlaintextCorpusReader object into NLTK text:  1 2 >>> from nltk.text import Text >>> text = Text(wubWords)  You can combine the three steps of NLTK text preparation into a single line, assuming that you have already imported PlaintextCorpusReader and Text: >>> text = Text(PlaintextCorpusReader('', 'Wub.txt', encoding='utf-8').words())  You will use these lines so often that you may want to pack them into a function and add it to textProc:  1 2 3 4 def textLoader(fileid): from nltk.corpus import PlaintextCorpusReader from nltk.text import Text return Text(PlaintextCorpusReader('', doc, encoding='utf-8').words())  Since it returns tokenized NLTK text, you can assign it to a variable: >>> text = textLoader('Wub.txt'):  To get started with text analysis, you need a text converted to NLTK’s text format, which in the previous chapter you learned how to do ultimately with a single function:  1 2 >>> from textProc import textLoader >>> text = textLoader('Wub.txt')  Let us now finally learn something about this text. See NLTK’s documentation for the text Module. ### 9.2.2. The Text methods¶ #### 9.2.2.1. Collocations()¶ A collocation is a group of words that occur together frequently in a text. Text’s collocations() method finds collocations of two words:  1 2 3 4 5 6 7 >>> text.collocations() Building collocations list wub said; Captain Franco; Peterson said; fifty cents; Franco walked; walked toward; French said; Franco said; Jones said; almost anything; Peterson stared; done anything; came back; wub rose; Captain said; Captain put; Peterson sat; wub stopped; Captain nodded; Captain watched  #### 9.2.2.2. Common_contexts()¶ The inverse procedure is to start with two words, say wub and Captain, and find the contexts that they share:  1 2 3 >>> text.common_contexts(['wub', 'Captain']) Building word-context index... the_. the_nodded the_said the_stood the_watched  This might not be many, depending on how long the text is. #### 9.2.2.3. Concordance()¶ It is often helpful to know the context of a word. The concordance view shows a certain number of characters before and after every occurrence of a given word:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 >>> text.concordance('wub') Building index... Displaying 25 of 56 matches: . pgdp . net [ Illustration : _ " The wub , sir ," Peterson said . " It spoke ! d . " It spoke !" _ ] BEYOND LIES THE WUB By PHILIP K . DICK _The slovenly wub WUB By PHILIP K . DICK _The slovenly wub might well have said : Many men talk lked toward him . " What is it ?" The wub stood sagging , its great body settli sat . There was silence . " It ' s a wub ," Peterson said . " I got it from a o poked the great sloping side of the wub . " It ' s a pig ! A huge dirty pig ! it ' s a pig . The natives call it a wub ." " A huge pig . It must weigh four rabbed a tuft of the rough hair . The wub gasped . Its eyes opened , small and uth twitched . A tear rolled down the wub ' s cheek and splashed on the floor . nd out ," Franco said . * * * * * The wub survived the take - off , sound aslee Captain Franco bade his men fetch the wub upstairs so that he might perceive wh ive what manner of beast it was . The wub grunted and wheezed , squeezing up th es grated , pulling at the rope . The wub twisted , rubbing its skin off on the hat is it ?" " Peterson says it ' s a wub ," Jones said . " It belongs to him . It belongs to him ." He kicked at the wub . The wub stood up unsteadily , panti to him ." He kicked at the wub . The wub stood up unsteadily , panting . " Wha oing to be sick ?" They watched . The wub rolled its eyes mournfully . It gazed terson came back with the water . The wub began to lap gratefully , splashing t him here . I want to find out --" The wub stopped lapping and looked up at the e Captain . " Really , Captain ," the wub said . " I suggest we talk of other m ?" Franco said . " Just now ." " The wub , sir ," Peterson said . " It spoke . " It spoke ." They all looked at the wub . " What did it say ? What did it say er things ." Franco walked toward the wub . He went all around it , examining i have a look ." " Oh , goodness !" the wub cried . " Is that all you people can , their faces blank , staring at the wub . The wub swished its tail . It belch  #### 9.2.2.4. Similar()¶ A text can also be searched for words that have a distribution similar to that of a given word:  1 2 3 >>> text.similar('wub') captain men gangplank optus room table brain chest contents cook corner door floor gleaming gun hall hurting jets kettle kitchen  The similar words are all nouns, which suggests that wub is one, too. #### 9.2.2.5. Generate()¶ Warning This seems to have been removed in NLTK 3. Finally, Text includes a method for creating a random assortment of words from the text that obey its statistical patterns:  1 2 3 4 5 6 7 8 9 >>> text.generate() Building ngram index... Produced by Greg Weeks , Stephen Blundell and the Online Distributed Proofreading Team at http :// www . pgdp . net [ Illustration : _ " The life essence is gone ." He dabbed at his watch . " I don ' t see him -- like a statue , standing there , his hands on his hips . Peterson was walking along the path , his face red , leading _it_ by a string . " It is difficult for me ." It stood , gasping , its great mouth twitched . A tear rolled down the hall ,  This is just for fun, but it does give you an indication of the style of the author or the genre, and perhaps of the content. #### 9.2.2.6. Others¶ There is a findall() method for searching for instances of a regular expression in the text, but it is so limited that it is better to use the re module. There was a search() method also for regular expressions, but it has been removed. vocab() returns the frequency distribution of the text, but there is an entire class for doing this called FreqDist() that was reviewed at the beginning of this chapter. I have not figured out how readability() works. #### 9.2.2.7. Dispersion_plot()¶ A picture is worth a thousand words, and Text supplies a few pictures. Text includes a quite unexpected way of understanding the distribution of a word in the text. dispersion_plot() draws a graph of where every instance of a word is found offset from the beginning of the text: >>> text.dispersion_plot(['wub','Optus','Captain'])  The offset measures how far an instance of the word is from the beginning of the text, counted in words. Text also has a method plot() for viewing a frequency distribution, but we will take that up in the next chapter. See NLTK’s documentation for the Text class. ### 9.2.3. How to use these methods without Text¶ While NLTK Text provides a convenient interface to these handy routines, you may want to use them independently of the Text format. You can do say by importing the appropriate classes that reveal the routines. This section shows you how to do so, but first we need some text to apply them to:  1 2 >>> from nltk import word_tokenize >>> tokens = word_tokenize(rawText)  #### 9.2.3.1. ContextIndex¶ The ContextIndex class reveals common contexts, similar words and word similarity:  1 2 3 >>> from nltk.text import ContextIndex >>> help(ContextIndex) >>> wubContext = ContextIndex(tokens)  The help utility explains that ContextIndex is … A bidirectional index between words and their ‘contexts’ in a text. The context of a word is usually defined to be the words that occur in a fixed window around the word; The exact window is not mentioned, but it appears to be the preceding and following token. One way to access the general index is through word_similarity_dict(), which calculates the similarity of every word in the text to a target word and returns a dictionary of such similarity scores. You will learn more about dictionaries in the next chapter. For the time being, I will just illustrate how to display the items of a dictionary:  1 2 >>> wubWSD = wubContext.word_similarity_dict('wub') >>> wubWSD.items()[:50]  items() returns a list of pairs in which the first member is the token and the second is its similarity to wub. Most of them have no similarity – a score of 0 – which is why I had to delve into the first fifty. A second way to access the context index is through the similar_words() method, which returns a list of tokens that are similar to the target token: >>> wubContext.similar_words('wub')  Presumably this list is taken from the word similarity dictionary by culling out all those tokens that have a similarity score greater than some minimum. The final way to to access the context index is through the common_contexts() method, which returns a NLTK frequency distribution or FreqDist, which is a dictionary of counts of tokens: >>> wubContext.common_contexts(['wub', 'Captain'])  You will learn a little more about frequency distributions below in Stop-word deletion and a lot more in the next chapter. For the time being, you use the FreqDist items() method to list out the content of a frequency distribution:  1 2 >>> wubcapCC = wubContext.common_contexts(['wub', 'Captain']) >>> wubcapCC.items()  The method returns a list of pairs of the token before the target word and the token after the target word. These pairs are in turn paired with the number of times the context was found. If you think that a frequency distribution looks like a dictionary, you are right. See NLTK’s documentation for the ContextIndex class. #### 9.2.3.2. ConcordanceIndex¶ The ConcordanceIndex class reveals offsets and prints a concordance:  1 2 3 >>> from nltk.text import ConcordanceIndex >>> help(ConcordanceIndex) >>> wubConcord = ConcordanceIndex(tokens)  Printing the concordance is straightforward: >>> wubConcord.print_concordance('wub')  We would like to get control over this process, however, and the key to doing so is the offsets() method, which returns a list of indices of the target token: >>> wubConcord.offsets('wub')  To double-check that this works, use the first member of the list as an index to tokens. This should return ‘wub’, as should every other member. Did you try it? The line of code is:  1 2 >>> wubOff = wubConcord.offsets('wub') >>> tokens[wubOff[0]]  A lexical dispersion plot presumably plots the offsets, but ConcordanceIndex does not supply any method for creating one. You will give it a try in Practice 1. See NLTK’s documentation for the ConcordanceIndex class. #### 9.2.3.3. TokenSearcher¶ The final gift of NLTK is a bit unexpected. The TokenSearcher class reveals a findall() tweaked to make regular expression matching a bit easier on a list of tokens:  1 2 >>> from nltk.text import TokenSearcher >>> help(TokenSearcher)  As the help explanation says, The tokenized string is converted to a string where tokens are marked with angle brackets – e.g., '<the><window><is><still><open>'. The regular expression passed to the findall() method is modified to treat angle brackets as non-capturing parentheses, in addition to matching the token boundaries; and to have '.' not match the angle brackets You can use this to make a quick and dirty concordance:  1 2 >>> wubSearcher = TokenSearcher(tokens) >>> wubSearcher.findall('<.*><.*>')  Do you know why the regex pattern above uses the dot instead of the alphabetic range? See NLTK’s documentation for the TokenSearcher class. ### 9.2.4. Summary¶ You should know what the following methods do. 1. PlaintextCorpusReader methods 1. raw() 2. words() 3. sents() 4. fileids() 5. open(fileid) 6. abspath(fileid) 7. root() 8. encoding(fileid) 9. readme() 2. Text methods 1. collocations() 2. common_contexts() 3. concordance() 4. similar() 5. generate() 6. dispersion_plot() 3. ContextIndex methods 1. word_similarity_dict() 2. similar_words() 3. common_contexts() 4. ConcordanceIndex methods 1. print_concordance() 2. offsets() 5. TokenSearcher method 1. findall() Finally, there is one class that I have omitted, TextCollection, which facilitates text processing over a group of texts. Since you do not have a group of texts yet, I will save it until you do. ### 9.2.5. Practice 1¶ 1. Format the items in the frequency distribution returned by ContextIndex.common_contexts() as given by Text.common_contexts(). That is to say, convert the list of pairs such as ('the', '.') to the string 'the_.'. This means you have to loop over a list of pairs, which requires knowing how pairs are indexed. Well, they are indexed just like strings and lists. Thus the first pair in wubcapCC.items() is wubcapCC.items()[0], or (('the', '.'), 2). You want to get the first member of this pair, which would be wubcapCC.items()[0][0], or ('the', '.'). Now convert this to a string, for each item in the list returned by wubcapCC.items(). A pair is converted to a string in the same way that a list is converted to a string. 2. Choose one of the words that is similar to ‘wub’ (except for captain) and find the contexts that it has in common with ‘wub’. Do this in two ways, the first with the NLTK Text methods and the second with the ContextIndex methods. 3. The offsets() method returns a list of indices of the occurrences of a target string in a list of tokens. Show how to create such a list yourself. 4. Try to reproduce print_concordance yourself, on the first 10 tokens of wub using a window of the 3 tokens before it and 3 tokens after it. [Not with findall()] 5. Try to plot lexical dispersion. 6. Try to reproduce print_concordance with findall(), on the tokens of wub using a window of the 3 tokens before it and 3 tokens after it. ## 9.3. Lexical phase¶ ### 9.3.1. Stop-word deletion¶ What is the most frequent word in English? If you answered the or a(n), you would be right. NLTK can help to sharpen your intuitions about the frequency of words by plotting the counts of the fifty most frequent tokens in Dick’s “Beyond lies the wub”:  1 2 3 4 5 6 7 >>> from nltk import word_tokenize >>> with open('Wub.txt','r') as tempFile: ... rawText = tempFile.read() >>> tokens = word_tokenize(rawText) >>> from nltk.probability import FreqDist >>> wubFD = FreqDist(t for t in tokens) >>> wubFD.plot(50)  Here is a graph of the most frequent strings from Dick’s Beyond lies the wub that helps to visualize just how frequent the most frequent strings are: Imagine how the various steps in the text processing chain would have to apply again and again to the 150 or so tokens of the. This seems like a massive waste of resources, especially since the word is not particularly informative. Words that are very frequent but not particularly informative are called stop words in computational linguistics. There is no definitive list for English, but most start with a list drawn up by Martin Porter and organized by grammatical form here. #### 9.3.1.1. Stop words in NLTK¶ NLTK has a list of English stop words in its corpora, the quick and dirty way to download just them is by using the command-line version of the downloader:  1 2 3 4 >>> import nltk >>> nltk.download_shell() Downloader> d stopwords Downloader> q  Line 3 downloads the stopwords to your nltk_data folder, and line 4 quits the downloader. Now you can import and examine the English stopwords:  1 2 3 4 5 6 >>> from nltk.corpus import stopwords >>> stopwords.fileids() [u'danish', u'dutch', u'english', u'finnish', u'french', u'german', u'hungarian', u'italian', u'norwegian', u'portuguese', u'russian', u'spanish', u'swedish', u'turkish'] >>> stopW = stopwords.words('english') >>> len(stopW)  You can examine the 127 words in the English list by tokenizing it:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 >>> stopwords.words('english') ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] >>> stopW[:30]  There are 127 of them, and they are in lowercase. How would you remove them from the text? I will give you a chance to answer the question in Practice 2. Here is the plot of the token counts without the stop words, which is much more informative, though it is still biased towards punctuation and contractions: How would get rid of the punctuation? That too you can figure out in Practice 2. Here is the plot, sans punctuation: For the time being, I just call your attention to the easing of the processing burden attendant on filtering out those tokens which, rather counter-intuitively, appear the most but contribute the least. We will return to the topic in more detail in the next chapter. #### 9.3.1.2. Stop words in Pattern¶ As far as I can tell, there is no free-standing list of stop words in Pattern, rather certain functions such as count() for counting words automatically exclude them. ### 9.3.2. Part-of-speech (POS) tagging¶ You are going to use NLTK’s default part-of-speech tagger, pos_tag:  1 2 3 4 5 6 >>> import nltk >>> help(nltk.tag.pos_tag) >>> from nltk.tag import pos_tag >>> nltkPOS = pos_tag(data) >>> nltkPOS[:5] [('apple', 'NN'), ('apples', 'NNS'), ('cherry', 'VBP'), ('cherries', 'NNS'), ('love', 'VBP')]  For reasons that will become clear below, I want you to extract the tags from this list of pairs. Pairs are indexed just like strings and lists, so a simple list comprehension will do: >>> nltkTags = [pair[1] for pair in nltkPOS]  Now, I want you to print the results to the console, in columns, with the first column being the data and the second, the tags. The zip method will do this:  1 2 >>> for row in zip(data, nltkTags): print '\t'.join(row)  The output should be:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 apple NN apples NNS cherry VBP cherries NNS love VBP loves NNS loving VBG loved VBD burn NN burned VBN burnt NN is VBZ was VBD were VBD can MD tall VB taller VB tallest JJS slowly RB ownership NN Mary NNP  These tags are drawn from the Penn Treebank II tag set. You can find the usage of a tag by invoking the help utility: >>> nltk.help.upenn_tagset('NN')  To save you same trouble, I have organized the tags returned for our data into Description of the tags in the data sample: Table 9.1 Description of the tags in the data sample Tag Description JJS adjective, comparative MD verb, modal auxiliary NN noun, common, singular NNS noun, common, plural NNP noun, proper, singular VB verb, base form VBD verb, past tense VBG verb, gerund or present participle VBZ verb, 3rd person singular present The full list of tags with examples can be found at the website mentioned above, Penn Treebank II tag set. However, for the upcoming discussion, you will thank me for converting this table to one that keys on part of speech or syntactic category: Table 9.2 Treebank II tags grouped by part of speech Description Tag predeterminer PDT determiner DT determiner, WH WDT determiner, possessive PRP$
cardinal number CD
nouns, common NN, NNS
nouns, proper NNP, NNPS
pronoun, personal PRP
pronoun, WH WP
verbs, conjugated VBZ, VBP, VBD
verb, modal auxiliary MD
verb, base VB
verb, gerund or present participle VBG
to, infinitival TO
preposition (+ some sub. conj.) IN
preposition as particle RP
conjunction, coordinating CC
conjunction, subordinating IN, WRB
there, existential EX
interjection UH
foreign word FW
punctuation (. = .;?*) . , :

#### 9.3.2.1. How to tag with pattern.en.parse()¶

You have already seen an example of POS tagging with Pattern. Here is the same line of code applied to our small example set, which is converted to a strings with the words separated by spaces, in accord with the input requirement of parse():

 1 2 >>> patternPOS = parse(' '.join(data), tags=True, chunks=False, relations=False, lemmata=False) >>> patternPOS 

parse() returns a Pattern text.TaggedString object. It needs to be split into a list, but split() actually returns a list of lists of a word and its tag:

>>> patternPOS.split()


So to extract the tags, the encompassing list, patternPOS.split()[0] must be iterated over:

>>> patternTags = [pair[1] for pair in patternPOS.split()[0]]


You append this list to the code above to display the results of both taggers in the console:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 >>> for row in zip(data, nltkTags, patternTags): ... print '\t'.join(row) ... apple NN NN apples NNS NNS cherry VBP JJ cherries NNS NNS love VBP VBP loves NNS VBZ loving VBG JJ loved VBD VBD burn NN NN burned VBN VBD burnt NN JJ is VBZ VBZ was VBD VBD were VBD VBD can MD MD tall VB JJ taller VB JJR tallest JJS JJS slowly RB RB ownership NN NN Mary NNP NNP 

The highlighted lines call attention to the words where the two taggers disagree. For all but loving, Pattern is more accurate. Loving is a draw between the two, since it can be both a gerund (noun) or a present participle (verb or adjective).

### 9.3.3. Practice 2¶

1. Rid the Wub text of the stopwords and then plot its frequency distribution.
2. Then get rid of the punctuation and plot the frequency distribution of the remaining words.
3. Use similar_words() from ContextIndex on Beyond Lies the Wub to see whether (a) the words similar to Captain are nouns; (b) the words similar to good are adjectives; the words similar to me are pronouns; the words similar to to are prepositions; the words similar to spoke are verbs.
4. Now use print_concordance() from ConcordanceIndex on the target words from #3 to see whether they appear in the context appropriate for the parts of speech mentioned.

## 9.4. Morphological phase¶

You are probably aware of the fact that a word can have different forms. A noun like cat can occur as cats, cat’s, or cats’ (plural, possessive and plural possessive), a verb like to love can also be used as loves, loved, or loving (third person singular, past tense/past participle or present participle/gerund). An adjective like tall can be used in comparatives as taller and superlatives as tallest. These are all examples of inflection, in which a base form or lemma is inflected, here by suffixes. The process of removing inflectional morphology is known as lemmatization.

### 9.4.1. The WordNet lemmatizer¶

NLTK has a lemmatizer that uses the WordNet database, and in particular WordNet’s morphy function. Unfortunately, this function requires the WordNet database. You can install it from the command-line downloader as so:

 1 2 3 >>> nltk.download_shell() Downloader> d wordnet Downloader> q 

Check the help file for WordNet’s lemmatizer to find out how to call it:

 1 2 3 4 >>> help(nltk.WordNetLemmatizer) >>> wnl = WordNetLemmatizer() >>> wnl.lemmatize('dogs') >>> lemmatized = [wnl.lemmatize(word) for word in data] 

#### 9.4.1.1. How to address rows of each list with zip()¶

Now add the lemmatized forms to the list of POS tags:

 1 2 >>> for row in zip(data, tags, lemmatized): ... print '\t'.join(row) 

The output should be:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 apples NNS apple cherry VBP cherry cherries NNS cherry love VBP love loves NNS love loving VBG loving loved VBD loved burn NN burn burned VBN burned burnt NN burnt is VBZ is was VBD wa were VBD were can MD can tall VB tall taller VB taller tallest JJS tallest slowly RB slowly ownership NN ownership Mary NNP Mary 

### 9.4.2. Stemming¶

NLTK has at least four modules for stemming. For the time being, I just review the code for calling them over our data.

#### 9.4.2.1. The Porter stemmer¶

The Porter stemmer:

 1 2 3 4 5 >>> help(nltk.PorterStemmer) >>> from nltk.stem.porter import PorterStemmer >>> portStmr = PorterStemmer() >>> portStmr.stem("loving") >>> portStmd = [portStmr.stem(word) for word in data] 

#### 9.4.2.2. The Lancaster stemmer¶

The Lancaster stemmer:

 1 2 3 4 5 >>> help(nltk.tk.stemLancasterStemmer) >>> from nltk.stem.lancaster import LancasterStemmer >>> lanStmr = LancasterStemmer() >>> lanStmr.stem("loving") >>> lanStmd = [lanStmr.stem(word) for word in data] 

#### 9.4.2.3. The Snowball stemmer¶

The Snowball stemmer:

 1 2 3 4 5 >>> help(nltk.SnowballStemmer) >>> from nltk.stem.snowball import EnglishStemmer >>> snoStmr = EnglishStemmer() >>> snoStmr.stem("loving") >>> snoStmd = [snoStmr.stem(word) for word in data] 

#### 9.4.2.4. How to put it all together¶

Display all the results:

 1 2 >>> for row in zip(data, tags, lemmatized, portStmd, lanStmd, snoStmd): >>> print '\t'.join(row) 

The output should look like this:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 apple NN apple appl appl appl apples NNS apple appl appl appl cherry VBP cherry cherri cherry cherri cherries NNS cherry cherri cherry cherri love VBP love love lov love loves NNS love love lov love loving VBG loving love lov love loved VBD loved love lov love burn NN burn burn burn burn burned VBN burned burn burn burn burnt NN burnt burnt burnt burnt is VBZ is is is is was VBD wa wa was was were VBD were were wer were can MD can can can can tall VB tall tall tal tall taller VB taller taller tal taller tallest JJS tallest tallest tallest tallest slowly RB slowly slowli slow slowli ownership NN ownership ownership own ownership Mary NNP Mary Mari mary mari 

This table would be easier to read if each column had a header. The simple-minded way to add one is to insert it directly at the beginning of each list, as below, but don’t do this:

 1 2 3 4 5 6 >>> data.insert(0, 'data') >>> tags.insert(0, 'pos') >>> lemmatized.insert(0, 'WordNet') >>> portStmd.insert(0, 'Porter') >>> lanStmd.insert(0, 'Lancaster') >>> snoStmd.insert(0, 'Snowball') 

You can show off you programming chops by doing this programmatically, in a loop. First you need a list of column headers to loop over:

>>> headers = ['data', 'pos', 'WordNet', 'Porter', 'Lancaster', 'Snowball']


Now comes the tricky part. You have to insert the corresponding title from the header list to each data list. That is to say, you want to loop over data lists & for each one insert the corresponding string from the header list. What is the only way to establish a correspondence between the order of data rows and the order of titles in the header list? It is through the implicit indexation of both, which can be made explicit with enumerate():

 1 2 3 4 >>> for (i, col) in enumerate([data, tags, lemmatized, portStmd, lanStmd, snoStmd]): ... col.insert(0, headers[i]) >>> for row in zip(data, tags, lemmatized, portStmd, lanStmd, snoStmd): ... print '\t'.join(row) 

The output should now look like this:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 data pos WordNet Porter Lancaster Snowball apple NN apple appl appl appl apples NNS apple appl appl appl cherry VBP cherry cherri cherry cherri cherries NNS cherry cherri cherry cherri love VBP love love lov love loves NNS love love lov love loving VBG loving love lov love loved VBD loved love lov love burn NN burn burn burn burn burned VBN burned burn burn burn burnt NN burnt burnt burnt burnt is VBZ is is is is was VBD wa wa was was were VBD were were wer were can MD can can can can tall VB tall tall tal tall taller VB taller taller tal taller tallest JJS tallest tallest tallest tallest slowly RB slowly slowli slow slowli ownership NN ownership ownership own ownership Mary NNP Mary Mari mary mari 

### 9.4.3. How to express morphology as feature structures¶

You may have noticed that there are many morphological facts about English that can’t be expressed by means of the POS tags that we have at our disposition. Here is a sample, where bold face expresses a morphological relationship:

1. Einstein himself couldn’t answer that question.
2. several beans vs. ?? several rice(s)
3. I am writing.
4. I am writing.

The first example shows agreement in grammatical gender between a proper noun and a pronoun. The second shows that a quantifier like several modifies a plural count noun like beans and not a singular or plural mass noun like rice. The third demonstrates agreement in grammatical person between the pronoun I and the form of to be, am. And the final one shows that the auxiliary usage of to be takes the present participle to form the progressive aspect.

A knee-jerk reaction is to propose expanding the set of POS tags to include this more nuanced information, but unfortunately the Penn Treebank tags are set in stone as the basis of many popular parsing algorithms, so that route must be abandoned.

The more general solution is to propose a more elaborate notation for representing grammatical information, called a feature structure or attribute-value matrix.

## 9.5. Syntactic phase¶

### 9.5.1. How to extract chunks¶

This is a listing of something I extracted from Beyond Lies the Wub. Can you identify it?

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 the first mate a good bargain A few flies A huge dirty A huge pig the rough hair the smooth chrome the real question a low voice some basic issues a long time the semantic warehouse the frozen food A few odds A nice apartment Some Martian birds any lasting contact the next month An unfortunate spoilage a common myth the empty universe a temporary period a brief journey The only one the green peas the thick slab the greatest things a living creature 

I hope you noticed that they are all sequences of determiner adjective noun, which go together to make up a noun phrase or NP in English. How do you think I extracted them?

I could have used token-based findall() as described in TokenSearcher, but in this section you are going to use a somewhat more general approach, a tag-based regular-expression matcher that implements the notion of chunking in natural language processing. As a technical term, chunking is the composition of tagged strings into larger units called chunks, which sometimes correspond to syntactic phrases. NLTK has a package for chunking that you are going to get to know in this section.

In particular, you are going to use the regexp module, so go ahead and import the two methods that you will need:

>>> from nltk.chunk.regexp import ChunkRule, RegexpChunkParser


The first step is devise a label for the chunk. This one is easy:

>>> chunkLabel = 'NP'


This module implements a regular-expression-like language for creating tag patterns. What tags correspond to the pattern that I matched above, determiner adjective noun? You are welcome to check Treebank II tags grouped by part of speech before answering this question. And then recall that in the token/tag language, items are surrounded by angled brackets, <>.

The minimal answer is <DT> <JJ> <NN>, but the latter two have alternative forms. These can be incorporated regex-ly, as <DT> <JJ.*> <NN.*>, which can be assigned to a pattern:

>>> chunkPattern = '<DT> <JJ.*> <NN.*>'


You now call a method to roll the pattern up into a rule, which includes a string for storing a description, which I will just fill in with the label assigned above:

>>> Chunker = ChunkRule(chunkPattern, chunkLabel)


And you initialize a parser that uses this rule:

>>> chunkParser = RegexpChunkParser([Chunker], chunk_label=chunkLabel)


This label will be used.

Now you have to pause to think. Chunking only makes sense when applied to sentences, so you need the text tokenized into sentences and not words. But NLTK’s sentence tokenizer leaves each sentence untouched, formatted as a string, so each sentence has to be tokenized into words. Then the words need to be tagged with their part of speech. Then the tagged words can be chunked into noun phrases. Then the user should be notified of the result, but the tricky part is that a sentence can have no NP chunks, one such chunk, or more than one. So a sentence itself has to be scanned for the proper chunks. Let us call the temporary chunks subtrees. Fortunately, they will be labeled as NP, so only those need be reported. The code looks like this:

 1 2 3 4 5 6 7 >>> for s in sentences: ... tokenizedS = word_tokenize(s) ... taggedS = pos_tag(tokenizedS) ... chunkedS = chunkParser.parse(taggedS) ... for subtree in chunkedS.subtrees(): ... if subtree.label() == chunkLabel: ... print(subtree) 

#### 9.5.1.1. Detour: how to write functions with a flag and a docstring¶

It is helpful to print the output of the chunker to see what it does and debug it, but you may prefer to collect it into a list for further processing. I am sure you recall that “collecting into a list” means appending in a loop. Here is one possibility:

 1 2 3 4 5 6 7 8 >>> subtrees = [] >>> for s in sentences: ... tokenizedS = word_tokenize(s) ... taggedS = pos_tag(tokenizedS) ... chunkedS = chunkParser.parse(taggedS) ... for subtree in chunkedS.subtrees(): ... if subtree.label() == chunkLabel: ... subtrees.append(subtree) 

In the upcoming sections, as well as Practice 4, you will use both of these blocks of code over and over, so it will save you a lot of keystrokes to consolidate them into a function. But to write a single function for both blocks of code means that your function will have to alternate between two modes of behavior, printing the results or saving them to a list. The way to tell the function to do this is to use an argument just for this purpose, which is usually known as a flag. As usual, try to figure this out inductively, from this example, which you can paste into textProc.py:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 def nltkChunker(chunkLabel, chunkPattern, sentences, out=1): """Chunks a list of sentences from a label & a pattern, printing or listing results.""" from nltk import pos_tag, word_tokenize from nltk.chunk.regexp import ChunkRule, RegexpChunkParser Chunker = ChunkRule(chunkPattern, chunkLabel) chunkParser = RegexpChunkParser([Chunker], chunk_label=chunkLabel) subtrees = [] for s in sentences: tokenizedS = word_tokenize(s) taggedS = pos_tag(tokenizedS) chunkedS = chunkParser.parse(taggedS) for subtree in chunkedS.subtrees(): if subtree.label() == chunkLabel: if out == 1: print(subtree) else: subtrees.append(subtree) if out != 1: return subtrees 

The function nltkChunker() is defined with four arguments. The first three are the data needed to make it work: a label, a pattern and a list of sentences. The fourth is the flag, out, which controls the mode of output. It has a default value of 1, set right there in the definition. As you read down the lines, you will see that 1 triggers the condition for printing the results. You would change it to something else, say 0, to switch the conditions to saving the results to a list that is returned to the user.

The body of the function just repeats the code from above, with a docstring for orientation.

If you just modified textProc.py, go ahead and reload it just to make sure that the changes took:

 1 2 3 4 >>> import textProc >>> reload(textProc) >>> from textProc import nltkChunker >>> help(nltkChunker) 

You should be good to go from the rest of the text on chunking.

### 9.5.2. How to extract named entities¶

Can you guess what I was aiming for, and how it was done, from this listing?

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Greg Weeks Stephen Blundell Online Distributed Proofreading Team BEYOND LIES THE WUB PHILIP K. DICK _The Captain Franco * * Captain Franco Captain Franco Captain Franco * * * * Captain Franco Captain Franco * * Captain Franco _Planet Stories_ Project Gutenberg Philip Kindred 

If you guessed full names, you are right. The tag regex is '<NNP> <NNP>'. So how does tag NNP recognize a proper name? It begins with a capital letter, which you can check with:

>>> nltk.help.upenn_tagset('NNP')


What about the pairs of asterisks?

NLTK’s named-entity chunker needs to be trained, cf. http://www.nltk.org/api/nltk.chunk.html#module-nltk.chunk.named_entity

### 9.5.3. How to extract compound nouns¶

I happened to read Bob Dylan wins 2016 Nobel Prize in literature while I was writing this section. (OK, I was looking for examples.) From the text below I produced the listing below it:

 1 2 3 4 5 6 7 8 Dylan had been mentioned in the Nobel speculation for years, but few experts expected the academy to extend the prestigious award to a genre such as pop music. The literature award was the last of this year's Nobel Prizes to be announced. The six awards will be handed out on Dec. 10, the anniversary of prize founder Alfred Nobel's death in 1896. pop music literature award prize founder 

What is going on?

If you guessed compound nouns, pat yourself on the back.

### 9.5.4. How to split at chinks¶

chink_rule = ChinkRule(“<VBD|IN|.>”, “Chink on verbs/prepositions”)

### 9.5.5. Parsing¶

Although chunking is good for analyzing specific phrases, sometimes this is not enough. Imagine the real-world situations described by these sentences:

1. I tickled a cat with a feather.
2. I tickled a hat with a feather.
3. I tickled a girl with a feather.

For (1) you probably imagined that I held a feather with which I tickled a cat. For (2), you probably tried to imagine me tickling a hat that was decorated with a feather. For (3), well, what did you imagine? I could have (3a) held a feather with which I tickled a girl or (3b) tickled a girl who was herself holding a feather.

In linguistics, a sentence like (3) that can describe two different situations in the real world is said to be ambiguous. The particular ambiguity of (3) depends on how the prepositional phrase with a feather is interpreted, as an adverbial phrase in the (a) reading or as an adjectival phrase in the (b) reading. This contrast can be represented diagrammatically in what are known as phrase structure trees.

Fig. 9.5 is an example of a phrase structure tree. It can be read from top to bottom or bottom to top, but the former makes slightly more sense. The S is the start symbol, but it is usually read as “sentence”. A sentence consists of a noun phrase, NP, and a verb phrase, VP. The NP in this sentence is a single pronoun, I. The VP consists of the verb, V, tickled, another NP, a girl, and the prepositional phrase or PP, in question, with a feather. Its attachment under VP indicates that it should be interpreted as something that modifies the verb phrase, such as describing how the action was performed.

Fig. 9.6 differs from the previous tree by the point of attachment of a single line. The PP is attached under NP, indicating that it should be interpreted as something that modifies the noun phrase, such as identifying who the person is.

The tree diagrams should help you to make your intuitive knowledge about this meaning of this sentence explicit. The PP has two points of attachment along the right edge of the tree, to VP and to NP, which means it should be interpretable in two ways, as an adverbial and as an adjectival. The other two sentences should also have these double interpretations, since they have the same tree structure, but since the real-world situation is different, one or the other interpretation is less plausible.

The process of deriving an entire tree structure for a sentence is known as parsing the sentence. It relies on constructing a list of rules known as a grammar. A rule takes the form of X -> Y, read as “X rewrites as Y”, where the arrow corresponds to a line between nodes of a phrase-structure tree. The trees discussed above can be reformulated into the following grammar:

 1 2 3 4 5 6 7 8 S -> NP VP NP -> 'I' | Det N | Det N PP VP -> V NP | V NP PP PP -> P NP V -> 'tickled' Det -> 'a' N -> 'girl' | 'feather' P -> 'with' 

The pipe | is used for disjunction, one or the other, just as in regular expressions. The words that make up the bottom-most or terminal nodes are given as strings, because you are going to use them as such in your code.

NLTK has resources for parsing from grammars designed by hand like this one. The first step is to encode the grammar as a string with the CFG.fromstring() method:

  1 2 3 4 5 6 7 8 9 10 11 12 >>> sentence = 'I tickled a girl with a feather' >>> from nltk.grammar import CFG >>> psGrammar = CFG.fromstring(""" ... S -> NP VP ... NP -> 'I' | Det N | Det N PP ... VP -> V NP | V NP PP ... PP -> P NP ... V -> 'tickled' ... Det -> 'a' ... N -> 'girl' | 'feather' ... P -> 'with' ... """) 

There are a few methods for checking what you did:

 1 2 >>> psGrammar.start() >>> psGrammar.productions() 

Recursive Descent Parser:

 1 2 3 4 5 >>> from nltk.parse import RecursiveDescentParser >>> rdParser = RecursiveDescentParser(psGrammar) >>> rdParser.parse(sentence.split()) >>> for t in rdParser.parse(sentence.split()): ... print t 

output:

  1 2 3 4 5 6 7 8 9 10 11 (S (NP I) (VP (V tickled) (NP (Det a) (N girl) (PP (P with) (NP (Det a) (N feather)))))) (S (NP I) (VP (V tickled) (NP (Det a) (N girl)) (PP (P with) (NP (Det a) (N feather))))) 

Chart parser:

 1 2 3 4 5 >>> from nltk.parse.chart import BottomUpChartParser >>> bucParser = BottomUpChartParser(grammar) >>> bucParser.parse(sentence.split()) >>> for t in parser.parse(sentence.split()): ... print t 

From the documentation, NLTK defines

1. BottomUpChartParser
2. BottomUpLeftCornerChartParser
3. LeftCornerChartParser
4. SteppingChartParser
5. TopDownChartParser

#### 9.5.5.1. Dependency parsing¶

There is an alternative to parsing a sentence into phrases which contain words, which is to parse a sentence into relations among words. Called dependency parsing, the idea is to state a grammar in terms of heads and their dependents. The tensed verb is considered to be the principle head. It enters into the subject relationship with the subject and the object relationship with the direct object, plus a modifier relationship with an adverbial prepositional phrase. A noun enters into modificational relationship with its determiner and with any adjectival prepositional phrase. As a first approximation to a grammar based on these relations I offer the following list of rules, in which A -> B can be read as “A is related to B”, without making the relationship explicit:

 1 2 3 4 'tickled' -> 'I' | 'girl' | 'with' 'girl' -> 'a' | 'with' 'with' -> 'feather' 'feather' -> 'a' 

These relations can be compiled into a dependency grammar as below:

 1 2 3 4 5 6 7 8 >>> from nltk.grammar import DependencyGrammar >>> dGrammar = DependencyGrammar.fromstring(""" ... 'tickled' -> 'I' | 'girl' | 'with' ... 'girl' -> 'a' | 'with' ... 'with' -> 'feather' ... 'feather' -> 'a' ... """) >>> print(dGrammar) 

As before, you import a parser class and initialize it with the grammar. With the class instance in hand, you parse the sentence, which returns a generator object, which can be put into a loop and printed to display the results of the parse:

 1 2 3 4 5 >>> from nltk.parse import ProjectiveDependencyParser >>> pdParser = ProjectiveDependencyParser(dGrammar) >>> pdParser.parse(sentence.split()) >>> for t in pdParser.parse(sentence.split()): ... print t `

### 9.5.7. Practice 4¶

1. Modify the chunking code in How to extract chunks

1. to include possessive determiners (pronouns).
2. to include WH determiners.
3. to make determiners optional.
4. to include an optional predeterminer
5. to include more than one adjective.
7. to fit in an optional cardinal number.
8. so that the whole ball of spaghetti that yo just designed alternates with a personal pronoun.
2. Modify the chunking code in How to extract chunks to produce a list of chunks, rather than printing them to the console.

3. The Treebank II tags split prepositions between IN and RP. Test this difference by chunking each tag following a verb and then preceding a noun phrase.

4. Chunk sequences of infinitival to followed by a base form verb.

5. Chunk sequences of modal auxiliary followed by a base form verb. Stick an optional adverb between them.

6. Is there a way to chunk progressive be + -ing or perfect have + -ed?

7. Modify the full-name pattern to match Philip K. Dick’s full name.

## 9.8. Powerpoint and podcast¶

Last edited November 21, 2016