7. Lists and tokenization

Note

The code script for this chapter is nlp7.py, which you can download with codeDowner(), see Practice 1, question 2.

7.1. Computation with lists

In working with findall(), you have seen many instances of a collection of strings held within square brackets, such as the one below:

1
2
3
4
5
>>> S = '''This above all: to thine own self be true,
... And it must follow, as the night the day,
... Thou canst not then be false to any man.'''
>>> re.findall(r'\b[a-zA-Z]{4}\b', S)
['This', 'self', 'true', 'must', 'Thou', 'then']

In this short chapter, you learn the main properties of these objects and how to manipulate them.

7.1.1. What is a list?

A list in Python is a sequence of objects delimited by square brackets, []. The objects are separated by commas. Consider this sentence from Shakespeare’s A Midsummer Night’s Dream represented as a list:

1
2
3
>>> L = ['Love', 'looks', 'not', 'with', 'the', 'eyes', ',', 'but', 'with', 'the', 'mind', '.']
>>> type(L)
>>> type(L[0])

L is a list of characters. You may think that a string is also a list of characters, and you would be correct for ordinary English, but in pythonic English, the word ‘list’ refers exclusively to a sequence of objects delimited by square brackets.

The following block of commands plays with two lists of numerical objects:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
>>> i = 2
>>> type(i)
>>> I = [0,1,i,3]
>>> type(I)
>>> type(I[0])
>>> n = 2.3
>>> type(n)
>>> N = [2.0,2.1,2.2,n]
>>> type(N)
>>> type(N[0])

Warning

You should not use the word “list” as a name for a list, because it is already taken by the built-in function list().

7.1.2. Most of the string methods work just as well on lists

A list can be input to many of the methods that accept strings:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
>>> len(L)
>>> sorted(L)
>>> set(L)
>>> sorted(set(L))
>>> len(sorted(set(L)))
>>> L+['!']
>>> len(L+['!'])
>>> L*2
>>> len(L*2)
>>> L.count('the')
>>> L.index('with')
>>> L.rindex('with')
>>> L[2:]
>>> L[:2]
>>> L[-2:]
>>> L[:-2]
>>> L[2:-2]
>>> L[:]
>>> L[:-1]+['!']

rindex() is not defined for lists.

7.1.2.1. Strings and lists share the container class

Strings and lists, as well as sets, share the ability to hold objects like characters or strings and access them or iterate over them because they are both instances of Python’s container class. This ability is so fundamental to the processing of strings and lists that we do not ordinarily need to deal with it explicitly. The one case in which it is necessary to be mindful of it is in the usage of the membership statement in. You will not see it until the next chapter, but to give you a taste and to understand the error that a violation of containership produces, try to figure out how the following expressions work:

1
2
3
4
5
6
7
>>> 'P' in 'Python'
>>> 'Love' in L
>>> 'Love' in set(L)
>>> 1 in 123
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
TypeError: argument of type 'int' is not iterable

I hope that you guessed that in returns True if the item on its left side is a member of the item on the right side. The right-hand item must be a container. Strings, lists, and sets are containers; an integer is not.

7.1.2.2. Strings and lists share the sequence class, sets do not

Strings and lists share the count(), index() and [] operations because they are both instances of Python’s sequence class. In a nutshell, a sequence type has an index associated with every one of its elements. It is the indexation which endows the type with the ability to be counted, retrieved by position and sliced.

more See Sequence Types in Python’s on-line documentation.

These operations are so useful that you might think that all Python types would need them, but that is not so. Sets, for instance, are not indexed, so they do not permit counting (obviously, since all the duplicates are removed), retrieval by position or slicing. All of the following commands produce an error:

1
2
3
4
>>> setL = set(L)
>>> setL.count('the')
>>> setL.index('with')
>>> setL[0]

Thus sets have almost the ‘opposite’ properties of lists, being unordered and without duplicates.

7.1.2.3. How to convert between strings and lists

The block of code below shows how to convert a string into a list:

1
2
3
4
5
6
7
8
9
>>> S1 = 'William Shakespeare'
>>> S2 = 'William_Shakespeare'
>>> S3 = 'William'
>>> u = '_'
>>> S1.split()
>>> S2.split()
>>> S2.split('_')
>>> S2.split(u)
>>> list(S3)

The split() function divides a prefixed string into chains at a space. If there is no space, then a string must be supplied at which the input string can be divided. This separator string can be a variable. If there is no chance of finding a separator, then list() will separate every character into a list.

The inverse task, to convert a list into a string, is performed by join():

1
2
3
4
5
>>> L1 = ['William', 'Shakespeare']
>>> u = '_'
>>> ''.join(L1)
>>> ' '.join(L1)
>>> u.join(L)

join() requires a prefixed string as a separator. If the empty string is supplied, then the strings are concatenated without interruption. The separator can be a variable. The diagram below visualizes the inverse relation between splitting and joining:

_images/SplitJoin.png

Fig. 7.1 The inverse relation between splitting and joining.

Authro’s diagram.

7.1.2.4. Summary of operations that strings and lists share

The following table summarizes the operations that are common to strings and lists:

operation string S list L
number of elements len(S) len(L)
alphabetical order sorted(S) sorted(L)
set formation set(S) set(L)
concatenation S1+S2 L1+L2
duplication S*2 L*2
count duplicates S.count() L.count()
find position S.index() L.index()
slice S[] L[]
membership ‘w’ in S ‘not’ in L
conversion S.split() ‘’.join(L)

Despite the fact that strings and lists share these properties, they are still different types. Python gets angry if you try to use certain operations to mix them, like concatenation:

>>> L+S1

7.1.3. How strings and lists differ

7.1.3.1. How to insert and remove items from a list

Strings and lists differ in one fundamental way. Try out these methods on a fresh version of our list:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
>>> L2 = 'Love looks not with the eyes, but with the mind.'.split()
>>> L2.append('Awesome!')
>>> L2
>>> L2.extend("You tell'em, Will!".split())
>>> L2
>>> L2.insert(5,'bloodshot')
>>> L2
>>> L2.remove('bloodshot')
>>> L2.insert(5,'bloodshot')
>>> L2.pop(5)
>>> L2

Did you notice the difference between append() and extend()? Try switching them:

1
2
3
4
5
>>> L3 = 'Love looks not with the eyes, but with the mind.'.split()
>>> L3.append('Awesome!')
>>> L3
>>> L3.extend("You tell'em, Will!".split())
>>> L3

Appending in computer science has the very specific meaning of attaching a new item to the end of a sequence. append() attaches the new item as the next element in the list, whereas extend() attaches each of its elements as the next element in the list.

Do these methods look like they can apply to a string? Give them a try:

1
2
3
4
5
6
>>> S4 = ' '.join(L)
>>> S4.append('Awesome!')
>>> S4.extend("You tell'em, Will!".split())
>>> S4.insert(5,'bloodshot')
>>> S4.remove('not')
>>> S4.pop(5)

Do you know why these methods fail on a string?

Recall our discussion of mutability in Mutability. We observed that, once a string is created, its constituency cannot be altered. In Pythonese, such an object is called immutable. What does the first block of code in this section say about the mutability of a list? Well, L is clearly mutable; every method above changes its constituents. So here we have the principle difference between the two types, strings are immutable while lists are not. Thus if you need a data type whose elements will wax and wane over the course of your program, a list is the way to go.

Given its mutability, not only can elements be added or subtracted from a list, but they can also be moved around. This section closes out with two methods that do so, which lead to a deeper understanding of how Python organizes its memory.

7.1.3.2. How to truncate a list with del

Python has a delete statement, del, that makes it clear that you want to delete items from a list, which is known technically as truncatation. How would you truncate the listified characters of ‘Awesome!’ from L3?

1
2
>>> del L3[-len('Awesome!'):]
>>> L3

This can be read as “delete from L3 a number of characters of the length of ‘Awesome!’, in the reverse direction starting from the end”. Now go ahead and truncate the list [‘You’, “tell’em,”, ‘Will!’] that was appended by extend() to L3 above:

1
2
del L3[-1:]
L3

This should return the modified list to its original shape.

7.1.3.3. How to reverse the order of a list

The first appears straightforward:

1
2
3
4
>>> L.reverse()
>>> L
>>> L = L[::-1]
>>> L

reverse(), it comes as no surprise, reverses the order of elements in a list. What is a surprise, though, is that it does so “in place”. That is, it reverses the list within the method, rather than sending the reversal as output to a new list. Thus the original order is destroyed. It is easy enough to recover, by reversing the reversal, or using the slice trick that you are reminded of in line 3.

But what if you wanted to use the original order at the same time? It seems logical to just assign the original list to a new list and reverse it, so that you wind up with two lists, one in the original order and one reversed. Don’t just sit there, try it out:

1
2
3
4
>>> L4 = L
>>> L4.reverse()
>>> L4
>>> L

L4 gets reversed … but … but … so does L! How did that happen? You never touched L. There must be a ghost in Python.

7.1.3.4. How to copy a list to a new memory location

Actually, what happens is that Python initially creates a representation of L in its memory, and then every new variable assigned to it just points at that spot.

_images/shallowCopy.png

Fig. 7.2 The assignment of L4 to L on the left side just makes L4 point at the memory location of L, as indicated on the right.

Author’s diagram, loosely based on the images here.

So reversing one has the effect of reversing all, since there is really only one of them in existence.

Tip

Python assigns an identification number to every object that it creates, which you can retrieve with id(). Two variables point to the same location in memory if they have the same id:

1
2
3
>>> id(L)
>>> id(L4)
>>> id(L) == id(L4)

There are two solutions to this problem of multiple variables pointing to a single memory location, using the copy method or using vacuous slicing.

The official one is to make a copy of the list that creates a new representation in memory, using the copy() method from the copy module.

_images/deepCopy.png

Fig. 7.3 The content of L on the left side is copied into a new memory location assigned to L4 on the right side.

Source same as previous.

The copy module is not part of the default invocation of Python and so must be imported, but first, recover the original order by reversing what you just did:

1
2
3
4
5
6
7
>>> L.reverse()
>>> from copy import copy
>>> L5 = copy(L)
>>> id(L5) == id(L)
>>> L5.reverse()
>>> L5
>>> L

The output of the last two lines should have opposite orders.

The other, sneaky way is to use the ‘vacuous’ slice operation to fool Python into creating a new representation in memory:

1
2
3
4
5
>>> L6 = L[:]
>>> id(L6) == id(L)
>>> L6.reverse()
>>> L6
>>> L

Again, the output of the last two lines should have opposite orders.

The copy() method is easy for someone reading your code to understand, but the slice operation is much simpler to type.

Having explained the magic of multiple variables pointing to the same memory location, you are now prepared to recognize it when it appears with a different method.

more For more on copying, see 8.17. copy — Shallow and deep copy operations in Python’s online documentation.

7.1.3.5. How to randomize the order of a list

It may seem odd to you now, but randomizing the order of a list is sometimes a very useful thing to do. Python has an amazingly simple scheme for randomization, the shuffle method from the random module. shuffle() randomizes the sequence of a list and like reverse(), does so in place, destroying the list’s original sequence. Unlike reversal, it cannot be undone, so if the order of a list is important, it is better to work on a copy:

1
2
3
4
5
6
>>> L7 = copy(L)
>>> id(L7) == id(L)
>>> from random import shuffle
>>> shuffle(L7)
>>> L7
>>> L

The main usage of randomization is to make sure that the input to an algorithm does not accidentally depend on a specific order of items.

7.1.4. Practice 1

  1. What are L, I, and N lists of?

  2. Can types be mixed in a single list?

  3. Convert the string S at the beginning of the chapter into a list (tokenize it) and then …

    1. sort it
    2. reverse it
    3. remove its duplicates
    4. sort it without duplicates
    5. reverse it without duplicates
    6. randomize it without duplicates

7.2. Tokenization

A string of words is too cumbersome a thing to work with; in dealing with texts, it is much easier to work with a list of words. The conversion of a string of words to a list of words is called tokenization, and is the first step in natural language processing.

You could tokenize a string yourself using the space as a separator with:

1
2
3
>>> S.split(' ')
>>> from re import split
>>> split(r' ', S)

But as you can see, these produce odd results with punctuation and non-printing characters. Instead of struggling with this yourself, you are going to stand on the shoulders of giants.

7.2.1. Tokenization in NLTK

Computational linguists have been working on tokenization for decades. Some of their results are collected into the Natural Language Toolkit, or NLTK. NLTK is part of Anaconda’s Python 0 distribution, so you can start poking around with it with import nltk. NLTK has combined a couple of tokenizers into word_tokenize:

1
2
>>> from nltk import word_tokenize
>>> word_tokenize(S)

This might not work for you. To find out what is going on, look at the documentation of word_tokenize:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
>>> help(nltk.word_tokenize)
Help on function word_tokenize in module nltk.tokenize:
word_tokenize(text, language='english')
    Return a tokenized copy of *text*,
    using NLTK's recommended word tokenizer
    (currently :class:`.TreebankWordTokenizer`
    along with :class:`.PunktSentenceTokenizer`
    for the specified language).

    :param text: text to split into sentences
    :param language: the model name in the Punkt corpus

The problem is that you don’t have Punkt software installed on your computer. it is easy enough to download from NLTK’s collection of corpora, however. This was touched on in NLTK’s corpora, but the gist of it is to open NLTK’s download GUI with nltk.download(), click on the Models tab, choose punkt and wait for it to download.

Tip

If the graphical user interface fails, you can use a command-line interface:

1
2
3
4
>>> nltk.download_shell()
>>> # to find what to download, type h and then scroll through the lists
>>> d punkt # download punkt
>>> q # to quit the downloader

Now you can try it out:

1
2
>>> from nltk import word_tokenize
>>> word_tokenize(S)

Notice that punctuation is split off into its own string. This is the convention of the Treebank tokenizer:

  • It splits standard contractions, e.g. “don’t” -> “do”, “n’t” and “they’ll” -> “they”, “‘ll.”
  • It treats most punctuation characters as separate tokens.
  • It splits off commas and single quotes, when followed by whitespace.
  • It separates periods that appear at the end of line

more See help(nltk.tokenize.TreebankWordTokenizer) and Penn Treebank Tokenizer <http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.treebank>`_ for more information.

Now finally do something with a real text:

1
2
3
>>> with open('Wub.txt','r') as tempFile:
    ... rawText = tempFile.read()
>>> tokens = word_tokenize(rawText)

Hooray, you have performed the first step in the natural language processing of a real text!

7.3. How read CSV as a list

You have undoubtedly seen a spreadsheet, which are almost universally known through Microsoft’s Excel program. In the table below I have copied in the first two lines of a spreadsheet that you are going to work on in this section:

Table 7.1 ISO languages of the world, first two lines, plus Spanish and English
SK_Language ISO639-3Code ISO639-2BCode ISO639-2TCode ISO639-1Code LanguageName Scope Type MacroLanguageISO639-3Code MacroLanguageName IsChild
10000 aaa (none) (none) (none) Ghotuo Individual Living (none) (none) 0
3190 spa spa spa es Spanish Individual Living (none) (none) 0
11103 eng eng eng en English Individual Living (none) (none) 0

As you can see, it consists of eleven columns, and starts (at the top) with a header row followed by data rows.

Since Excel’s format belongs to Microsoft, and is binary, spreadsheet-type data is usually exchanged in a simpler, non-proprietary format known as comma-separated values or CSV. A CSV file is plain text, with values enclosed in quotes and separated by a character – usually the comma – and each row ending in a new line.

7.3.1. How to read CSV with csv.reader()

Python has a native module for reading CSV files, which converts them to a list of strings – hence our interest in them in this chapter. So go out and get a CSV file to play with. In particular, I want you to download the CSV file (not the one with Windows icon on it) from Thomas Kejser’s blog post Free Data – ISO Languages (CSV and Excel). Save it as text to your hard with the name “ISOlanguages”.

Now, I know you know how to do that, so I don’t have to demonstrate the code, but we do have to take a peek at the data downloaded to find out its format, so I will demonstrate the code anyway:

1
2
3
4
5
6
7
8
>>> from requests import get
>>> url2 = 'http://kejser.org/wp-content/uploads/2014/06/Language.csv'
>>> response = get(url2)
>>> rawCSV = response.text
>>> rawCSV[:300]
>>> with open('ISOlanguages.csv', 'w') as csvfile:
...     csvfile.write(rawCSV)
...

Hold that thought for a moment. Go ahead and import csv.reader() and look at its documentation:

1
2
>>> from csv import reader
>>> help(reader)

It talks a lot about iteration, which is one of the topics of the next chapter. What I am more interested in here is dialect='excel' and optional keyword args.

A CSV file has two main formatting attributes, the character which separates the strings of a row, called delimiter in csv, and the character which encloses a string, called quotechar in csv. Check what these are in the default ‘dialect’ of Excel:

1
2
3
>>> from csv import excel
>>> excel.delimiter
>>> excel.quotechar

They are , and ", respectively. But what are they in the file that you just downloaded? Well, the quotechar is also ", but the delimiter is |, pipe, not comma.

The consequence of this variation from the default is that the delimiter has to be stipulated in the optional keywords arguments of csv.reader().

If you were really lazy and wanted Python to figure this out for you, csv can sniff it out:

1
2
3
4
5
>>> from csv import Sniffer
>>> with open('ISOlanguages.csv', 'r') as csvfile:
...     csvDialect = Sniffer().sniff(csvfile.read(1024))
>>> csvDialect.delimiter
>>> csvDialect.quotechar

Line 3 reads 1024 characters through the Sniffer().sniff() method to figure out the delimiter and quotechar, the outcome of which is reported in the last two lines. You can use the dialect object to read the CSV file, or set the quotechar yourself:

1
2
3
4
5
6
7
8
9
>>> with open('ISOlanguages.csv', 'r') as csvfile:
...     langReader = reader(csvfile, csvDialect)
...     firstRow = langReader.next()
print firstRow
# you set the delimiter
>>> with open('ISOlanguages.csv', 'r') as csvfile:
...     langReader = reader(csvfile, delimiter='|')
...     firstRow = langReader.next()
print firstRow

The goal of both with statements is to read the file into the langReader object and then extract and print the first row, which happens to be the header of the file.

All that work just to get the first line?, you might say. What about the rest?

I want you to get the rest, but unfortunately you have to understand about iteration in Python first. To repeat, that is something you will learn in the next chapter.

7.3.2. How to write CSV with csv.writer()

To round out the section, there is also a method to write CSV to a file, csv.writer(). As a simple example, write the single line returned from the reader to its own file:

1
2
3
4
>>> from csv import writer
>>> with open('ISOheader.csv', 'w') as csvfile:
...     ISOwriter = writer(csvfile)
...     ISOwriter.writerow(firstRow)

Check to see that ISOheader.csv magically appeared in pyScripts, either by looking at the File Explorer tab in Spyder, or programmatically with os:

1
2
>>> import os
>>> os.path.exists('/Users/{your_user_name}/Documents/pyScripts/ISOheader.csv')

more See Wikipedia’s Comma-separated values for background.

more See 13.1. csv — CSV File Reading and Writing for more detail on CSV files in Python 2.7.

more See ISO-639-1, ISO-639-2, and ISO-639-3 for more on the International Standards Office (ISO) language names.

7.3.3. Practice 2, or how to extract a file from a ZIP archive

There is a project on GitHub, World countries in JSON, CSV, XML and YAML, that collects some basic information about countries into several data formats, one of which is CSV. I want you to read the first line (header) of the CSV file. You could easily do this with what you have just learned … except … all of the data files are bundled into a single compressed archive. When we discussed requests, I mentioned in passing that it can automatically decompress (unzip) a single compressed (ZIP) file, but the ZIP file at the GitHub page contains a slew of files, which requests cannot handle directly.

Wouldn’t you know, Python has a native module for dealing with zipped files. So let’s dig right in. What you are going to do is redirect the output from requests to the zipfile module and extract the single file that you want, without saving anything to disk along the way:

1
2
3
4
5
>>> from StringIO import StringIO
>>> from zipfile import ZipFile
>>> url3 = 'https://github.com/mledoze/countries/zipball/master'
>>> response = get(url3)
>>> rawZip = ZipFile(StringIO(response.content))

So far so good. It looks like a normal requests response, except that you have tricked Python into sending the output to an internal variable with StringIO, instead of saving it to disk.

The hard part is that there are 536 files in this archive, and only one that ends in .csv. If you could iterate over the file list, it would be easy to find the single CSV file – but YOU CAN’T UNTIL NEXT CHAPTER. So you have to inspect the file list visually:

1
2
3
>>> rawZip.namelist()
>>> path = 'mledoze-countries-c212082/dist/countries.csv'
>>> rawZip.extract(path)

The first line prints out all the file names as strings in a list. Fortunately, the one you want is towards the end. I have assigned it to the variable path – did you notice that it is indeed a file path, because it had two directories along the way? Finally, the last line extracts the file to pyScripts, at the end of the sequence of folders in path. Confirm that it is indeed there.

Now you can extract its first line.

more See 12.4. zipfile — Work with ZIP archives for more detail on zipped files in Python 2.7.

7.4. Summary

Todo

That would be nice.

7.5. Further practice

Todo

Actually, it would be nice to have the time.

7.6. Powerpoint and podcast


Last edited: October 13, 2016