5. Flat text

Now that you have gotten a taste of Python, let us gather some texts to work on. To do so, you will learn how to download a file from the Internet and save it to your hard drive.

Note

The code for this chapter can be downloaded as nlp5.py, presumably to your Downloads folder, and then moved to pyScipts, from whence you can open it in Spyder and run each line one by one.

5.1. How to navigate folders with os

The first step is to figure out where to put the file. File organization is a prerogative of the operating system, so Python has a special module named ‘os’ for asking it to perform actions on files. To invoke it, use the import command with the module name, import os. Your first task is always to figure out what Python is looking at in your computer’s file hierarchy. This is known as the current working directory, and the command to ask for it is os.getcwd(), which can be read as something like “use the os module to get the current working directory”. If the current working directory is indeed the pyScripts folder, the response would be something like '/Users/{your_user_name}/Documents/pyScripts'. Note that the convention is to state the path through your computer’s folder hierarchy by separating folder names by a forward slash:

1
2
>>> import os
>>> os.getcwd()

If you set the global working directory to pyScripts as described in How to set the global working directory in Spyder, then your current working directory will always be pyScripts.

If the current working directory is some other folder, you can change it to pyScripts by putting the path to it in single or double quotes inside the parentheses of os.chdir(), to be read as “use the os module to change to the directory in parentheses”. Since a folder path can be very long, it can be hard to read within the parentheses and you might make a mistake. It is more perspicuous to assign the path to a variable and put the variable name within the parentheses. Once you have done this, you could double-check the last two by giving os.getcwd() again:

1
2
3
>>> path = '/Users/{your_user_name}/Documents/pyScripts'
>>> os.chdir(path)
>>> os.getcwd()

Note how defining the path for os.chdir() saved us the effort of retyping it for os.listdir(). Always let Python do as much work for you as possible.

If you have no pyScripts folder, you can use makedirs() to create it:

1
2
>>> os.chdir('/Users/{your_user_name}/Documents/')
>>> os.makedirs('pyScripts')

Finally, thee are a couple of methods that you can use to double check whether it all worked. The first lists the contents of a folder, the second checks to see whether the path to the folder actually exists:

1
2
>>> os.listdir('.')
>>> os.path.exists('/Users/{your_user_name}/Documents/pyScripts')

Now let’s get a file into pyScripts.

more For more explanation of Python’s operating system interface, see 15.1. os — Miscellaneous operating system interfaces in Python’s online documentation.

5.2. How to download and manage a plain text file

To start off, you will learn about plain text files, those marked with the suffix .txt or .text. I will refer to their format as TXT.

5.2.1. Project Gutenberg

Point your web browser at the home page for Project Gutenberg. As you can see, it offers more than 42,000 free ebooks. That should be more than enough for our purposes. In fact, it is so many that it is hard to know where to start. So I arbitrarily choose this one to work with: Beyond Lies the Wub by Philip K. Dick. Clicking on the link should bring you to a web page that looks like this:

_images/DickProjGuten.png

Fig. 5.1 The home page for Philip K. Dick’s “Beyond Lies the Wub” at project Gutenberg

Author’s screen capture

You want to download the file that is hiding behind the “Plain Text UTF-8” link, so click on it. Your web browser should fill up with the following text:

_images/WubUTF8.png

Fig. 5.2 First page of “Beyond Lies the Wub”

Author’s screen capture

The document starts with some introductory material from Project Gutenberg before turning to the title and beginning of Dick’s story. Note that it is ebook number 28554.

5.2.2. How to download a file with requests

There are several ways to download a file, but the recommended one is to use the requests module. You are going to use its get method, so go ahead and import it. Then you need to enter the path to the file, known as its url – see the sidebar for what “url” means. The URL for the file that you want is highlighted in the orange box in First page of “Beyond Lies the Wub”. You invoke requests to download the file with get(url), which may take a few seconds. This saves the file as a Response object into Python’s memory:

1
2
3
>>> from requests import get
>>> url = 'http://www.gutenberg.org/cache/epub/28554/pg28554.txt'
>>> response = get(url)

Let’s look at the response:

1
2
3
4
5
6
7
8
9
>>> response.headers
{'content-length': '13514', 'x-varnish': '1903523305', 'content-location':
'pg28554.txt.utf8.gzip', 'content-encoding': 'gzip', 'set-cookie':
'session_id=1063f74f21dde877ca540b628273ed4649a0cef1; Domain=.gutenberg.org;
expires=Sat, 17 Sep 2016 20:31:33 GMT; Path=/', 'age': '0', 'x-powered-by': '3',
'vary': 'negotiate,accept-encoding', 'server': 'Apache', 'tcn': 'choice',
'x-connection': 'Close', 'via': '1.1 varnish', 'x-rate-limiter': 'ratelimiter2.php57',
'date': 'Sat, 17 Sep 2016 20:01:33 GMT', 'x-frame-options': 'sameorigin',
'content-type': 'text/plain; charset=utf-8'}

The curly brackets that delimit the material mark it as a dictionary, a data type that you have not seen yet. The very first item, 'content-length': '13514' may be helpful, until you read along and see 'content-location': 'pg28554.txt.utf8.gzip', 'content-encoding': 'gzip'. ZIP and GZIP are file compression formats, so the content length of the compressed file is going to be much shorter that that of the uncompressed file. The next-to-the-last item, 'content-type': 'text/plain; charset=utf-8', is usually the most useful of them all, for it tells you that the content downloaded is plain text, encoded as UTF-8. Requests’ text attribute extracts that text, go give us something that we can work with:

1
2
3
4
>>> rawText = response.text
>>> type(rawText)
>>> len(rawText)         # 35737?
>>> rawText[:150]

Note that requests has encoded the string to UNICODE, so you may have to encode it to UTF-8 at some point.

5.2.2.1. How to start a function for recurring file operations

Project Gutenberg is so awesome that you might want to download more than one text from it. To save strain on your fingers from retyping the same code again and again, start a function for doing it for you. Put it in a new script file called “textProc.py”:

1
2
3
4
5
# In "textProc.py"
def gutenLoader(url, name):
    from requests import get
    response = get(url)
    rawText = response.text

Note that I did not include the Python prompt because I asked you to type this directly into a script. Be sure to save it, but there is no need to run it because it doesn’t do enough yet.

5.2.2.2. How to use try to catch errors

Recall from How to define a function that variables assigned within a function are not accessible outside of it. There is another way in which a function is opaque to the outside world, namely that an error that is produced, or ‘thrown’ in Pythonese, within a function may not be reported as such during the function’s execution. It just fails, and you don’t know why.

To avoid this problem, Python has a statement for tiptoeing around problematic methods called try. Try to figure out what it does from typing this into the console, for which you should have already imported requests and assigned a path to url:

1
2
3
4
5
>>> try:
...     response = get(url)
... except:
...     print 'Download failed!'
...

The syntax of try is rather peculiar, though it looks a lot like that of a function definition. It starts with the reserved word try followed by a colon, which switches Spyder’s console to the continuation prompt. What you want to try is indented. Then an un-indented except, again followed by a colon, explains what to do in case of an error. In Pythonese, error messages are called exceptions.

Downloading a file is particularly prone to error because it depends on the file’s server being in working order, as well as your Internet connection. If requests runs into a problem, it will throw an error which shuts down the attempted download without breaking anything and switches to the exception handler, which you have thoughtfully supplied with an informative message.

try admits a further line called else: for what you want to do if what you tried succeeded:

1
2
3
4
5
6
>>> try:
...     response = get(url)
... except:
...     print 'Download failed!'
... else:
...     DO SOME MORE STUFF

The difference between putting the next stuff to do under else is crucial; ask yourself what happens in the following block of code if except finds an exception:

1
2
3
4
5
>>> try:
...     response = get(url)
... except:
...     print 'Download failed!'
... DO SOME MORE STUFF

except prints an error message and then exits the try block – and Python goes on to do the other stuff, even if it depends on the download being successful. To avoid this problem of executing code only when try succeeds, put the other stuff beneath else. So that is where I am going to extract the text from the response:

1
2
3
4
5
6
>>> try:
...     response = get(url)
... except:
...     print 'Download failed!'
... else:
...     rawText = response.text

Add the try block to your budding gutenLoader function as so:

1
2
3
4
5
6
7
8
def gutenLoader(url, name):
    from requests import get
    try:
        response = get(url)
    except:
        print 'Download failed!'
    else:
        rawText = response.text

Warning

Project Gutenberg keeps track of how frequently you access it and will ask you to prove that you are human with a captcha. You will know that this has happened if the text that you downloaded is actually a bunch of HTML, as illustrated in the appendix A snippet of Project Gutenberg’s captcha page. Since requests does download a sort of text, it does not throw an exception.

Note

It’s probably obvious but the format of the try statement is:

1
2
3
4
5
6
>>> try:
>>>     DO SOMETHING RISKY
>>> except:
>>>     NOTIFY USER
>>> else:
>>>     KEEP GOING WITH RISKY BUSINESS

more See The try statement for more on try.

more See Built-in Exceptions for more on exceptions in general.

more See Exceptions for more on the exceptions raised by requests.

5.2.3. How to manage a file on your hard drive

Just in case something goes wrong, you should save the text that you have downloaded to your hard drive.

5.2.3.1. How to save a file to your drive with open(), write(), and close()

Getting a file out of Python and onto your hard drive involves a series of steps that has never seemed intuitive to me, so it is good for you to get familiar with it as soon as possible. It invokes three built-in methods. The first creates an empty file along with its name by assigning it to a file object that I like to call ‘tempFile’. The second is to write from memory to disk, to which an encoding must be added if necessary. The final one is to close the file object. The three steps are gathered here:

1
2
3
>>> tempFile = open('Wub.txt','w')
>>> tempFile.write(rawText.encode('UTF-8'))
>>> tempFile.close()

You should now look at your pyScripts folder to make sure that a new document named “Wub.txt” has appeared there. You can get the same effect in Python by listing all the files in your current working directory, which is assumed to be pyScripts. Be sure to import os if you haven’t already done so:

>>> os.listdir('.')

Note

The general syntax of the write-to-file procedure is:

1
2
3
>>> fileObject = open('documentName.suf','w')   # 1. create a file object, provide it with a name & set its mode to write
>>> fileObject.write(stringSource)              # 2. write the string to the disk, with an optional encoding
>>> fileObject.close()                          # 3. close the file object to save it and free-up its resources

By the way, the file object tempFile cannot be examined, neither while it is open nor after it is closed. If you try either way, Python returns messages like the second and third lines below:

1
2
3
>>> tempFile
<open file FileSystemPathPointer('/Users/{your_user_name}/Documents/pyScripts/Wub.txt'), mode 'w' at 0x104d58270>
<closed file 'Wub.txt', mode 'w' at 0x102109270>

Now that you have a document, let’s take a look at it.

5.2.3.2. How to look at a file with open() and read()

Dealing with a file that has already been saved to disk is easier than saving one; all you need is the first half of the saving algorithm. As before, open() needs the location of the file and an indication of the mode in which to open it. Assuming that the current working directory is pyScripts, the location of the file can be reduced to its name, and the mode is r for ‘read’:

1
2
3
>>> tempFile = open('Wub.txt','r')
>>> rawText = tempFile.read()
>>> tempFile.close()

You can combine them into a single line to obviate the need for the intermediate temporary file:

>>> rawText = open('Wub.txt', 'r').read()

You can test the result as before:

1
2
3
4
5
>>> type(rawText)
>>> len(rawText)         # 35739?
>>> rawText[:150]
>>> import chardet # if not already imported
>>> chardet.detect(rawText)

Since the string is of type str and not unicode, it can be run through chardet to determine its encoding. It is UTF-8, which is how it was saved to disk. This string is the jumping off point for text processing.

5.2.3.3. How to read from and write to a file

So far, you have used the open() method in two mutually exclusive modes, 'r' for “read” and 'w' for “write”. They can be used in sequence to open a file, do something to it, and save the changes. To illustrate this process, replace a word that appears early in Wub.txt, “Gutenberg” with its uppercase equivalent “GUTENBERG” and save the result:

1
2
3
4
5
>>> rawText = open('Wub.txt', 'r').read()
>>> tempText = rawText.replace('Gutenberg', 'GUTENBERG')
>>> tempFile = open('Wub2.txt','w')
>>> tempFile.write(tempText)
>>> tempFile.close()

Let’s test it:

1
2
>>> rawText = open('Wub2.txt', 'r').read()
>>> rawText[:150]

You should see the first token of ‘Gutenberg’ in uppercase.

5.2.3.4. How to simplify file operations with the with statement

Dealing with file input and output can be made a bit less onerous by using the with statement. It takes care of closing the file, as well as any errors that are produced. It looks a lot like the def and try statements:

1
2
3
4
5
6
7
>>> with open('Wub.txt','r') as tempFile:
...     rawText = tempFile.read()
>>> rawText = rawText.replace('Gutenberg', 'GUTENBERG')
...
>>> with open('Wub3.txt','w') as tempFile:
...     tempFile.write(rawText)
...

You don’t have to worry about closing the file, and the extra indentation may make it easier to conceptualize what is going on.

Let’s test it:

1
2
>>> rawText = open('Wub3.txt', 'r').read()
>>> rawText[:150]

As before, the first token of ‘Gutenberg’ should be displayed in uppercase.

You should now augment your function gutenLoader() with the code about file writing, so that it finally does something useful:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# In "textProc.py"
def gutenLoader(url, name):
    from requests import get
    try:
        response = get(url)
    except:
        print 'Download failed!'
    else:
        rawText = response.text
        with open(name,'w') as tempFile:
            tempFile.write(rawText.encode('UTF-8'))

Do you understand why I continued to append the code under else?

5.2.3.5. How to refresh your script with reload()

Python tries to husband your resources, but sometimes you just want to be a spendthrift. When Python first executes a script, it compiles it to a machine-readable format called bytecode and stores it internally. However, if you make a change in your script and save it, Python often does not recompile the bytecode, because that could take some time if your script is long. If you run the script again, or a function from it, your changes do not appear, and you think that you have thoroughly screwed up.

You can use the reload() method on the script name to force Python to recompile it. Bear in mind that this refreshes everything, so you will have to re-import anything that you need. To make sure that the previous change took, do this:

>>> reload(textProc)

You will test it in a few minutes.

5.2.3.6. How to get your function to communicate with the outside world with return

There is one last property of functions that you should know about. gutenLoader() does something to your file system, but it doesn’t really interact with you. Certainly one way to do this is to send the user a message that the download was successful by printing a string to the console. Before I show you a sample, where should this line of code go?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# In "textProc.py"
def gutenLoader(url, name):
    from requests import get
    try:
        response = get(url)
    except:
        print 'Download failed!'
    else:
        rawText = response.text
        with open(name,'w') as tempFile:
            tempFile.write(rawText.encode('UTF-8'))
        print 'File was written.'

I will ask you to make this message more informative in the practice.

The other thing that a user would expect from the function is the text that it extracted. This is the job of the return statement, which sends something from the function – recall that variables in a function are hidden from the outside world – to Python’s main memory. Let’s do this to rawText:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# In "textProc.py"
def gutenLoader(url, name):
    """Download a text file from Project Gutenberg & save it to disk, given a url and name of the file.."""
    from requests import get
    try:
        response = get(url)
    except:
        print 'Download failed!'
    else:
        rawText = response.text
        with open(name,'w') as tempFile:
            tempFile.write(rawText.encode('UTF-8'))
        print 'File was written.'
        return rawText

return must still be subordinate to else because you don’t want to try to return the text if it was never downloaded.

Note

The general syntax of a function is:

1
2
3
4
5
>>> def functionName(argument[s]):
...     """ Explanation. """
...     do something
...     return output
...

You will have noticed that I added a string on the second line that explains what the function does. This documentation string or docstring can be retrieved from the help utility.

more See Documentation Strings in Python 2.7 for more on docstrings.

5.2.3.7. How to call your function

gutenLoader() now does enough to test it. Pick another text from Project Gutenberg, assign its address to url and give it a name via name. Then import it via its script, and run it by simply typing its name with the two arguments, as so:

1
2
3
4
5
6
>>> url = <yourUrl>
>>> name = <yourName>
>>> reload(textProc)
>>> from textProc import gutenLoader
>>> help(gutenLoader)
>>> rawText2 = gutenLoader(url, name)

A new file with the name you gave it should magically appear in pyScripts, and its text should be available to the console as rawText2.

Running your code from a function or a function in a script is called batch mode, rather than interactive mode.

5.2.4. Practice 1

  1. Yes, make the response more informative. In particular, make it spit out the catalog number of the Project Gutenberg text, as well as the name that you give the file.
  2. Define a function codeDowner() for downloading the script file for each chapter to pyScripts, which takes one argument, the name of the file, such as nlp5.py, which I tell you at the beginning of every chapter. Its url is fixed within the function as 'http://www.tulane.edu/~howard/NLP/_py/'+name. Insert it in textProc.

5.2.5. How to slice away what you don’t need

Beyond Lies the Wub starts with a header and ends with a footer inserted by Project Gutenberg. You don’t want either one messing up your analysis, so you should delete them. How would you?

You could use a slice, but how far from the beginning would you end? A quick glance back at the image of the first page above reveals that the text starts right after the notice, “*** START OF THIS PROJECT GUTENBERG EBOOK BEYOND LIES THE WUB ***”, or at least that is good enough for our purposes. Unfortunately, a Project Gutenberg ebook ends with a long copyright disclaimer, which follows the line “*** END OF THIS PROJECT GUTENBERG EBOOK BEYOND LIES THE WUB ***”. The text you want is sandwiched between the two:

header

*** START OF THIS PROJECT GUTENBERG EBOOK BEYOND LIES THE WUB ***

text

*** END OF THIS PROJECT GUTENBERG EBOOK BEYOND LIES THE WUB ***

footer

Can you think of how to pull it out?

What you want to do is slice out the middle.

The three asterisks are a great clue as to where to begin, but just to make sure that the string is unique, add some more to nail down the right index:

1
2
>>> rawText.index('*** START OF THIS PROJECT GUTENBERG EBOOK')
499

This index takes you to the correct line, but you want to delete all of it. That is to say, you want to delete from the index of the first new line character '\n' that comes after index 499. You can use the index as an optional second argument to index() that tells it where to start looking. Then slice up to it, and read what you got:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
>>> lineIndex = rawText.index('*** START OF THIS PROJECT GUTENBERG EBOOK')
>>> startIndex = rawText.index('\n',lineIndex)
>>> rawText[:startIndex]
u'\ufeffThe Project Gutenberg EBook of Beyond Lies the Wub, by Philip Kindred
Dick\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost
no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under
the terms of the Project Gutenberg License included\r\nwith this eBook or online
at www.gutenberg.net\r\n\r\n\r\nTitle: Beyond Lies the Wub\r\n\r\nAuthor: Philip
Kindred Dick\r\n\r\nIllustrator: Herman Vestal\r\n\r\nRelease Date: April 11, 2009
[EBook #28554]\r\n\r\nLanguage: English\r\n\r\n\r\n*** START OF THIS PROJECT GUTENBERG
EBOOK BEYOND LIES THE WUB ***\r'

Turning to the footer, it is also marked by a unique string so you can be assured that its index marks the end of the text:

>>> rawText.index('*** END OF THIS PROJECT GUTENBERG EBOOK')

It returns 16514. You are welcome to look at some of the following text to make sure that you hit the right spot, but I am ready to move on.

Can you use the two indices you have found to extract the story from its Project Gutenberg wrapper? Think about how to do it. I hope that you came up with something like this:

1
2
>>> endIndex = rawText.index('*** END OF THIS PROJECT GUTENBERG EBOOK')
>>> text = rawText[startIndex:endIndex]

Just to make it clear, the slice is where the square brackets are here:

header

*** START OF THIS PROJECT GUTENBERG EBOOK BEYOND LIES THE WUB ***[

text

]*** END OF THIS PROJECT GUTENBERG EBOOK BEYOND LIES THE WUB ***

footer

This chunk of text is what we want to analyze, so it is time to save it to disk. Incorporate the header and footer slicing code into gutenLoader() before saving the text to disk:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# In "textProc.py"
def gutenLoader(url, name):
    """Download a text file from Project Gutenberg, strip off its header & footer,
    and save it to disk, given a url and name of the file."""
    from requests import get
    try:
        response = get(url)
    except:
        print 'Download failed!'
    else:
        rawText = response.text
        lineIndex = rawText.index('*** START OF THIS PROJECT GUTENBERG EBOOK')
        startIndex = rawText.index('\n',lineIndex)
        endIndex = rawText.index('*** END OF THIS PROJECT GUTENBERG EBOOK')
        text = rawText[startIndex:endIndex]
        with open(name,'w') as tempFile:
           tempFile.write(text.encode('UTF-8'))
        message = 'File {} was written.'
        print message.format(name)
        return text

I added string formatting to make the message returned more informative.

Call the function as before, importing textProc.py if you haven’t done so:

1
2
3
4
>>> import textProc
>>> reload(textProc)
>>> from textProc import gutenLoader
>>> rawText = gutenLoader(url, name)

5.2.6. Practice 2

  1. You should be able to explain what every line of gutenLoader() does. To make sure, I want you to add a comment to at least every other line, explaining what it does. You preface a single line comment with #, which tells Python’s interpreter to ignore the line. You have seen comments several times in my sample code, I just did not explain what they were.
  2. The os module has a method os.path.isfile(path) which returns True if it finds a file at the end of the path string in the argument. You have already seen os.getcwd(), which returns the path to the current working directory. Can you think of way to use os.path.isfile() to check whether gutenLoader() actually does create a file?

5.3. How to convert other documents to a string

Project Gutenberg’s text files are easy to handle, but TXT is not the most common format for textual data.

Question

What other document formats are you familiar with? They are usually indicated by a document’s suffix.

Nowadays, most files that are devoted to text are in the .pdf format. PDF stands for portable document format and belongs to Adobe Systems. You are also undoubtedly familiar with Microsoft Word documents, .doc or .docx. There are several others for text, such as .htm or .html for web pages, as well as formats for images, video and sounds.

All of these file types include additional information about the structure and formatting of the text in the form of tags. One of the simplest examples is italicization in HTML, in which a word is enclosed in angled brackets which contain the i tag, <i>italicized</i>. The i tells your web browser to start italicizing the string, and the /i tells it to stop.

5.3.1. Let’s try textract

You want to clear your text of this clutter. Over the years, as new formats became popular, new utilities were developed to clean them up. You could install each one and learn how to use it to extract text, but there is an alternative. The Python package textract gathers together some thirteen packages, as well as several modules built into Python, for which it supplies a simple interface.

Table 5.1 Python modules that textract calls on and their availability in the Continuum Anaconda distribution, CAD
Format Converter Available in CAD?
.csv built into Python
.doc `antiword`_
.docx `python-docx`_
.eml built into Python
.epub `ebooklib`_
.gif `tesseract-ocr`_
.jpg, .jpeg `tesseract-ocr`_
.json built into Python
.htm, .html `beautifulsoup4`_
.mp3 `SpeechRecognition`_ and `sox`_
.msg `msg-extractor`_
.odt built into Python
.ogg `SpeechRecognition`_ and `sox`_
.pdf `pdftotext`_ (default) or `pdfminer`_
.png `tesseract-ocr`_
.pptx `python-pptx`_
.ps `ps2text`_
.rtf `unrtf`_
.tiff `tesseract-ocr`_
.txt built into Python
.wav `SpeechRecognition`_
.xlsx xlrd
.xls xlrd ❌, though CAD has openpyxl

textract is not included in Anaconda’s distribution for Python 2.7, but can be installed easily enough with pip, though it does have some quirks.

5.3.1.1. On a Mac

Textract’s developer recommends installing four of the packages that textract uses through a non-Python package manager called Homebrew. The instructions for installing Homebrew are simple enough – you have to enter this enormous line of code into the terminal:

$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

more See Homebrew’s installation page.

Once Homebrew is installed, you use it to install the four packages for textract:

$ brew install poppler antiword unrtf tesseract

Finally, textract itself can be installed with:

$ pip install textract

5.3.1.2. On Windows

My best guess is to install each package separately with pip:

> pip install poppler antiword unrtf tesseract
> pip install textract

5.3.2. How to convert EPUBs

Point your web browser at Beyond Lies the Wub by Philip K. Dick again. In the list of documents, there are two EPUBs, with and without images. You can click on either or both, and the download starts immediately. You can open the downloaded file see what a pleasant reading experience the EPUB format provides. Did you notice that the one with images has an illustration on page 3? Vintage 50’s science fiction, but it ruins the expectation of what the story is about. I also prefer my own imagination for visualizing a wub.

By the way, if you open an EPUB as if it were a text file, for example, with the Mac’s TextEdit app, you see a hallucinogenic jumble of characters which I cannot even reproduce here. This is because the file has been compressed with the zip utility. So you are forewarned that an EPUB can’t be treated like plain text.

In any event, since you have an EPUB or two right at your fingertips, let us go ahead and use it to illustrate how to use textract.

The first step is to download the file with requests. And right off the bat there is a problem, because clicking on the file’s icon starts the download without first going through a page that tells you what its address is. In FireFox, you can recover its address by right-clicking on the (no images) icon and selecting Copy Link Location. Paste it into a string for the url variable in Spyder. When I did this it came out as:

>>> url = 'http://www.gutenberg.org/ebooks/28554.epub.noimages?session_id=2b5dedd11a639d4cfee43eece28193a387ac6b74'

Your session id will be different, but you don’t need it, so trim it away to leave just the address:

>>> url = 'http://www.gutenberg.org/ebooks/28554.epub.noimages'

Download it with requests:

1
2
>>> from requests import get
>>> response = get(url)

Now we have to talk about how requests works.

Requests has retrieved an object from project Gutenberg’s server’s response to your request. Requests calls it a Response object. The Response object contains several pieces of information from the server that will aid you in deciding how to process it, which you can see with the headers attribute. For ease of discussion, I have reproduced my result below:

1
2
3
4
>>> type(response)
<class 'requests.models.Response'>
>>> response.headers
{'content-length': '16922', 'x-varnish': '1988218503', 'x-powered-by': '3', 'set-cookie': 'session_id=c91e2c01ad330b816664af3600b141ed13f5be94; Domain=.gutenberg.org; expires=Thu, 15 Sep 2016 13:05:50 GMT; Path=/', 'age': '0', 'server': 'Apache', 'x-connection': 'Close', 'via': '1.1 varnish', 'x-rate-limiter': 'ratelimiter2.php57', 'date': 'Thu, 15 Sep 2016 12:35:50 GMT', 'x-frame-options': 'sameorigin', 'content-type': 'application/epub+zip'}

The very first item in the dictionary tells you the length of the content of the response. The last one tells you its type, application/epub+zip, which suggests that it can be processed as an EPUB.

However, textract expects to begin with a file, so the response needs to be saved as one. The problem is that the content of the response is not text:

>>> response.text[:150]

What it is is an encoding called binary, which can be retried with the content attribute:

>>> response.content[:150]

All non-text files are encoded as binary data, and Python file processing methods have a mode to deal with it, b. So you can save the response content to a (EPUB) file with our with statement:

1
2
>>> with open('Wub.epub', "wb") as tempFile:
...    tempFile.write(response.content)

Assuming that you have used pip in your terminal to install textract, fire it up and pull out the text:

1
2
>>> from textract import process
>>> rawText = process('Wub.epub')

Be sure to test it:

1
2
3
4
5
>>> type(rawText)
>>> from chardet import detect
>>> detect(rawText)
>>> len(rawText) # 34361 more than the content-length in the header [#]_
>>> rawText[:150]

Easy peasy, though the text is messy – too messy to cut out Project Gutenberg’s header and footer – but you will learn how to clean it up in the next chapter.

5.3.2.1. Practice 3

Wrap the code reviewed above into a function called epub2text that takes the URL of an epub and a file name and returns the raw text. Be sure that it prints a message informing the user that the download was successful, but the message should not be what it returns.

5.3.3. How to convert PDFs

If you open a PDF as if it were plain text, you would see something like:

%PDF-1.6
%‚„œ”
582 0 obj
<</Linearized 1/L 714719/O 584/E 508046/N 48/T 714031/H [ 482 406]>>
endobj

593 0 obj
<</DecodeParms<</Columns 5/Predictor 12>>/Filter/FlateDecode/ID[<D63FE8F5F5D8B94E84BB39E31EC212C3><D5D2D98678A6BE4A8AFE345CD16AD430>]
/Index[582 23]/Info 581 0 R/Length 73/Prev 714032/Root 583 0 R/
Size 605/Type/XRef/W[1 3 1]>>stream
hfibbd```b``Œ

About all that you can get out of this is that it is a PDF file. What this means is that PDFs need a certain amount of processing to extract the text that they contain, but textract is up to.

To give this a try, download The Minority Report by you-know-who and convert it to raw text:

1
2
3
4
5
6
7
>>> url = 'http://www.cwanderson.org/wp-content/uploads/2011/11/Philip-K-Dick-The-Minority-Report.pdf'
>>> from requests import get
>>> response = get(url)
>>> with open('MinorityReport.pdf', "wb") as tempFile:
...    tempFile.write(response.content)
>>> from textract import process
>>> rawText = process('MinorityReport.pdf')

Don’t assume that it just magically worked:

1
2
3
4
5
>>> type(rawText)
>>> from chardet import detect
>>> detect(rawText)
>>> len(rawText) # 91409
>>> rawText[:150]

Again, the raw text is splattered with junk that we don’t need, but we will hoover it up in the next chapter.

5.3.3.1. Practice 4

Wrap the code reviewed above into a function called pdf2text that takes the URL of an EPUB and a file name and returns the raw text. Be sure that it prints a message informing the user that the download was successful, but the message should not be what it returns.

5.3.4. If all else fails, use pypandoc

You can use a free, open-source library called Pandoc which can convert a multitude of textual formats to plain text. There is a Python wrapper for it called pypandoc. It is not part of Anaconda’s distribution for Python 2.7, but can be installed in the terminal with $ pip install pypandoc.

5.3.5. How to use tesseract-ocr to extract text from an image

5.4. The Natural Language Toolkit’s corpora

Another convenient trove of text is the corpora that were collected for the Natural Language Toolkit, which we will hereafter refer to as NLTK. However, at this point in time I am not sure whether we will use them, so I am sending the instructions off to the Appendix, NLTK’s corpora.

5.5. Summary

5.6. Appendix

5.6.1. A snippet of Project Gutenberg’s captcha page

The document starts as so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<style type="text/css">
.icon   { background: transparent url(/pics/sprite.png?1472155906) 0 0 no-repeat; }
</style>
<link rel="stylesheet" type="text/css" href="/css/pg-desktop-one.css?1472155906" />
<script type="text/javascript">//<![CDATA[
var json_search     = "/ebooks/suggest/";
var mobile_url      = "//m.gutenberg.org/w/captcha/question/?format=mobile";
var canonical_url   = "http://www.gutenberg.org/w/captcha/question/";
var lang            = "en_US";
var fb_lang         = "en_US"; /* FB accepts only xx_XX */
var msg_load_more   = "Load More Results…";
var page_mode       = "screen";
var dialog_title    = "";
var dialog_message  = "";
//]]></script>

However, the actual message is under the body tag, <body>:

1
2
3
4
5
6
7
<body>
<div id="mw-head-dummy" class="noprint"></div>
<div id="content">
<div class="body">
<div id="dialog" title="Are you human?" class="hidden">
<p>You have used Project Gutenberg quite a lot today or clicked through it really fast. To make sure you are human, we ask you to resolve this captcha:</p>
<form method="post" action="/w/captcha/answer/">

We may play with this in the future.

5.6.2. pdfminer code for extracting text

PDFMiner is not part of the Anaconda distribution, nor is it available through pip. So you have to go through the drudgery of downloading the source code and compiling it yourself. I walk you through it in the Appendix to the introduction to Python on How to install a package in Anaconda.

Once you have PDFMiner installed, you can call it with the following function, which returns the text of the PDF:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def pdf2text(pdfFile):
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    from cStringIO import StringIO

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(pdfFile, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    text = retstr.getvalue()
    retstr.close()
    return text

PDFMiner is difficult to use because the structure of pdf files is complex, so you will just use it through this function. It takes a PDF file as its only argument and returns a string containing the text extracted from the PDF.

5.6.3. NLTK’s corpora

The NLTK package is included in Anaconda’s distribution, but you should check with by typing import nltk in Spyder’s console. If Python returns the prompt to you, then nltk is installed. If you get an error message like this one you must install NLTK.

5.6.3.1. How to download the NLTK corpora

The instructions for Installing NLTK Data from the NLTK website are as follows:

1
2
3
>>> import nltk
>>> nltk.download()
showing info http://nltk.github.com/nltk_data/

This should open a window like this one (on the Mac, it is with the rocketship icon):

_images/nltkData.png

Fig. 5.3 NLTK Data download window

Author’s screen capture

Look at the bottom line labelled Download Directory. On my Mac, the path is /Users/harryhow/nltk_data. Keep this in mind for a while.

Click on Book and then the Download button. The download can take a while, so be patient. The bar in the bottom right corner records its progress. Once it is complete, double-check that it worked by giving the following commands in Python:

1
2
3
>>> from nltk.corpus import brown
>>> brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

5.6.4. How to find files in the terminal

To find files in the terminal, we will just list the commands that correspond to the Python ones that we have just reviewed:

1
2
3
4
5
$ [just opening a terminal is equivalent to import os]
$ pwd
$ path=/Users/{your_user_name}/Documents/pyScripts
$ cd $path
$ ls

Much more could be said, but this is enough to get us going.

5.7. Powerpoint and podcast

Endnotes

[1]34361 is more than the content-length in the header, 16922, because the EPUB is compressed with ZIP. It is 1,378 less than the length of the text file, which you may recall is 35739. That suggests that some characters were lost in the extraction of the text from the EPUB, but I have not had time to compare the two.

Last edited: October 13, 2016