3. Introduction to natural language processing

3.1. What is natural language processing?

Wikipedia says …

Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation.

The first thing that jumps out at me about this explanation is the fact that Wikipedia feels the need to qualify “languages” as human or natural. Aren’t all languages human or natural? Well, it tuns out that within computer science there are artificial or machine languages, those that mediate between what a human programmer wants a computer to do and what the computer actually does. In this course, you are going to learn one of those programming languages, called Python.

So there are human or natural languages like English and machine or artificial languages like Python. Let me give you a short example of each. Here is a sentence of English, “If it is 3:50 pm, write ‘It’s time to go.’” I hope you had no problem understanding it. This is how you could type this sentence into Python:

1
2
3
4
5
6
7
>>> from datetime import datetime, time
>>> rightNow = datetime.now()
>>> nowTime = time(rightNow.hour, rightNow.minute)
>>> targetTime = time(15, 50)
>>> if nowTime >= targetTime:
...        print("It's time to go.")
...

Questions

Do you understand what this program does? Could you explain it to me line by line? How is it like English? How is it different from English?

Python has a module for dealing with time called datetime, with the submodules datetime and time that you gain access to in line 1. The datetime.now() method uses your computer’s clock to get the time at which the line is typed, which goes from the year to the number of milliseconds in the current second. That’s too much information, so line 3 whittles it down to just the hour and minute. The next line tells Python what time you are going to compare the current time to. Did you notice that Python tells the hour in ‘military’ or ‘24-hour’ time? The fifth line sets forth the comparison as a condition, and the next one tells Python what to do if the condition turns out to be true. This action is indented with respect to the previous line, and the prompt changes from >>> to ..., which tells you that Python is waiting for you to type something else. Since the steps to take can be many, a blank line tells Python that there are no more.

I hope that you were able to puzzle out the workings of this snippet of code without the need for my explanation. This is by design. That is to say, Python was designed to read like ordinary English, which makes it a good language to learn in an introductory course. I also defined several items by means of names that tell you what they do, which helps to make the code understandable. This is a practice that I will encourage in your own coding.

But is Python really like ordinary English? Recall the exact wording of my example “If it is 3:50 pm, write ‘It’s time to go.’” and consider these three cases. What do you do for each?

  1. it’s 3:49
  2. it’s 3:50
  3. it’s 3:51

I think that we can all agree that for case 1, you don’t write anything, while for case 2, you do. What about case 3? You probably answered that it is still time to go, so you should write it. I respond that your conclusion is not the literal meaning of the English sentence, but something that you added to it based on your knowledge of how the world works. You have learned a rule along the lines of “if you do something at the end of a period, you can also do it at any moment posterior to the end of the period”. Of course, I could have specifically allowed your implicit rule by saying “If it is 3:50 pm or later, write ‘It’s time to go.’” but I don’t really have to because we have all internalized the same rule. A linguist would say that my original sentence is ambiguous (it has more than one interpretation) or perhaps just vague (it leaves some information unspecified).

How does Python deal with ambiguity or vagueness? Well, in my code, I specifically indicated at 3:50 pm or later by means of the comparative operator >=, read greater than or equal to. If I had left it at ==, equal to, case 1 would be false, case 2 would be true, and case 3 would also be false. The larger moral to the story is that Python can’t resolve ambiguity or vagueness, and it does not know how the world works. It only knows what you tell it to do.

This is why the quote from Wikipedia that started out this chapter assigns natural language processing to artificial intelligence – because any NLP system will ultimately have to know something about how the world works.

In this course, you will learn how to use Python to get your computer perform various useful tasks with natural language, which is my definition in a nutshell of natural language processing. Wikipedia says that NLP comes in two flavors, natural language understanding and natural language generation. We do not have enough time in a single semester to deal with both of them, so we will concentrate on natural language understanding.

You may think that this is fairly simple, since you understand at least one natural language, English, without any effort whatsoever. You are not alone. Reading sown to the history section of the Wikipedia article on natural language processing reveals this fun fact:

The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[2]

Well, it is now 2016, and machine translation is emphatically NOT a “solved problem”. Hardly any task of natural language processing is, except for maybe spell checking. I bring this up to convince you that natural language understanding is really hard – up until about the 1980s researchers thought that it might be impossible.

One reason for this is that computers are profoundly stupid. You may find this difficult to swallow, but it is true. A simple ant is much smarter than any computer that you have at your disposal.

3.2. But what is computation?

You own a computer, but do you know how it works? Well, if it’s called a computer, it must compute. But what is computation? Since the notion is fundamental to this course, it is worth understanding in at least a general sense. I will try to explain it by way of something that you already have a certain familiarity with, namely calculation. Take a look at the following image and try to guess the calculation that is hidden by the question mark:

_images/1-calculation.png

Did you guess addition? 1+2+3+4 = 10. That is simple enough, but why I bring it up is not because of the result but rather the process. The box that represents the addition operator takes an input, the integers one through four, to produce an output, the integer ten. All numerical calculations have this format. Thus in the broadest sense, a calculation is a mapping between numbers.

The next figure generalizes this process to things other than numbers. What does the box with the question mark represent?

_images/1-computation.png

I hope that you answered something like the process of making an omelet. Yet the layout of the diagram attempts to paint a picture of a larger point, namely that, like a calculation, the making of an omelet is a mapping from certain inputs to a particular output. We don’t really want to call this a calculation, since that word is reserved for numerical mappings, but it is a kind of computation, understood in its most general sense as any mapping between inputs and outputs.

You may object that the making of an omelet has to follow a series of steps which allows for very little alteration – a series of steps commonly known as a recipe. My reply is that a computation can also obey a strict series of steps, one that is known as an algorithm.

In this course, the computations that you will learn invariably involve words.

3.3. Computational culture

3.3.1. Google’s n-gram viewer

3.3.1.1. Einstein vs. Holmes vs. Frankenstein

Go to the webpage for Google’s n-gram viewer [1] here, which looked like this the last time I checked:

_images/1-ngram_default.png

There are two things to focus on. One is the input line which has already been populated with the names “Albert Einstein,Sherlock Holmes,Frankenstein”; the other is the graph, each line of which is labeled with one of the names. Before trying to understand the lines, you should figure out what the axes mean. The x or horizontal axis is labeled with years, from 1800 to 2000. They refer to the year of publication of each book in the vast English-language corpus that Google makes available through the n-gram viewer. The y or vertical axis is labeled with percentages – teeny tiny percentages. They measure the ratio of each name to the total number of words in the books published in a year given by the horizontal axis.

The ratio of the number of occurrences of a word or phrase to the total number of words in a corpus is called the frequency of the word or phrase. We can use the frequencies of the three names as a rough estimate of the interest or popularity of the three figures. Which one seems to be the most popular? Does it surprise you that the two imaginary figures are more popular than the real one? Do you really think that Frankenstein is more popular than Sherlock Holmes?

Another kind of question that can be asked comes from looking at the exact shape of the frequency lines. The line for Sherlock Holmes has at least six peaks, at 1905, 1916, 1931, 1944, 1959, and 1978, and then it flattens out. What could have happened at each period to pique interest in the detective? What I find even more interesting is that the peaks of 1931, 1944, and 1959 were falling in amplitude (each peak was lower than the previous one) until the line bottoms in 1966, only to rocket up to the peak of 1978. Does this quantitative pattern reveal a renaissance of Sherlock Holmes after 1966?

Of course, there is not enough information in this graph to answer these questions. It is useful mainly for a bird’s-eye or orientative view on the relationship between the three figures. In terms of the scientific method, such a graph of word frequencies aids in the early phase of data collection, for instance, if you are interested in Sherlock Holmes, you should concentrate on the six years mentioned above.

3.3.1.2. Men vs. women

You can enter any words that you want into the box that holds “Albert Einstein,Sherlock Holmes,Frankenstein”. Delete those three and type in “man,woman,men,women”. Tick the case-insensitive checkbox to the right to count both upper- and lowercase instances of these words. Additionally, change the start year to search as far back as Google’s corpus goes, 1500. Hitting Search lots of books should produce a graph like this one:

_images/1-ngram_man_woman.png

The blue-green “man-men” lines bounce around erratically until about 1700, as do the red-orange “woman-women” lines, though at a lesser amplitude. This erraticness presumably has to do with the lower number of texts from the early centuries. There is a dramatic drop in the frequency of the masculine terms from 1720 until 1765, at which point they level out until 1900, when they undergo another decline. It is curious that these changes are not mirrored by rises in the feminine terms, so it is not that the authors stopped writing about men to write about women. At least, not until 1967. That is the year that the frequency of “women” shoots up inversely to the fall in the masculine terms, as if from 1967 on, authors did indeed stop writing about men to focus on women. Can you think of anything that happened around 1967 that would account for this change?

3.3.1.3. Cheerful vs. gay vs. homosexual

It was about 1973 when I first heard “gay” used to mean “homosexual”. Does Google’s corpus confirm my recollection? Keeping the same options as the previous search, type in “cheerful,gay,homosexual” and hit Search lots of books to get the following graph:

_images/1-ngram_cheerful.png

Again, the lines bounce around in the early centuries to stabilize around 1700. “Gay” then begins a long climb upward, hitting its last high in 1822. “Cheerful” follows a similar trajectory, though starting later and peaking later, in 1859. The frequency of both words declines thereafter, as if authors fell into a century-long depression. (Can you think of anything that happened around 1859?) At least, until “gay” shoots up to its contemporary height. The bottom of the trough before this final climb is tricky to pin-point, but 1972 seems close enough. It is not unreasonable to conclude that this is the date of the transition in usage from “cheerful” to “homosexual”. With respect to the word “homosexual” itself, its frequency marches up steadily from 1900, having first been used in print in 1892. [2]

3.3.1.4. Mead vs. beer vs. wine vs. whiskey

As a final foray into the lexicographic history of English, try “mead,beer,wine,whiskey”:

_images/1-ngram_mead_wine.png

To my mind, it is odd that “wine” appears to be the most popular alcoholic beverage, since the English-speaking lands are more conducive to the cultivation of the cereales from which beer is brewed rather than the grapes from which wine is fermented. “Beer” is always a distant second, though. “Whiskey” barely makes an appearance, though it too is a product of cereales. Have you ever heard of mead? It is fermented honey and was drunk by the warriors in Beowulf. It had apparantly fallen out of favor by the time of Google’s corpus.

3.3.1.5. Love vs. sex

Sorry, I can’t resist. One more. Draw your own conclusions:

_images/1-ngram_love_sex.png

3.3.2. America vs. citizen in presidential inaugural addresses

So far, our examples are drawn from Google’s corpus, which we only have access to through the N-gram Viewer. There are dozens if not hundreds of other corpora out there that we can get our digital hands on in order to analyze in more depth, but that means that we have to build the analytical tools ourselves. There is no app for that. Learning how to build such tools is the goal of this course.

As a first example, the beginning of a US president’s term of office January 20th is marked by a speech known as an inaugural address. Every such address from George Washington’s first term in 1789 to Barack Obama’s first term in 2009 is available in a single digitalized corpus. Plotting the frequency of the words “America” and “citizen” by year in this corpus produces the following graph:

_images/1-inaugural_america_citizen.png

“Citizen” acts as a sort of control, in that it occurs at a steady rate. What is interesting is that the post-World War I, and especially post-World War II, presidents have seen fit to mention “America” at an increasing rate. The reasons for this change in the 20th century are undoubtedly worth a deeper look.

3.4. Computational culture is culturonomics

3.5. Computational culture likes big data

You may have heard the term “big data” thrown around as the next technological advance that will change the world. But don’t take my word for it, ask the N-gram Viewer:

_images/1-ngram_big_data.png

3.6. What is the difference between computational culture and computational linguistics?

Endnotes

[1]Lin, Y., Michel, J. B., Aiden, E. L., Orwant, J., Brockman, W., & Petrov, S. (2012, July). Syntactic annotations for the google books ngram corpus. In Proceedings of the ACL 2012 System Demonstrations (pp. 169-174). Association for Computational Linguistics. pdf
[2]For the earliest usage in print of “homosexuality” in the USA, see Kiernan: “Heterosexual,” “Homosexual,” May 1892

Last edited: September 29, 2016