6. Computation with regular expressions

Let us say that you want to analyze the string below from Shakespeare’s Hamlet. For instance, you could extract all two-letter words:

1
2
3
>>> S = '''This above all: to thine own self be true,
... And it must follow, as the night the day,
... Thou canst not then be false to any man.'''

So far, we have no way of doing so, yet it is a basic step in textual computation.

Note

The code script for this chapter is nlp6.py, which you can download with codeDowner(), see Practice 1, question 2.

6.1. How to find all matches with findall()

It is so basic, in fact, that an entire programming language for such string analysis has been invented. The statements in this language are called regular expressions, often shortened to regex, and Python includes a module for using them, called re. Re has several methods, but the one that we will start with, called findall(). Go ahead and import it now:

>>> from re import findall

findall() searches a string for all occurrences of a pattern:

Note

findall(pattern, target string)

The essence of the regular expression language is to allow you to design a pattern appropriate for the linguistic task that you have in mind.

There is a fundamental dichotomy in the formulation of regular expressions between those that match a string whose length is known in advance and those that match a string whose length can vary. We start with the former, which is conceptually simpler.

6.2. How to create a fixed-length pattern to match

6.2.1. Match a string

Strings themselves are regular expressions, so our first try is to look for one, such as ‘ be ‘, in S:

1
2
>>> findall(' be ', S)
[' be ', ' be ']

The result is two tokens of the pattern, separated by a comma and enclosed in square brackets. This is a new data structure for us, called a list. We won’t do anything with the lists output by findall() in this chapter, so we won’t say anything more about them, except that they behave a whole lot like strings.

6.2.2. Match one item of a disjunction with |

Recall that the task with which the chapter began was just not to match ‘ be ‘, but rather to match all the two-letter words in the string. There are only four types in it, so it would be easy to mention them all, as in ‘ to ‘ or ‘ be ‘ or ‘ it ‘ or ‘ as ‘. Re includes a meta-character to restate this disjunction as a string, |, called pipe:

1
2
>>> findall(' to | be | it | as ', S)
[' to ', ' be ', ' it ', ' as ', ' be ', ' to ']

The set method removes duplicates, thereby turning the six tokens into the four types:

1
2
>>> set(findall(' to | be | it | as ', S))
set([' it ', ' as ', ' to ', ' be '])

6.2.3. Match a group of characters with capturing or non-capturing parentheses, ()

You may have gotten tired of typing so many blank spaces in the previous example, especially since you might have figured out that they are redundant – you only need one pair of spaces around the whole disjunction of two-letter words. Re supplies a meta-character to limit the disjunction to just the letter pairs, namely the parentheses, (). The previous pattern can be reduced to this, much easier to type, one:

1
2
>>> findall(' (to|be|it|as) ', S)
['to', 'be', 'it', 'as', 'be', 'to']

But it doesn’t work the way you expect – it outputs just the words, without the spaces. Now, we don’t really want the spaces, so this is a double improvement, but there may be situations in which we want to keep the material outside of the parentheses in the output.

Re lets you do so by adding ?: after the open parenthesis:

1
2
>>> findall(' (?:to|be|it|as) ', S)
[' to ', ' be ', ' it ', ' as ', ' be ', ' to ']

The default behavior of parentheses is to capture the string inside them in the output. The ?: prefix turns capturing off. For the rest of this discussion, we prefer to exclude the spaces from the output.

6.2.4. Match a range with [] and its negation with [^]

Though the current version is accurate, it is too specific in that it would not generalize to a longer text that contained another two-letter word such as at. We really want to design a regular expression that says something like any two letters. Re is up to the task. It represents any lowercase letter as [a-z], any uppercase letter as [A-Z], and any digit as [0-9]. All we need are two of the first:

1
2
>>> findall(' ([a-z][a-z]) ', S)
['to', 'be', 'it', 'as', 'be', 'to']

One of the most useful properties of square brackets is that they permit a negation operator, the caret ^. Thus the same group of six words can be matched by searching for those characters that are not digits:

1
2
>>> findall(' ([^0-9][^0-9]) ', S)
['to', 'be', 'it', 'as', 'be', 'to']

The dash representation of square brackets is not limited to just the beginning and end of the alphabet or digit sequence; it can abbreviate any sub-sequence that preserves the natural order. By way of example, a regular expression can match just the two-letter words that fall in the first five letters, [a-e], or not:

1
2
3
4
>>> findall(' ([a-e][a-e]) ', S)
['be', 'be']
>>> findall(' ([^a-e][^a-e]) ', S)
['to', 'it', 'to']

Ranges have a few more tricks up their sleeve that we will take up below.

6.2.5. Match a number of repetitions with {}

At this point, you may wonder whether you have to go through all the effort of typing [a-z] twice. Shouldn’t there be a regular expression for matching repetitions of a character? The gods of regular expressions took pity on you and created the curly bracket meta-character, {}, which surrounds an integer which sets the number of repetitions to match:

1
2
>>> findall(' ([a-z]{2}) ', S)
['to', 'be', 'it', 'as', 'be', 'to']

Curly brackets have an additional syntactic twist that we shall come back to below.

6.2.6. Match any character with .

You may think that the pattern with square brackets is the most accurate for the task, and you would be right. Nevertheless, re allows one more level of generality. By coincidence, the two-letter words in our string also happen to be the only two-character sequences in it. Re has a meta-character for matching any character, the period .

1
2
3
4
>>> findall(' (..) ', S)
['to', 'be', 'it', 'as', 'be', 'to']
>>> findall(' (.{2}) ', S)
['to', 'be', 'it', 'as', 'be', 'to']

Both versions give the right answer for our string as it is, but of course they would produce the wrong answer on any variation that contained a pair of non-letter characters like ‘ 15 ‘ or ‘ !! ‘, or a mixture of both, ‘ a1 ‘.

6.2.7. Practice 1

To summarize, we have reviewed the following six regular expressions for solving the task of finding all two-letter words in the target string:

  1. ' to | be | it | as '
  2. ' (to|be|it|as) '
  3. ' ([a-z][a-z]) '
  4. ' ([a-z]{2}) '
  5. ' (..) '
  6. ' (.{2}) '

The third and fourth are the most accurate. The first two are too specific because they would fail on a new string that contained a different two-letter word. The last two are too general because they would not fail on a new string that contained a two-character non-word. This gives us a lot of options, so let us try to narrow them down by timing them all.

Start a new script with the name regex.py. Define a function regexTimer() in which you will time all six regular expressions. That is to say, regexTimer() will itself contain six functions, one for finall() applied to the string S with each regular expression. Due to a quirk in the way timeit works, all of the information must be contained within the function to be timed. Thus you have to repeat the string within each function. After each function, write a line of timeit code to time the function, and save the result to a variable. To make sure that I have explained the task adequately, I am going to get you started:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def regexTimer():
    from re import findall
    from timeit import timeit

    def f1():
        S = '''This above all: to thine own self be true,
        And it must follow, as the night the day,
        Thou canst not then be false to any man.'''
        findall(' to | be | it | as ', S)

    time1 = timeit(f1)

You just add the other five.

This is mostly copy and paste, so it’s not very challenging. As a first innovation, I want your function to return a list of times. Recall that a list is delimited by square brackets, and each member is set off by commas. Thus the full outline of regexTimer() looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def regexTimer():
    from re import findall
    from timeit import timeit

    def f1():
        S = '''This above all: to thine own self be true,
        And it must follow, as the night the day,
        Thou canst not then be false to any man.'''
        findall(' to | be | it | as ', S)

    time1 = timeit(f1)

    # MORE CODE HERE

    return [time1, ]

As a consequence, you have easy access to all the times recorded. What can you do with that? Rhetorical question, but … you can graph them!!

Python has an excellent though complex graphing package called matplotlib which is usually accessed through a module called pyplot, so to call one of its (many) methods you would have to prefix it with matplotlib.pyplot, which would quickly make your code unreadable. Thus everyone abbreviates this bear as plt:

>>> import matplotlib.pyplot as plt

The first step is to set up a plotting window:

>>> plt.figure()

You are going to plot the times in the basic way, as a two-dimensional line graph. The first thing is always to define the horizontal or X axis. It has to be list of numbers. Just number each function:

>>> X = [1,2,3,4,5,6]

I didn’t use spaces because I typed it with my two index fingers, the left of the number and the right for comma. The next step is to define the vertical or Y axis, which also has to be list of numbers. Do you have a list of numbers? Yes, the list of times output by regexTimer()! So assign that to Y:

>>> Y = regexTimer()

You now have enough data to plot. Call pyplot’s basic method plot() for line plots, which needs X and Y arguments:

>>> plt.plot(X,Y)

This is enough to display, but it’s not as informative as it could be. Add labels to each axis and then a title to the whole graph:

1
2
3
>>> plt.xlabel('Regex')
>>> plt.ylabel('Time')
>>> plt.title('Our Fantastic Graph')

Finally, behold your handiwork:

>>> plt.show()

Just one more thing, and then you are finished. In the instructions for pyplot I indicate that you type everything into the console so that you can watch what happens. However, typing it all into the script regex.py lets you do something cool. So go ahead and type all the plotting code after the return line of regexTimer(), but unindented.

Now, finally, hit Spyder’s Run button, the right-pointing green triangle, to see what happens. That’s what we will talk about in class.

6.2.8. Will the best regex please stand up?

6.2.8.1. Under-fitting vs. over-fitting

This challenge of finding the regular expression that is just right may remind you of the story of Goldilocks and the three bears, in which Goldilocks tried to find the bowl of porridge that was neither too hot nor too cold. Statisticians have their own version of Goldilocks, which evaluates how well a statistical analysis fits the data that it is applied to. An analysis that over-fits the data is too specific, in that it excludes data points from a larger data set that should be included. Conversely, an analysis that under-fits the data is too general, in that it includes data points from a larger data set that should be excluded. In our example, the first two regular expressions over-fit the data set (at should be included), while the last two under-fit it (19 should be excluded). [1]

6.2.8.2. False positives and false negatives

Statistical test theory provides an alternative way of conceptualizing the problem, which I unfortunately can’t figure out how to tie in to Goldilocks. Though it is usually illustrated in terms of medical tests, I believe that explaining it in terms of legal ‘tests’ is easier to understand. Imagine that a person is charged with a crime and goes through a trial. If she is guilty and the verdict is guilty, the trial has produced a true positive data point: a guilty person is found guilty. Conversely, if she is innocent and the verdict is not guilty, the trial has produced a true negative data point: a not-guilty person is found not guilty.

We expect that an accurate test only produces true positives and true negatives, but there are two more logical possibilities that leave room for a test to be nearly accurate. One is for an innocent person to be found guilty. This is called a false positive data point, because the accused should have failed the test but instead passed it. Alternatively, if a guilty person is found innocent, the legal test has produced a false negative data point, because the accused should have passed the test but instead failed it. [2]

All four possibilities are summarized in Four outcomes of a trial:

Table 6.1 Four outcomes of a trial
  true false
positive guilty found guilty innocent found guilty
negative innocent found not guilty guilty found not guilty

It is commonly held that the American legal system is designed to err on the side of false negatives, in the sense that it is better for the guilty to be judged not guilty once in a while than for the innocent to be judged guilty even once.

6.2.8.3. Summary of the two sorts of regex evaluation

The table Four outcomes of a regular expression attempts to bring the two ways of evaluating a regular expression together, along with some examples. Each cell states how the string indicated is fit by the regular expression indicated, which has been shorn of blank space. The location of the cell within the table shows how the evaluation is classified as a statistical test.

Table 6.2 Four outcomes of a regular expression
  true false
positive evaluation of ‘to’ by [a-z]{2} results in good fit evaluation of ‘at’ by (?:to|be|it|as) results in under-fit
negative evaluation of ‘the’ by [a-z]{2} results in bad fit evaluation of ‘19’ by .{2} results in over-fit

6.2.9. More on ranges and negation

Let us consider another task, that of matching the vowels in a string such as:

>>> S2 = 'otolaryngologist'

English only has five letters for vowels, so it would be easy enough list them in a disjunction:

>>> findall('a|e|i|o|u', S2)

It turns out that square brackets can also be used as a disjunction:

>>> findall('[aeiou]', S2)

While the two approaches produce identical results, remember that square brackets permit negation, so they provide a certain flexibility that disjunction with | does not. The negation of the vowel expression easily matches the consonants of the word:

>>> findall('[^aeiou]', S2)

6.2.10. A range of repetition with {}

The quirk of curly brackets that was omitted from the introduction above is that, as with slicing, {2} is an abbreviation of the full syntax for setting a minimum and maximum number of repetitions, {2,2}.

Note

character{minimum, maximum}

This elaboration is illustrated with these two strings:

1
2
>>> S3 = 'bookkeeper'
>>> S4 = 'goddessship'

First, some more examples of a fixed repetition:

1
2
>>> findall('[aeiou]{2}', S3)
>>> findall('[^aeiou]{3}', S4)

Here is a range of repetition:

>>> findall('[^aeiou]{2,3}', S4)

6.2.11. How to use multiple () and name them

A pattern can have more than one pair of parentheses:

1
2
3
>>> SK = 'Stephen Kleene'
>>> findall('([A-Z][a-z]{6}) ([A-Z][a-z]{5})', SK)
[('Stephen', 'Kleene')]

I have reproduced the match. Can you identify it? It is two strings inside parentheses – the tuple delimiter – inside square brackets. That is, it is a list of a tuple of two strings. The pattern has successfully parsed out the two words of the input string.

It would be really cool if the parentheses could be named, to allow them to be distinguished functionally. Oh wait, look:

1
2
>>> findall('(?P<first>[A-Z][a-z]{6}) (?P<last>[A-Z][a-z]{5})', SK)
[('Stephen', 'Kleene')]

The syntax is (?P<name>), where name must be a valid Python name and can only be defined once in a regular expression. Re calls the sequences named within the parentheses groups. A group name can be used again inside a pattern, as if it were a variable, with (?P=name), though I have not come up with a plausible example yet. Other methods besides findall() get more mileage from group names, so we return to the topic in How to access group names below.

6.2.12. How to match the beginning or end of a string with ^ and $

As a final observation, there are two positions on a string that always exist, its beginning and its end. Re encodes them with ^ and $, respectively. In the following expression, these meta-characters match the first and last character of Shakespeare’s quote:

>>> findall('^.|.$', S)

6.2.13. Practice 2

Note

If you ever need to brainstorm a regular expression, there is an on-line tester at Regex Pal.

In the following tasks, assume that a word is a sequence between two spaces, which excludes all, true and day from Shakespeare’s quote S. For most of the tasks, you can imagine that you are solving a crossword puzzle. Match in S …

  1. the five-letter words that end in ‘t’.
  2. the four-letter words that start with a capital letter.
  3. the six-letter words followed by a comma that have ‘o’ as the second and fourth letters.
  4. the five-letter words whose fourth letter can be ‘v’ or ‘n’. (two ways)
  5. the three-letter roots that come from removing the suffix ‘st’ from five-letter words.
  6. the words of up to five letters that end in ‘e’.
  7. the words that begin with ‘th’ in any case.
  8. the words that are at least five letters long.
  9. the three-letter words that don’t end in ‘n’.
  10. all the punctuation (which includes the newline character ‘\n’).

6.3. How to create a variable-length pattern

Imagine that you are given string S5 and told to match the prepositions at the end of each word:

>>> S5 = 'break breakup breakout breakdown breakthrough '

This is a problem, because we don’t know how many characters to match for each preposition. OK, in S5 we do, but maybe S5 isn’t representative of the whole data set. We don’t want to over-fit right from the beginning.

6.3.1. Match an unknown number of characters with + and *

Re offers two meta-characters to solve the problem of the length of unknown strings, which are illustrated below, with the help of capturing parentheses to extract just the prepositions:

1
2
>>> findall('break([a-z]+) ', S5)
>>> findall('break([a-z]*) ', S5)

The first gives the expected answer, a list of four strings – the four prepositions. The second gives an unexpected answer, a list of five strings – the four prepositions plus the empty string. Based on this difference in length, can you explain the contrast between + and *?

+ is often known as the one-or-more operator, since it matches one or more instances of the previous character. Thus it matches the four prepositions, but it does not match break all by itself, because there has to be at least one character after it to match. *, in contrast, is often known as the zero-or-more operator, since it matches zero or more instances of the previous character. Thus it matches the four prepositions, as well as break all by itself, because there are effectively zero characters after it. Note that for this case, re outputs the ‘zero-content’ character, the null string ‘’.

6.3.2. Match optional characters with ?

A similar sort of problem is to match the singular and plural forms of fish in S6:

>>> S6 = 'fish fishes fishy fisher '

In terms of string transformations, the plural suffix is an optional substring of the singular. This is the effect of ?, which is shown with non-capturing parentheses:

>>> findall('fish(?:es)? ', S6)

Optionality is a kind of disjunction, so the previous expression can be restated with |:

>>> findall('fish |fishes ', S6)

But this formulation doesn’t encode the asymmetry between the two words. Fishes is a variant of fish. That is to say, | over-fits the data.

Optionality is equivalent to matching zero or one instances of the string in question, so it can be mimicked with curly brackets:

>>> findall('fish(?:es){0,1} ', S6)

This seems to read less intuitively, though.

Optionality provides a means to do a rough morphological matching:

1
2
>>> S6 = 'fish fishes fishy fisher fishers '
>>> findall('fish(?:er)?(?:s)? ', S6)

Restating the optional substrings as a choice among entire words obfuscates the linguistic generalization that er and s are suffixes to nouns:

>>> findall('fish |fisher |fishers ', S6)

In fact, it is not clear at all why these three are thrown together, and not some other random assortment of words.

6.3.3. +, * and ? match lazily with ?

You wouldn’t want to trust the plus, star and question meta-characters with your wallet, since they take as much as they can get. Consider the problem of matching one of a sequence of quotes:

>>> S7 = "'Cool' never goes out of style, but 'gnarly' does."

You should jump to the conclusion that '.*' should match both quoted words. Is that so?

>>> findall("'.*'", S7)

This is because * (and + and ?) are greedy – they match the largest number of characters that they can. So even though you may see ‘Cool’ and gnarly as the only accurate matches, what * actually sees is something like ‘Cool never goes out of style, but gnarly’.

If such greedy matching is not desired, it can be turned off by suffixing the three meta-characters with a question mark:

>>> findall("'.*?'", S7)

Turning off the greediness of ? means typing two question marks, ??, which I have not found a good example of yet.

6.3.4. Summary table

The various regular expressions reviewed in this section are organized into a table of Simple meta-characters:

Table 6.3 Simple meta-characters
meta-character matches name notes
a|b a or b disjunction  
(ab) a and b grouping only outputs what is in (); (?:ab) for rest of pattern
[ab] a or b range [a-z] lowercase, [A-Z] uppercase, [0-9] digits
[^a] all but a negation  
a{m, n} from m to n of a repetition a{n} a number n of a
^a a at start of S    
a$ a at end of S    
a+ one or more of a   a+? lazy +
a* zero or more of a Kleene star a*? lazy *
a? with or without a optionality a?? lazy ?

6.3.5. Practice 3

You now can restate some of the fixed-length problems more efficiently and take on some new ones. As before, assume that a word is a sequence between two spaces, which excludes all, true and day from Shakespeare’s quote S. For most of the tasks, you can imagine that you are solving a crossword puzzle. Match in S …

  1. the words that end in ‘t’.
  2. the words that start with a capital letter.
  3. the words that start with ‘th’ in any case.
  4. the words whose fourth letter can be ‘v’ or ‘n’. (two ways)
  5. the words that are at least five letters long.

6.4. How to create a pattern with character classes

6.4.1. Match any alphanumeric character or not with \w and \W

You may think that constantly typing [a-zA-Z0-9] to match letters and numbers would get tiresome, and the developers of the regular expression language agree. Re therefore includes a character class for abbreviating the range of letters and numbers, \w. It has a small peculiarity in that it also includes the underscore _, so that it is equivalent to the range [a-zA-Z0-9_]. If you think back to the conventions for variable names in Python, you may recall that a name can begin with a letter followed by any combination of letters, digits, or underscores. The underscore can stand in for a space if need be and so makes reading complex names easier. That is why it is included in the letter and number, or alphanumeric class. Re defines the negation of this class as \W, for all the non-alphanumeric characters.

The two regular expressions that provide the best fit to the task of matching all the two-letter words in Shakespeare’s quote can be restated with \w as so:

1
2
3
4
>>> findall(' \w\w ', S)
[' to ', ' be ', ' it ', ' as ', ' be ', ' to ']
>>> findall(' \w{2} ', S)
[' to ', ' be ', ' it ', ' as ', ' be ', ' to ']

6.4.2. Raw string notation with r

Python interprets regular expressions just like any other expression. This can lead to unexpected results with class meta-characters, because the backslash that they incorporate is sometimes also used by Python for its own constructs. For instance, we will meet a class meta-character \b, which marks the edge of a word. It will be extremely useful for us, but it happens to overlap with Python’s own backspace operator, \b.

The way to resolve this ambiguity is to prefix an r to a regular expression. The r marks the regular expression as raw text, so Python does not process it for special characters. The previous example is augmented with the raw text notation below:

1
2
3
4
>>> findall(r' \w\w ', S)
[' to ', ' be ', ' it ', ' as ', ' be ', ' to ']
>>> findall(r' \w{2} ', S)
[' to ', ' be ', ' it ', ' as ', ' be ', ' to ']

As a further illustration, what do you think are the non-alphanumeric characters in the Shakespeare text?

1
2
3
4
>>> findall(r'\W', S)
[' ', ' ', ':', ' ', ' ', ' ', ' ', ' ', ' ', ',', '\n',
' ', ' ', ' ', ',', ' ', ' ', ' ', ' ', ' ', ',', '\n',
' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '.']

Recall that the triple-quote notation for forming multi-line strings automatically inserts newlines to maintain the original line breaks.

6.4.3. Match any digit or not with \d and \D

The developers of the regular expression language have also found it convenient to have an abbreviation for the digits [0-9] as \d, whose negation is \D, the non-digit characters. It is equivalent to the range [0-9].

6.4.4. The non-printing characters \t, \v, \n, \r, \f

There is a group of characters that do not show up on your computer screen but are there to tell the cursor to move to certain places. To the best of my knowledge, they are inherited from teletype machines by way of typewriters. Some are obsolete, rarely used, or have been adapted for other purposes. I mention them for the sake of completeness.

The horizontal spacing triggered by a tab is encoded by \t. There is also a corresponding vertical tab that skips lines down a page for filling out a form, encoded by \v, which is no longer used. [3]

There are three class characters for ending a line. The carriage return \r moves the cursor back to the beginning of the current line. The linefeed character \n, nowadays known as newline, ends the current line and moves the cursor to the beginning of the next line. The form-feed character \f moves the cursor to the next page. [4]

We have already remarked that the Shakespeare string has newlines:

1
2
>>> findall(r'\n', S)
['\n', '\n']

6.4.5. Match any whitespace or not with \s and \S

Since the non-printing characters are not visible on the screen, you may not notice that they are there. But if they are there, any regular expression applied to them will produce unexpected results. Re consequently includes a class character for matching them, \s. It is convenient to also throw into this bag the space character ‘ ‘, since it is another sort of hidden character that we usually don’t need for anything. Taken together, [ tvnrf] are called whitespace. There is also a negation character for whitespace, namely \S, as you might guess.

We can now match the non-alphanumerics of the Shakespeare string as whitespace, and the rest of the text as non-whitespace:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
>>> findall(r'\s', S)
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '\n',
' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '\n',
' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
>>> findall(r'\S', S)
['T', 'h', 'i', 's', 'a', 'b', 'o', 'v', 'e', 'a', 'l', 'l', ':', 't',
'o', 't', 'h', 'i', 'n', 'e', 'o', 'w', 'n', 's', 'e', 'l', 'f', 'b',
'e', 't', 'r', 'u', 'e', ',', 'A', 'n', 'd', 'i', 't', 'm', 'u', 's',
't', 'f', 'o', 'l', 'l', 'o', 'w', ',', 'a', 's', 't', 'h', 'e', 'n',
'i', 'g', 'h', 't', 't', 'h', 'e', 'd', 'a', 'y', ',', 'T', 'h', 'o',
'u', 'c', 'a', 'n', 's', 't', 'n', 'o', 't', 't', 'h', 'e', 'n', 'b',
'e', 'f', 'a', 'l', 's', 'e', 't', 'o', 'a', 'n', 'y', 'm', 'a', 'n', '.']

Whoops, I lied. What is on the wrong side of the cut? The punctuation, ‘:’ and ‘.’ are non-whitespace, so they get matched along with the letters.

6.4.6. Match the edge of a word or not with \b and \B

An unexpected gift from re is that it defines a character class for the boundaries of a word, \b. A word is considered to be any alphanumeric sequence, so it includes underscores. The negation of \b is the utterly predictable \B.

So you can match any word with \b[a-z]+\b:

1
2
3
4
>>> findall(r'\b[a-zA-Z]+\b', S)
['This', 'above', 'all', 'to', 'thine', 'own', 'self', 'be', 'true',
'And', 'it', 'must', 'follow', 'as', 'the', 'night', 'the', 'day',
'Thou', 'canst', 'not', 'then', 'be', 'false', 'to', 'any', 'man']

The alphabetic range can be replaced with the alphanumeric character class:

1
2
3
4
>>> findall(r'\b\w+\b', S)
['This', 'above', 'all', 'to', 'thine', 'own', 'self', 'be', 'true',
'And', 'it', 'must', 'follow', 'as', 'the', 'night', 'the', 'day',
'Thou', 'canst', 'not', 'then', 'be', 'false', 'to', 'any', 'man']

This is a great improvement over the much less perspicuous usage of spaces to delimit words that we started this chapter with.

6.4.7. Match the beginning or end of a string with \A and \Z

Why re needs to have a character class for matching the beginning and end of a string, \A and \Z, respectively, when it already has the perfectly serviceable ^ and $ is beyond me, but there you have it. I guess some people just like working with character classes.

6.4.8. Summary table

The various regular expressions reviewed in this section are organized into a table of Meta-character classes:

Table 6.4 Meta-character classes
class abbreviates name notes
\w [a-zA-Z0-9_] alphanumeric it’s really alphanumeric and underscore, but we are lazy
\W [^a-zA-Z0-9_]   not alphanumeric
\d [0-9] digit  
\D [^0-9]   not a digit
\s [ tvnrf] whitespace  
\S [^ tvnrf]   not whitespace
\t   horizontal tab  
\v   vertical tab  
\n   newline  
\r   carriage return  
\f   form-feed  
\b   word boundary  
\B     not a word boundary
\A ^ beginning  
\Z $ end  

6.5. Practice 4

Induction vs deduction. Well, imagine what the relationship is between a regular expression and the substrings that it matches. The regular expression is a general statement; the substrings that it matches are specific instances. So you can do two types of exercises, a deductive one in which you are given a regular expression and asked to infer the substrings it matches, and an inductive one in which you are given a bunch of substrings are asked to induce a regular expression that they will match.

6.5.1. Deductive

  1. Restate these with character classes. Match in S …
  2. the words that end in ‘t’.
  3. the words that start with a capital letter.
  4. the words that start with ‘th’ in any case.
  5. the words whose fourth letter can be ‘v’ or ‘n’. (two ways)
  6. the words that are at least five letters long.
  7. all the punctuation (which includes the newline character ‘\n’).
  1. Explain:

    1
    2
    3
    4
    5
    6
    findall('[^aeiou]{3}', S3+'_'+S4)
    findall('a|e|i|o|u{2}', S3)
    findall('[^aeiou]{2,}', S4)
    findall('[^aeiou]{,3}', S4)
    findall('[^aeiou]{2,2}', S4)
    findall('fish(es)? ', S6)
    

6.5.2. Inductive

Todo

make a pattern, match it, and use matches

6.6. Other methods

So far, I have concentrated on re‘s findall() method because it returns the result of what it finds. Re provides several other functions, though, the most useful of which are previewed in this section.

6.6.1. How to separate a pattern from its execution with compile()

If a pattern is used over and over in your program, it will save time to convert it to a form that re can store internally, so re can get to it quicker. This also gives you the chance to name the pattern, which can make your code more perspicuous. Here is an example:

1
2
3
>>> from re import compile
>>> twoLetter = compile(r'\b[a-z]{2}\b')
>>> findall(twoLetter, S)

Assuming as usual that the name is informative, I find this easier to understand than the version in which it is all in findall().

6.6.2. How to find the first token with search() and match()

If you only need to find the first token of a pattern, re provides the methods search() and match(). Search() scans the whole string; match() tries to find the pattern starting at the beginning of the string. In both cases, if the pattern is found, a MatchObject is returned; if the pattern is not found, nothing is returned:

1
2
3
>>> from re import search, match
>>> search(twoLetter, S)
>>> match(twoLetter, S)

The problem is that a MatchObject is opaque to you without further processing. Fortunately, it affords a lot of options, but the two that stand out right now are to return the match itself, and its index:

1
2
3
4
5
6
>>> m1 = search(twoLetter, S)
>>> m1.group(0)
>>> m1.start()
>>> m1.end()
>>> m2 = match(twoLetter, S)
>>> m2.group(0)

The group() method in line 2 returns as many hits as are matched; in our example, there is only one, whose index is 0. Start() and end() return the indices of the beginning and end of the hit, as you might imagine. None of these apply happily to the MatchObject returned by match(), because it is empty.

6.6.2.1. How to access group names

You may recall from How to use multiple () and name them that parentheses can be named. A MatchObject gives you access to the names through group()

1
2
3
4
5
6
7
>>> SK = 'Stephen Kleene'
>>> m3 = search(r'(?P<first>[A-Z][a-z]+) (?P<last>[A-Z][a-z]+)', SK)
>>> m3.group(0)
>>> m3.group(1)
>>> m3.group('first')
>>> m3.group(2)
>>> m3.group('last')

The first question is, did the pattern match the entire string? Line 3 says it did. But the pattern has parentheses, so you can look for additional matches. Did the first parenthesis match? Line 4 looks for a match through its index, and line 5, its name. Ditto for the second pair of parentheses: any match is accessed through its index, line 6, or its name, line 7.

Before moving on, you should confirm that match() also works on this pattern, because the whole string matches from the beginning.

Named parentheses give you the superhuman ability to categorize strings inside re, rather than just extracting them with re and performing a categorization with ‘normal’ code.

more See Match Objects in Python’s documentation for more information.

6.6.3. How to divide a string into substrings with split()

To do a sort of inverse of findall(), which is to say, return a list of strings that are separated by a certain character, re includes split()

1
2
>>> from re import split
>>> split(' ', S)

This should get close to returning all the words in the Shakespearean string.

6.6.4. Substitute one regex for another with sub()

Oh, no, re is haunted by the ghost of replace()!

1
2
>>> from re import sub
>>> sub(twoLetter, 'XX', S)

Use at your own risk.

6.6.5. Practice

Todo

You betcha!

6.7. Summary

Todo

yes

6.8. Further reading

Python’s documentation of regular expressions is found at 7.2. re — Regular expression operations. The most thorough on-line tutorial I have found is at regular-expressions.info. Questions about regular expressions are answered at Stack Overflow under the tag regex.

6.9. Appendix: How to create a Unicode pattern with UNICODE

To explore how re handles Unicode, let’s use a more realistic string, a famous quote from Don Quijote:

Yo sé quién soy, respondió don Quijote, y sé que puedo ser no sólo los que he dicho, sino todos los doce Pares de Francia, y aun todos los nueve de la Fama, pues a todas las hazañas que ellos todos juntos y cada uno por sí hicieron, se aventajarán las mías.

It comes out like this in UTF-8:

1
2
3
4
5
>>> C = '''Yo s\xc3\xa9 qui\xc3\xa9n soy, respondi\xc3\xb3 don Quijote,
... y s\xc3\xa9 que puedo ser no s\xc3\xb3lo los que he dicho, sino
... todos los doce Pares de Francia, y aun todos los nueve de la Fama,
... pues a todas las haza\xc3\xb1as que ellos todos juntos y cada uno
... por s\xc3\xad hicieron, se aventajar\xc3\xa1n las m\xc3\xadas.'''

The first step is to decode it from UTF8 to Unicode:

1
2
>>> uC = C.decode('utf8')
>>> uC

Now look for the two-letter words in uC, using the UNICODE flag. There are many possibilities, but these four make the point best:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
>>> from re import findall, UNICODE
>>> findall(r'\b[a-zA-z]{2}\b', uC)
[u'Yo', u'no', u'lo', u'he', u'de', u'de', u'la', u'as', u'se', u'as']
>>> findall(r'\b.{2}\b', uC)
[u'Yo', u' s', u'\xe9 ', u'\xe9n', u', ', u'\xf3 ', u', ', u'y ', u'\xe9 ', u'no',
u' s', u'lo', u'he', u', ', u'de', u', ', u'y ', u'de', u'la', u', ', u' a', u'as',
u' y', u' s', u'\xed ', u', ', u'se', u'\xe1n', u' m', u'as']
>>> findall(r'\b\w{2}\b',uC)
[u'Yo', u'no', u'lo', u'he', u'de', u'de', u'la', u'as', u'se', u'as']
>>> findall(r'\b\w{2}\b', uC, UNICODE)
[u'Yo', u's\xe9', u's\xe9', u'no', u'he', u'de', u'de', u'la', u's\xed', u'se']

The alphabetic range [a-zA-Z] of line 2 is limited to ASCII and so is useless for our text. The wild card meta-character . of line 4 matches the non-ASCII characters but also space, so it is equally useless. The alphanumeric character class \w in line 6 acts like [a-zA-Z] in line 2 by only matching ASCII characters – unless the UNICODE flag tells it to treat the non-ASCII letters like ASCII letters. This behavior of \w with UNICODE extends to the other character classes.

6.10. Powerpoint and podcast

Endnotes

[1]Wikipedia has a rudimentary page on over-fitting but none on under-fitting. A search for either term turns up some introductory material from the literature on machine learning.
[2]Wikipedia’s explanation of statistical hypothesis testing puts it in the more general framework of Type I and type II errors.
[3]See the discussion at Stack Overflow on vertical tabs.
[4]See the discussion at Stack Overflow on terminating lines.

Last edited: Dec 05, 2016