4. Computation with strings

4.1. How to create a string

Note

The code from the text for this chapter can be downloaded as nlp4.py, presumably to your Downloads folder, and then moved to pyScipts, from whence you can open it in Spyder and run each line one by one.

A string is a sequence of characters delimited between single or double quotes. Here are some examples:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
>>> monty = 'Monty Python'
>>> monty
'Monty Python'
>>> doublemonty = "Monty Python"
>>> doublemonty
'Monty Python'
>>> circus = 'Monty Python's Flying Circus'
File "<stdin>", line 1
   circus = 'Monty Python's Flying Circus'
                          ^
    SyntaxError: invalid syntax
>>> circus = "Monty Python's Flying Circus"
>>> circus
"Monty Python's Flying Circus"
>>> circus = 'Monty Python\'s Flying Circus'
>>> circus
"Monty Python's Flying Circus"

Did you spot the reason for the error in 'Monty Python's Flying Circus' in line 7? Did you notice how the double quotes are used to avoid it in "Monty Python's Flying Circus" in line 12? Finally, did you grasp the usage of \ as an escape character in 'Monty Python\'s Flying Circus' in line 15 to tell Python to not process ' as a string delimiter, but rather as an ordinary single quote or apostrophe?

Note

Since delimiting a string with double quotes solves the problem of including a single quote in a string, many programmers prefer using double quotes to delimit all their strings. I prefer single quotes, however, because they save the effort of having to hit the shift key twice. Plus, as the doublemonty example shows, if there is no need for double quotes, Python defaults to single quotes. I find it rather disconcerting for Python to change my code.

A new string can be formed by combination or concatenation of two strings with + or by repetition of a string a number of times with *. Unfortunately, a character cannot be deleted with . The examples below illustrate these operations as input code without the corresponding output. It is up to you to type in each line to see what it does – or run it line by line from the script mentioned at the beginning of the section. Try to guess what line 8 does before you try it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
>>> B = 'balloon'
>>> B
>>> B+'s'
>>> B+s
>>> 'red '+B
>>> 'red '+B+'s'
>>> B*2
>>> B+'s'*2
>>> (B+'s')*2
>>> B-'n'
>>> B+2
>>> B+'2'

While output of these commands may seem straight-forward enough, they illustrate two fundamental aspects of computation that are worth highlighting with their own subsections, operation precedence and object type.

Warning

You should not use the word “string” as a name for a string, because it is already taken by the string module.

Tip

Python has a neat trick for abbreviating the concatenation and repetition operations:

1
2
>>> B += 's'    # which is equivalent to B = B+'s'
>>> B *= 2      # which is equivalent to B = B*2

This trick works with the other arithmetic operators, too, but we won’t have much to say about them in this book.

4.1.1. Operator precedence

You may think that line 8 is ambiguous in that it can be processed in two ways, illustrated in this diagram:

_images/3-PlusstarAmbiguity.png

Fig. 4.1 Ambiguity of B+’s’*2

Author’s diagram.

Yet Python produces a unique output. This is so because it adopts the convention that some operators apply before others, in order to cut down on the use of parentheses. In particular, * and / apply before + and -, so that line 8 is processed as if it were B+('s'*2), which is the right diagram above. If that is not what you want, you have to add parentheses yourself, as in line 9, which corresponds to the left diagram above.

4.1.2. Object type

The error that should have greeted you upon submitting line 11 above suggests that Python can distinguish between a number and a string. It can, because it encodes them via a difference in type. You can check this with the type() method:

1
2
>>> type(B)
>>> type(2)

The first answers with <type 'str'> for “string” and the second with <type 'int'> for “integer”. As line 11 shows, operations do not in general apply to different types. It makes no sense to try to sum together the number 2 and the word “balloon”.

Here is a another way to examine an object’s type, plus a new type:

1
2
3
4
5
>>> type('one') == type(1)
>>> type('one') == type('1')
>>> type(1) == type(1.0)
>>> type('1') == type('1.0')
>>> type(1.0)

The new type, “float”, refers to how computers represent rational numbers, as ‘floating points’ – see for instance Wikipedia’s article Floating point. In this book you will not deal with numbers directly all that much, but you may want to brush up on the difference between integers and rational numbers, as explained in the article Integers and rational numbers from MathPlanet.

In fact, you will work with strings so much, that you should know that Python has a simple method for converting a number to a string, str():

1
2
3
4
5
6
>>> str(1)
>>> str(1.0)
>>> '1' == str(1)
>>> '1.0' == str(1.0)
>>> type('1') == type(str(1))
>>> type('1.0') == type(str(1.0))

The converse of str() to convert a string to an integer or rational number is int() and float():

1
2
3
4
5
6
>>> int('1')
>>> float('1.0')
>>> 1 == int('1')
>>> 1.0 == float('1.0')
>>> type(1) == type(int('1'))
>>> type(1.0) == type(float('1.0'))

Question

What do you think the == operator does?

Spyder helps you to keep track of the type of the variables that you define by listing them in the Variable Explorer pane, one of the options of the top right pane:

_images/3-VariableExplorer.png

Fig. 4.2 The variable explorer

Author’s image.

Our single variable B is of type string and has the value ‘balloon’. It also measures size 1, which needn’t concern us now.

4.2. Basic string methods

Python supplies several methods to perform tasks on strings. Three basic ones are illustrated below. Try to figure out what they do:

1
2
3
4
5
6
7
8
>>> len(B)
>>> len(B+'s')
>>> len(B*2)
>>> sorted(B)
>>> len(sorted(B))
>>> set(B)
>>> sorted(set(B))
>>> len(set(B))

You should have concluded that len(B) gives the length of the string, that is, the number of characters that it contains. sorted(B) orders the string’s characters alphabetically. set(B) produces the set of characters in the string. One useful property of sets is that they do not contain duplicate elements. More generally, if it makes sense, the output of one method can be input to another method.

4.2.1. Nested or embedded operations

Several of the lines above contain operations within operations. It is fundamental that you be able to read (and concoct) them, so lines 2 and 5 are converted to diagrams below for those of you that are visual learners:

_images/3-Nesting.png

Fig. 4.3 Nesting of operators

Author’s diagram.

In textual format, processing proceeds from the operators that are inside the most parentheses outwards. This metaphor of one operation being ‘inside’ another is called nesting or embedding. The diagrams try to visualize these metaphors. The most nested or embedded operations are on the bottom, and the lines trace the flow of processing as operators combine with their arguments and send their results upwards.

4.2.2. Tokens vs. types

The removal of repetitions performed by set() touches on a fundamental concept in text computation, that of the distinction between a token and a type. A representation in which repetitions are allowed is said to consist of tokens, while one in which there are no repetitions is said to consist of types. Thus set() converts the tokens of a string into types. There is one type of 'o' in 'balloon', but two tokens of 'o'. In ordinary English, types are categories and tokens are instances of a category. In Python, data types are also categories, of which variables are instances.

4.2.3. Dot and method notation

The material aggregated to a method in parentheses is called the method’s argument. In the examples above, the argument B can be thought of linguistically as the object of a noun: the length of B, the alphabetical sorting of B, the set of B.

But what if two pieces of information are needed for a method to work, for instance, to count the number of o’s in balloon? In other words, if there is a method count(), how many pieces of information does it need to count the o’s in balloon? Well, if we continue with the idea that the argument holds the direct object, we would expect Python to accept count('o'), but where does the string to be searched for o’s go?

Python allows the additional information to be prefixed to the method with a dot:

>>> B.count('o')

The example can be read as “in B, count the o’s”, with the argument being the substring to be counted. The nomenclature for the prefixed variable is a bit more complex, because we have introduced it backwards. In pythonic reality, the dot says that the string variable B can take the method count('o') as an attribute because a string is a type of object whose elements can be counted. Putting the two concepts together produces the general format noted below:

Note

object.attribute, where an attribute can be method(argument)

Spyder has a neat way of reminding you what the attributes of an object are. Type in B. and with the cursor right after the dot, Spyder opens a small window that lists all of the methods available for B. This depends on the type of B, which is a string, so Spyder shows you all of the string methods. We have only seen a few of them so far.

4.2.4. How to clean up a string

There is a group of methods for modifying string properties, illustrated below. You can guess what they do from their names:

1
2
3
4
5
6
7
>>> L = 'i lOvE yOu'
>>> L.lower()
>>> L.upper()
>>> L.swapcase()
>>> L.capitalize()
>>> L.title()
>>> L.replace('O','o')

Did you notice the difference between title() and capitalize()?

There is a family of methods for stripping away leading or trailing characters:

1
2
3
4
5
6
7
>>> P = '*abc*bca*'
>>> P.lstrip('*')
>>> P.rstrip('*')
>>> P.strip('*')
>>> P.lstrip()
>>> P.rstrip()
>>> P.strip()

The last three lines suggest that without an argument, strip doesn’t do anything, but that is not true. The next line makes a new string with peripheral blank spaces which is then subject to the three methods:

1
2
3
4
5
>>> P1 = ' '+P+' '
>>> P1
>>> P1.strip()
>>> P1.lstrip()
>>> P1.rstrip()

If no substring is supplied, their default behavior is to remove any whitespace at the edges of a word.

Finally, these methods are not restricted to stripping out single characters, but their behavior is somewhat unexpected. They try to strip as many tokens of the input string as possible:

1
2
3
4
5
6
>>> P.strip('*a')
>>> P.strip('a*')
>>> P.rstrip('*a')
>>> P.rstrip('a*')
>>> P.lstrip('*a')
>>> P.lstrip('a*')

In lines 1 and 2, despite the difference in order of the stripping strings ‘/a’ and ‘a/‘, the same string is output.

4.2.5. Practice 1

  1. What types are output by len(), sorted(), and set()?

  2. Write the code to perform the changes given below on these two strings:

    1
    2
    >>> S = 'ABCDEFGH'
    >>> s = 'abcdefgh'
    
  1. Extract the first 3 characters of S and make them lowercase.
  2. Extract the last 4 characters of s and make them uppercase.
  3. Create a string from the first 4 characters of S and the last 4 characters of s and then switch its case.
  1. Here are two real life strings to work with:

    1
    2
    >>> mail = 'howard@tulane.edu'
    >>> url = 'http://www.tulane.edu/~howard/NLP/'
    
  1. How would you strip out the user name and the server name from my email address?
  2. Internet addresses start with the transfer protocol that the site uses. For web pages, this is usually the hypertext transfer protocol, http. How would you strip this information out to leave just the address of the book?
  3. Following up on (b), how would you extract just Tulane’s server address?
  1. Now we mix types:

    1
    2
    3
    >>> day = 1
    >>> month = 'Sept'
    >>> year = 2016
    

Concatenate them into the string ‘Sept. 1, 2016’. Then split the string ‘Nov. 1, 1957’ into its three parts.

4.3. How to find your way around a string

Note

The code for this section is found at nlp4sec3.py.

Another useful thing to do to strings is to pick bits and pieces out of them. To do so, Python needs a means for keeping track of the sequence of characters in a string, otherwise known as its ordinality. Python represents ordinality in a string by assigning an integer, called an index, to each character to mark its position.

4.3.1. How to find characters given an index

Instead of telling you how indexation works, try to figure it out from these examples:

1
2
3
4
5
>>> E = 'abcde'
>>> E[0]
>>> E[1]
>>> E[4]
>>> E[5]

A single integer in square brackets marks a position on a string and so picks out the character there. Fine, but does it start at 1?

4.3.1.1. Zero-based indexation

You probably thought that the first character in a string should be given the number 1, but Python actually gives it 0, and the second character gets 1. There are some advantages to this format which do not concern us here, but we will mention a real-world example. In Europe, the floors of buildings are numbered in such a way that the ground floor is considered the zeroth one, so that the first floor up from the ground is the first floor, though in the USA, it would called the second floor. [1]

Be that as it may, I want to fix this notion firmly in your mind, so I invite you to peruse the diagram below:

_images/3-StringIndexation.png

Fig. 4.4 String indexation

Author’s image.

Its import is that string indexation counts the intervals between characters, and not the characters themselves. Just like the fact that a child is not one year old until it actually reaches the end of its first year, the first character in a string takes up the zeroth interval. [2]

4.3.1.2. How to index in reverse

The diagram suggests that there is a reverse indexation, in the negative direction. Try it out for yourself:

1
2
3
4
5
6
>>> E[-0]
>>> E[-1]
>>> E[-2]
>>> E[-4]
>>> E[-5]
>>> E[-6]

The indexation runs in reverse, from right to left, just like in the diagram. The only tricky one is the first. Apparently Python interprets -0 as 0.

4.3.1.3. How to slice a string

Square brackets do a lot more than just extract a single character. What does the following notation do?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
>>> E[0:3]
>>> E[1:4]
>>> E[2:5]
>>> E[-5:-2]
>>> E[-4:-1]
>>> E[-3:-0]
>>> E[0:3] == E[-5:-2]
>>> E[1:4] == E[-4:-1]
>>> E[2:5] == E[-3:-0]
>>> E[0:5]
>>> E[-6:5]
>>> E[0:5] == E[-6:5]
>>> E[5:-6]

Two integers in square brackets separated by a colon extract the characters from the first integer up to, but not including, the second integer. In Python, this operation is called slicing. A positive integer indexes the string from left to right; a negative one indexes it from right to left. Either way, the sequence must be stated from low to high. The two polarities can be mixed, as long as the low-to-high sequencing is respected. If not, the result is the null or empty string ''.

Question

Did you notice that the positive and negative slices are not entirely equivalent. Can you explain the difference?

If no beginning or end position is mentioned for a slice, Python defaults to the beginning or end of the string:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
>>> E[3:]
>>> E[3:] == E[3:5]
>>> E[-3:]
>>> E[-3:] == E[2:5]
>>> E[:3]
>>> E[:3] == E[0:3]
>>> E[:-3]
>>> E[:-3] == E[-6:-3]
>>> E[:]
>>> E[:] == E[0:5]
>>> E[:] == E[-6:]

The result of a slice is a string, so it can be concatenated with another string or repeated:

1
2
3
4
5
>>> type(E[2:])
>>> E[:-1] + '!'
>>> E[:2] + E[2:]
>>> E[:2] + E[2:] == E
>>> E[-2:] * 2

4.3.1.4. Extended slicing

Slice syntax allows a mysterious third argument, by appending an additional colon and integer. What do these do?

1
2
3
4
5
>>> K = 'abcdefghijk'
>>> K[::1]
>>> K[::2]
>>> K[::3]
>>> K[::4]

This third argument is called the step or stride argument, but you may remember it more readily by calling it the Noah argument, in reference to the story of Noah’s ark from Genesis, in which the animals went into the ark two by two. The step argument tells slicing to progress through the string n by n, where n is the Noah argument. Thus the first example slices the string one by one, and so just returns the original string. The second example slices the string two by two, selecting the even-numbered letters. The third example slices the string three by three, culling out every third letter. And so on.

Of course, you can still use the first two arguments to slice out a substring, which the Noah argument steps through:

1
2
3
4
>>> K[1:7:1]
>>> K[1:7:2]
>>> K[1:7:3]
>>> K[1:7:6]

Thus the overall format of a slice is:

Note

string[start:end:step]

4.3.1.5. How to reverse a string

You may wonder whether making the Noah argument negative does anything. Consider the inverse of the initial example set:

1
2
3
4
>>> K[::-1]
>>> K[::-2]
>>> K[::-3]
>>> K[::-4]

As before, the first one slices the string one by one, returning the entire input string. But it does so starting from the end, so that the string winds up reversed. The next three behave as expected, slicing out every second, third and fourth letter, starting from the end. This is the easiest way in Python to reverse a string. [3]

4.3.2. How to find an index given a character

You can ask Python for a character’s index with the index() or rindex() methods, which take the string as an attribute and the character as an argument:

1
2
3
4
5
>>> D = 'abcdabc'
>>> D.index('a')
>>> D.rindex('a')
>>> D.index('d')
>>> D.rindex('d')

Python also has a pair of methods find() and rfind(), which appears to do the same thing as index() and rindex():

1
2
3
4
>>> D.find('a')
>>> D.rfind('a')
>>> D.find('d')
>>> D.rfind('d')

Where they differ lies in how they handle errors:

1
2
3
4
5
6
>>> D.find('z')
-1
>>> D.index('z')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: substring not found

Upon not encountering the character, find() produces -1, which could be used in further processing – hey, the target isn’t here, do something else – while index() halts processing. The former is probably more useful, so it is the one that I prefer.

These two methods can also find substrings:

1
2
3
4
>>> D.find('cda')
>>> D.index('cda')
>>> D.find('abc')
>>> D.index('abc')

They return the position of the first character of the substring.

4.3.2.1. How to limit a search to a substring

index() and find() allow optional arguments for the beginning and end positions of a substring, in order to limit searching to a substring’s confines:

1
2
3
4
>>> D.index('ab', 0, 3)
>>> D.index('ab', 3)
>>> D.find('ab', 0, 3)
>>> D.find('ab', 3)

Note

index/find(string, beginning, end)

4.3.3. Operator iteration

Most of the methods that we have reviewed take a string as input to give a string as output. This means that they should be able to be hooked together to make quite complex expressions. Returning to the string L, here are some examples:

1
2
3
4
>>> L = 'i lOvE yOu'
>>> L[2:6].capitalize().upper()
>>> L[-3:].capitalize().lower()
>>> (L[:4].upper()+L[4:].lower()).swapcase()

You will explore this a bit further in the upcoming exercises.

4.3.4. Practice 2

  1. Write the code to perform the changes given below on these two strings, but try to using slicing rather than stripping:

    1
    2
    >>> S = 'ABCDEFGH'
    >>> s = 'abcdefgh'
    
  1. Make the first 3 characters of S lowercase.
  2. Make the last 4 characters of s uppercase.
  3. Create a string from the first 4 characters of S and the last 4 characters of s and then switch its case.
  4. Concatenate both strings and slice out every even character.
  5. Concatenate both strings and reverse the order of the characters.
  6. Retrieve the index of ‘E’ and ‘h’.
  1. As in Practice 1:

    1
    2
    >>> mail = howard@tulane.edu
    >>> url = http://www.tulane.edu/~howard/NLP/
    
  1. How would you slice out the user name and the server name from my email address?
  2. How would you slice the hypertext transfer protocol, http://, out to leave just the address of the book?
  3. Following up on (b), how would you slice out just Tulane’s server address?
  1. Slice the string ‘Nov. 1, 1957’ into month (a string), day (an integer), and year (an integer).

  2. It may not have occurred to you, but you could use the index returned by index() or find() as an argument inside a slice, like in the following example:

    >>> E[E.index('a'):E.find('d')]
    

Use this sort of embedded index/find() to slice the following strings out of E:

  1. ‘ab’
  2. ‘bc’ using the r versions
  3. ‘de’
  1. CREATIVITY ALERT! I mentioned briefly in class that find() and index() can be helpful for morphological analysis.
  1. Take the word ‘constitution’ as your sample string and show how you could slice the suffix ‘tion’ away from it to leave the root ‘constitu’
  2. Can you use the same technique to slice the prefix ‘anti’ away from the word ‘antimatter’ to leave the root ‘matter’? If not, can you explain why?
  3. Try to get both results from the stripping methods.
  1. Let us approach the problem of slicing away a prefix from a different angle. Forget about find() and index(). What else do you know about ‘anti’ that could be used in a slice to remove it from ‘antimatter’? To inspire your creativity, let’s enlarge the data set for a moment. Imagine that you have a variable prefix that can include ‘a’ from ‘atypical’, ‘im’ from ‘impossible’ and ‘dis’ from ‘disown’, as well as ‘anti’. What is the one thing that you know about these four prefixes that can be used to get rid of them with a slice?

If you find an answer, use it to restate your answer to (5a) above.

  1. What is the longest sequence of operators that you can make?

4.4. Where to slice

Note

The code for this and the rest of the chapter can be downloaded as this script nlpsec4ff.py, presumably to your Downloads folder, and then moved to pyScipts, from whence you can open it in Spyder and run each line one by one.

The input of string methods is a string, as is their output, so there are two places to apply a slice. Consider the two alternatives below for turning the character at index 5 of N into upper case:

1
2
3
>>> N = 'abcdefg'
>>> N.upper()[5]
>>> N[5].upper()

The output is identical in both expressions, but could there be a difference lurking below the surface? Try to explain in English how lines 2 and 3 contrast.

4.4.1. The notion of scope

To sharpen your intuition about what is going on, let us try to think about it visually, as in the image below:

_images/3-OperatorHierarchy.png

Fig. 4.5 Hierarchy of operators

Author’s diagram.

The method upper() takes a string as input, the box on the lower left, and changes it to the upper case in the top center box. A slice can be applied to either box to achieve the same result in string terms – which we see – but the processing is slightly different in either case – which we don’t see. Let us refer to the application of an operator to the smallest item possible as narrow scope. Conversely, the application of an operator to the largest item possible is known as wide scope.

To try to make the difference manifest, let us plug the result of slicing into the diagram:

_images/3-SliceAmbiguity.png

Fig. 4.6 Slice ambiguity

Author’s diagram.

In the wide-scope slice on the left, the method processes the eight characters of the input chain and then its output is sliced. In the narrow-scope slice on the right, the input chain is sliced, and then the method processes the single resulting character. This narrow-scope arrangement should take less processing.

It would be helpful to demonstrate the contrast overtly and objectively. One possibility is to check whether the difference in processing scope is reflected in a difference in processing time. Python has a resource for measuring how long it takes to do something, but it takes a function as input, rather than a string. Thus we need to take a slight detour to talk about how to define a function.

4.4.2. How to define a function

For our purposes, we will consider a function to be a method that you define yourself. By way of illustration, the following lines of code turn the two slices above into functions:

1
2
3
4
5
6
7
8
9
>>> def f1():
...     s = 'abcdefg'
...     s.upper()[5]
...

>>> def f2():
...     s = 'abcdefg'
...     s[5].upper()
...

Python does a couple of new things here. After the colon, it knows to wait for you to type in the body of the function, which is indicated by changing the prompt to the ellipsis (three dots). But you must insert a tab, so that your code lines up beneath f. A function ends with a blank line, which tells Python to return to its regular prompt.

A concluding thought: you may have wondered why the string s was assigned twice, once in each function. If assigned for one, shouldn’t it be available for the other? Check whether it is by typing it at the prompt:

>>> s

You should get a NameError saying that s is not defined. This is another instance of scope. A variable assigned within a function is only accessible to the function, and not to any code outside of it. That is to say, its scope is limited to the function. If you are starting to suspect that scope is a pervasive organizing principle of computer programs, you are right.

We can now time how long it takes to process each function.

4.4.3. How to time execution with timeit

To get the resources for measuring processing time, import the timeit module and then use either function as argument to timeit():

1
2
3
>>> from timeit import timeit
>>> timeit(f1)
>>> timeit(f2)

How speedy are they? On most computers, the second is about a third faster than the first. [4]

4.4.4. Practice 3

  1. Show which of len(), sorted(), and set() takes the most time to process.

  2. In Operator iteration there is a line with the expression L[2:6].capitalize().upper(). Show whether it is faster for Python to process it as it is, or broken into its more-easily-readable parts:

    1
    2
    3
    >>> A = L[2:6]
    >>> B = A.capitalize()
    >>> C = B.upper()
    
  3. Recall the discussion of the default precedence of * and + in Operator precedence. Perhaps * comes first because it is quicker to process than +. Test this hypothesis by writing a function for either combination of * and + and time them to see which runs faster.

  4. Recall that stripping comes in right and left versions. Using the string ‘antidisestablishmentarianism’, compare the stripping of ‘ism’ to the right stripping of ‘ism’ to see whether the latter is faster.

4.5. How to make a string longer than one line

So far, the strings that you have worked with fit comfortably into a single line, but once we take up texts, this will no longer be the case. Below are four ways of inputting a multi-line string:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
>>> longslash = 'A wonderful bird is the pelican.'\
... 'His bill can hold more than his belican.'
>>> print longslash
'A wonderful bird is the pelican.His bill can hold more than his belican.'
>>> longparen = ('A wonderful bird is the pelican.'
... 'His bill can hold more than his belican.')
>>> print longparen
'A wonderful bird is the pelican.His bill can hold more than his belican.'
>>> longsingle = '''A wonderful bird is the pelican.
... His bill can hold more than his belican.'''
>>> print longsingle
A wonderful bird is the pelican.
His bill can hold more than his belican.
>>> longdouble = """A wonderful bird is the pelican.
... His bill can hold more than his belican."""
>>> print longdouble
A wonderful bird is the pelican.
His bill can hold more than his belican.
>>> longdouble
'A wonderful bird is the pelican.\nHis bill can hold more than his belican.'

The first concatenates the strings by mean of a backslash at the end of the first line, while the second concatenates them inside parentheses. The result of both is the same, which is shown by using the print statement to format them on the screen. The second two just create one giant string that flows across lines by putting the material within triple single or double quotes. Yet they preserve the line breaks, which is shown for the last by displaying the string itself. It contains the new line character, \n. Notice that in all four cases, Python knows to wait for more material, as signaled by the ellipsis prompt.

4.5.1. Practice 4

The four ways of making long strings introduced above are too easy to reproduce through cut and paste to make practicing them very challenging. So instead let us practice concatenation, with the following strings, which constitute a limerick of unknown origin cited as the first example of Wikipedia’s Limerick (poetry):

1
2
3
4
5
>>> L1 = 'The limerick packs laughs anatomical'
>>> L2 = 'into space that is quite economical.'
>>> L3 = "But the good ones I've seen"
>>> L4 = 'so seldom are clean'
>>> L5 = 'and the clean ones so seldom are comical.'
  1. Combine them into a single string with proper spacing for two sentences. Check it with print.
  2. Now combine them into a single string so that each one prints out on its own line. This is very hard to do, but the answer is hidden in line 20 above.
  3. Finally, modify the string you developed in #2 so it prints out to the conventional form of limericks, five lines with the third and forth indented by two spaces.

4.6. Assignment and mutability

4.6.1. Assignment of variable names

You have by now performed an assignment of a variable to an expression several times and hopefully have internalized something like:

Note

variable = expression

This can be read as “variable is assigned to expression”. For convenience, we often refer to the variable as the name for the expression, though the Python documentation actually prefers identifier to name.

4.6.2. What’s in a name

There are several limitations on what string can be a name. The main one is that it cannot be one of the words that Python reserves for its own uses, such as:

>>> print = 'print'

A list of all such reserved words or keywords is found in the Python documentation here.

A trickier case is that of the methods built into Python, such as the string methods reviewed in this chapter:

>>> len = 'len'

Using one of them as a name is not prohibited, but is considered bad form, since it could be confused with the method len() and lead to misunderstanding. Your really do not want to create the possibility of expressions like len(len) in your code.

There are also limits on what characters can be part of a name. A name must start with a letter or underscore, but not a digit:

1
2
3
>>> name = 'zzz'
>>> _name = 'zzz'
>>> 0name = 'zzz'

Once a name has been started correctly, it can contain any combination of letters, digits, or underscores, but nothing else:

1
2
3
4
>>> my_name = 'zzz'
>>> my_name1 = 'zzz'
>>> my name = 'zzz'
>>> my-name = 'zzz'

Upper and lower case are different:

1
2
3
>>> name = 'zzz'
>>> Name = 'zzz2'
>>> name == Name

Having outlined what a name cannot be, let me spend a few lines on what a name should be.

There are three main conventions in the programming world for the design of identifiers, Pascal case, camel case and underscore case, see CamelCase. Here they are exemplified side by side:

1
2
3
>>> PascalCase = 'justastring'
>>> camelCase = 'justastring'
>>> underscore_case = 'justastring'

The idea is to make up highly descriptive, multiword names, without spaces (which are prohibited in Python anyway), and to enhance readability by capitalization or underscoring. Pascal case capitalizes every word, like in PowerPoint; camel case capitalizes every word after the first, like in iPhone, and underscore case separates lowercase words with an underscore, which I cannot think of an example of in common usage. Notice that some technique is needed for readability – were you able to read 'justastring' when you first laid eyes on it?

In this book, I will endeavor use camel Case, because it is in wide circulation, and it saves me the effort of having to hit the shift key more than once.

4.6.3. Mutability

Try this:

>>> name[0] = 'g'

I asked you to assign 'b' to the first character of name, but Python doesn’t let you. Conversely, Python has a delete statement – with the syntax of print – but it doesn’t delete the first character of name, either:

>>> del name[0]

This is because strings are immutable, which is to say, once a string has been created, it cannot be changed by adding or deleting items from it. The only way that it can be changed is by means of the methods that were reviewed above.

4.6.4. Practice 5

There is not much to practice in this section.

  1. Which of these are illegitimate names?:
  1. name?
  2. *name
  3. _name
  4. 1name
  5. name1
  1. Mutability prevents you from slicing a character into a string, but there is still a sneaky way of transforming ‘name’ into ‘game’. Do you recall it?

4.7. How to deal with non-English characters

So your program is humming along, and it hits the string ‘naïve’ and chokes. For instance, it may try to find out its length:

1
2
3
>>> F = 'naïve'
>>> len(F)
6

Does naïve look like a six-letter word to you?

Try to see F the way Python does by asking Spyder to display it for you:

1
2
>>> F
'na\xc3\xafve'

Somehow, Python has converted ‘naïve’ to ‘naxc3xafve’. If I told you that the backslash is used as an escape character here, which tells Python to process the next (three) characters in a special way. Since there are two backslashes, len() may be counting each one as a character, to give the unexpected length of six. It’s as if the dieresis (two dots over the i) were given its own weight as a character. Check this hypothesis like so:

1
2
3
>>> id = 'ï'
>>> id
'\xc3\xaf'

In this section, you will review this special way that Python has of dealing with non-English characters.

4.7.1. English characters and ASCII

Computers were originally designed to use the English alphabet, and in particular, an encoding of it called the American Standard Code for Information Interchange, abbreviated ASCII and pronounced /ˈæski/ or “ass-kee”, see ASCII in Wikipedia. ASCII is ultimately based on telegraph codes and represents the numbers 0-9, the English letters a-z and A-Z, the English punctuation symbols plus a blank space, along with control codes that originated with Teletype machines, some of which are now obsolete. The table below illustrates the encoding:

Table 4.1 ASCII characters
  0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2   ! # $ % & ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~

Rows 0 and 1 are for non-printing teletype characters like the line feed, the carriage return, the escape key, or ringing a bell. The cell at row 2 and column 0 (20) holds the empty space. Cell 7F implements the delete key.

4.7.1.1. How to show the indexation of ASCII with ord() and chr()

Did you know that all ASCII characters have a place in the alphabetical order of the letters? Try this:

1
2
3
>>> unsorted = 'a*@A6'
>>> sorted(unsorted)
['*', '6', '@', 'A', 'a']

Can you consult the table of ASCII characters to figure out what the order is?

Hopefully you guessed that characters are indexed starting from cell 00 and reading right and then down until reaching cell 7F. Thus the string unsorted is sorted the way that it is because * comes first, at cell 2A, then 6 at cell 36, then @ at cell 40, and so on.

As in pythonic indexing, ASCII numbering begins at zero – and rows are sixteen characters long – so that the first visible character is the space at cell 20 and ordinal 32. You can see for yourself by querying the location of a character with ord():

1
2
3
>>> ord(' ')
>>> ord('!')
>>> ord('~')

Conversely, chr() retrieves the character from its ordinal number:

1
2
3
4
>>> chr(32)
>>> chr(33)
>>> chr(126)
>>> chr(127)

The string returned by line 4 is the non-printing character named ‘x7f’ (delete), which happens to occupy cell 7F.

Since the table has sixteen columns and eight rows, ASCII only holds one hundred and twenty-eight characters. It is incapable of representing any others, and we need go no further than the accented characters of other Western European languages, such as French 'naïve', to find a string that blows it up.

4.7.2. Unicode and UTF-8

Many alternatives to ASCII have been invented over the years, for different languages and different operating systems. They have grown organically, as needs and resources permitted, leading to a diversity of mutually unintelligible encodings that impeded the flow of information around the world. The only way to bring some kind of organization to this notational logjam was to create a unified standard that included all of the characters of the world’s writing systems, plus some room to grow. This is the purpose of Unicode, which has space for more than a million characters, though the current version only defines about 110,000.

Unicode was designed for comprehensiveness, but for reasons of efficiency, an implementation of it called the Universal Character Set Transformation Format—8-bit or UTF-8 has been adopted as the de facto standard encoding world-wide.

4.7.2.1. How to manage character encoding in Python

Despite the popularity of UTF-8, Python does not presuppose that it will be the only encoding that you have to deal with, so Python prefers to do its internal character processing in Unicode and convert its output to whatever format you might need. Thus Python adopts the work-flow traced in the diagram below:

_images/3-UnicodeWorkflow.png

Fig. 4.7 Flow diagram of conversion into and out of Unicode in Python

Author’s diagram, inspired on `Figure 3.3: Unicode Decoding and Encoding <http://nltk.org/book/ch03.html>`_

In prose, what happens is that characters are decoded or translated into Unicode for Python to process them, and then they are encoded or translated out of Unicode into whatever format the user desires.

4.7.2.2. What happens when you type a non-ASCII character into a Python console?

As your first step in learning you how to do this, you should ask your Python console what encoding it uses by means of the getdefaultencoding() function of the sys module:

1
2
>>> import sys
>>> sys.getdefaultencoding()

On my Mac, Spyder responds with ascii.

In any event, you should check how your console deals with non-ASCII characters by trying one, such as the i with dieresis of naïve that we used above:

1
2
3
4
5
>>> id = 'ï'
>>> id
'\xc3\xaf'
>>> print id
ï

The sequence '\xc3\xaf' is the UTF-8 representation of ï. You can check this by googling it, though I will give you hand and suggest that you search for it directly at UTF-8 encoding table and Unicode characters.

Note how Unicode uses the backslash as an escape character to tell Python that the string requires special processing. The print statement tries its best to display it as readable text.

4.7.2.3. How to translate into and out of Unicode with decode() and encode()

You might expect Python to name the Unicode methods according to the nomenclature of the Unicode work-flow above, and you would be correct:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
>>> F = 'na\xc3\xafve'
>>> uF = F.decode('utf8')
>>> uF
u'na\xefve'
>>> len(uF)
5
>>> utf8F = uF.encode('utf8')
>>> utf8F
'na\xc3\xafve'
>>> print utf8F
naïve
>>> F == utf8F
True

String F starts us off with the character escape sequences of naïve in UTF-8, so that we are on the same page, even if UTF-8 is not the default of your system. F.decode('utf8') translates F from UTF-8 to Unicode, as explained in the work-flow. Note that the character escape sequences for ï in Unicode is \xef – this is how we know that decode() did something. The length of uF is now correctly five characters. Finally, uF.encode('utf8') translates the Unicode string back to UTF-8, which makes it identical to F.

more Python 2.7’s documentation of Unicode is at Unicode HOWTO.

Decoding and encoding are general functions in Python, handled by the codecs module, see 7.8. codecs — Codec registry and base classes. An encoding of a character set is itself called a codec. The character encodings or codecs that are built into Python are listed at 7.8.3. Standard Encodings. The codecs module can be used to open a file in a specific encoding, but we are going to use one of NLTK’s corpus readers instead.

You can search for the encoding of any character at Unicode Character Search.

4.7.2.4. How to find out the encoding of a string with chardet

Most of the documents that we will be working with will be in UTF-8, or at least I hope so. But what do you do if get a document in an unknown encoding? It seems reasonable to expect for there to be some simple way to guess its encoding is, but it turns out that there isn’t. There is a Python module chardet for detecting the encoding of a string, but it is not part of Anaconda’s distribution of Python. It is easy enough to get with pip in the terminal, though. See How to use a command-line interface and in particular the sub-section on pip:

$> pip install chardet

chardet determines the encoding of a string probabilistically. That is, it produces a series of guesses, each with a degree of confidence:

1
2
3
>>> import chardet
>>> unknown = 'someString'
>>> chardet.detect(unknown)

On my Mac, Spyder’s console returns ascii, because that is the encoding that my console uses, as was revealed in What happens when you type a non-ASCII character into a Python console? above.

more For more explanation of chardet, see its documentation.

4.7.2.5. The special comment for setting a default encoding of a script

Python has a special comment for declaring the default encoding of a file:

>>> # -*- coding: utf-8 -*-

It must be the first or second line of a file, but we haven’t come to writing files yet.

4.7.3. Practice 6

Again, there is not much to practice. Take a look at English terms with diacritical marks, select a (lowercase) word with a non-ascii character in it and see if you can print it to the console as uppercase.

4.8. String formatting

4.8.1. String substitution

There may be times when you want to print a string to the console by filling it in with sub-strings drawn from a variable, as in this examples:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
>>> both = '30N by 90W'
>>> 'The coordinates of New Orleans are {}.'.format(both)
>>> 'The coordinates of New Orleans are {0}.'.format(both)
>>> lat = '30N'
>>> lon = '90W'
>>> 'The coordinates of New Orleans are {} by {}.'.format(lat, lon)
>>> 'The coordinates of New Orleans are {0} by {1}.'.format(lat, lon)
>>> coord = ('30N', '90W')
>>> 'The coordinates of New Orleans are {0[0]} by {0[1]}.'.format(coord)
>>> coord[0]
>>> coord[1]
>>> nolaStr = 'The coordinates of New Orleans are {0[0]} by {0[1]}'
>>> nolaStr.format(coord)

Line 1 assigns a string to a name, which is referenced by the str.format() attribute in line 2 in order to fill its value into the curly brackets. The index 0 chooses the first (and only) name, as you can appreciate by comparing it to lines 3-5, where two names are assigned and referenced in sequence. Finally, line 16 introduces a new type, called a tuple, which in this example is just a pair of strings. The tuple is referenced through str.format() to fill in the curly brackets in sequence, using the syntax of slicing. Lines 8 and 9 just show that tuples do indeed support slicing syntax. As a final bit of housekeeping, the last two lines show that the string to be formated can also be references through a name.

The argument to str.format() can be a number, which can save you the step of converting a number to a string:

1
2
3
>>> nowData = (6, 52, 5, 9, 2016)
>>> nowStr = 'It is {0[0]}:{0[1]} on the {0[2]}th day of the {0[3]}th month of {0[4]}.'
>>> nowStr.format(nowData)

Since the argument to format() can be a number, it can be calculated there directly:

1
2
3
>>> points = 19
>>> total = 22.0
>>> 'Correct answers: {}'.format(points/total)

4.8.2. Output formatting

There are many ways to format a number, such as a percentage, in base 2 or in exponential form. To invoke a specific format, the left curly bracket is followed by a colon and then a formatting symbol. Try to guess what the symbols illustrated below do:

1
2
3
>>> 'Correct answers: {:.2}'.format(points/total)
>>> 'Correct answers: {:%}'.format(points/total)
>>> 'Correct answers: {:.2%}'.format(points/total)

The default format is base 10 to some crazy precision. Line 1 reduces it to two decimal places of precision. Line 2 changes it to a percentage with the default precision, and line 3 reduces the precision to two decimal places again.

As you may have gathered by now, string formatting can get rather complex, plus it is changing from Python 2 to Python 3. The examples above demonstrates the new syntax. I will not go into the soon-to-be-obsolete syntax, though it still common to find it.

more See Python’s documentation on the Format Specification Mini-Language for further information.

4.8.3. Practice 7

Assign a string or number to the italicized, camel-cased names in the textual template below and compose them with the text so that it makes sense. Do the composition both through concatenation and string formatting. Which do you think is easier to do, or to understand as code?

My name is firstName lastName. I am myAge years old. I am in my ordinalYear of undergradGrad studies at uniName University.

4.9. Date, time and calendar objects

There is another kind of object that can be formated as a string, though it is created in a very non-string-like way.

4.9.1. Datetime

The datetime module centralizes all calculation of the current date and time in a single method, datetime.now():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
>>> from datetime import datetime, date, time
>>> datetime.now()
>>> n = datetime.now()
>>> type(n)
>>> n.year
>>> type(n.year)
>>> n.month
>>> n.day
>>> n.hour
>>> n.minute
>>> n.second
>>> n.microsecond
>>> n.isoformat()

Datetime has a formatting language to convert its temporal objects into strings or integers. The method that effects the conversion is called strftime(). Its argument is a string of formatting symbols, which are single letters prefixed with a percent sign, separated by the appropriate spaces and punctuation. Try to figure out what each symbol stands for:

1
2
>>> n.strftime('%A, %d. %B %Y %I:%M%p')
>>> n.strftime('%a. %b. %w, %Y, %H:%M')

more See Python’s strftime() and strptime() Behavior for the full list of formatting symbols.

4.9.2. Date

The date module lets you work with dates without time. You have to import it from datetime (from datetime import date) if you haven’t already done so:

1
2
3
4
5
6
7
8
9
>>> date.today()
tdy = date.today()
>>> type(tdy)
>>> tdy.year
>>> type(tdy.year)
>>> tdy.month
>>> tdy.day
>>> tdy.isoformat()
>>> tdy.weekday()

It works just like you would expect, except that the weekday method returns the day of the week counted from Monday = 0.

The date module supports the same formatting options as datetime:

1
2
>>> tdy.strftime('%A, %d. %B %Y')
>>> tdy.strftime('%a. %b. %w, %Y')

4.9.3. Time

Datetime’s time module is similar to the date module, except that there is no basic time object. Thus the current time has to be culled out from datetime.now() or assembled from data, as explained below. It’s formatting obeys the conventions of datetime.

4.9.4. From data to datetime

Datetime lets you convert a date to a datetime object. This can be done directly by specifying the year, month and day:

1
2
3
>>> xMas = date(2016, 12, 25)
>>> xMas.isoformat()
>>> xMas.weekday()

Or it can be done by replacing them:

1
2
>>> tdy = date.today()
>>> tdy.replace(month=12, day=25)

The main advantage of a datetime object is that it is numerical, calculated from the Gregorian calendar under the assumption that are exactly 3600*24 seconds in every day. This means that datetime objects can be subject to arithmetic:

1
2
3
>>> newYearsEve = date(tdy.year, 12, 31)
>>> time2newYearsEve = newYearsEve - tdy
>>> time2newYearsEve.days

In line 3, the difference is returned as a timedelta() object, which for dates is measured in days and can be converted to an integer by the days method.

A time object can be constructed in a similar manner, but it does NOT support arithmetic:

1
2
3
4
>>> startTime = time(15, 0)
>>> startTime.isoformat()
>>> endTime = time(15, 50)
>>> classTime = endTime - startTime

The only way to do this is to use the entire datetime object:

1
2
3
4
5
>>> startTime = datetime(2016, 9, 8, 15, 0, 0)
>>> endTime = datetime(2016, 9, 8, 15, 50, 0)
>>> classTime = endTime - startTime
>>> classTime.seconds
>>> classTime.seconds/60

But even so, datetime returns the difference as a timedelta() object measured in seconds in line 3, so it has to be converted to an integer by the seconds method in line 4 and divided by 60 to get the actual number of minutes in line 5.

more See datetime — Basic date and time types for the full documentation.

4.9.5. Calendar

Python also has a calendar module which lets you do calendrical calculations, like the ancient Maya. I have probably tired you at with dates and times, though, so I will just mention one trick:

1
2
>>> import calendar
>>> calendar.monthrange(tdy.year, tdy.month)

Line 2 returns a pair, in which the first integer is the day of the week that the month starts on and the second integer is the number of days in the month, which I can never remember.

more See calendar — General calendar-related functions for a thorough documentation of the calendar module.

4.9.6. Practice 8

  1. How many days are left until the final exam?
  2. How many hours are left until the final exam? (Don’t just multiply by 24, you lazy scoundrel.)
  3. What day of the week was New Year’s Day this year? How many days have elapsed since then?
  4. What day of the week will New Year’s Eve be this year? How many days are left until then?
  5. How old would George Washington be? (b. 22 July 1732)
  6. How old will you be the next time Halley’s Comet passes by? (28 July 2061)
  7. What the day of the week did the month that you were born in start on?

4.10. Summary

By now, you should be familiar with the following pythonic concepts: string, concatenation, token, type, method, zero-based indexation, slicing, function, scope, assignment, name or identifier, reserved word, mutability.

4.11. Further practice

4.12. Further reading

Check String Methods in Python 2.7’s documentation for more information on the methods reviewed in this chapter, plus some others.

Most of the properties of strings are due to the fact that they are sequence types, which we will cover more thoroughly when we get to lists.

One of the main properties of strings that has been omitted from this chapter is how to format them. We will take that up as the need arises.

Endnotes

[1]See Python strings - why do character positions start with ‘0’, extended slicing with string in python and Zero-based numbering in Wikipedia.
[2]The negative direction starts at -1 because there is no -0 in any commonly-used integer system.
[3]I have always found this discussion on StackOverflow about how to reverse a string in Python to be helpful.
[4]For more information on the timeit module, see timeit in the Python documentation and the tutorial Time a Python Function
[5]ISO8859-1 is also known as Latin 1 and was a precursor to UTF-8, see ISO/IEC 8859-1.

4.13. Powerpoint and podcast


Last edited: October 02, 2016