14. Housekeeping

Note

The script with the code for this chapter is nlpHousekeeping.py, which you can download with codeDowner(), see Practice 1, question 2.

14.1. How to store data in a database

Since a dictionary resides in Spyder’s memory, it will eventually grow too large to process efficiently the data that it contains. At that point one must adopt a more robust framework for holding structured information, namely, a database.

There are two main types of databases, relational and non-relational.

A relational database holds its data in a table. The columns of table represent keys and the rows represent values. In statistical terms, each row is like an experiment in which the columns define the samples and the value in each cell is an outcome. Yet it would be too simple to adopt an existing terminology. A new terminology has grown up around relational databases, in which the column is termed an attribute and the row a tuple. The columns and rows taken together define a relation, hence the name “relational database”. One of the most popular is MySQL database, for which there is a Python module.

Now imagine that you have been collecting tweets in your relational database and suddenly realize that you need to add entities.hashtags. It is easy enough to add another column to the table for them, but none of the tweets (rows) collected before the moment of adding the new column will have any entry for it, and the tweets themselves are long gone, along with their hastags. There will be a big hole in your table that you still must expend resources on. More generally, for data collection which changes quickly or which cannot be predicted well in advance, the relational format appears excessively rigid. Thus the invention of the non-relational database, which has the format of a tree, a graph or just a bunch of key:value pairs. One of the most popular is Mongo database, for which there is a Python module.

14.1.1. How to use SQLite

You are going to start out leaning about SQLite through Python’s built-in sqlite3 module. Go ahead and fire it up:

>>> from sqlite3 import connect

The connect() method opens a SQLite database with the name given; if it doesn’t already exist, it is created:

conn = connect('NLP.db')

The next step is to create an container for your data. SQLite calls the main container a “table”, which consists of rows and columns. Each row contains a data item; each column contains a field of an item. The overall appearance is that of a spreadsheet.

SQLite is not particularly picky about the type of information is in a field. It allows the following five broad five storage classes:

Table 14.1 SQLite storage classes and Python types
Storage class Python types Example
TEXT str, unicode, date, time, datetime open(‘file.txt’).read()
BLOB str (binary, from PDF or image) db.binary(open(‘img.jpg’, ‘rb’).read())
INTEGER int; bool 128; 0, 1
REAL float 6.666666
NULL None  

They should be self-explanatory, except for BLOB. We have not had much chance to talk about binary files, but just think of them as any file that cannot be read as flat text.

14.1.2. How to create a table with CREATE

To create a a table for tweets, you can follow the suggestions for making a dictionary for tweets. It should include a tweet’s id, author, text, and date. All of this comes through pattern.Twitter as Unicode strings, and SQLite prefers to keep them that way. A tweet’s id is an eighteen-character integer. Everything else is text.

SQLite’s CREATE command initializes a table with the appropriate columns. It’s syntax is CREATE TABLE tableName (colName1 STORAGE_CLASS1, colName2 STORAGE_CLASS2, …). Python communicates with SQLite by wrapping all commands in string delimiters and performing them with the execute() method:

Listing 14.1 Define fields for a SQLite table
1
2
3
4
5
>>> conn.execute('''CREATE TABLE tweets
...                 (twid INT,
...                 author TEXT,
...                 text TEXT,
...                 date TEXT)''')

I have indented the specification of each column to make it easier to read. I will indicate SQLite reserved words in uppercase, and everything else in lower case, which tends to be names that you supply.

14.1.3. How to use SQLite Manager

Firefox SQLite Manager

14.1.4. How to insert rows in a table with INSERT

Time to fill ‘er up. You are going to use the same code to download ten tweets that was developed in How to access Twitter’s RESTful API and What’s in a tweet. Each tweet is inserted into the table as a row with SQLite’s INSERT command, whose syntax is INSERT INTO tableName (colName1, colName2, …) VALUES (value1, value2, …). However, our values come from the current tweet and really should be treated as variables in string formatting. The SQLite adaptation of this process is to use question marks instead of {}, which are replaced from a tuple that follows the string:

Listing 14.2 Insert tweets into a SQLite database
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
>>> from pattern.web import Twitter, author
>>> consumerKey = 'your_info_here'
>>> consumerSecret = 'your_info_here'
>>> accessToken = 'your_info_here'
>>> tokenSecret = 'your_info_here'
>>> twLicense = (consumerKey, consumerSecret, (accessToken, tokenSecret))
>>> tw = Twitter(license=twLicense)

>>> tweetsPerPage = 5
>>> pages = 3
>>> searchTerm = 'win'
>>> for page in range(1, pages):
...     for tweet in tw.search(searchTerm, start=page, count=tweetsPerPage):
...         conn.execute('''INSERT INTO tweets (twid,author,text,date)
...                         VALUES (?,?,?,?)''', (int(tweet.id),
...                                               tweet.author,
...                                               tweet.text,
...                                               tweet.date))

Lines 15 to 17 take charge of aligning the column name with a value placeholder and the corresponding field from the current tweet.

14.1.5. How to retrieve rows from a table with SELECT

To preview the results, use SQLite’s SELECT command, with the syntax SELECT colName1, colName2, FROM tableName. To iterate through the rows, the output of execute() is assigned to an iterator object called a cursor, which for can loop over. Here are several things to try:

Listing 14.3 Preview of first tweet table
1
2
3
4
5
6
7
8
>>> from pprint import pprint
>>> cursor = conn.execute('SELECT text FROM tweets')
>>> for row in cursor: print row
>>> cursor = conn.execute('SELECT twid, author FROM tweets')
>>> for row in cursor: pprint(row)
>>> cursor = conn.execute('SELECT rowid, date FROM tweets')
>>> for row in cursor: pprint(row)
>>> pprint(cursor.fetchall())

Line 6 shows that SQLite automatically adds an index to each row called rowid, much like Python.

14.1.6. How to modify a column with a function and UPDATE

I am not at all satisfied with the date string captured from Twitter. It does not have any mathematical underpinning and so can’t be used to order or retrieve records using the natural sequencing of time. However, if it could be converted to SQLite’s datetime format, such as 2016-11-27 00:34:52, then dates could be sorted and retrieved by order. sqlite3 supplies create_function() to mediate between a function defined in Python and SQLite. With it you can define a function in Python that uses datetime.strptime() to convert the Twitter date string into the SQLite format.

There is one subtle but maddening problem. A Twitter date such as ‘Sun Nov 27 00:34:52 +0000 2016’ has an element that is unparseable by datetime.strptime() in Python 2.7. It is the sequence +0000, which represents the offset from Greenwich Mean Time, or in other words, the time zone. datetime.strptime() chokes on it without considerably more programming than I want to do. The obvious work-around is to cut it out, which is what the function does:

1
2
3
4
5
>>> def str2dtm(dateString):
...     dateString = dateString.replace('+0000 ', '')
...     dtmObject = datetime.strptime(dateString, '%a %b %d %H:%M:%S %Y')
...     return unicode(dtmObject)
...

Line 2 replaces the time zone substring with the empty string. Line 3 performs the format conversion, outputting a Datetime object. Line 4 returns it in Unicode.

Let us test this in and out of SQLite:

1
2
3
4
5
6
>>> twDate = 'Sun Nov 27 00:34:52 +0000 2016'
>>> str2dtm(twDate)
u'2016-11-27 00:34:52'
>>> conn.create_function('STR2DTM', 1, str2dtm)
>>> cursor = conn.execute('SELECT STR2DTM(date) FROM tweets')
>>> pprint(cursor.fetchall())

Line 1 lines up a sample Twitter date, which line 2 converts to SQLite’s format, as reproduced in line 3. Then we switch to sqlite3. Line 4 registers the Python function with sqlite3, with the syntax ('sqliteName', numberOfArguments, pythonName). Line 5 retrieves the date column from the table, converting each existing date into the SQLite format. The final line pretty-prints the list returned by fetchall().

The final step is to actually change the contents of the table. This is effected by changing existing rows, so the appropriate SQLite command is UPDATE, with the syntax UPDATE tableName SET columnName = whatever. In our case, whatever is the application of the function to the date column, STR2DTM(date):

1
2
>>> conn.execute('UPDATE tweets SET date = STR2DTM(date)')
>>> conn.commit()

If line 1 executes without an error, make the change permanent by entering line 2.

14.1.7. How to sort a column with ORDER BY

As one way of using date to sort the table, retrieve the built-in list of row ids, rowid, sorted by ascending date, with the help of an ORDER BY clause appended to SELECT:

1
2
>>> cursor = conn.execute('SELECT rowid FROM tweets ORDER BY date ASC')
>>> pprint(cursor.fetchall())

You may see that the rows were not inserted into the table in ascending order. Did you catch where the parameter for an ascending sort is mentioned? Its converse, a descending sort, is indicated by DESC.

14.1.8. How to add more rows to a table with INSERT

I omit the code for importing resources and setting up your API credentials:

Listing 14.4 Insert more tweets into a SQLite database
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
>>> tweetsPerPage = 5
>>> pages = 3
>>> searchTerm = 'lose'
>>> for page in range(1, pages):
...     for tweet in tw.search(searchTerm, start=page, count=tweetsPerPage):
...         conn.execute('''INSERT INTO tweets (twid,author,text,date)
...                         VALUES (?,?,?,STR2DTM(?))''', (int(tweet.id),
...                                                        tweet.author,
...                                                        tweet.text,
...                                                        tweet.date))
>>> conn.commit()

Commit the changes if there was no error and scan the text column to see what you got:

1
2
3
4
>>> cursor = conn.execute('SELECT text FROM tweets')
>>> for row in cursor: print row
>>> cursor = conn.execute('SELECT date FROM tweets')
>>> pprint(cursor.fetchall())

14.2. Appendix

14.2.1. How get and set up MySQL

Getting a database onto your computer is fraught with problems. In this section, I walk you through the installation and basic usage of MySQL Community Server, v. 5.7.

more See MySQL 5.7 Reference Manual for more than you ever wanted to know about mySQL.

14.2.1.1. How to get MySQL Community Server

Point your web broswer at Download MySQL Community Server. Under Select Platform:, the operating system of your computer should already be choosen, and there should be a list of downloads differentiated by compression scheme. Download one for which you have the corresponding decompression utility on your computer. If you don’t know, just try one. Ignore the “Test Suite”.

A page opens called “Begin Your Download” and asks you sign up for an account. Ignore it and click on No thanks, just start my download.. Once it downloads to your computer, run the installer, and it should do everything for you.

Warning

If a window opens during the installation along the lines of “[Note] A temporary password is generated for root@localhost: …”, copy the password to a safe place. Don’t let it finish without having a record of your password.

Once MySQL is installed, on the Mac, you must add the path to MySQL’s folder to your default PATH by typing this line in the Terminal:

1
$ echo 'export PATH="/usr/local/mysql/bin:$PATH"' >> ~/.bash_profile

14.2.1.2. How to start MySQL Community Server

To start the MySQL server:

The Mac prference window looks like this when the server is running:

_images/mysqlPref.png

Fig. 14.1 Mac Preference Pane for MySQL, with the server running.

Author’s screen shot.

The server must be running to use a database in Python.

On the Mac, an alternative to the MySQL preference pane is to type the following commands in the Terminal:

1
2
3
$ sudo /usr/local/mysql/support-files/mysql.server start
$ YOUR_ROOT_PASSWORD
$ sudo /usr/local/mysql/support-files/mysql.server stop

14.2.1.3. How to change your temporary root password

If you got a password during the installation, change it as soon as possbile. In the Terminal, enter this line:

1
$ mysqladmin -u root -p'YOUR_TEMP_PSWD' password 'YOUR_NEW_PSWD'

Make YOUR_NEW_PSWD something easy to remember like ‘123’. I doubt that anyone is going to try to break into your database.

If this does not work, point your browser at How to Reset the Root Password and scroll down to “Resetting the Root Password: Generic Instructions” and follow them.

14.2.1.4. Getting started with MySQL Monitor

MySQL Server has a semi-graphical user interface called MySQL Monitor that you can play with through the Terminal. To start it, log in as the root user with:

1
2
$ mysql -u root -p
$ Enter password:

You type line 1, and mysql responds by asking you for your root password, which you just changed above. Type it in and hit Return. mysql should respond with this opening message:

_images/mysqlWelcome.png

Fig. 14.2 MySQL Monitor welcome message.

Author’s screen shot.

Did you notice that the prompt has changed to mysql?

If you get a response like ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2), it means that you have not started the server. Go back to How to start MySQL Community Server.

Note

To exit MySQL Monitor, type QUIT: at the prompt. This does not shut off the server, though.

The first thing to do is to create a user account. It is in general considered bad form to constantly use the root account, since it is easier for you to break something. To create a user account, enter these lines in the Terminal. Fill in YOUR_NAME with your name; fill in ANOTHER_PASSWORD with something that is easy for you to remember:

1
2
3
4
mysql> CREATE USER 'YOUR_NAME'@'localhost' IDENTIFIED BY 'ANOTHER_PASSWORD';
mysql> GRANT ALL PRIVILEGES ON *.* TO 'YOUR_NAME'@'localhost'
    -> WITH GRANT OPTION;
mysql> QUIT;

After the response, quit, which returns you to the nomal prompt. Now try to log on to your user account:

1
2
$ mysql -u YOUR_NAME -p
Enter password: ANOTHER_PASSWORD

If it works, you should get the welcome message from MySQL Monitor again, and the prompt turns back into mysql>.

Check the initial state of your databases with line 1; the response is the other lines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
mysql> SHOW DATABASES;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
4 rows in set (0.01 sec)
mysql> QUIT;

This is what I mean by a “semi-graphical user interface”; MySQL Monitor uses a sort of ‘ASCII art’ to display complex responses in the Terminal. It’s far from a real GUI, but it gets the job done.

I asked you to quit because you need the Terminal for something else for a moment.

14.2.1.5. How to interact with MySQL from Python with mysql.connector

The next step is to download and install the Python package for communicating with MySQL, called mysql-connector-python, which is authored and maintained by Oracle, the owner of MySQL. It is not part of Anaconda’s default distribution, but is available from Anaconda’s conda repository:

1
$ conda install mysql-connector-python

Accept anything it asks you for. It takes a minute or so for the package to be downloaded and installed.

Once it is, in Spyder’s console imoprt the relevant mysql-connector-python module and try to make a connection to your user account:

1
2
3
4
5
>>> from mysql import connector
>>> cnx = connector.connect(user='YOUR_NAME', password='ANOTHER_PASSWORD',
...                         host='127.0.0.1',
...                         database='mysql')
>>> cnx.close()

If there is no problem, it just returns the Python prompt.

Lines 2 to 4 illustrate the full form of making a connection, but mysql-connector-python knows what the default host is, and you can connect to the server without asking for a specific database as so:

>>> cnx = connector.connect(user='YOUR_NAME', password='ANOTHER_PASSWORD')

more For the full documentation of mysql-connector-python, see MySQL Connector/Python Developer Guide.

14.2.1.6. How to create a new database

You are going to create a new database named “tweets”:

Listing 14.5 Create a MySQL database with mysql-connector-python.
1
2
3
4
5
6
7
>>> dbName = 'tweets'
>>> cursor = cnx.cursor()
>>> try:
...     cursor.execute("CREATE DATABASE {} DEFAULT CHARACTER SET 'utf8'".format(dbName))
... except connector.Error as err:
...     print("Failed creating database: {}".format(err))
...     exit(1)

This block of code is not very optimistic about the operation succeeding, which is why it is wrapped in a try-except statement. The core of it is line 4, which formats a string to use as a command for MySQL.

You can check for the presence of the database in the Terminal by repeating SHOW DATABASES;:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
mysql> SHOW DATABASES;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| sys                |
| tweets             |
+--------------------+
5 rows in set (0.01 sec)

The new database takes its place in the list.

14.2.2. How to to send tweets to a database

14.2.2.1. How to get data into a database

14.2.2.2. How to get data out of a database

14.2.2.3. How to tell StreamListener() to send tweets to a database

14.3. Summary

14.4. References

Footnotes

[1]

Last edited: November 27, 2016