12. How to get text from Twitter feeds

12.1. Getting started with Twitter

12.1.1. What is an API?

Twitter is a Mississippi of tweets sloshing back and forth all over the world. We went to stick a bucket in that torrent to see what is on people’s minds. But what commands would we use? All a Twitter account lets us do is send and receive tweets, not look at the whole global flow.

Well, if Twitter’s developers do not tell us what commands to use, we are stuck. (Netflix, for one, used to tell us what commands to use, but no longer does.) Fortunately, Twitter welcomes outsiders into its code, by means of its API. API stands for Application Programming Interface.

12.1.2. How to get authorization for Twitter’s API

The first thing to do is to sign up for a Twitter account at twitter.com, if you don’t already have one. Then point your browser at Twitter Developers and log in with your new account credentials. At the top right corner, click on the triangle and choose My Applications and sign in again. Hit the Create New App button. Give it any name you want, describe at as “computational culture with Twitter”, use the course website “http://www.tulane.edu/~howard/SPAN-NLP/” as the website, click the button to agree with the Developer Rules of the Road, and click on Create your Twitter application.

On the next page, select API Keys from the menu. On the Application settings page, for the time being, you can keep the access level at Read-only. If you want to change it to Read, Write and Access direct messages by means of the modify app permissions link, you will have to give Twitter your cell phone number. Scroll down the page and click on create my access token. You will get a confirmation message at the top of the page. You may want to click the reset link to hurry the process along.

There are now four crucial pieces of information that you will need to make note of: API key, API secret, Access token and Access token secret. Since these are long and unwieldy strings, you should copy and paste them into some handy place immediately. In fact, since you are going to use them in several scripts, go ahead and open a new Spyder file, paste these lines into it, fill in your information, and save it as something like “tweepyLogon.py”:

API_KEY = 'your_info_here'
API_SECRET = 'your_info_here'
ACCESS_TOKEN = 'your_info_here'
ACCESS_TOKEN_SECRET = 'your_info_here'

12.1.3. API protocols and Python packages

Twitter’s API reveals two ways of interacting with it, a request and response protocol known as representational state transfer or REST and a streaming protocol. Each one has a Python package associated with it, python-twitter and tweepy, respectively. The streaming protocol is much more exciting, so we will concentrate on it. Thus the first step is for you install tweepy on your Python installation. If it is not part of the installation (and it is not with Anaconda or Canopy), you have to download and install it yourself – recall the instructions ???.

12.1.4. How to log on to the Twitter API with tweepy

The unavoidable steps for logging on to the Twitter API with your application credentials are as follows:

import tweepy
API_KEY = 'your_info_here'
API_SECRET = 'your_info_here'
ACCESS_TOKEN = 'your_info_here'
ACCESS_TOKEN_SECRET = 'your_info_here'
key = tweepy.OAuthHandler(API_KEY, API_SECRET)
key.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

Add the missing lines to “tweepyLogon.py”. Then, to test that it works, add the following lines which ask the API for your application name:

api = tweepy.API(key)
print api.me().name

When you run the script, Spyder’s console should respond with your application name. If it doesn’t, you probably didn’t enter your authentication credentials correctly.

12.2. How to pull tweets from the Twitter stream with StreamListener()

Twitter’s streaming protocol is often called a fire hose, because it just blasts every tweet in the world at you. Tweepy has a method called StreamListener() that listens to the fire hose – sorry for the mixed metaphor! – and picks out just the tweets that you want. StreamListener() was designed to be very flexible, so we can adapt it to our purposes, but this adaptability comes at a price – StreamListener() can’t do anything right out of the box!

There are two crucial things that StreamListener() cannot do: it cannot send the tweets it selects somewhere, like to your computer screen, a text file, or a database, and it cannot stop listening for tweets when you have enough. But it is easy enough to modify it to perform these tasks, and in fact that is what the developer envisioned for the tweepy user to do.

12.2.1. Classes in Python

To effect this modification, you are going to create a new class for StreamListener() that fills in the missing functionality. A “class” in Python is a group of functions that work together. The StreamListener() class has two functions that you are going to modify. The first is the initialization function __init__() which stores information that is only used the first time that the class is called. The second is the on_status() function. In Twitter, a tweet is known technically as a status update. on_status() tells StreamListener() what to do when it receives a new status update. This is where the most important processing takes place. The code below lays out a skeleton for the new class, which you will add to as you go along:

class NewStreamListener(tweepy.StreamListener):
        def __init__(self, api=None):
                self.api = api or API()

        def on_status(self, status):

NewStreamListener() takes StreamListener() as its argument and thereby inherits everything that is in StreamListener() . We have to be careful to only ‘fill in the blanks’ – that is, to only add material that is not in StreamListener() and not contradict anything that is already there, which would break it. For instance, the code having to do with api in __init__() is part of the endowment of StreamListener(). We gingerly tiptoe around it, leaving it untouched.

12.2.2. How to tell StreamListener() to stop listening for tweets

For the sake of discussion, imagine that you only want twenty tweets. How would you tell StreamListener() to turn itself off when it hits twenty? The standard computational solution is to create a counter that increments itself by one every time that StreamListener() finds a relevant tweet, up to twenty. In pseudo-code, the algorithm looks like this:

upon initializing StreamListener()
        set the counter to 0
        set the maximum to 20

when a relevant tweet is found
        if the counter is less than 20, add 1 to it and keep going
        otherwise, exit StreamListener()

Seeing these steps implemented in their tweepy context may help you to understand them better, so we add them to the previous snippet of code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
class StopStreamListener(tweepy.StreamListener):
        def __init__(self, api=None):
                self.api = api or API()
                self.n = 0
                self.m = 20

        def on_status(self, status):
                self.n = self.n+1
                if self.n < self.m: return True
                else: return False

On line 4 of __init__(), self.n is the counter, which is initialized to 0. On line 5, self.m is the maximum number of tweets wanted, which is initialized to 20. In on_status(), line 8 increments the counter and the if statement checks to see whether the new count is less than the maximum. If it is, on_status() returns True, which tells StopStreamListener() to keep listening for more tweets. Otherwise, on_status() returns False, which tells StopStreamListener() to stop listening and ultimately turn tweepy off.

12.3. Four places to send tweets

In the rest of this chapter you will learn four places to send tweets, summarized in this image:

_images/TwitterTransCont.png

They are, the Spyder console, a text file, a Python dictionary, and a database.

12.4. How to tell StreamListener() to print tweets to the screen

The other bit of functionality that you have to fill in is to send the tweets somewhere. The easiest place to start with is your computer screen, so that you can see how cool Twitter is. The block below adds the requisite lines to the previous block:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
class Stream2Screen(tweepy.StreamListener):
    def __init__(self, api=None):
        self.api = api or API()
        self.n = 0
        self.m = 20

    def on_status(self, status):
        print status.text.encode('utf8')
        self.n = self.n+1
        if self.n < self.m: return True
        else:
            print 'tweets = '+str(self.n)
            return False

In on_status(), line 8 now prints the text of a status update to the screen, encoding it in UTF-8 just to be on the safe side. The text of a status update is what most people would consider to be a tweet. The else condition has been expanded on line 12 to print a sign-off message, a string with the number of tweets found. It should be the same as the tweet maximum, which is a great way of checking whether the code has worked right.

12.4.1. How to invoke StreamListener()

Now that you have added the functionality missing from StreamListener(), you have to call it. That only takes two lines of code:

stream = tweepy.streaming.Stream(key, Stream2Screen())
stream.filter(track=['de'], languages=['es'])

The first line assigns the output of Stream2Screen() to a variable, using your login credentials loaded above into key and the Stream2Screen() that you just finished. But that just gets you a place-holder for a stream. The second line defines the stream by giving it two parameters to listen for: a keyword – a list of strings assigned to track – and a language, here, Spanish. The track argument is obligatory (you can’t listen for everything, apparently). The language argument is optional. If it is omitted, English is understood to be the language listened to. If you want to mention English explicitly, its abbreviation is en.

12.4.2. Putting it together into a single script

The block below combines Stream2Screen() with the authentication and calling code to produce a working script:

# -*- coding: utf-8 -*-
# suggested name: tweepyFlujoMonitor
import tweepy
from tweepy.api import API

API_KEY = 'your_info_here'
API_SECRET = 'your_info_here'
ACCESS_TOKEN = 'your_info_here'
ACCESS_TOKEN_SECRET = 'your_info_here'
key = tweepy.OAuthHandler(API_KEY, API_SECRET)
key.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

class Stream2Screen(tweepy.StreamListener):
    def __init__(self, api=None):
        self.api = api or API()
        self.n = 0
        self.m = 20

    def on_status(self, status):
        print status.text.encode('utf8')
        self.n = self.n+1
        if self.n < self.m: return True
        else:
            print 'tweets = '+str(self.n)
            return False

stream = tweepy.streaming.Stream(key, Stream2Screen())
stream.filter(track=['de'], languages=['es'])

There is one new line here, from tweepy.api import API. It imports the API() method for the first line of __init__(). Again, this is a built-in part of StreamListener() that we do not touch.

Running this script should produce something like the following:

@UltraRadioTol Hola, por favor pasen #ILoveYouTeQuiero, nuevo single de @Belindapop feat @pitbull. Gracias!
#Ahora especial sobre la inundaciñon en La Plata ¿qué pasó? Un recorrido de ayer a hoy miralo en http://t.co/ts2WgCYD4o
RT @fernandezpm: Los que van al Lollapalooza saben que hay una app para ver los horarios? De nada, miren: #twitteresservicio http://t.co/3h…
RT @Paula_257: Necesito un abrazo,de esos que me aseguren que todo saldrá bien.
RT @iIusionOptica: ¡Esto es genial! Más de una vez hemos necesitado un cable más largo, ahora con este invento ya es posible. http://t.co/e…
RT @formula1tv: Además, los comisarios no vieron indicios de penalización en el incidente entre Kvyat y Alonso en #Q2 http://t.co/sNGe4CXat…
Arq Eduardo Ferrareso Subsecretario de Tierras de la provincia de Neuquén: http://t.co/KEq6HFKJuX via @YouTube
@BalletSCali tiene para la venta hermosos obsequios para nuestras bailarinas, pregúntalos en la oficina de dirección http://t.co/AeFeHsRFB9
RT @Mindeporte: #MomentoStgo2014 Jesús Aguilar doble dorado de Venezuela http://t.co/YCAsjhuIFL
@Niall_Boswell -Pone los ojos en blanco, cruzando los brazos y sin dejarse amedrantar.- No sé si te has dado cuenta de que de que formo &gt;
RT @Continental_: A San Lorenzo se le escapó el triunfo en el final y peligra su clasificación: Ganaba con gol de Blandi, pero, ... http://…
Haciendo desayuno para el jefe de la casa.
RT @AlaanUlloa: El mejor lugar para dormir es el banco de la escuela, ami no me jodan.
RT @MartaBarrera: @mmariaagomez pues mira, renovación de abrigos jajaja ;)
Joder, estaba en ask y alguien le ha dado a MG a la respuesta de un nazi, y soy tan lista que le he dado MG sin querer. :-).
RT @Foto_Historia: Plaza de la Independencia, en Ucrania, antes y después de las protestas. http://t.co/l2Hd6FvdPI
Estoy descubriendo que tengo un 'type' de celebrity crushes. Kay Murray y Candice Swanepoel se parecen bastante
RT @ermorochito: @reinaldoprofeta QUITAR TODA PROPAGANDA , AFICHE, PANCARTA ROJA COMUNISTA  DE TODAS LAS CALLES DE VZLA, EN HONOR A LOS CAI…
INCREIBLE!! VOY A VENDER MI PERFIL ME PAGAN 69 EUROS POR MI CUENTA DE TWITTER!! CALCULA EL PRECIO DE LA TUYA EN:

http://t.co/QBjfOn9wZh

.
Unete todas las madrugadas a nuestros Tiempos de Tefila (Oracion) en Yeshua. Ver Horarios: http://t.co/k8yOUWm1Ny
tweets = 20

12.5. How to tell StreamListener() to save tweets to a file

Streaming tweets to your computer screen is cool and useful for testing and trouble-shooting, but we would rather collect them into a text file that we can run through with our regular tools for text analysis.

To recall the discussion from ???, Python’s algorithm for creating a file invokes the open(), write(), and close() methods. open() creates an empty file. write() fills it with text, and close() saves it to disk and recycles all the resources that were used.

With respect to listening to a tweet stream, open() is only invoked once at the beginning and so can be put into __init__(). write() must be invoked every time a valid status update is received in order to add it to the file and so substitutes for print in on_status(). close() is also only invoked once, after the maximum number of updates has been received, and so should reside in the else condition of on_status(). Putting all of this together with Stream2Screen() produces the following new class, Stream2File():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
class Stream2File(tweepy.StreamListener):
    def __init__(self, api=None):
        self.api = api or API()
        self.n = 0
        self.m = 20
        self.output = open('/Users/{user}/nltk_data/pytextos/tweepy_text.txt', 'w')

    def on_status(self, status):
        self.output.write(status.text.encode('utf8') + "\n")
        self.n = self.n+1
        if self.n < self.m: return True
        else:
            self.output.close()
            print 'tweets = '+str(self.n)
            return False

In line 6, the variable output is assigned to be the temporary place-holder for the file, and the entire path to the new file tweepy_text.txt is given, just in case. In line 9, a new status update is written to output and ended with a newline, to keep tweets separate. In line 13, the file is closed when the maximum number of tweets has been counted.

12.5.1. Putting it together into a single script

The code block below plugs Stream2File() into the previous script:

# -*- coding: utf-8 -*-
# suggested name: tweepyFlujoArchivo
import tweepy
from tweepy.api import API

API_KEY = 'your_info_here'
API_SECRET = 'your_info_here'
ACCESS_TOKEN = 'your_info_here'
ACCESS_TOKEN_SECRET = 'your_info_here'
key = tweepy.OAuthHandler(API_KEY, API_SECRET)
key.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

class Stream2File(tweepy.StreamListener):
    def __init__(self, api=None):
        self.api = api or API()
        self.n = 0
        self.m = 20
        self.output = open('/Users/{user}/nltk_data/pytextos/tweepy_text.txt', 'w')

    def on_status(self, status):
        self.output.write(status.text.encode('utf8') + "\n")
        self.n = self.n+1
        if self.n < self.m: return True
        else:
            self.output.close()
            print 'tweets = '+str(self.n)
            return False

stream = tweepy.streaming.Stream(key, Stream2File())
stream.filter(track=['de'], languages=['es'])

There aren’t any new lines here.

Opening the file produced by this script should reveal many lines of tweets, ending with “tweets = 20”, just like the sample run for Stream2Screen().

12.5.2. A reminder of how to get the tweet file into NLTK Text

Now that you have a text file of tweets, you should convert it to NLTK text, so that it can be analyzed. The following reviews the steps to go through:

import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.text import Text
path = '/Users/{your_user_name}/nltk_data/pytextos'
name = 'tweepy_text.txt'
texlector = PlaintextCorpusReader(path, name, encoding='utf8')
texto = Text(texlector.words())

Test that it worked with our old standbys:

len(texto)
texto[:50]

12.6. What’s in a tweet

So far, we have assumed that a tweet is a string of 140 characters, but this is really just the tip of the iceberg. A Twitter status update contains an enormous amount of additional information. To see it all in its full glory, you are going to alter tweepyFlujoMonitor to display all of the data in a status update.

12.6.1. How to use json to translate JSON to a dictionary in on_data

Twitter uses a format called JSON (JavaScript Object Notation) for transmitting its data. In StreamListener(), the function that deals with JSON objects is not on-status but rather on_data. Python has a module, json, for converting a JSON data object to a python dictionary. You will load and call json and print its output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# -*- coding: utf-8 -*-
# suggested name: tweepyFlujoMonitorData.py
import tweepy, json
from tweepy.api import API

LOGON_INFO_HERE

class Stream2Screen(tweepy.StreamListener):
    def __init__(self, api=None):
        self.api = api or API()
        self.n = 0
        self.m = 1

    def on_data(self, data):
        datadict = json.loads(data)
        print str(datadict)
        self.n = self.n+1
        if self.n < self.m: return True
        else:
            print 'tweets = '+str(self.n)
            return False

stream = tweepy.streaming.Stream(key, Stream2Screen())
stream.filter(track=['de'], languages=['es'])

In line 3, the json module is imported. Line 14 switches from on_status() to on_data(). The difference is the conversion of the JSON object on line 15, and the display of the result on line 16. The rest is just on_status().

12.6.2. How to use pprint to pretty up the print out

The standard printing of a status update produces a jumbled ball of indecipherable notation. If you think that I exagerate, tell me whose screen name is replied to in this update:

{u'contributors': None, u'truncated': False, u'text': u'@NaachoTotaro Todo
lleno de liqui nada que ver -.- nose que me ensucian el banco gilessss',
u'in_reply_to_status_id': 450802384705683456, u'id': 450802805780267009,
u'favorite_count': 0, u'source': u'web', u'retweeted': False, u'coordinates':
None, u'entities': {u'symbols': [], u'user_mentions': [{u'id': 713363136,
u'indices': [0, 13], u'id_str': u'713363136', u'screen_name': u'NaachoTotaro',
u'name': u'NachoTotaro/BRC\xb414\u2665'}], u'hashtags': [], u'urls': []},
u'in_reply_to_screen_name': u'NaachoTotaro', u'id_str': u'450802805780267009',
u'retweet_count': 0, u'in_reply_to_user_id': 713363136, u'favorited': False,
u'user': {u'follow_request_sent': None, u'profile_use_background_image': True,
u'default_profile_image': False, u'id': 1326116929, u'profile_background_image_url_https':
u'https://pbs.twimg.com/profile_background_images/450008721792303104/KybvZYnK.jpeg',
u'verified': False, u'profile_image_url_https':
u'https://pbs.twimg.com/profile_images/450373575497625601/VLLOdmDc_normal.jpeg',
u'profile_sidebar_fill_color': u'E5507E', u'profile_text_color': u'362720', u'followers_count': 322,
u'profile_sidebar_border_color': u'000000', u'id_str': u'1326116929',
u'profile_background_color': u'72BBE0', u'listed_count': 0, u'is_translation_enabled':
False, u'utc_offset': -10800, u'statuses_count': 12383, u'description':
u'El mundo est\xe1 cambiando \ny ya no tiene explicacion.( No apto para perseguidas)',
u'friends_count': 280, u'location': u'', u'profile_link_color': u'DB46CA',
u'profile_image_url': u'http://pbs.twimg.com/profile_images/450373575497625601/VLLOdmDc_normal.
jpeg', u'following': None, u'geo_enabled': True, u'profile_banner_url':
u'https://pbs.twimg.com/profile_banners/1326116929/1395525742', u'profile_background_image_url':
u'http://pbs.twimg.com/profile_background_images/450008721792303104/KybvZYnK.jpeg',
u'name': u'\u0418egrita \u2661', u'lang': u'es', u'profile_background_tile': True,
u'favourites_count': 652, u'screen_name': u'BaarbiChamorro', u'notifications':
None, u'url': None, u'created_at': u'Thu Apr 04 04:07:34 +0000 2013',
u'contributors_enabled': False, u'time_zone': u'Brasilia', u'protected':
False, u'default_profile': False, u'is_translator': False}, u'geo': None,
u'in_reply_to_user_id_str': u'713363136', u'lang': u'es', u'created_at':
u'Tue Apr 01 01:12:19 +0000 2014', u'filter_level': u'medium', u'in_reply_to_status_id_str':
u'450802384705683456', u'place': None}

To unwind this ball, you have recourse to the pprint module, which, among other things, places each attribute:value pair of the JSON hierarchy on its own line and indents it according to its place in the hierarchy. To invoke pprint, add pprint to the import line, i.e. import tweepy, json, pprint and change the print statement to the basic pprint method pprint.pprint(datadict). Now tell me whose screen name is replied to:

{u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Tue Apr 01 01:12:19 +0000 2014',
 u'entities': {u'hashtags': [],
               u'symbols': [],
               u'urls': [],
               u'user_mentions': [{u'id': 713363136,
                                   u'id_str': u'713363136',
                                   u'indices': [0, 13],
                                   u'name': u'NachoTotaro/BRC\xb414\u2665',
                                   u'screen_name': u'NaachoTotaro'}]},
 u'favorite_count': 0,
 u'favorited': False,
 u'filter_level': u'medium',
 u'geo': None,
 u'id': 450802805780267009,
 u'id_str': u'450802805780267009',
 u'in_reply_to_screen_name': u'NaachoTotaro',
 u'in_reply_to_status_id': 450802384705683456,
 u'in_reply_to_status_id_str': u'450802384705683456',
 u'in_reply_to_user_id': 713363136,
 u'in_reply_to_user_id_str': u'713363136',
 u'lang': u'es',
 u'place': None,
 u'retweet_count': 0,
 u'retweeted': False,
 u'source': u'web',
 u'text': u'@NaachoTotaro Todo lleno de liqui nada que ver -.- nose que me ensucian el banco gilessss',
 u'truncated': False,
 u'user': {u'contributors_enabled': False,
           u'created_at': u'Thu Apr 04 04:07:34 +0000 2013',
           u'default_profile': False,
           u'default_profile_image': False,
           u'description': u'El mundo est\xe1 cambiando \ny ya no tiene explicacion.( No apto para perseguidas)',
           u'favourites_count': 652,
           u'follow_request_sent': None,
           u'followers_count': 322,
           u'following': None,
           u'friends_count': 280,
           u'geo_enabled': True,
           u'id': 1326116929,
           u'id_str': u'1326116929',
           u'is_translation_enabled': False,
           u'is_translator': False,
           u'lang': u'es',
           u'listed_count': 0,
           u'location': u'',
           u'name': u'\u0418egrita \u2661',
           u'notifications': None,
           u'profile_background_color': u'72BBE0',
           u'profile_background_image_url': u'http://pbs.twimg.com/profile_background_images/450008721792303104/KybvZYnK.jpeg',
           u'profile_background_image_url_https': u'https://pbs.twimg.com/profile_background_images/450008721792303104/KybvZYnK.jpeg',
           u'profile_background_tile': True,
           u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/1326116929/1395525742',
           u'profile_image_url': u'http://pbs.twimg.com/profile_images/450373575497625601/VLLOdmDc_normal.jpeg',
           u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/450373575497625601/VLLOdmDc_normal.jpeg',
           u'profile_link_color': u'DB46CA',
           u'profile_sidebar_border_color': u'000000',
           u'profile_sidebar_fill_color': u'E5507E',
           u'profile_text_color': u'362720',
           u'profile_use_background_image': True,
           u'protected': False,
           u'screen_name': u'BaarbiChamorro',
           u'statuses_count': 12383,
           u'time_zone': u'Brasilia',
           u'url': None,
           u'utc_offset': -10800,
           u'verified': False}}

It wasn’t too hard to pick out NaachoTotaro, was it?

There are several status update keys that I am interested in, in addition to text. The screen name, to keep track of who tweets, and all of the information about location, such as “geo” and “place” and under “user”, “time_zone”, and “location”. I will also need a unique identifier for each tweet, which is the import of “id”. I may also want the time of the tweet, from “created_at”.

12.7. How to tell StreamListener() to save tweets to a dictionary

The problem is how to save this information. I could put it into a list, but it would be extremely complicated and unintuitive to keep up with the correspondences among values –– every seventh member of the list would be the value of the same key. Fortunately, Python has a data structure for maintaining correspondences across data, known as a dictionary.

12.7.1. What is a Python dictionary?

A Python dictionary is a sequence of keyword:value pairs enclosed in curly brackets. You just saw one, the status update above. Here are the main methods for working with dictionaries:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
>>> nomgen = {'hombre':'m','mujer':'f','policia':'f'}
>>> nomgen['mujer']
'f'
>>> len(nomgen)
3
>>> str(nomgen)
"{'policia': 'f', 'mujer': 'f', 'hombre': 'm'}"
>>> type(nomgen)
<type 'dict'
>>> nomgen.has_key('hombre')
True
>>> 'hombre' in nomgen
True
>>> nomgen.items()
[('policia', 'f'), ('mujer', 'f'), ('hombre', 'm')]
>>> nomgen.keys()
['policia', 'mujer', 'hombre']
>>> nomgen.values()
['f', 'f', 'm']
>>> nomgen['policia'] = 'm'
>>> nomgen
{'policia': 'm', 'mujer': 'f', 'hombre': 'm'}

Line 1 creates a dictionary. Line 2 queries it for the value of the key 'mujer'. Note that this is just like querying a list for the value at an index. Line 4 shows that a dictionary has a length, which is the number of keys. Line 6 shows how to convert a dictionary for display. Line 8 returns the dictionary’s type. Line 10 shows how to a check a dictionary for the existence of a key, though has_key() is considered obselete. Line 12 shows the preferred alternative, which generalizes the in statement. Line 14 returns a list of the contents of the dictionary as a sequence of ordered pairs. Lines 16 and 18 return the keys and values, respectively.

Policía is one of those rare words that has both a feminine and a masculine usage. La policía means police force, while el policiía means policeman. Line 20 tries to add an additional key for policia with the “m” value. What happens? It overwrites the previous “f” value. This demonstrates that dictionary keys must be unique, and in many ways act like the set() method for removing duplicates from a list.

12.7.2. Saving status update values to a dictionary

To collect the information wanted into a dictionary, it is now clear that some unique datum must serve as key. You could make one up by concatenating screen name and time, which has the advantage of being meaningful, but since there is a unique identifier in “id”, you may as well start with it. The data structure to be created for each tweet is {id: {text, screen_name, geo, time_zone}}, which is to say, a dictionary of dictionaries. Here is the modification of tweepyFlujoArchivo.py which sends output to a dictionary:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# -*- coding: utf-8 -*-
# suggested name: tweepyFlujoDic1.py
import tweepy, pprint
from tweepy.api import API

TU LOGON AQUI

class Flujo2Diccionario(tweepy.StreamListener):

    def __init__(self, api=None):
        self.api = api or API()
        self.n = 0
        self.m = 1
        self.output = {}

    def on_status(self, status):
        self.output[status.id] = {
            'tweet':status.text.encode('utf8'),
            'usuario':status.user.screen_name,
            'geo':status.geo,
            'lugar':status.place,
            'localizacion':status.user.location,
            'zona':status.user.time_zone}
        self.n = self.n+1
        if self.n < self.m: return True
        else:
            pprint.pprint(self.output)
            print 'tweets = '+str(self.n)
            return False

flujo = tweepy.streaming.Stream(clave, Flujo2Diccionario())
flujo.filter(track=['de'], languages=['es'])

Line 3 adds pprint to the modules imported, which is used to format the printing of the dictionary in line 27. In lines 8 and 31, the name of the new class is changed to Flujo2Diccionario. Lines 17 through 23 insert the relevant values into the dictinary defined as output on line 14.

Test this code with just one tweet (line 13) until you are satisified that it works, and then up the number. Here is a typical run for three tweets:

tweets = 3
{450975400555212801: {'geo': None,
                      'localizacion': u'Castell\xf3n Comunidad Valenciana',
                      'lugar': None,
                      'tweet': 'RT @GaviotaPopular: El @PSOE cree que el dinero p\xc3\xbablico no es de "nadie".Si nos arruinaron dos veces, porque no una tercera? http://t.co/mg\xe2\x80\xa6',
                      'usuario': u'julio_tena',
                      'zona': u'Madrid'},
 450975400601321472: {'geo': None,
                      'localizacion': u'',
                      'lugar': None,
                      'tweet': 'RT @MariaCorinaYA: Ante la evidencia de la confabulaci\xc3\xb3n institucional contra la Soberan\xc3\xada Popular y el respeto a Constituci\xc3\xb3n,el pueblo se\xe2\x80\xa6',
                      'usuario': u'BertlemVL',
                      'zona': u'Caracas'},
 450975400794263552: {'geo': None,
                      'localizacion': u'\xdcT: 10.478302,-66.88222',
                      'lugar': None,
                      'tweet': 'RT @ruedaveloz: #350AsumeTuDerecho RT @pecimaria53 Merida izan labandera d Guerra a Muerte .1er estado en desobediencia Total + FOTO http:/\xe2\x80\xa6',
                      'usuario': u'Totiestaba',
                      'zona': u'Quito'}}

Note that tweeters rarely include all the bits of location information, which will make our job a little harder.

12.7.3. Instance variables vs. class variables

The script does exactly what I want it to do, but it has a significant limitation – Spyder’s console has no access to the dictionary. Try this:

>>> len(output)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'output' is not defined

Moreover, if you open the Variable explorer in the top right window of Spyder, you will not see any entry for “output”. This means that we cannot perform any additional work with the dicionary, which sort of defeats the purpose of having created it.

In more general terms, the problem is that the scope of output is local to on_status. It only comes into being when on_status is called by Flujo2Diccionario() and is destroyed once on_status is exited. Such variables are called instance variables. This is an efficient usage of resources, but it frustrates out desire to make additional usage of variables that by rights it would seem that we should have access to.

There is a simple solution. The idea is to ‘raise’ the scope of output to that of the entire class Flujo2Diccionario() and then call Flujo2Diccionario() specifically to get a copy of it. This is accomplished in the next block:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# -*- coding: utf-8 -*-
# suggested name: tweepyFlujoDic2.py
import tweepy, pprint
from tweepy.api import API

TU LOGON AQUI

class Flujo2Diccionario(tweepy.StreamListener):
        output = {}
        def __init__(self, api=None):
                self.api = api or API()
                self.n = 0
                self.m = 1

        def on_status(self, status):
                self.output[status.id] = {
                        'tweet':status.text.encode('utf8'),
                        'usuario':status.user.screen_name,
                        'geo':status.geo,
                        'lugar':status.place,
                        'localizacion':status.user.location,
                        'zona':status.user.time_zone}
                self.n = self.n+1
                if self.n < self.m: return True
                else:
                        print 'tweets = '+str(self.n)
                        return False

flujo = tweepy.streaming.Stream(clave, Flujo2Diccionario())
flujo.filter(track=['de'], languages=['es'])
tweetdic = Flujo2Diccionario().output

Only two lines are different. The initialization of output has been moved up to line 9, which gives it a scope that is global within Flujo2Diccionario(). Such a variable is known as a class variable, since it is available to every function within its class, as well to functions outside of it if called appropriately. Line 31 has been added to make the call, and so get a copy of output into your hands via tweetdic:

>>> len(tweetdic)
1

The Variable explorer in the top right window of Spyder now has an entry for tweetdic:

_images/SpyderVarExplorer.png

You have finally acheived the goal of this section, to be able to stream status updates into a dictionary.

12.8. How to tell StreamListener() to send tweets to a database

Since a dictionary resides in Spyder’s memory, at some point, it will grow too large to process efficiently the data that it contains. At that point one must adopt a more robust framework for holding structured information, namely, a database.

There are two main types of databases, relational and non-relational.

A relational database holds its data in a table. The columns of table represent keys and the rows reprsent values. In statistical terms, each row is like an experiment in which the columns define the samples and the value in each cell is an outcome. Yet it would be too simple to adopt an existing terminology. A new terminology has grown up around relational databases, in which the column is termed an attribute and the row a tuple. The columns and rows taken together define a relation, hence the name “relational database”. One of the most popular is MySQL database, for which there is a Python module.

Now imagine that you have been collecting tweets in your relational database and suddenly realize that you need to add entities.hashtags. It is easy enough to add another column to the table for them, but none of the tweets (rows) collected before the moment of adding the new column will have any entry for it, and the tweets themselves are long gone, along with their hastags. There will be a big hole in your table that you still must expend resources on. More generally, for data collection which changes quickly or which cannot be predicted well in advance, the relational format appears excessively rigid. Thus the invention of the non-relational database, which has the format of a tree, a graph or just a bunch of key:value pairs. One of the most popular is Mongo database, for which there is a Python module.

12.8.1. How get and set up MySQL

Getting a database onto your computer is frought with problems.

12.8.1.1. How to get MySQL Community Server

Point your web broswer at MySQL Community Server (http://dev.mysql.com/downloads/mysql/). In the Generally Available (GA) Releases tab, under Select Platform: choose your platform, Microsoft Windows or Mac OS X:

  • For Microsoft Windows, click on the MySQL Installer.
  • For Mac OS X, click on the 64-bit DMG Archive for your version of the OS, probably 10.7.

Oracle then forces you to sign up for an account, which is a huge pain. Then they send you an e-mail that has a link for you to click on to start the download. Once you get it, run the installer, and it should do everything for you. On the Mac, also click on the Preference pane to install it.

12.8.1.2. How to start MySQL Community Server

To start the MySQL server:

The server must be running to use a database in Python.

12.8.1.3. How to get MySQL-python

The next step is to download and install the Python package for communicating with MySQL, called MySQL-python. Ideally, you can use pip:

  • On Windows, open the Command Prompt under Tools, select Start -> Run and type cmd in the box. Type in the same command, pip install -U MySQL-python.
  • On the Mac, open Terminal and type in pip install -U MySQL-python.

It takes a minute or so for the package to be downloaded and installed. Once it is, in Spyder’s console type import MySQLdb and see what happens.

12.8.1.4. Error with mysql_config

The installation process may fail if it cannot find “mysql_config”. The error message should indicate the file where the error was encountered. On my Mac, this was “setup_posix.py” in a temporary build folder /Users/your_user_name/build/MySQL-python/setup_posix.py. You can add the path to “mysql_config” by opening “setup_posix.py” in Spyder by changing mysql_config.path = mysql_config to:

mysql_config.path = "/usr/local/mysql-5.6.17-osx10.7-x86_64/bin/mysql_config"

At least on a Mac. Unfortunately, the error has broken the installation process and you have to continue it by hand. In Terminal, enter the following:

$ cd /Users/{your_user_name}/build/MySQL-python
$ python setup.py build
$ python setup.py install

12.8.1.5. Error “image not found” with _mysql.so

Once installed, MySQL-python can fail to be imported with an error like this:

>>> import MySQLdb
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/{your_user_name}/anaconda/lib/python2.7/site-packages/MySQLdb/__init__.py", line 19, in <module>
    import _mysql
ImportError: dlopen(/Users/{your_user_name}/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/MySQL_python-1.2.5-py2.7-macosx-10.6-x86_64.egg/_mysql.so, 2): Library not loaded: libmysqlclient.18.dylib
  Referenced from: /Users/{your_user_name}/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/MySQL_python-1.2.5-py2.7-macosx-10.6-x86_64.egg/_mysql.so
  Reason: image not found

The problem is that the file _mysql.so cannot find the file libmysqlclient.18.dylib. To double-check that this is the case, in the Terminal type the following command:

$ otool -L /Users/{your_user_name}/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/MySQL_python-1.2.5-py2.7-macosx-10.6-x86_64.egg/_mysql.so

This should give a response like this one:

/Users/{your_user_name}/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/MySQL_python-1.2.5-py2.7-macosx-10.6-x86_64.egg/_mysql.so:
        libmysqlclient.18.dylib (compatibility version 18.0.0, current version 18.0.0)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 159.1.0)

libmysqlclient.18.dylib is the file that caused the error, and there is no path to it. It is in the folder that was created to hold MySQL, which is normally hidden from you. To find it, type the following command in the Terminal:

$ defaults write com.apple.finder AppleShowAllFiles YES

Then press the option or alt key and click on the Finder icon to select Relaunch when it pops up. This should make all the hidden folders and files magically appear on your Macintosh. Open a Finder window and click on your hard drive icon (under DEVICES?) and follow this path: hard drive > usr > local > mysql-5.6.17-{your_OS} > lib > libmysqlclient.18.dylib. Now you need to incorporate this into a command to fix _mysql.so. The command takes the following form:

$ install_name_tool -change libmysqlclient.18.dylib {path_to_libmysqlclient.18.dylib} {path_to__mysql.so}

With the blanks filled in, it will look like this:

$ install_name_tool -change libmysqlclient.18.dylib /usr/local/mysql/lib/libmysqlclient.18.dylib /Users/{your_user_name}/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/MySQL_python-1.2.5-py2.7-macosx-10.6-x86_64.egg/_mysql.so

And now re-hide the hidden files with this command in Terminal:

$ defaults write com.apple.finder AppleShowAllFiles NO

and press the option or alt key and click on the Finder icon to select Relaunch when it pops up. This should make all the hidden folders and files magically disappear on your Macintosh.

12.8.1.6. How to create a new database

Turn on the MySQL server (Apple > System Preferences > MySQL) if it is not already on. In the Terminal, try to log on to the server with:

$ mysql -u root -p

If you get an error like mysql: command not found, use the full path:

$ /usr/local/mysql/bin/mysql -u root -p

root is the superuser; the one with full privileges. It is initially empty, so just hit return has you should enter MySQL with a response like:

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 21
Server version: 5.6.17 MySQL Community Server (GPL)

Copyright (c) 2000, 2014, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

You are now going to create a new database named ‘twitter” and a new user to go along with it, so you don’t have to log in as root. You only need to do this once. Enter the commands below at the mysql prompt and then hit return:

mysql> CREATE DATABASE twitter;
Query OK, 1 row affected (0.04 sec)

mysql> CREATE USER '{your-user_name}'@'localhost' IDENTIFIED BY '{your_password}';
Query OK, 0 rows affected (0.10 sec)

mysql> USE twitter;
Database changed

mysql> GRANT ALL ON twitter.* to '{your-user_name}'@'localhost';
Query OK, 0 rows affected (0.03 sec)

mysql> quit;
Bye

To test that this worked, open Spyder and try to access the database with:

>>> import MySQLdb as mdb
>>> mdb.connect('localhost', '{your_user_name}', '{your_password}', 'twitter')
<_mysql.connection open to 'localhost' at 100bf2420>

The response means that the database is ready to go to work. It is probably safe to quit the Terminal now.

12.8.1.7. How to get data into a database

12.8.1.8. How to get data out of a database

12.9. Summary

12.10. Further practice

12.11. Further reading

StackOverflow answers a question about Instance variables vs. class variables in Python

StackOverflow answers the question What is better in python, a dictionary or Mysql?

12.12. Appendix

Footnotes

[1]

Last edited: April 17, 2014

Table Of Contents

Previous topic

11. How to get text from web pages and blog feeds

Next topic

13. How to extract information from YouTube in Python

This Page