11. How to get text from RESTful APIs

I know you how to do this:

_images/googleDeepLearning.png

Fig. 11.1 Search google for “deep learning.

Author’s screen shot.

But look at all that juicy text! I would love to get it to play with, but I can’t get it out of my web browser. However, Google publishes a set of instructions called an application programming interface or API that Python can use to download the hits from a key-word search.

It is not just Google that has an publicly-available API for its services, most popular sites have one, too. In this chapter, you will learn how to access them. You will start with those APIs for which Pattern supplies a uniform interface, and then move on to learning to deal with the APIs themselves.

Note

The script with the code for this chapter is nlpAPI.py, which you can download with codeDowner(), see Practice 1, question 2.

11.1. The six constraints of REST

Nowadays, much of the information exchanged over the Internet is in a style that is known as Representational State Transfer or REST. Representational State Transfer consists of six constraints on the ‘architecture’ of information transfer originally set forth in 2000 by Roy Fielding in his doctoral dissertation.

While you need not understand the theory of REST to use it, you will find it to be easier going to have a passing familiarity with the six constraints. In the explanation below I follow the exposition of What Is REST?, though I have reordered the constraints to facilitate exposition:

  1. Client-server
  2. Uniform interface
  3. Stateless
  4. Cacheable
  5. Layered
  6. Code on demand

There is a separation between servers, which store data, and clients, which access it. You are a client. A uniform interface regiments the communication between you and the servers that you access. The most frequent form of this interface is a request for resources stored on a server. The server does not fulfill your request with the actual resource (usually records in a database) but rather with some representation of it in HTML, XML, or JSON. Moreover, your request has to be complete enough to allow the server to fulfill it. In particular, it should not depend on what you (or the server, for that matter) have done previously. In other words, the request does not care about the current state of the client or server and so is said to be stateless. Nevertheless, a client may store or cache previous responses from the server to aid in processing. Likewise, on the server’s side, a client does not ordinarily know whether it is connected to the end server or not, so that intermediary servers can be ‘layered in’ to aid in processing and enforce security policies. Finally, a server can temporarily extend or customize the functionality of a client, though this is optional.

An architecture that complies with these constraints, or at least the first five, is said to be RESTful.

11.2. How to retrieve text from Facebook

You may have already thought that the myriad posts and comments on Facebook would be a marvelous source of text. You are right, but as usual, the devil is in the details. Facebook separates its users into two distinct groups, individuals like you or me who have profiles with strict limits on what can be shared with others and groups which have public pages with essentially no limits on what can be shared. You will look at the former briefly, but concentrate on the latter.

11.2.1. How to get a user Access Token

Facebook has published a API for developers called GraphAPI. You cannot interact with it without validating yourself. Validation is a simple process, but it helps to start by logging onto your Facebook profile. Then Point your web browser at https://developers.facebook.com/tools/explorer/. Now for a bunch of clicking:

  1. Click on Get Token → Get user Access Token.
  2. Click as many of the boxes as you have information for in your profile.
  3. Then click on Get Access Token.

You should be looking at a window similar to this one:

_images/fbGraphAPIExToken.png

Fig. 11.2 Facebook’s GraphAPI Explorer with a user Access Token.

Author’s screen shot.

Let us now try to replicate this in Python. The first step is to copy your giant user Access Token from its line in the GarphAPI Explorer to a string assigned to a name in Spyder:

>>> myAccToken = 'paste_your_token_here'

11.2.2. How to access GraphAPI with Explorer and facebook-sdk

The next step is to turn to the Python wrapper around the JavaScript code that communicates with GraphAPI over the Internet. You can get it from pip:

$> pip install facebook-sdk

SDK stands for software developement kit. Import it as facebook:

>>> import facebook

You are welcome to check its help page with help(facebook), but it is rather long. Get started by initializing it with your access token:

>>> graph = facebook.GraphAPI(access_token=fbAccessToken) # version='2.7'

I will walk you through the usage of the methods for downloading objects.

11.2.2.1. How to access your profile with Explorer and facebook-sdk

Now place your cursor on the + sign underneath Node: me and press it to reveal a long menu of options. Do not press Search for a field. From fields → birthday, release the cursor, and then continue with fields > gender, fields → id, fields → locale and finally fields → name. To submit these selections to GraphAPI Explorer, click on the blue Submit button. Your page should wind up looking something like this:

_images/fbGraphAPIExProfile.png

Fig. 11.3 A personal profile in Facebook’s GraphAPI Explorer with five fields selected.

Author’s screen shot.

Turn to Spyder and try to explain what these lines do:

>>> graph.get_object(id='me')
>>> graph.get_object(id='me', fields='id,name,birthday,gender,locale')

Line 1 returns may entire profile as a dictionary. Line 2 trims it down to the five fields that pattern.web returns. Did you notice the peculiar syntax of the fields argument? It is string with each item separated by a comma, rather than a list or tuple of strings. Does that remind you of anything?

It will help to pretty print the result to have an easier-to-read representation that approximates that of GraphAPI Explorer:

>>> from pprint import pprint
>>> pprint(graph.get_object(id='me', fields='birthday,gender,id,locale,name'))

{u'birthday': u'11/01/1957',
 u'gender': u'male',
 u'id': u'2811272',
 u'locale': u'en_US',
 u'name': u'Harry Howard'}

You can retrieve other fields from your Facebook profile in a similar way, such as the one for education:

>>> graph.get_object(id='me', fields='education')
>>> pprint(graph.get_object(id='me', fields='education'))

You are welcome to check out any other fields in your profile that you may have filled out that I have not, but I want to take up Facebook connections.

Going back to Explorer, uncheck the five fields that you checked, click on the + sign again and scroll down past the heading connections to feed. The Explorer window should fill up with items like this:

_images/fbGraphAPIExFeed.png

Fig. 11.4 A personal profile in Facebook’s GraphAPI Explorer with “feed” selected.

Author’s screen shot.

There are a lot of posts in this window. Twenty-five, to be exact, which is so many that they obscure the overall structure. You can reduce them to some other number, say one, by pressing on the + sign that is indented to the right of the check box for feed, and selecting modifiers → limit and releasing. Set 10 → 1. Update Explorer with your selections by clicking on the blue Submit button. Your window should now hold a single feed item, with a two enormous URLs beneath it, like this image:

_images/fbGraphAPIExFeed1.png

Fig. 11.5 A personal profile in Facebook’s GraphAPI Explorer with “feed” limited to 1.

Author’s screen shot.

Let’s think about this for a moment.

The entire item is encased in curly brackets, so Python would represent it as a dictionary. Following the level of indentation, it has two keys, feed and id. feed takes a dictionary as its value, which itself consists of two keys, data and paging. data takes a list for its value, which contains but a single element, a dictionary. Since you limited feed to 1, data would appear to be the key that holds the posts in your feed. This is what we are interested in. paging, on the other hand, takes a dictionary as its value, which has the two keys previous and next. next presumably tells you where to get the rest of the posts in your feed, so we will need it, too, in the upcoming code. To emphasize the structure I have just explained, I reproduce below the keys without the content of their values:

{
  "feed": {
    "data": [
    ],
    "paging": {
      "previous": ,
      "next":
    }
  },
  "id":
}

To repeat, data holds the items with text and next points to the next ones.

Turning to facebook-sdk, to retrieve the posts in your feed, you can enter graph.get_connections(id='me', connection_name='feed'). But WAIT! That could choke Spyder! You should play it safe and add a limit:

>>> graph.get_connections(id='me',
...                       connection_name='feed',
...                       limit=1)

You still get a bunch of stuff. Assign it to a name and pretty print it to a depth of one and then two levels of indentation:

>>> postInfo = graph.get_connections(id='me',
...                                  connection_name='feed',
...                                  limit=1)
>>> pprint(postInfo, depth=1)
{u'data': [...], u'paging': {...}}
>>> pprint(postInfo, depth=2)

Pretty printing the connection object to a depth of one in line 4 returns the short, lovely line 5, which shows you that the object returned is … is … can you guess? Yes, it is a dictionary made up of two keys, data and paging. Pretty printing it to a depth of two reveals that paging in turn is a dictionary with the keys next and previous. So the layout of information in GarphAPI Explorer that I took pains to impress upon you now bears fruit: the text you want will be in data and the address for getting more text will be in next. Confirm this by selecting the first (and only) element of the data list and pretty print it to a depth of one:

>>> pprint(postInfo['data'][0], depth=1)

In my example, the text is the value of the key story; in yours, it could be the value of the key message, depending on the type of the post. We will make use of this observation in a moment.

Now it is time to gather enough posts to process. The algorithm is:

  1. Get the feed connection,
  2. loop over its data list to free each post,
  3. extract its text via its story or message keys.

Go ahead and do step one; we will experiment a bit for the other two:

>>> postInfo = graph.get_connections(id='me',
...                                  connection_name='feed',
...                                  limit=10)

I would like to collect all the text with a single list comprehension:

>>> postText = [p['story'] or p['message'] for p in postInfo['data']]

This raises a KeyError exception on the first post that has message but not story. The alternative is to deconstruct the list comprehension into a for loop with a try-except statement to check for KeyErrors:

>>> postText = []
>>> for p in postInfo['data']:
...     try:
...         postText.append(p['story']) or postText.append(p['message'])
...     except KeyError:
...         continue
>>> pprint(postText)

That should fill postText with ten strings that look like the text of each post.

Let’s return to Explorer to look at a few more connections. Going down the column alphabetically, there are + → connections → friends, + → connections → likes and + → connections → posts, among others. Feel free to explore them. I am going to illustrate a single likes just to underscore the structure of the response:

_images/fbGraphAPIExLikes1.png

Fig. 11.6 A personal profile in Facebook’s GraphAPI Explorer with “likes” limited to 1.

Author’s screen shot.

Shorn of its values, the structure of this object is:

{
  "likes": {
    "data": [
    ],
    "paging": {
      "cursors": {
        "before": ,
        "after":
      },
      "next":
    }
  },
  "id":
}

As with feed, data holds the items of interest and next points to the next ones.

In facebook-sdk, you should be able to come up with this:

>>> likeInfo = graph.get_connections(id='me',
...                                  connection_name='likes',
...                                  limit=10)
>>> pprint(likeInfo, depth=2)

Finally, you may have noticed a connection for comments, but comments are attached to posts. To retrieve the comments on a post, deselect whatever you have selected, and do + → connections → posts and then in the + that is indented to underneath and to the right of posts, do + → connections → comments, so that a configuration like the following appears:

_images/fbGraphAPIExPostsComments.png

Fig. 11.7 A personal profile in Facebook’s GraphAPI Explorer with “comments” on “posts”.

Author’s screen shot.

Post 2811272_218115921634101 is the first one with comments, so I can paste that id directly into GraphAPI Explorer and then + → connections → comments. Yet I want to limit it to one, in order to clarify the overall organization. Once again, select modifiers → limit and release to set 10 → 1:

_images/fbGraphAPIExPost1Comments.png

Fig. 11.8 A personal profile in Facebook’s GraphAPI Explorer with “comments” on a single post.

Author’s screen shot.

I hope you are able to tease out this:

{
  "comments": {
    "data": [
    ],
    "paging": {
      "cursors": {
        "before": ,
        "after":
      },
      "next":
    }
  },
  "id":
}

This gives me the opportunity to repeat one more time that data holds the items with text and next points to the next ones.

This should give you enough background to plan how to extract the text of the comments from the posts in your feed. You could glom them all together in one giant sticky ball of text, for which this block is a first stab:

>>> postComments = []
>>> for p in postInfo['data']:
...     try:
...         postComments.append(p['comments']['data'])
...     except KeyError:
...         continue
>>> pprint(postComments)

But as pretty printing lays bare, the comment data is itself a list with includes all of the ancillary fields along with the text field of message. To extract just the message, there would have to be a list comprehension within the comment data:

1
2
3
4
5
6
7
>>> postComments = []
>>> for p in postInfo['data']:
...     try:
...         postComments.append([c['message'] for c in p['comments']['data']])
...     except KeyError:
...         continue
>>> pprint(postComments)

This does the trick!

Well, yes, for my limited definition of the trick. Saving just the comments loses track of what they are comments on. Thus at the very least, each bundle of comments should be associated with its post’s id. How would you do that? think … think … maybe like this:

1
2
3
4
5
6
7
>>> postComments = {}
>>> for p in postInfo['data']:
... try:
...     postComments.update({p['id']:[c['message'] for c in p['comments']['data']]})
... except KeyError:
...     continue
>>> pprint(postComments)

This alternative initializes postComments as a dictionary and then updates it with a new dictionary formed by using the post’s id as key and a list of the text of all the comments as its value.

11.2.2.2. How to access a public page with Explorer and facebook-sdk

You have learned a lot from your profile, but you already know what you have to say; let’s find out what someone else has to say. For this second part, I have chosen USA Today as an example, though any large media organization would do just as well. The first step is to figure out what USA Today’s Facebook name or id is. You can do this by navigating to its home page, scrolling down, down, down to the bottom of the page, and right-clicking on the Facebook icon to open in a new tab. This should open USA Today’s Facebook page, with the address www.facebook.com/usatoday. The string after “www.facebook.com/”, usatoday, is USA Today’s Facebook name. Plug it into the input line of Explorer and hit Submit to render a screen like this:

_images/fbGraphAPIUSAToday.png

Fig. 11.9 USAToday in Facebook’s GraphAPI Explorer.

Author’s screen shot.

If you rummage around in + → fields, you will see that they are different from those of a personal profile. For instance, there is a description:

_images/fbGraphAPIUSATodayDescripton.png

Fig. 11.10 USAToday’s description in Facebook’s GraphAPI Explorer.

Author’s screen shot.

I am sure that you can guess how to reproduce these two screens in facebook-sdk:

>>> graph.get_object(id='usatoday')
>>> graph.get_object(id='usatoday', fields='description')

Now let’s see whether you can pick out a single post in usatoday’s feed and then return the same item from facebook-sdk:

_images/fbGraphAPIUSATodayFeed1.png

Fig. 11.11 One post in USAToday’s feed in Facebook’s GraphAPI Explorer.

Author’s screen shot.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
>>> postInfo = graph.get_connections(id='usatoday',
...                                  connection_name='feed',
...                                  limit=1)
>>> pprint(postInfo, depth=4)
{u'data': [{u'actions': [{...}, {...}],
            u'caption': u'reviewed.com',
            u'comments': {u'data': [...], u'paging': {...}},
            u'created_time': u'2016-11-11T20:00:00+0000',
            u'description': u'Start off National Sundae Day right with these boozy treats.',
            u'from': {u'category': u'Media/News/Publishing',
                      u'id': u'13652355666',
                      u'name': u'USA TODAY'},
            u'icon': u'https://www.facebook.com/images/icons/post.gif',
            u'id': u'13652355666_10154037012080667',
            u'is_expired': False,
            u'is_hidden': False,
            u'likes': {u'data': [...], u'paging': {...}},
            u'link': u'http://www.reviewed.com/home-outdoors/news/bourbon-caramel-sundaes-exist-and-heres-how-you-can-make-it',
            u'message': u"Is it 5 o'clock yet?!",
            u'name': u'This boozy sundae is just what the internet needs right now',
            u'picture': u'https://external.xx.fbcdn.net/safe_image.php?d=AQAYf4mtcgPcCQcE&w=130&h=130&url=https%3A%2F%2Freviewed-production.s3.amazonaws.com%2F1478877108000%2FIG.jpg&cfs=1',
            u'privacy': {u'allow': u'',
                         u'deny': u'',
                         u'description': u'',
                         u'friends': u'',
                         u'value': u''},
            u'shares': {u'count': 1},
            u'status_type': u'shared_story',
            u'type': u'link',
            u'updated_time': u'2016-11-11T20:08:50+0000'}],
 u'paging': {u'next': …,
             u'previous': …}}

Pretty-printing postInfo to a depth of four manifests all of the information that is folded into a post, with more detail than is revealed in my own posts or by Explorer. The top level of data and paging is clear as the least indented. All of the fields of the data dictionary are indented to the next level. Of particular note is the highlighted line 7 which holds the comments. created_time and link could also be helpful.

It is relief that the loop developed above cut through all of this. I repeat it here:

>>> postComments = {}
>>> for p in postInfo['data']:
...     try:
...             postComments.update({p['id']:[c['message'] for c in p['comments']['data']]})
...     except KeyError:
...             continue
>>> pprint(postComments)

There is one final task, namely to get more than a single page of 25 responses. facebook-sdk does not have any code for paginating through additional responses, so we are on our own. Fortunately, we suspect that it has to do with the next field, but this field will have to be used ‘by hand’ with requests. Go ahead and import it:

>>> import requests

As a first step, make a name for the number of posts that are downloaded from a single call to GraphAPI and plug it into the limit argument of graph.get_connections():

>>> postsPerDownload = 1
>>> postInfo = graph.get_connections(id='usatoday',
...                                  connection_name='feed')
...                                  limit=postsPerDownload)

Having this number available will simplify several decisions that the code must make.

Now you need a second name, say downloads, for the number of times to download. I am going to initialize it with infrequent number, like 11, to preclude making a mistake but not realizing it. In addition, an empty list posts needs to be set up to hold to hold the posts downloaded by the code. After all that, a while loop can iterate until it reaches the number of downloaded posts set by the names. Within this loop, the algorithm is to extract the posts from postInfo, add them to the list of posts and then get the next post from the next key within the paging key:

Listing 11.1 Download from GraphAPI a number of posts set by the user.
1
2
3
4
5
6
7
8
9
>>> downloads = 11
>>> posts = []
>>> while len(posts) < postsPerDownload*downloads:
...     try:
...         posts.extend([p for p in postInfo['data']])
...         postInfo = requests.get(postInfo['paging']['next']).json()
...     except KeyError:
...         continue
>>> pprint([p['message'] for p in posts])

Line 3 introduces the while loop, which you have not seen yet, but which works the way you imagine it to. It repeats the block of code indented underneath it until the numerical condition fails, at which point it halts. The the final line is executed, that of pretty-printing a list of resultant messages. Line 6 shows how requests is called. You don’t need to understand it all, just how the URL for the next post is extracted by running down the keys in postInfo.

What I really want are the comments for each post, as I have reiterated tirelessly. To get them, I will save them into a dictionary as a list of values to a key consisting of the post’s id. The entire code is gathered together here, even though the variables are repeated from the previous block:

>>> postsPerDownload = 1
>>> postInfo = graph.get_connections(id='usatoday',
...                                  connection_name='feed')
...                                  limit=postsPerDownload)
>>> downloads = 11
>>> postComments = {}
>>> while len(postComments) < postsPerDownload*downloads:
...     try:
...         for p in postInfo['data']:
...             postComments.update({p['id']:[c['message'] for c in p['comments']['data']]})
...         postInfo = requests.get(postInfo['paging']['next']).json()
...     except KeyError:
...         continue
>>> pprint(['{}, {}'.format((k,len(v)) for (k,v) in postComments.items())])

The drawback of this approach is the default limit on the number of comments per post, twenty-five.

To overcome this limit requires a conceptual innovation, that of a loop that never ends. In line 12 below, it is achieved by while True: containing the code for downloading comments. True is a constant in Python that means, well, “true”. Thus the while loop never stops of its own accord, because it is permanently true. Yet any loop can be interrupted by the break statement. So what condition would you want to be fulfilled to stop processing comments? Running out of comments, which should happen when the next key disappears from the paging value. If the code expects to find a key but does not, it throws KeyError, which an except statement can pick up and use to execute the break statement. This happens in lines 17 to 18 below.

The only other innovation is to download the comments themselves. I create a new name nextCommentBundle to hold the download, extract the comments in a list comprehension over the data key, save them as individual items to the current post id in the postComments dictionary, and then try to download the next batch of comments with requests, relying on finding the next key in the value of paging. This is the import of lines 14 through 16, which is set up by line 11. Here is the whole shebang, including the context:

Listing 11.2 Download from GraphAPI all the comments to 3 posts.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
>>> postsPerDownload = 1
>>> postInfo = graph.get_connections(id='usatoday',
...                                  connection_name='feed')
...                                  limit=postsPerDownload)
>>> downloads = 3
>>> postComments = {}
>>> while len(postComments) < postsPerDownload*downloads:
...     try:
...         for p in postInfo['data']:
...             postComments.update({p['id']:[c['message'] for c in p['comments']['data']]})
...             nextCommentBundle = requests.get(p['comments']['paging']['next']).json()
...             while True:
...                 try:
...                     nextComments = [c['message'] for c in nextCommentBundle['data']]
...                     postComments[p['id']].extend(nextComments)
...                     nextCommentBundle = requests.get(nextCommentBundle['paging']['next']).json()
...                 except KeyError:
...                     break
...         postInfo = requests.get(postInfo['paging']['next']).json()
...     except KeyError:
...         continue
>>> pprint(['{}, {}'.format((k,len(v)) for (k,v) in postComments.items())])

I assume that you realized that the number of downloads is reduced to three in line 5, just in case you hit a post with hundreds if not thousands of comments.

Having gone through all of this, you may want to have a look at the fruits of your labor. A quick and easy way to do so is through Spyder’s Variable explorer. In the image below, I clicked on postComments, which opens the small window in the middle that holds the three posts. I clicked on its top value, with size 358, which opens up the window on the right, with values in pink:

_images/fbGraphAPIUSATodayCommentsVarExplor.png

Fig. 11.12 Selection of comments displayed through Spyder’s Variable explorer.

Author’s screen capture.

As you can see, there is a lot of stuff that we are not interested in. In fact, most of it is comment spam that does not address the post at all. I don’t know how representative this is of USAToday comments, but it may motivate you to use a different site.

11.2.3. How to access Facebook’s API with pattern.web

Now start up Pattern:

>>> from pattern.web import Facebook, SEARCH, NEWS, COMMENTS, LIKES, FRIENDS
>>> fb = Facebook(license=myAccToken)

11.2.3.1. How to access your profile with pattern.web

>>> me = fb.profile()
>>> me
(u'2811272', u'Harry Howard', u'11/01/1957', u'm', u'en_US', 0)

As you can see, me returns a sextuple, the first member of which is my user ID. The rest are my name, birthday, gender, locale, and likes (but for pages not personal profiles, so mine is 0). However, this is the only information that pattern.web retrieves from the fields in your profile, so you will have to use some other means if you want more.

As for your connections, a pattern.web search of type NEWS returns “posts for a given author” according to the documentation, but what it most closely resembles in GraphAPI Explorer is the connection feed:

>>> fb.search(me[0], type=NEWS, count=5)
>>> [post.text for post in fb.search(me[0], type=NEWS, count=5)]

As usual, a pattern.web search as in line 1 returns a Result object, which can be iterated over to extract what you are interested in, as in line 2.

Besides text, there are six other parameters for a Result object. The next block of code prints them out for the first post in my feed:

>>> first = fb.search(me[0], type=NEWS, count=1)
>>> p = first[0]
>>> postFields = ['id', 'url', 'text', 'date', 'likes', 'comments',
...                       'author']
>>> fieldValues = [p.id, p.url, p.text, p.date, p.likes, p.comments,
...                        p.author]
>>> for pair in zip(postFields, fieldValues):
...     print '{}\t{}'.format(pair[0], pair[1])
...

id                      2811272_10101192505121999
url
text            John Damanti and 20 others wrote on your Timeline.
date            2016-11-02T14:19:34+0000
likes           0
comments        0
author          (u'10204359945191203', u'John Damanti')

This post has no URL, nor likes nor comments, but it is good to know that you could get them if there were any.

COMMENTS returns “comments for the given post id”, which should access GraphAPI Explorer’s connection comments. From the directions to Fig. 11.8, I chose this one:

>>> fb.search("2811272_218115921634101", type=COMMENTS, count=5)
>>> [comment.text for comment in fb.search("2811272_218115921634101",
...                                                     type=COMMENTS,
...                                                     count=5)]

Given that comments depend on posts, it would be efficient to retrieve them all at once. This is the genius of the comment parameter introduced above. It tells you whether a post has comments or not, so all you have to do is loop over posts, and for each post with non-zero comments retrieve them with a call to the comment connection. Here is one way to do it, by storing the post id as a key in a dictionary and a list of the post’s comments as its value:

    >>> commentDict = {}
    >>> for post in fb.search(me[0], type=NEWS, count=50):
...         if post.comments > 0:
...         comments = [comment.text for comment in fb.search(post.id, type=COMMENTS, count=20)]
...         commentDict.update({post.id:comments})

This comes close to reproducing the effect illustrated in Fig. 11.7 for GraphAPI Explorer.

LIKES returns “authors for the given author, post or comments”, which accesses GraphAPI Explorer’s connection likes:

>>> fb.search(me[0], type=LIKES, count=5)
>>> [like.text for like in fb.search(me[0], type=LIKES, count=5)]

Since likes can be attached to a comment, the previous code can be recycled to retrieve all the likes to my posts:

    >>> likeDict = {}
    >>> for post in fb.search(me[0], type=NEWS, count=50):
...         if post.comments > 0:
...             likes = [like.author for like in fb.search(post.id, type=LIKES, count=20)]
...             likeDict.update({post.id:likes})

Likes don’t have any text, so I saved each author.

FRIENDS returns “authors for the given author, post or comments”, which should access GraphAPI Explorer’s connection friends. However, I get the empty list:

>>> fb.search(me[0], type=FRIENDS, count=5)

SEARCH returns “public posts + author”, which should access GraphAPI Explorer’s connection posts, but it fails with an authentication error:

>>>  fb.search(me[0], type=SEARCH, count=5)

To recap, pattern.web affords limited access to your profile and easy access to your most important connections.

11.2.3.2. How to access a public page with pattern.web

>>> usaToday = fb.profile(id='usatoday')
>>> usaToday
(u'13652355666', u'USA TODAY', u'09/15/1982', u'', u'', 6892370)

This is the same sextuple as was returned from my profile.

It can be used as the id in a pattern search method to retrieve three sorts of information from my profile

11.3. How to get text from Twitter

Twitter is a Mississippi of tweets sloshing back and forth all over the world. We want to stick a bucket in that torrent to see what is on people’s minds. But what commands would we use? All a Twitter account lets us do is send and receive tweets, not look at the whole global flow.

Well, if Twitter’s developers did not tell us what commands to use, we would be stuck. (Netflix, for one, used to tell us what commands to use, but no longer does.) Fortunately, Twitter welcomes outsiders into its code, by means of its API.

11.3.1. How to get authorization for Twitter’s API

The first thing to do is to sign up for a Twitter account at twitter.com, if you don’t already have one. Then point your browser at Twitter Apps and log in with your (new) account credentials. At the top right corner, click on the Create New App button. In the form that opens up, give your new app any name you want, describe at as “NLP with Twitter”, use Tulane’s URL “http://www.tulane.edu/” as the website, click the button to agree with the Developer Agreement, and click on Create your Twitter application.

On the next page, select API Keys from the menu. On the Application settings page, for the time being, you can keep the access level at Read-only. If you want to change it to Read, Write and Access direct messages by means of the modify app permissions link, you will have to give Twitter your cell phone number. Scroll down the page and click on create my access token. You will get a confirmation message at the top of the page. You may want to click the reset link to hurry the process along.

There are now four crucial pieces of information that you will need to make note of: API key, API secret, Access token and Access token secret. Since these are long and unwieldy strings, you should copy and paste them into some handy place immediately. In fact, since you are going to use them in several scripts, go ahead and open a new Spyder file, paste these lines into it, fill in your information:

>>> consumerKey = 'your_info_here'
>>> consumerSecret = 'your_info_here'
>>> accessToken = 'your_info_here'
>>> tokenSecret = 'your_info_here'

Twitter’s API reveals two ways of interacting with it, a RESTful request and response protocol, and a streaming protocol.

11.3.2. How to get text from Twitter with pattern.web

Go ahead and get started with pattern.web's Twitter module:

>>> from pattern.web import Twitter, hashtags, retweets, author
>>> twLicense = (consumerKey, consumerSecret, (accessToken, tokenSecret))
>>> tw = Twitter(license=twLicense)

11.3.2.1. How to access Twitter’s RESTful API

Twitter’s API, like Facebook’s, does not send you the tweets that it finds all at once, but rather in dribs and drabs. I am going to continue to refer to them as “pages”. Thus there are two variables to be set, the number of tweets per page and the number of pages. Let us start out small:

>>> tweetsPerPage = 5
>>> pages = 3
>>> searchTerm = 'win'

You also need a search term, preferably a word that is relatively frequent.

Initialize an empty list to hold the tweets that you are going to collect, loop through the pages, and within each page, loop through the search() results. From each tweet, extract its text and append it to the list. Pretty-print what you got:

>>> tweets = []
>>> for page in range(1, pages):
...     for tweet in tw.search(searchTerm, start=page, count=tweetsPerPage):
...         tweets.append(tweet.text)
>>> pprint(tweets)

This is a good start, and if all want is some tweets, it is enough. However, it throws away all the other information in a tweet. Just in case you might want that, collect each tweet along with its identification number into a dictionary:

>>> tweetID = {}
>>> for page in range(1, pages):
...     for tweet in tw.search(searchTerm, start=page, count=tweetsPerPage):
...         tweetID.update({tweet.id:tweet.text})
>>> pprint(tweetID)

pattern.web throws in a couple of more goodies, namely methods for getting an author’s tweets and for getting Twitter trends:

>>> authorTweets = tw.search(author('llStrafeJump'))
>>> pprint(authorTweets)

>>> twTrends = tw.trends(cached=False)
>>> pprint(twTrends[:10])

11.3.2.2. What’s in a tweet

So far, we have assumed that a tweet is a string of 140 characters, but this is really just the tip of the iceberg. A Twitter status update contains an enormous amount of additional information. Below is a sample:

>>> json2screenpretty(1,['of,the,a'])
{u'contributors': None,
{u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Sun Nov 09 22:21:06 +0000 2014',
 u'entities': {u'hashtags': [],
               u'symbols': [],
               u'trends': [],
               u'urls': [],
               u'user_mentions': []},
 u'favorite_count': 0,
 u'favorited': False,
 u'filter_level': u'medium',
 u'geo': None,
 u'id': 531572213959237632,
 u'id_str': u'531572213959237632',
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_status_id_str': None,
 u'in_reply_to_user_id': None,
 u'in_reply_to_user_id_str': None,
 u'lang': u'en',
 u'place': None,
 u'possibly_sensitive': False,
 u'retweet_count': 0,
 u'retweeted': False,
 u'source': u'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 u'text': u'She said she believe in the glo \U0001f60f',
 u'timestamp_ms': u'1415571666605',
 u'truncated': False,
 u'user': {u'contributors_enabled': False,
           u'created_at': u'Wed Jul 04 04:36:33 +0000 2012',
           u'default_profile': False,
           u'default_profile_image': False,
           u'description': u'Side hoes are smarter than you',
           u'favourites_count': 1074,
           u'follow_request_sent': None,
           u'followers_count': 4094,
           u'following': None,
           u'friends_count': 2748,
           u'geo_enabled': False,
           u'id': 626238009,
           u'id_str': u'626238009',
           u'is_translator': False,
           u'lang': u'en',
           u'listed_count': 15,
           u'location': u'Houstatlantavegas',
           u'name': u'6 Gramz\u2600\ufe0f',
           u'notifications': None,
           u'profile_background_color': u'FF6699',
           u'profile_background_image_url': u'http://pbs.twimg.com/profile_background__png/378800000101732299/1772b43d3dfdb4f02fc3addbcd0d5e7a.jpeg',
           u'profile_background_image_url_https': u'https://pbs.twimg.com/profile_background__png/378800000101732299/1772b43d3dfdb4f02fc3addbcd0d5e7a.jpeg',
           u'profile_background_tile': True,
           u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/626238009/1413348528',
           u'profile_image_url': u'http://pbs.twimg.com/profile__png/531549676239986688/rJy1slia_normal.jpeg',
           u'profile_image_url_https': u'https://pbs.twimg.com/profile__png/531549676239986688/rJy1slia_normal.jpeg',
           u'profile_link_color': u'B40B43',
           u'profile_sidebar_border_color': u'FFFFFF',
           u'profile_sidebar_fill_color': u'DDEEF6',
           u'profile_text_color': u'333333',
           u'profile_use_background_image': True,
           u'protected': False,
           u'screen_name': u'Camnesssss',
           u'statuses_count': 29619,
           u'time_zone': u'Eastern Time (US & Canada)',
           u'url': None,
           u'utc_offset': -18000,
           u'verified': False}}
>>> tweets = 1

It wasn’t too hard to pick out Camnesssss, was it?

pattern.web pulls out a few of these keys, which is illustrated by stocking a dictionary with them:

>>> tweetID = {}
>>> for page in range(1, 2):
...     for tweet in tw.search(searchTerm, start=page, count=1):
...         tweetID.update({'id':tweet.id,
...                         'text':tweet.text,
...                         'url':tweet.url,
...                         'date':tweet.date,
...                         'author':tweet.author,
...                         'profile':tweet.profile,
...                         'lang':tweet.language,
...                         'hashtags':hashtags(tweet.text),
...                         'retweets':retweets(tweet.text)})
...
>>> pprint(tweetID)

This omits some interesting metadata, such as the information about location in geo and place and under user, time_zone, and location, as well as the time of the tweet, from created_at.

11.3.2.3. How to access Twitter’s streaming API with pattern.web

Twitter’s streaming protocol is often called a fire hose, because it just blasts every tweet in the world at you. pattern.web uses a slightly different syntax to dip a bucket in to Twitter ‘fire hose’:

>>> twStream = tw.stream('the')
>>> updateLimit = 50
>>> tweets = []
>>> for updates in range(updateLimit):
...     for tweet in twStream.update():
...         tweets.append(tweet.text)
>>> twStream.clear()

The next block saves a tweet and its id to a dictionary:

>>> tweetID = {}
>>> for updates in range(updateLimit):
...     for tweet in twStream.update():
...         tweetID.update({tweet.id:tweet.text}):
>>> twStream.clear()

11.3.2.4. How to access Twitter’s streaming API with tweepy

$> pip install tweepy

Tweepy has a method called StreamListener() that listens to the fire hose – sorry for the mixed metaphor! – and picks out just the tweets that you want. StreamListener() was designed to be very flexible, so we can adapt it to our purposes, but this adaptability comes at a price – StreamListener() can’t do anything right out of the box!

There are two crucial things that StreamListener() cannot do: it cannot send the tweets it selects somewhere for you to see them, like to your computer screen, a text file, or a database, and it cannot stop listening for tweets when you have enough. But it is easy enough to modify it to perform these tasks, and in fact that is what the developer envisioned for the tweepy user to do.

11.3.2.5. Classes in Python

To effect this modification, you are going to create a new class for StreamListener() that fills in the missing functionality. A “class” in Python is a group of functions that work together. The StreamListener() class has two functions that you are going to modify. The first is the initialization function __init__() which stores information that is only used the first time that the class is called. The second is the on_status() function. In Twitter, a tweet is known technically as a status update. on_status() tells StreamListener() what to do when it receives a new status update. This is where the most important processing takes place. The code below lays out a skeleton for the new class, which you will add to as you go along:

class NewStreamListener(tweepy.StreamListener):
        def __init__(self):
                self.api = tweepy.API()

        def on_status(self, status):

NewStreamListener() takes StreamListener() as its argument and thereby inherits everything that is in StreamListener() . We have to be careful to only ‘fill in the blanks’ – that is, to only add material that is not in StreamListener() and not contradict anything that is already there, which would break it. For instance, the code having to do with api in __init__() is part of the endowment of StreamListener(). We gingerly tiptoe around it, leaving it untouched.

11.3.2.6. How to tell StreamListener() to stop listening for tweets

For the sake of discussion, imagine that you only want twenty tweets. How would you tell StreamListener() to turn itself off when it hits twenty? The standard computational solution is to create a counter that increments itself by one every time that StreamListener() finds a relevant tweet, up to twenty. In pseudo-code, the algorithm looks like this:

upon initializing StreamListener()
set the counter to 0 set the maximum to 20
when a relevant tweet is found
if the counter is less than 20, add 1 to it and keep going otherwise, exit StreamListener()

Seeing these steps implemented in their tweepy context may help you to understand them better, so we add them to the previous snippet of code:

class StopStreamListener(tweepy.StreamListener):
        def __init__(self):
                self.api = tweepy.API()
                self.n = 0
                self.m = 20

        def on_status(self, status):
                self.n += 1
                if self.n < self.m:
                        return True
                else:
                        return False

On line 4 of __init__(), self.n is the counter, which is initialized to 0. On line 5, self.m is the maximum number of tweets wanted, which is initialized to 20. In on_status(), line 8 increments the counter and the if statement checks to see whether the new count is less than the maximum. If it is, on_status() returns True, which tells StopStreamListener() to keep listening for more tweets. Otherwise, on_status() returns False, which tells StopStreamListener() to stop listening and ultimately turn tweepy off.

The other bit of functionality that you have to fill in is to send the tweets somewhere. The easiest place to start with is your computer screen, so that you can see how cool Twitter is. The block below adds the requisite lines to the previous block:

class Stream2Screen(tweepy.StreamListener):
    def __init__(self):
        self.api = tweepy.API()
        self.n = 0
        self.m = 20

    def on_status(self, status):
        print status.text.encode('utf8')
        self.n += 1
        if self.n < self.m:
                return True
        else:
            print 'tweets = {}'.format(self.n)
            return False

In on_status(), line 8 now prints the text of a status update to the screen, encoding it in UTF-8 just to be on the safe side. The text of a status update is what most people would consider to be a tweet. The else condition has been expanded to print a sign-off message, which is a string with the number of tweets found. It should be the same as the tweet maximum, which is a great way of checking whether the code has worked right.

Now that you have added the functionality missing from StreamListener(), you have to call it. That only takes two lines of code:

stream = tweepy.streaming.Stream(key, Stream2Screen())
stream.filter(track=['of,the,a'], languages=['en'])

To make it more flexible, I have given it two arguments, num and terms. Num is the integer of tweets to collect; terms is the list of strings to listen for. The function is called like so:

>>> from tweepies import stream2screen
>>> stream2screen(20, ['of,the,a'])

11.4. How to search with Pattern’s search engines

pattern.web reveals an interface to twelve sites, Google, Yahoo, Bing, DuckDuckGo, Twitter, Facebook, Wikipedia, Wiktionary, Wikia, DBPedia, Flickr and Newsfeed, to which you can submit a key-word query.

Let’s get started with Google.

11.4.1. How to search with pattern.web.Google

You must import it by name, and while you are at it, also get the SEARCH and plaintext utilities:

>>> from pattern.web import Google, SEARCH, plaintext

Google allows you 100 free queries a day, and then you have to pay $1 for every 200 more. I think that we can squeeze by under the free limit, though we have to keep it in mind. To submit a query, the first step is to construct an instance of the Google search engine like so:

>>> google = Google(license=None, throttle=0.5, language=None)

There are three arguments, a license, which you don’t have (yet), a half-second pause between search requests called the throttle, and a language setting, which defaults to English if set to None. For an overview of the languages accepted, check out Google Custom Search’s Language Values page.

A query is formulated as a parameter to a search-engine instance:

>>> result = google.search('deep learning', type=SEARCH, start=1, count=10)

search takes an obligatory string as the search term, and several optional settings which have defaults. The Google search engine admits but a single type, that of SEARCH, so it is really the default and can can be omitted. start is a counter that enables search() to be used in a loop. count sets the number of hits to be returned from a single call to search(). The maximum is ten, which you can also omit to be set by default. Thus the previous line can be stripped down to the bare bones of:

>>> result = google.search('deep learning')

Yes, yes, type it into the Console. result is returned as a Result object. In the Variable Explorer you can see that it is of size 10, which is confirmed by len(result). When I executed this search, my first hit was from Wikipedia. This was my second hit, which I indented by hand:

>>> result[1]
Result({u'url': u'http://deeplearning.net/',
                u'text': u'... <b>Deep Learning</b> is a new area of Machine Learning research, which has been <br>\nintroduced with the objective of moving Machine Learning closer&nbsp;...',
                u'date': u'Dec 1, 2015',
                u'language': u'',
                u'title': u'Deep Learning'})

That is, each hit is a dictionary with five keys: url, text, date, language, and title. There is also a sixth key of author for news items and images, omitted from this article. Both keys and values are Unicode strings. To extract any key, you can just create a list comprehension for it:

>>> titles = [hit['title'] for hit in result]

You can type titles in the Console to view it, but a neat trick for a nicer display is to have pprint pretty print it:

>>> from pprint import pprint
>>> pprint(titles)

pprint() will try to display lists and such in a more readable layout.

What we really want is the text, though, yet as you can see, it is full of strange symbols. This is because the ‘text’ is actually hypertext mark-up language or HTML, which is the formatting of web pages. To get usable text, the HTML has to be scrubbed out of it. Pattern makes this child’s play through plaintext:

>>> googleTexts = [plaintext(hit.text) for hit in result]

You can pretty print the output to confirm that most of the non-text stuff has been filtered out.

By now, I hope that I have whetted your appetite for MORE. But the count setting limits you to ten hits a call. The workaround is to use the start setting as a counter in a loop to get MORE:

    >>> googleText = []
    >>> for page in range(1,3):
...         result = google.search('deep learning', start=page)
...         googleText.extend([plaintext(hit.text) for hit in result])

Line 1 initializes a list to hold the product of the rest of the code. for loops through a list of “pages” – how many? – during which one search is executed, cleaned up, and stuck on to the end of googleText as individual items. Pretty print a few to see that it worked.

Ah, but there is a catch. Google limits the number of hits to a thousand per query. So with a count of ten, how many pages can you loop through before Google cuts you off?

11.4.2. How to perform pattern.web searches on Bing, Yahoo and DuckDuckGo

pattern.web lends its interface to three other search engines, Bing, Yahoo and DuckDuckGo. In theory, you can submit the following:

>>> bing = Bing(license=None, throttle=0.5, language=None)
>>> bingResult = bing.search('deep learning', type=SEARCH)

>>> yahoo = Yahoo(license=None, throttle=0.5, language=None)
>>> yahooResult = yahoo.search('deep learning', type=SEARCH)

>>> ddg = DuckDuckGo(license=None, throttle=0.5, language=None)
>>> ddgResult = ddg.search('deep learning', type=SEARCH)

In practice, the Bing search fails (even with a license key) due to a URLError; the Yahoo search asks for authentication credentials, and the DuckDuckGo search returns 6 hits.

Yahoo’s Boss search API was discontinued on March 31, 2016, and its replacement, Yahoo Partners Ads, does not do search in the sense that we are interested in, see BOSS Search API.

11.4.3. How to perform pattern.web searches on Wikipedia, Wiktionary and Wikia

A search of Wikipedia follows the same form, though there is really nothing to set for the search engine itself:

>>> from pattern.web import Wikipedia, Wikia, Wiktionary
>>> wikipedia = Wikipedia()
>>> wikiArticle = wikipedia.search('deep learning')

This returns a WikipediaArticle object. It has several parameters, for which I refer you to Pattern’s explanation at Wikipedia articles. You can use the plaintext() method to scrub out the HTML:

>>> wikiText = wikiArticle.plaintext()

A Wikipedia article is organized into sections. The sections constant opens them up for you. Our sample article has 52. You might not be interested in displaying all of them:

>>> len(wikiArticle.sections)
>>> pprint(wikiArticle.sections[:10])

You can apply plaintext() to a section to extract its text free from HTML:

>>> wikiSecText = wikiArticle.sections[0].plaintext()

Wikia now appears to be a site called Fandom, which appears to have broken Pattern’s code:

>>> wikiaSite = Wikia(domain='One_Wiki_to_Rule_Them_All')
>>> for (i, title) in enumerate(wikiaSite.index(start='a', throttle=1.0, cached=True)):
...     if i >= 3: break
...     article = wikiaSite.search(title)
...     print article.title

This breaks at the point of trying to receive data.

pattern.web contains a module for searching ProductWiki, which no longer exists.

11.5. Summary

11.6. Powerpoint and podcast

  • 07 nov (M), day 28: We started talking about APIs.
  • 07 nov (M), day 28: Podcast of APIs 1.
  • 09 nov (W), day 29: We started continued about APIs.
  • 09 nov (W), day 29: Podcast of APIs 2.
  • 11 nov (V), day 30: We started continued about APIs.
  • 11 nov (V), day 30: Podcast of APIs 3.

Last edited Dec 05, 2016