# 13. How to get text from web pages¶

A marketer named Seth Godin writes Seth’s Blog which opened up to this page on Nov 22, 2016:

There is a nice chunk of text in it that we would like to get our hands on, but it is hidden in a field of images, links and other stuff. How do we winnow the textual wheat from the non-textual chaff?

Note

The script with the code for this chapter is nlpWebpages.py, which you can download with codeDowner(), see Practice 1, question 2.

## 13.1. What HTML looks like¶

### 13.1.1. Use Firefox, plus Firebug and FirePath¶

So that everyone will be on the same page, I suggest that you use FireFox as your web browser. You can install it from here, if you haven’t already done so. Note that one python module has to find it, so it should be in its default location, at the top level of your applications folder.

Warning

YOu don’t have to do this.

The advantage of using Firefox is that it has two extensions that will make it a lot easier to find the tags that mark text. The first is called Firebug. Add it to Firefox by opening its page in the Firefox Add-ons site Firebug and clicking on the green button. The second is called FirePath. Add it to Firefox by likewise opening its page in the Firefox Add-ons site FirePath and clicking on the same green button.

### 13.1.2. The general layout of a HTML page¶

The first step is to understand how web pages are encoded. They are written in what is called hypertext mark-up language, or HTML. The layout of a HTML page looks like the following, where successive indentation represents successive depth in a tag hierarchy:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Woman finds tongue-eating parasite in fish

A retired telephone switchboard operator who bought fish sticks from Burger Doodle found a 3cm long tongue-eating parasite in the fish just as she sat down to eat it.

href="http://acme.com/user/johnlennon.html"

Good thing she wasn't having tongue!

href="http://acme.com/user/georgeharrison.html"

I love Burger Doodle's flame-broiled tongue.

The tags are enclosed in angled brackets, <>. They come in pairs, with an end or closing tag </tag> at the same level of the hierarchy as its start or opening tag <tag>. A pair of tags plus the content between them is known as an element.

The first or highest tag is <html>, which is split into <head> and <body>. The body is further divided into sections with <div>. The comments themselves are structured into <header>, <section> and <footer>. It is the middle one that contains the text. Text in a HTML document is usually marked with <p>, for paragraph, and can be formatted with tags like <b> for bold face. I hope it is obvious to you that the text that you see is just a small part of the markup. You might want to look at HTML Introduction for further explanation and pretty pictures.

An HTML element can have any number of attributes, which specify what the tag does. In the made-up example above, class is used extensively to differentiate tags that would otherwise be the same and so identify them for individual formatting. Attributes come in pairs of name and value, just like pythonic dictionaries. In fact, attributes are represented in Python as dictionaries. The values in the sample delimited by double quotes, which is a stylistic choice recommended by many. It makes them easy to parse into strings in Python.

## 13.2. How to get text from a web page¶

While the task of finding tags in a page seems simple enough, so many things can go wrong that an entire Python module has been developed to facilitate the process. Called BeaufifulSoup, you will use it to extract the text that we want, rather than trying to write the code to do so yourself.

The algorithm is:

1. Gather resources, including the URL of a page and the tags on it that you need.
2. Initialize resources.
4. Parse it with BeautifulSoup.
5. Extract the tags that have text.
6. Loop through the tags to extract their text.
7. Print a summary.

### 13.2.1. What is a uniform resource locator (URL)?¶

There is one last thing before you see the code. You need to know the address for the page that we are looking at. The general address for the blog is http://sethgodin.typepad.com/. The specific address for the post is http://sethgodin.typepad.com/seths_blog/2016/11/the-yeasayer.html, which will be shown in your web browser’s address bar if you click on the title of the post, “The yeasayer”.

The technical name of any Internet address is a uniform resource locator or URL. You will see this in the upcoming code. By the way, the “http” that starts every web url stands for hypertext transport protocol.

### 13.2.2. How to take a short cut with pattern.web.plaintext¶

pattern 2.6 comes bundled Leonard Richardon’s BeautifulSoup. You import it from pattern.web to simplify the stripping away of HTML tags to reveal the hidden text. Or at least try to:

 1 2 3 4 5 6 7 >>> from requests import get >>> from pattern.web import plaintext >>> url = 'http://sethgodin.typepad.com/seths_blog/2016/11/the-yeasayer.html' >>> htmlString = get(url).text >>> webText = plaintext(htmlString) >>> len(webText) 7550

Does this page look like it has 7550 characters on it? Well, the post is so short that I can copy it into a name and have Python count it for me:

 1 2 3 4 5 6 7 >>> yeasayer = """Opposite of the naysayer, of course. ... This is the person who will find ten reasons why you should try something. ... The one who will embrace the possibility of better. ... The colleague to turn to when a reality check is necessary, ... because the reality is, it might work. Are you up for it?""" >>> len(yeasayer) 284

Whoops. requests downloaded a bunch of other junk along with the text of the post. You can double check by looking at webText in Spyder’s Variable explorer. So we have no choice but to look carefully at the HTML to find the tag that contains text and extract it directly with BeautifulSoup.

### 13.2.3. How to look at HTML source code to find text tags¶

Firefox can show you the hypertext mark-up of the material that you are interested in. Right-click on a word in the blog post and choose Inspect Element with FireBug from the pop-up menu. The window that opens up should look something like this:

The designer of Seth Godin’s blog appears to have marked the start of the text of a post with the tag <div class="entry-body>, which means that with any luck it will end with </div>, which indeed it does. This is all that is needed to extract the text with Python.

### 13.2.4. How to install BeautifulSoup¶

pattern 2.6 comes bundled with version 3.2.1 of Leonard Richardon’s BeautifulSoup. As of this moment, 2016-11-22, BeautifulSoup is up to v. 4.5.1, so there may be cases in which the more recent version is preferable anyway.

BeaufifulSoup is part of Anaconda’s default installation, so you should just be able to import it. If it is not, download and install it from the Python Package Index by means of pip in the Terminal/Command Prompt:

 1 $pip install BeautifulSoup4 Be sure to check that this worked by trying to import them in Spyder: >>> from bs4 import BeautifulSoup See Beautiful Soup Documentation for more about what BeautifulSoup can do. ### 13.2.5. How to find text with BeautifulSoup¶ BeautifulSoup requires that the HTML format of <tag> attribute = value be translated into Python as 'tag' {'attribute':'value'}. The tag which marks text in Godin’s website, <div class="entry-body> becomes 'div', {'class':'entry-body'}. The code is short and sweet. Assuming that you have already imported requests: Listing 13.1 Scrape text from a blog post.  1 2 3 4 5 6 7 8 >>> from bs4 import BeautifulSoup >>> url = 'http://sethgodin.typepad.com/seths_blog/2016/11/the-yeasayer.html' >>> htmlString = get(url).text >>> html = BeautifulSoup(htmlString, 'lxml') >>> entries = html.find_all('div', {'class':'entry-body'}) >>> text = [e.get_text() for e in entries] >>> print '{} posts were found.'.format(len(text)) >>> print text[0] Success! Most of the textual processing is done by BeautifulSoup’s find_all() method, which, like re’s findall(), scans the text looking for the appropriate tags. find_all() returns a tag object for the contents of each tag that it finds. The for statement loops through them so that get_text() can extract from each a string in UTF-8. The outcome is text, a list of strings, one for each post or blog entry on the page. As a final comment, don’t take for granted that print displays the true status of the text. In reality, it has some crud that you might want to filter out in posterior processing: >>> text[0] ### 13.2.6. How to get a page of posts¶ For your next trick, take a look at the main page of Godin’s blog by clicking on Main. I’ll treat you to a screen shot: Scrolling down the page reveals a lot more posts; it is hard to tell how many. I would like you to grab the text for all of them. There are several techniques for doing so #### 13.2.6.1. How to get the text with requests and BeautifulSoup¶ The previous block of code, Listing 13.1, has the ability to collect all the text on this page through the magic of find_all(). All you have to do is change the URL:  1 2 3 4 5 6 7 8 >>> url = 'http://sethgodin.typepad.com/seths_blog/' >>> htmlString = get(url).text >>> html = BeautifulSoup(htmlString, 'lxml') >>> entries = html.find_all('div', {'class':'entry-body'}) >>> text = [e.get_text() for e in entries] >>> print '{} posts were found.'.format(len(text)) >>> from pprint import pprint >>> pprint(text[-1]) After running it, I can tell you with confidence that there are thirty-three posts on this page. #### 13.2.6.2. How to get text from an Atom or RSS feed¶ A web page that is updated regularly, like a blog, often has a mechanism for distributing a new post when it is published. This can be by e-mail, Twitter, Facebook, or some other social media, but one of the oldest and most widely used is by RSS. RSS originally stood for Rich Site Summary but nowadays Really Simple Syndication is more perspicuous. I can’t describe what it does any better than Wikipedia’s entry on RSS: An RSS document (called “feed”, “web feed”, or “channel”) includes full or summarized text, and metadata, like publishing date and author’s name. RSS feeds enable publishers to syndicate data automatically. A standard XML file format ensures compatibility with many different machines/programs. RSS feeds also benefit users who want to receive timely updates from favorite websites or to aggregate data from many sites. Subscribing to a website RSS removes the need for the user to manually check the website for new content. Instead, their browser constantly monitors the site and informs the user of any updates. The browser can also be commanded to automatically download the new data for the user. An RSS feed can contain all of the posts available on a blog’s webpage, and maybe more, depending on how it is configured. It is important to point out that RSS formats a feed in Extensible Markup Language, XML, for an explanation of which I appeal to Wikipedia’s elegance: In computing, Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The W3C’s XML 1.0 Specification and several other related specifications – all of them free open standards – define XML. The design goals of XML emphasize simplicity, generality, and usability across the Internet. It is a textual data format with strong support via Unicode for different human languages. Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures[6] such as those used in web services. Python will render XML into lists and dictionaries, but it is not a trivial transformation. You will need the help of a parser, one called feedparser. The way to download a feed follows an algorithm that should be familiar to you by now: 1. Specify a URL. 2. Specify a limit of items to download. 3. Initialize a list to hold them. 4. Loop through the downloaded object, appending each item to the list. 5. Print (a sample of) the results. #### 13.2.6.3. How to find a feed’s URL¶ The first step in downloading a feed that you are interested in is to find its URL. All feeds have a URL that similar to the address of the website that hosts them. Let us continue with the example of Godin’s blog. In the left margin, under his photo, there is a heading entitled “RSS FEEDS”. If you put your cursor on top of the SUBSCRIBE button, a long URL appears, either http://www.addthis.com/feed.php?h1=http://feeds.feedblitz.com/SethsBlog&pub=sethgodin or http://www.addthis.com/feed.php?h1=http%3A%2F%2Ffeeds.feedblitz.com%2FSethsBlog&pub=sethgodin, with the non-alphabetic characters converted to UTF-8. ‘:’ becomes ‘%3A’, and /’ becomes ‘%2F’ The conversion makes the URL safe to transmit but hard to read. By a process of elimination, the feed URL must be http://feeds.feedblitz.com/SethsBlog. It has to start with http:// and contain some indication of the blog title. You can double-check by actually clicking on the button. In the page that opens, click on view xml, and the page opens up to the feed itself, which has the URL http://feeds.feedblitz.com/SethsBlog. This struggle to find the right URL may seem the result of poor design, but it is unavoidable due to the lack of a standard naming convention for feeds, or even a standard way to display them on a webpage. But that is how you learn to become a text hacker, by guessing and experimenting to see what works. #### 13.2.6.4. How to download a feed with pattern.web¶ pattern.web's module for downloading RSS feeds is called Newsfeed. Go ahead and import it (and plaintext, if you no longer have it available): >>> from pattern.web import Newsfeed, plaintext Adapting the feed download algorithm to pattern.web's common interface goes like this: Listing 13.2 How to download a RSS post with pattern.web.Newsfeed.  1 2 3 4 5 6 >>> url = 'http://feeds.feedblitz.com/SethsBlog' >>> postText = [] >>> for post in Newsfeed().search(url): ... postText.append(plaintext(post.text)) >>> print '{} posts were found.'.format(len(postText)) >>> pprint(postText[-1]) Only 10 posts were found. This is apparently the maximum held by the feed. Thus whether you download posts from a feed or the blog that hosts them will depend on which one retains more. #### 13.2.6.5. What’s in a feed¶ A RSS feed can be divided into two parts, a channel and the entries that it contains. The channel describes the entire feed, while each entry holds a single post or other piece of information. There are several tags which are supposed to be part of every channel and entry. However, pattern.web only instantiates the following ones:  1 2 3 4 5 6 7 8 >>> pprint(post.items()) [(u'language', u'en-US'), (u'author', u'Seth Godin'), (u'url', u'http://feeds.feedblitz.com/~/230783464/0/sethsblog~The-yeasayer.html'), (u'text', *LOTS OF STUFF*), (u'title', u'The yeasayer'), (u'date', u'2016-11-22T04:56:00-05:00'), (u'id', u'tag:typepad.com,2003:post-6a00d83451b31569e201b8d2017dc9970c')] They can be shaken loose from the dictionary that holds them by using them as keys:  1 2 3 4 5 6 7 8 >>> rssKeys = {'language':post.language, ... 'author':post.author, ... 'url':post.url, ... 'text':post.text, ... 'title':post.title, ... 'data':post.date, ... 'id':post.id} >>> pprint(rssKeys) See Common RSS Elements in feedparser's documentation for more on RSS tags. #### 13.2.6.6. How to install feedparser¶ The RSS module of pattern 2.6 is a wrapper for version 5.1.2 of Mark Pilgram’s feedparser. As of this moment, 2016-11-22, feedparser is up to v. 5.2.1. If pattern's bundled package were ever to fail you, or if you want access to all the tags in a feed or post, you can install a stand-alone package from Anaconda’s repository by entering the line below in the Terminal/Command Prompt:$ conda install feedparser

If for some reason that does not work, it can also be installed from the Python Package Index by means of pip in the Terminal/Command Prompt:

>$pip install -U feedparser If not even that works, it’s source code can be downloaded from the Python Package Index at feedparser. Click on the green download button, decompress the file that you get, and in Terminal do the following:$ cd {path to feedparser's folder}
python setup.py install Be sure to check that you got it by trying to import it in Spyder: >>> import feedparser #### 13.2.6.7. How to process a feed with feedparser¶ The final algorithm is similar to that of the webpage task: 1. Gather resources, including the URL of the feed that has text. 2. Download the feed with feedparser. 3. Parse it into tags with feedparser. 4. Extract the tags that have text. 5. Loop through the tag objects to extract their text with the help of BeautifulSoup. 6. Print a sample result. You can flesh this out in Python as so, assuming that you have imported feedparser:  1 2 3 4 5 >>> from feedparser import parse >>> url = 'http://feeds.feedblitz.com/SethsBlog' >>> tags = parse(url) >>> print 'Fetched {} entries from {}.'.format(len(tags.entries), tags.feed.title) To outline what you got, pretty-print the highest level:  1 2 3 4 5 6 7 8 9 10 11 >>> pprint(tags, depth=1) {'bozo': 0, 'encoding': u'UTF-8', 'entries': [...], 'etag': u'"7d48e328625cc589991d35cc87a49911"', 'feed': {...}, 'headers': {...}, 'href': u'http://feeds.feedblitz.com/SethsBlog', 'namespaces': {...}, 'status': 200, 'version': u'atom10'} entries contains a list of posts, while feed contains the information about the feed. Let’s extract the first post and pretty-print its highest level:  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 >>> pprint(tags.entries[0], depth=1) {'author': u'Seth Godin', 'author_detail': {...}, 'authors': [...], 'content': [...], 'feedburner_origlink': u'http://sethgodin.typepad.com/seths_blog/2016/11/the-yeasayer.html', 'guidislink': False, 'id': u'tag:typepad.com,2003:post-6a00d83451b31569e201b8d2017dc9970c', 'link': u'http://feeds.feedblitz.com/~/230783464/0/sethsblog~The-yeasayer.html', 'links': [...], 'published': u'2016-11-22T04:56:00-05:00', 'published_parsed': time.struct_time(tm_year=2016, tm_mon=11, tm_mday=22, tm_hour=9, tm_min=56, tm_sec=0, tm_wday=1, tm_yday=327, tm_isdst=0), 'summary': *LOTS OF STUFF*, 'summary_detail': {...}, 'title': u'The yeasayer', 'title_detail': {...}, 'updated': u'2016-11-22T04:56:00-05:00', 'updated_parsed': time.struct_time(tm_year=2016, tm_mon=11, tm_mday=22, tm_hour=9, tm_min=56, tm_sec=0, tm_wday=1, tm_yday=327, tm_isdst=0)} The text must be hidden in content, so let’s extract it and pretty-print its highest level. Ok, I did that and found the content is a list of dictionaries, though it only contains one. So the code cuts out the first element in the content list:  1 2 3 4 5 >>> pprint(tags.entries[0].content[0], depth=1) {'base': u'http://sethgodin.typepad.com/seths_blog/', 'language': u'en-US', 'type': u'text/html', 'value': *LOTS OF HTML*} So the text is actually hidden in value, as HTML. It was nice of pattern.web to fish it out and clean it up for us. Oh, wait, maybe we can still call on pattern.web to squeeze the text out of the HTML:  1 2 >>> text = [plaintext(e.content[0].value) for e in tags.entries] >>> pprint(text[0]) Eureka, it works! The alternative with BeautifulSoup is not much different:  1 2 >>> text = [BeautifulSoup(e.content[0].value).get_text() for e in tags.entries] >>> pprint(text[0]) See feedparser's documentation for an explanation of what else it can do. ### 13.2.7. How to download more than one page of posts¶ So how would you get another page of posts from Godin’s blog? #### 13.2.7.1. With pattern.web.Newsfeed¶ I would really like for this algorithm to work: initialize an empty list to hold the posts that you are going to collect, loop through the pages, and within each page, loop through the search() results. From each post, extract its text and append it to the list. Print a message informing the user of the results:  1 2 3 4 5 6 7 8 >>> postsPerPage = 10 >>> pages = 3 >>> text = [] >>> for page in range(1, pages): ... for post in Newsfeed().search(url, start=page, count=postsPerPage): ... text.append(plaintext(post.text)) >>> print 'Fetched {} entries by {}.'.format(len(text), post.title) >>> print '{} are unique.'.format(len(set(text))) Line 7 says that 20 posts were retrieved, but line 8 says that only 10 are unique. What happened? Well, if you open text in Spyder’s Variable explorer, you will see that the first ten posts are duplicated once. This brings us to a brick wall in RSS – the number of posts in feed is set to some (small) maximum, and there just aren’t any more. The only remaining alternative is to page through the blog by hand, as it were. #### 13.2.7.3. How to gather repeated code into a function¶ While the block of code above performs quite well, there are several lines that are repeated twice. Such repetition gives rise to two problems. The first is that it is hard to maintain, since if you change one you have to remember to change the other. In addition, it leaves the reader of your code perplexed: is there a reason for the repetition? Is the repeated material somehow different from the original, even though it looks identical? To solve both problems, it is best to pack the duplicate code into its own function:  1 2 3 4 5 6 7 >>> def blogText(url): ... htmlString = get(url).text ... html = BeautifulSoup(htmlString, 'lxml') ... entries = html.find_all('div', {'class':'entry-body'}) ... print 'Fetched {} entries.'.format(len(entries)) ... text = [e.get_text() for e in entries] ... return text This makes the body of the program much more understandable:  1 2 3 4 5 6 7 8 9 10 11 >>> url = 'http://sethgodin.typepad.com/seths_blog/' >>> text = [] >>> # get first page >>> text.extend(blogText(url)) >>> # get second page >>> page = 2 >>> nextPageURL = '{}page/{}/'.format(url, page) >>> text.extend(blogText(nextPageURL)) >>> print 'Fetched {} texts from {} pages.'.format(len(text), page) >>> print '{} texts are unique.'.format(len(set(text))) You should have gotten the same text as before – unless Godin’s blog has been updated with a new post in the meantime. #### 13.2.7.4. How to get even more pages with a for loop with a known outcome¶ If you click on the Older Posts » link you are taken to the next page of posts, at the bottom of which is another link for Older Posts ». Inspecting its element shows the same anchor tag as in the first link – I pasted it in below – with a tiny difference:  1 2 3 4 Older Posts » Did you spot it? The link now has changed the integer 2 to 3. This has a certain logic since, if you click on the link, you will be whisked away to the third page of the blog. But now you have to ask yourself the question, how far into the history of the blog do you want to go? Well, if you click on the More… link under ARCHIVES in the left sidebar, the list that unfurls itself extends all the way back to … January 2002 … month by month. Clicking on that last link and scrolling to the bottom of the page shows that this initial page leads to February 2002. But for the sake of simplicity, you only need to go back, say, three pages. To do so, the easiest technique is to use a for loop that ranges over the maximum number of pages:  1 2 3 4 5 6 7 8 9 10 11 12 >>> url = 'http://sethgodin.typepad.com/seths_blog/' >>> maxPages = 5 >>> text = [] >>> # get first page >>> text.extend(blogText(url)) >>> # get following pages >>> for page in range(2, maxPages): ... nextPageURL = '{}page/{}/'.format(url, page) ... text.extend(blogText(nextPageURL)) >>> print 'Fetched {} texts from {} pages.'.format(len(text), page) >>> print '{} texts are unique.'.format(len(set(text))) When I executed this block, it returned 183 texts from 4 pages. #### 13.2.7.5. How to get even more pages with a while loop with an unknown outcome¶ But we want MORE!! We want it all, actually. But we don’t know where it all ends, so a for loop is out of the question. The only alternative is a while loop, which requires careful planning of the condition which makes while run until it fails. When you used it before, the loop ended when a dictionary key was no longer found, raising a KeyError exception. The current code is so simple that it does not look for keys, or tags. I would rather not complicate the code just to find an easy trigger, so some other ‘absence’ must be sought. One approach is to examine a page which is ‘beyond’ the end of the blog. Right now, I don’t know how many pages there are, so let’s just try a crazy number, like 180. What’s on http://sethgodin.typepad.com/seths_blog/page/180/? Well, you don’t get an error; just an empty page. The conclusion would be that if the function blogText tries to scan this page for text, it will come up empty. Python has a round-about way for checking for list emptiness, which is that an empty list evaluates to False in a condition:  1 2 >>> empty = [] >>> if not empty: print 'This list is empty!' It follows that if empty in line 1 evaluates to False, negating it with not in line 2 should make it evaluate to True and induce the condition to execute its action of printing the response string. To get access to the list returned by blogText(), the line text.extend(blogText(url)) must be unpacked into newText = blogText(url) – which manifests the list – and text.extend(newText), which incorporates it into the existing list. Here is the code. Warning The last time I ran this block, I got 6097 texts from 125 pages. Handle with care! Ok, so now you really get the code. Sit back and sip on a refreshing beverage: Listing 13.3 Extract text from all the posts on a blog.  1 2 3 4 5 6 7 8 9 10 11 12 13 14 >>> url = 'http://sethgodin.typepad.com/seths_blog/' >>> page = 2 >>> text = [] >>> # get first page >>> newText = blogText(url) >>> text.extend(newText) >>> # get following pages >>> while newText: ... nextPageURL = '{}page/{}/'.format(url, page) ... newText = blogText(nextPageURL) ... text.extend(newText) ... page += 1 >>> print 'Fetched {} texts from {} pages.'.format(len(text), page-1) >>> print '{} texts are unique.'.format(len(set(text))) Did you reach the end? The trick is in line 8, where the newText is True as long as the list has text in it. Once the end of the archive is reached, newText comes back empty, while evaluates to False, and the loop shuts down. Oh, did you notice the modification of line 13? Why is it necessary? Think about what happens during the execution of code within the loop when it reaches the end. ### 13.2.8. Practice¶ Recall that you looked at the last page in Godin’s blog. Look at it again and inspect the link for February 2002 with FireBug. There has been a change in the HTML organization:  1 2 3 4 5 Main | February 2002 » So at some point, the tags that you rely on now will be replaced by others, which makes the code fail. How would you change the code to adapt it to the older environment? When you looked at the URL of Godin’s blog, http://sethgodin.typepad.com, did you wonder what “typepad” was? Well, it is a blog-hosting service, as you can see for yourself by clicking on its home page. If Typepad’s blog layout is consistent (which makes life easier for its engineers) and our code does not overfit, then the script that you just developed should generalize to any other Typepad blog. Try to figure this out by looking at the HTML of a few Typepad blogs and then running their URLs through the script to see whether all the posts are captured. Googling “best blogging platforms” brings up a multitude of hits, some of which are out of date. Here is a list from 2013 that I have updated slightly: 1. WordPress, see Showcase 2. Blogger, see Google’s Official Blog 3. Tumblr, see search 4. Medium, see Staff picks 5. Svbtle, see Dustin Curtis 6. Quora, see What are some good blogs to follow on Quora? 7. Postach.io, see Team Blog 8. Google+, see List of Recommended Bloggers to Follow on Google+ 9. SETT, see Top-ranked posts on Sett 10. Ghost, see Ghost blog 11. Squarespace, see Big Picture 12. Typepad, you have already seen it 13. Posthaven, see The Official Posthaven Posthaven 14. LinkedIn, see 6 Underrated Bloggers To Follow on LinkedIn 15. LiveJournal, see Communities Check out some of the examples. Which ones could you scrape text from? ### 13.2.9. Appendix: how to find the exact tags¶ The URL is a daughter of the <span class="pager-right"> tag, which is easy enough to find with BeautifulSoup using html.find('span', {'class':'pager-right'}). The problem is that it contains the two span tags. You could design a regular expression that would extract the URL, but I would rather use BeautifulSoup as much as possible, since it has a long list of methods for finding tags. The hacker way is to look at BeautifulSoup’s documentation and try every plausible find() method, but I want to save you some effort. In particular, I want you to learn to think a problem through before giving up and using the shotgun approach. Imagine you are a parser. You have parsed <span class="pager-right"> in the tag block above. The tag pair is the anchor, <a …> </a>. BeautifulSoup has a method for finding the next tag, find_next(), which you can use like this to isolate the anchor tag:  1 2 3 4 5 >>> nextPageTag = html.find('span', {'class':'pager-right'}) >>> nextPageATag = nextPageTag.find_next('a') >>> nextPageATag Older Posts » It includes the two span tags that are in the way. Looking closer at the <a> tag, the link that you want is part of the tag’s attribute. BeautifulSoup has a method for extracting attributes, attrs, which produces a dictionary:  1 2 >>> nextPageATag.attrs {u'href': u'http://sethgodin.typepad.com/seths_blog/page/2/'} Since the value of a dictionary is returned from its key, it is easy to flush out the URL with the href key:  1 2 >>> nextPageATag.attrs['href'] u'http://sethgodin.typepad.com/seths_blog/page/2/' You have now retrieved the URL with nary a trace of a regular expression. What is more, the algorithm relies on basic building blocks of the web page, so it should be robust to changes across pages and even across blogs of the same blogging service. The text from the second page can be appended to the first. Here is the entire block of old and new commands:  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 >>> url = 'http://sethgodin.typepad.com/seths_blog/' >>> text = [] >>> # get first page >>> htmlString = get(url).text >>> html = BeautifulSoup(htmlString, 'lxml') >>> entries = html.find_all('div', {'class':'entry-body'}) >>> text = [e.get_text() for e in entries] >>> # get next page >>> nextPageTag = html.find('span', {'class':'pager-right'}) >>> nextPageATag = nextPageTag.find_next('a') >>> nextPageURL = nextPageATag.attrs['href'] >>> htmlString = get(nextPageURL).text >>> html = BeautifulSoup(htmlString, 'lxml') >>> entries = html.find_all('div', {'class':'entry-body'}) >>> text = [e.get_text() for e in entries] >>> print str(len(text))+' posts were found' >>> print '{} are unique.'.format(len(set(text))) Nice! Given the necessity of a while statement, the algorithm should be revised to something like: • gather resource, including the URL of a page and the tags on it that you need • initialize resources • download the web page with requests • parse it with BeautifulSoup • extract the tags that have text • loop through the tags to extract their text • extract the tag that has the next page URL • while the next page number is less than the cut-off number • download the next web page with requests • parse it with BeautifulSoup • extract the tags that have text • loop through the tags to extract their text • extract the tag that has the next page URL • print a summary This seems simple enough, but the integers for tracking page numbers really should come from the page URL. Yet BeautifulSoup only parses down to the level of the tag, and not its components. A regular expression is unavoidable. #### 13.2.9.1. How to compile and search with a regular expression¶ The regex that you need should return the digit(s) from the URL – recall the \d meta-character for digits – but deployment of a regular expression in the middle of a program, where it could be called frequently, is slightly different from what you have seen so far. In such a case, the regex is created or compiled first, and then called with a regex method like search(). The regex for the next page digit(s) can be complied as in the first line:  1 2 >>> nextPageNumRE = compile(r'page/(\d+?)/') >>> nextPageNum = nextPageRE.search(nextPageURL).group(1) The regex surrounds the page number with its context, page/ and /. The parentheses non-greedily capture any sequence of digits in this context. I have preferred to include the digits’ context so that the regex will fail if the context changes, as a way of indicating that the structure of the page has probably changed so you should stop and check the HTML. The next line is tricky. Re searches nextPageURL for the regex pattern. Search() stores any and every match in a match object. The actual strings matched have to be returned with the group() method. group(0) returns the largest match, which is the entire pattern. group(1) returns the next match, which is hopefully the entire page number. Note that this is a string, not an integer, so it will have to be converted to an integer at some point with int(). #### 13.2.9.2. How to streamline duplicate code with functions¶ The whole program is presented here in script format:  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 from requests import get from bs4 import BeautifulSoup from re import compile, search url = 'http://sethgodin.typepad.com' nextPageNumRE = compile(r'page/(\d+?)/') maxPage = 4 htmlString = get(url).text html = BeautifulSoup(htmlString, 'html5lib') tags = html.find_all('div', {'class':'entry-body'}) text = [e.get_text() for e in tags] print str(len(text))+' posts were found' nextPageTag = html.find('span', {'class':'pager-right'}) nextPageATag = nextPageTag.find_next('a') nextPageURL = nextPageATag.attrs['href'] nextPageNum = nextPageNumRE.search(nextPageURL).group(1) while int(nextPageNum) <= maxPage: htmlString = get(nextPageURL).text html = BeautifulSoup(htmlString, 'html5lib') tags = html.find_all('div', {'class':'entry-body'}) text = text + [e.get_text() for e in tags] print str(len(text))+' posts were found' nextPageTag = html.find('span', {'class':'pager-right'}) nextPageATag = nextPageTag.find_next('a') nextPageURL = nextPageATag.attrs['href'] nextPageNum = nextPageNumRE.search(nextPageURL).group(1) If you paste it into a script and run it, the console outputs:  1 2 3 4 5 >>> runfile('/Users/harryhow/Documents/pyScripts/web2.py', wdir=r'/Users/harryhow/Documents/pyScripts') 25 posts were found 35 posts were found 45 posts were found 55 posts were found It successfully processes the four pages of posts that it was designed to. After the first page of twenty-five posts, each successive page contains ten. The program is quite accurate, but it is ugly in that two chucks of code are duplicated, the extraction of the text and the extraction of the next page number. This duplication creates two problems. The first is that it obscures to the reader that the two chunks of code were intended to be the same. This leads to the second problem that another coder may not grasp this and change a line in one chunk without changing its duplicate, which would lead to unexpected behavior. It is much preferable to reduce duplication with functions. The next block reorganizes the previous one with the help of two functions that encapsulate the repeated code:  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 from requests import get from bs4 import BeautifulSoup from re import compile, search url = 'http://sethgodin.typepad.com' nextPageNumRE = compile(r'page/(\d+?)/') nextPageNum = '1' maxPage = 4 totText = [] def getText(url): htmlString = get(url).text html = BeautifulSoup(htmlString, 'html5lib') tags = html.find_all('div', {'class':'entry-body'}) text = [e.get_text() for e in tags] return (html, text) def getPage(html, regex): nextPageTag = html.find('span', {'class':'pager-right'}) nextPageATag = nextPageTag.find_next('a') nextPageURL = nextPageATag.attrs['href'] nextPageNum = regex.search(nextPageURL).group(1) return (nextPageURL, nextPageNum) while int(nextPageNum) <= maxPage: (html, newText) = getText(url) totText = totText + newText print str(len(totText))+' posts were found' (url, nextPageNum) = getPage(html, nextPageNumRE) The flow of control is more intricate, because it passes from the body of the loop to the functions and back again, but the code itself is shorter because all of the duplicate lines have been removed. What is perhaps more complex is that certain variables need to be initialized at the beginning that were initialized during the course of the program in its previous incarnation. What you have not seen before is that the functions return tuples of output, and that these tuples have to be assigned to variable tuples. Python is smart enough to be able to align the corresponding members of each tuple. The reorganized program works just as well as the original:  1 2 3 4 5 >>> runfile('/Users/harryhow/Documents/pyScripts/web2.py', wdir=r'/Users/harryhow/Documents/pyScripts') 25 posts were found 35 posts were found 45 posts were found 55 posts were found You are now free to play with the code, though the next section asks you to play with it, too. ## 13.4. How to scrape text with selenium¶ There are many websites which allow commenting for which the techniques introduced above will fail. It is frustrating to see the text that you desire right in front of you – you could reach out and grab it – and yet it might as well be in Antarctica. There is one final technique that you can try, which is to open a web browser within Python and extract the HTML that it gets from the web site’s server. To demonstrate how this works, you should install the Python package called selenium, which is not part of the Anaconda installation. Install it in the Terminal/Command Prompt with pip: .. code-block:: bash pip install selenium

Selenium mimics Firefox, Chrome, Opera, and Internet Explorer. Here I will use FireFox, so you should install it on your computer, if you haven’t already done so. Note that selenium has to find it, so it should be in its default location, at the top level of your applications folder.

### 13.4.1. Introduction to the task¶

The algorithm is similar to that of the web page task:

1. Find a sample URL & use it to find the tags that mark text and pagination.
2. Father resources, including URL and tags.
3. Open a faux Firefox browser with selenium.
5. Extract the text with selenium.
6. Print a summary response.

The only tricky part is determining what tags to use, which has to be done by inspection of the HTML specific to a web service and so varies from service to service.

#### 13.4.1.1. How to locate elements with selenium¶

Selenium provides the following methods to locate web page elements, according to 4. Locating Elements, which has further explanation. I have reordered them slightly to aid in exposition:

single elements multiple elements
find_element_by_id
find_element_by_name find_elements_by_name
find_element_by_tag_name find_elements_by_tag_name
find_element_by_class_name find_elements_by_class_name
find_element_by_xpath find_elements_by_xpath
find_element_by_css_selector find_elements_by_css_selector

All but the last two use HTML tags to locate elements.

To overcome the current limitation to one hundred of comments retrieved from YouTube, you can scrape them all (at least in theory) with selenium. To demonstrate this task, you will use the video Broca’s Aphasia, which has a manageable amount of comments – 200 at the time of writing.

This code gets you started:

 1 2 3 4 5 >>> from selenium import webdriver >>> video = 'f2IiMEbMnPM' >>> url = 'https://www.youtube.com/watch?v='+video+'' >>> driver = webdriver.Firefox() >>> driver.get(url)

After you have typed the last line, something magic happens: Firefox opens to the video that you have selected. Scroll down to one of the comments, select any word in it and right-click on it, and then select Inspect element from the pop-up window. In the Inspector window that opens up, the post that you selected will be in the tag <div class="Ct">, but notice that the values of all of the other tags are nonsensical. This is called software obfuscation, which as Wikipedia explains, is the “deliberate act of creating … source or machine code that is difficult for humans to understand”.

Even though the text tag is not obfuscated, the fact that most of the others are means that your code may not generalize from the sample page to another. Fortunately, if you click on ALL COMMENTS (200) above the comments, you are taken to another page, https://www.youtube.com/all_comments?v=f2IiMEbMnPM, which only contains comments and whose tags are not obfuscated. So go ahead and close the current instance of FireFox with driver.close() and open a new one to the comments page by changing the URL:

 1 2 3 4 5 >>> from selenium import webdriver >>> video = 'f2IiMEbMnPM' >>> url = 'https://www.youtube.com/all_comments?v='+video+'' >>> driver = webdriver.Firefox() >>> driver.get(url)

Then highlight one word of a comment, right-click on it, and select Inspect element from the pop-up window. It shows that the tag for text is <div class="comment-text-comment">. This is enough information to extract all the comments on the page, but if you scroll down to the bottom, you see that there are more comments that are not on the page. You make them appear by clicking on the Show more button, BUT DON’T CLICK ON IT! Instead, inspect its HTML element by right-clicking on it V E R Y C A R E F U L L Y. It is contained in the tag <div id="yt-comments-paginator" class="paginator load-comments" data-token="…">. You will need both of the tags to successfully scrape all the comments.

The first tag, <div class="comment-text-comment">, contains a class name and marks a single comment, so you want to locate all of them with find_elements_by_class_name. The second tag <div id="yt-comments-paginator" class="paginator load-comments" data-token="…"> contains an id, and you only need to find one, so you can locate it with find_element_by_id. As a first approximation, try this block of code:

 1 2 3 4 5 6 7 8 9 >>> from selenium import webdriver >>> from time import sleep >>> video = 'f2IiMEbMnPM' >>> url = 'https://www.youtube.com/all_comments?v='+video+'' >>> driver = webdriver.Firefox() >>> driver.get(url) >>> sleep(2) >>> commentHTML = driver.find_elements_by_class_name('comment-text-content') >>> text = [comment.text for comment in commentHTML]

The last three lines are new. Line 7 invokes the sleep() method from the time module to make Python do nothing for two seconds while the web page loads in FireFox. After that pause, selenium searches the page for the comment tag and saves each one to the variable commentHTML. In the final line, a list comprehension loops over each comment object to extract its text. Clicking on text in the Variable explorer reveals a list of 101 comments, from the beginning to the end of the page.

Now use selenium to click on the button to load the next page of comments with:

 1 2 3 4 >>> driver.find_element_by_id("yt-comments-paginator").click() >>> sleep(3) >>> commentHTML = driver.find_elements_by_class_name('comment-text-content') >>> text = [comment.text for comment in commentHTML]

The first line sends the click, the second line makes Python wait three seconds for the new comments to load, and the last two lines collect all of them (even the old ones) and extracts their text. The count of text increases to 151.

There is still another button at the bottom of page waiting to click. You could have Python do it, but the fact that there is more to go suggests that the process should be recoded in a loop.

#### 13.4.2.1. How to break a loop with try-except¶

Since it is not known in advance how many comments are on a page, it is unlikely that a for loop is adequate to the task. A while loop should do fine, if a condition for its stopping can be identified. Actually, there is an obvious condition: stop looping when the button disappears. Selenium supplies a page_source representation in which to look for the button’s tag, so the loop should take the form:

 1 2 >>> while "yt-comments-paginator" in driver.page_source: ... do something

However, it turns out that this tag may actually hang around even though the button has disappeared. The only sure-fire condition is for the button to fail to be clicked. You could code this into an if statement, but there is an alternative that is slightly more efficient and pythonic, called a try-except statement. The idea is to try to click the button in the while loop, and it when it fails, break the loop. A first draft of the code is:

 1 2 3 4 5 6 7 8 9 >>> while "yt-comments-paginator" in driver.page_source: try: driver.find_element_by_id("yt-comments-paginator").click() sleep(3) except: commentHTML = driver.find_elements_by_class_name('comment-text-content') text = [comment.text for comment in commentHTML] print str(len(text))+' comments found' break

As mentioned, the try statement clicks the button and waits three seconds. After the final comment, this fails because the button disappears. Selenium sends an error message, which in pythonese is called an exception – whence the name of the except statement. Since the exception marks the end of the comments, the except statement contains the final bit of processing, namely locate the text tags, extract their text, print a summary message and break from the loop.

This is enough code to perform the task, but I want to add a couple of more lines to make the summary message more informative. You have already seen that the comment page begins with the total number of comments. It would be helpful to compare that number to the number that the code extracts. So inspect its HTML, which is:

.. code-block:: html

Look back at the table of elements that selenium can extract. There is the tag named <strong> but the number is outside of it. The only other one is <a href="…">. You may not be familiar with it, but it is a link tag, and the stuff inside it is considered its text. You can use find_element_by_partial_link_text to locate it, and in particular the open parenthesis. The next block of code expands the while loop with this method, extracts its text, and adds to the summary string:

 1 2 3 4 5 6 7 8 9 10 11 >>> while "yt-comments-paginator" in driver.page_source: try: driver.find_element_by_id("yt-comments-paginator").click() sleep(3) except: commentHTML = driver.find_elements_by_class_name('comment-text-content') text = [comment.text for comment in commentHTML] totalHTML = driver.find_element_by_partial_link_text('(') total = totalHTML.text print str(len(text))+' comments found out of '+total break

The final version of the code is:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 from selenium import webdriver from time import sleep video = 'f2IiMEbMnPM' url = 'https://www.youtube.com/all_comments?v='+video+'' driver = webdriver.Firefox() driver.get(url) sleep(2) while "yt-comments-paginator" in driver.page_source: try: driver.find_element_by_id("yt-comments-paginator").click() sleep(3) except: commentHTML = driver.find_elements_by_class_name('comment-text-content') text = [comment.text for comment in commentHTML] totalHTML = driver.find_element_by_partial_link_text('(') total = totalHTML.text print str(len(text))+' comments found out of '+total break # driver.quit()

It prints 196 comments found out of ALL COMMENTS (200) for this video. The code will be turned into a function and included in a script that I will email you.

### 13.4.3. How to get the text from LiveFyre comments¶

As a second example, let us look at the comments posted to New Orleans’ Times-Picayune’s website, in particular, the comments to this story, Oak Street Po-Boy Festival: And the winning po-boys are .... At the moment of writing, it has sixteen comments. Select a word in any of them and inspect its web element (right-click and select Inspect element). You will see that the comment is contained in the tag <div class=”fyre-comment” itemprop=”text”>. “fyre” refers to a web service called LiveFyre for managing comments and other cross-site material. The company explains its API at Developer Documentation, but as far as I can tell, it is not meant for third-party usage. Thus we are forced to scrape comments with selenium.

In any event, the <div class=”fyre-comment” itemprop=”text”> tag should be all that selenium needs, so give it a try:

 1 2 3 4 5 6 >>> from selenium import webdriver >>> url = 'http://www.nola.com/festivals/index.ssf/2014/11/the_winning_po-boys_at_the_oak.html#comments' >>> driver = webdriver.Firefox() >>> driver.get(url) # Wait for FireFox to load the page. When it is down, it returns control to Spyder's console. >>> commentHTML = driver.find_elements_by_class_name('frye-comment')

If you have the Variable explorer open, you see that commentHTML appears as a list with zero elements. That is to say, selenium failed to find the tag even once. You could try other tags, but they fail, too.

#### 13.4.3.1. How to find a tag with XPath¶

Fortunately, selenium is bursting with alternative methods. The most precise is find-element(s)-by-xpath. I haven’t mentioned XPath yet, but selenium’s 4.3. Locating by XPath has a handy introduction:

XPath is the language used for locating nodes in an XML document. As HTML can be an implementation of XML (XHTML), Selenium users can leverage this powerful language to target elements in their web applications. XPath extends beyond (as well as supporting) the simple methods of locating by id or name attributes, and opens up all sorts of new possibilities such as locating the third checkbox on the page.

One of the main reasons for using XPath is when you don’t have a suitable id or name attribute for the element you wish to locate. You can use XPath to either locate the element in absolute terms (not advised), or relative to an element that does have an id or name attribute. XPath locators can also be used to specify elements via attributes other than id and name.

Absolute XPaths contain the location of all elements from the root (html) and as a result are likely to fail with only the slightest adjustment to the application. By finding a nearby element with an id or name attribute (ideally a parent element) you can locate your target element based on the relationship. This is much less likely to change and can make your tests more robust.

The FireBug and FirePath extensions to Firefox that you installed at the beginning of this section make finding xpaths quite simple. In the Times-Picayune article, close the Inspect element window if it is still open (click on the x at the top left corner) and select a word in the first comment, right click it and select Inspect Element with Firebug. A window opens up in the bottom half of the page, filled with HTML tags, with the tag of the text that you have selected highlighted in blue. Put the cursor on it to select and right-click it and then Inspect in FirePath Panel. The Firebug window should switch from the HTML to the FirePath tab; <div class="fyre-comment" itemprop="text"> should be selected, and the XPath window at the top should show an xpath like line 1 below:

Now repeat this process on the last comment on the page. I have pasted the xpath of the one I got in line 2. A single difference between them stands out: “article” is numbered differently. I suspect that this is what causes driver.find_elements_by_class_name('frye-comment') to fail. The numbering is not required for an xpath, so if you still have selenium’s webdriver open, you can give it a shot with:

 1 2 3 >>> xpathComment = "//*[@id='rtb-comments']/div/div/div[7]/div[1]/article/div[1]/section/div" # delete the dot at the beginning of (a) & (b) >>> commentHTML = driver.find_elements_by_xpath(xpathComment) >>> text = [comment.text for comment in commentHTML]

text should appear in the Variable explorer as a list with some non-zero number of items, each of which is a string of text.

#### 13.4.3.2. Dynamic injection of HTML through JavaScript¶

the most recent one of which contains the infrequent word “whine”. You want to find out what tag marks such text, so you open the page to its source (right click on View Page Source) and search for “whine” … but the search comes up empty. Despite the fact that you can see the word on the page, it is not included in the page source. What do think that means?

Well, it does not mean that it is magic. It means that the comments are inserted dynamically into the web page from a third party.

Since comments are displayed on a web page in real time, as they are posted, it is not unexpected that they be treated separately from the unchanging text of an article. In addition, most commenting requires the author to log in, so log-in credentials must be managed. And some comments are moderated, which necessitates an additional layer of oversight. It is therefore understandable that an entity like a newspaper might want to avoid the hassle of dealing with all this and sub-contract it to someone else.

In the case of the Times-Picayune, it is from a company called LiveFyre, though this is so difficult to figure out that I will skip it for the time being. But in any event, it means that the algorithm for extracting text from static web pages explained above will not work. Fortunately, Python provides a work-around which is one of the coolest things discussed in this book.

Imagine the process: the entire web page is hosted by the Times-Picayune, but the comments are hosted by LiveFyre and inserted into the web page in their proper place, as they appear. Thus the two entities are constantly talking to one another, but in a way that does not involve the static HTML of the story. This communication is effected through a programming language called JavaScript, which is far beyond the purview of this book. How can you eavesdrop on this conversation to scoop up some text?

The solution in Python is to open a faux web browser that is really a front for Python and so makes Python privy to the JavaScript conversation.

The following is a block of basic code for performing this task:

 1 2 3 4 5 >>> commentHTML = driver.find_element_by_id('rtb-comments') >>> from re import findall >>> html = elem.get_attribute('innerHTML') >>> tags = findall(r'
(.+?)
', html) >>> print tags[0]

Many websites use Disqus to manage their comments, such as The Atlantic. While I was composing this section,  There’s a Dog in This Story, So More People Will Pay Attention to It <http://www.theatlantic.com/business/archive/2014/11/will-people-pay-more-attention-to-this-just-because-it-has-a-dog-in-it/382819/#disqus_thread>`_ popped on the magazine’s site. At the moment, it only has five comments. The fourth mentions the infrequent word “vilest”, so open up the page’s HTML layout (right click on View Page Source) and search for “vilest”. Oh, it’s not there. That suggests that the comments are injected dynamically via JavaScript, but at least in this case, you can find out who does the injecting. Clicking on the Login symbol pops open a menu whose first chose is Disqus. Disqus is perhaps the web pre-eminent comment hosting service, and, bless their hearts, they provide a Python package for interacting with their service.

## 13.6. Appendix¶

### 13.6.1. Do not try to process the XML from an RSS feed¶

RSS sends data in an XML representation, the tags of which can be considerably different from the HTML tags that you learned about above. The code below is a cumbersome but transparent way to try to extract the text from a single entry in an RSS feed. You will notice at the end that the ‘text’ is full of non-textual junk. Therefore I do not recommend that you try to extract text ‘by hand’ from an RSS feed but rather let feedparser do it for you:

One thing you\'ll discover when you start pan roasting brussel sprouts or tomatoes (or running a theater or an airline, or just about anything for that matter) is that more is not always better.

\n

Sure, I know that you have three uncooked sprouts left, and it would be a shame to not serve them, but if you add those three to the pan with the others, the entire batch will suffer.

\n

Adding one more is just fine, until adding one more ruins everything.

\n

Greed costs.

\n\n\n\n

\n'

Last edited: Dec 05, 2016