11. How to get text from web pages and blog feeds

A journalist named Enrique Dans writes a blog, El blog de Enrique Dans which opened up to this page on Abril 4, 2014:


There is a nice chunk of text in it that we would like to get our hands on, but it is hidden in a field of images, links and other stuff. How do we winnow the textual wheat from the non-textual chaff?

11.1. How to look at html source code to find text tags

The first step is to understand how web pages are encoded. They are represented in what is called hypertext mark-up language, or html. Your web browser can show you the hypertext mark-up of a page that you are looking at it:

  • On Windows, just right-click and choose Show page source.
  • In Firefox, Tools > Web Developer > Page Source.
  • In Safari, first turn on the Developer menu, Safari > Preferences > Advanced and check Show Develop menu in menu bar. Then Dvelop > Show page source

The window that opens up begins like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" lang="es-ES" xmlns:fb="http://www.facebook.com/2008/fbml"  xmlns:og="http://opengraphprotocol.org/schema/" >

<head profile="http://gmpg.org/xfn/11">
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

        <title>  Columna en Expansión: clientes y sostenibilidad &raquo; El Blog de Enrique Dans</title>

        <meta name="generator" content="WordPress 3.8.1" /> <!-- leave this for stats -->

        <link rel="alternate" type="application/rss+xml" title="Feed RSS de El Blog de Enrique Dans" href="http://feeds.feedburner.com/ElBlogDeEnriqueDans" />
        <link rel="pingback" href="http://www.enriquedans.com/xmlrpc.php" />

        <link rel="shortcut icon" href="http://www.enriquedans.com/wp-content/themes/enriquedans_20_v3/favicon.ico" />

        <link rel="openid.server" href="http://openid.blogs.es/index.php/serve" />
        <link rel="openid.delegate" href="http://openid.blogs.es/edans" />

        <link rel="stylesheet" href="http://www.enriquedans.com/wp-content/themes/enriquedans_20_v3/style.css?v=14" type="text/css" media="screen" />
        <!--[if  IE 6]>
                <link rel="STYLESHEET" type="text/css" href="http://www.enriquedans.com/wp-content/themes/enriquedans_20_v3/style-ie.css?v=13">

There is no trace of the text that we want. To find it, search for its first words, “Mi columna de”. After three false starts, you will find them on line 206. The context of this line is copied here:

<div itemprop="description" class="post-entry">
<p><a href="http://www.enriquedans.com/wp-content/uploads/2014/04/customersustainability-expansion.pdf" target="_blank"><img class=" wp-image-23355 alignright" style="margin-bottom: 10px; margin-left: 30px;" title="Clientes y sostenibilidad - Expansión (pdf, haz clic si no quieres dejarte los ojos)" alt="Clientes y sostenibilidad - Expansión (pdf, haz clic si no quieres dejarte los ojos)" src="http://www.enriquedans.com/wp-content/uploads/2014/04/customersustainability-expansion.jpg" width="248" height="429" /></a>Mi columna de esta semana en Expansión, titulada &#8220;<a title="Clientes y sostenibilidad - Expansión (pdf)" href="http://www.enriquedans.com/wp-content/uploads/2014/04/customersustainability-expansion.pdf" target="_blank">Clientes y sostenibilidad</a>&#8221; (pdf), trata sobre un concepto al que estoy dando cada vez más desarrollo: el de sostenibilidad aplicada a la relación con el cliente.</p>

In html, a piece of text is ‘marked up’ by surrounding it with tags, such as <p> text </p>, which marks “text” as a paragraph. The person who designed Engrique Dans’ blog appears to have marked the start of the text of a post with the tag <div itemprop="description" class="post-entry">, which means that with any luck it will end with some sort of </div>.

11.1.1. BeautifulSoup

This seems simple enough, but so many things can go wrong that an entire Python module has been developed to extract information from text markup. Called BeaufifulSoup, you will use it to extract the text that we want, rather than trying to write the code to do so yourself. BeaufifulSoup is not part of the default packages of Canopy (or Anaconda), so download and install it with pip. In Terminal:

$ pip install BeautifulSoup4

11.1.2. Universal resource locators (URLs)

There is one last thing efore you see the code. You need to know the address for the page that we are looking at. The general address for the blog is http://www.enriquedans.com. The specific address for the post is http://www.enriquedans.com/2014/04/columna-en-expansion-clientes-y-sostenibilidad.html, which will be shown in your web browser’s address bar if you click on the title of the post, “Columna en Expansión: clientes y sostenibilidad”.

The technical name of a web address is a universal resource locator or url. You will see this in the upcoming code. By the way, the “http” that starts every web url stands for hypertext transport protocol.

11.1.3. The script for capturing text from a web page

The script is short and sweet:

# -*- coding: utf-8 -*-
# nombre sugerido: capturar_texto_web.py
import requests
from bs4 import BeautifulSoup
# pegar la dirección aquí:
url = 'http://www.enriquedans.com/2014/04/columna-en-expansion-clientes-y-sostenibilidad.html'
html = requests.get(url).text
sopa = BeautifulSoup(html)
print sopa.find("div", {"itemprop":"description"}, {"class":"post-entry"}).text.encode('utf8')

11.2. How to get the text from a blog feed

11.3. Summary

11.4. Further practice

11.5. Further reading

11.6. Appendix



Last edited: April 23, 2014