The Scrapy library is a very powerful web scraping library, easy to use as well. If you are new to this, you can follow the available tutorial on using the Scrapy library.
This tutorial covers the use of Xpath selectors. Xpath uses path like syntax to navigate the nodes of XML documents. They are also useful in navigating HTML tags.
Unlike in the Scrapy tutorial, we are going to be doing all of our operations here on the terminal for simplicity sake. This doesn‘t mean that the Xpath can‘t be used with the proper Scrapy program though, they can be utilized in the parse library on the response parameter.
We are going to be working with the example.webscraping.com site, as it is very simple and would help understand the concepts.
To use scrapy in our terminal, type in the command below:
It would visit the site and get the needed information, then leave us with an interactive shell to work with. You should see a prompt like:
From the interactive session, we are going to be working with the response object.
Here‘s what our syntax would look like for the majority of this article:
This command above is used to extract all of the matched tags according to the Xpath syntax and then stores it in a list.
This command above is used to extract only the first matched tag, and stores it in a list.
We can now start working on the Xpath syntax.
Navigating tags in Xpath is very easy, all that is needed is the forward-slash “/” followed by the name of the tag.
The command above would return the html tag and everything it contains as a single item in a list.
If we want to get the body of the web page, we would use the following:
Xpath also allows the wildcard character “*”, which matches everything in the level in which it is used.
The code above would match everything in the document. The same thing happens when we use ‘/html’.
Asides navigating tags, we can get all the descendant tags of a particular tag by using the “//”.
The above code would return all the anchor tags under in the html tag i.e. it would return a list of all the descendant anchor tags.
TAGS BY ATTRIBUTES AND THEIR VALUES
Sometimes, navigating html tags to get to the required tag could be trouble. This trouble can be averted by simply finding the needed tag by its attribute.
The code above returns all the div tags under the html tag that have the id attribute with a value of pagination.
The code above would return a list of all the div tags under the html tag, only if they have the class attribute with a value of span12.
What if you do not know the value of the attribute? And all you want is to get tags with a particular attribute, with no concern about it‘s value. Doing this is simple as well, all you need to do is to use only the @ symbol and the attribute.
This code would return a list of all the div tags that contain the class attribute regardless of what value that class attribute holds.
How about if you know only a couple of characters contained in the value of an attribute? It‘s also possible to get those type of tags.
The code above would return all the div tags under the html tag that have the id attribute, however we do not know what value the attribute holds except that we know it contains “ion”.
The page we are parsing has only one tag in this category, and the value is “pagination” so it would be returned.
TAGS BY THEIR TEXT
Remember we matched tags by their attributes earlier. We can also match tags by their text.
The code above would help us get all the anchor tags that have the “ Algeria” text in them. NB: It must be tags with exactly that text content.
How about if we do not know in the exact text content, and we only know a few of the text content? We can do that as well.
The code above would get the tags that have the letter “A” in their text content.
EXTRACTING TAG CONTENT
All along, we have been talking about finding the right tags. It‘s time to extract the content of the tag when we find it.
It‘s pretty simple. All we need to do is to add “/text()” to the syntax, and the contents of the tag would be extracted.
The code above would get all the anchor tags in the html document, and then extract the text content.
EXTRACTING THE LINKS
Now that we know how to extract the text in tags, then we should know how to extract the values of attributes. Most times, the values of attributes that are of utmost importance to us are links.
Doing this is almost same as extracting the text values, however instead of using “/text()” we would be using the “/@” symbol and the name of the attribute.
The code above would extract all of the links in the anchor tags, the links are supposed to be the values of the href attribute.
NAVIGATING SIBLING TAGS
If you noticed, we have been navigating tags all this while. However, there’s one situation we haven’t tackled.
How do we select a particular tag when tags with the same name are on the same level?
<img src="/places/static/images/flags/af.png"> Afghanistan</a>
<img src="/places/static/images/flags/ax.png"> Aland Islands</a>
In a case like the one we have above, if we are to look at it, we might say we’d use extract_first() to get the first match.
However, what if we want to match the second one? What if there are more than ten options and we want the fifth one? We are going to answer that right now.
Here is the solution: When we write our Xpath syntax we put the position of the tag we want in square brackets, just like we are indexing but the index starts at 1.
Looking at the html of the web page we are dealing with, you’d notice that there a lot of <tr> tags on the same level. To get the third <tr> tag, we’d use the following code:
You’d also notice that the <td> tags are in twos, if we want only the second <td> tags from the <tr> rows we’d do the following:
Xpath is a very powerful way to parse html files, and could help minimize the use of regular expressions in parsing them considering it has the contains function in its syntax.
There are other libraries that allow parsing with Xpath such as Selenium for web automation. Xpath gives us a lot of options while parsing html, but what has been treated in this article should be able to carry you through common html parsing operations.