Python Web Scraping

Finding Children Nodes With Beautiful Soup

The task of web scraping is one that requires the understanding of how web pages are structured. To get the needed information from web pages, one needs to understand the structure of web pages, analyze the tags that hold the needed information and then the attributes of those tags.

For beginners in web scraping with BeautifulSoup, an article discussing the concepts of web scraping with this powerful library can be found here.

This article is for programmers, data analysts, scientists or engineers who already have the skillset of extracting content from web pages using BeautifulSoup. If you do not have any knowledge of this library,  I advise you to go through the BeautifulSoup tutorial for beginners.

Now we can proceed — I want to believe that you already have this library installed.  If not, you can do this using the command below:

pip install BeautifulSoup4

Since we are working with extracting data from HTML, we need to have a basic HTML page to practice these concepts on.  For this article, we would use this HTML snippet for practice. I am going to assign the following HTML snippet to a variable using the triple quotes in Python.

sample_content = """<html>
<head>
<title>LinuxHint</title>
</head>
<body>
<p>

To make an unordered list, the ul tag is used:
 
<ul>
Here's an unordered list
 
<li>First option</li>
<li>Second option</li>
</ul>
</p>
<p>

To make an ordered list, the ol tag is used:
 
<ol>

Here's an ordered list

<li>Number One</li>
<li>Number Two</li>
</ol>
</p>
<p>Linux Hint, 2018</p>
</body>
</html>"""

Now that we have sorted that, let’s move right into working with the BeautifulSoup library.

We are going to be making use of a couple of methods and attributes which we would be calling on our BeautifulSoup object. However, we would need to parse our string using BeautifulSoup and then assign to an “our_soup” variable.

from bs4 import BeautifulSoup as bso
our_soup = bso(sample_content, "lxml")

Henceforth, we would be working with the “our_soup” variable and calling all of our attributes or methods on it.

On a quick note, if you do not already know what a child node is, it is basically a node (tag) that exists inside another node. In our HTML snippet for example, the li tags are children nodes of both the “ul” and the “ol” tags.

Here are the methods we would be taking a look at:

  • findChild
  • findChildren
  • contents
  • children
  • descendants

findChild():

The findChild method is used to find the first child node of HTML elements. For example when we take a look at our “ol” or “ul” tags, we would find two children tags in it. However when we use the findChild method, it only returns the first node as the child node.

This method could prove very useful when we want to get only the first child node of an HTML element, as it returns the required result right away.

The returned object is of the type bs4.element.Tag. We can extract the text from it by calling the text attribute on it.

Here’s an example:

first_child = our_soup.find("body").find("ol")
print(first_child.findChild())

 The code above would return the following:

<li>Number One</li>

To get the text from the tag, we call the text attribute on it.

Like:

print(first_child.findChild().text)

To get the following result:

'Number One'
findChildren():

We have taken a look at the findChild method and seen how it works. The findChildren method works in similar ways, however as the name implies, it doesn’t find only one child node, it gets all of the children nodes in a tag.

When you need to get all the children nodes in a tag, the findChildren method is the way to go. This method returns all of the children nodes in a list, you can access the tag of your choice using its index number.

Here’s an example:

first_child = our_soup.find("body").find("ol")
print(first_child.findChildren())

This would return the children nodes in a list:

[<li>Number One</li>, <li>Number Two</li>]

To get the second child node in the list, the following code would do the job:

print(first_child.findChildren()[1])

To get the following result:

<li>Number Two</li>

That’s all BeautifulSoup provides when it comes to methods. However, it doesn’t end there. Attributes can also be called on our BeautifulSoup objects to get the child/children/descendant node from an HTML element.

contents:

While the findChildren method did the straightforward job of extracting the children nodes, the contents attributes does something a bit different.

The contents attribute returns a list of all the content in an HTML element, including the children nodes. So when you call the contents attribute on a BeautifulSoup object, it would return the text as strings and the nodes in the tags as a bs4.element.Tag object.

Here’s an example:

first_child = our_soup.find("body").find("ol")
print(first_child.contents)

This returns the following:

["\n   Here's an ordered list\n   ", <li>Number One</li>,
'\n', <li>Number Two</li>, '\n']

As you can see, the list contains the text that comes before a child node, the child node and the text that comes after the child node.

To access the second child node, all we need to do is to make use of its index number as shown below:

print(first_child.contents[3])

This would return the following:

<li>Number Two</li>

children:

Here is one attribute that does almost the same thing as the contents attribute. However, it has one small difference that could make a huge impact (for those that take code optimization seriously).

The children attribute also returns the text that comes before a child node, the child node itself and the text that comes after the child node. The difference here is that it returns them as a generator instead of a list.

Let’s take a look at the following example:

first_child = our_soup.find("body").find("ol")
print(first_child.children)

The code above gives the following results (the address on your machine doesn’t have to tally with the one below):

<list_iterator object at 0x7f9c14b99908>

As you can see, it only returns the address of the generator. We could convert this generator into a list.

We can see this in the example below:

first_child = our_soup.find("body").find("ol")
print(list(first_child.children))

This gives the following result:

["\n        Here's an ordered list\n        ", <li>Number One</li>,
'\n', <li>Number Two</li>, '\n']

descendants:

While the children attribute works on getting only the content inside a tag i.e. the text, and nodes on the first level, the descendants attribute goes deeper and does more.

The descendants attribute gets all of the text and nodes that exist in children nodes. So it doesn’t return only children nodes, it returns grandchildren nodes as well.

Asides returning the text and tags, it also returns the content in the tags as strings too.

Just like the children attribute, descendants returns its results as a generator.

We can see this below:

first_child = our_soup.find("body").find("ol")
print(first_child.descendants)

This gives the following result:

<generator object descendants at 0x7f9c14b6d8e0>

As seen earlier, we can then convert this generator object into a list:

first_child = our_soup.find("body").find("ol")
print(list(first_child.descendants))

We would get the list below:

["\n   Here's an ordered list\n   ", <li>Number One</li>,
'Number One', '\n', <li>Number Two</li>, 'Number Two', '\n']

Conclusion

There you have it, five different ways to access children nodes in HTML elements. There could be more ways, however with the methods and attributes discussed in this article one should be able to access the child node of any HTML element.

About the author

Habeeb Kenny Shopeju

Habeeb Kenny Shopeju

I love building software, very proficient with Python and JavaScript. I'm very comfortable with the linux terminal and interested in machine learning. In my spare time, I write prose, poetry and tech articles.