Python

How to Parse and Scrape HTML Using Pyquery

“Pyquery” is a third-party Python module that allows you to parse and extract data from “xml” and “html” documents. It is inspired by jQuery JavaScript library and features a near identical syntax, allowing you to use many helper functions and shorthand code to parse and manipulate the document tree. This article will cover a simple guide on Pyquery that will help you get started with the module.

Pyquery Installation

To install Pyquery in Ubuntu, use the command specified below:

$ sudo apt install python3-pyquery

You can also install latest version of Pyquery from “pip” package manager by running the following two commands in succession:

$ sudo apt install python3-pip
$ pip3 install pyquery

To install Pyquery in other Linux distributions, install “pip3” from the package manager and run the second command mentioned above.

Creating a Parsable Document Tree

Before you can parse and extract data from an HTML document, you need to create a document tree. You can create a document tree from a simple HTML markup using the code sample below:

from pyquery import PyQuery as pq
document = pq("Hello World !!</html>")
print (document)
print (type(document))

The first statement imports the “PyQuery” class from the “pyquery” module. Next, a new instance of PyQuery class is created. After running the code sample above, you should get the following output:

<html>Hello World !!</html>
<class 'pyquery.pyquery.PyQuery'>

Notice the second line in the output. Here “document”, which is an instance of “PyQuery” class, does not return a string type object. You can quickly query all the methods available for “document” instance by adding the following extra line to the code sample above:

from pyquery import PyQuery as pq
document = pq("<html>Hello World !!</html>")
print (help(document))

You can also browse API for PyQuery class online.

To create document tree from a URL, use the following code instead (replace “url” with your own desired address):

from pyquery import PyQuery as pq
document = pq(url='https://example.com')
print (document)

To create a document tree form local HTML file, use the below code (replace the value of “filename” according to your needs):

from pyquery import PyQuery as pq
document = pq(filename='index.html')
print (document)

Now that you have a document tree, you can start parsing it.

Manipulating the Document Tree

You can extract data and manipulate document trees using a variety of methods. Some of the most common methods are listed below with samples. For all usable methods, refer to the API available here.

You can use “text” method to get text content of an element:

from pyquery import PyQuery as pq
document = pq('''<html><p id="hw">Hello World !!</p></html>''')
p = document('p')
print (p.text())

You can choose a specific tag / element by supplying its name as argument to the “document” instance. After running the above code sample, you should get the following output:

Hello World !!

You can get attributes of a tag by using the “attr” method. To do so, pick a tag you want to parse (‘p’ in this case) and supply the attribute name as an argument (‘id’ in this case) or use dot notation.

from pyquery import PyQuery as pq
document = pq('''<html><p id="hw">Hello World !!</p></html>''')
p = document('p')
print (document)
print (p.attr("id"), p.attr.id)

After running the above code sample, you should get the following output:

<p id="hw">Hello World !!</p>

You can manipulate CSS using the “css” method. To add CSS styles to

or any other tag, you can use the following code:

from pyquery import PyQuery as pq
document = pq('''<html><p id="hw">Hello World !!</p></html>''')
p = document('p')
p.css({"color": "red"})
print (document)
print (p.attr("style"))

Replace “{“color”: “red”}” part with your own custom styles. After running the above code sample, you should get the following output and can verify that CSS has been correctly applied:

<p id="hw" style="color: red">Hello World !!</p>
color: red

If you have a pre-styled class, you can just use the “addClass” method to apply existing styles.

from pyquery import PyQuery as pq
document = pq('''<html><p id="hw">Hello World !!</p></html>''')
p = document('p')
p.addClass("mystyle")

You can append and prepend your own custom markup using the code sample below:

from pyquery import PyQuery as pq
document = pq('''<p id="hw">Hello World !!</p>''')
p = document('p')
p.prepend("<p>Hi</p>")
p.append("<p>Bye</p>")
print (document)

Replace arguments in the “prepend” and “append” method with your own values. After running the above code sample, you should get the following output:

<p id="hw"><p>Hi</p>Hello World !!<p>Bye</p></p>

To remove contents of an element, use the “empty” method.

from pyquery import PyQuery as pq
document = pq('''<p id="hw">Hello World !!</p>''')
p = document('p')
p.empty()
print (document)

After running the above code sample, you should get the following output:

<html><p id="hw" /></html>

You can use the “filter” method to select specific elements when there are multiple tags of the same type. For instance, the code below picks up a “<p>” tag having an “id” as “hello”:

from pyquery import PyQuery as pq
document = pq('''<p id="hello">Hello</p><p id="world">World !!</p>''')
p = document('p')
print (p.filter("#hello"))

After running the above code sample, you should get the following output:

<p id="hello">Hello</p>

You can find multiple tags / elements at once using “find” method:

from pyquery import PyQuery as pq
document = pq('''<p id="hello">Hello</p><p id="world">World !!</p>''')
print (document.find('p'))

Supply the tag / element name as argument to the “find” method. After running the above code sample, you should get the following output:

<p id="hello">Hello</p><p id="world">World !!</p>

You can switch between “xml” and “html” parsers using an additional “parser” argument:

from pyquery import PyQuery as pq
document = pq('''<p id="hello">Hello</p><p id="world">World !!</p>''', parser="html")
print (document)

If you need further help with Pyquery, refer to its official documentation and examples available here.

Conclusion

PyQuery allows you to quickly parse html documents by writing minimum code, as it includes numerous helper functions that completely omit the need for writing custom code. Its “jQuery” like syntax and structure also helps in selecting elements and nodes without going deeper into the document tree, especially when there is a lot of nested markup.

About the author

Nitesh Kumar

Nitesh Kumar

I am a freelancer software developer and content writer who loves Linux, open source software and the free software community.