Python

Python Urlparse()

URLs frequently include essential data that could be exploited when evaluating a website, a participant’s search, or the distribution of the material in each area. Although, they sometimes appear to be quite complex, Python comes with a variety of helpful libraries that let you parse URLs and retrieve their constituent parts.

In Python 3, the urllib package enables users to explore websites from within their script. The urllib contains several modules for managing different URL functions. When opening a URL in Python programming, the urllib library is crucial. It allows you to visit and interact with websites by utilizing their Universal Resource Locator. This library provides us with packages like: urllib.request, urllib.error, urllib.parse, and urllib.robotparser.

In this snippet, despite this being a large topic to comprehend all at once, we will simply pay attention to the urllib.parse module. Most particularly, the urlparse() method.

The urllib.parse module is utilized for parsing the URLs of the websites. It implies that by dividing a URL, we may obtain its various parts. Additionally, it may be used to get any URL from a source URL and reference path.

Loading the urllib:

Python includes urllib as a standard library. To use it, we must first import this library. For this, we will open the Spyder tool and write the following command:

Import urllib

Urlparse() Module:

The urlparse() module offers a defined method for parsing a uniform resource locator (URL) into distinct sections. To put it simply, this module enables us to easily separate URLs into different components and filter out any particular part from URLs. It just simply split the URL into 6 components which relate to the overall syntax of a

URL: scheme:/netloc/path;parameters?query#fragment.

Let’s now begin our tutorial with a practical example.

from urllib.parse import urlparse, urlunparse

In this code snippet, the first thing we did is importing the urlparse and urlunparse from the urllib.parse. This will enable all the required features of the urlparse() method in our tool.

from urllib.parse import urlparse
exampleurl = urlparse('https://linuxhint.com/')
print("Url Components:",exampleurl)

Now, as we can use the urlparse() method. We have defined a variable named “exampleurl” which will store the string values. Then, we used the assignment operator “=” to assign values. Next to it, we have called the “urlparse()” method. Inside the braces of the urlparse() method, between the inverted commas, we have defined a URL of a particular website on which we want to perform the parsing. The braces of the print() statement contain a quoted text and the variable name, separated by a comma.

The image below shows us the following output.

You can see that the provided URL is divided into 6 components. Now, before we dip into learning these components, we will first learn how to put these components back to the original URL.

For this purpose, the method being used is “urlunparse()”.

from urllib.parse import urlparse, urlunparse
exampleurl = urlparse('https://linuxhint.com/')
print("Url Components:",exampleurl)
unpar_url = urlunparse(exampleurl)
print("Original URL:" ,unpar_url)

We have already imported the urlunparse from the urllib.parse in the above snippet. Now, we will create a variable named “unpar_url”. Invoking the “urlunparse()” method and writing the name of the variable, we allocate the URL opening for the urlparse() method i.e. “exampleurl”. In the last step, use the “print()” statement to display a text and the variable name for unparsing the URL.

The parsed URL is displayed in the image attached below.

The usage of the urlparse() and urlunparse() functions has been demonstrated. Now, let us explore the significance of every element of the ParseResult that was returned.

Urlparse() Components:

The urlparse() method splits the provided URL into 6 chunks which are scheme, netloc, path, params, query, and fragment.

The first component is the scheme. The scheme is utilized to specify the protocol that is to be used to acquire the online resources which could be HTTP or HTTPS. The next component is netloc: net refers to network while loc means location. So, it tells us about the provided URLs network location. The component path contains the precise pathway that a web browser has to take to acquire the provided resource. The params are the path elements’ parameters. The query adheres to the path component and offers a stream of data that the resource can utilize. The last component fragment simply classifies a part.

As previously mentioned, each of these elements contains some data on the URL. Since the returned object is provided as a tuple, all these components may also be retrieved utilizing the index position.

from urllib.parse import urlparse
exampleurl = urlparse('https://linuxhint.com/')
print(exampleurl.scheme, "==",exampleurl[0])
print(exampleurl.netloc, "==",exampleurl[1])
print(exampleurl.path, "==",exampleurl[2])
print(exampleurl.params, "==",exampleurl[3])
print(exampleurl.query, "==",exampleurl[4])
print(exampleurl.fragment, "==",exampleurl[5])

In this code snippet, we defined indexes for each component separately inside the print() statement. We used the name of the variable with the component name against which we mentioned the variable name with the index position at which it lies in the stream. We will continue to use this sequence until we have mentioned all the components with corresponding index positions.

Resultant strings can be seen in the image here.

Even though these make up the majority of the indexed content, more keywords can also be used to retrieve certain additional functionalities such as hostname, username, password, and port. The hostname identifies the hostname of the specified URL, the username holds the name of the user, the password keeps the password user has provided, while the port tells the port number.g\

Conclusion

In today’s topic, we have discussed the urlparse() module provided by the urllib.parse. We explained the purpose and usability of the urlparse() method. We elaborated on different components of the urlparse() method and also how we make access. By implementing the practical example codes on the URL of any specified website employing the Spyder tool, we tried to make it simple, understandable yet beneficial learning for you.

About the author

Kalsoom Bibi

Hello, I am a freelance writer and usually write for Linux and other technology related content