Python

Python Extract Substring Using Regex

There can be several case scenarios where it is required to extract a substring from a string in Python. For instance, while working on large datasets, you may need to get specific data from the text fields or match a particular pattern in a string, such as an email address or phone number. Moreover, the substring extraction operation also assists in text processing and analysis.

This post will cover the following approaches:

Method 1: Python Extract Substring Using Regex in “re.search()” Method

The Python “re.search()” method looks for the first occurrence of the instance of the added pattern within a string and outputs a “Match” object. It can be invoked when you want to locate a specific substring inside a longer string but have no idea how frequently it occurs.

Syntax

To use the re.search() method, follow the given syntax:

re.search(pattern, string, flags)

Here:

  • pattern” represents the regex that you want to search.
  • string” refers to the specified string in which you want to search.
  • flags” represents the optional parameters, such as multi-line mode, case sensitivity, etc.

Example 1: Extracting Text-based Substring Using “re.search()” Method

For utilizing the “re.search()” method to extract a substring, firstly import the “re” module. This module offers support for regex:

import re

Define the string from which you want to retrieve a substring:

string = 'Linuxhint is the best tutorial website'

Then, specify the regex. Here, “r” indicates that it is a raw string to treat backlashes as the literal characters, and “best” is the defined regular expression or regex:

regex = r'best'

Pass the created “regex” and “string” to the re.search() method and store the resultant object in the “match”:

match = re.search(regex, string)

Now, add the given condition to extracts the matched substring from the “match” object returned by the re.search() method, and display it to the console:

if match:
    sub_string = match.group()
    print(sub_string)

It can be observed that the substring “best” has been extracted by utilizing the “group()” method of the match object:

Example 2: Extracting Numeric Substring Using “re.search()” Method

Now, define a numeric string and search for the first occurrence of one or more digits in it by passing the “\d+” as the regex to “re.search()” method:

string = '039-6546-0987'
print(re.search(r'\d+', string))

In the specified regex:

  • \” is utilized for escaping the letter “d” (digit character).
  • +” signifies one or match digits in a row:

As you can see, the matched object has been returned by the “re.search()” method.

Method 2: Python Extract Substring Using Regex in “re.match()” Method

re.match()” only searches for the regex at the start of the strings and outputs a Match object in case of a successful search. This method can be utilized when you know that the substring only occurs at the start of the given string.

Syntax

To invoke the re.match() method, follow the given syntax:

re.match(pattern, string, flags)

Example

Firstly, define the regular expression as “‘^l…….t$‘”. This regex matches the strings that begin with “l”, end with “t”, and have exactly 8 characters.

regex = '^l.......t$'

Then, declare the string. Pass it to the re.match() method, along with the regex as arguments:

string = 'linuxhint'
result = re.match(regex, string)

Add the “if-else” condition and specify the respective print statements for the cases if “Match” object has been returned or not:

if result:
  print("Search has been done successfully", result)
else:
  print("Search was unsuccessful.")

Output

Method 3: Python Extract Substring Using Regex in “re.findall()” Method

The “re.findall()” Python method searches for every instance of a pattern within the given strings and outputs a list of extracted substrings. This method is used in those case scenarios where it is required to retrieve multiple substrings without any particular order.

Syntax

To invoke the re.findall() method, check out the given syntax:

re.findall(pattern, string, flags)

Example

Define a string comprising numeric values. Then, specify the regex pattern as “r’\d+‘” to match one or more digits:

string = '4 Hour Boot camp Linuxhint course for $14.99'
regex = r'\d+'

Then, call the “re.findall()” method and pass the defined regex and the string as arguments

matches = re.findall(regex, string)

Now, iterate over the returned “Match” object stored in the matches variable and print the elements on the console:

for match in matches:
    print(match)

Output

Method 4: Python Extract Substring Using Regex in “re.finditer()” Method

The “re.finditer()” method works the same as the re.findall() method. However, it returns an iterator rather than a list of substrings. In Python, this method can be utilized when there exists a large data set and it does not need to store all matches at once. More specifically, the re.finditer() method processes the extracted substring one at a time.

Syntax

To invoke the re.finditer() method, follow the given syntax:

re.finditer(pattern, string, flags)

Example

First, create a string. Then, define a regex pattern as “r'[A-Z]+’” that matches one or more uppercase letters:

string = 'Linuxhint is the Best Tutorial Website'
regex = r'[A-Z]+'

Pass the regex and the string as arguments to the “re.finditer()” method and store the resultant Match object in “matches”:

matches = re.finditer(regex, string)

Lastly, iterate over the matches object elements, extract the substring with the help of the “group()” method and print out on the console:

for match in matches:
    sub_string = match.group()
    print(sub_string)

Output

We have compiled essential approaches related to extracting substring in Python.

Conclusion

To extract substring using regex in Python, use the “re.search()”, “re.match()”, “re.findall()”, or the “re.finditer()” methods. Depending on your requirements, utilize “re.search()” method when it is required to extract only the first instance of the regex, “re.match()” extracts the substring present the start of a string, “re.findall()” retrieves multiple substrings according to the pattern, and lastly “re.finditer()” process the multiple strings one at a time. This blog covered the methods for extracting substring in Python.

About the author

Abdul Mannan

I am curious about technology and writing and exploring it is my passion. I am interested in learning new skills and improving my knowledge and I hold a bachelor's degree in computer science.