Python

Python Regular Expression

In this topic, we will learn Python Regular Expressions.

Definition: Regular expressions, sometimes called re or regex or regexp, are sequences of characters to match patterns in text/string. Python has an inbuilt re module to perform this.

The common uses of a regular expression are:

  1. Search a string(search and find)
  2. Find all matching string(findall)
  3. Split string into substring(split)
  4. Replace part of a string(sub)

A regular expression is a combination of alphabets, metacharacters. So following metacharacters are available.

  • \ This is used to drop/ignore the special meaning of character
  • [] This indicates a character class Ex: [a-z],[a-zA-Z0-9]
  • ^ This matches the beginning of the text
  • $ This matches the end of the text
  • . This matches any character except newline
  • ? Match zero or one occurrence.
  • | Means OR (Match with any of the characters separated by it.
  • * Any number of occurrences (including 0 occurrences)
  • + One or more occurrences
  • {} Indicate several occurrences of a preceding RE to match.
  • () Enclose a group of regexp

If we use backslash ‘\’, it indicates various sequences. I want to use backslash without its special meaning use’\\’.

  • \d Matches any decimal digit, this is same as set class [0-9]
  • \D Matches any non-digit character
  • \s Matches any whitespace character.
  • \S Matches any non-whitespace character
  • \w Matches any alphanumeric character; this is the same as a class [a-zA-Z0-9_].
  • \W Matches any non-alphanumeric character.

The following method available in re module:

re.search() :

This method returns the matching part of the string, and it stops after the first match. So this can be used for testing an expression rather than extracting data.

Syntax: re.search (pattern, string)
Return value:
None : the pattern does not match
String : pattern matched

Ex: In this example will search month and date

import re  
      regexp = r"([a-zA-Z]+) (\d+)"
      match = re.search(regexp, "My son birthday is on July 20")
      if match != None:
        print ("Match at index %s, %s" % (match.start(), match.end()))#This provides index of matched string
        print ("Full match: %s" % (match.group(0)))
        print ("Month: %s" % (match.group(1)))
        print ("Day: %s" % (match.group(2)))
      else:
    print ("The given regex pattern does not match")

re.match() :

This method searches and returns the first match. This checks for the match only at the beginning of the string.

Syntax: re.match(pattern, string)
Return value:
None: the pattern does not match
String: pattern matched

Ex: This example to show pattern matched beginning of string

import re
      regexp = r"([a-zA-Z]+) (\d+)"
      match = re.match(regexp, "July 20")    
      if match == None:
        print ("Not a valid date")
      else:
        print("Given string: %s" % (match.group()))
        print("Month: %s" % (match.group(1)))
        print("Day: %s" % (match.group(2)))

Ex: To show pattern not matched at the beginning

import re
        match = re.match(regexp, "My son birthday is on July 20")    
        if match == None:
          print ("Not a valid date")
        else:
          print("Given string: %s" % (match.group()))
          print("Month: %s" % (match.group(1)))
          print("Day: %s" % (match.group(2)))

re.findall() :

This method returns all matches of pattern in a string. The string is searched from start to end, and matches are returned in the order found.

Syntax : re.findall(pattern, string)
Return value
Empty string([)]: pattern does not match
List of string: pattern matched

Ex: regular expression to find digits

import re
        string  = """Bangalore pincode is 560066 and
             gulbarga pincode is 585101"""

        regexp = '\d+'            
        match = re.findall(regexp, string)
        print(match)

Ex: Find mobile number(exact 10 digit number) from given text

import re
        string  = """Bangalore office number 1234567891,
             My number is 8884278690, emergency contact 3456789123
             invalid number 898883456"""

        regexp = '\d{10}'#This regular expression to match exact 10 digits number            
        match = re.findall(regexp, string)
        print(match)

re.compile():

Regular expressions are compiled into pattern objects and can be used on methods. Example searching for pattern matches, string substitutions.

Ex:

import re
    e = re.compile('[a-e]')
    print(e.findall("I born at 11 A.M. on 20th July 1989"))
    e = re.compile('\d') # \d is equivalent to [0-9].
    print(e.findall("I born at 11 A.M. on 20th July 1989"))
    p = re.compile('\d+')#group of one or more digits
    print(p.findall("I born at 11 A.M. on 20th July 1989"))

re.split():

Split string based on occurrences of a pattern. If found pattern, the remaining characters from the string are returned as part of the resulting list. We can specify the maximum split for a given string.

Syntax – re.split(pattern, string, maxsplit=0)
Return values:
Empty list([]) : pattern does not match
List of string : pattern matched

Ex:

import re
    # '\W+' matches Non-Alphanumeric Characters or group of characters
    # split Upon finding ',' or whitespace ' '
    print(re.split('\W+', 'Good, better , Best'))
    print(re.split('\W+', "Book's books Books"))
    # Here ':', ' ' ,',' are not AlphaNumeric where splitting occurs
    print(re.split('\W+', 'Born On 20th July 1989, at 11:00 AM'))
    # '\d+' denotes Numeric Characters or group of characters
    # Spliting occurs at '20', '1989', '11', '00'
    print(re.split('\d+', 'Born On 20th July 1989, at 11:00 AM'))
    # Specified maximum split as 1
    print(re.split('\d+', 'Born On 20th July 1989, at 11:00      
AM'
,maxsplit=1))

re.sub():

Here the ‘sub’ meaning is a substring. In this function, the given regular expression(pattern parameter) is matched in the given string(string parameter); if the substring is found, it is replaced by a repl parameter.
Here in the count, specify the number of times the regex is replaced.
Here we can specify the regex flag(ex: re. IGNORECASE)

Syntax:- re.sub(pattern, repl, string, count=0, flags=0)
Return value:
Returns a new string after replacing a pattern else
Returns the same string

Ex:

    import re
    # Ex: pattern 'lly' matches the string at "successfully" and "DELLY"
    print(re.sub('lly', '#$' , 'doctor appointment booked successfully in DELLY'))
    # Ex : CASE hasbeen ignored, using Flag, 'lly' wil match twice with the string
    # After matching, 'lly' is replaced by '~*' in "successfully" and "DELLY".
    print(re.sub('lly', '#$' , 'doctor appointment booked successfully in DELLY',flags = re.IGNORECASE))
    # Ex : Case Senstivity, 'lLY' will not be reaplced.
    print(re.sub('lLY', '#$' , 'doctor appointment booked successfully in DELLY'))
    # Ex : As count = 1, the maximum times replacement occurs is 1
    print(re.sub('lly', '#$' , 'doctor appointment booked successfully in DELLY',count=1, flags = re.IGNORECASE))

re.subn():

subn() functionality same as sub() in all ways; the only difference is providing output. It returns a tuple that contains a count of a total of replacement and the new string.
Syntax:- re.subn(pattern, repl, string, count=0, flags=0)

Ex:

import re
    print(re.subn('lly', '#$' , 'doctor appointment booked successfully in DELLY'))
    t = re.subn('lly', '#$' , 'doctor appointment booked successfully in DELLY', flags = re.IGNORECASE)
    print(t)
    print(len(t))
    # This will give same output as sub()
    print(t[0])

re.escape() :

This returns string with backslash ‘\’ before every non-alphanumeric character. This is helpful if we want to match an arbitrary literal string that may have regular expression metacharacters in it.
Syntax:- re.escape(string)
Ex:

import re
    # below case has only ' ', is not alphanumeric
    print(re.escape("doctor appointment booked successfully at 1PM"))
    # below case has , ' ', caret '^', '-', '[]', '\' are not alphanumeric
    print(re.escape("He asked what is this [0-9], I said \t ^Numberic class"))

Conclusion:

The article covered the things needed to understand the regular expression in any application. We have learned various methods and meta characters present in python regexp using examples.

About the author

Bamdeb Ghosh

Bamdeb Ghosh is having hands-on experience in Wireless networking domain.He's an expert in Wireshark capture analysis on Wireless or Wired Networking along with knowledge of Android, Bluetooth, Linux commands and python. Follow his site: wifisharks.com