Python

Regular Expressions using Python 3

Regular Expressions are often seen as this really obscure series of hieroglyphs that one typically copies from the Internet and pastes into his/her code. This mysterious spell then shows magical capabilities of finding patterns inside strings of text and if we ask it nicely it will even do us the favor of replacing a given pattern within a string with something nicer.

For example, when you are writing handlers for URL (and God help you if you are writing one from scratch) then you often want to display the same result regardless of the trailing ‘/’ in the URL. E.g https://example.com/user/settings/ and https://example.com/user/settings should both point to the same page despite the trailing ‘/’. 

However, you can’t ignore all the forward slashes, like:

  1. The forward slash between ’user’ and ‘settings’,e, ‘user/settings’.
  2. Also you will have to take into account the ‘//’ at the beginning of your FQDN followed by ‘https’.

So, you come up with a rule like, “Ignore just the forward slashes followed by empty space.” and if you want you can encode that rule with a series of if-else statements. But that will get cumbersome quite quickly. You can write a function saying cleanUrl() which can encapsulate this for you. But universe will soon start throwing more curveballs at you. You will soon find yourself writing functions for cleanHeaders(), processLog(), etc. Or you can use a regular expression whenever any kind of pattern matching is required.

Standard IO and Files

Before we get into the details of regular expressions it is worth mentioning the model which most systems have for streams of text. Here is a short (incomplete) summary of it:

  1. Text is processed as a (single) stream of characters.
  2. This stream can originate from a file of Unicode or ASCII text or from standard input (keyboard) or from a remote network connection. After processing, say by a regex script, the output either goes to a file or network stream or the standard output (e.g, console)
  3. The stream consists one or more lines. Each line has zero or more characters followed by a newline.

For the sake of simplicity, I want you to picture that a file is composed of lines ending with a newline character. We break this file into individual lines (or strings) each ending either with a newline or a normal character(for the last line).

Regexs and String

A regex has nothing, particular, to do with files. Imagine it as a black box that can take as input any arbitrary string of any (finite) length and once it reaches the end of this string it can either:

  1. Accept the string. In other words, the string matches the regular expression (regex).
  2. Reject the string, i.e, the string doesn’t match the regular expression (regex).

Despite its black box-y nature, I will add a few more constraints to this machinary. A regex reads a string sequentially, from left to right, and it reads only one character at a time. So a string “LinuxHint” with be read as:

‘L’ ‘i’ ‘n’ ‘u’ ‘x’ ‘H’ ‘i’ ‘n’ ‘t’ [Left to right]

Let’s start simple

The most simplistic type of regex would be to search for and match a string ‘C’. The regular expression for it is just ‘C’. Quite trivial. The way to do it in Python would require you to first import the re module for regular expressions.

>>> import re

We then use the function re.search(pattern, string) where pattern is our regular expression and string in the input string within which we search for the pattern.

>>> re.search('C', 'This sentence has a deliberate C in it')
<re.Match object; span=(31, 32), match='C'>

The function takes in the pattern ‘C’, looks for it in the input string and prints the location (span) where the said pattern is found. This part of the string, this substring is what matches our regular expression. If there was no such match found output would be a None object.

Similarly, you can search for the pattern ‘regular expression’ as follows:

>>>re.search(“regular expression”,“We can use regular expressions for searching patterns.”)
<re.Match object; span=(11, 29), match='regular expression'>

re.search() , re.match() and re.fullmatch()

Three useful functions from the re module include:

1.  re.search(pattern, string)

This returns back the substring which matches the pattern, as we have seen above. If no match is found then None is returned. If multiple substrings conform to a given pattern only the first occurance is reported.

2.  re.match(pattern, string)

This function tries to match the supplied pattern from the beginning of the string. If it encounters a break somewhere midway, it returns None.

For example,

>>> re.match("Joh", "John Doe")
<re.Match object; span=(0, 3), match='Joh'>

Where as the string “My name is John Doe” is not a match, and hence None is returned.

>>> print(re.match(“Joh”, “My name is John Doe”))
None

3.  re.fullmatch(pattern, string)

This is stricter than both the above, and tries to find an exact match of the pattern in the string, else defaults to None.

>>> print(re.fullmatch("Joh", "Joh"))
<re.Match object; span=(0, 3), match='Joh'>
# Anything else will not be a match

I will be using just the re.search() function in the rest of this article. Whenever, I say the regex accepts this string, it means that athe re.search() function has found a matching substring in the input string and returned that, instead of Noneobject.

Special characters

Regular expressions like ‘John’ and ‘C’ are not of much use. We need special characters which a specific mean in the context of regular expressions. Here are a few examples:

    1. ^ — This matches the beginning of a string. For example, ‘^C’ will match all the strings that begin with the letter C.
    2. $ — This matches the end of line.
    3. . — The dot is to indicate one or more characters, except the newline.
    4. * — This is to zero or more character of what preceded it. So b* matches 0 or more occurrences of b. ab* matches just a, ab, and a
    5. + — This is to one or more character of what preceded it. So b+ matches 1 or more occurrences of b. ab* matches just a, ab, and a
    6. \ — Backslash is used as escape sequence in the regexes. So it you want a regular expression to search for the literal presence of dollar symbol ‘$’ instead of the end of line. You can write \$ in regular expression.
    7. Curly braces can be used to specify the number of repetitions that you want to see. For example, a pattern like ab{10} signifies the string a followed by 10 b will match this pattern. You can specify a range of numbers as well, like b{4,6} matches strings containing b repeated 4 to 6 times consecutively. The pattern for 4 or more repetitions would require just a trailing comma, like so b{4,}
    8. Square brackets and range of characters. RE like [0-9] can act like a placeholder for any digit between 0 and 9. Similarly, you can have digits between one and five [1-5] or to match any uppercase letter use [A-Z] or for any letter of the Alphabet regardless of it being upper or lowercase use [A-z].
      For example any string made of exactly ten digits matches the regular expression [0-9]{10}, quite useful when you are looking for phone numbers in a given string.
    9. You can create an OR like statement, using | character where a regular expression is made up of two or more regular expressions, say, A and B. The regex A|B is a match if the input string is either a match for regular expression A or for B.
    10. You can group different regexes together. For example, the regex (A|B)C will match regexes for AC and

There’s a lot of more to cover, but I would recommend learning as you go instead of overloading your brain with a lot of obscure symbols and edge cases. When in doubt, the Python Docs are a great help and now you know enough to follow the docs easily.

Hands on Experience and References

If you want to see a visual interpretation of your regex, you can visit Debuggex. This site generate a view of your regex in real-time and lets you test it against various input strings.

To know more about the theoretical aspect of Regular Expressions you may want to look at the first couple of chapters of Introduction to the Theory of Computation by Michael Sipser. Its very easy to follow and shows the importance of regular expressions as a core concept of computation itself!

About the author

Ranvir Singh

Ranvir Singh

I am a tech and science writer with quite a diverse range of interests. A strong believer of the Unix philosophy. Few of the things I am passionate about include system administration, computer hardware and physics.