Python

How to Use the Difflib Module in Python

This article will cover a guide on using the “difflib” module in Python. The difflib module can be used to compare two Python objects of certain types and view similarities or differences between them. All code samples in this article are tested with Python 3.9.5 on Ubuntu 21.04.

About Difflib Module

The difflib module, as the name suggests, can be used to find differences or “diff” between contents of files or other hashable Python objects. It can be also used to find a ratio that shows the extent of similarities between two objects. The usage of the difflib module and its functions can be best understood through examples. Some of them are listed below.

About Hashable Python Objects

In Python, object types whose value is not likely to change or most of the immutable object types are called hashable types. Hashable type objects have a certain fixed value assigned by Python during declaration and these values do not change during their lifetime. All hashable objects in Python have a “__hash__” method. Have a look at the code sample below:

number = 6
print (type(number))
print (number.__hash__())

word = "something"
print (type(word))
print (word.__hash__())

dictionary = {"a" : 1, "b": 2}
print (type(dictionary))
print (dictionary.__hash__())

After running the above code sample, you should get the following output:

6

2168059999105608551

Traceback (most recent call last):
  File "/main.py", line 13, in
    print (dictionary.__hash__())
TypeError: 'NoneType' object is not callable

The code sample includes three Python types: an integer type object, a string type object, and a dictionary type object. The output shows that when calling the “__hash__” method, the integer type object and the string type object show a certain value while the dictionary type object throws an error as it doesn’t have a method called “__hash__”. Hence an integer type or a string type is a hashable object in Python while a dictionary type is not. You can learn more about hashable objects from here.

Comparing Two Hashable Python Objects

You can compare two hashable types or sequences using the “Differ” class available in the difflib module. Have a look at the code sample below.

from difflib import Differ

line1 = "abcd"
line2 = "cdef"
d = Differ()
difference = list(d.compare(line1, line2))
print (difference)

The first statement imports the Differ class from the difflib module. Next, two string type variables are defined with some values. A new instance of the Differ class is then created as “d”. Using this instance, the “compare” method is then called to find the difference between “line1” and “line2” strings. These strings are supplied as arguments to the compare method. After running the above code sample, you should get the following output:

['- a', '- b', '  c', '  d', '+ e', '+ f']

The dashes or minus signs indicate that “line2” doesn’t have these characters. Characters without any signs or leading whitespace are common to both variables. Characters with plus sign are available in the “line2” string only. For better readability, you can use the newline character and “join” method to view line by line output:

from difflib import Differ

line1 = "abcd"
line2 = "cdef"
d = Differ()
difference = list(d.compare(line1, line2))
difference = '\n'.join(difference)
print (difference)

After running the above code sample, you should get the following output:

- a
- b
  c
  d
+ e
+ f

Instead of the Differ class, you can also use the “HtmlDiff” class to produce colored output in HTML format.

from difflib import HtmlDiff

line1 = "abcd"
line2 = "cdef"
d = HtmlDiff()
difference = d.make_file(line1, line2)
print (difference)

The code sample is the same as above, except that the Differ class instance has been replaced by an instance of HtmlDiff class and instead of the compare method, you now call the “make_file” method. After running the above command, you will get some HTML output in the terminal. You can export the output to a file using the “>” symbol in bash or you can use the code sample below to export the output to a “diff.html” file from Python itself.

from difflib import HtmlDiff

line1 = "abcd"
line2 = "cdef"
d = HtmlDiff()
difference = d.make_file(line1, line2)
with open("diff.html", "w") as f:
    for line in difference.splitlines():
        print (line, file=f)

The “with open” statement in “w” mode creates a new “diff.html” file and saves the entire contents of the “difference” variable to the diff.html file. When you open the diff.html file in a browser, you should get a layout similar to this:

Getting Differences Between Contents of Two Files

If you want to produce diff data from the contents of two files using the Differ.compare() method, you can use the “with open” statement and “readline” method to read the contents of files. The example below illustrates this where contents of “file1.txt” and “file2.txt” are read using “with open” statements. The “with open” statements are used to safely read data from files.

from difflib import Differ

with open ("file1.txt") as f:
    file1_lines = f.readlines()
with open ("file2.txt") as f:
    file2_lines = f.readlines()
d = Differ()
difference = list(d.compare(file1_lines, file2_lines))
difference = '\n'.join(difference)
print (difference)

The code is pretty straightforward and nearly the same as the example shown above. Assuming that “file1.txt” contains “a”, “b”, “c”, and “d” characters each on a new line and “file2.txt” contains “c”, “d”, “e”, and “f” characters each on a new line, the code sample above will produce the following output:

- a

- b

  c

- d
+ d

+ e

+ f

The output is almost the same as before, “-” sign represents lines not present in the second file. The “+” sign shows lines only present in the second file. Lines without any signs or having both signs are common to both files.

Finding Similarity Ratio

You can use the “sequenceMatcher” class from the difflib module to find the similarity ratio between two Python objects. The range of the similarity ratio lies between 0 and 1 where having a value of 1 indicates exact match or maximum similarity. A value of 0 indicates totally unique objects. Have a look at the code sample below:

from difflib import SequenceMatcher
line1 = "abcd"
line2 = "cdef"
sm = SequenceMatcher(a=line1, b=line2)
print (sm.ratio())

A SequenceMatcher instance has been created with objects to be compared supplied as “a” and “b” arguments. The “ratio” method is then called upon the instance to get the similarity ratio. After running the above code sample, you should get the following output:

0.5

Conclusion

The difflib module in Python can be used in a variety of ways to compare data from different hashable objects or content read from files. Its ratio method is also useful if you just want to get a similarity percentage between two objects.

About the author

Nitesh Kumar

I am a freelancer software developer and content writer who loves Linux, open source software and the free software community.