What Is MD5 Hash in Python?
One of the hash methods provided by Python’s Hashlib module is MD5 hash. To perform the hash calculations, it is mostly used in cryptographic functions. Hash can also be used to create caches of massive data sets, verify passwords, verify fingerprints, check the integrity of files, etc. It takes a byte string as input and outputs a hexadecimal string as the encoded value. A 128-bit hash value is produced when encoding the string data to an MD5 hash. You should be careful when choosing the character encoding to convert the text data to binary before hashing because hashing algorithms often work with binary data rather than textual data. The hash also results in binary data.
The functions which are associated with MD5 hash are:
encode(): It creates bytes from the string so that the hash function can use them.
digest(): It returns the encoded or encrypted data in the form of bytes.
hexdigest(): The hexadecimal format of the encoded data is returned using this function.
Advantages of MD5
- Smaller hashes can be compared and stored more easily using MD5 than the larger texts of varying lengths.
- Passwords are saved/stored in the 128-bit format using the MD5 algorithm.
- Each data packet adds a hash value before the transmission of data. You can look for file corruption once the data is received by the server. File integrity verification is valid as long as the hashes match to avoid a data corruption.
- Using MD5, a message digest can be simply generated from an original message.
How to Use MD5 Hash in Python?
How to get MD5 of string objects and files are demonstrated in the following section.
Calculating the MD5 Hash Value of a String
Calculating the hash value of a string object in Python generally involves four steps:
- Creating or loading a string value.
- Converting the string into bytes.
- Encrypting the data in bytes into MD5 hash value.
- Display or return the data in the form of bytes (using digest()) or in the form of hexadecimal (using hexdigest()).
The string variable “string” is defined with binary encoding in the previous script. The “hashlib.md5” method can therefore be used to encrypt the string directly. The encoded output is displayed using the digest function. A binary string may not always be available as input. In such scenarios, you must first convert the data into a binary sequence before passing it to the MD5 hash algorithm.
We now look at a few examples to calculate the MD5 hash of a string object.
Example 1: Printing the String Data in Bytes Equivalent to MD5 Hash
To use the MD5 function, we have to import the Hashlib module first. We pass a string data inside the MD5 function. Then, we print in the form of MD5 hash value as well as in the form of bytes.
The hash function can accept the bytes as input. Thus, we pass the strings as bytes to the MD5() function in the previous code. The MD5 hash method then encrypts the supplied data. Finally, we use the digest() function to generate the byte equivalent of the MD5 hash encoded string.
Example 2: Printing the String Data in Hexadecimal Equivalent to MD5 Hash
Now, we print the data in hexadecimal after encoding it in MD5 hash. In the previous example, we use “b” just before the string value to encode the string into bytes. Here, we apply the encode() function on the string to encode it. Both approaches yield identical results. However, we can specify the encoding format of our choice with the help of the encode() function.
Here, we use the encode() function to transform the specified string data into a byte so that it can be passed to a hash function that would accept it. Then, it is encoded with the MD5 function. Finally, its hexadecimal value is returned using the hexidest() method.
Calculating the MD5 Hash Value of a File
The built-in module Hashlib of Python can also be used to create the MD5 hash of a file
Example 1: The MD5 Hash Value of a Small File in Python
You should be aware that simply specifying a file name inside the hashlib.md5() function, like in the following example, does not return the hash value of the file.
The returned value is not the MD5 hash of our file. But it is the MD5 hash value of the “python.txt” string.
To get the correct MD5 hash value of the file, you must first read the file into bytes. It is easy, all we have to do is read the file’s content and convert it into bytes. The byte is then passed to hashlib.md5() to get the MD5 hash value.
As can be seen, the function calculates the MD5 hash value of the file successfully.
Example 2: The MD5 Hash Value of a Large File in Python
If the file is 10 Gb in size, let’s say it’s a large log file, a traffic dump, a Video game, etc, it would probably use all of your memory if you try to create an MD5 hash of it. In such a case, we can read the large files into chunks of bytes, which is a memory-efficient way of computing MD5 hashes. The size of chunks depends on your requirement, the size of your file, the memory of your system, etc. Therefore, in this procedure, we sequentially process the chunks while also updating the hash. As a result, the MD5 Hash is updated 100 times during this process, if there are 100 such file chunks.
We read the data into chunks using the help of a while loop while the MD5 hash value is updated using the update() function.
Compare and Validate a File’s MD5 Hash
At the server or using a logic in your code, we should validate the MD5 hash of the data or file. We create the original file’s MD5 hash again to verify the hash. After that, compare the MD5 values that are generated by the source and us.
As both values, the value of source_md5 and the MD5 hash value of our file matched. That means that the MD5 hash value is verified.
Conclusion
We first learned about the hash functions in this tutorial. We explained what hashlib.md5() function is and which functions are associated with it. We discussed some advantages and applications of MD5 hash functions. We learned how to use the hashlib.md5() method to calculate a string’s MD5 hash value. We also implemented a couple of examples to teach you how to calculate and verify the MD5 hash value of small and large files in Python.