Apache Spark

PySpark like() and ilike() Functions

If we want to return the values from the DataFrame column based on the string match in PySpark, then the like() and ilike() methods available in PySpark are used to return the rows based on the values specified inside it.

They can be used with either the filter clause or where clause. We will see them one by one with different examples.

like() Function

The like() function in PySpark is used to check if a string or a pattern exists in a column of PySpark DataFrame. If it exists, the matched rows will be returned. Otherwise, an empty DataFrame is returned. It is case-sensitive.

Syntax

dataframe_object.filter(dataframe_obj.column.like(pattern/string))
dataframe_object.where(dataframe_obj.column.like(pattern/string))

Where,
dataframe_object is the PySpark DataFrame.

Parameter:
The like() function has one parameter.

It can be a pattern or a string such that the like() function will check if the specified value is present in the DataFrame column or not.

Return:
Based on this column value, the entire row is returned.

First, we will create the PySpark DataFrame with 10 rows and 5 columns.

import pyspark
from pyspark.sql import *
spark_app = SparkSession.builder.appName('_').getOrCreate()
students =[(4,'sravan',23,'PHP','Testing'),
        (2,'sravan',23,'Oracle','Testing'),
        (46,'mounika',22,'.NET','HTML'),
        (12,'deepika',21,'Oracle','HTML'),
        (46,'mounika',22,'Oracle','Testing'),
        (12,'chandrika',23,'Hadoop','C#'),
        (12,'chandrika',22,'Oracle','Testing'),
        (45,'sravan',23,'Oracle','C#'),
        (4,'deepika',21,'PHP','C#'),
        (46,'mounika',22,'.NET','Testing')
            ]
 
dataframe_obj = spark_app.createDataFrame( students,['subject_id','name','age','technology1','technology2'])

dataframe_obj.show()

Output:

Now, let’s apply the like() function on the PySpark DataFrame to return the results.

Example 1
We will provide the string, ‘sravan’, in the name column inside the like() method and return all rows matching this string.

#check for string- sravan in the name column and return rows with the name - sravan.
print("--------Using where() clause--------")
dataframe_obj.where(dataframe_obj.name.like('sravan')).show()
 
#check for string- sravan in the name column and return rows with the name - sravan.
print("--------Using filter() clause--------")
dataframe_obj.filter(dataframe_obj.name.like('sravan')).show()

Output:

You can see that sravan is found three times, and rows were returned.

Example 2
We will provide the string, ‘pHP’, in the technology1 column inside the like() method and return all rows matching this string.

#check for string- pHP in the technology1 column and return rows with technology1 - pHP.
print("--------Using where() clause--------")
dataframe_obj.where(dataframe_obj.technology1.like('pHP')).show()
 
#check for string- pHP in the technology1 column and return rows with technology1 - pHP.
print("--------Using filter() clause--------")
dataframe_obj.filter(dataframe_obj.technology1.like('pHP')).show()

Output:

You can see that pHP is not found in the technology1 column. Hence, 0 rows were returned.

ilike() Function

The ilike() function in PySpark is used to check if a string or a pattern exists in a column of PySpark DataFrame. If it exists, the matched rows will be returned. Otherwise, the empty DataFrame is returned. It is case-insensitive.

Syntax

dataframe_object.filter(dataframe_obj.column.ilike(pattern/string))
dataframe_object.where(dataframe_obj.column.ilike(pattern/string))

Where,
dataframe_object is the PySpark DataFrame.

Parameter:
The ilike() function has one parameter.

It can be a pattern or a string such that the ilike() function will check if the specified value is present in the DataFrame column or not.

Return:
Based on this column value, the entire row is returned.

Example 1
We will provide the string, ‘sravan’, in the name column inside the ilike() method and return all rows matching this string.

#check for string- sravan in the name column and return rows with the name - sravan.
print("--------Using where() clause--------")
dataframe_obj.where(dataframe_obj.name.ilike('sravan')).show()
 
#check for string- sravan in the name column and return rows with the name - sravan.
print("--------Using filter() clause--------")
dataframe_obj.filter(dataframe_obj.name.ilike('sravan')).show()

Output:

Explanation
You can see that sravan is found three times and rows were returned.

Example 2
We will provide the string, ‘pHP’, in the technology1 column inside the like() method and return all rows matching this string.

#check for string- pHP in technology1 column and return rows with technology1 - pHP.
print("--------Using where() clause--------")
dataframe_obj.where(dataframe_obj.technology1.ilike('pHP')).show()
 
#check for string- pHP in technology1 column and return rows with technology1 - pHP.
print("--------Using filter() clause--------")
dataframe_obj.filter(dataframe_obj.technology1.ilike('pHP')).show()

Output:

Explanation
You can see that pHP is not found in the technology1 column. But ilike() is case-insensitive. So, it takes PHP and pHP as the same. So, the rows were returned.

Conclusion

In this PySpark tutorial, we saw two functions that return values based on the string match in the PySpark DataFrame column. The like() and ilike() functions are used to check if a string or a pattern exists in a column of PySpark DataFrame. The difference is like() is case-sensitive and ilike() is case-insensitive.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain