Apache Spark

PySpark – Pandas DataFrame: Comparison Operators

In Python, PySpark is a Spark module that provides a similar kind of Processing like spark using DataFrame, which will store the given data in row and column format. PySpark – pandas DataFrame represents the pandas DataFrame, but it holds the PySpark DataFrame internally. Pandas support DataFrame data structure, and pandas is imported from the pyspark module.

In this article we will demonstrate the Pandas DataFrame Comparison Operators and how they can be used in pyspark. Before that, you have to install the pyspark module as shown below:

Command

pip install pyspark

Syntax to import

from pyspark import pandas

After that, we can create or use the dataframe from the pandas module.

Syntax to create pandas DataFrame

pyspark.pandas.DataFrame()

We can pass a dictionary or list of lists with values. Let’s create a pandas DataFrame through pyspark with three columns and five rows.

#import pandas from the pyspark module
from  pyspark import pandas

#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})

#display
print(pyspark_pandas)

Output

Now, we will go into our tutorial.

Comparison operators are used to comparing all the values in the pyspark pandas dataframe with a value. It returns True if the condition is satisfied; otherwise will return False for all values in a dataframe.

Let’s see them one by one.

pyspark.pandas.DataFrame.lt (less than operator) 

This comparison operator is used to check if all the values in the given pyspark pandas dataframe are less than the given value. If yes, then it will return True for that value; otherwise, False is returned.

It is also possible to use ‘<’ – less than operator.

Syntax 

pyspark_pandas.lt(value)
pyspark_pandas<value

Where pyspark_pandas is the pyspark pandas dataframe.

Parameter
It takes the value as a parameter that refers to a numeric value.

ExampleIn this example, we will compare the above created dataframe with value – 75 using lt and < operators.

#import pandas from the pyspark module
from  pyspark import pandas

#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})

#check all the values in the above dataframe are less than 75
print(pyspark_pandas.lt(75))

print()

#check all the values in the above dataframe are less than 75
print(pyspark_pandas<75)

Output

Both operators returned the same, and According to the condition, values less than 75 returned True, and in other cases, it returned False.

pyspark.pandas.DataFrame.le (less than or equal operator)

le is the comparison operator used to check if all the values in the given pyspark pandas dataframe are less than or equal to the given value. If yes, then it will return True for that value; otherwise, False is returned.

It is also possible to use ‘<=’ – less than or equal to operator.

Syntax

pyspark_pandas.le(value)
pyspark_pandas<=value

Where pyspark_pandas is the pyspark pandas dataframe.

Parameter
It takes the value as a parameter that refers to a numeric value.

Example
In this example, we will compare the above created dataframe with value – 75 using le and <= operators.

#import pandas from the pyspark module
from  pyspark import pandas

#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})

#check all the values in the above dataframe are less than or equal to  75
print(pyspark_pandas.le(75))

print()

#check all the values in the above dataframe are less than or equal to 75
print(pyspark_pandas<=75)

Output

Both operators returned the same, and According to the condition, values less than or equal to  75 returned True, and in other cases, it returned False.

pyspark.pandas.DataFrame.gt (greater than operator)

This comparison operator is used to check if all the values in the given pyspark pandas dataframe are greater than the given value. If yes, then it will return True for that value; otherwise, False is returned.

It is also possible to use ‘>’ – greater than operator.

Syntax 

pyspark_pandas.gt(value)
pyspark_pandas>value

Where pyspark_pandas is the pyspark pandas dataframe.

Parameter
It takes the value as a parameter that refers to a numeric value.

Example
In this example, we will compare the above created dataframe with value – 75 using gt and > operators.

#import pandas from the pyspark module
from  pyspark import pandas

#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})

#check all the values in the above dataframe are greater than 75
print(pyspark_pandas.gt(75))

print()

#check all the values in the above dataframe are greater than 75
print(pyspark_pandas>75)

Output

Both operators returned the same, and According to the condition, values greater than 75 returned True, and in other cases, it returned False.

pyspark.pandas.DataFrame.ge (greater than or equal operator)

ge is the comparison operator used to check if all the values in the given pyspark pandas dataframe are greater than or equal to the given value. If yes, then it will return True for that value; otherwise, False is returned.

It is also possible to use ‘>=’ – greater than or equal to the operator.

Syntax

pyspark_pandas.ge(value)
pyspark_pandas>=value

Where pyspark_pandas is the pyspark pandas dataframe.

Parameter
It takes the value as a parameter that refers to a numeric value.

Example
In this example, we will compare the above created dataframe with value – 75 using ge and >= operators.

#import pandas from the pyspark module
from  pyspark import pandas

#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})

#check all the values in the above dataframe are greater than or equal to  75
print(pyspark_pandas.ge(75))

print()

#check all the values in the above dataframe are greater than or equal to 75
print(pyspark_pandas>=75)

Output

Both operators returned the same, and According to the condition, values greater than or equal to  75 returned True, and in other cases, it returned False.

pyspark.pandas.DataFrame.eq (equality logical operator)

eq is the comparison operator used to check if all the values in the given pyspark pandas dataframe are equal to the given value. If yes, then it will return True for that value; otherwise, False is returned.

It is also possible to use ‘==’ – equal to operator.

Syntax

pyspark_pandas.eq(value)
pyspark_pandas==value

Where pyspark_pandas is the pyspark pandas dataframe.

Parameter
It takes the value as a parameter that refers to a numeric value.

Example
In this example, we will compare the above created dataframe with value – 97 using eq and == operators.

#import pandas from the pyspark module
from  pyspark import pandas

#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})

#check all the values in the above dataframe are  equal to  97
print(pyspark_pandas.eq(97))

print()

#check all the values in the above dataframe are  equal to  97
print(pyspark_pandas==97)

Output

Both operators returned the same, and According to the condition, values equal to  97 returned True, and in other cases, it returned False.

pyspark.pandas.DataFrame.ne (not equal to operator)

ne is the comparison operator used to check if all the values in the given pyspark pandas dataframe are not equal to the given value. If yes, then it will return True for that value; otherwise, False is returned.

It is also possible to use ‘!=’ – not equal to operator.

Syntax 

pyspark_pandas.ne(value)
pyspark_pandas!=value

Where pyspark_pandas is the pyspark pandas dataframe.

Parameter
It takes the value as a parameter that refers to a numeric value.

Example
In this example, we will compare the above created dataframe with value – 97 using ne and != operators.

#import pandas from the pyspark module
from  pyspark import pandas

#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})

#check all the values in the above dataframe are not equal to  97
print(pyspark_pandas.ne(97))

print()

#check all the values in the above dataframe are not  equal to  97
print(pyspark_pandas!=97)

Output

Both operators returned the same, and According to the condition, values not equal to  97 returned True, and in other cases, it returned False.

Conclusion

In this PySpark pandas article we see how to apply different comparison operators on DataFrame through built-in operators and normal operators. Each operator returns a boolean value in the pyspark pandas DataFrame element wise. The comparison operators that we used are : eq(),ne(),lt(),gt(),le() and ge().

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain