In this article we will demonstrate the Pandas DataFrame Comparison Operators and how they can be used in pyspark. Before that, you have to install the pyspark module as shown below:
Command
Syntax to import
After that, we can create or use the dataframe from the pandas module.
Syntax to create pandas DataFrame
We can pass a dictionary or list of lists with values. Let’s create a pandas DataFrame through pyspark with three columns and five rows.
from pyspark import pandas
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
#display
print(pyspark_pandas)
Output
Now, we will go into our tutorial.
Comparison operators are used to comparing all the values in the pyspark pandas dataframe with a value. It returns True if the condition is satisfied; otherwise will return False for all values in a dataframe.
Let’s see them one by one.
pyspark.pandas.DataFrame.lt (less than operator)
This comparison operator is used to check if all the values in the given pyspark pandas dataframe are less than the given value. If yes, then it will return True for that value; otherwise, False is returned.
It is also possible to use ‘<’ – less than operator.
Syntax
pyspark_pandas<value
Where pyspark_pandas is the pyspark pandas dataframe.
Parameter
It takes the value as a parameter that refers to a numeric value.
ExampleIn this example, we will compare the above created dataframe with value – 75 using lt and < operators.
from pyspark import pandas
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
#check all the values in the above dataframe are less than 75
print(pyspark_pandas.lt(75))
print()
#check all the values in the above dataframe are less than 75
print(pyspark_pandas<75)
Output
Both operators returned the same, and According to the condition, values less than 75 returned True, and in other cases, it returned False.
pyspark.pandas.DataFrame.le (less than or equal operator)
le is the comparison operator used to check if all the values in the given pyspark pandas dataframe are less than or equal to the given value. If yes, then it will return True for that value; otherwise, False is returned.
It is also possible to use ‘<=’ – less than or equal to operator.
Syntax
pyspark_pandas<=value
Where pyspark_pandas is the pyspark pandas dataframe.
Parameter
It takes the value as a parameter that refers to a numeric value.
Example
In this example, we will compare the above created dataframe with value – 75 using le and <= operators.
from pyspark import pandas
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
#check all the values in the above dataframe are less than or equal to 75
print(pyspark_pandas.le(75))
print()
#check all the values in the above dataframe are less than or equal to 75
print(pyspark_pandas<=75)
Output
Both operators returned the same, and According to the condition, values less than or equal to 75 returned True, and in other cases, it returned False.
pyspark.pandas.DataFrame.gt (greater than operator)
This comparison operator is used to check if all the values in the given pyspark pandas dataframe are greater than the given value. If yes, then it will return True for that value; otherwise, False is returned.
It is also possible to use ‘>’ – greater than operator.
Syntax
pyspark_pandas>value
Where pyspark_pandas is the pyspark pandas dataframe.
Parameter
It takes the value as a parameter that refers to a numeric value.
Example
In this example, we will compare the above created dataframe with value – 75 using gt and > operators.
from pyspark import pandas
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
#check all the values in the above dataframe are greater than 75
print(pyspark_pandas.gt(75))
print()
#check all the values in the above dataframe are greater than 75
print(pyspark_pandas>75)
Output
Both operators returned the same, and According to the condition, values greater than 75 returned True, and in other cases, it returned False.
pyspark.pandas.DataFrame.ge (greater than or equal operator)
ge is the comparison operator used to check if all the values in the given pyspark pandas dataframe are greater than or equal to the given value. If yes, then it will return True for that value; otherwise, False is returned.
It is also possible to use ‘>=’ – greater than or equal to the operator.
Syntax
pyspark_pandas>=value
Where pyspark_pandas is the pyspark pandas dataframe.
Parameter
It takes the value as a parameter that refers to a numeric value.
Example
In this example, we will compare the above created dataframe with value – 75 using ge and >= operators.
from pyspark import pandas
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
#check all the values in the above dataframe are greater than or equal to 75
print(pyspark_pandas.ge(75))
print()
#check all the values in the above dataframe are greater than or equal to 75
print(pyspark_pandas>=75)
Output
Both operators returned the same, and According to the condition, values greater than or equal to 75 returned True, and in other cases, it returned False.
pyspark.pandas.DataFrame.eq (equality logical operator)
eq is the comparison operator used to check if all the values in the given pyspark pandas dataframe are equal to the given value. If yes, then it will return True for that value; otherwise, False is returned.
It is also possible to use ‘==’ – equal to operator.
Syntax
pyspark_pandas==value
Where pyspark_pandas is the pyspark pandas dataframe.
Parameter
It takes the value as a parameter that refers to a numeric value.
Example
In this example, we will compare the above created dataframe with value – 97 using eq and == operators.
from pyspark import pandas
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
#check all the values in the above dataframe are equal to 97
print(pyspark_pandas.eq(97))
print()
#check all the values in the above dataframe are equal to 97
print(pyspark_pandas==97)
Output
Both operators returned the same, and According to the condition, values equal to 97 returned True, and in other cases, it returned False.
pyspark.pandas.DataFrame.ne (not equal to operator)
ne is the comparison operator used to check if all the values in the given pyspark pandas dataframe are not equal to the given value. If yes, then it will return True for that value; otherwise, False is returned.
It is also possible to use ‘!=’ – not equal to operator.
Syntax
pyspark_pandas!=value
Where pyspark_pandas is the pyspark pandas dataframe.
Parameter
It takes the value as a parameter that refers to a numeric value.
Example
In this example, we will compare the above created dataframe with value – 97 using ne and != operators.
from pyspark import pandas
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
#check all the values in the above dataframe are not equal to 97
print(pyspark_pandas.ne(97))
print()
#check all the values in the above dataframe are not equal to 97
print(pyspark_pandas!=97)
Output
Both operators returned the same, and According to the condition, values not equal to 97 returned True, and in other cases, it returned False.
Conclusion
In this PySpark pandas article we see how to apply different comparison operators on DataFrame through built-in operators and normal operators. Each operator returns a boolean value in the pyspark pandas DataFrame element wise. The comparison operators that we used are : eq(),ne(),lt(),gt(),le() and ge().