Apache Spark

PySpark – Pandas DataFrame: Arithmetic Operations

In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using DataFrame, which will store the given data in row and column format. PySpark – pandas DataFrame represents the pandas DataFrame, but it holds the PySpark DataFrame internally. Pandas support DataFrame data structure, and pandas is imported from the pyspark module.

In this article we will discuss the arithmetic operators available in Pandas and how they can be used in PySpark data frames. Before that, you have to install the pyspark module.

Command

pip install pyspark

Syntax to import:

from pyspark import pandas

After that, we can create or use the dataframe from the pandas module.

Syntax to create pandas DataFrame

pyspark.pandas.DataFrame()

We can pass a dictionary or list of lists with values.

Let’s create a pandas DataFrame through pyspark that has three columns and five rows.

#import pandas from the pyspark module
from  pyspark import pandas

#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})

#display
print(pyspark_pandas)

Output

Now, we will go into our tutorial.

Arithmetic operations are used to perform operations like addition, subtraction, multiplication, division, and modulus. Pyspark pandas dataframe supports built-in functions that are used to perform these operations.

Let’s see one by one.

pyspark.pandas.DataFrame.add()

add() in pyspark pandas dataframe is used to add elements in the entire dataframe with a value.

It is also possible to add a value in a single column. It takes the value as a  parameter.

Syntax

For entire pyspark pandas dataframe

pyspark_pandas.add(value)

For particular column

pyspark_pandas.add(value)

Where,

  1. pyspark_pandas is the pyspark pandas dataframe
  2. value that takes numeric value to be added to the pyspark_pandas.

Example 1
In this example, we will add  5 to the mark1 column.

#import pandas from the pyspark module
from  pyspark import pandas
 
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
 
#add values in mark1 column with 5
print(pyspark_pandas.mark1.add(5))

Output

We can see that 5 is added to each value in the mark1 column.

Example 2
In this example, we will add 5 to the entire pyspark pandas dataframe.

#import pandas from the pyspark module
from  pyspark import pandas
 
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
 
#add 5 to the entire dataframe
print(pyspark_pandas.add(5))

Output

We can see that 5 is added to the entire pyspark pandas dataframe.

pyspark.pandas.DataFrame.sub()

sub() in pyspark pandas dataframe is used to subtract elements from the entire dataframe with a value.

It is also possible to subtract from a single column. It takes the value as a  parameter.

Syntax

For entire pyspark pandas dataframe

pyspark_pandas.sub(value)

For particular column

pyspark_pandas.sub(value)

Where,

  1. pyspark_pandas is the pyspark pandas dataframe
  2. value that takes numeric value to be subtracted from the pyspark_pandas.

Example 1
In this example, we will subtract  5 from the mark1 column.

#import pandas from the pyspark module
from  pyspark import pandas
 
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
 
#subtract values in mark1 column with 5
print(pyspark_pandas.mark1.sub(5))

Output

We can see that 5 is subtracted from each value in the mark1 column.

Example 2
In this example, we will subtract 5 from the entire pyspark pandas dataframe.

#import pandas from the pyspark module
from  pyspark import pandas
 
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
 
#subtract  5 from the entire dataframe
print(pyspark_pandas.sub(5))

Output

We can see that 5 is subtracted from the entire pyspark pandas dataframe.

pyspark.pandas.DataFrame.mul()

mul() in the pyspark pandas dataframe is used to multiply elements in the entire dataframe with a value.

It is also possible to multiply a value in a single column. It takes the value as a  parameter.

Syntax

For entire pyspark pandas dataframe

pyspark_pandas.mul(value)

For particular column

pyspark_pandas.mul(value)

Where,

  1. pyspark_pandas is the pyspark pandas dataframe
  2. value that takes numeric value to be multiplied with the  pyspark_pandas.

Example 1
In this example, we will multiply all values in the mark1 column with 5.

#import pandas from the pyspark module
from  pyspark import pandas
 
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
 
#subtract  5 from the entire dataframe
print(pyspark_pandas.sub(5))

Output

We can see that 5 is multiplied with each value in the mark1 column.

Example 2
In this example, we will multiply the entire pyspark pandas dataframe by 5.

#import pandas from the pyspark module
from  pyspark import pandas
 
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
 
#multiply entire dataframe with 5
print(pyspark_pandas.mul(5))

Output

We can see that the entire pyspark pandas dataframe is multiplied by 5.

pyspark.pandas.DataFrame.div()

div() in pyspark pandas dataframe is used to divide elements in the entire dataframe with a value.

It is also possible to divide by value in a single column. It takes the value as a  parameter. It returns a quotient.

Syntax

For entire pyspark pandas dataframe

pyspark_pandas.div(value)

For particular column

pyspark_pandas.div(value)

Where,

  1. pyspark_pandas is the pyspark pandas dataframe
  2. value that takes numeric value to be divided with the  pyspark_pandas.

Example 1
In this example, we will divide all values in the mark1 column by 5.

#import pandas from the pyspark module
from  pyspark import pandas
 
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
 
#divide mark1 column with 5
print(pyspark_pandas.mark1.div(5))

Output

We can see that each value in the mark1 column is divided by 5.

Example 2
In this example, we will divide the entire pyspark pandas dataframe by 5.

#import pandas from the pyspark module
from  pyspark import pandas
 
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
 
#divide entire dataframe by 5
print(pyspark_pandas.div(5))

Output

We can see that the entire pyspark pandas dataframe is divided by 5.

pyspark.pandas.DataFrame.mod()

mod() in pyspark pandas dataframe is used to divide elements in the entire dataframe with a value. It will return the remainder.

It is also possible to divide by value in a single column. It takes the value as a  parameter.

Syntax

For entire pyspark pandas dataframe

pyspark_pandas.mod(value)

For particular column

pyspark_pandas.mod(value)

Where,

  1. pyspark_pandas is the pyspark pandas dataframe
  2. value that takes numeric value to be divided with the  pyspark_pandas.

Example 1
In this example, we will divide all values in the mark1 column by 5.

#import pandas from the pyspark module
from  pyspark import pandas
 
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
 
#divide mark1 column with 5
print(pyspark_pandas.mark1.mod(5))

Output

We can see that each value in the mark1 column is divided by 5 and returned the remainder.

Example 2
In this example, we will divide the entire pyspark pandas dataframe by 5.

#import pandas from the pyspark module
from  pyspark import pandas
 
#create dataframe from pandas pyspark
pyspark_pandas=pandas.DataFrame({'mark1':[90,78,90,54,67],'mark2':[100,67,96,89,77],'mark3':[91,92,98,97,87]})
 
#divide entire dataframe by 5
print(pyspark_pandas.mod(5))

Output

We can see that the entire pyspark pandas dataframe is divided by 5 and returned the remainder.

Conclusion

In this pyspark pandas tutorial, we discussed arithmetic operations performed on the pyspark pandas dataframe. add() is used to add all the values in the entire dataframe with 5, and sub() is used to subtract values from the entire pyspark pandas dataframe. mul() is used to multiply all the values in the entire dataframe with a value, and div() is used to divide all the values by a value in the pyspark pandas dataframe and return the quotient. mod() is used to divide all the values by a value in the pyspark pandas dataframe and return the remainder. The difference between mod() and div() is mod() returns remainder but div() returns quotient.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain