Before that, we have to create PySpark DataFrame for demonstration.
Example:
We will create a dataframe with 5 rows and 6 columns and display it using the show() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#display dataframe
df.show()
Output:
Method 1: Using select() method
We can get the total value from the column in the dataframe using the select() method. Using the sum() method, we can get the total value from the column. To use this method, we have to import it from pyspark.sql.functions module, and finally, we can use the collect() method to get the sum from the column
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the sum value
If we want to return the total value from multiple columns, we must use the sum() method inside the select() method by specifying the column name separated by a comma.
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the sum value
Example 1: Single Column
This example will get the total value from the height column in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the sum function
from pyspark.sql.functions import sum
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the sum from the height column
df.select(sum('height')).collect()
Output:
The total value (sum) from the height column is returned in the above example.
Example 2:Multiple Columns
This example will get the total value from the height, age, and weight columns in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the sum function
from pyspark.sql.functions import sum
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the sum from the height ,age and weight columns
df.select(sum('height'),sum('age'),sum('weight')).collect()
Output:
The total value (sum) from the height, age and weight columns are returned in the above example.
Method 2: Using agg() method
We can get the total value from the column in the dataframe using the agg() method. This method is known as aggregation, which groups the values within a column. It will take dictionary as a parameter in that key will be column name and value is the aggregate function, i.e., sum. By using the sum() method, we can get the total value from the column, and finally, we can use the collect() method to get the sum from the column.
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the sum value
- the sum is an aggregation function used to return the sum
If we want to return the total value from multiple columns, we must specify the column name with the sum function separated by a comma.
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the sum value
- the sum is an aggregation function used to return the sum
Example 1: Single Column
This example will get the total value from the height column in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the sum from the height column
df.agg({'height': 'sum'}).collect()
Output:
In the above example, the total value (sum) from the height column is returned.
Example 2: Multiple Columns
This example will get the total value from the height, age, and weight columns in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the sum from the height,age and weight columns
df.agg({'height': 'sum','age': 'sum','weight': 'sum'}).collect()
Output:
In the above example, the total value (sum) from the height, age, and weight columns is returned.
Method 3: Using groupBy() method
We can get the total value from the column in the dataframe using the groupBy() method. This method will return the total value by grouping similar values in a column. We have to use sum() function after performing groupBy() function
Syntax:
Where,
- df is the input PySpark DataFrame
- group_column is the column where values are grouped based on this column
- column_name is the column to get the sum value
- the sum is an aggregation function used to return the sum
Example 1:
In this example, we will group the address column with the height column to return the total (sum) value based on this address column.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the sum from the height column grouping with address column
df.groupBy('address').sum('height').collect()
Output:
There are three unique values in the address field – hyd, guntur, and patna. So the sum will be formed by grouping the values across the address values.
Row(address='guntur', sum(height)=5.79),
Row(address='patna', sum(height)=2.79)]
Example 2:
In this example, we will group the address column with the weight column to return the total (sum) value based on this address column.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the sum from the weight column grouping with address column
df.groupBy('address').sum('weight').collect()
Output:
There are three unique values in the address field – hyd, guntur, and patna. So the sum will be formed by grouping the values across the address values.
Row(address='guntur', sum(weight)=67),
Row(address='patna', sum(weight)=17)]
Conclusion:
We discussed how to get the sum (total) value from the PySpark DataFrame using the select() and agg() methods. To get the sum value by grouping with other columns, we used the groupBy along with the sum() function.