PySpark – stddev()
stddev() in PySpark is used to return the standard deviation from a particular column in the DataFrame.
Before that, we have to create PySpark DataFrame for demonstration.
Example:
We will create a dataframe with 5 rows and 6 columns and display it using the show() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#display dataframe
df.show()
Output:
Method -1 : Using select() method
We can get the standard deviation from the column in the dataframe using the select() method. Using the stddev() method, we can get the standard deviation from the column. To use this method, we have to import it from pyspark.sql.functions module, and finally, we can use the collect() method to get the standard deviation from the column
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the standard deviation
If we want to return the standard deviation from multiple columns, we have to use the stddev() method inside the select() method by specifying the column name separated by a comma.
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the standard deviation
Example 1: Single Column
This example will get the standard deviation from the height column in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the standsrd deviation - stddev function
from pyspark.sql.functions import stddev
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the standard deviation from the height column
df.select(stddev('height')).collect()
Output:
In the above example, the standard deviation from the height column is returned.
Example 2:Multiple Columns
This example will get the standard deviation from the height, age, and weight columns in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the standsrd deviation - stddev function
from pyspark.sql.functions import stddev
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the standard deviation from the height,age and weight column
df.select(stddev('height'),stddev('age'),stddev('weight')).collect()
Output:
The standard deviation from the height, age, and weight columns is returned in the above example.
Method – 2 : Using agg() method
We can get the standard deviation from the column in the dataframe using the agg() method. This method is known as aggregation, which groups the values within a column. It will take dictionary as a parameter in that key will be column name and value is the aggregate function, i.e., stddev. By using the stddev() method, we can get the standard deviation from the column, and finally, we can use the collect() method to get the standard deviation from the column.
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the standard deviation
- stddev is an aggregation function used to return the standard deviation
If we want to return the standard deviation from multiple columns, we have to specify the column name with the stddev function separated by a comma.
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the standard deviation
- stddev is an aggregation function used to return the standard deviation
Example 1: Single Column
This example will get the standard deviation from the height column in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the standard deviation from the height column
df.agg({'height': 'stddev'}).collect()
Output:
In the above example, the standard deviation from the height column is returned.
Example 2: Multiple Columns
This example will get the standard deviation from the height, age, and weight columns in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the standard deviation from the height , and weight column
df.agg({'height': 'stddev','age': 'stddev','weight': 'stddev'}).collect()
Output:
The standard deviation from the height, age, and weight columns is returned in the above example.
PySpark – stddev_samp()
Stddev_samp() in PySpark is used to return the standard deviation of a samplefrom a particular column in the DataFrame. It is similar to stddev() function.
Before that, we have to create PySpark DataFrame for demonstration.
Example:
We will create a dataframe with 5 rows and 6 columns and display it using the show() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#display dataframe
df.show()
Output:
Method -1 : Using select() method
We can get the standard deviation from the column in the dataframe using the select() method. By using the stddev_samp() method, we can get the standard deviation from the column. To use this method, we have to import it from pyspark.sql.functions module, and finally, we can use the collect() method to get the standard deviation from the column
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the standard deviation in a sample
If we want to return the standard deviation from multiple columns of a sample, we have to use the stddev_samp () method inside the select() method by specifying the column name separated by a comma.
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the standard deviation for the given sample
Example 1: Single Column
In this example, we will get the standard deviation of a sample from the height column in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the standsrd deviation - stddev_samp function
from pyspark.sql.functions import stddev_samp
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the standard deviation from the height column
df.select(stddev_samp('height')).collect()
Output:
In the above example, the standard deviation from the height column is returned.
Example 2:Multiple Columns
In this example, we will get the standard deviation of the sample from the height, age, and weight columns in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the standsrd deviation - stddev_samp function
from pyspark.sql.functions import stddev_samp
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the standard deviation from the height,age and weight column
df.select(stddev_samp('height'),stddev_samp('age'),stddev_samp('weight')).collect()
Output:
In the above example, the standard deviation from the height, age, and weight columns is returned.
Method – 2 : Using agg() method
We can get the standard deviation of a sample from the column in the dataframe using the agg() method. This method is known as aggregation, which groups the values within a column. It will take dictionary as a parameter in that key will be column name and value is the aggregate function, i.e., stddev_samp. By using the stddev_samp () method, we can get the standard deviation from the column, and finally, we can use the collect() method to get the standard deviation of a sample from the column.
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the standard deviation of a sample
- stddev_samp is an aggregation function used to return the standard deviation of a sample
If we want to return the standard deviation from multiple columns, we have to specify the column name with the stddev_samp function separated by a comma.
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the standard deviation of a sample
- stddev_samp is an aggregation function used to return the standard deviation of a sample
Example 1: Single Column
This example will get the standard deviation from the height column in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the standard deviation from the height column
df.agg({'height': 'stddev_samp'}).collect()
Output:
In the above example, the standard deviation of a sample from the height column is returned.
Example 2: Multiple Columns
In this example, we will get the standard deviation of a sample from the height, age, and weight columns in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the standard deviation from the height , and weight column
df.agg({'height': 'stddev_samp','age': 'stddev_samp','weight': 'stddev_samp'}).collect()
Output:
In the above example, the standard deviation from the height, age and weight columns is returned.
PySpark – stddev_pop()
stddev_pop() in PySpark is used to return the standard deviation of a population from a particular column in the DataFrame.
Before that, we have to create PySpark DataFrame for demonstration.
Example:
We will create a dataframe with 5 rows and 6 columns and display it using the show() method.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#display dataframe
df.show()
Output:
Method -1 : Using select() method
We can get the standard deviation from the column in the dataframe using the select() method. By using the stddev_pop() method, we can get the standard deviation of the population from the column. To use this method, we have to import it from pyspark.sql.functions module, and finally, we can use the collect() method to get the standard deviation from the column
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the standard deviation of a population
If we want to return the standard deviation from multiple columns for the given sample, we have to use the stddev_pop () method inside the select() method by specifying the column name separated by a comma.
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the standard deviation for the given population
Example 1: Single Column
In this example, we will get the standard deviation of a population from the height column in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the standard deviation - stddev_pop function
from pyspark.sql.functions import stddev_pop
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the standard deviation from the height column
df.select(stddev_pop('height')).collect()
Output:
In the above example, the standard deviation from the height column is returned.
Example 2:Multiple Columns
In this example, we will get the standard deviation of population from the height, age, and weight columns in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the standsrd deviation - stddev_pop function
from pyspark.sql.functions import stddev_pop
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the standard deviation from the height,age and weight column
df.select(stddev_pop('height'),stddev_pop('age'),stddev_pop('weight')).collect()
Output:
In the above example, the standard deviation from the height, age, and weight columns is returned.
Method – 2 : Using agg() method
We can get the population’s standard deviation from the column in the dataframe using the agg() method. This method is known as aggregation, which groups the values within a column. It will take dictionary as a parameter in that key will be column name and value is the aggregate function, i.e. stddev_pop. Using the stddev_pop () method, we can get the standard deviation from the column. Finally, we can use the collect() method to get the standard deviation of a population from the column.
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the standard deviation of a population
- stddev_pop is an aggregation function used to return the standard deviation of a population
If we want to return the standard deviation from multiple columns, we have to specify the column name with the stddev_pop function separated by a comma.
Syntax:
Where,
- df is the input PySpark DataFrame
- column_name is the column to get the standard deviation of a population
- stddev_pop is an aggregation function used to return the standard deviation of a population
Example 1: Single Column
This example will get the standard deviation from the height column in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the standard deviation from the height column
df.agg({'height': 'stddev_pop'}).collect()
Output:
In the above example, the standard deviation of a sample from the height column is returned.
Example 2: Multiple Columns
In this example, we will get the standard deviation of a sample from the height, age, and weight columns in the PySpark dataframe.
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()
# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]
# create the dataframe
df = spark_app.createDataFrame( students)
#return the standard deviation from the height , and weight column
df.agg({'height': 'stddev_pop','age': 'stddev_pop','weight': 'stddev_pop'}).collect()
Output:
In the above example, the standard deviation from the height, age, and weight columns is returned.
Conclusion
We discussed how to get the standard deviation from the PySpark DataFrame using stddev(), stddev_samp() and stddev_pop() functions through the select() and agg() methods.