Apache Spark

Standard Deviation in PySpark

In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using DataFrame. Standard deviation is a mathematical calculation to determine how similar or different numbers are. For example people will say, this number is X number of standard deviations away from the average. Or in total all the numbers in a set are with in Y number of standard deviations, etc. We will demonstrate three functions for standard deviation in this article using PySpark. For each of these functions we will provide examples with select() and agg() methods.

  1. PySpark – stddev()
  2. PySpark – stddev_samp()
  3. PySpark – stddev_pop()

PySpark – stddev()

stddev() in PySpark is used to return the standard deviation from a particular column in the DataFrame.

Before that, we have to create PySpark DataFrame for demonstration.

Example:

We will create a dataframe with 5 rows and 6 columns and display it using the show() method.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#display dataframe
df.show()

Output:

Capture.PNG

Method -1 : Using select() method

We can get the standard deviation from the column in the dataframe using the select() method. Using the stddev() method, we can get the standard deviation from the column. To use this method, we have to import it from pyspark.sql.functions module, and finally, we can use the collect() method to get the standard deviation from the column

Syntax:

df.select(stddev (‘column_name’))

Where,

  1. df is the input PySpark DataFrame
  2. column_name is the column to get the standard deviation

If we want to return the standard deviation from multiple columns, we have to use the stddev() method inside the select() method by specifying the column name separated by a comma.

Syntax:

df.select(stddev(‘column_name’), stddev (‘column_name’),………., stddev (‘column_name’))

Where,

  1. df is the input PySpark DataFrame
  2. column_name is the column to get the standard deviation

Example 1: Single Column

This example will get the standard deviation from the height column in the PySpark dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the standsrd deviation - stddev function
from pyspark.sql.functions import stddev

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]


# create the dataframe
df = spark_app.createDataFrame( students)

#return the standard deviation from the height column
df.select(stddev('height')).collect()

Output:

[Row(stddev_samp(height)=1.3030732903409539)]

In the above example, the standard deviation from the height column is returned.

Example 2:Multiple Columns

This example will get the standard deviation from the height, age, and weight columns in the PySpark dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the standsrd deviation - stddev function
from pyspark.sql.functions import stddev

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]


# create the dataframe
df = spark_app.createDataFrame( students)

#return the standard deviation from the height,age and weight column
df.select(stddev('height'),stddev('age'),stddev('weight')).collect()

Output:

[Row(stddev_samp(height)=1.3030732903409539, stddev_samp(age)=12.157302332343306, stddev_samp(weight)=20.211382931407737)]

The standard deviation from the height, age, and weight columns is returned in the above example.

Method – 2 : Using agg() method

We can get the standard deviation from the column in the dataframe using the agg() method. This method is known as aggregation, which groups the values within a column. It will take dictionary as a parameter in that key will be column name and value is the aggregate function, i.e., stddev. By using the stddev() method, we can get the standard deviation from the column, and finally, we can use the collect() method to get the standard deviation from the column.

Syntax:

df.agg({‘column_name’:stddev})

Where,

  1. df is the input PySpark DataFrame
  2. column_name is the column to get the standard deviation
  3. stddev is an aggregation function used to return the standard deviation

If we want to return the standard deviation from multiple columns, we have to specify the column name with the stddev function separated by a comma.

Syntax:

df.agg({‘column_name’: stddev,‘column_name’: stddev,…………………,‘column_name’: stddev })

Where,

  1. df is the input PySpark DataFrame
  2. column_name is the column to get the standard deviation
  3. stddev is an aggregation function used to return the standard deviation

Example 1: Single Column

This example will get the standard deviation from the height column in the PySpark dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]


# create the dataframe
df = spark_app.createDataFrame( students)

#return the standard deviation from the height column
df.agg({'height': 'stddev'}).collect()

Output:

[Row(stddev(height)=1.3030732903409539)]

In the above example, the standard deviation from the height column is returned.

Example 2: Multiple Columns

This example will get the standard deviation from the height, age, and weight columns in the PySpark dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]


# create the dataframe
df = spark_app.createDataFrame( students)

#return the standard deviation from the height , and weight column
df.agg({'height': 'stddev','age': 'stddev','weight': 'stddev'}).collect()

Output:

[Row(stddev(weight)=20.211382931407737, stddev(age)=12.157302332343306, stddev(height)=1.3030732903409539)]

The standard deviation from the height, age, and weight columns is returned in the above example.

PySpark – stddev_samp()

Stddev_samp() in PySpark is used to return the standard deviation of a samplefrom a particular column in the DataFrame. It is similar to stddev() function.

Before that, we have to create PySpark DataFrame for demonstration.

Example:

We will create a dataframe with 5 rows and 6 columns and display it using the show() method.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]


# create the dataframe
df = spark_app.createDataFrame( students)

#display dataframe
df.show()

Output:

Capture.PNG

Method -1 : Using select() method

We can get the standard deviation from the column in the dataframe using the select() method. By using the stddev_samp() method, we can get the standard deviation from the column. To use this method, we have to import it from pyspark.sql.functions module, and finally, we can use the collect() method to get the standard deviation from the column

Syntax:

df.select(stddev_samp (‘column_name’))

Where,

  1. df is the input PySpark DataFrame
  2. column_name is the column to get the standard deviation in a sample

If we want to return the standard deviation from multiple columns of a sample, we have to use the stddev_samp () method inside the select() method by specifying the column name separated by a comma.

Syntax:

df.select(stddev_samp (‘column_name’), stddev_samp (‘column_name’),………., stddev_samp (‘column_name’))

Where,

  1. df is the input PySpark DataFrame
  2. column_name is the column to get the standard deviation for the given sample

Example 1: Single Column

In this example, we will get the standard deviation of a sample from the height column in the PySpark dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the standsrd deviation - stddev_samp function
from pyspark.sql.functions import stddev_samp

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]


# create the dataframe
df = spark_app.createDataFrame( students)

#return the standard deviation from the height column
df.select(stddev_samp('height')).collect()

Output:

[Row(stddev_samp(height)=1.3030732903409539)]

In the above example, the standard deviation from the height column is returned.

Example 2:Multiple Columns

In this example, we will get the standard deviation of the sample from the height, age, and weight columns in the PySpark dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the standsrd deviation - stddev_samp function
from pyspark.sql.functions import stddev_samp

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]


# create the dataframe
df = spark_app.createDataFrame( students)

#return the standard deviation from the height,age and weight column
df.select(stddev_samp('height'),stddev_samp('age'),stddev_samp('weight')).collect()

Output:

[Row(stddev_samp(height)=1.3030732903409539, stddev_samp(age)=12.157302332343306, stddev_samp(weight)=20.211382931407737)]

In the above example, the standard deviation from the height, age, and weight columns is returned.

Method – 2 : Using agg() method

We can get the standard deviation of a sample from the column in the dataframe using the agg() method. This method is known as aggregation, which groups the values within a column. It will take dictionary as a parameter in that key will be column name and value is the aggregate function, i.e., stddev_samp. By using the stddev_samp () method, we can get the standard deviation from the column, and finally, we can use the collect() method to get the standard deviation of a sample from the column.

Syntax:

df.agg({‘column_name’: stddev_samp })

Where,

  1. df is the input PySpark DataFrame
  2. column_name is the column to get the standard deviation of a sample
  3. stddev_samp is an aggregation function used to return the standard deviation of a sample

If we want to return the standard deviation from multiple columns, we have to specify the column name with the stddev_samp function separated by a comma.

Syntax:

df.agg({‘column_name’: stddev_samp,‘column_name’: stddev_samp,…………………,‘column_name’: stddev_samp })

Where,

  1. df is the input PySpark DataFrame
  2. column_name is the column to get the standard deviation of a sample
  3. stddev_samp is an aggregation function used to return the standard deviation of a sample

Example 1: Single Column

This example will get the standard deviation from the height column in the PySpark dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]


# create the dataframe
df = spark_app.createDataFrame( students)

#return the standard deviation from the height column
df.agg({'height': 'stddev_samp'}).collect()

Output:

[Row(stddev_samp(height)=1.3030732903409539)]

In the above example, the standard deviation of a sample from the height column is returned.

Example 2: Multiple Columns

In this example, we will get the standard deviation of a sample from the height, age, and weight columns in the PySpark dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]


# create the dataframe
df = spark_app.createDataFrame( students)

#return the standard deviation from the height , and weight column
df.agg({'height': 'stddev_samp','age': 'stddev_samp','weight': 'stddev_samp'}).collect()

Output:

[Row(stddev_samp(weight)=20.211382931407737, stddev_samp(age)=12.157302332343306, stddev_samp(height)=1.3030732903409539)]

In the above example, the standard deviation from the height, age and weight columns is returned.

PySpark – stddev_pop()

stddev_pop() in PySpark is used to return the standard deviation of a population from a particular column in the DataFrame.

Before that, we have to create PySpark DataFrame for demonstration.

Example:

We will create a dataframe with 5 rows and 6 columns and display it using the show() method.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]


# create the dataframe
df = spark_app.createDataFrame( students)

#display dataframe
df.show()

Output:

Capture.PNG

Method -1 : Using select() method

We can get the standard deviation from the column in the dataframe using the select() method. By using the stddev_pop() method, we can get the standard deviation of the population from the column. To use this method, we have to import it from pyspark.sql.functions module, and finally, we can use the collect() method to get the standard deviation from the column

Syntax:

df.select(stddev_pop (‘column_name’))

Where,

  1. df is the input PySpark DataFrame
  2. column_name is the column to get the standard deviation of a population

If we want to return the standard deviation from multiple columns for the given sample, we have to use the stddev_pop () method inside the select() method by specifying the column name separated by a comma.

Syntax:

df.select(stddev_pop (‘column_name’), stddev_pop (‘column_name’),………., stddev_pop (‘column_name’))

Where,

  1. df is the input PySpark DataFrame
  2. column_name is the column to get the standard deviation for the given population

Example 1: Single Column

In this example, we will get the standard deviation of a population from the height column in the PySpark dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the standard deviation - stddev_pop function
from pyspark.sql.functions import stddev_pop

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]


# create the dataframe
df = spark_app.createDataFrame( students)

#return the standard deviation from the height column
df.select(stddev_pop('height')).collect()

Output:

[Row(stddev_pop(height)=1.1655041827466772)]

In the above example, the standard deviation from the height column is returned.

Example 2:Multiple Columns

In this example, we will get the standard deviation of population from the height, age, and weight columns in the PySpark dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#import the standsrd deviation - stddev_pop function
from pyspark.sql.functions import stddev_pop

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]


# create the dataframe
df = spark_app.createDataFrame( students)

#return the standard deviation from the height,age and weight column
df.select(stddev_pop('height'),stddev_pop('age'),stddev_pop('weight')).collect()

Output:

[Row(stddev_pop(height)=1.1655041827466772, stddev_pop(age)=10.87382177525455, stddev_pop(weight)=18.077610461562667)]

In the above example, the standard deviation from the height, age, and weight columns is returned.

Method – 2 : Using agg() method

We can get the population’s standard deviation from the column in the dataframe using the agg() method. This method is known as aggregation, which groups the values within a column. It will take dictionary as a parameter in that key will be column name and value is the aggregate function, i.e. stddev_pop. Using the stddev_pop () method, we can get the standard deviation from the column. Finally, we can use the collect() method to get the standard deviation of a population from the column.

Syntax:

df.agg({‘column_name’: stddev_pop })

Where,

  1. df is the input PySpark DataFrame
  2. column_name is the column to get the standard deviation of a population
  3. stddev_pop is an aggregation function used to return the standard deviation of a population

If we want to return the standard deviation from multiple columns, we have to specify the column name with the stddev_pop function separated by a comma.

Syntax:

df.agg({‘column_name’: stddev_pop,‘column_name’: stddev_pop,…………………,‘column_name’: stddev_pop })

Where,

  1. df is the input PySpark DataFrame
  2. column_name is the column to get the standard deviation of a population
  3. stddev_pop is an aggregation function used to return the standard deviation of a population

Example 1: Single Column

This example will get the standard deviation from the height column in the PySpark dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]


# create the dataframe
df = spark_app.createDataFrame( students)

#return the standard deviation from the height column
df.agg({'height': 'stddev_pop'}).collect()

Output:

[Row(stddev_pop(height)=1.1655041827466772)]

In the above example, the standard deviation of a sample from the height column is returned.

Example 2: Multiple Columns

In this example, we will get the standard deviation of a sample from the height, age, and weight columns in the PySpark dataframe.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[
{'rollno':'001','name':'sravan','age':23,
 'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
 'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
 'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
 'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
 'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students)

#return the standard deviation from the height , and weight column
df.agg({'height': 'stddev_pop','age': 'stddev_pop','weight': 'stddev_pop'}).collect()

Output:

[Row(stddev_pop(weight)=18.077610461562667, stddev_pop(age)=10.87382177525455, stddev_pop(height)=1.1655041827466772)]

In the above example, the standard deviation from the height, age, and weight columns is returned.

Conclusion

We discussed how to get the standard deviation from the PySpark DataFrame using stddev(), stddev_samp() and stddev_pop() functions through the select() and agg() methods.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain