Apache Spark

Get PySpark DataFrame Information

In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using DataFrame. We can get the PySpark DataFrame information like total number of rows and columns, DataFrame Statistics, and size of the DataFrame. Let’s create an PySpark DataFrame for demonstration.

Example:
In this example, we are going to create the PySpark DataFrame with 5 rows and 6 columns and display using show() method.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students1 =[
{'rollno':'001','name':'sravan','age':23,
  'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
  'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
  'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
  'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
  'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students1)

# display dataframe
df.show()

Output:

Scenario 1 : Get the total number of rows

We can get the total number of rows in the PySpark DataFrame using count() function.

Syntax:
dataframe.count()

Where, dataframe is the input PySpark DataFrame.

Example:
In this example, we will use count() function to get the total number of rows.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students1 =[
{'rollno':'001','name':'sravan','age':23,
  'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
  'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
  'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
  'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
  'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students1)

# display the row count
print(df.count())

Output:
5

Scenario 2 : Get the total number of columns

We can get the total number of columns in the PySpark DataFrame using len() function with columns method.

columns method will return all the columns in a list. So, we can apply len() function to it to return the number of columns.

Syntax:
len(dataframe.columns)

Where, dataframe is the input PySpark DataFrame.

Example:
In this example, we will use len() function to get the total number of columns and display the columns using columns method.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students1 =[
{'rollno':'001','name':'sravan','age':23,
  'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
  'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
  'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
  'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
  'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students1)

# display the column count
print(len(df.columns))

# display the columns
print(df.columns)

Output:

6

[‘address’, ‘age’, ‘height’, ‘name’, ‘rollno’, ‘weight’]

Scenario 3 : Get the Statistics

We can get the statistics like count, mean, standard deviation, and minimum value and the maximum value from the PySpark DataFrame using describe() method

Syntax:
dataframe.describe()

Where, dataframe is the input PySpark DataFrame.

Note – There is no mean and standard deviation for string type values. In that case, the result is null.

Example:
In this example, we will use describe() function to get the statistics.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students1 =[
{'rollno':'001','name':'sravan','age':23,
  'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
  'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
  'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
  'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
  'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students1)

df.describe().show()

Output:

From the above output, name is of string type. So, null value is occupied for mean and standard deviation.

We can use summary() to return the statistics. It is similar to the describe() method. But this will return the 25%, 50% and 75% range values.

Example:
In this example, we will use describe() function to get the statistics.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students1 =[
{'rollno':'001','name':'sravan','age':23,
  'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
  'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
  'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
  'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
  'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students1)

# get the summary
df.summary().show()

Output:

Conclusion

In this article, we discussed the use of describe() and summary() functions. They are used to return the statistics of the PySpark input DataFrame. We have seen that by using len() method we can get the total number of columns and by using count() method, we can get total number of rows in PySpark DataFrame.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain