Get PySpark DataFrame Information

In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using DataFrame. We can get the PySpark DataFrame information like total number of rows and columns, DataFrame Statistics, and size of the DataFrame. Let’s create an PySpark DataFrame for demonstration.

Example:
In this example, we are going to create the PySpark DataFrame with 5 rows and 6 columns and display using show() method.

#import the pyspark module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students1 =[
{'rollno':'001','name':'sravan','age':23,
'height':5.79,'weight':67,'address':'guntur'},
{'rollno':'002','name':'ojaswi','age':16,
'height':3.79,'weight':34,'address':'hyd'},
{'rollno':'003','name':'gnanesh chowdary','age':7,
'height':2.79,'weight':17,'address':'patna'},
{'rollno':'004','name':'rohith','age':9,
'height':3.69,'weight':28,'address':'hyd'},
{'rollno':'005','name':'sridevi','age':37,
'height':5.59,'weight':54,'address':'hyd'}]

# create the dataframe
df = spark_app.createDataFrame( students1)

# display dataframe
df.show()

Output:

Scenario 1 : Get the total number of rows

We can get the total number of rows in the PySpark DataFrame using count() function.

Syntax:
dataframe.count()

Where, dataframe is the input PySpark DataFrame.

Example:
In this example, we will use count() function to get the total number of rows.

Output:
5

Scenario 2 : Get the total number of columns

We can get the total number of columns in the PySpark DataFrame using len() function with columns method.

columns method will return all the columns in a list. So, we can apply len() function to it to return the number of columns.

Syntax:
len(dataframe.columns)

Where, dataframe is the input PySpark DataFrame.

Example:
In this example, we will use len() function to get the total number of columns and display the columns using columns method.

Output:

[‘address’, ‘age’, ‘height’, ‘name’, ‘rollno’, ‘weight’]

Scenario 3 : Get the Statistics

We can get the statistics like count, mean, standard deviation, and minimum value and the maximum value from the PySpark DataFrame using describe() method

Syntax:
dataframe.describe()

Where, dataframe is the input PySpark DataFrame.

Note – There is no mean and standard deviation for string type values. In that case, the result is null.

Example:
In this example, we will use describe() function to get the statistics.

Output:

From the above output, name is of string type. So, null value is occupied for mean and standard deviation.

We can use summary() to return the statistics. It is similar to the describe() method. But this will return the 25%, 50% and 75% range values.

Example:
In this example, we will use describe() function to get the statistics.

Output:

Conclusion

In this article, we discussed the use of describe() and summary() functions. They are used to return the statistics of the PySpark input DataFrame. We have seen that by using len() method we can get the total number of columns and by using count() method, we can get total number of rows in PySpark DataFrame.

Get PySpark DataFrame Information

Scenario 1 : Get the total number of rows

Scenario 2 : Get the total number of columns

Scenario 3 : Get the Statistics

Conclusion

About the author

Gottumukkala Sravan Kumar