Apache Spark

PySpark – StructType & StructField

In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using DataFrame.

It provides the StructType() and StructField() methods which are used to define the columns in the PySpark DataFrame.

Using these methods, we can define the column names and the data types of the particular columns.

Let’s discuss one by one

StructType()

This method is used to define the structure of the PySpark dataframe. It will accept a list of data types along with column names for the given dataframe. This is known as the schema of the dataframe. It stores a collection of fields

StructField()

This method is used inside the StructType() method of the PySpark dataframe. It will accept column names with the datatype.

Syntax:

schema=StructType([

StructField("column 1", datatype,True/False),

StructField("column 2", datatype,True/False),

………………………………………………,

StructField("column n", datatype,True/False)])

Where schema refers to the dataframe when it is created

Parameters:

1. StructType accepts a list of StructFields in a list separated by a comma

2. StructField() is used to add columns to the dataframe, which takes column names as the first parameter and the datatype of the particular columns as the second parameter.

We have to use the data types from the methods which are imported from the pyspark.sql.types module.

The data types supported are:

  • StringType() – Used to store string values
  • IntegerType() – Used to store Integer or Long Integer values
  • FloatType() – Used to store Float values
  • DoubleType() – Used to store Double values

3. Boolean values as the third parameter; if it is True, then the given data type will be used; otherwise, not when it is False.

We have to pass this schema to the DataFrame method along with data.

Syntax:

createDataFrame(data, schema=schema)

Example 1:

In this example, we created data within the list that contains 5 rows and 6 columns, and we are assigning columns names as rollno with the string data type, a name with the string data type, age with integer type, height with a float type, weight with integer and address with the string data type.

Finally, we are going to display the dataframe using the show() method.

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#and import struct types and data types
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,FloatType

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[['001','sravan',23,5.79,67,'guntur'],
               ['002','ojaswi',16,3.79,34,'hyd'],
               ['003','gnanesh chowdary',7,2.79,17,'patna'],
               ['004','rohith',9,3.69,28,'hyd'],
               ['005','sridevi',37,5.59,54,'hyd']]

#define the StructType and StructFields
#for the below column names
schema=StructType([
    StructField("rollno",StringType(),True),
    StructField("name",StringType(),True),
    StructField("age",IntegerType(),True),
    StructField("height", FloatType(), True),
    StructField("weight", IntegerType(), True),
    StructField("address", StringType(), True)
  ])
 
#create the dataframe and add schema to the dataframe
df = spark_app.createDataFrame(students, schema=schema)

#display the dataframe
df.show()

Output:

Capture.PNG

If we want to display the dataframe schema, then we have to use the schema method.

This will return the dataframe type along with columns

Syntax:

Dataframe. schema

If we want to display fields, then we have to use fields with schema

Syntax:

Dataframe. schema.fields

Example 2

In this example, we are going to display the schema of the dataframe

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#and import struct types and data types
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,FloatType

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[['001','sravan',23,5.79,67,'guntur'],
               ['002','ojaswi',16,3.79,34,'hyd'],
               ['003','gnanesh chowdary',7,2.79,17,'patna'],
               ['004','rohith',9,3.69,28,'hyd'],
               ['005','sridevi',37,5.59,54,'hyd']]

#define the StructType and StructFields
#for the below column names
schema=StructType([
    StructField("rollno",StringType(),True),
    StructField("name",StringType(),True),
    StructField("age",IntegerType(),True),
    StructField("height", FloatType(), True),
    StructField("weight", IntegerType(), True),
    StructField("address", StringType(), True)
  ])
 
#create the dataframe and add schema to the dataframe
df = spark_app.createDataFrame(students, schema=schema)

# display the schema
print(df.schema)

Output:

[StructField(rollno,StringType,true), StructField(name,StringType,true), StructField(age,IntegerType,true), StructField(height,FloatType,true), StructField(weight,IntegerType,true), StructField(address,StringType,true)]

Example 3

In this example, we are going to display the schema fields of the dataframe using schema.fields

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#and import struct types and data types
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,FloatType

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[['001','sravan',23,5.79,67,'guntur'],
               ['002','ojaswi',16,3.79,34,'hyd'],
               ['003','gnanesh chowdary',7,2.79,17,'patna'],
               ['004','rohith',9,3.69,28,'hyd'],
               ['005','sridevi',37,5.59,54,'hyd']]

#define the StructType and StructFields
#for the below column names
schema=StructType([
    StructField("rollno",StringType(),True),
    StructField("name",StringType(),True),
    StructField("age",IntegerType(),True),
    StructField("height", FloatType(), True),
    StructField("weight", IntegerType(), True),
    StructField("address", StringType(), True)
  ])
 
#create the dataframe and add schema to the dataframe
df = spark_app.createDataFrame(students, schema=schema)

# display the schema fields
print(df.schema.fields)

Output:

[StructField(rollno,StringType,true), StructField(name,StringType,true), StructField(age,IntegerType,true), StructField(height,FloatType,true), StructField(weight,IntegerType,true), StructField(address,StringType,true)]

We can also use the printSchema() method to display the schema in tree format

Syntax:

Dataframe.printSchema()

Example 4:

Display the schema in tree format with printSchema() method

#import the pyspaprk module
import pyspark
#import SparkSession for creating a session
from pyspark.sql import SparkSession
#and import struct types and data types
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,FloatType

#create an app named linuxhint
spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student data with 5 rows and 6 attributes
students =[['001','sravan',23,5.79,67,'guntur'],
               ['002','ojaswi',16,3.79,34,'hyd'],
               ['003','gnanesh chowdary',7,2.79,17,'patna'],
               ['004','rohith',9,3.69,28,'hyd'],
               ['005','sridevi',37,5.59,54,'hyd']]

#define the StructType and StructFields
#for the below column names
schema=StructType([
    StructField("rollno",StringType(),True),
    StructField("name",StringType(),True),
    StructField("age",IntegerType(),True),
    StructField("height", FloatType(), True),
    StructField("weight", IntegerType(), True),
    StructField("address", StringType(), True)
  ])
 
#create the dataframe and add schema to the dataframe
df = spark_app.createDataFrame(students, schema=schema)

# display the schema in tree format
df.printSchema()

Output:

Capture.PNG

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain