Apache Spark

PySpark Read CSV()

Constructing the PySpark DataFrame from the CSV data is possible in PySpark using the read.csv() function. In some scenarios, if you want to load the external data into the PySpark DataFrame, PySpark supports many formats like JSON, CSV, etc. In this tutorial, we will see how to read the CSV data and load it into the PySpark DataFrame. Also, we will discuss loading multiple CSV files in a single DataFrame at a time with examples.

Pyspark.sql.DataFrameReader.csv()

This method is used to read the data from the CSV file/s and store them in the PySpark DataFrame. It takes the options while reading CSV into the DataFrame. We will discuss the different options with examples in detail. While passing more than one CSV file, it is important to pass the file names with extension in a list that is separated by the comma operator. If you are reading only one CSV file, there is no need to provide the file name in a list.

Syntax:

Single file - spark_app.read.csv(‘file.csv’,options…)

Multiple files – spark_app.read.csv([‘file1.csv’,’file2.csv’,…],options…)

It can also be possible to separate the options and file names.

Single file – spark_app.read.options(options…).csv(‘file.csv’)

Multiple files – spark_app.read.options(options…).csv([‘file1.csv’,’file2.csv’,…])

Install the PySpark library before implementing the following examples.

pip install pyspark

After the successful installation, you can see the output as follows:

Scenario 1: Reading the CSV File Header

Let’s create a CSV file named “person_skill.csv” with 5 records which is shown in the following and load it into the PySpark DataFrame:

The header parameter is used to specify the column names in the PySpark DataFrame. It takes a Boolean value. If it is “True”, the actual column names that exist in the CSV file are specified in the DataFrame, Otherwise, the c0, c1, c2… are specified and the actual column names will be a row. It is best practiced to set the header parameter to true.

Example 1: Header = True

import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# Load the csv named - person_skill.csv into skills with column labels with header

skills = linuxhint_spark_app.read.csv('person_skill.csv',header=True)

# Display the DataFrame

skills.show()

Output:

Explanation:

We can see that the PySpark DataFrame is created from the CSV file with specified columns and rows.

Use the following command to check the columns:

skills.columns

Example 2: Header = False

import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# Load the csv named - person_skill.csv into skills with column labels without header

skills = linuxhint_spark_app.read.csv('person_skill.csv',header=False)

# Display the DataFrame

skills.show()

Output:

Explanation:

We can see that the PySpark DataFrame is created from the CSV file without existing columns.

Also, the existing columns are stored as rows in the PySpark DataFrame.

skills.columns

Using the Read.options.csv()

Now, we read the CSV file using the read.options.csv() method. Here, we need to pass the options like delimiter, header, etc in the options as arguments and file name in the csv(). Let’s pass the header parameter by setting it to “True”.

Scenario 1:

import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# Using read.options.csv()

skills = linuxhint_spark_app.read.options(header=True).csv('person_skill.csv')

# Display the DataFrame

skills.show()

Output:

Scenario 2: Reading the CSV File Delimiter

The delimiter parameter takes the character which is used to separate each field. It takes a comma (,) by default. Let’s use the same CSV file that is used in the first scenario and pass the comma (‘,’) as the delimiter.

import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# Using read.options.csv() with delimiter along with header

skills = linuxhint_spark_app.read.options(header=True,delimiter=',').csv('person_skill.csv')

# Display the DataFrame

skills.show()

Output:

Reading Multiple Files

Until now, we have been reading a single CSV file. Let’s see how to read more than one CSV file. In this scenario, the rows in multiple files are appended in a single PySpark DataFrame. We just need to pass the file names in a list within the method.

Example:

Let’s have the following CSV files named “person_skill.csv” and “person_skill2.csv” with the following data:


Read these two CSV files and store them in a single PySpark DataFrame.

import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# Load 2 csv files named - person_skill.csv and person_skill2.csv into skills with column labels with header

skills = linuxhint_spark_app.read.csv(['person_skill.csv','person_skill2.csv'],sep=',',header=True)

skills.show()

Output:

Explanation:

The first CSV holds 6 records and the second CSV holds 3 records. We can see that the first CSV is loaded into the DataFrame first. Then, the second CSV is loaded. Finally, the PySpark DataFrame holds 9 records.

Conclusion

Reading the CSV into the PySpark DataFrame is quite simple with the pyspark.sql.DataFrameReader.csv() method. It can be possible to pass the header and delimiter parameters to this method in order to specify the columns and the format. PySpark also supports reading multiple CSV files at a time with the provided methods along with their options. In this article, we have seen the examples by considering different options. Also, we have seen two ways of passing the options to the method.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain