Python Pandas

Pandas Read_Parquet

A parquet file is a column-oriented data file format that is designed to store the data efficiently. If you want to load the parquet file into the Pandas DataFrame, the pandas.read_parquet() function is used. In this guide, we will discuss how to read the parquet file into the Pandas DataFrame by considering different parameters that are passed to this function.

Pandas.Read_Parquet()

 The pandas.read_parquet() function is used to load the parquet from the path (file/URL). It returns the Pandas DataFrame.

Syntax:

Let’s see the syntax of the pandas.read_parquet() function with the parameters in detail:

pandas.read_parquet(file/path location to the file, engine, columns, storage_options, use_nullable_dtypes, dtype_backend)
  1. file – We can directly pass the parquet from the location.
  2. columns – If you want to load the parquet file into DataFrame with specific columns, the columns are specified in a list and assigned to the “columns” parameter. All columns are loaded if it is not specified.
  3. engine – The parquet is an open source and there are many different Python libraries and engines that can be used to read the parquet. The supported engines are “auto” (by default), “pyarrow”, and “fastparquet”.
  4. storage_options – (by default = None) We can specify the storage connection with the localhost, username, password, etc. as additional options.

Parquet File Creation from Pandas DataFrame

Before reading the parquet file into the DataFrame object, we need a parquet file. So, let’s create a DataFrame and generate a “camps.parquet” parquet file from the DataFrame.

import pandas

camps = pandas.DataFrame([['Marketing','Conference','Completed',1200],
                          ['Sales','Trade show','Completed',1500],
                          ['Service','Conference','Planned',2500],
                          ['Technical','Conference','Aborted',5000],
                          ['Others','Public Relations','Planned',6000],
                          ],columns=['Campaign_Name', 'Campaign_Type', 'Status', 'Budget'])

# Generate parquet file from the above DataFrame
camps.to_parquet('camps.parquet')

The parquet file is generated. Let’s utilize this to read into the Pandas DataFrame.

Example 1: With the Default Parameters

Let’s read the “camps.parquet” file to the DataFrame by passing the file name to the read.parquet() function. By default, it reads all the columns and the engine is “auto”.

import pandas

# Read camps.parquet file
DataFrame_from_parquet = pandas.read_parquet('camps.parquet')
print(DataFrame_from_parquet)

Output:

All the columns are loaded into the DataFrame from the parquet file.

Example 2: With the “Columns” Parameter

Now, we read the parquet file into the DataFrame by passing the “columns” parameter.

  1. Read the camps.parquet file with the “Status” and “Budget” columns.
  2. Read the camps.parquet file with the “Campaign_Name” column.
import pandas

# Read camps.parquet file with Status and Budget
DataFrame_from_parquet = pandas.read_parquet('camps.parquet',columns=['Status','Budget'])
print(DataFrame_from_parquet,"\n")

# Read camps.parquet file with Campaign_Name column
DataFrame_from_parquet = pandas.read_parquet('camps.parquet',columns=['Campaign_Name'])
print(DataFrame_from_parquet)

Output:

In the first output, the DataFrame is created with two columns – “Status” and “Budget”. In the second output, the DataFrame is created with one column – “Campaign_Name”.

Example 3: With the Fastparquet Engine

The fastparquet is a Python implementation of the parquet format which integrates big-data workflows in Python. In your Python Env, install the fastparquet module first.

Command: pip install fastparquet

Let’s set the engine to “fastparquet”.

import pandas
import fastparquet

# Read camps.parquet file with engine as 'fastparquet'
DataFrame_from_parquet = pandas.read_parquet('camps.parquet',engine='fastparquet')
print(DataFrame_from_parquet,"\n")

Output:

Example 4: With the Pyarrow and PyArrow Engines Utility Function

Basically, Pyarrow allows data sharing between the data science tools and languages in an efficient way. Pandas uses the PyArrow to improve the API performance. Let’s read the parquet file by setting the engine to “pyarrow”.

import pandas

# Read camps.parquet file with engine as 'pyarrow'
DataFrame_from_parquet = pandas.read_parquet('camps.parquet',engine='pyarrow')

print(DataFrame_from_parquet,"\n")

Output:

While working with larger datasets, the processing speed is an important factor. Using the PyArrow utility function is the best option which speeds up the load.

  1. First, we need to load the file using the read_table() function that is available in the pyarrow.parquet module.
  2. Using the to_pandas() function, we convert the table data into the Pandas DataFrame.
import pandas
import pyarrow.parquet

# Read Table from camps.parquet
table = pyarrow.parquet.read_table('camps.parquet')
print(table,"\n")

# Convert Table to pandas DataFrame
DataFrame_from_parquet = table.to_pandas()
print(DataFrame_from_parquet)

Output:

The first output is the pyarrow table and the second output is the Pandas DataFrame is created from the pyarrow table.

Conclusion

Now, we are able to load the parquet file into the Pandas DataFrame using the pandas.read_parquet() function. In this guide, we learned the different examples by considering the parameters with the output screenshots. To improve the data processing speed while loading the parquet file into the Pandas DataFrame, we utilized the PyArrow utility function.  Specific columns can be loaded into the Pandas DataFrame by specifying the “columns” parameter.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain