Python Pandas

Outer Join Pandas

In this guide, we will join the pandas DataFrames so that all the rows (matched/unmatched) are joined based on parameters such as columns, indices, etc. This type of join is known as ‘Outer Join.’ Python supports three functions to perform any kind of join. In all these functions, we need to provide the type of join (Outer) such that Outer Join functionality is performed. For each method, examples are discussed with code snippets and outputs.

Using pandas.DataFrame.join()

This function will join the columns from the first DataFrame with the second DataFrame using either the index or specified column. You can refer to this tutorial for pandas.DataFrame.join().

Example 1: Join Using Index Labels

Let’s create two DataFrames – marketing_camps (First DataFrame) and sales_camps (Second DataFrame). First DataFrame holds three columns – [‘Name’, ‘Budget’, ‘Type’] with three records and second DataFrame holds two columns – [‘Name’, ‘Saled?’, ‘Status’] with three records.

Use the join() function to perform outer join based on index labels, and specify the lsuffix and rsuffix parameters. This will differentiate the resulting DataFrame columns by adding the specified suffix to the existing DataFrame columns that are similar. Here, Name is the column that exists in both the DataFrames.

import pandas

# Create DataFrame - marketing_camps with 3 columns
marketing_camps = pandas.DataFrame({'Name':["Company A","Company B","Company C"],
                    'Budget' :[24000,35000,12000],
                    'Type':['Conference','Trade Show','Webinar']},index=['camp1','camp2','camp3'])
print(marketing_camps,"\n")

# Create DataFrame - sales_camps with 2 columns
sales_camps = pandas.DataFrame({'Name':["Company A","Company N","Company C"],
                    'Saled?' :['Yes','No','Yes'],
                    'Status':['Aborted','Completed','In Progress']},index=['camp1','camp4','camp3'])
print(sales_camps,"\n")

# Using join()
marketing_camps.join(sales_camps, lsuffix="_from_Marketing", rsuffix="_from_Sales", how='outer')

Output

Explanation

In the resulting DataFrame, we can observe that ‘Name’ from the first DataFrame has been replaced with ‘Name_from_Marketing’ and ‘Name,’ from the second DataFrame has been replaced with ‘Name_from_Sales’ in the joined DataFrame.

  1. Index – ‘camp1’ exists in both Dataframes, so all the values are not null.
  2. Index – ‘camp2’ exists in marketing_camps but is not present in the sales_camps. so NaN values are filled under columns for the second DataFrame.
  3. Index – ‘camp3’ exists in both the Dataframes, so all the values are not null.
  4. Index – ‘camp2’ exists in sales_camps but is not present in the marketing_camps, so NaN values are filled under columns for the sales_camps.

Example 2: Join Using Column

We will use the above DataFrames in this example but not having indices. We will use the join() function to perform outer join based on the ‘Name’ column.

import pandas

# Create DataFrame - marketing_camps with 3 columns
marketing_camps = pandas.DataFrame({'Name':["Company A","Company B","Company C"],
                    'Budget' :[24000,35000,12000],
                    'Type':['Conference','Trade Show','Webinar']})
print(marketing_camps,"\n")

# Create DataFrame - sales_camps with 2 columns
sales_camps = pandas.DataFrame({'Name':["Company A","Company N","Company C"],
                    'Saled?' :['Yes','No','Yes'],
                    'Status':['Aborted','Completed','In Progress']})
print(sales_camps,"\n")

# Using join() - Join two DataFrames based on Name column
marketing_camps.join(sales_camps.set_index('Name'), on='Name', how='outer')

Output

Explanation

  1. ‘Company A’ exists in both DataFrames so there are no missing values in this row.
  2. ‘Company B’ exists in marketing_camps but not in the second DataFrame, so the records under the sales_camps are filled with NaN.
  3. ‘Company C’ exists in both the DataFrames, so there are no missing values in this row.
  4. ‘Company N’ exists in sales_camps but not in marketing_camps, so records under the marketing_camps are filled with NaN.

1. Using pandas.merge()

pandas.merge() is used to merge the DataFrames based on columns or indices. A detailed tutorial on this function is available. In this scenario, make sure that you specify how = ‘outer’ is specified in all the examples.

Example 1

Let’s create two DataFrames – marketing_camps (First DataFrame) and sales_camps (Second DataFrame). First DataFrame holds three columns – [‘Name’, ‘Budget’, ‘Type’] with three records and second DataFrame holds two columns – [‘Name’, ‘Sales?’, ‘Status’] with three records.

Use merge() to perform outer join having the same column name, ‘Name,’ as the key in both DataFrames. To achieve this, specify the ‘on’ parameter with ‘Name’ and set ‘how’ to ‘outer’.

import pandas

# Create DataFrame - marketing_camps with 3 columns
marketing_camps = pandas.DataFrame({'Name':["Company A","Company B","Company C"],
                    'Budget' :[24000,35000,12000],
                    'Type':['Conference','Trade Show','Webinar']},index=['camp1','camp2','camp3'])
print(marketing_camps,"\n")

# Create DataFrame - sales_camps with 2 columns
sales_camps = pandas.DataFrame({'Name':["Company A","Company N","Company C"],
                    'Saled?' :['Yes','No','Yes'],
                    'Status':['Aborted','Completed','In Progress']},index=['camp1','camp4','camp3'])
print(sales_camps,"\n")

# Use merge() to perform outer join having same column name as Key in
# both the DataFrames
marketing_camps.merge(sales_camps, how='outer',on='Name')

Output

Explanation

  1. ‘Company A’ exists in both the DataFrames, so there are no missing values in this row.
  2. ‘Company B’ exists in the marketing_camps but not in the sales_camps, so the records under the sales_camps are filled with NaN.
  3. ‘Company C’ exists in both the DataFrames, so there are no missing values in this row.
  4. ‘Company N’ exists in the sales_camps but not in the marketing_camps, so the records under the marketing_camps are filled with NaN.

Example 2

Consider two DataFrames (marketing_camps & sales_camps) with different column labels, so we need to pass the left_on and right_on parameters along with the how parameter. The DataFrames are merged based on the ‘Name’ column from the first DataFrame and ‘Company_Name’ column from the second DataFrame. So, pass the left_on parameter with ‘Name’ and right_on parameter with ‘Company_Name.’

import pandas

# Create DataFrame - marketing_camps with 3 columns
marketing_camps = pandas.DataFrame({'Name':["Company A","Company B","Company C"],
                    'Budget' :[24000,35000,12000],
                    'Type':['Conference','Trade Show','Webinar']},index=['camp1','camp2','camp3'])
print(marketing_camps,"\n")

# Create DataFrame - sales_camps with 2 columns
sales_camps = pandas.DataFrame({'Company_Name':["Company A","Company N","Company C"],
                    'Saled?' :['Yes','No','Yes'],
                    'Status':['Aborted','Completed','In Progress']},index=['camp1','camp4','camp3'])
print(sales_camps,"\n")

# Use merge() to perform outer join having different column name as Key in
# both the DataFrames
marketing_camps.merge(sales_camps, how='outer',left_on='Name',right_on='Company_Name')

Output

Explanation

  1. ‘Company A’ exists in both the DataFrames, so there are no missing values in this row.
  2. ‘Company B’ exists in the marketing_camps but not in the sales_camps, so records under the sales_camps are filled with NaN.
  3. ‘Company C’ exists in both the DataFrames, so there are no missing values in this row.
  4. ‘Company N’ exists in the sales_camps but not in the marketing_camps, so records under the marketing_camps are filled with NaN.

2. Using pandas.concat()

pandas.concat() concatenates the DataFrames along rows or columns. This is controlled by specifying the axis parameter. join=’outer’ is passed as a parameter to this function to perform the outer join.

  1. Set the axis parameter to 0 to concatenate two DataFrames along the rows.
  2. Set the axis parameter to 1 to concatenate DataFrames along the columns.

Example 1: Concatenate across Rows

Let’s consider two DataFrames such that both DataFrames hold three records each. Set the axis parameter to 0 to concatenate two DataFrames along the rows.

import pandas

# Create DataFrame - marketing_camps with 3 columns
marketing_camps = pandas.DataFrame({'Name':["Company A","Company B","Company C"],
                    'Budget' :[24000,35000,12000],
                    'Type':['Conference','Trade Show','Webinar']})
print(marketing_camps,"\n")

# Create DataFrame - sales_camps with 2 columns
sales_camps = pandas.DataFrame({'Name':["Company A","Company N","Company C"],
                    'Status':['Aborted','Completed','In Progress'],
                                'Budget' :[24000,35000,12000],})
print(sales_camps,"\n")

# Use concat() to append the rows
pandas.concat([marketing_camps,sales_camps],axis=0,join='outer',ignore_index = True)

Output

Explanation

For the first three records, status is NaN because the column does not exist in the first DataFrame and similarly for the last three records, type is NaN because the column does not exist in the second DataFrame.

Example 2: Concatenate across Column

Let’s consider two DataFrames with index labels. Set the axis parameter to 1 to concatenate two DataFrames along the columns.

import pandas

# Create DataFrame - marketing_camps with 3 columns
marketing_camps = pandas.DataFrame({'Name':["Company A","Company B","Company C"],
                    'Budget' :[24000,35000,12000],
                    'Type':['Conference','Trade Show','Webinar']},index=['camp1','camp2','camp3'])
print(marketing_camps,"\n")

# Create DataFrame - sales_camps with 2 columns
sales_camps = pandas.DataFrame({'Name':["Company A","Company B","Company C"],
                    'Budget' :[24000,35000,12000],
                    'Type':['Conference','Trade Show','Webinar']},index=['camp1','camp2','camp4'])
print(sales_camps,"\n")

# Use concat() to concatenate the DataFrames horizontally
pandas.concat([marketing_camps,sales_camps],axis=1,join='outer')

Output

Explanation

  1. ‘camp1’ row exists in both the columns, so there are no missing values in the first row.
  2. ‘camp2’ row exists in both the columns, so there are no missing values in the second row.
  3. ‘camp3’ row exists in the marketing_camps but not in the sales_camps. so NaN is filled under the columns of the sales_camps.
  4. ‘camp4’ row exists in the sales_camps but not in the marketing_camps. so NaN is filled under the columns of the marketing_camps.

Conclusion

Now, we are able to perform the outer join on the two pandas DataFrames using pandas.DataFrame.join(), pandas.merge() and pandas.concat() functions. All these functions accept the outer join as a parameter that will result in the DataFrame by performing Outer join. Mostly, we utilized the same DataFrames in all the examples with or without indices.

About the author

Gottumukkala Sravan Kumar

B tech-hon's in Information Technology; Known programming languages - Python, R , PHP MySQL; Published 500+ articles on computer science domain