How To Extract Unique Values From the Pandas Column?
Several ways can be used to find unique values in pandas. The most common way to extract unique values from a column is using the unique() function and the drop_duplicates() function. Before using these functions, let’s see their syntax first.
Syntax of unique() Function: Series.unique( )
Returns: ndarray or ExtensionArray
Syntax of drop_duplicate() Function
subset: A list of column labels or a column is required by the subset. None is the default value for it. After passing columns, it will only take duplicates into account.
keep: To control how duplicate values are considered. We can use three distinct values; it is ‘first’ by default.
- If ‘first’, the first value will be considered unique, and the rest of the same or repeating values will be considered a duplicate.
- If ‘last’, the last value will be considered unique, and the rest of the same or repeating values will be considered a duplicate.
- If False, all the same values will be considered a duplicate.
inplace: Boolean value. If True, removes duplicate rows.
Returns: Depending on the arguments, the return type will be a DataFrame with duplicate rows eliminated.
As we have seen the syntax, let’s move toward the examples to learn how to extract unique values from the pandas column.
Example # 01: Get Unique Values From Pandas Columns by Using the unique() Method
When working with a single column of a DataFrame, the “pandas.DataFrame.unique()” method is used. It returns all unique components of a column. The method generates a DataFrame that includes the distinct column elements and their accompanying index labels as output. Let’s create a DataFrame first, so we can use the unique() function to extract unique values from its columns.
After importing the pandas module, we created our DataFrame using a pandas dictionary. We defined the keys of our dictionary as “Name” and “Courses” and assigned this dictionary to the variable “dic”. The “dic” variable is then passed in the parameter of the pd.DataFrame() method as an argument to create the “df” DataFrame. We can view our DataFrame by using the print() function.
Let’s suppose our DataFrame consist of student names and the courses in which they are enrolled. In such a situation, it is rather difficult to count each row of the DataFrame to identify the specific course category to determine the overall number of courses studied. In the previous DataFrame the column “Courses” containing the name of courses (‘English’, ‘Maths’, ‘Chemistry’, ‘Maths’, ‘Statistics’, ‘Maths’, ‘English’, ‘Datascience’). More than one student studies some courses. So, to get the unique courses from the “Courses” column, we will use the unique() function.
In the output, we get an array of elements containing the unique courses in our DataFrame. Suppose you want to count the total number of distinct elements rather than looking for the names of unique values in the DataFrame’s columns. For this purpose, we can use the nunique() function. The total number of distinct values for each column is returned by the nunique() method.
The nunique() function has returned “5”, which means there are a total of 5 unique values in the ‘Courses’ column of the ‘df’ DataFrame.
Example # 02: Using unique() Method Get Unique Values From Numeric Columns
To create a DataFrame, we will import the pandas module first. Then, we will create our DataFrame using the pd.DataFrame() function.
As seen above, we have created the DataFrame by passing a dictionary inside the DataFrame() function. To visualize the newly created DataFrame, we will use the print() function.
In this DataFrame, we have two labels, “Age” and “Salary”, having numeric data. In the column “Age”, we have the ages of individuals as (“20”, “24”, “20”, “22”, “21”, “28”, “31”, “25”), while the “Salary” column is storing the salaries of individuals (‘1000’, ‘1000’, ‘1300’, ‘1100’, ‘1400’, ‘1000’, ‘1100’, ‘1400’). Now, we will use the unique() function to get the distinct values from the columns of the DataFrame.
As the previous script shows, we used the unique() function to get distinct values from the “Salary” column. The function has returned the output in the form of an array [‘1000’, ‘1300’, ‘1100’, ‘1400’] containing all the unique values from the “Salary” column in the DataFrame. We can also use the sort() function to sort the result data into ascending order.
To sort the output array (with unique values from the Salary column), we assigned the array to a variable ‘u’. The sort() function is applied to the array to sort the values of the output array in ascending order.
Example # 03: Get Unique Values From Multiple Columns by Using the unique() Method
We have learned how to extract a set of distinct values from a single column of DataFrame. But in some situations. You may require to find distinct values across multiple columns. In such circumstances, before using the unique() function on the series(column) object, we will combine the values of the columns from which we want to get the unique values. We will use the same DataFrame, which we have created in example # 2.
Suppose we want to get the distinct values from the ‘Age’ and ‘Salary’ columns. First, we will merge the data of both columns using the following script.
In the previous code, we selected the data from the ‘Age’ column and then used the append(‘Salary’) to merge the data of the ‘Salary’ column with the data of the ‘Age’ column. After merging the data, we used the unique() function to get the distinct values from both columns.
As can be seen, we have successfully extracted the unique values from both columns.
Example # 04: Using the drop_duplicates() Function To Get Unique Values From Pandas Columns
The drop_duplicates function is a built-in function of the pandas library. It can be used to remove the repeating values or duplicate data from the DataFrame’s column. The rows with duplicate values are removed while the datatype of the object or its subset remains preserved. The drop_duplicate() method is the quicker option to eliminate duplicate values when working with a big group of data.
Now, we will use the drop_duplicate() function to eliminate the columns having duplicate values.
As you can see, all the rows have been eliminated by considering the duplicate data in the “Salary” column. Only the first instance of duplicate values is left in the DataFrame.
In this article, we discussed how to get unique values from the columns of the DataFrame in pandas. After going through this tutorial, you might be able to extract unique values from the pandas column on your own. We implemented a few examples to teach you how to get unique values from pandas columns and numeric pandas columns by using the unique() function and drop_duplicates() function.