There are two different scenarios while calculating the median of a dataset:
- You have an odd number of values in our dataset
- You have an even number of values in our dataset
In case you have an odd number of values, the solution is straight forward and you can find the middle number easily.
If you are trying to find the median for an even number of values, you have to take the mean or average of the two middle numbers. In other words, you need to add the two values that lay in the middle of the dataset and divide their sum by two.
In this article, we are going to focus on how you can find the Median of a dataset in your Amazon Redshift database. Redshift is a very famous AWS data warehousing service for solving complex database queries and running big data analysis jobs.
Syntax to use the MEDIAN function
If you are working with Redshift, you can easily find the median of a dataset using the following syntax:
Here, the median expression is simply the data set or the column name for which you want to find the median.
Examples of using the MEDIAN function
Now, let’s take an example where you want to find the median for the age of students in the class. You have a table class_data with two columns name and age in the Redshift cluster.
We have random and unordered data and we want the median of this data. To find the median for such a dataset, you will be writing the following query in Redshift:
From organization.class_data
The query is simple and short, but it returns the median of the dataset provided via input expression to it. The MEDIAN function can just be a small portion of many complex queries in difficult data analysis and statistical jobs.
You have seen how to find the median of a column belonging to a certain database table. Let’s move on to a higher level and see how the MEDIAN function can be used in complex Redshift queries.
Conditional Case
Here, you are going to see how you can add a conditional statement while trying to find a median of a dataset. Suppose you are working as an IT specialist in the sales and revenue department of your organization. You are assigned a task by your upper management to find the median for the projects costing over one thousand dollars and right now all the projects are listed in a single database table without any cost segregation.
You already know how to find the median for this column. But here, our requirement is a bit different, that is why you will be using the following query to achieve the desired results:
from organization.projects
where cost > 1000
You are going to get the following output from this query. The median we received is calculated after ignoring all the values of cost lesser than a thousand.
This is how you can utilize the median function with a conditional limit to get your required results in the Redshift database.
MEDIAN Window function
In case you are not familiar with window functions, they are used when you don’t want to apply the function on the whole database table or column. The Window functions allow you to apply a function on a set or a specific group or a range of data. Each group will return its result for that particular function in a single output. You can find corresponding window functions for many SQL functions in Amazon Redshift.
Suppose you are going to start a new project but your company lacks some of the skill sets to complete that project. For this reason, you want to outsource some of the portion of the project for which you have some other organizations to provide you with quotations for this outsourcing project. Each client has come up with three different execution plans out of which you need to choose one.
Now, you need to go with a moderate plan for each client. To find the solution first you will find the median for each client separately. We will be using the median window function for this task.
over (partition by client_name)
from organization.clients
order by client_name;
In the output, you will get the following results. The median for each client is calculated and displayed separately using the OVER clause where we have mentioned the client_name for the basis of this partition.
This way you can utilize the median window function using Amazon Redshift. This scheme can be used for more complex and much bigger datasets as well.
Conclusion
If you want to find the median of a dataset in Amazon redshift, you can perform this task easily using the Redshift MEDIAN function which allows you to calculate the median for a complete column or just for a small group of values using the MEDIAN window function. There are certain cases and scenarios of MEDIAN function discussed in this blog to make your understanding clear.