A database is a collection of data that is stored in a computer system. SQL is one of the most universal programming language that can perform thousands of functions including reading, writing, and manipulating data.
It is the standard programming language for the management of relational databases and is used by computer scientists, data analysts, and data scientists throughout the world.
In this guide, we will go over SQL statements or queries that are used extensively in data science.
This guide uses PostgreSQL software to run the queries. You can download it from the official website by clicking here.
With that being said, let’s begin!1
1. Selecting All Columns
The SELECT statement is one of the most fundamental SQL Query. This statement helps you to select some or all columns from the database. You will be using this function throughout your data science projects.
To select all the columns, use ‘*’ as shown in the following example:
“Product” is a table that is consists of 3 columns. After running the previous query, we can look into the columns that are present inside the Product table.
2. FROM Clause
FROM clause comes after SELECT and helps to identify where the column is located in the table by name.
When specifying multiple tables, use commas and space between the table names. This is shown in the following example:
The output is more specific with only one column. This is shown in the following:
3. WHERE Clause
The next statement after FROM is WHERE. It helps us to filter the data from the database based on certain conditions.
Output:
4. GROUP BY Statement
This clause is used to select and group the rows that have the same values.
In the given SQL query, we will group two rows where the Product_Price is equal to 6.
The output will have a new column named count. It is followed by the Product_Name column as shown in the following:
5. HAVING Statement
Aggregation functions such as SUM, AVG, MAX, MIN, etc. can’t be used within the WHERE clause. In order to use them, you have to use the HAVING statement. This is demonstrated with the help of the following code:
Output:
6. ORDER BY Statement
As from the name, ORDER BY command is used to list the records in ascending or descending order. For ascending, use ORDER BY alone or add the keyword “ASC” along it. Similarly, to sort the values in descending order, use the keyword “DESC” along with ORDER BY.
For descending order:
Output:
Similarly, for ascending order:
Output:
7. LIMIT Statement
LIMIT is an important SQL statement that helps you to specify the number of records to return from the database. For example, to return only two rows from our database, use the LIMIT Command as follows:
Output:
8. Pre-Defined DATA Functions
Pre-defined functions are very important in SQL as they help the Data Scientists save time. These functions are also called aggregate functions which work on a set of rows instead of a single row and return a single value. There are many different aggregate functions. Some important ones are listed in the following:
COUNT(*) –> Returns the number of rows.
MIN(<Column Name>) –> Finds the minimum value in the column.
MAX(<Column Name>) –> Finds the maximum value in the column.
SUM(<Column Name>) –> The sum of all the values inside the column name.
What’s the max price in our data table?
Output:
9. INNER, LEFT, and RIGHT JOINS
There are many types of JOINs but the three main ones are discussed here.
INNER JOIN is the simplest and most common JOIN that is used to create a new table that has matching rows in both left and right columns.
Output:
The LEFT JOIN returns all the rows in the left column and the matching rows in the right column.
Output:
Here, we joined the Product_profit column from table2 with the Product_Name column of table 1.
The RIGHT JOIN is the opposite of the LEFT JOIN. It returns all the rows present in the right column and the matching rows in the left column.
10. Subqueries
Last but not the least, we have Subquery which is also known as an inner query. It is nested inside a larger query which can occur inside either the SELECT, FROM, or WHERE clause.
A subquery is not limited to these clauses. It can also occur in other clauses such as DELETE, INSERT, or UPDATE.
Another important feature of the subquery is that you can use the logical operators like >, <, or =. A subquery is also executed first before its parent query.
For example:
Output:
Conclusion
In this article, we went over the important SQL queries for Data Scientists. These queries are very common and can be used by professionals such as Data Engineers, Data Analysts, etc. The best way to learn all these queries is to create your own database and then practice as much as you can. With this guide as the foundation, you can now learn about more advanced queries for Data Science.