Databases and SQL — Intermediate

Tanja Adžić
17 min readNov 17, 2023

--

Introduction

Let’s take a look at more advanced SQL which are imprtant for navigating the querying landscape, including some example of these statements being used.

Photo by Growtika on Unsplash

Contents

  1. Refining results using String Patterns, Sorting, and Grouping
  2. Functions, Multiple Tables, and Subqueries
  3. Views, Stored Procedures, and Transactions
  4. JOIN Statements

Refining results using String Patterns, Sorting and Grouping

String Patterns and Ranges

The main purpose of a database management system is not just to store the data, but also facilitate retrieval of the data. In its simplest form, a SELECT statement has this syntax:

SELECT * FROM tablename;

In this case, we retrieve all the rows from the selected table. We can restrict the results by selecting specifin columns:

SELECT column1, column2 FROM tablename;

We can also restrict results by using the WHERE clause:

SELECT column1, column2 FROM tablename
WHERE condition;

String Patterns

What if we find ourselves uncertain about the exact value to specify in a WHERE clause? The WHERE clause always requires a predicate, which is a condition that evaluates to True, False, or Unknown. However, there are situations where we might not have precise information for the predicate. In a relational database, we can employ string patterns to search for data rows that match a specific condition.

For instance, imagine we can’t recall the author’s name but remember that it begins with the letter ‘R’. In such cases, we can use the WHERE clause with the LIKE predicate. The LIKE predicate is utilized in a WHERE clause to search for a pattern within a column:

SELECT firstname FROM Author
WHERE firstname LIKE 'R%';

Here, the percent sign, known as a wildcard character, is used to represent missing letters. It can be placed before the pattern, after the pattern, or both. Wildcard characters serve as placeholders for other characters. In this example, we employ the percent sign after the pattern ‘R’, which retrieves all rows in the author table where the author’s first name begins with ‘R’.

Ranges

When our aim is to retrieve a list of books with a specific range of page numbers — say, more than 290 but less than 300 — we can efficiently specify this condition using a numeric range. Instead of employing comparison operators like “greater than” or “equal to,” we utilize the BETWEEN AND comparison operator:

SELECT title, pages FROM Book
WHERE pages BETWEEN 290 AND 300;

Sets of Values

If we want to determine the countries from which the authors originate, we could write a SELECT statement to retrieve authors from Australia or Brazil by using the WHERE clause and repeating the two country values. However, what if our objective is to retrieve authors from Canada, India, and China? In such cases, the WHERE clause would become cumbersome with repeated listings of the required country conditions. Fortunately, we have a more efficient alternative in the form of the IN operator. The IN operator enables us to specify a set of values within a WHERE clause, taking a list of expressions for comparison.

SELECT firstname, lastname, country FROM Author
WHERE country IN ('CAN', 'IN', 'CHI') ;

Sorting Result Sets

To display the result set in alphabetical order, we incorporate the ORDER BY”clause into the SELECT statement. The ORDER BY clause is employed within a query to arrange the result set based on a designated column.

SELECT title FROM Book
ORDER BY title;

In this instance, we have utilized ORDER BY with the “title” column to arrange the result set. By default, the result set is sorted in ascending order, which in this example, organizes the result set alphabetically by book title. To achieve a descending order, we can use the keyword DESC. Consequently, the result set is now arranged according to the specified “title” column, and it is sorted in a descending order.

Another approach to specify the sorting column is by indicating the column’s sequence number.

SELECT title, pages FROM book 
ORDER BY 2;

In this example, the query utilizes the column sequence number within the query to determine the sorting order. Instead of explicitly stating the column name “pages,” we use the number “two.” Within the SELECT statement, the second column specified in the column list is “pages,” and therefore, the sorting order is determined by the values within the “pages” column. In this context, the “pages” column signifies the number of pages in each book. As a result, the displayed result set is organized in ascending order based on the number of pages.

Grouping Result Sets

Eliminating duplicates — the DISTINCT clause

In the example with the author table, the “country” column contains two-letter country codes indicating each author’s country of origin. When we select only the “country” column, we obtain a comprehensive list of all the countries. In this scenario, the result set comprises 20 rows, one for each of the 20 authors. Nevertheless, some authors hail from the same country, resulting in duplicate entries within the result set. However, our objective is simply to compile a list of the unique countries the authors represent, making duplicates redundant. To eliminate these duplicates and streamline the result set, we use the DISTINCT keyword, resulting in a condensed result set with just six distinct rows.

SELECT DISTINCT (country) FROM Author;

GROUP BY

But what if our interest extends beyond simply knowing the unique countries of origin for the 20 authors? We might also want to know how many authors originate from the same country, gaining insights into the distribution of authors across these countries. To obtain this comprehensive view, listing the countries alongside the count of authors from each country, we introduce the GROUP BY clause into the SELECT statement.

The GROUP BY clause allows us to categorize the result set into subsets characterized by matching values in one or more columns.

SELECT country, COUNT(country) FROM Author
GROUP BY country;

In this particular case, the countries are grouped together, and then we employ the COUNT function to tally the number of authors within each group, providing a clear perspective on the distribution of authors by country.

In the result set, the second column is derived through the use of the COUNT function. Rather than relying on the default column name “2,” we have the flexibility to assign a more meaningful name to this derived column using the AS keyword. In this instance, we transform the column name from “2” to “Count” using the “AS Count” keyword, enhancing the clarity and interpretability of the result set.

SELECT country, COUNT(country) AS Count FROM Author
GROUP BY country;

HAVING Clause

Now that we have the count of authors originating from various countries, we can refine our result set by setting specific conditions. For instance, we might want to determine if there are more than four authors coming from the same country. To establish conditions within a GROUP BY clause, we use the HAVING keyword.

SELECT country, COUNT(country) AS Count FROM Author
GROUP BY country
HAVING COUNT(country) > 4;

The HAVING clause is utilized in conjunction with the GROUP BY clause and allows us to filter and analyze aggregated data based on specific criteria. It’s crucial to recognize that while the WHERE clause applies to the entire result set, the HAVING clause exclusively functions in conjunction with the GROUP BY clause, making it a powerful tool for conditionally selecting grouped data.

Functions, Multiple Tables, and Subqueries

Built-in Database Functions

Built-in functions come with most databases, even if we are able to first fetch data from the database and use an application to perform operations on it. These functions:

  • can be included as part of SQL statements
  • can significantly reduce the amount of data that needs to be retrieved
  • can speed up data processing

Aggregate or Column Functions

An aggregate function takes a collection of like values, such as all of the values in a column, as input, and returns a single value or null. Here are most common aggregate functions:

  • SUM() adds up all the values in a column which can be renamed
SELECT SUM(cost) AS sum_of_cost
FROM House;
  • MIN() returns the minimum value
  • MAX() returns the maximum value
SELECT MAX(quantity) FROM House;
SELECT MIN(id) FROM House
WHERE id < 100;
  • AVG() returns the average — the mean value
SELECT AVG(cost) FROM House;

We can also perform mathematical operations between columns, for example:

SELECT AVG(cost / quantity) FROM House
WHERE id < 100;

Scalar and String Functions

Scalar functions perform operations on individual values. Some examples include:

  • ROUND() which rounds a number, LENGTH(), UCASE, LCASE
SELECT ROUND(cost) FROM House;
  • LENGTH() is used on strings, and it retrieves the length of each value in a column
SELECT LENGTH(street_name) FROM House;
  • UCASE, LCASE — Uppercase and lowercase functions can be used to return uppercase or lowercase values of strings
SELECT UCASE(street_name) FROM House;
SELECT * FROM House
WHERE LCASE(street_name) = 'Elm Street';

Date and Time Built-in Functions

Most databases include specialized data types for handling dates and times, for example DB2, offering date, time, and timestamp types.

  • DATE consists of eight digits representing the year, month, and day: YYYYMMDD
  • TIME employs six digits for hours, minutes, and seconds: HHMMSS
  • TIME, on the other hand, utilizes 20 digits, encompassing year, month, day, hour, minute, seconds, and microseconds: YYYYXXDDHHMMSSZZZZZZ

Various functions are available to extract specific date and time components, such as DAY(), MONTH(), YEAR(), DAYOFMONTH(), DAYOFWEEK(), DAYOFYEAR(), WEEK(), HOUR(), MINUTE(), and SECOND().

Let’s explore some practical examples of queries involving date and time functions. The DAY function, for instance, allows us to extract the day portion from a date. For instance, to retrieve the day portion for each rescue date involving cats, we can execute the following query:

SELECT DAY(RESCUEDATE) FROM PETRESCUE WHERE ANIMAL = 'cat';

Date and time functions can also be employed within the WHERE clause to filter data. For instance, to ascertain the number of rescues occurring in the month of May (Month 5), we can execute:

SELECT COUNT(*) FROM PETRESCUE WHERE MONTH(RESCUEDATE) = 05;

Furthermore, you can perform date or time arithmetic. For instance, if you need to determine the date that falls three days after each rescue date, perhaps for processing purposes within a three-day window, you can use:

SELECT (RESCUEDATE + 3 DAYS) FROM PETRESCUE;

Additionally, special registers for CURRENT_TIME and CURRENT_DATE are available. To calculate the number of days that have elapsed between each rescue date and the current date, you can use:

SELECT (CURRENT_DATE - RESCUEDATE) FROM PETRESCUE;

The result will be provided in terms of years, months, and days, offering valuable insights into the temporal aspects of your data.

Sub-queries and Nested SELECT’s

Sub-queries, also known as sub-selects, function similarly to regular queries but are enclosed within parentheses and nested inside another query. This approach allows us to construct more robust and intricate queries that would have been possible otherwise. Here is an example:

SELECT colum1 FROM tablename
WHERE column2 = (SELECT MAX(column2) FROM tablename);

In this example, the sub-query is inside the where clause of another query.

Why do we use Sub-queries?

Let’s say we want to retrieve the list of employees who earn more than the average salary. An approach that would be considered here is this:

SELECT * FROM employees
WHERE salary > AVG(salary);

RESULT:
ERROR-INVALID USE OF AGGREGATE FUNCTION

One of the limitations of built-in aggregate functions, like the AVG() function, is that they cannot always be evaluated in the WHERE clause. Therefore, we use a sub-query:

SELECT firstname, lastname, salary FROM employees
WHERE salary < (SELECT AVG(salary) FROM employees);

Column Expressions

Sub-selects are not limited to the WHERE clause alone; they can be incorporated into various sections of the query, including the list of columns to be selected. These sub-queries, when used in this context, are referred to as column expressions.

As an example, if we wanted to compare the salary of each employee with the average salary, we could use the GROUP BY function, or use a subquery placed in the list of columns:

SELECT emp_id, salary, 
(SELECT AVG(salary) FROM employees) AS avg_salary
FROM employees;

Sub-queries in FROM clause

Another approach is to integrate the sub-query into the FROM clause, a technique often referred to as derived tables or table expressions. In such cases, the outer query utilizes the outcomes of the sub-query as a data source. Let’s see an example where we create a table expression containing non-sensitive employee information:

SELECT * FROM
(SELECT emp_id, firstname, lastname, dep_id FROM employees)
AS employee_for_all;

The derived table within a sub-query deliberately excludes sensitive fields such as date of birth or salary. While this example may appear simplistic, it showcases the concept. In practice, utilizing derived tables becomes especially potent in more complicated scenarios, such as when dealing with multiple tables and performing joins. In such situations, they provide a structured and secure means of managing and extracting specific subsets of data.

Working with Multiple Tables

There are several ways to access multiple tables in the same query:

  • Sub-queries
  • Implicit JOINS
  • JOIN queries (INNER JOIN, OUTER JOIN…)

Accessing multiple tables using Sub-queries

If we want to retrieve only the employee records from the employees table for which a department ID exists in the departments table, we can use a sub-query as follows:

SELECT * FROM employees
WHERE dep_id IN (SELECT dep_id_dep FROM departments);

Here the outer query accesses the employees table and the sub-query on the departments table is used for filtering the result set of the outer query.

Suppose our objective is to obtain a list of employees from a particular location. In the employees table, there is no direct location information available. However, the departments table includes a column named “location ID” that can provide us with the necessary location data. Therefore, we can use a sub-query from the departments table and utilize it as input for the employee table query:

SELECT * FROM employees
WHERE dep_id IN (SELECT dep_id_dep FROM departments
WHERE loc_id = 'L0002');

Now, let’s retrieve the department ID and department name for employees who earn more than $70,000. To do so, we will need a sub-query on the employees table to satisfy the salary criteria, and then feed it as input to an outer query on the departments table in order to get the matching department info:

SELECT dep_id_dep, dep_name FROM departments
WHERE dep_id IN
(SELECT dep_id FROM employees
WHERE salary > 70000);

Accessing multiple tables with Implicit JOIN

We can also access multiple tables by specifying them in the FROM clause of the query:

SELECT * FROM employees, departments;

Here, we specify two tables within the FROM clause, leading to a table join. It’s important to note that we’re not explicitly using the join operator. The resulting join in this example is called a full join or Cartesian join, as every row from the first table is paired with every row from the second table. Consequently, when examining the result set, there will be a greater number of rows compared to the individual row counts of both tables.

We can use additional operands to limit the result set. Let’s look at an example where we limit the result set to only rows with matching department ID’s:

SELECT * FROM employees, departments
WHERE employees.dep_id = departments.dep_id_dep;

Since the table names can sometimes be long, we can use shorter aliases for table names as shown here:

SELECT * FROM employees E, departments D
WHERE E.dep_id = D.dep_id_dep;

The column names within the SELECT clause can also be preceded by aliases for clarity and differentiation:

SELECT E.emp_id, D.dep_ID
FROM employees E, departments D
WHERE E.dep_id = D.dep_id_dep;

Views, Stored Procedures, and Transactions

Views

A view is an alternative way of representing data that exists in one or more tables or views.

  1. A view can include all or some of the columns from one or more base tables or existing views.
  2. Creating a view creates a named specification of a results table, which can be queried in the same way as a table.
  3. You can also change the data in the base table by running insert, update, and delete queries against the view.
  4. When you define a view, the definition of the view is stored. The data that the view represents is stored in the base tables, not by the view itself.

We can use Views to:

  • Show a selection of data for a given table (useful for ommiting sensitive data)
  • Combine two or more tables in meaningful ways
  • Simplify access to data by granting access to a View and not the table
  • Show only portions of data in the table relevant to the process that uses the View

Syntax of a View:

CREATE VIEW viewname (column1, column2)
AS SELECT column1, column2 FROM tablename
WHERE condition;

For example, if we would want to create a View that does not contain sensitive infromation about employees, it would look like this:

CREATE VIEW empinfo (emp_id, firstname, lastname, address, job_id, manager_id,
dep_id)
AS SELECT emp_id, f_name, l_name, address, job_id, manager_id, dep_id
FROM employees
WHERE manager_id = '3002';


to view:
SELECT * FROM empinfo;

To delete a View, we simply use the DROP statement:

DROP VIEW empinfo;

Stored Procedures

A stored procedure is a set of SQL statements that are stored and executed on the database server. Instead of sending multiple SQL statements from the client to server, we encapsulate them in a stored procedure on the server and send one statement from the client to execute them.

Stored procedures can be written in many different languages: SQL PL, Java, C, and other. They can accept information as parameters, perform CRUD operations, and return results to the clients application.

Benefits of stored procedures:

  • Reduction in network traffic (one call for multiple statements)
  • Improvement in performance (processing happens on the server)
  • Reuse of code (multiple applications use the same stored procedure)
  • Increase in security (no need to expose table information, and use server to validate data before it enters the system)

CREATE PROCEDURE Statement:

CREATE PROCEDURE UPDATE_SAL (IN empNum CHAR(6), IN rating SMALLINT)
LANGUAGE SQL
BEGIN
IF rating = 1 THEN
UPDATE employees
SET salary = salary * 1.10
WHERE emp_id = empNum;
ELSE
UPDATE employee
SET salary = salary * 1.05
WHERE emp_id = empNum;
END IF;
END

Firstly, we use the CREATE PROCEDURE statement, specifying the procedure’s name and any parameters it will accept. In this example, the UPDATE_SAL procedure is designed to take an employee number and a rating, which we will use to update an employee’s salary based on their rating. Then, we declare the programming language we intend to use. Subsequently, we encapsulate our procedural logic within the BEGIN and END statements. In this particular scenario, we provide employees with a rating of 1 a 10% pay increase and all others receive a 5% raise. It’s worth noting that we can directly utilize the information passed as parameters in our procedural logic.

We have the capability to invoke stored procedures from external applications or within dynamic SQL statements. To invoke the UPDATE_SAL stored procedure that we have just created, we use the CALL statement, specifying the procedure’s name and passing the required parameters — in this case, the employee ID and the employee’s rating.

ACID Transactions

A transaction is an indivisible unit of work. It can consist of one or more SQL statements, but to be considered successful, either all of those SQL statements must complete successfully, leaving the database in a new stable state, or none must complete, leaving the database as it was before the transaction began.

For example, if you make a purchase using your bank card, many things must happen:

  • The product must be added to your cart
  • Your payment must be processed
  • Your account must be debited the correct amount and the store’s account credited
  • The inventory for that product must be reduced by the number purchased

ACID Transactions

The types of transactions in the example are referred to as ACID transactions:

  1. Atomic — All changes must be executed successfully or not at all.
  2. Consistent — Data must maintain a consistent state before and after the transaction.
  3. Isolated — No other process can alter the data while the transaction is in progress.
  4. Durable — The changes made by the transaction must endure and persist.

Here are the steps for managing an ACID transaction:

  1. BEGIN: To start an ACID transaction, use the BEGIN command.
  2. COMMIT: If all the commands complete successfully, issue a COMMIT command to save everything in the database to a consistent, stable state.
  3. ROLLBACK: If any of the commands fail, such as in the case where a bank account lacks sufficient funds for a payment, you can issue a ROLLBACK command to undo all the changes and return the database to its previous consistent and stable state.

JOIN Statements

A simple SELECT statement retrieves data from one or more columns from a single table. The next level of complexity is retrieving data from two or more tables. This leads to multiple possibilities of how the result set can be generated.

To combine data from two tables, we use the JOIN operator. A JOIN combines the rows from two or more tables based on a relationship between certain columns in these tables.

First we need to identify the relationship between these tables. That is, the column or columns in each table to use as a link between the tables. A primary key uniquely identifies each row in a table. A foreign key refers to to a primary key of another table.

If we wanted to combine 3 or more tables, we simply add them to the JOIN.

Types of joins

SQL offers you several different types of JOINs:

  • INNER JOIN — displays only the rows from two tables that have matching value in a common column, usually the primary key of one table that exists as a foreign key in the second table.
  • OUTER JOIN — which return matching rows, and even the rows from one or the other table that don’t match. There are many varieties of outer join that you can use to refine your result set.

INNER JOIN

In an inner join, only the rows from two tables that have matching values in a common column are displayed. Typically, this common column is the primary key of one table, which exists as a foreign key in the second table. Here’s the syntax for the SELECT statement in an inner join:

SELECT B.borrower_id, B.lastname, B.country, L.borrower_id, L.loan_date
FROM borrower B
INNER JOIN loan L ON B.borrower_id = L.borrower_id;

Here, we want to retrieve a list of all people who are borrowing books, along with the date of the loan, requiring data from the borrower table and the loan table. In the FROM clause, we specify the join between the borrower table and the loan table as “BORROWER INNER JOIN LOAN.” We identify the borrower table as “B,” and the loan table as “L.” The table specified on the left of the JOIN clause is known as the left table — in this case, the borrower table is the left table.

For this join, we select borrower ID, last name, and country from the borrower table, and the borrower ID and the loan date from the loan table. In the ON clause, we specify the JOIN predicate, which is the condition that the borrower ID in the borrower table is equal to the borrower ID in the loan table.

Notice that in this join, each column name is prefixed with either the letter “B” or “L.” In SQL, this is referred to as an alias. Using an alias is much easier than rewriting the whole table name. The result set displays only the rows from both tables that have the same borrower ID.

OUTER JOIN

Outer joins, unlike inner joins, return the rows from each table that have matching values in the join columns. In contrast to inner joins, outer joins also return the rows that do not have a match between the tables. SQL provides three types of outer joins:

  • left outer join
  • right outer join
  • full outer join
OUTER JOINs

LEFT JOIN

In a left outer join, all the rows from the first table (on the left side of the join predicate) are included, and only the matching rows from the second table (on the right side of the join predicate).

In this diagram, a Left Join matches all the rows from the left table and combines the information with rows from the right table that match the criteria specified in the query.

An example of a LEFT JOIN syntaxt is given below:

SELECT B.borrower_id, B.lastname, B.country, L.borrower_id, L.loan_date
FROM borrower B
LEFT JOIN loan L ON B.borrower_id = L.borrower_id;

RIGHT JOIN

In a right outer join, all the rows from the first table (on the left side of the join predicate) are included, and only the matching rows from the second table (on the right side of the join predicate).

In the above diagram, a Right Join matches all the rows from the right table and combines the information with rows from the left table that match the criteria specified in the query.

Here is an example of a RIGHT JOIN:

SELECT B.borrower_id, B.lastname, B.country, L.borrower_id, L.loan_date
FROM borrower B
RIGHT JOIN loan L ON B.borrower_id = L.borrower_id;

FULL JOIN

A full join returns all rows from both the right table and the left table. So, the FULL JOIN can return a very large result set.

In the above diagram, the result set of a RIGHT JOIN is all rows from both tables matching the criteria specified in the query, plus all non-matching rows from the RIGHT table.

Here is the syntax of a FULL JOIN:

SELECT B.borrower_id, B.lastname, B.country, L.borrower_id, L.loan_date
FROM borrower B
FULL JOIN loan L ON B.borrower_id = L.borrower_id;

Disclaimer: The notes and information presented in this blog post were compiled during the course “Databases and SQL for Data Science with Python” and are intended to provide an educational overview of the subject matter for personal use.

--

--

Tanja Adžić
Tanja Adžić

Written by Tanja Adžić

Data Scientist and aspiring Data Engineer, skilled in Python, SQL. I love to solve problems.

No responses yet