Skip to content

Gentle introduction to relational databases

For many years, relational databases where the only serious way to keep relevant data saved in a safe and retrievable way, whenever needed.

Some would say that it still is the only serious and safe way.

Relational databases emerged victorious from alternative ways to store large volume of enterprise data way back in the 80's, there is a fine documentary about it.

They worth to use instead of just shove all the data inside a plain file because most of them offer ACID properties:

  1. Atomic operations
  2. Consistency of data
  3. Isolation in concurrent scenarios
  4. Durability of stored data

We'll look on several relevant details in this chapter, but let's start with the sample code: in fact, i will present things first and explain later, deal with database management in most enterprise scenario is a more business-related job than a purely technical challenge: what really matters is what does this data means instead of how to store it.

Write a program to query data

This is a small example on how store things.

Kotlin inherits from java a library called JDBC and you can use it to query the database:

kotlin
// ...
fun list() {
    println("|#\t|description\t|done\t|updated\t")
    connection.prepareStatement(
        """
            select * from todos
        """.trimIndent()
    ).use {
        val rs = it.executeQuery()
        while (rs.next()) {
            println(
                "|${rs.getLong("id")}\t" +
                        "|${rs.getString("description")}\t" +
                        "|${rs.getBoolean("done")}\t" +
                        "|${rs.getTimestamp("updated")}"
            )
        }
    }
}
// ...

JDBC isn't famous for the concise, easy to use API, but it's a good base for the people writing alternative database interfacing libraries.

You can check the complete sample code here.

Cool database runtimes

When people talk about databases, what they really mean is the collection of software needed to provide database management. RDBMS - Relational Database Management System is the term to kick in to find ou more.

Some of them are quite humble, just one program saving into a file, while others are huge, state-of-the-art beasts, intended for brutally large data volume processing.

There are many RDBMS around, but let's focus on some that will help you to deliver things fast.

SQLite

Let's start with the tiny guy. They say it's not meant to be used for serious applications, yet it is easily the most ubiquitous database engine in the entire world.

There are several ways to install the command line tool and the library, but we'll focus on how to handle the database using the command line, since we already sampled how a kotlin program could consume the data.

Also, we'll make use of SQL - Structured Query Language to create the sample todo list database. More about it, in details, further in this chapter.

Open the command line terminal, call the sqlite3 command you have installed from previous step and:

bash
sombriks@zagreus Documents % sqlite3 todos.db
SQLite version 3.43.2 2023-10-10 13:08:14
Enter ".help" for usage hints.
sqlite> create table todos(id integer, description text, updated timestamp);
sqlite> insert into todos(id,description,updated) values(1, 'do the dishes', '2024-05-26');
sqlite>

Hit CTRL+D to exit, the file todos.db now is there.

You can go back into your database with the same previous command sqlite3 todos.db. Now try to query the data:

bash
sombriks@zagreus Documents % sqlite3 todos.db
SQLite version 3.43.2 2023-10-10 13:08:14
Enter ".help" for usage hints.
sqlite> select * from todos;
1|do the dishes|2024-05-26
sqlite>

There is a lot to explain, yet look how simple is that!

SQLite figures into this list because it is by far the easiest one to use when we need a small and simple database for testing purposes. As you can see you don't even need to setup a real project to test things with it.

H2

This is another tiny, embeddable, database runtime that you can use in your kotlin/spring projects.

In fact, spring boot will provision a h2 embedded instance if your tests needs a DataSource, but you configured no such thing.

Instead of a command line tool, H2 offers a feature-complete web admin console, so you get a better visualization of the data

MySQL

MySQL is a popular, robust RDBMS widely used on internet solutions.

It's quite feature complete (I still don't forgive the lack of RETURNING statements, i don't care if it's not ANSI-SQL) and easy to find years of documentation about it.

It has a notable fork, MariaDB (which has the RETURNING).

PostgreSQL, the open-source and enterprise one

Now, this is the one you pick when you want to do big, serious business. It's also one of the oldest RDBMS in activity, dating from the time of relational database invention (i mean WARS!!).

With postgres you can handle internet scale operations, your hardware is the limit. It is into another league, competing with systems like Oracle, IBM DB2 and SQL Server. And winning.

Pick postgres whenever possible.

Cool database clients

Most RDBMS offers tools like the sqlite command line tool we used earlier in this chapter or something like the H2 embedded web consoles.

But if we keep using only the official tool for each RDBMS, we'll end up overwhelmed with so much tooling to learn about something that is a ISO standard.

There are tools able to deal with several distinct database runtimes, so you can learn one single tool and get the job done even though you don't understand the database in use completely.

Dbeaver

dbeaver supports a wide range of RDBMS. Everything the jvm touches it touches too.

Intellij Ultimate or DataGrip

datagrip supports a wide range of RDBMS. Everything the jvm touches it touches too.

DbBrowser for SQLite

dbbrowser is a dedicated client for SQLite and it deserves a place here because SQLite is the one we pick to do fast, expendable test databases.

Sure, command line is cool but sometimes visual tools allows you to go early to home.

The SQL language

Now let's talk about the way we interact with databases.

The Structured Query Language is the standard way to retrieve, model, modify and evolve data stored into a RDBMS.

Since the language existed for years before the standardization appeared (and for decades before that standardization started to be taken seriously), a few SQL Dialects exists in different RDBMS. We'll face them up ahead.

It can be divided in several categories, but the principal ones are DDL and DML.

DDL

DDL stands for Data Definition Language. You use it to describe the data schema, tables and relations between them. We saw it in action in the initDb function back in the sample project.

Create table, alter table, drop table

We talk SQL, we talk tables.

A table is the foundational structure used to model our data. It holds columns, each column represents a characteristic of the data we want stored.

Let's say we want to keep a list of friends. We need their names.

sql
create table friends
(
    name text
);

This table represents exactly what we requested, although it is horribly, poorly designed.

  • If we have two friends with the same name, we end up with no easy way to differentiate one from another.
  • I don't know when i inserted that friend in my database.
  • Extra information about the friend (phone, email, address, for example) would be nice to have.

We can recreate our table to fix those problems i just invented:

sql
-- throw away previous table definition
drop table friends;
-- do it again
create table friends
(
    id      integer primary key,
    name    text      not null,
    created timestamp not null default current_timestamp
);

Or, instead of throw it away we just alter the table definition:

sql
alter table friends
    add column phone text;
alter table friends
    add column email text;
alter table friends
    add column address text;

Some database engines (SQLite, for example!) won't let you add primary keys via alter table, this is why the example above is how it is.

What?? what is a primary key and why 9 out of 10 DBAs call it id?

What is a primary key

Whenever you insert data into a table, a record is created. A record is a 'line' in the table.

If you insert the same data twice, how to know the difference between records? They are the same, yet they're not.

The concept of identity, from psychology, helps to solve it. You can read the theseus ship paradox to get things even more clear: data is subject to change over the time, yet each record is supposed to be equal to itself no matter the changes.

'Natural' vs 'artificial' keys

Another common debate is whether to adopt 'natural' or 'artificial' keys.

By natural understand some trait that already exists along the characteristics of the entity being represented by a table.

A social security number could be considered a natural key.

The ìd column is an artificial key. It makes sense in the table context but is completely alien to a person in the real world.

In most cases, prefer artificial keys. Natural ones are subject to change due real world events and thus outside the RDBMS control.

If for some reason you want to use special names for your table, use double quotes:

Double quotes
sql
create table "never tell me the odds"
(
    id             integer primary key,
    "odd col name" text not null
);

insert into "never tell me the odds" ("odd col name")
values ('han shoots first');

select *
from "never tell me the odds";

Resulting in:

id"odd col name
1han shoots first

Create index, create view

Indexes are a way to inform the RDBMS that a certain column will be requested by queries pretty often.

Primary keys are usually indexed automatically.

You can add uniqueness on your column definition and it will also be indexed.

The table bellow defines unique indexes on email and phone. no friend share such information, but it's ok to have more than one friend with the same address.

sql
-- drop table friends;
create table friends
(
    id      integer primary key,
    name    text        not null,
    email   text unique not null,
    phone   text unique not null,
    address text        not null,
    created timestamp   not null default current_timestamp
);

This is an alternative syntax, the advantage is we know the key name:

sql
-- drop table friends;
create table friends
(
    id      integer primary key,
    name    text      not null,
    email   text      not null,
    phone   text      not null,
    address text      not null,
    created timestamp not null default current_timestamp,
    constraint friends_uq_email unique (email),
    constraint friends_uq_hone unique (phone)
);

This also works for some dialects (H2 and the rest of the world ok, sqlite no):

sql
-- drop table friends;
create table friends
(
    id      integer primary key,
    name    text      not null,
    email   text      not null,
    phone   text      not null,
    address text      not null,
    created timestamp not null default current_timestamp
);

alter table friends
    add constraint friends_uq_email unique (email);
alter table friends
    add constraint friends_uq_hone unique (phone);

On the other hand, sometimes we want just a subset of results, or even there are two or more tables we join very often (see what is a join a little further in this chapter). For those scenarios we can create views:

sql
create view friend_emails as
select id, email
from friends;

Those two concepts will make more sense in the further topics.

Foreign keys

Let's talk about the "Relational" in Relational Database Management System.

There are relations between the tables. Let's evolve our friends database, let's have one table for friends, another for addresses because yes.

sql
drop table friends;

create table addresses
(
    id      integer primary key,
    name    text      not null,
    number  text,
    created timestamp not null default current_timestamp,
    constraint addresses_uq_name_number unique (name, number)
);

create table friends
(
    id           integer primary key,
    name         text        not null,
    email        text unique not null,
    phone        text unique not null,
    addresses_id integer     not null,
    created      timestamp   not null default current_timestamp,
    foreign key (addresses_id) references addresses (id)
);

The way address is defined allow us to share the same address with more than one friend. It is represented in the friends table by the use of a foreign key pointing to the address primary key.

But what happens if i delete an address?

Depending on the database runtime, the delete operation can either be prevented or it occurs but let you with a database with invalid state.

Databases honoring the ACID properties will prevent you from doing that.

But you can solve it explaining what to do in such cases:

sql
create table friends
(
    id           integer primary key,
    name         text        not null,
    email        text unique not null,
    phone        text unique not null,
    addresses_id integer     not null,
    created      timestamp   not null default current_timestamp,
    foreign key (addresses_id) references addresses (id) on delete set null
);

We'll come back to this cascade thing in upcoming topics.

DML

DML stands for Data Manipulation language. It differs from what we're seeing because the first one, DDL, explains how our data should look like. This one is to feed data into it.

Insert

To add data to you database you must use insert statements.

For this table (tested on sqlite, trust me):

sql
-- drop table if exists friends;
create table friends
(
    id      integer primary key,
    name    text      not null,
    email   text,
    created timestamp not null default current_timestamp
);

You can insert data like this:

sql
insert into friends(name)
values ('Joe');

The output should be something like this:

bash
1 row affected in 5 ms

And you can check your data with a select (more on that later):

sql
select *
from friends;

And the result would be:

idnameemailcreated
1Joenull2024-05-28 11:46:18

Not that insert didn't had any info about id or createdcolumns, but there is values on them. This is how column definitions works, you can define keys, default values, restrictions, you name it (as long as supported by the database engine in use).

For the following table:

sql
-- drop table if exists friends;
create table friends
(
    id      integer primary key,
    name    text      not null,
    email   text      not null,
    created timestamp not null default current_timestamp
);

The same insert that served us well will fail:

sql
insert into friends(name)
values ('Joe');

With a message error more or less like this:

console
[2024-05-28 09:02:09] [19] [SQLITE_CONSTRAINT_NOTNULL] A NOT NULL constraint failed (NOT NULL constraint failed: friends.email)

What does it mean? it means, according to our table definition, that i don't accept friends who doesn't have an email. Uncivilized people.

You fix that insert by simply providing an email:

sql
insert into friends(name, email)
values ('Joe', 'email@joe.com');

It's possible to provide several friends in a single insert:

sql
insert into friends(name, email)
values ('Joe', 'email@joe.com'),
       ('Anne', 'anne@ggg.co');

You can specify values for columns with default values if you want to:

sql
insert into friends(name, email, created)
values ('Joe', 'email@joe.com', '2024-05-28');

But beware! Primary and unique keys will complain about duplicate values and your insert will fail. The id thing from psychology, remember?

sql
insert into friends(id, name, email)
values (1, 'Joe', 'email@joe.com'),
       (1, 'Anne', 'anne@ggg.co');

Expect something like this:

console
[2024-05-28 09:13:28] [19] [SQLITE_CONSTRAINT_PRIMARYKEY] A PRIMARY KEY constraint failed (UNIQUE constraint failed: friends.id)

A funny one, if every column has a default value:

sql
create table my_messages
(
    id      integer primary key,
    message varchar(255) not null default 'yo!',
    created timestamp    not null default current_timestamp
);

You can insert data like this:

sql
insert into my_messages default
values;

Tables with foreign keys marked with not null restriction will perform extra checks in your insert attempts.

For this table schema:

sql
create table orders
(
    id      integer   not null primary key,
    created timestamp not null default current_timestamp
);

create table items
(
    id        integer   not null primary key,
    product   text      not null,
    amount    integer   not null default 1,
    orders_id integer   not null references orders (id),
    created   timestamp not null default current_timestamp
);

This insert alone will fail (except in sqlite, unless you enable the foreign key check):

sql
-- pragma foreign_keys = 1;
insert into items (product, orders_id)
values ('cell phone', 10);

Resulting error looks like this:

console
[2024-05-28 10:00:57] [19] [SQLITE_CONSTRAINT_FOREIGNKEY] A foreign key constraint failed (FOREIGN KEY constraint failed)

Avoid that by inserting an order first:

sql
insert into orders default
values
    returning *

The returning part (which MySQL doesn't have but every other database runtime does, even sqlite) helps to know how the newly created record looks like. Here's a sample output:

idcreated
12024-05-28 13:05:48

Now the item insert will pass -- as long as the orders_id exists as an id in the order table:

sql
-- pragma foreign_keys = 1;
insert into items (product, orders_id)
values ('bicycle', 1);

Update, delete, cascade

Similar to the table schema itself, stored data must evolve too.

Either by modifying or removing records. For example, given this table:

sql
create table foods
(
    id          integer   not null primary key,
    description text      not null,
    is_sweet    boolean   not null default false,
    created     timestamp not null default current_timestamp
);

And this data:

sql
insert into foods(description)
values ('noodles'),
       ('milky ''n honey chocolate'),
       ('lemon'),
       ('steak');

By checking the content we'll notice an error:

sql
select *
from foods;
iddescriptionis_sweetcreated
5noodlesfalse2024-05-28 23:42:08.814539
6milky 'n honey chocolatefalse2024-05-28 23:42:08.814539
7lemonfalse2024-05-28 23:42:08.814539
8steakfalse2024-05-28 23:42:08.814539

The chocolate thing is supposed to be sweet.

To fix that, update the row:

sql
update foods
set is_sweet = true
where description = 'milky ''n honey chocolate';

But the name is too difficult to use, and what is that '' thing?

You can use an alternative update:

sql
update foods
set is_sweet = true
where description like '%honey%';

The like operator looks for similarities to a given special string.

But the most safe, straightforward way to update a record is knowing its id:

sql
update foods
set is_sweet = true
where id = 6;

We started to use the where clause in our queries, more on that up next.

If we need to get rid of some data, the delete can be used:

sql
delete
from foods
where id = 6;

Note that, like update, a where clause should be used to specify which records you want to affect. On both scenarios, if no where clause is provided, it will affect all records in the table, which is undesirable on most cases.

Deletes also take relations into account. Given this schema:

sql
create table team
(
    id          identity not null primary key,
    description text     not null
);

create table member
(
    id      identity not null primary key,
    team_id integer  not null,
    name    text     not null,
    constraint fk_member_team_id foreign key (team_id) references team (id)
);

insert into team (description)
values ('team 1'),
       ('team 2');

insert into member(name, team_id)
values ('Joe', 1),
       ('Anna', 1),
       ('Lucas', 1),
       ('Ana', 2),
       ('Kelly', 2);

An attempt to delete team 2 will likely fail (tested on H2):

sql
delete
from team
where id = 2;

With an output like this:

console
[2024-05-28 21:12:15] [23503][23503] Referential integrity constraint violation: "FK_MEMBER_TEAM_ID: PUBLIC.MEMBER FOREIGN KEY(TEAM_ID) REFERENCES PUBLIC.TEAM(ID) (2)"; SQL statement:
[2024-05-28 21:12:15] delete from team where id = 2 [23503-220]

Yes, yes, you can delete all members where team_id = 2 and then perform the deletion on team, but you can define a cascade into team_id colum:

sql
alter table member
    drop constraint fk_member_team_id;
alter table member
    add constraint fk_member_team_id foreign key (team_id) references team (id) on delete cascade;

Now your deletion goes even better than expected:

console
TODOS.H2.DB.PUBLIC> delete from team where id = 2
[2024-05-28 21:26:35] 1 row affected in 2 ms

Check what's left in the database:

sql
select *
from team
         join member on member.team_id = team.id;
team.iddescriptionmember.idteam_idname
1team 111Joe
1team 121Anna
1team 131Lucas

Select

We're using selects for a while in this chapter now but let's look at them in more detail now.

  • We already saw how to define our database (create table, alter, etc.)
  • We saw how to feed and modify data into our database (insert, update, delete)
  • Now we will see how to ask things to the database and get answers

To perform a select, you need very little, really. Just a RDBMS running and accepting connections (SQLite is a file, H2, MySQL, PostgreSQL, Oracle and so on works usually as a client/server application). This is a valid select:

sql
select 1;

No tables. Just a number. Works.

You can select arbitrary values:

sql
select 'Hi mom',
       123,
       current_timestamp,
       lower('DON''T YELL AT ME!'),
       true;

When selecting you can mix things with data from your table:

sql
select concat('the team member ', name, ' from team ', description) as headline
from team
         join member on member.team_id = team.id;
Select to insert

It is possible to use the select output as the values to an insert statement. It is a powerful move when transforming data already present inside the database.

For example, i want to know all possible color combinations from my color database, first i create a table to save my colors, add some, create a second table to store my combinations and instead of feed combinations one by one i can do the insert into / select operation:

sql
-- drop table if exists colors;
create table colors
(
    id   integer primary key,
    name varchar(255) not null
);

insert into colors (name)
values ('red'),
       ('green'),
       ('blue'),
       ('cyan'),
       ('purple');

-- drop table if exists color_combinations;
create table color_combinations
(
    color1 integer not null,
    color2 integer not null,
    constraint color_combinations_fk_colors_id_1 foreign key (color1) references colors (id) on delete cascade,
    constraint color_combinations_fk_colors_id_2 foreign key (color2) references colors (id) on delete cascade,
    constraint color_combinations_pk primary key (color1, color2)
);

-- feed the combinations
insert into color_combinations
select c1.id, c2.id
from colors c1,
     colors c2
where c1.id <> c2.id;

select *
from color_combinations cc
         join colors c1 on cc.color1 = c1.id
         join colors c2 on cc.color2 = c2.id

This results into this remarkable dataset:

color1color2idnameidname
121red2green
131red3blue
141red4cyan
151red5purple
212green1red
232green3blue
242green4cyan
252green5purple
313blue1red
323blue2green
343blue4cyan
353blue5purple
414cyan1red
424cyan2green
434cyan3blue
454cyan5purple
515purple1red
525purple2green
535purple3blue
545purple4cyan

How to make it a combination instead a permutation is left to you as an exercise.

Select to create table

In a similar fashion, you can use a select to create a table:

sql
create table ccombs as
select c1.name as name1, c2.name as name2
from color_combinations cc
         join colors c1 on cc.color1 = c1.id
         join colors c2 on cc.color2 = c2.id
where c1.id < c2.id

select *
from ccombs;
name1name2
redgreen
redblue
redcyan
redpurple
greenblue
greencyan
greenpurple
bluecyan
bluepurple
cyanpurple

Note however that the resulting table is pretty much poorly designed. it does not have any constraints, no primary key, just raw, synthetic data. Sure, you can fix some of those issues with alter table statements, but the victorious approach here is to create the table first and then go with insert/select strategy.

Join and union

Join operations (which we've been using for a while now) combine the records of two or more tables. Keep that in mind, it involves various tables but the operation is at record level. Why is that important?

sql
select c1.*, c2.*
from colors c1,
     colors c2;
idnameidname
1red1red
1red2green
1red3blue
1red4cyan
1red5purple
2green1red
2green2green
2green3blue
2green4cyan
2green5purple
3blue1red
3blue2green
3blue3blue
3blue4cyan
3blue5purple
4cyan1red
4cyan2green
4cyan3blue
4cyan4cyan
4cyan5purple
5purple1red
5purple2green
5purple3blue
5purple4cyan
5purple5purple

Each record from the colors table aliased as c1 was combined with each record from the same table, but aliased as c2. This is called a cartesian product and not always desirable.

As we saw previously, you avoid that by proper define either a where clause or conditions in the join. Both join forms bellow solves the issue:

sql
select c1.*, c2.*
from colors c1,
     colors c2
where c1.id < c2.id;

select c1.*, c2.*
from colors c1
         join colors c2 on c1.id < c2.id;

Joins has variations:

  • left join, when the condition in joining table involves null values.
  • right join, when the condition in joined table involves null values.
  • (inner) join, when null values are not allowed in the junction. On most RDMS the inner is optional, and the comma-separated syntax is equivalent to it, like the example above.
  • cross outer join, when null values are allowed in the junction.

Now, if you want to append distinct yet slightly similar resuls you need to do a union. For example for this schema and data:

sql
create table colonels
(
    id   integer primary key,
    name text not null
);
create table captains
(
    id   integer primary key,
    name text not null
);
create table soldiers
(
    id   integer primary key,
    name text not null
);

insert into colonels(name)
values ('Clark'),
       ('Mark'),
       ('Oliver');
insert into captains(name)
values ('Barry'),
       ('Larry');
insert into soldiers(name)
values ('Joe');

-- i can query each table, one by one, of course, but i can union them and even
-- polish results a little so no information is left out:

select '1 - colonel' as rank, name
from colonels
union
select '2 - captain' as rank, name
from captains
union
select '3 - soldier' as rank, name
from soldiers
order by rank;
rankname
1 - colonelClark
1 - colonelMark
1 - colonelOliver
2 - captainBarry
2 - captainLarry
3 - soldierJoe

Pretty nice, right?

CTE - Common Table Expression

Queries can grow more and more complex depending on the kind of information we must extract. Joins, subqueries, views, things can grow in complexity pretty fast.

One approach to ease such pain is to identify the duplicity in your subqueries and promote them to CTE's.

Let's exemplify the issue and then how to ease things with this database schema:

sql
-- pragma foreign_keys = 1;

-- a table for products. key characteristics, hardly changing, goes here. 
create table products
(
    id          integer primary key,
    description text      not null,
    created     timestamp not null default current_timestamp
);

-- a table for history price. prices will vary over time. 
create table prices_history
(
    id          integer primary key,
    value       decimal(10, 4) not null,
    created     timestamp      not null default current_timestamp,
    products_id integer        not null,
    foreign key (products_id) references products (id),
    unique (products_id, created)
);

-- a table to know the product stock position in a given moment in time.
create table stock_history
(
    id          integer primary key,
    amount      integer   not null default 0,
    created     timestamp not null default current_timestamp,
    products_id integer   not null,
    foreign key (products_id) references products (id),
    unique (products_id, created)
);

-- order table to identify a transaction
create table orders
(
    id          integer primary key,
    -- this should be calculated from all items in this order
    total_price decimal(10, 4) not null,
    created     timestamp      not null default current_timestamp
);

-- order item to know how many products where involved in the transaction, how
-- much they costed and what was the stock position when this transaction
-- happened. 
create table order_items
(
    id                integer primary key,
    amount            integer   not null default 1,
    created           timestamp not null default current_timestamp,
    orders_id         integer   not null,
    products_id       integer   not null,
    prices_history_id integer   not null,
    stock_history_id  integer   not null,
    foreign key (orders_id) references orders (id),
    foreign key (products_id) references products (id),
    foreign key (prices_history_id) references prices_history (id),
    foreign key (stock_history_id) references stock_history (id),
    -- product appears a single time per transaction, 
    -- with a stock and a price properly trackable in time
    unique (orders_id, products_id),
    unique (orders_id, products_id, prices_history_id),
    unique (orders_id, products_id, stock_history_id)
);

I want to sell things and keep track on how exactly it affects my product stock and how much it used to cost by the time i sold it. It's not that complex, but believe me, sell things is hard.

Let's feed some data:

sql
-- pragma foreign_keys = 1;
-- some base products
insert into products(id, description)
values (1, 'Apple'),
       (2, 'Banana'),
       (3, 'Milk'),
       (4, 'Toy Airplane');

-- initial prices
insert into prices_history (id, products_id, value)
VALUES (1, 1, 5),
       (2, 2, 3),
       (3, 3, 5.5),
       (4, 4, 25);

-- initial stock
insert into stock_history (id, products_id, amount)
values (1, 1, 100),
       (2, 2, 500),
       (3, 3, 100),
       (4, 4, 30);

-- now create an order
insert into orders(id, total_price)
values (1, 100);

-- and some items to it
insert into order_items(orders_id, products_id, prices_history_id, stock_history_id, amount)
values (1, 1, 1, 1, 10), -- 10 apples at $5 each
       (1, 4, 4, 4, 2);
-- 2 toys at $25 each

-- let's update our stocks
insert into stock_history (id, products_id, amount)
values (5, 1, 90), -- i just sold 10 apples
       (6, 4, 28);
-- just sold 2 airplane toys

-- now let's change some prices just because.
insert into prices_history (id, products_id, value)
VALUES (5, 1, 10),
       (6, 2, 9);

-- and place a new order
insert into orders(id, total_price)
values (2, 100);

insert into order_items(orders_id, products_id, prices_history_id, stock_history_id, amount)
values (2, 1, 5, 5, 20), -- 20 apples at $10 each
       (2, 2, 6, 2, 100);
-- 100 bananas at $9 each

-- now update the stock, keep the stock up to date
insert into stock_history (id, products_id, amount)
values (7, 1, 70),
       (8, 2, 400);

-- oof! boy that reminds me my days as a salesman.

Now, let's ask some questions to this database:

  1. how many different items i've sold so far?
  2. what was the paid amount in the cheapest order item?
  3. what is my latest stock position for each product?
  4. what is the initial and current price for each product?

The following queries can answer that:

sql
-- Answer #1
select count(distinct products_id) as different_items
from order_items;

-- Answer #2
select min(paid_price) as paid_amount
from (select amount * value as paid_price
      from order_items oi
               join prices_history ph on oi.prices_history_id = ph.id);

-- Answer #3
select latest.latest_id, latest.products_id, sh.amount
from (select max(id) as latest_id, products_id
      from stock_history
      group by products_id) as latest
         join stock_history sh on sh.id = latest.latest_id;

-- Answer #4
select mm.products_id, phmin.value, phmax.value
from (select min(ph.id) first_id, max(ph.id) last_id, ph.products_id
      from prices_history ph
      group by products_id) mm
         join prices_history phmin on phmin.id = mm.first_id
         join prices_history phmax on phmax.id = mm.last_id

But boy that is a lot of queries and subqueries. Subqueries are hard to read. We can improve that a little with CTE's:

sql
-- Answer #2
with minimum as (select oi.amount * ph.value as paid_price
                 from order_items oi
                          join prices_history ph on oi.prices_history_id = ph.id)
select min(paid_price) as paid_amount
from minimum;

Now the subquery is more readable and also reusable.

Let's check another one:

sql
-- Answer #3
with latest_history as (select max(id) as latest_id, products_id
                        from stock_history
                        group by products_id),
     latest_products as (select lh.latest_id, lh.products_id, sh.amount
                         from latest_history lh
                                  join stock_history sh on sh.id = lh.latest_id)
select latest_products.latest_id, latest_products.products_id, latest_products.amount
from latest_products;

By composing CTE's you can drill down more easily what your query really does. Sequencing is better than nesting.

One more:

sql
with mm as (select min(ph.id) first_id, max(ph.id) last_id, ph.products_id
            from prices_history ph
            group by products_id)
select mm.products_id, phmin.value, phmax.value
from mm
         join prices_history phmin on phmin.id = mm.first_id
         join prices_history phmax on phmax.id = mm.last_id;

You can think on reusable queries hierarchy like this:

  • A local, in-place subquery in the current query for one single use.
  • A CTE for subqueries to better reading or use in more than once scenario.
  • A view when the subquery will be reused across many queries.

Sum, avg, count, group by

We can use aggregate functions (like we've been using!) to extract even more cool information from our database.

You can figure out total sold items:

sql
select sum(amount)
from order_items;

The average historic price of a product:

sql
select avg(value)
from prices_history
where products_id = 1;

How many transactions we've done so far:

sql
select count(id)
from orders;

By grouping, you can narrow down those information to a specific aspect.

For example, this is how to figure out how many price changes each product had:

sql
select count(id) price_changes, products_id
from prices_history
group by products_id;

Every column present on select statement and not being aggregated (count, sum, avg, etc) must be present in the group by statement.

Order by, limit, offset

Order by sorts the results. for example:

sql
select *
from products
order by description;

Results in:

iddescriptioncreated
1Apple2024-05-30 20:29:20
2Banana2024-05-30 20:29:20
3Milk2024-05-30 20:29:20
4Toy Airplane2024-05-30 20:29:20

While:

sql
select *
from products
order by description desc;
iddescriptioncreated
4Toy Airplane2024-05-30 20:29:20
3Milk2024-05-30 20:29:20
2Banana2024-05-30 20:29:20
1Apple2024-05-30 20:29:20

You can get creative with your order clauses:

sql
select *
from products
order by length(description) desc;
iddescriptioncreated
4Toy Airplane2024-05-30 20:29:20
2Banana2024-05-30 20:29:20
1Apple2024-05-30 20:29:20
3Milk2024-05-30 20:29:20

Limit and offset helps to trim down the total results. Very useful when your database grows or you don't want to waste bandwidth:

sql
select *
from products
order by length(description) desc limit 2;
iddescriptioncreated
4Toy Airplane2024-05-30 20:29:20
2Banana2024-05-30 20:29:20
sql
select *
from products
order by length(description) desc limit 2 offset 1;
iddescriptioncreated
2Banana2024-05-30 20:29:20
1Apple2024-05-30 20:29:20
Never paginate without ordenation

I mean, you can do that, but it will give you unstable results. Without the order by clause, it's up to the database to decide what goes in each offset. Not good if you want to recover precise information chunks.

Remember that. keep your data in order.

Window functions

At last but not least, let's talk about window functions.

Regular queries are like unroll a scroll. sometimes it's a very long scroll, so you use limit, offset, order by to get just the part you want. Row by row your data will be presented. Some RDMS systems has a concept called cursor, it's not quite pagination, but serves a similar purpose.

Aggregation queries are like take that scroll and fold it over itself and doing over the grouped results the aggregation operation you want. You can still order, limit, offset the result, but it's a different dataset now.

Window functions is what happens when you decide that you need a small piece of aggregate data alongside your regular row information.

So let's mix things up and see how does it look like:

sql
select id, description, count(id) over ()
from products;
iddescription"count(id) over()"
1Apple4
2Banana4
3Milk4
4Toy Airplane4

The first two columns are regular ones, from existing rows in the products table but the last one is am aggregation result.

The query tell us the products we have in the data base out of the total products created.

What if we want to know the product, its stock position and how does it compare with total of all items in stock?

sql
with latest_history as (select distinct p.id,
                                        p.description,
                                        max(sh.id) over (partition by sh.products_id) sh_id
                        from products p
                               join stock_history sh on sh.products_id = p.id)
select lh.id,
       lh.description,
       sh2.amount              product_amount,
       sum(sh2.amount) over () all_items_stock
from latest_history lh
       join stock_history sh2 on sh2.id = lh.sh_id;
iddescriptionproduct_amountall_items_stock
1Apple70598
2Banana400598
3Milk100598
4Toy Airplane28598

Yes, by combining record data and aggregate data in single results you start to get better analytical insights.

The form of a windowed colum is:

  1. An aggregate function (min, max, sum, avg) or ranking functions (rank, dense_rank row_number).
  2. The window, the over() part
  3. Inside the window, partitioning, ordering, frame definition.

Ranking offers information about who's your top thing.

For example, what if i want to know which product is the most sold?

sql
select oi.products_id,
       oi.orders_id,
       oi.id                              as order_items_id,
       p.description,
       oi.amount,
       rank() over (order by amount desc) as most_sold
from products p
         left join order_items oi on oi.products_id = p.id
order by most_sold, description
products_idorders_idorder_items_iddescriptionamountmost_sold
224Banana1001
123Apple202
111Apple103
412Toy Airplane24
nullnullnullMilknull5

Another important thing to know: order by with aggregate functions works differently from regular order by.

It's a sliding window, remember that. Because of that, the following query does accumulative sums instead of a total sum for all rows:

sql
select oi.id,
       oi.products_id,
       sum(oi.amount) over (order by oi.id) as acc_amount
--        sum(oi.amount) over () as total_amount
from order_items oi
order by oi.id
idproducts_idacc_amount
1110
2412
3132
42132

This is quite useful when plotting charts.

Next steps

What a ride, huh?

This as a long chapter, but if you're going enterprise then you're going databases and queries.

Check the links, try to model things around you and get good on that.

In the next chapter we get back to our normal kotlin programming.