Avatar

SQL patterns, tips and tricks

← Back to list
Posted on 07.08.2023
Image by derekb on Flickr
Refill!

I don't face SQL challenges every day. Sometimes I find a good solution for a case I had to crack, but then as time passes by, I tend to forget such a valuable solution I've found earlier. So I either have to google again, or re-discover the approach.

Well, not any longer. Here I will post a list of cases I sometimes work on, along with appropriate solutions.

Before it starts

When it comes to SQL before even putting hands on development, some research must be conducted. Connect to the staging database (or the load testing env database) and make sure, that the SQL queries you made are performant.

Before you proceed, make sure that you:

  • have clear understanding whether the solution is optimised for reading or for writing,
  • understand what kind of delays in query execution are tolerable,
  • know what amount of data you are looking at,
  • understand the trade-offs of the solution.

Generally speaking, performance should be treated as a first-class citizen in SQL. You made a beautiful, elegant and modular SQL query, but it's not performant? You may as well just trash it right now.

Also, an engineer should always mind concurrency and thread safety when making data alterations.

Joining data that was prepared beforehand

The task is: we have a table of the main entity, and we also have another table that holds some auxiliary entity that references the first one. We need to pre-calculate some info and join it with the main entity. A good example would be a product and users saving that product to favourites.

So the following query gives us the information we need:

SELECT
product_id,
bool_or(is_favourite) AS is_favourite,
bool_or(is_popular) AS is_popular
FROM customer_preferences
GROUP BY product_id;
The code is licensed under the MIT license

And later on we can make use of JOIN (SELECT ...) syntax, but that would mean the sub-query forcing Postgres to create a temporary table for immediate join.

To prevent that, a materialised view can be used. A materialised view is the way to store the data selection beforehand and avoid unnecessarily repetitive data selection.

Materialized views must only be used with clear understanding of the approach of keeping it up-to-date. Failing to do so may cause serious disruption in the business logic. Never use Cron jobs to keep materialized views up to date, it is a recipe for a trouble.

Before committing to materialized views, give de-normalization a try.

CREATE MATERIALIZED VIEW products_user_info IF NOT EXISTS AS
SELECT
product_id,
bool_or(is_favourite) AS is_favourite,
bool_or(is_popular) AS is_popular
FROM customer_preferences
GROUP BY product_id
WITH DATA;
The code is licensed under the MIT license

Then the view can be joined as a regular table:

SELECT * FROM products p
INNER JOIN products_user_info pui ON pui.product_id = p.id;
The code is licensed under the MIT license

Note two things:

  1. A materialised view consumes space on a hard drive just like a regular table does.
  2. A materialised view does not automatically refresh itself on every change of the related data, it needs to be re-populated manually or on a trigger, using the following SQL command:
    REFRESH MATERIALIZED VIEW products_user_info;
    The code is licensed under the MIT license
  3. To maintain accessibility, a materialized view should be better refreshed concurrently. In this case you'll need a covering index on such MV.
    REFRESH MATERIALIZED VIEW CONCURRENTLY products_user_info;
    The code is licensed under the MIT license

Making use of triggers

So, the materialized view needs to be updated. One can do it on the application level, but sometimes when there is no bulk imports/updates expected as a business need, it makes sense to run the refresher after every SQL statement execution that modifies the source data.

Before creating a trigger, please make sure the table is not heavily written! If that is the case, the CPU will spike to the clouds.

CREATE OR REPLACE FUNCTION refresh_products_user_info()
RETURNS TRIGGER LANGUAGE plpgsql
AS $$
BEGIN
REFRESH MATERIALIZED VIEW products_user_info;
RAISE LOG 'refresh_products_user_info() was executed';
RETURN NULL;
END $$;
CREATE TRIGGER refresh_products_user_info
AFTER INSERT OR UPDATE OR DELETE OR TRUNCATE
ON products
FOR EACH STATEMENT
EXECUTE PROCEDURE refresh_products_user_info();
The code is licensed under the MIT license

Note the logging statement, it can be used to check if the trigger was actually executed on prod later on.

How to disable a trigger and make sure it is disabled:

ALTER TABLE products DISABLE TRIGGER refresh_products_user_info;
SELECT tgname, (case when tgenabled = 'D' then 'N' else 'Y' end) as enabled FROM pg_trigger WHERE tgrelid = 'public.products'::regclass;
The code is licensed under the MIT license

Implementing counters

The task: I have an entity that can have different states. I need to query total count of that entity in different states. A good example is a task that can be "open", "in progress", "done", etc. I need to know how many tasks I have in each of the states.

So there are three approaches:

  1. Calculate the counts on every query. That solution is resource and time ineffective.
  2. Store the counters in a separate table and update the counters on every task insertion, update or deletion. This approach is read-optimal, however it is not concurrent-safe.
  3. Use the audit table pattern I have discussed before in details. The approach is concurrent-safe, however it may be not that read-performant and it has other limitations as well.

There is no right or wrong approach, choose one that suits your case better.

Storing nested data structures, such as trees

There are again three ways to do implement trees

  1. A straight-forward, naive approach would be to just keep a reference to the left and right node by storing primary keys of the corresponding records. This way is far from being read-optimal, as it requires a recursive algorithm to be in place in order to get a sub-set of items, and moreover the approach can only allow storing binary trees.
  2. The other solution is the Nested Sets pattern. This method is suitable for keeping tree-only structures and optimised for reading tree sub-branches.
  3. The close competitor of the previous one is the Adjacency List pattern. The adjacency list pattern can work with graphs of a general topology, but more space-consuming.

Note that there are more suitable databases on the market to work with such data structures, called Graph Databases.

Migrations

A few words about migrations. You should be mindful when writing queries, and always check the potential execution time. Be extra careful with so-called back-fill queries, such as this one:

UPDATE products SET created_at = updated_at;
The code is licensed under the MIT license

This is an un-conditional update that back-fills the created_at attribute, and if executed on a table with 3M+ records, it can easily take up to several minutes to complete. The migration will most likely fail by timeout, and then the whole deployment will fail.

That's why it is also recommended to only change the database schema in migrations, and for the back-filling use a dedicated script that runs the same query across desired environments.

Replica identity

If you have a read replica, then the database should know how to map every record from each table into a corresponding table of the replica. That is why every table should either have a primary key, or it should have a unique index that can be used as a replica identity:

ALTER TABLE some_table REPLICA IDENTITY USING INDEX ix_some_index;
The code is licensed under the MIT license

Addressing performance issues

When having an SQL query that shows poor performance, don't rush in with a solution. Instead, escalate it using the following rank:

  • Try running explain select and put some indexes. Maybe it does great deal of help.
  • Try de-normalizing data. Moving a couple of fields here and there can help cutting the amount of joins, especially when a table is joined exclusively to obtain the value of that column X.
  • Rearrange the query basing on input parameters. Maybe you find some parameter combinations where you could cut some corners and make the query more performant? The approach when an endpoint has different performance depending on the input parameters is common, because different combinations are served by distinct SQL queries.
  • If eventual consistency is tolerable, consider using materialized views. However, be careful and have clear understanding of how and when the views are going to be renewed.
  • If write operations are slow and need performance boost, and eventual consistency is not a problem, then why not moving the operations from sync to async mode? Say, when an entity is changed, add this event to a queue, and on a separate go routine consume that queue and do stuff needed to be done. This will keep user hits fast as before and do the job later.
  • Use in-memory cache and insert/update batching.
  • If nothing above helps, re-think the architecture. Maybe you don't need such complexity? Maybe you could re-structure the data in a different manner? Maybe you don't have to show those counters to a user on the list view? Maybe you could avoid sorting by that attribute, because it's computationally expensive? Talk you your PM.

As you can see, every step up introduces changes of a different level of magnitude, so start small, try to get the low hanging fruit, and only then call for heavy artillery.

De-normalization

Sometimes the performance of a Select query is deeply impacted by the necessity of doing heavy joins. And the joins themselves may be difficult to optimise: you may not be able to put more or less effective indexes there, because due to low selectivity they simply won't be utilized. Also, you can't just simply shrug off these joins, because in this case the logic of the application totally goes south. Sometimes the only way to mitigate this is to admit that something went terribly wrong when the database schema was designed in the first place, and then pull off a costly refactoring.

But sometimes there are ways to shift the application towards being more read-optimised by preparing as much data as possible in advance. This is called "data de-normalisation".

So, instead of having a select with many static joins, it is worth creating an extra column and run a back-filling query whenever the data changes. Here is the example of such UPDATE FROM query:

UPDATE products
SET has_prices = products_with_prices.has_active_prices AND products_with_prices.has_active_stores
FROM
(
SELECT
products.id AS id,
bool_or(prices.id IS NOT NULL AND prices.deleted_by IS NULL) AS has_active_prices,
bool_or(store.id IS NOT NULL AND store.deleted_by IS NULL) AS has_active_stores
FROM products
LEFT JOIN prices
ON prices.product_id = products.id
LEFT JOIN stores
ON stores.id = prices.store_id
WHERE products.id IN (123, 456) -- <<<- Here you put any statement that returns a list of products to update. Use an array or a sub-query.
GROUP BY products.id
) AS products_with_prices
WHERE products.id = products_with_prices.id
The code is licensed under the MIT license

Run this query for particular subset of products whenever the price or a store is changed, and just use the has_prices column in the select instead of several heavy joins.

Insert-select and update-select versus select and C-UD in a cycle

Most of the time a naive approach is to make a selection first (as a separate query), then iterate over the result (on the client side) by doing a series of insert-update-delete operations. This approach is quite pro-intuitive, but it's also concurrency-unsafe, error-prone and most of the time ineffective.

Whenever possible, use the insert-select and update-select syntax. For example:

INSERT
INTO order_history (product_id, amount, updated_at, requested_by)
SELECT
s.id as product_id,
1 as amount,
now() as updated_at,
'backfill-foo' as requested_by
FROM
products p
WHERE
not exists(
SELECT * from order_history oh where oh.product_id = p.id
)
ON CONFLICT DO NOTHING
RETURNING id
The code is licensed under the MIT license

This request is atomic, and also it is more error-tolerant since it has the ON CONFLICT clause.

It's a good practice to leave a way for making possible future amends by marking a newly inserted records in a special way. For instance, in the case above the requested_by field has a special value.

Getting the list of affected rows

Postgres offers an amazing option of getting the list of affected rows when doing the UPDATE query. This is extremely helpful when it's crucial to know the list of rows that were actually updated. Just append the RETURNING clause to the end of the query.

UPDATE products
SET title = 'Title missing'
WHERE title is null
RETURNING *;
The code is licensed under the MIT license

You can use the asterisk to get all rows, or just ids.

Dealing with arrays

Filtering by JSON array

Not that often, but sometimes it's required to filter by the content of a JSON filed in a table. Well, for that matter there is a snippet. If there is an JSON field attributes that contains an array of objects, then to filter by objects where the key value contains foo:

SELECT id FROM products
CROSS JOIN LATERAL json_array_elements ( attributes ) AS j
WHERE j->>'value' = 'foo';
The code is licensed under the MIT license

Selecting arrays when aggregating

There is a way to put values into an array when grouping records. This is useful for visual debugging as well as for a bunch of other cases. This is possible thanks to the ARRAY_AGG() function. Take a look:

SELECT
a.id,
a.name,
ARRAY_AGG(distinct b.title) AS books
FROM authors a
INNER JOIN books b ON a.id = b.author_id
GROUP BY
a.id,
a.name
;
The code is licensed under the MIT license

Database-level constraints

We usually apply constraints on the business logic level. But, if an SQL database is used, some of such constraints can be moved to the database level to increase consistency and data safety.

For example:

ALTER TABLE price ADD CONSTRAINT price_restrict_value CHECK ((status <> 'ACTIVE' AND value = 0) OR (status = 'INACTIVE'));
The code is licensed under the MIT license

More intricate constraints can be established by using triggers.

Managing enums

One of the most common tasks that pops up every now and then is to add a new member to an existing enum. With modern Postgres it's easy to do using the ALTER TYPE operator. However, this operator does not support the IF (NOT) EXISTS clause, so to make the code re-iterable and suitable for migrations, a script is needed.

Note: this only works since Postgres v12

DO $$
BEGIN
IF NOT EXISTS (
SELECT 1 FROM pg_enum
WHERE enumtypid = 'your_enum_name'::regtype
AND enumlabel = 'NEW_MEMBER'
) THEN
ALTER TYPE your_enum_name ADD VALUE 'NEW_MEMBER';
END IF;
END$$;
The code is licensed under the MIT license

Concurrency

Managing concurrency is highly important when developing a high-loaded cloud native application. Making a selection, then analysing the result and updating the database is a naive, yet quite often unavoidable pattern. In this case, locking should be used.

The easiest way to manage concurrent requests would be by using advisory locks in Postgres. Here is an example:

BEGIN;
SELECT pg_advisory_lock(12345);
SElECT * from foo;
UPDATE foo ... ;
COMMIT;
The code is licensed under the MIT license

Where 12345 is a lock id. When another thread (hit) reaches this place and tries to execute the same transaction, it will have to wait until the previous one is done. When the transaction is committed or rolled back, the lock is released.

Beware of deadlocks!

Besides, there is also a way to serialize transactions or do per table row locks, which can decrease the possibility of making a potential performance bottleneck.

Here is an amazing article I've found on the subject: Amazing overview of locking mechanisms in Postgres.

Backfilling vs migrations

There is a migration and there is backfilling. Migrations are often quite fast, as they typically only change the database schema. Backfilling, on the contrary, changes the content of the data in the database (and also indexes), and hence can be resource- and time-consuming. Beware of this, and feel safe to run migrations within the CICD pipeline, but always make dedicated scripts for executing backfilling.

Well, this article is a work in progress, just like the life of an engineer itself :) I will add more things as I keep discovering.


Avatar

Sergei Gannochenko

Business-oriented fullstack engineer, in ❤️ with Tech.
Golang, React, TypeScript, Docker, AWS, Jamstack.
15+ years in dev.