SQL patterns, tips and tricks
I don't face SQL challenges every day. Sometimes I find a good solution for a case I had to crack, but then as time passes by, I tend to forget such a valuable solution I've found earlier. So I either have to google again, or re-discover the approach.
Well, not any longer. Here I will post a list of cases I sometimes work on, along with appropriate solutions.
Before it starts
When it comes to SQL before even putting hands on development, some research must be conducted. Connect to the staging database (or the load testing env database) and make sure, that the SQL queries you made are performant.
Before you proceed, make sure that you:
- have clear understanding whether the solution is optimised for reading or for writing,
- understand what kind of delays in query execution are tolerable,
- know what amount of data you are looking at,
- understand the trade-offs of the solution.
Generally speaking, performance should be treated as a first-class citizen in SQL. You made a beautiful, elegant and modular SQL query, but it's not performant? You may as well just trash it right now.
Also, an engineer should always mind concurrency and thread safety when making data alterations.
Joining data that was prepared beforehand
The task is: we have a table of the main entity, and we also have another table that holds some auxiliary entity that references the first one. We need to pre-calculate some info and join it with the main entity. A good example would be a product and users saving that product to favourites.
So the following query gives us the information we need:
SELECTproduct_id,bool_or(is_favourite) AS is_favourite,bool_or(is_popular) AS is_popularFROM customer_preferencesGROUP BY product_id;
And later on we can make use of JOIN (SELECT ...) syntax, but that would mean the sub-query forcing Postgres to create a temporary table for immediate join.
To prevent that, a materialised view can be used. A materialised view is the way to store the data selection beforehand and avoid unnecessarily repetitive data selection.
Materialized views must only be used with clear understanding of the approach of keeping it up-to-date. Failing to do so may cause serious disruption in the business logic. Never use Cron jobs to keep materialized views up to date, it is a recipe for a trouble.
Before committing to materialized views, give de-normalization a try.
CREATE MATERIALIZED VIEW products_user_info IF NOT EXISTS ASSELECTproduct_id,bool_or(is_favourite) AS is_favourite,bool_or(is_popular) AS is_popularFROM customer_preferencesGROUP BY product_idWITH DATA;
Then the view can be joined as a regular table:
SELECT * FROM products pINNER JOIN products_user_info pui ON pui.product_id = p.id;
Note two things:
- A materialised view consumes space on a hard drive just like a regular table does.
- A materialised view does not automatically refresh itself on every change of the related data, it needs to be re-populated manually or on a trigger, using the following SQL command:REFRESH MATERIALIZED VIEW products_user_info;The code is licensed under the MIT license
- To maintain accessibility, a materialized view should be better refreshed concurrently. In this case you'll need a covering index on such MV.REFRESH MATERIALIZED VIEW CONCURRENTLY products_user_info;The code is licensed under the MIT license
Making use of triggers
So, the materialized view needs to be updated. One can do it on the application level, but sometimes when there is no bulk imports/updates expected as a business need, it makes sense to run the refresher after every SQL statement execution that modifies the source data.
Before creating a trigger, please make sure the table is not heavily written! If that is the case, the CPU will spike to the clouds.
CREATE OR REPLACE FUNCTION refresh_products_user_info()RETURNS TRIGGER LANGUAGE plpgsqlAS $$BEGINREFRESH MATERIALIZED VIEW products_user_info;RAISE LOG 'refresh_products_user_info() was executed';RETURN NULL;END $$;CREATE TRIGGER refresh_products_user_infoAFTER INSERT OR UPDATE OR DELETE OR TRUNCATEON productsFOR EACH STATEMENTEXECUTE PROCEDURE refresh_products_user_info();
Note the logging statement, it can be used to check if the trigger was actually executed on prod later on.
How to disable a trigger and make sure it is disabled:
ALTER TABLE products DISABLE TRIGGER refresh_products_user_info;SELECT tgname, (case when tgenabled = 'D' then 'N' else 'Y' end) as enabled FROM pg_trigger WHERE tgrelid = 'public.products'::regclass;
Implementing counters
The task: I have an entity that can have different states. I need to query total count of that entity in different states. A good example is a task that can be "open", "in progress", "done", etc. I need to know how many tasks I have in each of the states.
So there are three approaches:
- Calculate the counts on every query. That solution is resource and time ineffective.
- Store the counters in a separate table and update the counters on every task insertion, update or deletion. This approach is read-optimal, however it is not concurrent-safe.
- Use the audit table pattern I have discussed before in details. The approach is concurrent-safe, however it may be not that read-performant and it has other limitations as well.
There is no right or wrong approach, choose one that suits your case better.
Storing nested data structures, such as trees
There are again three ways to do implement trees
- A straight-forward, naive approach would be to just keep a reference to the left and right node by storing primary keys of the corresponding records. This way is far from being read-optimal, as it requires a recursive algorithm to be in place in order to get a sub-set of items, and moreover the approach can only allow storing binary trees.
- The other solution is the Nested Sets pattern. This method is suitable for keeping tree-only structures and optimised for reading tree sub-branches.
- The close competitor of the previous one is the Adjacency List pattern. The adjacency list pattern can work with graphs of a general topology, but more space-consuming.
Note that there are more suitable databases on the market to work with such data structures, called Graph Databases.
Migrations
A few words about migrations. You should be mindful when writing queries, and always check the potential execution time. Be extra careful with so-called back-fill queries, such as this one:
UPDATE products SET created_at = updated_at;
This is an un-conditional update that back-fills the created_at attribute, and if executed on a table with 3M+ records, it can easily take up to several minutes to complete. The migration will most likely fail by timeout, and then the whole deployment will fail.
That's why it is also recommended to only change the database schema in migrations, and for the back-filling use a dedicated script that runs the same query across desired environments.
Replica identity
If you have a read replica, then the database should know how to map every record from each table into a corresponding table of the replica. That is why every table should either have a primary key, or it should have a unique index that can be used as a replica identity:
ALTER TABLE some_table REPLICA IDENTITY USING INDEX ix_some_index;
Addressing performance issues
When having an SQL query that shows poor performance, don't rush in with a solution. Instead, escalate it using the following rank:
- Try running explain select and put some indexes. Maybe it does great deal of help.
- Try de-normalizing data. Moving a couple of fields here and there can help cutting the amount of joins, especially when a table is joined exclusively to obtain the value of that column X.
- Rearrange the query basing on input parameters. Maybe you find some parameter combinations where you could cut some corners and make the query more performant? The approach when an endpoint has different performance depending on the input parameters is common, because different combinations are served by distinct SQL queries.
- If eventual consistency is tolerable, consider using materialized views. However, be careful and have clear understanding of how and when the views are going to be renewed.
- If write operations are slow and need performance boost, and eventual consistency is not a problem, then why not moving the operations from sync to async mode? Say, when an entity is changed, add this event to a queue, and on a separate go routine consume that queue and do stuff needed to be done. This will keep user hits fast as before and do the job later.
- Use in-memory cache and insert/update batching.
- If nothing above helps, re-think the architecture. Maybe you don't need such complexity? Maybe you could re-structure the data in a different manner? Maybe you don't have to show those counters to a user on the list view? Maybe you could avoid sorting by that attribute, because it's computationally expensive? Talk you your PM.
As you can see, every step up introduces changes of a different level of magnitude, so start small, try to get the low hanging fruit, and only then call for heavy artillery.
De-normalization
Sometimes the performance of a Select query is deeply impacted by the necessity of doing heavy joins. And the joins themselves may be difficult to optimise: you may not be able to put more or less effective indexes there, because due to low selectivity they simply won't be utilized. Also, you can't just simply shrug off these joins, because in this case the logic of the application totally goes south. Sometimes the only way to mitigate this is to admit that something went terribly wrong when the database schema was designed in the first place, and then pull off a costly refactoring.
But sometimes there are ways to shift the application towards being more read-optimised by preparing as much data as possible in advance. This is called "data de-normalisation".
So, instead of having a select with many static joins, it is worth creating an extra column and run a back-filling query whenever the data changes. Here is the example of such UPDATE FROM query:
UPDATE productsSET has_prices = products_with_prices.has_active_prices AND products_with_prices.has_active_storesFROM(SELECTproducts.id AS id,bool_or(prices.id IS NOT NULL AND prices.deleted_by IS NULL) AS has_active_prices,bool_or(store.id IS NOT NULL AND store.deleted_by IS NULL) AS has_active_storesFROM productsLEFT JOIN pricesON prices.product_id = products.idLEFT JOIN storesON stores.id = prices.store_idWHERE products.id IN (123, 456) -- <<<- Here you put any statement that returns a list of products to update. Use an array or a sub-query.GROUP BY products.id) AS products_with_pricesWHERE products.id = products_with_prices.id
Run this query for particular subset of products whenever the price or a store is changed, and just use the has_prices column in the select instead of several heavy joins.
Upsert
Classic upsert in Postgres looks like this:
INSERT INTOsettings (key, value)VALUES('USE_BRAIN', 'true')ON CONFLICT (key) DOUPDATE SETvalue='true';
Insert-select and update-select versus select and C-UD in a cycle
Most of the time a naive approach is to make a selection first (as a separate query), then iterate over the result (on the client side) by doing a series of insert-update-delete operations. This approach is quite pro-intuitive, but it's also concurrency-unsafe, error-prone and most of the time ineffective.
Whenever possible, use the insert-select and update-select syntax. For example:
INSERTINTO order_history (product_id, amount, updated_at, requested_by)SELECTs.id as product_id,1 as amount,now() as updated_at,'backfill-foo' as requested_byFROMproducts pWHEREnot exists(SELECT * from order_history oh where oh.product_id = p.id)ON CONFLICT DO NOTHINGRETURNING id
This request is atomic, and also it is more error-tolerant since it has the ON CONFLICT clause.
It's a good practice to leave a way for making possible future amends by marking a newly inserted records in a special way. For instance, in the case above the requested_by field has a special value.
Getting the list of affected rows
Postgres offers an amazing option of getting the list of affected rows when doing the UPDATE query. This is extremely helpful when it's crucial to know the list of rows that were actually updated. Just append the RETURNING clause to the end of the query.
UPDATE productsSET title = 'Title missing'WHERE title is nullRETURNING *;
You can use the asterisk to get all rows, or just ids.
Dealing with arrays
Filtering by JSON array
Not that often, but sometimes it's required to filter by the content of a JSON filed in a table. Well, for that matter there is a snippet. If there is an JSON field attributes that contains an array of objects, then to filter by objects where the key value contains foo:
SELECT id FROM productsCROSS JOIN LATERAL json_array_elements ( attributes ) AS jWHERE j->>'value' = 'foo';
Selecting arrays when aggregating
There is a way to put values into an array when grouping records. This is useful for visual debugging as well as for a bunch of other cases. This is possible thanks to the ARRAY_AGG() function. Take a look:
SELECTa.id,a.name,ARRAY_AGG(distinct b.title) AS booksFROM authors aINNER JOIN books b ON a.id = b.author_idGROUP BYa.id,a.name;
Database-level constraints
We usually apply constraints on the business logic level. But, if an SQL database is used, some of such constraints can be moved to the database level to increase consistency and data safety.
For example:
ALTER TABLE price ADD CONSTRAINT price_restrict_value CHECK ((status <> 'ACTIVE' AND value = 0) OR (status = 'INACTIVE'));
More intricate constraints can be established by using triggers.
Partial indexes
Partial indexes can be useful to build an index over just a subset of records. A good example could be excluding soft-deleted records:
CREATE UNIQUE INDEX IF NOT EXISTS ix_first_name_last_name ON users WHERE deleted_at IS NULL;
Managing enums
One of the most common tasks that pops up every now and then is to add a new member to an existing enum. With modern Postgres it's easy to do using the ALTER TYPE operator. However, this operator does not support the IF (NOT) EXISTS clause, so to make the code re-iterable and suitable for migrations, a script is needed.
Note: this only works since Postgres v12
DO $$BEGINIF NOT EXISTS (SELECT 1 FROM pg_enumWHERE enumtypid = 'your_enum_name'::regtypeAND enumlabel = 'NEW_MEMBER') THENALTER TYPE your_enum_name ADD VALUE 'NEW_MEMBER';END IF;END$$;
Concurrency
Managing concurrency is highly important when developing a high-loaded cloud native application. Making a selection, then analysing the result and updating the database is a naive, yet quite often unavoidable pattern. In this case, locking should be used.
The easiest way to manage concurrent requests would be by using advisory locks in Postgres. Here is an example:
BEGIN;SELECT pg_advisory_lock(12345);SElECT * from foo;UPDATE foo ... ;COMMIT;
Where 12345 is a lock id. When another thread (hit) reaches this place and tries to execute the same transaction, it will have to wait until the previous one is done. When the transaction is committed or rolled back, the lock is released.
Beware of deadlocks!
Besides, there is also a way to serialize transactions or do per table row locks, which can decrease the possibility of making a potential performance bottleneck.
Here is an amazing article I've found on the subject: Amazing overview of locking mechanisms in Postgres.
Backfilling vs migrations
There is a migration and there is backfilling. Migrations are often quite fast, as they typically only change the database schema. Backfilling, on the contrary, changes the content of the data in the database (and also indexes), and hence can be resource- and time-consuming. Beware of this, and feel safe to run migrations within the CICD pipeline, but always make dedicated scripts for executing backfilling.
Organising complex requests
Sometimes the SELECT requests can be really complex. To avoid problems understanding the IN (SELECT ...) or JOIN (SELECT ...) syntax, there is an alternative one, that uses the WITH operator.
Here is an example:
WITHauthor AS (SELECT count(*) as cnt, genre_id FROM books b WHERE b.author_id IS NOT NULL GROUP BY genre_id),missing_author AS (SELECT count(*) as cnt, genre_id FROM books b WHERE b.author_id IS NULL GROUP BY genre_id)SELECT*, a.cnt, ma.cntFROMgenres gLEFT JOIN author a ON a.genre_id = g.idLEFT JOIN missing_author ma ON ma.genre_id = g.id
Well, this article is a work in progress, just like the life of an engineer itself :) I will add more things as I keep discovering.
Using Postgres as pub/sub
Even though Postgres isn't tailored for such a task, it is still possible to a certain degree.
This will dispatch a message:
NOTIFY channel_name some_payload;-- or for use inside of triggers:SELECT pg_notify('channel_name', 'some_payload');
And this will listen to the messages:
LISTEN channel_name;
Beware however, that the size of the payload is limited. If your payload is potentially of unknown size (e.g. when you serialise a JSON there), instead of NOTIFY/LISTEN you may consider using un-logged table + on-insert trigger + listen.
An example of calling pg_notify() from a trigger that allows tracking the insert/update operations on a table:
CREATE OR REPLACE FUNCTION notify_price_table_change() RETURNS trigger AS $$DECLAREBEGINPERFORM pg_notify('price_table_change',json_build_object('operation', TG_OP,'data', json_build_object('id', NEW.id,'value', NEW.value))::text);RETURN NEW;END;$$ LANGUAGE plpgsql;
As mentioned above, don't put all fields from NEW to the payload, otherwise it may start failing when the table structure is changed.
Putting rows into buckets
Sometimes it's required to lay out rows into different buckets using certain criteria. For that, partitioning can be used.
In this example, we remove rows which are duplicates for some other row on fields first_name, last_name and salary:
WITH CTE AS (SELECT*,ROW_NUMBER() OVER (PARTITION BY first_name, last_name, salary ORDER BY id) AS rnFROMemployees)DELETE FROM employeesWHERE id IN (SELECT idFROM CTEWHERE rn > 1);
Sergei Gannochenko
Golang, React, TypeScript, Docker, AWS, Jamstack.
19+ years in dev.