PostgreSQL Optimization Techniques

Created on Oct 25, 2022

PostgreSQL has features that you can utilize to speed up query execution. Learning about PostgreSQL performance tuning helps you solve the slowness issue with your database in the future. It’s important to tune your queries before you even start writing queries.

This post covers PostgreSQL techniques to optimize queries. You’re going to learn a few techniques like temporary tables, CTEs (Common Table Expressions), views, materialized views, and how you can leverage each technique depending on your case.

Temp tables, CTEs, views, and materialized views are discussed from the point of decomposition in computer science. Similarly, applying decomposition meant in our database context, aka factoring, breaks down a complex query or a database system into smaller parts that are easier to understand, program, and maintain.

Download the database and setup PostgreSQL

This is a practical tutorial in which we will experiment SQL queries for real data coming from StackExchange. To follow along, clone the Stackexchange dump to PostgreSQL GitHub repo. Create a new virtual Python environment and activate it:

git clone https://github.com/Networks-Learning/stackexchange-dump-to-postgres
cd stackexchange-dump-to-postgres
python3 -m venv venv; . venv/bin/activate; pip install --upgrade pip

Then install the requirements file with pip: pip install -r requirements.txt

Finally, create the PostgreSQL database using psql and then run the load_into_pg.py with options as shown below:

sudo -u postgresql psql -c "CREATE DATABASE webappsSE;"
python load_into_pg.py -s webapps -d webappsse -H localhost -u <username> -p <pass>

Note: This load_into_pg.py script will download the database that you just created (webappsSE) and then loads all the tables into your PostgreSQL database with your associated username and password. The host is a localhost, but you can change it to your situation.

The data is a compressed file of the webapps stackexchange from the archive.org website. The schema is explained in this stackexchange answer.

Clean up the database

The data coming from StackExchange has indexes in each table. You need to drop them all to be able to firstly test the data without using indexes. Then you will create them and see the difference in execution time while using the techniques that this tutorial is about.

To list all indexes in these tables, run the following query:

SELECT idx.indexname
FROM pg_class c
JOIN pg_indexes idx ON idx.tablename = c.relname
AND c.relkind = 'r'
AND c.relname NOT LIKE 'pg_%'
AND c.relnamespace = 2200;

where relnamespace is the schema number shared by all webapps tables. As you can see, the indexname column indicates all indexes for these tables.

I’ve combined all drop index statements (including dropping the constraints) that you need for the webapps tables here:

DROP INDEX tags_count_idx;
ALTER TABLE tags DROP CONSTRAINT tags_pkey;
DROP INDEX cmnts_userid_idx;
DROP INDEX cmnts_creation_date_idx;
DROP INDEX cmnts_postid_idx;
DROP INDEX cmnts_score_idx;
ALTER TABLE COMMENTS DROP CONSTRAINT comments_pkey;
DROP INDEX user_created_at_idx;
DROP INDEX user_down_votes_idx;
DROP INDEX user_up_votes_idx;
DROP INDEX user_display_idx;
DROP INDEX user_acc_id_idx;
ALTER TABLE users DROP CONSTRAINT users_pkey;
DROP INDEX badges_date_idx;
DROP INDEX badges_name_idx;
DROP INDEX badges_user_id_idx;
ALTER TABLE badges DROP CONSTRAINT badges_pkey;
DROP INDEX votes_creation_date_idx;
DROP INDEX votes_type_idx;
DROP INDEX votes_post_id_idx;
ALTER TABLE votes DROP CONSTRAINT votes_pkey;
DROP INDEX posts_parent_id_idx;
DROP INDEX posts_accepted_answer_id_idx;
DROP INDEX posts_viewcount_idx;
DROP INDEX posts_favorite_count_idx;
DROP INDEX posts_comment_count_idx;
DROP INDEX posts_answer_count_idx;
DROP INDEX posts_owner_user_id_idx;
DROP INDEX posts_creation_date_idx;
DROP INDEX posts_score_idx;
DROP INDEX posts_post_type_id_idx;
ALTER TABLE posts DROP CONSTRAINT posts_pkey;
DROP INDEX postlinks_related_post_id_idx;
DROP INDEX postlinks_post_id_idx;
ALTER TABLE postlinks DROP CONSTRAINT postlinks_pkey;
DROP INDEX ph_userid_idx;
DROP INDEX ph_creation_date_idx;
DROP INDEX ph_revguid_idx;
DROP INDEX ph_postid_idx;
DROP INDEX ph_post_type_id_idx;
ALTER TABLE posthistory DROP CONSTRAINT posthistory_pkey;

To make sure all indexes are now dropped, run the above query that we used to list the indexes. Make sure it returns nothing.

Now, you’re ready to start using the PostgreSQL query optimization techniques.

Temporary tables

Temporary tables are tables that exist temporarily in the database. Once you end the database session, these tables will be dropped.

To follow a useful example, consider the following query:

SELECT a.owneruserid AS AcceptedAnswerUserID,
	   u.reputation,
	   q.acceptedanswerid,
	   q.id AS QuestionID,
	   q.owneruserid,
	   q.title,
	   q.tags,
	   concat(
	   'https://webapps.stackexchange.com/questions/',
	   q.id,
	   '/',
	   REPLACE(lower(q.title), ' ', '-')
	   ) AS url
FROM posts q
JOIN posts a ON q.acceptedanswerid = a.id
JOIN users u ON a.owneruserid = u.id
WHERE q.tags LIKE '%<gmail>%'
ORDER BY u.reputation DESC;

This query lists all users with accepted answers for certain web app, which is gmail in our case. The first part of this query selects from the posts table, aliased with q, as an indication to the questions table.

To get information about the users that wrote the accepted answers, we need to join a table that has the answers information. If you look carefully at the stackexchange schema and the PostTypeId column, you’ll find that each post in the posts table can be a question, an answer, or any other stackexchange post type. So we need a join on the same posts table, aliased with a, to indicate it’s a posts table for answers info.

You then join that answers table with the users table to get user information. The WHERE clause filters the tags column by a string that starts or ends or has a <gmail> value in it. Finally, the output is ordered by the users that has the highest reputation to the lowest.

I ran the above query on my machine and it executed in around 100ms. Run it on your machine and take a note of the second run (after the output is being cached). Compare this result with the following technique (also after the output is being cached).

Now, a temporary table can be created here especially if you want to filter by other web apps (e.g. youtube) and see the top users with accepted answers.

Creating a temp table is done by CREATE TEMP TABLE table_name AS followed by the SELECT statement:

CREATE TEMP TABLE webapps_elite AS
SELECT a.owneruserid AS AcceptedAnswerUserID,
	   u.reputation,
	   q.acceptedanswerid,
	   q.id AS QuestionID,
	   q.owneruserid,
	   q.title,
	   q.tags,
	   concat(
	   'https://webapps.stackexchange.com/questions/',
	   q.id,
	   '/',
	   REPLACE(lower(q.title), ' ', '-')
	   ) AS url
FROM posts q
JOIN posts a ON q.acceptedanswerid = a.id
JOIN users u ON a.owneruserid = u.id;

SELECT AcceptedAnswerUserID, url
FROM webapps_elite
WHERE tags LIKE '%<gmail>%'
ORDER BY reputation DESC;

So webapps_elite is the temporary table here that you used to get the accepted answer user ID and the URL of each accepted answer. Then you only filtered the gmail web app and then ordered the output by reputation in a descending order.

Running this query takes around 7 ms which means it’s around 10x as fast as the query without temporary tables.

Using temporary tables can allow you to have less maintenance issues in your query. It introduces a simplified query restructuring which makes the query more readable and easier to understand.

Temporary tables can make your query execution faster depending on the forced order of the execution planning. So in the previous example, use EXPLAIN before querying from the temp table:

webappsse=# EXPLAIN SELECT AcceptedAnswerUserID, url
FROM webapps_elite
WHERE tags LIKE '%<gmail>%'
ORDER BY reputation DESC;
                              QUERY PLAN                              
----------------------------------------------------------------------
 Sort  (cost=620.51..620.51 rows=2 width=40)
   Sort Key: reputation DESC
   ->  Seq Scan on webapps_elite  (cost=0.00..620.50 rows=2 width=40)
         Filter: (tags ~~ '%<gmail>%'::text)
(4 rows)

As you can see, you’ve forced filtering the tags column on the temp table webapps_elite first (without the intermediate steps in the temp table) and then you’ve sorted the output by reputation. In this case, the execution time was around 7 ms.

However, the query without temporary tables did have the intermediate steps of the JOINs statements and didn’t have a forced order of execution. Take a look at how the PostgreSQL optimized chose the execution plan:

webappsse=# EXPLAIN SELECT a.owneruserid AS AcceptedAnswerUserID,
       concat(
       'https://webapps.stackexchange.com/questions/',
       q.id,
       '/',
       REPLACE(lower(q.title), ' ', '-')
       ) AS url
FROM posts q
JOIN posts a ON q.acceptedanswerid = a.id
JOIN users u ON a.owneruserid = u.id
WHERE q.tags LIKE '%<gmail>%'
ORDER BY u.reputation DESC;
                                            QUERY PLAN                                            
--------------------------------------------------------------------------------------------------
 Gather Merge  (cost=27660.49..27718.12 rows=494 width=40)
   Workers Planned: 2
   ->  Sort  (cost=26660.46..26661.08 rows=247 width=40)
         Sort Key: u.reputation DESC
         ->  Hash Join  (cost=17335.63..26650.65 rows=247 width=40)
               Hash Cond: (u.id = a.owneruserid)
               ->  Parallel Seq Scan on users u  (cost=0.00..6907.50 rows=91550 width=8)
               ->  Hash  (cost=17327.91..17327.91 rows=618 width=66)
                     ->  Hash Join  (cost=8321.69..17327.91 rows=618 width=66)
                           Hash Cond: (a.id = q.acceptedanswerid)
                           ->  Seq Scan on posts a  (cost=0.00..8062.96 rows=83296 width=8)
                           ->  Hash  (cost=8271.20..8271.20 rows=4039 width=66)
                                 ->  Seq Scan on posts q  (cost=0.00..8271.20 rows=4039 width=66)
                                       Filter: (tags ~~ '%<gmail>%'::text)
(14 rows)

Starting from the bottom, you’ll see the filter condition on the WHERE clause and then the intermediate steps of each join. Starting from the posts table (representing the answers) to the users table and then sorting the output by the reputation column.

Another benefit of using temporary tables is that you can create indexes on them. So if you have a slowness execution due to the forced order of your query, you can create an index on specific columns and the query optimizer will do index scans just like the indexed ordinary tables.

When you use temporary tables, take care about disk space. Although these tables are dropped at the end of the session, it can bloat your database size. To check how much size your temporary tables take, run the following query:

webappsse=# SELECT schemaname, relname, pg_total_relation_size(relid) AS relsize
 FROM pg_stat_all_tables
 WHERE schemaname LIKE 'pg_%temp%';
   schemaname    |    relname     | relsize 
-----------------+----------------+---------
 pg_toast_temp_7 | pg_toast_96754 |    8192
 pg_temp_7       | webapps_elite  | 2998272
(2 rows)

The pg_toast_* table is autogenerated by PostgreSQL because field values in our temp table exceeded the TOAST threshold (2KB by default). The bigger disk storage comes from the webapps_elite table which takes around 2998272 bytes.

To sum the size and have a pretty format instead of bytes, use the following:

webappsse=# SELECT pg_size_pretty(relsize)
FROM (
	SELECT sum(pg_total_relation_size(relid)) AS relsize
	FROM pg_stat_all_tables
	WHERE schemaname LIKE 'pg_%temp%'
) x;
 pg_size_pretty 
----------------
 2936 kB
(1 row)

So the temporary tables have an on-disk size of 2.9MB.

This can be a pain when you deal with large tables. To solve this problem, you need to have a look at CTEs.

Common Table Expressions (CTEs)

CTEs (or Common Table Expressions) is a technique to create an auxiliary statement for just one query. You can use a CTE when you have a complex query that needs to be chunked for better readability. Maintaining a CTE will be used in one query starting with a WITH clause that wraps a SQL statement.

webappsse=# EXPLAIN WITH webapps_elite_cte AS
(
    SELECT a.owneruserid AS AcceptedAnswerUserID,
       q.tags,
       u.reputation,
       concat(
       'https://webapps.stackexchange.com/questions/',
       q.id,
       '/',
       REPLACE(lower(q.title), ' ', '-')
       ) AS url
    FROM posts q
    JOIN posts a ON q.acceptedanswerid = a.id
    JOIN users u ON a.owneruserid = u.id
)
SELECT AcceptedAnswerUserID, url
FROM webapps_elite_cte w
WHERE tags LIKE '%<gmail>%'
ORDER BY reputation DESC;
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 Sort  (cost=34265.40..34265.41 rows=1 width=40)
   Sort Key: w.reputation DESC
   CTE webapps_elite_cte
     ->  Hash Join  (cost=21224.88..33990.38 rows=12223 width=65)
           Hash Cond: (a.owneruserid = u.id)
           ->  Hash Join  (cost=9430.16..20700.95 rows=12747 width=91)
                 Hash Cond: (q.acceptedanswerid = a.id)
                 ->  Seq Scan on posts q  (cost=0.00..8062.96 rows=83296 width=91)
                 ->  Hash  (cost=8062.96..8062.96 rows=83296 width=8)
                       ->  Seq Scan on posts a  (cost=0.00..8062.96 rows=83296 width=8)
           ->  Hash  (cost=8189.21..8189.21 rows=219721 width=8)
                 ->  Seq Scan on users u  (cost=0.00..8189.21 rows=219721 width=8)
   ->  CTE Scan on webapps_elite_cte w  (cost=0.00..275.02 rows=1 width=40)
         Filter: (tags ~~ '%<gmail>%'::text)
(14 rows)

The CTE took around 200ms on my machine exceeding the previous cases. The only benefit in our implementation for PostgreSQL version 10 that I use, is that there is no additional disk size like unlike temporary tables.

In the following two sections, you’ll learn two techniques that can have better use cases than temporary tables and CTEs.

Views

A view is a database object that stores a query to create a virtual table. Views replicate processing the query each run which means a view stores only the query, not the data itself. So if there are new records inserted into a table in that query, selecting from the view will fetch the new data as well as the old data.

See views in action in the following example:

CREATE VIEW webapps_elite_view AS 
(
    SELECT a.owneruserid AS AcceptedAnswerUserID,
           q.tags,
           u.reputation,
           concat(
                'https://webapps.stackexchange.com/questions/',
                q.id,
                '/',
                REPLACE(lower(q.title), ' ', '-')
           ) AS url
    FROM posts q
    JOIN posts a ON q.acceptedanswerid = a.id
    JOIN users u ON a.owneruserid = u.id
);

SELECT * FROM webapps_elite_view
WHERE tags LIKE '%<gmail>%'
ORDER BY reputation DESC;

So the webapps_elite_view is a view that contains the query you used earlier to join posts and users table.

Selecting from that view took around 100ms, just like the first query (without views). You will see how indexes will improve this result later on this tutorial.

Materialized views

A materialized view is a database object that is similar to a view in terms of storing a query, and similar to a table in terms of storing the query results at the time it is run.

A materialized view differs from the view because the former stores query results. This means a materialized view does not reflect current data, but at the time the materialized view was last refreshed.

It also differs from a table because you can’t modify (update or delete) data in a materlized view. You can only refresh the materialized view to reflect the current data using REFRESH MATERIALIZED VIEW materialized_view_obj.

See the following materialized view below:

CREATE MATERIALIZED VIEW webapps_elite_mv AS 
(
    SELECT a.owneruserid AS AcceptedAnswerUserID,
           q.tags,
           u.reputation,
           concat(
                'https://webapps.stackexchange.com/questions/',
                q.id,
                '/',
                REPLACE(lower(q.title), ' ', '-')
           ) AS url
    FROM posts q
    JOIN posts a ON q.acceptedanswerid = a.id
    JOIN users u ON a.owneruserid = u.id
);

SELECT * FROM webapps_elite_mv
WHERE tags LIKE '%<gmail>%'
ORDER BY reputation DESC;

The query was much faster. It took around 4 ms.

Creating indexes

Now, bring back the indexes that we dropped earlier and see the performance difference. Create all indexes with the following statements:

CREATE INDEX tags_name_idx ON public.tags USING hash (tagname) WITH (fillfactor='100');
CREATE INDEX tags_count_idx ON public.tags USING btree (count) WITH (fillfactor='100');
CREATE UNIQUE INDEX tags_pkey ON public.tags USING btree (id);
CREATE INDEX cmnts_userid_idx ON public.comments USING btree (userid) WITH (fillfactor='100');
CREATE INDEX cmnts_creation_date_idx ON public.comments USING btree (creationdate) WITH (fillfactor='100');
CREATE INDEX cmnts_postid_idx ON public.comments USING hash (postid) WITH (fillfactor='100');
CREATE INDEX cmnts_score_idx ON public.comments USING btree (score) WITH (fillfactor='100');
CREATE UNIQUE INDEX comments_pkey ON public.comments USING btree (id);
CREATE INDEX user_created_at_idx ON public.users USING btree (creationdate) WITH (fillfactor='100');
CREATE INDEX user_down_votes_idx ON public.users USING btree (downvotes) WITH (fillfactor='100');
CREATE INDEX user_up_votes_idx ON public.users USING btree (upvotes) WITH (fillfactor='100');
CREATE INDEX user_display_idx ON public.users USING hash (displayname) WITH (fillfactor='100');
CREATE INDEX user_acc_id_idx ON public.users USING hash (accountid) WITH (fillfactor='100');
CREATE UNIQUE INDEX users_pkey ON public.users USING btree (id);
CREATE INDEX badges_date_idx ON public.badges USING btree (date) WITH (fillfactor='100');
CREATE INDEX badges_name_idx ON public.badges USING btree (name) WITH (fillfactor='100');
CREATE INDEX badges_user_id_idx ON public.badges USING btree (userid) WITH (fillfactor='100');
CREATE UNIQUE INDEX badges_pkey ON public.badges USING btree (id);
CREATE INDEX votes_creation_date_idx ON public.votes USING btree (creationdate) WITH (fillfactor='100');
CREATE INDEX votes_type_idx ON public.votes USING btree (votetypeid) WITH (fillfactor='100');
CREATE INDEX votes_post_id_idx ON public.votes USING hash (postid) WITH (fillfactor='100');
CREATE UNIQUE INDEX votes_pkey ON public.votes USING btree (id);
CREATE INDEX posts_parent_id_idx ON public.posts USING btree (parentid) WITH (fillfactor='100');
CREATE INDEX posts_accepted_answer_id_idx ON public.posts USING btree (acceptedanswerid) WITH (fillfactor='100');
CREATE INDEX posts_viewcount_idx ON public.posts USING btree (viewcount) WITH (fillfactor='100');
CREATE INDEX posts_favorite_count_idx ON public.posts USING btree (favoritecount) WITH (fillfactor='100');
CREATE INDEX posts_comment_count_idx ON public.posts USING btree (commentcount) WITH (fillfactor='100');
CREATE INDEX posts_answer_count_idx ON public.posts USING btree (answercount) WITH (fillfactor='100');
CREATE INDEX posts_owner_user_id_idx ON public.posts USING hash (owneruserid) WITH (fillfactor='100');
CREATE INDEX posts_creation_date_idx ON public.posts USING btree (creationdate) WITH (fillfactor='100');
CREATE INDEX posts_score_idx ON public.posts USING btree (score) WITH (fillfactor='100');
CREATE INDEX posts_post_type_id_idx ON public.posts USING btree (posttypeid) WITH (fillfactor='100');
CREATE UNIQUE INDEX posts_pkey ON public.posts USING btree (id);
CREATE INDEX postlinks_related_post_id_idx ON public.postlinks USING btree (relatedpostid) WITH (fillfactor='100');
CREATE INDEX postlinks_post_id_idx ON public.postlinks USING btree (postid) WITH (fillfactor='100');
CREATE UNIQUE INDEX postlinks_pkey ON public.postlinks USING btree (id);
CREATE INDEX ph_userid_idx ON public.posthistory USING btree (userid) WITH (fillfactor='100');
CREATE INDEX ph_creation_date_idx ON public.posthistory USING btree (creationdate) WITH (fillfactor='100');
CREATE INDEX ph_revguid_idx ON public.posthistory USING btree (revisionguid) WITH (fillfactor='100');
CREATE INDEX ph_postid_idx ON public.posthistory USING hash (postid) WITH (fillfactor='100');
CREATE INDEX ph_post_type_id_idx ON public.posthistory USING btree (posthistorytypeid) WITH (fillfactor='100');
CREATE UNIQUE INDEX posthistory_pkey ON public.posthistory USING btree (id);

Now, run the first query after applying all indexes. I found that it ran in 60ms. It executed in more than 100ms without using indexes.

Temporary tables with indexes

Create a new temporary table named webapps_elite_idx and compare the execution time with the earlier webapps_elite temp table. I found no big difference, it executed in around 6ms while the earlier one without indexes executed in around 7ms.

CTEs with indexes

Remember how much time the CTE executed? It took 200ms while if you use it now to run your query, you’ll see an improvement. It took around 110ms in my machine.

Views with indexes

Create a new view named webapps_elite_view_idx and compare your result with the last view webapps_elite_view. It took around 40ms in this view, while the previous view took around 100ms.

Materialized views with indexes

Create a new materialized view named webapps_elite_mv_idx and compare your result with the last materialized view webapps_elite_mv. The query was already fast enough. So it took around the same 4ms on my machine.

Conclusion

This tutorial has covered PostgreSQL techniques that you can use in your project to help optimize your query and make it shorter and/or better in performance. We’ve discussed temporary tables, common table expressions (CTEs), views, and materialized views.

To wrap up, there is no best technique. It always depends on your data and your situation. Sometimes you need to have data that you can restore every session so you might not use a CTE. Other times you care more about readability than performance so you see what fits. Other times data does not change regularly and you care about performance so you use materialized view. Other times you need to store a query that’s used frequently and you might not care about performance much so you use views.

All in all, choose the optimization technique depending on your need and your experiment.

This article was originally published on Datasentinel, check it out.