PostgreSQL performance starts right away at the data modeling phase. When you start designing your tables, PostgreSQL optimizer adapts to your changes and hence boost your database performance or hurt it.
In this PostgreSQL tutorial, you will learn how database design matters in PostgreSQL performance tuning. You’ll gain insights about primary keys benefit, when to do normalization, and when to do de-normalization.
Setting up the database
Similar to the PostgreSQL Optimization Techniques article, we will set up the Stackexchange database. In this tutorial, we will deal with the unix project; to deal with unix data on the StackExchange website.
To set up the database, make sure to clone the Stackexchange Postgresql dump Github repo and create a new database first:
git clone https://github.com/Networks-Learning/stackexchange-dump-to-postgres.git sudo -u postgresql psql -c "CREATE DATABASE unix;"
and then use the
load_into_pg.py script to load the data from the archive website:
python load_into_pg.py -s unix -d unix -H localhost -u <username> -p <pass>
Once you set up the database, it’s time to clean up the database as you see in the following section.
Cleaning up data
To be able to get good insights on our data, you need to clean up some database-related things. You’ll remove any primary keys or indexes. This will get us a barebone database that you can experiment with very well.
To drop all primary keys, run the following SQL statements:
ALTER TABLE badges DROP CONSTRAINT badges_pkey CASCADE; ALTER TABLE comments DROP CONSTRAINT comments_pkey CASCADE; ALTER TABLE posthistory DROP CONSTRAINT posthistory_pkey CASCADE; ALTER TABLE postlinks DROP CONSTRAINT postlinks_pkey CASCADE; ALTER TABLE posts DROP CONSTRAINT posts_pkey CASCADE; ALTER TABLE tags DROP CONSTRAINT tags_pkey CASCADE; ALTER TABLE users DROP CONSTRAINT users_pkey CASCADE; ALTER TABLE votes DROP CONSTRAINT votes_pkey CASCADE;
And to drop all indexes for this database:
DROP INDEX tags_count_idx; DROP INDEX tags_name_idx; DROP INDEX badges_user_id_idx; DROP INDEX badges_name_idx; DROP INDEX badges_date_idx; DROP INDEX user_acc_id_idx; DROP INDEX user_display_idx; DROP INDEX user_up_votes_idx; DROP INDEX user_down_votes_idx; DROP INDEX user_created_at_idx; DROP INDEX posts_post_type_id_idx; DROP INDEX posts_score_idx; DROP INDEX posts_creation_date_idx; DROP INDEX posts_owner_user_id_idx; DROP INDEX posts_answer_count_idx; DROP INDEX posts_comment_count_idx; DROP INDEX posts_favorite_count_idx; DROP INDEX posts_viewcount_idx; DROP INDEX posts_accepted_answer_id_idx; DROP INDEX posts_parent_id_idx; DROP INDEX cmnts_score_idx; DROP INDEX cmnts_postid_idx; DROP INDEX cmnts_creation_date_idx; DROP INDEX cmnts_userid_idx; DROP INDEX votes_post_id_idx; DROP INDEX votes_type_idx; DROP INDEX votes_creation_date_idx; DROP INDEX postlinks_post_id_idx; DROP INDEX postlinks_related_post_id_idx; DROP INDEX ph_post_type_id_idx; DROP INDEX ph_postid_idx; DROP INDEX ph_revguid_idx; DROP INDEX ph_creation_date_idx; DROP INDEX ph_userid_idx;
Now, our postgres tables are very basic; without any performance improvements. Let’s dive in how performance starts from the database design.
Specifying primary keys in every table
When you start designing tables, make sure you set a primary key for each postgres table. Even if you have a column with unique values, this does not help in efficient SQL queries as much as setting that column as a primary key. That’s because when adding a primary key, PostgreSQL automatically creates a unique B-tree index on that column.
See that in action in the following example:
SELECT concat( 'https://unix.stackexchange.com/questions/', p.id, '/', REPLACE(lower(p.title), ' ', '-') ) AS url FROM posts p JOIN posthistory ph ON p.id = ph.postid WHERE p.tags LIKE '%<python>%' AND ph.posthistorytypeid = 16;
So this query returns the stackexchange URLs of posts tagged with Python that are community owned. The query does that by joining the
posts table with the
posthistory table. The filter statements include the tags with the python keyword. However, the
PostHistoryTypeId column with a value of
16 means that the post is community owned. You can consult the database schema for more information on the attributes of each table.
So running this query took around 3m 2s on my machine. Remember, this result is without specifying any primary key or having any index on the
Here is the execution plan for that run when you do
EXPLAIN to that postgres query:
Nested Loop (cost=2000.00..193078.96 rows=1 width=32) Join Filter: (p.id = ph.postid) -> Gather (cost=1000.00..128415.64 rows=2190 width=4) Workers Planned: 2 -> Parallel Seq Scan on posthistory ph (cost=0.00..127196.64 rows=912 width=4) Filter: (posthistorytypeid = 16) -> Materialize (cost=1000.00..64006.37 rows=20 width=58) -> Gather (cost=1000.00..64006.27 rows=20 width=58) Workers Planned: 2 -> Parallel Seq Scan on posts p (cost=0.00..63004.27 rows=8 width=58) Filter: (tags ~~ '%<python>%'::text)
Notice the sequential scan used on both tables. A scan on all records of both tables have been done without any performance gain except that the Postgres optimizer used parallel workers.
Let’s see what effect this query would have when you add a primary key constraint to both tables:
-- Add primary keys to posts and posthistory tables ALTER TABLE posthistory ADD CONSTRAINT "posthistory_pkey" PRIMARY KEY (id); ALTER TABLE posts ADD CONSTRAINT "posts_pkey" PRIMARY KEY (id);
Each table has now a primary key specified for the
Note: We used the
ALTER syntax here to edit the attribute constraint. If you’re creating the
posts table, you would need to create the table specifying the column as a primary key like this:
CREATE TABLE posts ( id int PRIMARY KEY, ... );
Now, running the previous query takes 1m 5s on my machine. So this is around 1/3 the time we had earlier without setting the primary key.
That’s because an index is used. Here is the execution plan:
Gather (cost=1000.42..134310.06 rows=1 width=32) Workers Planned: 2 -> Nested Loop (cost=0.42..133309.96 rows=1 width=32) -> Parallel Seq Scan on posthistory ph (cost=0.00..127196.64 rows=912 width=4) Filter: (posthistorytypeid = 16) -> Index Scan using posts_pkey on posts p (cost=0.42..6.70 rows=1 width=58) Index Cond: (id = ph.postid) Filter: (tags ~~ '%<python>%'::text)
Notice the index scan here instead of the earlier sequential scan.
Database normalization helps structure tables and columns in your relational database to ensure their dependencies are formed properly. While normalization’s goal is to reduce data redundancy and improve data integrity, it also helps in improving performance for specific cases.
When we searched for the python word using the
LIKE operator, we had a performance bottleneck. That’s because you can’t create an index on other words pattern matching. Or this would be time consuming to consider a full-text search index on every tag. A better way than using a pattern matching to search for a tag, is to do normalization and break up the
posts and the
To do such normalization, you need to know the relationship between tags and posts. So each post can have multiple tags, and each tag can be used for multiple posts. That means you have a many-to-many relationship between the
tags tables. To create such a relationship, you need to have one more table (called
PostTags). This table exists in the schema diagram of StackExchange, but it’s not included in the data dump. I have already created that table for you and included the data dump for that specific table.
So to design such a table, you’d have two columns:
tagid. Each combination of both columns should be uniquely identified. So you better create a compound primary key on both tables to make postgres automatically create an index on the combination of both fields:
CREATE TABLE posttags ( postid int, tagid int, PRIMARY KEY (postid, tagid) );
So now the
posttags table is ready and you can insert statements to it. I’ve created a dump in this repo so you can restore that specific table to your database.
Download the file and have it on your current working directory:
Assuming your database name is the same as previously stated (
unix) and your postgres username is
postgres, here is the command to restore the
posttags table into your database:
sudo -u postgres pg_restore -d unix posttags.dump
So if you list this table, you’ll see unique combinations of
tagid columns. Let’s get the same information as the earlier query that filters on the python tag to get the community-owned posts:
-- Using the normalized tables SELECT concat( 'https://unix.stackexchange.com/questions/', p.id, '/', REPLACE(lower(p.title), ' ', '-') ) AS url FROM posts p JOIN posttags pt ON pt.postid = p.id JOIN tags t ON t.id = pt.tagid JOIN posthistory ph ON p.id = ph.postid WHERE t.tagname = 'python' AND ph.posthistorytypeid = 16;
Running this query took around 26.7s which is around half the time it took without normalization. This would take us to the next section which talks about de-normalization.
Normalization doesn’t often lead to performance boosting. In fact, it’s mostly useful to structure your tables for querying more consistent data. De-normalization is the opposite of normalization which entails that a table can group certain facts together without the need to break them up into other tables.
This leads to improving performance in some cases where you don’t need to write more JOINs as in the case of normalization.
Let’s take a look at the following denormalized example:
SELECT concat( 'https://unix.stackexchange.com/questions/', p.id, '/', REPLACE(lower(p.title), ' ', '-') ) AS url FROM posts p WHERE p.tags IN ('<colors><grep>', '<grep><colors>');
So here we want to get all posts from the unix StackExchange community that has
grep tags. So we just filter on the
tags column which contains the tags for each post. For this query, you’ll see records showing posts on both tags whether the user started tagging
colors first or
This is a simple query in the denormalized version where we didn’t use a single JOIN statement.
Let’s now see what a normalized version would look like to get the same result. We can first create a view to get all posts with the
CREATE VIEW grep_view AS ( SELECT p.id AS post_id, t.id AS tag_id, p.title FROM posts p JOIN posttags pt ON pt.postid = p.id JOIN tags t ON pt.tagid = t.id AND t.tagname = 'grep' );
And then we would create a second view to get all posts with the
CREATE VIEW colors_view AS ( SELECT p.id, pt.tagid FROM posts p JOIN posttags pt ON pt.postid = p.id JOIN tags t ON pt.tagid = t.id AND t.tagname = 'colors' );
A third view would have the intersection between all the two created views to select the common posts:
CREATE VIEW grep_colors_view AS ( SELECT g.post_id, g.tag_id AS grep_tagid, c.tagid AS colors_tagid, g.title FROM grep_view g JOIN colors_view c ON g.post_id = c.id );
Finally, we would filter on the third view to ensure that we don’t get a post that has tags other than the
SELECT concat( 'https://unix.stackexchange.com/questions/', gc.post_id, '/', REPLACE(lower(gc.title), ' ', '-') ) AS url FROM grep_colors_view gc WHERE gc.post_id NOT IN ( SELECT pt.postid FROM posttags pt JOIN grep_colors_view gc ON gc.post_id = pt.postid WHERE pt.tagid != gc.grep_tagid AND pt.tagid != gc.colors_tagid );
If you execute this query, you’d see that the query from the denormalized table is faster by 1 second. In fact, it’s much less hassle. This performance boosting will be shown more clearly if you want to filter on more tags. The issue here is that more and more joins can happen to the normalized tables. These joins can be more difficult for the PostgreSQl optimizer to choose the best execution plan, which can degrade performance for the read load.
This tutorial has covered how database design in PostgreSQL matters in performance. We’ve seen how performance starts from designing your tables. We started looking at the importance of primary keys and how PostgreSQL uses it for your performance benefit. We then saw examples on where normalization can be beneficial and when de-normalization can boost postgres performance.
To wrap up, increasing database performance depends on your user experience. Depending on what the user needs to retrieve data, you can get involved in improving that retrieval to be faster. Whether you normalize or de-normalize, it must come from a fact that affects the user need.
This article was originally published on Datasentinel, check it out.