SQL is a powerful tool that can help us to extract valuable insights from large datasets. However, writing SQL queries that run efficiently can be a challenging task, even for experienced developers.
When debugging slow running queries, you might be tempted to follow the first route that comes on your mind. However, most of the time, we come across bad advice that can lead to bigger problems and performance issues in the future.
This tutorial will cover seven of the most common mistakes that people make when writing SQL queries for PostgreSQL. By learning about these mistakes and how to avoid them, you can make your PostgreSQL queries run much faster and save time and resources.
Mistake #1: Not using indexes properly
Indexes is one of the most powerful features of PostgreSQL, as they can significantly speed up query execution times for large datasets. An index is essentially a data structure that stores a sorted copy of selected columns from a table, which allows the database system to quickly look up and return the requested data.
However, not using indexes properly can actually slow down query performance. Some common mistakes include:
- Not defining indexes on columns that are frequently used in search conditions or join clauses
- Defining too many indexes, which can increase the overhead of insert, update, and delete operations
- Using the wrong type of index for the query at hand, such as a B-tree index for full-text search queries
To avoid these mistakes, you should carefully analyze the types of queries that your application needs to perform and create indexes that are optimized for those queries. You should also periodically review your index usage and remove any indexes that are no longer needed. In general, less is more when it comes to indexes: focus on creating the fewest number of indexes that provide the greatest benefit to your queries.
Here are some examples of queries that can benefit from indexes:
- Queries that involve equality conditions on a single column, such as SELECT * FROM users WHERE email = ‘firstname.lastname@example.org’
- Queries that involve inequality conditions on a single column, such as SELECT * FROM orders WHERE total_price > 100
- Queries that involve a combination of multiple columns, such as SELECT * FROM employees WHERE department = ‘sales’ AND salary > 50000
- Queries that involve sorting or grouping on a column, such as SELECT department, AVG(salary) FROM employees GROUP BY department ORDER BY AVG(salary) DESC
- Queries that involve joining two or more tables on a common column, such as SELECT * FROM customers JOIN orders ON customers.id = orders.customer_id
By defining indexes on the relevant columns for these types of queries, you can significantly improve their performance. However, it’s important to remember that there are tradeoffs involved in index creation, so you should carefully balance the benefits of improved query performance with the costs of increased storage space and slower insert, update, and delete operations.
Mistake #2: Writing SQL queries that expose tables or indexes to scanning
One of the most common SQL mistakes that can lead to reduced PostgreSQL performance is writing queries that expose tables or indexes to scanning. This can happen when a query fails to use a WHERE clause, uses a poorly optimized WHERE clause, or performs a full table scan instead of utilizing an index.
Table scans can be extremely resource-intensive, particularly for large datasets, and can significantly slow down query execution times.
Take a look at the following example:
SELECT (SELECT count(*) FROM users WHERE reputation <= 50) AS inactive, (SELECT count(*) FROM users WHERE reputation > 50 AND reputation <= 1000) AS intermediate, (SELECT count(*) FROM users WHERE reputation > 1000) AS nerd; -- (took 309ms)
This query counts the number of users in different reputation ranges and returns the counts as separate columns. The following is the execution plan for it:
Result (cost=50944.03..50944.04 rows=1 width=24) InitPlan 1 (returns $1) -> Finalize Aggregate (cost=16954.89..16954.90 rows=1 width=8) -> Gather (cost=16954.67..16954.88 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=15954.67..15954.68 rows=1 width=8) -> Parallel Seq Scan on users (cost=0.00..15639.10 rows=126228 width=0) Filter: (reputation <= 50) InitPlan 2 (returns $3) -> Finalize Aggregate (cost=17346.75..17346.76 rows=1 width=8) -> Gather (cost=17346.53..17346.74 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=16346.53..16346.54 rows=1 width=8) -> Parallel Seq Scan on users users_1 (cost=0.00..16152.12 rows=77763 width=0) Filter: ((reputation > 50) AND (reputation <= 1000)) InitPlan 3 (returns $5) -> Finalize Aggregate (cost=16642.36..16642.37 rows=1 width=8) -> Gather (cost=16642.15..16642.36 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=15642.15..15642.16 rows=1 width=8) -> Parallel Seq Scan on users users_2 (cost=0.00..15639.10 rows=1217 width=0) Filter: (reputation > 1000)
As you can see, the query uses three separate subqueries, each of which calculates a count of the number of users falling into a different reputation range. The subqueries are executed as “init plans” that are executed in parallel with the main query. The results of the subqueries are returned as columns in the main query result.
Each of the subqueries performs a parallel sequential scan of the
users table, filtering by
reputation ranges using a simple filter predicate. Each subquery then performs a partial aggregation, which is then gathered and finalized to produce the total count for each reputation range.
The overall cost of the query is relatively high, indicating that the query may be resource-intensive and may benefit from performance optimization.
So you can use
FILTER to get rid of these three subqueries like the following:
SELECT count(*) FILTER (WHERE reputation <= 50) AS inactive, count(*) FILTER (WHERE reputation > 50 AND reputation <= 1000) AS intermediate, count(*) FILTER (WHERE reputation > 1000) AS nerd FROM users; -- (took 110ms)
EXPLAIN command to this query produces the following execution plan:
Finalize Aggregate (cost=19717.45..19717.46 rows=1 width=24) -> Gather (cost=19717.22..19717.43 rows=2 width=24) Workers Planned: 2 -> Partial Aggregate (cost=18717.22..18717.23 rows=1 width=24) -> Parallel Seq Scan on users (cost=0.00..15126.08 rows=205208 width=4)
As you can see, the query calculates a single result that is the sum of all the reputation values in the table.
The reason why this single query is better in performance than the query with 3 subqueries is because it only needs to scan the table once. In contrast, the subquery approach requires scanning the table 3 times, which can be expensive in terms of I/O operations and can have a negative impact on performance.
Note: You can use
CASE instead of
FILTER in the following way:
SELECT COUNT(CASE WHEN reputation <= 50 THEN 1 END) AS inactive, COUNT(CASE WHEN reputation > 50 AND reputation <= 1000 THEN 1 END) AS intermediate, COUNT(CASE WHEN reputation > 1000 THEN 1 END) AS nerd FROM users;
CASE expression evaluates each row and returns either 1 or NULL, depending on whether the condition is true or false. The
COUNT function counts the non-null values, so the result is the same as using
Similarly, scanning an index can also be costly, particularly when the index is not optimized for the query at hand.
To avoid this mistake, it’s important to carefully design your SQL queries with performance in mind. This includes optimizing your WHERE clauses and ensuring that indexes are used whenever possible. You can also use the EXPLAIN statement to see how PostgreSQL is executing your queries, and to identify any table or index scans that may be slowing down your queries.
You can use the
pg_stat_all_indexes system catalog tables to get information about table and index scans on your PostgreSQL database. Here’s an example SQL query that you can use to get information about table and index scans for all tables in your database:
SELECT relname, seq_scan, seq_tup_read, idx_scan, idx_tup_fetch FROM pg_stat_all_tables;
This query will return a result set with the following columns:
relname: the name of the table
seq_scan: the number of sequential table scans that have been performed on the table
seq_tup_read: the number of rows that have been read during sequential table scans
idx_scan: the number of index scans that have been performed on the table
idx_tup_fetch: the number of rows that have been fetched from the table during index scans
Similarly, you can use the following SQL query to get information about index scans for all indexes in your database:
SELECT relname, indexrelname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_all_indexes;
This query will return a result set with the following columns:
relname: the name of the table that the index belongs to
indexrelname: the name of the index
idx_scan: the number of index scans that have been performed on the index
idx_tup_read: the number of rows that have been read from the table during index scans
idx_tup_fetch: the number of rows that have been fetched from the table during index scans
By running these queries, you can get a better understanding of how your tables and indexes are being scanned, and use this information to optimize your queries for better performance.
Mistake #3: Creating too many temporary tables
Temporary tables are a common tool in SQL for storing intermediate results during query execution. They are useful for breaking complex queries into smaller, more manageable parts and can significantly improve query performance in some cases. However, creating too many temporary tables can also have a negative impact on PostgreSQL performance.
When you create a temporary table, PostgreSQL must allocate memory to store the table and its associated data. If you create too many temporary tables, this can put a strain on the system’s memory resources and lead to slower query execution times.
To avoid creating too many temporary tables, it’s important to carefully consider whether they are necessary for your queries. In some cases, you may be able to rewrite your queries to avoid the need for temporary tables. For example, you may be able to use subqueries or CTEs instead of creating a temporary table.
If you do need to use temporary tables, it’s important to use them efficiently. This includes limiting the number of temporary tables you create and ensuring that you drop them when they are no longer needed. You should also optimize the design of your temporary tables to minimize their memory usage.
In general, it’s important to use temporary tables judiciously and to carefully consider their impact on PostgreSQL performance. By avoiding the creation of unnecessary temporary tables and optimizing their design, you can ensure that your queries execute as efficiently as possible.
Mistake #4: Performing sorting operations when not required
Another common mistake that can reduce PostgreSQL performance is performing sorting operations when they are not required. Sorting can be a resource-intensive operation, particularly for large datasets, and can significantly slow down query execution times.
Sorting is often required for certain types of queries, such as when you need to order query results by a specific column. However, in some cases, sorting may not be necessary. For example, you don’t need to use
ORDER BY clause in a view definition. See the following example.
You have created the following view to have users ordered by their reputation in an ascending order:
CREATE VIEW unix_users_view_ordered AS ( SELECT * FROM users ORDER BY reputation ASC );
And you want to get data from this view in a descending order:
EXPLAIN SELECT * FROM unix_elite_view_ordered ORDER BY reputation DESC;
The query plan would have:
Sort (cost=347541.30..348567.34 rows=410416 width=394) Sort Key: users.reputation DESC -> Gather Merge (cost=108586.36..156471.57 rows=410416 width=394) Workers Planned: 2 -> Sort (cost=107586.33..108099.35 rows=205208 width=394) Sort Key: users.reputation -> Parallel Seq Scan on users (cost=0.00..15126.08 rows=205208 width=394)
As you can see from this execution plan, the query first performs a Gather Merge operation to combine the results from the parallel workers and then performs a Sort operation to sort the combined result set.
However, if you omit the sorting operation in the view like the following view:
CREATE VIEW unix_users_view AS ( SELECT * FROM users );
and measure the execution plan this time with the same query above:
EXPLAIN SELECT * FROM unix_users_view ORDER BY reputation DESC;
you would see a query plan like the following:
Gather Merge (cost=108586.36..156471.57 rows=410416 width=394) Workers Planned: 2 -> Sort (cost=107586.33..108099.35 rows=205208 width=394) Sort Key: users.reputation DESC -> Parallel Seq Scan on users (cost=0.00..15126.08 rows=205208 width=394)
The query performs a Sort operation on each worker’s local result set and then combines the sorted result sets using Gather Merge.
The cost of the ordered view (
unix_users_view_ordered) query plan is higher, specifically the Sort node cost. In the first plan, the cost of the Sort node is 347541.30..348567.34, while in the second plan, the cost of the Sort node is 107586.33..108099.35. This is almost a 3 order of magnitude lower in cost. The lower cost indicates that the second plan is expected to perform better in terms of query execution time. However, it’s important to note that the cost is just an estimate and the actual performance can vary depending on many factors, such as the size of the data and the hardware configuration.
To avoid this unnecessary sorting mistake, it’s important to carefully consider whether sorting is necessary for your queries.
If you do need to perform sorting operations, it’s important to optimize your query for performance. This includes using indexes to speed up sorting, and avoiding unnecessary sorting by limiting the number of rows that need to be sorted.
Mistake #5: Using
DISTINCT keyword is a powerful tool in SQL that allows you to remove duplicate rows from the results of a query. However, using
DISTINCT can also have a negative impact on PostgreSQL performance, especially when it is used unnecessarily.
Take this simple example:
SELECT * FROM posts;
The previous query took 987ms on my machine. Compare it with this the following that uses
SELECT DISTINCT * FROM posts;
It took 2 minutes and 45 seconds!
That’s because when you use
DISTINCT, PostgreSQL must sort the result set to identify and remove duplicate rows. This sorting operation can be quite expensive, especially if the result set is large (like the
posts table which contains more than 500k rows) or if the query is complex.
To avoid using
DISTINCT unnecessarily, it’s important to carefully consider whether it is needed for your queries. In some cases, you may be able to rewrite your queries to avoid the need for
DISTINCT. For example, you may be able to use a
GROUP BY clause to group the results and eliminate duplicates.
Here’s an example use case to show how a
GROUP BY clause can be an alternative to
DISTINCT to eliminate duplicates. In the StackExchange data dump, the
unix database has a
users table with a column:
displayname. We want to retrieve a list of unique display names from the table. One way to do this is to use the
SELECT DISTINCT displayname FROM users; -- (took 2.3s)
However, this can be quite slow if the
users table is large or if the query is complex. An alternative approach is to use a
GROUP BY clause to group the results by email address:
SELECT displayname FROM users GROUP BY displayname; -- (took 1.9s)
This query produces the same result as the previous query, but it can be much faster since it eliminates duplicates by grouping the results. The
GROUP BY clause groups the results by display names, and since each group contains only one display name, the results are effectively deduplicated.
If you do need to use
DISTINCT, it’s important to use it efficiently. This includes limiting the number of columns you use
DISTINCT on, since using
DISTINCT on many columns can significantly increase the cost of the sorting operation. It’s also important to ensure that you are only using
DISTINCT when it is truly necessary.
Mistake #6: Sending big batches of data through network
When you are working with large datasets in PostgreSQL, it can be tempting to send large batches of data over the network to your client application. While this can be an efficient way to transfer data, it can also be a performance bottleneck if the network is slow or if the client application cannot process the data quickly enough.
To avoid this problem, it’s important to carefully consider the size of the batches you are sending over the network. If you are sending large batches of data, you may want to consider breaking them up into smaller, more manageable chunks.
One way to do this is to use pagination to retrieve data in smaller batches. For example, you might use the LIMIT and OFFSET clauses to retrieve data in batches of 100 rows at a time:
SELECT * FROM posthistory p LIMIT 100 OFFSET 0; SELECT * FROM posthistory p LIMIT 100 OFFSET 100; SELECT * FROM posthistory p LIMIT 100 OFFSET 200; ...
By using pagination to retrieve data in smaller batches, you can avoid sending large batches of data over the network, which can be a performance bottleneck. This can also make your client application more responsive and efficient, since it doesn’t need to process large batches of data all at once.
Mistake #7: Using complex queries when simpler ones would suffice
Using complex queries when simpler ones would suffice can lead to decreased performance in database systems. While it can be tempting to write more complex queries to perform multiple operations at once or to combine data from different tables, it is important to keep in mind that each added complexity can increase query time and resource usage. In some cases, a simpler query may provide the same results more efficiently.
Here are a few examples of cases where simpler queries may be more efficient than complex ones:
Unnecessary Joins: Joins can be costly in terms of processing time and memory usage. In cases where a join is not needed to obtain the desired result, using a simpler query without a join can lead to faster execution.
Unnecessary Subqueries: Similarly, subqueries can also be costly in terms of processing time and memory usage. In some cases, it may be possible to use a simpler query without a subquery to achieve the same result.
Redundant Operations: When writing complex queries, it is important to review the query to ensure that each operation is necessary. Redundant operations can add unnecessary processing time and memory usage, leading to slower execution.
Overuse of GROUP BY: GROUP BY is another operation that can be costly in terms of processing time and memory usage. Overuse of GROUP BY, or grouping by too many columns, can lead to slower execution. In some cases, a simpler query without a GROUP BY may provide the same result more efficiently.
In summary, it is important to keep in mind that simpler queries can often provide the same result more efficiently than more complex queries. When writing queries, it is important to carefully consider the necessary operations and to avoid unnecessary complexity.
In conclusion, when it comes to writing efficient SQL queries, it is important to avoid some common mistakes. These include not using indexes properly, exposing full scans for tables and indexes when writing queries, creating too many temporary tables, using
ORDER BY when not required, using
DISTINCT unnecessarily, sending big batches of data through the network, and using complex queries when simpler ones would suffice.
To avoid these mistakes, it is important to carefully analyze the data and the requirements of the query, and choose the most appropriate approach. This may involve using indexing, optimizing queries for performance, and minimizing data transfers. By taking these steps, it is possible to write SQL queries that are more efficient, faster, and less prone to errors.
This article was originally published on Datasentinel.