#1
Always order your JOINs from largest tables to smallest tables
#2
Always "GROUP BY" by the attribute/column with the largest number of
unique entities/values
#3
Avoid subqueries in WHERE clause
#4
Use Max instead of Rank
#5
Use ‘regexp_like’ to replace ‘LIKE’ clauses
#6
Use ‘regexp_extract’ to replace ‘Case-when Like’
#7
Convert long list of IN clause into a temporary table
#8 Other Tips
● Use approx_distinct() instead of count(distinct) for very large datasets
● Use approx_percentile(metric, 0.5) for median
● Avoid UNIONs where possible
● Use WITH statements vs. nested subqueries
Every few months we get an alert from our database monitoring to warn us that we are running high on disk usage. Usually, we provision more storage and forget about it. We thought this was an excellent opportunity to do some cleanups that would otherwise be much more challenging.
The Usual Suspects
Provisioning storage is something we do from time to time, but before we throw money at the problem we like to make sure we make good use of the storage we already have. To do that, we start with the usual suspects.
Unused Indexes
Unused indexes are double-edged swords; you create them to make things faster, but they end up taking space and slow inserts and updates. Unused indexes are the first thing we always check when we need to clear up storage.
To find unused indexes we use the following query:
1SELECT
2 relname,
3 indexrelname,
4 idx_scan,
5 idx_tup_read,
6 idx_tup_fetch,
7 pg_size_pretty(pg_relation_size(indexrelname::regclass)) as size
8FROM
9 pg_stat_all_indexes
10WHERE
11 schemaname = 'public'
12 AND indexrelname NOT LIKE 'pg_toast_%'
13 AND idx_scan = 0
14 AND idx_tup_read = 0
15 AND idx_tup_fetch = 0
16ORDER BY
17 pg_relation_size(indexrelname::regclass) DESC;
To find the unused indexes you can actually drop, you usually have to go over the list one by one and make a decision. This can be time-consuming the first couple of times, but after you get rid of most unused indexes it becomes easier.
Index and Table Bloat
The next suspect is bloat. When you update rows in a table, PostgreSQL marks the tuple as dead and adds the updated tuple in the next available space. This process creates what's called "bloat", which can cause tables to consume more space than they really need. Bloat also affects indexes, so to free up space, bloat is a good place to look.
Estimating bloat in tables and indexes is apparently not a simple task. After running, the below queries you will most likely find some bloat, so the next thing to do is clear up that space.
Clearing bloat in indexes
To clear bloat in an index, you need to rebuild it. There are several ways to rebuild an index:
Re-create the index: If you re-create the index, it will be built in an optimal way.
Rebuild the index: Instead of dropping and creating the index yourself, PostgreSQL provides a way to re-build an existing index in place using the REINDEX command:
1REINDEX INDEX index_name;
Rebuild the index concurrently: The previous methods will obtain a lock on the table and prevent it from being changed while the operation is in progress, which is usually unacceptable. To rebuild the index without locking it for updates, you can rebuild the index concurrently.
1REINDEX INDEX CONCURRENTLY index_name;
When using REINDEX CONCURRENTLY, PostgreSQL creates a new index with a name suffixed with _ccnew and syncs any changes made to the table in the meantime. When the rebuild is done, it will switch the old index with the new index, and drop the old one.
If for some reason you had to stop the rebuild in the middle, the new index will not be dropped. Instead, it will be left in an invalid state and consume space. To identify invalid indexes that were created during REINDEX, we use the following query:
1-- Identify invalid indexes that were created during index rebuild
2SELECT
3 c.relname as index_name,
4 pg_size_pretty(pg_relation_size(c.oid))
5FROM
6 pg_index i
7 JOIN pg_class c ON i.indexrelid = c.oid
8WHERE
9 -- New index built using REINDEX CONCURRENTLY
10 c.relname LIKE '%_ccnew'
11 -- In INVALID state
12 AND NOT indisvalid
13LIMIT 10;
Utilizing Partial Indexes
To find suitable candidates for partial index we wrote a query to search for indexes on fields with high null_frac, the percent of values of the column that PostgreSQL estimates are NULL:
1SELECT
2 c.oid,
3 c.relname AS index,
4 pg_size_pretty(pg_relation_size(c.oid)) AS index_size,
5 i.indisunique AS unique,
6 a.attname AS indexed_column,
7 CASE s.null_frac
8 WHEN 0 THEN ''
9 ELSE to_char(s.null_frac * 100, '999.00%')
10 END AS null_frac,
11 pg_size_pretty((pg_relation_size(c.oid) * s.null_frac)::bigint) AS expected_saving
12 -- Uncomment to include the index definition
13 --, ixs.indexdef
1415FROM
16 pg_class c
17 JOIN pg_index i ON i.indexrelid = c.oid
18 JOIN pg_attribute a ON a.attrelid = c.oid
19 JOIN pg_class c_table ON c_table.oid = i.indrelid
20 JOIN pg_indexes ixs ON c.relname = ixs.indexname
21 LEFT JOIN pg_stats s ON s.tablename = c_table.relname AND a.attname = s.attname
2223WHERE
24 -- Primary key cannot be partial
25 NOT i.indisprimary
2627 -- Exclude already partial indexes
28 AND i.indpred IS NULL
2930 -- Exclude composite indexes
31 AND array_length(i.indkey, 1) = 1
3233 -- Larger than 10MB
34 AND pg_relation_size(c.oid) > 10 * 1024 ^ 2
3536ORDER BY
37 pg_relation_size(c.oid) * s.null_frac DESC;
The results of this query can look like this on the staging schema:
Is it always beneficial to exclude nulls from indexes?
No. NULL is as meaningful as any other value. If your queries are searching for null values using, these queries might benefit from an index on NULL.
So is this method beneficial only for null values?
Using partial indexes to exclude values that are not queried very often or not at all can be beneficial for any value, not just null values. NULL usually indicate a lack of value, and in our case, not many queries were searching for null values, so it made sense to exclude them from the index.
Conclusion
Optimizing disks, storage parameters, and configuration can only affect performance so much. At some point, to squeeze that final drop in performance you need to make changes to the underlying objects. In this case, it was the index definition.
To sum up the process we took to clear as much storage as we could:
Remove unused indexes
Utilize partial indexes to index only what's necessary
Hopefully, after applying these techniques you can gain a few more days before you need to reach into your pocket and provision more storage.