Tuesday, 31 January 2023

Optimizing SQL Performance: Expert Tips & Tricks


#1
Always order your JOINs from largest tables to smallest tables

#2
Always "GROUP BY" by the attribute/column with the largest number of unique entities/values

#3
Avoid subqueries in WHERE clause

#4
Use Max instead of Rank

#5
Use ‘regexp_like’ to replace ‘LIKE’ clauses

#6
Use ‘regexp_extract’ to replace ‘Case-when Like’


#7
Convert long list of IN clause into a temporary table



#8 Other Tips

● Use approx_distinct() instead of count(distinct) for very large datasets
● Use approx_percentile(metric, 0.5) for median
● Avoid UNIONs where possible
● Use WITH statements vs. nested subqueries

Friday, 27 January 2023

Reclaiming Disk Space: Efficient Techniques for Freeing Storage without Dropping Indexes or Deleting Data

Every few months we get an alert from our database monitoring to warn us that we are running high on disk usage. Usually, we provision more storage and forget about it. We thought this was an excellent opportunity to do some cleanups that would otherwise be much more challenging.

The Usual Suspects

Provisioning storage is something we do from time to time, but before we throw money at the problem we like to make sure we make good use of the storage we already have. To do that, we start with the usual suspects.

Unused Indexes

Unused indexes are double-edged swords; you create them to make things faster, but they end up taking space and slow inserts and updates. Unused indexes are the first thing we always check when we need to clear up storage.

To find unused indexes we use the following query:

1SELECT 2 relname, 3 indexrelname, 4 idx_scan, 5 idx_tup_read, 6 idx_tup_fetch, 7 pg_size_pretty(pg_relation_size(indexrelname::regclass)) as size 8FROM 9 pg_stat_all_indexes 10WHERE 11 schemaname = 'public' 12 AND indexrelname NOT LIKE 'pg_toast_%' 13 AND idx_scan = 0 14 AND idx_tup_read = 0 15 AND idx_tup_fetch = 0 16ORDER BY 17 pg_relation_size(indexrelname::regclass) DESC;

 

 

To find the unused indexes you can actually drop, you usually have to go over the list one by one and make a decision. This can be time-consuming the first couple of times, but after you get rid of most unused indexes it becomes easier.

Index and Table Bloat

The next suspect is bloat. When you update rows in a table, PostgreSQL marks the tuple as dead and adds the updated tuple in the next available space. This process creates what's called "bloat", which can cause tables to consume more space than they really need. Bloat also affects indexes, so to free up space, bloat is a good place to look.

Estimating bloat in tables and indexes is apparently not a simple task. After running, the below queries you will most likely find some bloat, so the next thing to do is clear up that space.

Clearing bloat in indexes

To clear bloat in an index, you need to rebuild it. There are several ways to rebuild an index:

  1. Re-create the index: If you re-create the index, it will be built in an optimal way.

  2. Rebuild the index: Instead of dropping and creating the index yourself, PostgreSQL provides a way to re-build an existing index in place using the REINDEX command:

1REINDEX INDEX index_name;

 

  1. Rebuild the index concurrently: The previous methods will obtain a lock on the table and prevent it from being changed while the operation is in progress, which is usually unacceptable. To rebuild the index without locking it for updates, you can rebuild the index concurrently.

1REINDEX INDEX CONCURRENTLY index_name;

When using REINDEX CONCURRENTLY, PostgreSQL creates a new index with a name suffixed with _ccnew and syncs any changes made to the table in the meantime. When the rebuild is done, it will switch the old index with the new index, and drop the old one.

 

If for some reason you had to stop the rebuild in the middle, the new index will not be dropped. Instead, it will be left in an invalid state and consume space. To identify invalid indexes that were created during REINDEX, we use the following query:

 

1-- Identify invalid indexes that were created during index rebuild 2SELECT 3 c.relname as index_name, 4 pg_size_pretty(pg_relation_size(c.oid)) 5FROM 6 pg_index i 7 JOIN pg_class c ON i.indexrelid = c.oid 8WHERE 9 -- New index built using REINDEX CONCURRENTLY 10 c.relname LIKE '%_ccnew' 11 -- In INVALID state 12 AND NOT indisvalid 13LIMIT 10;

 

Utilizing Partial Indexes

To find suitable candidates for partial index we wrote a query to search for indexes on fields with high null_frac, the percent of values of the column that PostgreSQL estimates are NULL:

 

1SELECT 2 c.oid, 3 c.relname AS index, 4 pg_size_pretty(pg_relation_size(c.oid)) AS index_size, 5 i.indisunique AS unique, 6 a.attname AS indexed_column, 7 CASE s.null_frac 8 WHEN 0 THEN '' 9 ELSE to_char(s.null_frac * 100, '999.00%') 10 END AS null_frac, 11 pg_size_pretty((pg_relation_size(c.oid) * s.null_frac)::bigint) AS expected_saving 12 -- Uncomment to include the index definition 13 --, ixs.indexdef 14 15FROM 16 pg_class c 17 JOIN pg_index i ON i.indexrelid = c.oid 18 JOIN pg_attribute a ON a.attrelid = c.oid 19 JOIN pg_class c_table ON c_table.oid = i.indrelid 20 JOIN pg_indexes ixs ON c.relname = ixs.indexname 21 LEFT JOIN pg_stats s ON s.tablename = c_table.relname AND a.attname = s.attname 22 23WHERE 24 -- Primary key cannot be partial 25 NOT i.indisprimary 26 27 -- Exclude already partial indexes 28 AND i.indpred IS NULL 29 30 -- Exclude composite indexes 31 AND array_length(i.indkey, 1) = 1 32 33 -- Larger than 10MB 34 AND pg_relation_size(c.oid) > 10 * 1024 ^ 2 35 36ORDER BY 37 pg_relation_size(c.oid) * s.null_frac DESC;

The results of this query can look like this on the staging schema:

 

Is it always beneficial to exclude nulls from indexes?

No. NULL is as meaningful as any other value. If your queries are searching for null values using, these queries might benefit from an index on NULL.

 

So is this method beneficial only for null values?

Using partial indexes to exclude values that are not queried very often or not at all can be beneficial for any value, not just null values. NULL usually indicate a lack of value, and in our case, not many queries were searching for null values, so it made sense to exclude them from the index.

Conclusion

Optimizing disks, storage parameters, and configuration can only affect performance so much. At some point, to squeeze that final drop in performance you need to make changes to the underlying objects. In this case, it was the index definition.

To sum up the process we took to clear as much storage as we could:

  • Remove unused indexes

  • Utilize partial indexes to index only what's necessary

Hopefully, after applying these techniques you can gain a few more days before you need to reach into your pocket and provision more storage.

Exploring the Power of Generative AI Services: Unlocking Limitless Creativity

Introduction In recent years, we have witnessed remarkable advancements in the field of artificial intelligence (AI). One of the most intrig...