dylrich ~/blog/hash $ |

PostgreSQL hash indexes

Indexing that is trivial to implement has been the killer-app feature of relational database systems as long as they have been around the block. Almost everybody who touches a database writes something like create index on table users during their first course or tutorial on how they work. Magic! One simple line and you have access to an extremely powerful data structure that I’d guess most folks would not be thrilled to have to implement themselves for each application they build.

The B-Tree is by far the most important data structure for application engineers to know in a traditional RDBMS. I’m going to nominate it for “Greatest of All Time” , given how well it performs across a wide variety of tasks. It builds pretty quickly, has a reasonable size, works well on disk and in-memory, and we can do a ton of different types of queries against it efficiently. Even SQL-skeptics like Mongo trust the B-Tree to efficiently search their data. Sure, sometimes you can eek out more performance by using a different indexing strategy, but you’re rarely wrong to start with the B-Tree and see how it does.

What’s this hash index anyway?

Well, this is one of those times we’re going to leave the comfort of old-reliable and try out an alternative data structure that could potentially lay the smackdown on the B-Tree. Fortunately, this data structure isn’t super esoteric for most people! The subject of this post is the Hash index, and you’ve already used something like it in your favorite programming language if it contains a data structure called any of: dictionary, hash map, hash table, hash array, or a map.

The idea behind a Hash index is to use a specialized data structure that can do single value lookups in constant time, or O(1) for you big-oh nerds. So, yes, this data structure is only useful for situations where you want to do exactly select * from table where my_column = exactly_this_value. If you will ever need to do ranges, greater or less than, LIKE, or even not equal to, this is not the hot new data structure for you. That’s a bummer, but where I work I often end up querying large tables with columns where the only thing I want is exact equality. Even better, those values also occasionally end up being close to or completely unique! These are the exact conditions when PostgreSQL’s hash index thrives. I’ve put together a simple demonstration below to show where you could consider implementing one.

Show me some code

Let’s create a dummy PostgreSQL table for our new application’s users. They each have a name and a source, the source being the medium we used to recruit them to our service.

1
2
3
4
5
create table users (
    id serial,
    name text not null,
    source text not null
)

Now that we have our schema, let’s generate some random users who just love our product

1
2
3
4
5
6
7
insert into
    users (name, source)
select
    md5(random()::text),
    (ARRAY['newspaper', 'magazine', 'online', 'billboard'])[trunc(random() * 4) + 1]
from
    generate_series(1, 1000000);

> 1000000 rows affected in 3 s 748 ms

Neat! One million users, we’re one high-growth startup. I want to look up my users by their username, which I have not guaranteed to be unique. The first thing I’ll do is test out our trusty B-Tree at this task. Let’s create the index:

1
2
3
4
5
6
create index
    users_name_btree_index
on
    users
using 
    btree (name);

> completed in 2 s 64 ms

With that done, we can run a query for a user I know is in the database:

1
2
3
4
5
6
7
8
9
explain (analyze, timing off, costs off)
select
    id,
    name,
    source
from
    users
where
    name = 'e02f855cc00604b4f35337cf1b62ae2e'
> Index Scan using users_name_btree_index on users (actual rows=0 loops=1)
>   Index Cond: (name = 'e02f855cc00604b4f35337cf1b62ae2e'::text)
> Planning Time: 0.068 ms
> Execution Time: 0.041 ms

four hundreths of a millisecond execution time is great! We searched through 1,000,000 records for the one we wanted without even requiring a page indirection. Now, let’s see how the hash index fares.

1
2
3
4
5
6
create index
    users_name_hash_index
on
    users
using 
    hash (name);

> completed in 1 s 559 ms

It built about 25% faster than the B-Tree, neat! Running the same select query from above gives us the following plan:

> Index Scan using users_name_hash_index on users (actual rows=0 loops=1)
>   Index Cond: (name = 'e02f855cc00604b4f35337cf1b62ae2e'::text)
> Planning Time: 0.064 ms
> Execution Time: 0.024 ms

Woah, check it out. Almost half the time compared to the B-Tree. I ran this query one hundred times and took the mean, but I will say that the standard deviation on this was about 6 microseconds and it went up the more data was missing from shared buffers, so there is definitely some variation here I’m not capturing. Still, the hash index can clearly produce some respectable results in situations well-suited to its properties - unique or mostly unique columns, and strict equality lookups. If your B-Tree is huge, you may notice even larger differences between the two as the hash index doesn’t require any page indirections.

With that done, let’s check out how queries on our source field look.

1
2
3
4
5
6
create index
    users_source_btree_index
on
    users
using 
    btree (source);

> completed in 698 ms

Look at that speed on the creation - those clustered values really help the B-Tree here.

1
2
3
4
5
6
7
8
9
explain (analyze, timing off, costs off)
select
    id,
    name,
    source
from
    users
where
    source = 'newspaper';
> Index Scan using users_source_btree_index on users (actual rows=249163 loops=1)
>   Index Cond: (source = 'newspaper'::text)
> Planning Time: 0.067 ms
> Execution Time: 39.936 ms

This was a much heavier query than before, but the B-Tree can look up almost 250,000 rows out of 1,000,000 in a reasonable 39 ms. And now for the hash index…

1
2
3
4
5
6
create index
    users_source_hash_index
on
    users
using 
    hash (source);

> completed in 1 m 13 s 264 ms

Um, what? Did I copy that into this post correctly? Yep, the hash index really struggles with clustered values like this. Overly-clustered columns will destroy index update/creation time and I highly recommend avoiding them for large tables. Running the same select as above yields:

> Index Scan using users_source_hash_index on users (actual rows=249163 loops=1)
>   Index Cond: (source = 'newspaper'::text )
> Planning Time: 0.063 ms
> Execution Time: 50.953 ms

All of that effort and we don’t even get better read performance! At a heavy 25% hit rate, the hash index is 25% slower than the B-Tree. For situations where your data is pretty heavily clustered, the hash index is not buying you much and you’re better served choosing the B-Tree or another of PostgreSQL’s fantastic indexes.

So, should I use this thing?

This hasn’t been an exhaustive exploration of the hash index, but I hope it has given a simple to understand overview of where it may be useful with a few rough performance benchmarks. The tl;dr here is as follows:

  • Consider a hash index if you only ever want to do strict equality lookups and your column consists of majority unique values
  • The hash index provides slightly worse reads than a B-Tree when values are heavily clustered and significantly worse index writes

There are hundreds of parameters that go into correctly tuning a database system for optimal performance, so in your own situation don’t be surprised if you get conflicting results - always measure these micro optimizations in context first before applying them blindly. I encourage the reader to page through the official PostgreSQL documentation on indexes for a better idea of limitations than I have expanded on here.

Test configuration

I ran this test on my personal desktop with a Ryzen 2700X CPU, Samsung 860 EVO SSD, and 32 GB of RAM at 2333 MHz. I was using PostgreSQL 12 running on Fedora Linux 31 with Kernel 5.5.10-200. No guarantee that you see the same values that I do - I ran every query 100 times and took the mean of results for the execution times above.