Kuba's Journal - Postgress progress

I think I’ve finally made some good progress with my Postgres database. Things are working pretty well now, and I still have a few fixes in the queue. It’s not all finished yet, but I feel like I’m coming out of the woods now.

For those out of the loop: I have a server which streams and saves all posts from the Bluesky firehose, and uses this data to run several feeds on Bluesky used by 1k+ people, plus generates daily post statistics, the directory of custom handles and things like that.

This all has been running on SQLite so far (200+ GB file now), and it has served me well far longer than I would ever have expected, but in the end I’m hitting its limits. The main problem is lack of support for concurrent writes, which makes it unrealistic to e.g. split the work into multiple workers, which I’m gonna have to do sooner or later.

I’ve only really used MySQL before on servers, but everyone was telling me I need to use Postgres, so I did what any normal person would have done (right?): I’ve set up two test servers, one on MySQL and one on Postgres, running from two modified branches of the code, and I made the SQLite-based production server proxy some requests to one or the other in order to test them on real traffic. A sort of database A/B testing 🤪

It has been working like this for a few months now, and I’ve been slowly tweaking things, mostly on the Postgres side. Since I got it to work at all, they’ve both been working ok, but MySQL has generally been doing more write I/O doing the same things, and Postgres has been doing comparably more read I/O.

The key part are the Replies feeds (Follows & Replies and Only Replies, it’s basically the same code). These two are kind of variants of the built-in Following feed, showing the connecting user their own chronological timeline, generated live, filtered in some way. These feeds work by fetching the requesting user’s follows list, either from the AppView or from a local cache, and then making a query that’s basically: “give me the most recent 100 posts from any of these 50/200/1000/5000 users”.

There are two relevant indexes in that table: one on (repo, time) (repo = user’s DID), and one on just (time). Roughly speaking, for those users who follow e.g. 80 or 200 people, it makes more sense to scan the (repo, time) index those 80-200 times and collect the 100 most recent posts from all of those found, and for those who follow e.g. 9000 (yes, that happens 😛), it’s faster to scan the single (time) index until you find 100 relevant posts. But I’ve been struggling to make Postgres always use the right index.

At first it tried to use the (time) index way too much. I tried to find a way to force it to use a specific index, but I was told it’s not possible (it might be though!). With the help of some people on Bluesky, I managed to rewrite the query to use the other index, but then it worked bad for those people with a lot of follows. I interpreted the EXPLAINs as the query planner not having a good enough info about the structure of the data (it estimated n_distinct, the total number of distinct values in a column, as much lower than the actual number of users in the table), so I started messing with the “statistics” settings. I ended up bumping up the statistics target to the maximum 10,000 (making ANALYZE take really long) and hardcoded n_distinct to 4.5M, which was closer to the total count. I also increased shared_buffers to 1 GB. This improved things somewhat, to the “not great, not terrible” level, but I was still not happy with the performance.

But eventually, I was able to improve things a lot over the last few weeks:

First, I was re-reading one of the blog posts about the differences in how MySQL and Postgres organize data internally, and I had a bit of an epiphany: unlike MySQL, Postgres doesn’t store the primary key (id) in secondary indexes, since they reference the physical location of the data instead. In MySQL I had a version of the query which first did a subquery that collected just post IDs, and then used the final 100 IDs to fetch posts, but this couldn’t work here if the indexes don’t include id! This was something unexpected for me. So I replaced the larger index with (repo, time DESC, id) and rewrote the query again with a subquery, and the EXPLAIN showed “Index Only Scan” now, meaning it could use just the index for the subquery.
Next, the Index Only Scan was showing something like: “Heap Fetches: 31362”. This apparently means that even though it tries to only use the index in that step, it still has to go fetch 30k post rows from the actual table. This is because as records are updated or deleted (in my case only really deleted), Postgres just marks them as deleted, but doesn’t free up the space or update the index until a VACUUM is done. A separate structure called “visibility map” tracks which table pages have some outdated records in them, and if an index entry points to such page, it has to go check if that record is still there. So I started tracking the state of this visibility map, and set up a cron job to do the VACUUM twice a day to keep it relatively “clean”. (I wish it was as easy to do with my apartment…)
Now, it was working great for people with short lists, but still tried to use the (repo, time, id) index for most very long lists. The EXPLAIN showed that it was vastly overestimating the cost of using (time). I came to the conclusion that… the hardcoded n_distinct was actually working against it now, even though it was technically more precise. In my practical use cases, what actually matters is not how many different users have at least one post anywhere in the table, but how many are actively posting - whose posts you can find in the last few hours or a day. And that’s much less than 4.5M. So I cleared the hardcoded n_distinct and let it find its own estimate, and reset the statistics target to the default 100 - and this improved things quite a bit! It was now almost perfectly using index_posts_on_repo_time_id to around ~1200 followed accounts, and index_posts_on_time above that.
Not related to the query/Postgres directly, but I also added a “prefetching” task to the follows list cache: it runs every 15 minutes and checks which follows lists are approaching the point where the cached data becomes stale (but was accessed recently) and reloads the list from the AppView in advance. This saves a couple to several seconds of time every time the list had to be refreshed before returning the feed, so it made quite a lot of difference.
One last thing I need to fix is that for long lists, significant time is spent on query planning (e.g. 1724 ms planning time and then 12 ms execution time). I’m gonna try to restructure the query to pass the list of user IDs differently somehow and see if that helps, or perhaps I could move the cache to a new database table instead of keeping it in .json files.
And finally, I have some more config tweaking to do - e.g. it seems shared_buffers could be bumped up to at least 2 GB here (with 8 GB total RAM) - but I wanted to first make sure it’s doing the right things before I give it more space to cache it.

I’m really looking forward to when I can finally set up a production instance with Postgres and move everything there and drop the other branches - but I feel like this moment is getting close 😅