Firehose go brrr

A couple of months ago, I wrote about the tests I’ve done on how I can speed up my code for processing the Bluesky firehose, using my Skyfall Ruby gem. The limits I’ve reached then were around:

  • 4k events/s with full post processing that it normally does (saving all posts, matching posts to feeds etc.)
  • 5k events/s if using Jetstream – but I noted that it seemed suspiciously like that was a fixed rate limit (which others have confirmed in the comments)
  • 6k events/s just reading packets from the firehose, without processing

Since then, I’ve done two more things. The first one was that I ran Jetstream locally for testing, and configured it to have a much higher rate limit (the –max-sub-rate option). I’ve confirmed that indeed, with the rate limit not getting in the way, Skyfall using Jetstream can go much faster on the same server, up to about 10-12k doing full processing.

The second thing is that I started doing some profiling to find out where else I could save some processing time, and in the process, I managed to massively speed up the underlying faye-websocket library 🙃

The Faye speedup fix

So, I was playing with ruby-prof to find where else I can shave off a few microseconds. I ran the scan on the version with my processing turned off, expecting the remaining work to be mostly in some boring internals of Faye or Ruby core libs, reading and writing bytes from the socket, adding them together and waiting.

And I found something… quite interesting: a majority of time was spend in two places:

  1. Some code deep inside Faye’s helper library websocket-driver, which takes a filled string buffer and converts it to a byte array to be dispatched to a handler:
           4.519    130695/130695   WebSocket::Driver::Hybi#emit_message
13.83%     4.519           130695   String#bytes
  1. And my code in Skyfall, which takes a byte array received from Faye and converts it to a binary string:
           0.000         1/130696   WebSocket::HTTP::Response#body
          12.549    130695/130696   Skyfall::Firehose#handle_message
38.42%    12.549           130696   Array#pack

… wait a minute… 🤔🤔💡

Yes, for a binary websocket (which is used here), Faye prepares the received data in a binary String, but then sends it out as an Array of bytes:

def emit_message
  message  = @extensions.process_incoming_message(@message)
  @message = nil

  payload = message.data

  case message.opcode
    when OPCODES[:text] then
      payload = Driver.encode(payload, Encoding::UTF_8)
      payload = nil unless payload.valid_encoding?
    when OPCODES[:binary]
      payload = payload.bytes.to_a    # <===
  end

And since I want a binary string at the end, to pass it to the CBOR library for decoding, I need to take that byte array and convert it back into a string just like the one we had before:

def handle_message(msg)
  data = msg.data.pack('C*')    # <===
  @handlers[:raw_message]&.call(data)

So could we just… not do that? 🫠

Turns out, yes, although not without some hacking, since the library didn’t have an option to emit a string instead for binary websocket messages.

I made some monkey-patches to Faye & websocket-driver first, and eventually a pull request which I submitted to the author – adding a :binary_data_format option to Faye::Websocket::Client initializer, where you can ask to have the data returned as a string instead, defaulting to the original method of returning a byte array. The author actually said that he thinks it makes sense to change the default to a binary string (while adding an option to revert it), and version 0.12.0 was released last month with this changed behavior. (To use this in Skyfall, you need the updated version 0.6, which I’m going to release soon.)

New benchmarks

I’ve run the benchmarks again, and the results look very encouraging. This is for the CBOR firehose, with and without the fix, and I also rechecked the async-websocket library for comparison:

Library With processing Processing + Rust lib Just reading
Faye Firehose (old) 3,200-3,350 3,300-3,450 6,800-6,900
Faye Firehose (+ fix) 5,300-5,600 6,300-6,800 33,000-34,500
Async Firehose 5,700-5,900 5,700-5,900 (?) 5,800-5,950 (??)

And this is for Jetstream without a rate limit (the Faye fix does not affect/help Jetstream, because here the stream is text-based, not binary, so that problematic code path wasn’t used here):

Library With processing Processing + Rust lib Just reading
Faye Jetstream 9,400-9,600 11,000-12,000 180,000-200,000
Async Jetstream 10,000-11,000 12,000-15,000 250,000-280,000

As you can see:

  • the fix gives me basically 50% speedup for free in the existing live code
  • with the Rust module for regexp matching turned on, when a higher portion of the whole time is spent inside Faye + Skyfall, that speedup becomes more like 100%
  • and when you skip all processing, and almost all of the time is spent inside Faye + Skyfall, that gets as much as 4-5x speedup (!)
  • this gives me a possible ceiling of as much as 33-34k events/s if I manage to further optimize or parallelize the event processing parts (parsing CBOR/CAR, feed matching, building models, calling Postgres etc.)
  • Jetstream mode, without the rate limit, can be around 2x faster than the Firehose version in practice
  • with the data saving part optimized further, we could possibly go into even 6-digit numbers with Jetstream, but at this point it’s kind of theoretical, because I would likely run into many other bottlenecks before I get there (disk speed, VPS bandwidth limits etc.)
  • the Async library can be a bit faster than the EventMachine based one, but not dramatically so

For unknown reason, I wasn’t able to make the Async version go faster than 6k evt/s on the CBOR (binary) stream (while it was using much less than 100% CPU), and it worked fine with Jetstream; not sure why – maybe it was some issue on my side, but I don’t really want to spend time digging into this. EM/Faye works fine (especially now), is very battle-tested (even if not being updated much anymore), works with older Rubies, and it would be a big API change for apparently not that much gain. So I think I’m going to keep it as is and maybe reconsider for Skyfall 2.0 one day…

Overall, I think all of this gives me more than enough space to not worry about this again until Bluesky becomes much bigger :) And it looks like I’m not even going to need any parallel workers + Redis queue setup anytime soon.

Kuba Suder @mackuba