Heads Up: This post is quite old. Friends don't let friends make business decisions based on four-year-old blog posts.
We've been using Amazon SQS where I work for awhile. We have a fairly heavy (though, that's relative: we're a small company) cloud application that makes use of a bunch of the Amazon services (SimpleDB, S3, EC2) and, when we needed a message queue, SQS was just there for us. It was convenient, simple, and reasonably quick to code at.
Our application has now grown to a point where 'convenient' doesn't quite cut it anymore. Performance and cost are starting to seriously matter - and SQS is a pretty serious point of pain. So, after some searching for alternatives: RabbitMQ to the rescue.
Here's the core problem: with Amazon SQS, the message consumer is forced to poll the queue in order to determine whether or not a message is available. If a message is available, that's great, the consumer is happy and receives the message. If not, the consumer must idle for awhile, and then poll again.
In our use case, we face relatively long periods in which queues are idle (no messages) followed by bursts of activity that are time-critical (well, ok, not critical, but time-sensitive). Worse, we have a chain of services with queues sitting in between them: one service forwards a message to another, which produces new work for other services, and on... Obviously, the amount of time that a message spends on the queue waiting for a consumer to wake up and poll matters. The latencies multiply out when you have a chain of services, like we do. So, fine: poll like crazy, right? Constantly, and with a very low delay?
Well, no: SQS charges for every request made. 1 million requests = $1. Sounds OK, but multiply out the number of queues you're polling and the number of consumers doing the polling, and you can get some serious $$$ by polling like crazy on empty queues. 10 consumers polling once every 250ms for 1 month (31 days) = 107 million requests = $107. The solution that we had been relying on was to use a simple back-off algorithm: if a consumer is receiving messages, continue polling like crazy, but every time we don't receive a message from the queue, back off a bit - slow the polling down, up to a certain threshold (10 seconds for us). That's fine and all, but then you're introducing up to 10 seconds of latency at each point in the chain.
So, fine then, we need something new: RabbitMQ.
RabbitMQ is a message queue system based on Erlang and conforming to AMQP (a standard and heavily used message queue protocol). I've really only scratched the surface with it (we're still in the process of running tests to make sure that RabbitMQ really does what we need), but I like what I've seen so far.
- It's fast
- It's built for telcos and SMS switching and all kinds of serious heavy-load craziness that there's just no way SQS could handle. It supports clustering. It supports subscriptions (RabbitMQ will deliver the message to your consumer immediately instead of forcing consumers to poll). The underlying resources are under your control (no competing with noisy neighbours, as with Amazon).
- It's free
- RabbitMQ is open source. All you're paying for is the instance that runs it. There's no request charge, so you can poll as frequently as you want. If your clients subscribe to queues, then there's not even a need to poll. RabbitMQ will just tell the client when a message is available!
- It supports all kinds of crazy message patterns
- For right now, this isn't particularly important to us, but RabbitMQ supports a bunch of different message passing patterns: one-to-one, one-to-many, many-to-many, RPC, complex routing, topics, etc..
- It supports 'durable' messaging
- It was particularly important to us that a crash of RabbitMQ (or the EC2 instance) not lose the contents of the message queues. RabbitMQ supports both durable and transient messaging. If all you care about is lightning fast performance, set durable=false on the queue. If you care about crash recovery, set durable=true. There's a performance penalty there, though, because each queued message needs to be written to disk.
- It supports true FIFO message ordering
- I wrote a ton of code to deal with SQS's 'more or less in order, sort of' message ordering. With RabbitMQ, the only time a message will be received out-of-order is if the message is requeued for one reason or another (of course, it depends on your configuration).
- It is consistent
- Eventual consistency can be a real problem with the Amazon services. It is not infrequent to have the same message delivered multiple times to the same client (or multiple times to multiple clients). With RabbitMQ, most of that pain is gone. RabbitMQ does not guarantee exactly-once delivery, but more-than-once delivery will really only occur if the first message delivery fails (is not acknowledged by the client), or if the message is resubmitted.
- Easy to set up, easy to swap in
- It's taken me about a day to set up RabbitMQ, swap out our old SQS code and replace it with RabbitMQ code. It's a hack job for now, but it's 100% functional and gives us an environment with which to test out RabbitMQ.
- We haven't quite gotten there yet, but: making RabbitMQ highly available is up to you. With SQS, high availability comes along for free.
- Additionally, making RabbitMQ highly available seems a bit tricky. RabbitMQ's clustering is built for performance, not availability. There is, from what I've seen, no built-in support for redundancy and failover. You kind of have to glue it all together yourself.
- Worryingly large number of cases in which a message can disappear into vapor.
- Watch out: there seem to be a number of ways to configure RabbitMQ where, if a consumer is not present or a queue is not present or an exchange is not present or whatever, your message will just vanish. Vaporize. I'm just in the process of testing out our migration, and this is one thing that I still need to spend some time on: it's not acceptable for a message to vaporize, ever, at all, for us. With SQS, when a message is published, Amazon guarantees that it will be delivered at least once (provided a client eventually polls the queue).