Replacing AWS Step Functions with SQS FIFO queues and cutting the cost in half

One of the core services for SMG Real Estate is Search Alerts – a service that allows users to be notified about new properties published on ImmoScout24 and Homegate. 

For example, if a user is looking for a new apartment having N rooms and costing less than X, and not finding any matches now, they can create a search alert with those search criteria. When a new property matching those criteria is published, the user will be notified either via email or a mobile push notification.

About the Search Alerts system

The Original Implementation

Users can choose the frequency of notifications – either every 5 minutes or every 4 hours (assuming there are any to be delivered). The vast majority of search alerts are configured with the 5-minute frequency. 

Just to get an idea of the scale, we have: 

    • Millions of published properties match saved search alerts every day
    • More than a million emails and push notifications sent daily

Check out our “Homegate’s fast and modern search experience helps users find their dream home” blog post for an overview of how Search Alerts work.

In the original implementation we used an AWS Step Function to achieve the 5- minute batching of notifications:

  • When a new property matching a search alert arrived, the
    StartSendNotificationProcess lambda would start a Step Function execution specific to that search alert.
  • All it did was waiting for 5 minutes while the matches accumulated in the matches-{searchAlertId} SQS queue, which was programmatically created for that specific search alert
  • If more matching listings arrived during the 5-minute waiting period, the StartSendNotificationProcess lambda attempted to start the step function execution with the same name. When it failed with the ExecutionAlreadyExists error, we knew there was already a SF for waiting. That way the deduplication of notification processes was guaranteed.
  • After 5 minutes of waiting time, the Step Function execution proceeded with triggering the LoadMatches lambda that received the accumulated matches, generating and sending a notification (either an email or a mobile push)

The Problem?

The Step Function’s state machine was defined as follows:

This workflow corresponds to 4 state transitions. Considering we had over 5M state transitions per day and that Step Functions are billed $0.025 per 1K state transitions, this ended up generating daily costs of hundreds of dollars, which added up to tens of thousands per year. 

So it was only reasonable to try to find a more cost-efficient solution for scheduling. 

FIFO Queues to the Rescue

While brainstorming possible cheaper alternative implementations of 

scheduling, we considered a number of possibilities including implementing our own scheduling service. But at some point we also realised we could use a combination of properties that AWS SQS FIFO queues offer to achieve exactly the same result at a fraction of the cost. 

FIFO queues guarantee the message order preservation and exactly-once processing. The second property was particularly interesting for us – FIFO queues have a 5-minute deduplication interval, during which trying to send an identical message will not introduce duplicates into the queue. Because the  

LoadMatches lambda only needs a search alert ID as input, multiple matches for the same search alerts effectively generated equivalent messages that could be deduplicated. This, in combination with a 5-minute message delay, would allow us to achieve the same 5-minute batching as with Step Functions:

On the pricing side, FIFO queues promised lots of savings: $0.48/Million messages VS $25/Million state transitions. 

It’s worth mentioning that we didn’t consider FIFO queues for the original implementation because they were introduced years after our system was first deployed. 

Shadowing and Analysis

Because it seemed too good to be true, we wanted to make sure the new scheduling approach would be equivalent to the old one and that we wouldn’t lose any notifications along the way. 

To do that we implemented the following setup: 

We started to log the scheduling-related info on every execution in the StartSendNotificationProcess lambda. That included a randomly generated scheduleId as well as searchAlertId and processStart timestamp.

Every time we got a new match we attempted to start a new Step Function execution, but also sent a message to the newly introduced FIFO queue. The scheduling log info was passed in the payload to both channels
Every time the LoadMatches or FakeLoadMatches were invoked, we logged some more scheduling-related info such as scheduleId , searchAlertId and processEnd.

FakeLoadMatches did nothing but logging – the idea was to shadow the existing production flow and send the same events through both the old and the new flows and compare the results.
We used the scheduling IDs and search alert IDs for correlation and to prove that the new shadow flow with FIFO queue would result in equivalent trigger events for the LoadMatches lambda as the Step Function flow.

We needed to be able to stitch the logs together and answer the main question – given a search alert, does the FIFO flow produce the same trigger events with the same payload at roughly the same time compared to the existing Step Function flow? To do that, we resorted to AWS Athena. That involved:

  • Exporting CloudWatch logs from each lambda into a S3 bucket
  • Creating an Athena table from the logs of each lambda
  • Using SQL to join, filter and aggregate the data from those tables to get the quantity, time difference and distribution of output trigger events

One challenge we faced was to build a query for creating an Athena table from the mix of JSON and space-separated info coming from the logger utility we’ve been using. In the end we just went for the simplest solution possible – using the vanilla console.log() from the Node runtime and log the spaces-separated values. This way it was trivial to parse and ingest them into Athena.

After running multiple queries we proved that the deduplication and delay in the new flow worked as expected, and the LoadMatches and FakeLoadMatches were triggered with events that would result in equivalent notifications delivered to the users.
Now that the new architecture has been validated, we were ready to perform the switch.

Gradual Roll-Out

We still wanted to be cautious and roll out the new scheduling solution gradually – starting with 1% of the traffic and then increasing it all the way to 100% if the system behaved well. This could be achieved by routing the events in StartSendNotificationProcess to either the Step Function or FIFO flow, and connecting the FIFO queue directly to the LoadMatches lambda, so the latter can be triggered with the same result by both the Step Function and the FIFO flows. 

For testing purposes, we wanted to guarantee that a given search alert will always go through the same flow – either the FIFO queue or the Step Function. This meant that simply rolling a dice wouldn’t cut it – so we decided to use a RegExp to match against search alert IDs to decide which was to route notifications:

We created some identical test search alerts and manipulated their IDs in such a way that they end up using different flows.

This allowed us to do one last manual test before increasing the rollout percentage – we verified that both search alerts resulted in notification emails containing the same properties. As the system was stable and didn’t produce any errors, we manipulated the RegExp to progressively reach the 100% rollout. After that the Step Function flow was completely removed from the system, and now all the notifications for 5-minute search alerts are going through the one and only FIFO queue flow.

Conclusion

Using SQS FIFO queues to basically debounce match events for 5 minutes while batching the notifications turned out to be a success. It’s a simple solution that allowed us to achieve the same result at a fraction of cost. Speaking of costs, this improvement alone cut the AWS costs in half:

Besides the cost cut, the stability of the service also improved – previously, we occasionally got some throttling from the Step Function service when listing and starting new executions in correlation with huge spikes of matches. SQS, on the other hand, has no problem digesting such spikes. 

While this worked great for 5-minute notifications, this approach wouldn’t work for longer batching times, as the 5-minute deduplication interval is hardcoded in SQS FIFO queues. We kept the original Step Function based implementation for those cases, as they really are a tiny amount compared to the 5-minute notifications and the cost impact is negligible. 

However, if one day we decide to get rid of the remaining step functions, we’ll have to find a way of scheduling intervals longer than 5 minutes. One idea could be using scheduled Event Bridge events as recurrent triggers to fetch and send the accumulated matches every X minutes.

Author

Alexei Liulin

Senior Engineer

Real Estate

Scroll to Top