Replacing AWS Step Functions with SQS FIFO queues and cutting the cost in half

05.03.2025

One of the core services for SMG Real Estate is Search Alerts – a service that allows users to be notified about new properties published on ImmoScout24 and Homegate. 

For example, if a user is looking for a new apartment having N rooms and costing less than X, and not finding any matches now, they can create a search alert with those search criteria. When a new property matching those criteria is published, the user will be notified either via email or a mobile push notification.

About the Search Alerts system

Users can choose the frequency of notifications – either every 5 minutes or every 4 hours (assuming there are any to be delivered). The vast majority of search alerts are configured with the 5-minute frequency.  Just to get an idea of the scale, we have: 
    • Millions of matches between the published properties and saved search alerts every day
    • Millions of published properties match saved search alerts every day
    • More than a million emails and push notifications sent daily
Check out our « Homegate’s fast and modern search experience helps users find their dream home » blog post for an overview of how Search Alerts work.

The Original Implementation

In the original implementation we used an AWS Step Function to achieve the 5- minute batching of notifications:
  • When a new property matching a search alert arrived, the StartSendNotificationProcess lambda would start a Step Function execution specific to that search alert.
  • All it did was waiting for 5 minutes while the matches accumulated in the matches-{searchAlertId} SQS queue, which was programmatically created for that specific search alert
  • If more matching listings arrived during the 5-minute waiting period, the StartSendNotificationProcess lambda attempted to start the step function execution with the same name. When it failed with the ExecutionAlreadyExists error, we knew there was already a SF for waiting. That way the deduplication of notification processes was guaranteed.
  • After 5 minutes of waiting time, the Step Function execution proceeded with triggering the LoadMatches lambda that received the accumulated matches, generating and sending a notification (either an email or a mobile push)

The Problem?

The Step Function’s state machine was defined as follows:

This workflow corresponds to 4 state transitions. Considering we had over 5M state transitions per day and that Step Functions are billed $0.025 per 1K state transitions, this ended up generating daily costs of hundreds of dollars, which added up to tens of thousands per year. 

So it was only reasonable to try to find a more cost-efficient solution for scheduling. 

FIFO Queues to the Rescue

While brainstorming possible cheaper alternative implementations of  scheduling, we considered a number of possibilities including implementing our own scheduling service. But at some point we also realised we could use a combination of properties that AWS SQS FIFO queues offer to achieve exactly the same result at a fraction of the cost.  FIFO queues guarantee the message order preservation and exactly-once processing. The second property was particularly interesting for us – FIFO queues have a 5-minute deduplication interval, during which trying to send an identical message will not introduce duplicates into the queue. Because the   LoadMatches lambda only needs a search alert ID as input, multiple matches for the same search alerts effectively generated equivalent messages that could be deduplicated. This, in combination with a 5-minute message delay, would allow us to achieve the same 5-minute batching as with Step Functions:

On the pricing side, FIFO queues promised lots of savings: $0.48/Million messages VS $25/Million state transitions. 

It’s worth mentioning that we didn’t consider FIFO queues for the original implementation because they were introduced years after our system was first deployed. 

Shadowing and Analysis

Because it seemed too good to be true, we wanted to make sure the new scheduling approach would be equivalent to the old one and that we wouldn’t lose any notifications along the way. 

To do that we implemented the following setup: 

We started to log the scheduling-related info on every execution in the StartSendNotificationProcess lambda. That included a randomly generated scheduleId as well as searchAlertId and processStart timestamp. Every time we got a new match we attempted to start a new Step Function execution, but also sent a message to the newly introduced FIFO queue. The scheduling log info was passed in the payload to both channels Every time the LoadMatches or FakeLoadMatches were invoked, we logged some more scheduling-related info such as scheduleId , searchAlertId and processEnd. FakeLoadMatches did nothing but logging – the idea was to shadow the existing production flow and send the same events through both the old and the new flows and compare the results. We used the scheduling IDs and search alert IDs for correlation and to prove that the new shadow flow with FIFO queue would result in equivalent trigger events for the LoadMatches lambda as the Step Function flow.
We needed to be able to stitch the logs together and answer the main question – given a search alert, does the FIFO flow produce the same trigger events with the same payload at roughly the same time compared to the existing Step Function flow? To do that, we resorted to AWS Athena. That involved:
  • Exporting CloudWatch logs from each lambda into a S3 bucket
  • Creating an Athena table from the logs of each lambda
  • Using SQL to join, filter and aggregate the data from those tables to get the quantity, time difference and distribution of output trigger events
One challenge we faced was to build a query for creating an Athena table from the mix of JSON and space-separated info coming from the logger utility we’ve been using. In the end we just went for the simplest solution possible – using the vanilla console.log() from the Node runtime and log the spaces-separated values. This way it was trivial to parse and ingest them into Athena. After running multiple queries we proved that the deduplication and delay in the new flow worked as expected, and the LoadMatches and FakeLoadMatches were triggered with events that would result in equivalent notifications delivered to the users. Now that the new architecture has been validated, we were ready to perform the switch.

Gradual Roll-Out

We still wanted to be cautious and roll out the new scheduling solution gradually – starting with 1% of the traffic and then increasing it all the way to 100% if the system behaved well. This could be achieved by routing the events in StartSendNotificationProcess to either the Step Function or FIFO flow, and connecting the FIFO queue directly to the LoadMatches lambda, so the latter can be triggered with the same result by both the Step Function and the FIFO flows.  For testing purposes, we wanted to guarantee that a given search alert will always go through the same flow – either the FIFO queue or the Step Function. This meant that simply rolling a dice wouldn’t cut it – so we decided to use a RegExp to match against search alert IDs to decide which was to route notifications:

We created some identical test search alerts and manipulated their IDs in such a way that they end up using different flows.

This allowed us to do one last manual test before increasing the rollout percentage – we verified that both search alerts resulted in notification emails containing the same properties. As the system was stable and didn’t produce any errors, we manipulated the RegExp to progressively reach the 100% rollout. After that the Step Function flow was completely removed from the system, and now all the notifications for 5-minute search alerts are going through the one and only FIFO queue flow.

Conclusion

Using SQS FIFO queues to basically debounce match events for 5 minutes while batching the notifications turned out to be a success. It’s a simple solution that allowed us to achieve the same result at a fraction of cost. Speaking of costs, this improvement alone cut the AWS costs in half:

Besides the cost cut, the stability of the service also improved – previously, we occasionally got some throttling from the Step Function service when listing and starting new executions in correlation with huge spikes of matches. SQS, on the other hand, has no problem digesting such spikes. 

While this worked great for 5-minute notifications, this approach wouldn’t work for longer batching times, as the 5-minute deduplication interval is hardcoded in SQS FIFO queues. We kept the original Step Function based implementation for those cases, as they really are a tiny amount compared to the 5-minute notifications and the cost impact is negligible. 

However, if one day we decide to get rid of the remaining step functions, we’ll have to find a way of scheduling intervals longer than 5 minutes. One idea could be using scheduled Event Bridge events as recurrent triggers to fetch and send the accumulated matches every X minutes.

Author

Alexei Liulin
Senior Engineer

Real Estate

Marta Andreoni

Head of Design for Automotive

Introduce yourself and your role at SMG

I’m Marta Andreoni, Head of Design at SMG Automotive. I lead the design and UX writing team shaping AutoScout24 user experience. 

In my role, I wear many hats. My main focus is ensuring we stay true to our vision “simplifying people’s lives and connecting humans through innovative digital platforms” and our brand promise, “make it happen”. I challenge my team to think user-first, push for innovation, ease of use for our customers and make forward-thinking decisions, even within business and technological constraints.

 A big part of my role is supporting each designer’s growth, motivation, and career development. Through one-on-one coaching, mentoring, group work, and projects, I help my colleagues set and achieve their goals while fostering new learning opportunities.

What helps you feel empowered and confident in your role?

If I had to mention one thing I would say “being proactive” has been key to feeling more empowered. I enjoy solving problems, so when issues or opportunities arise, be it in the product, market or the team, I get curious and I proactively investigate the reasons and try to bring inputs to be discussed with others, this makes me feel I can be part of the process or solution and my point of view is going to be taken seriously. My optimism also plays a role, giving me confidence that even the most complex challenges can be solved. 

Besides, having trust from other managers and colleagues makes me feel in a safe environment where I can take ownership on topics I’m passionate about. 

What’s one thing SMG does well in fostering an inclusive workplace? What more can be done to amplify and support different perspectives in the workplace?

In my experience, we strive for balancing top-down and bottom-up inputs, ensuring employees can influence product directions, processes, and culture. People are approachable, and our strong feedback culture helps voices be heard. Across SMG, initiatives like regular People & Culture Surveys, topic guilds, and events in our locations across the world foster open exchange and mutual learning.

That said, I’ve noticed that quieter voices sometimes get less space, or interacting with top management can feel intimidating, especially when giving critical feedback. To make participation more inclusive, we could apply more facilitation and group work techniques like structured turn-taking, written input, and smaller group discussions – ensuring everyone, regardless of confidence level, seniority or personality, feels comfortable contributing. 

Design is often about seeing the world differently. How do unique perspectives contribute to more innovative, inclusive, or impactful design?

Design is about understanding diverse user personas and perspectives to create solutions that truly meet their needs or create new opportunities. I believe in the power of collaboration to shape user experiences – bringing together different disciplines, backgrounds, and lived experiences helps challenge assumptions, uncover blind spots, and drive more inclusive, innovative, and impactful solutions.

Looking back on your career, what’s one lesson or piece of advice you wish you had known earlier as a leader in design?

There are three things no one really prepares you for as a design leader: dealing with constant change, facing failure and handling emotions at work. These topics aren’t talked about much until you face them. I was lucky to learn from others’ experiences, but much of it came through my own.

One thing I wish I had understood earlier is the power of emotional intelligence, my job is no longer about the content and the design, it is about people. Self-awareness, not just of your own emotions, but also how others feel and react, can be the difference between conflict and harmony, frustration and clarity. The more I grow as a leader and designer, the more I realise that design isn’t just about doing the design job, delivering solutions on the market: it’s about navigating people, their emotions, and making change more acceptable and transforming issues into opportunities, both within the organisation and through great products.

 

Photos de la Direction avec et sans couleur de fond en fichier ZIP

Logo à télécharger dans toutes les versions