One of the core services for SMG Real Estate is Search Alerts – a service that allows users to be notified about new properties published on ImmoScout24 and Homegate.
For example, if a user is looking for a new apartment having N rooms and costing less than X, and not finding any matches now, they can create a search alert with those search criteria. When a new property matching those criteria is published, the user will be notified either via email or a mobile push notification.
About the Search Alerts system
-
- Millions of matches between the published properties and saved search alerts every day
- Millions of published properties match saved search alerts every day
- More than a million emails and push notifications sent daily
The Original Implementation
- When a new property matching a search alert arrived, the
StartSendNotificationProcess
lambda would start a Step Function execution specific to that search alert. - All it did was waiting for 5 minutes while the matches accumulated in the matches-{searchAlertId} SQS queue, which was programmatically created for that specific search alert
- If more matching listings arrived during the 5-minute waiting period, the StartSendNotificationProcess lambda attempted to start the step function execution with the same name. When it failed with the ExecutionAlreadyExists error, we knew there was already a SF for waiting. That way the deduplication of notification processes was guaranteed.
- After 5 minutes of waiting time, the Step Function execution proceeded with triggering the LoadMatches lambda that received the accumulated matches, generating and sending a notification (either an email or a mobile push)

The Problem?
The Step Function’s state machine was defined as follows:

This workflow corresponds to 4 state transitions. Considering we had over 5M state transitions per day and that Step Functions are billed $0.025 per 1K state transitions, this ended up generating daily costs of hundreds of dollars, which added up to tens of thousands per year.
So it was only reasonable to try to find a more cost-efficient solution for scheduling.
FIFO Queues to the Rescue

On the pricing side, FIFO queues promised lots of savings: $0.48/Million messages VS $25/Million state transitions.
It’s worth mentioning that we didn’t consider FIFO queues for the original implementation because they were introduced years after our system was first deployed.
Shadowing and Analysis
Because it seemed too good to be true, we wanted to make sure the new scheduling approach would be equivalent to the old one and that we wouldn’t lose any notifications along the way.
To do that we implemented the following setup:
- Exporting CloudWatch logs from each lambda into a S3 bucket
- Creating an Athena table from the logs of each lambda
- Using SQL to join, filter and aggregate the data from those tables to get the quantity, time difference and distribution of output trigger events
Gradual Roll-Out

We created some identical test search alerts and manipulated their IDs in such a way that they end up using different flows.
This allowed us to do one last manual test before increasing the rollout percentage – we verified that both search alerts resulted in notification emails containing the same properties. As the system was stable and didn’t produce any errors, we manipulated the RegExp to progressively reach the 100% rollout. After that the Step Function flow was completely removed from the system, and now all the notifications for 5-minute search alerts are going through the one and only FIFO queue flow.
Conclusion


Besides the cost cut, the stability of the service also improved – previously, we occasionally got some throttling from the Step Function service when listing and starting new executions in correlation with huge spikes of matches. SQS, on the other hand, has no problem digesting such spikes.
While this worked great for 5-minute notifications, this approach wouldn’t work for longer batching times, as the 5-minute deduplication interval is hardcoded in SQS FIFO queues. We kept the original Step Function based implementation for those cases, as they really are a tiny amount compared to the 5-minute notifications and the cost impact is negligible.
However, if one day we decide to get rid of the remaining step functions, we’ll have to find a way of scheduling intervals longer than 5 minutes. One idea could be using scheduled Event Bridge events as recurrent triggers to fetch and send the accumulated matches every X minutes.

Author
Alexei Liulin
Senior Engineer
Real Estate