Introduction and motivation
At SMG Swiss Marketplace Group, we operate multiple online platforms used daily by millions of people across Switzerland. One of them, Ricardo, is a second-hand marketplace where buyers and sellers interact in four different languages: French, German, Italian, and English.
To deliver a seamless experience across languages, we had been relying on DeepL API, one of the most accurate translation APIs on the market. And while DeepL served us well in terms of quality and response time, it came with significant downsides:
- A monthly cost of over 7,000 CHF
- A black-box approach, offering no insight into how translations were produced
- Data privacy risks
As usage grew, so did the bill, and the urgency to reduce it.
Around that time, generative AI had made enormous improvements, and open-source large language models (LLMs) had become more accessible than ever. This sparked a question:
Can we replace DeepL API with a self-hosted open-source LLM, without sacrificing translation quality, speed, and user experience?
The answer turned out to be yes. Over the course of my master thesis, I built a production-grade, GPU-accelerated translation service using modern LLM tooling, hosted entirely on Google Cloud Platform (GCP), and fully integrated into Ricardo’s infrastructure. The result:
- Over 85% cost savings
- Sub-3-second latency
- No impact on conversion rate of buyers
And everything is reproducible, transparent, and extensible.
In this article, I’ll walk you through exactly how I built it, from selecting the right model, optimizing inference time, and wrapping it in an API, to deploying and monitoring it in production. My goal is simple: enable you to do the same.
Whether you’re a backend developer, a ML engineer, or just AI-curious, this guide will show you how to turn open-source LLMs into real-world, cost-efficient infrastructure.
DeepL alternatives and model benchmarking
Once we made the decision to move away from DeepL, the next step was to figure out what could realistically replace it. That meant surveying the landscape of translation solutions, from commercial APIs to open-source solutions, and running a thorough benchmark to find the best fit for our use case.
Commercial APIs: a costly choice at scale
I started by evaluating some of the major commercial alternatives: Google Translate, Amazon Translate, Microsoft Translator and others. These services offer decent translation capabilities and easy-to-use APIs, but the quality simply didn’t match DeepL, especially for other languages than English, like French, German, or Italian. The translations often lacked nuance, handled HTML formatting poorly, or failed to interpret user-generated content correctly. And even if slightly cheaper than DeepL, they were still costly at scale. In the end, none of them provided enough of an edge to justify a switch.
Open-source systems: promising, but underwhelming
Next, I explored open-source projects like LibreTranslate, Apertium, and OpenNMT. These are fully self-hostable and appealing from a privacy and cost perspective. However, they turned out to be too limited in practice. Most had outdated models, poor multilingual coverage, or unacceptable latency. Some couldn’t handle non-standard input like emojis, bullet points, or HTML tags, all of which are common on Ricardo, where users write their own listings. The performance gap with DeepL was simply too wide.
Shifting gears: benchmarking modern AI models
It became clear that we needed to look beyond traditional machine translation systems and focus on open-source AI models, which had made incredible progress thanks to projects like Hugging Face. The goal was to find a model that was both accurate and fast enough to run inference on a single small GPU.
I shortlisted models like Facebook’s NLLB-200, mBART, Google’s T5 variants, the Helsinki-NLP opus-mt family, and finally, a newer contender: GemmaX2–28–2B, developed by Xiaomi researchers and based on Google’s Gemma-2 architecture.
To compare them, I ran a series of translation benchmarks using sacreBLEU, a standardized BLEU (Bilingual Evaluation Understudy) score implementation that ensures reproducibility. I used multilingual datasets such as WMT14, Europarl, and Opus-100, testing across key language pairs like English-German, French-English, German-French, and so on. While some of the classic models like Opus-MT performed decently, GemmaX2–28–2B consistently came out on top, in some cases even outperforming DeepL. Most importantly, it did this while remaining small enough (2B parameters) to run inference on a NVIDIA T4 GPU with reasonable latency.
from sacrebleu import corpus_bleu
hypotheses = [
"This is a translated sentence.",
"Here is another translation."
]
references = [
["This is the reference sentence."],
["Here is another reference."]
]
score = corpus_bleu(hypotheses, references)
print(f"sacreBLEU score: {score.score:.2f}")
❯ python sacrebleu.py
sacreBLEU score: 22.96
Run this code for each candidate model using parallel reference/test sets (e.g. from Hugging Face datasets like WMT14 or Europarl).

I started researching and benchmarking open-source AI models for translation in November 2024, at the start of my master thesis project. Over the next few weeks/months, I compared multiple candidates and concluded the research phase by late March 2025, just after the release of GemmaX2–28–2B.
The AI landscape moves fast, so if you’re reading this later in 2025 or beyond, newer models may outperform the ones tested here, but the benchmarking approach, latency tuning, and infrastructure lessons will still apply.
Manual testing against DeepL
Since DeepL doesn’t publish BLEU scores and is essentially a black box, I also included manual testing of the same texts translated via DeepL and GemmaX2. I included HTML content, emojis, incorrect casing, long sentences, and typos to simulate real-life inputs. The results were surprisingly close. In many cases, the LLM-generated translations were indistinguishable from DeepL.
Performance matters: latency testing
Quality alone isn’t enough. We also needed speed. On initial tests, most open-source models took over 25 seconds to translate a 5,000-character text (maximum size of an article description on Ricardo), clearly unacceptable for production. But with GemmaX2, once we optimized the inference pipeline using vLLM, sentence tokenization, FP16 quantization, we brought the average response time down to under 3 seconds. That put us within the latency budget we needed for a smooth user experience.
In the end, GemmaX2–28–2B stood out as the only model that offered both great translation quality and production-grade performance. It’s not the most well-known model out there, but it proved itself where it mattered: translating real, messy, multilingual content quickly and reliably. That made it our clear winner, and the foundation for everything that came next.
Optimizations and API design
Once the GemmaX2–28–2B model was selected, the real challenge began: making it fast and robust enough for a real-time user-facing service.
Out of the box, even the best models were too slow. Translating a single 5,000-character text could take over 25 seconds on a T4 GPU. That kind of latency would ruin the user experience. So the next phase of the project was all about optimizing for speed and stability.
Reducing latency: from 25s to under 3s
The first major win came from quantizing the model to FP16, which reduced memory usage and sped up inference significantly. This was a straightforward change, but surprisingly effective.
By default, most models run with 32-bit floating point (FP32) precision, which offers high numerical accuracy but also consumes a lot of memory and compute. FP16, or 16-bit floating point, cuts that in half, using less GPU memory and allowing more operations to fit in parallel. For inference tasks like translation, FP16 precision is typically more than sufficient, with little to no loss in quality. On GPUs like the NVIDIA T4, using FP16 allows faster throughput, lower latency, and the ability to process larger batches, all without retraining the model.

Next, I integrated the model with vLLM, an inference engine designed specifically for large language models. Unlike the default Hugging Face pipelines (like transformers), vLLM is optimized for high-throughput and low-latency tasks. It supports batched execution, streaming responses, and asynchronous inference, which together allow us to fully utilize the GPU.
But the real breakthrough came from handling long texts more intelligently.
Instead of feeding entire blocks of HTML or full descriptions to the model, I built a preprocessing pipeline using BeautifulSoup and NLTK. The system first extracts all the text nodes from the HTML, then splits the content into individual sentences (NLTK) based on the detected source language by pycld2 library. Each sentence is translated independently and in parallel using Python’s asyncio.gather, and then the results are reassembled into the original HTML structure. This approach not only preserves formatting, but also significantly speeds up the translation since short sentences are much quicker to process in parallel, than long, unstructured blobs.
On top of that, I set the model’s temperature to 0, which eliminates randomness in generation and results in faster, more deterministic outputs. Since we don’t need creativity in translations, just accuracy, this was a safe and effective tweak.
With all these optimizations in place, the average response time for long texts dropped from over 25 seconds to under 3, with 95% of requests completing in less than 2.7 seconds. That made the system viable for production use.
Wrapping it in a FastAPI microservice
With the model optimized, I needed a clean, scalable interface to expose it internally. For that, I built a lightweight microservice using FastAPI.
The main endpoint is a simple POST route, /translate, which accepts raw HTML content and a target language. The service automatically detects the source language using pycld2, applies the preprocessing pipeline, and returns the translated HTML. In case of unexpected errors, like an unsupported language, a malformed input, or a timeout from the model, the service responds with detailed error messages and appropriate HTTP status codes (400, 500, or 503). These errors are logged and exposed through Prometheus for monitoring.
Check out the code here
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import AsyncOpenAI
import re
import html
import asyncio
from bs4 import BeautifulSoup, NavigableString, Tag
from nltk.tokenize import sent_tokenize
import httpx
import requests
from prometheus_client import make_asgi_app
from app.instrumenting import prometheus_middleware
import nltk
import pycld2 as cld2
# Download NLTK models for sentence tokenization
nltk.download("punkt")
nltk.download("punkt_tab")
app = FastAPI()
VLLM_SERVER_URL = "http://localhost:8000/v1"
API_KEY = "token-123"
vllm_running_status = False
SUPPORTED_LANGUAGES = {
"en": "english",
"fr": "french",
"de": "german",
"it": "italian",
"un": "unknown"
}
client = AsyncOpenAI(
base_url=VLLM_SERVER_URL,
api_key=API_KEY,
)
class TranslateRequest(BaseModel):
text: str
target_lang: str
app.middleware("http")(prometheus_middleware)
app.mount("/metrics", make_asgi_app())
@app.get("/health")
async def health_check():
global vllm_running_status
if vllm_running_status:
return {"status": "ok"}
try:
async with httpx.AsyncClient(timeout=3.0) as client:
resp = await client.get("http://localhost:8000/health")
if resp.status_code == 200:
vllm_running_status = True
return {"status": "ok"}
else:
raise HTTPException(status_code=503, detail="vLLM server is not healthy")
except requests.RequestException:
raise HTTPException(status_code=503, detail="vLLM server is down")
@app.get("/ready")
async def ready_check():
try:
async with httpx.AsyncClient(timeout=3.0) as client:
resp = await client.get("http://localhost:8000/health")
if resp.status_code == 200:
return {"status": "ok"}
else:
raise HTTPException(status_code=503, detail="vLLM server is not ready")
except requests.RequestException:
raise HTTPException(status_code=503, detail="vLLM server is down")
@app.post("/translate")
async def translate_text(data: TranslateRequest):
# Clean input
t = safe_utf8(data.text)
t = strip_control_and_bidi(t)
t = unescape_entities(t)
# Detect source language
try:
_, _, details = cld2.detect(t)
lang_code = details[0][1].lower() # use ISO code (e.g., 'de')
except cld2.error:
raise HTTPException(status_code=400, detail="Could not detect source language")
# Map to NLTK language name
nltk_lang = SUPPORTED_LANGUAGES.get(lang_code)
if nltk_lang is None:
nltk_lang = "unknown"
# Validate target language
target_lang = SUPPORTED_LANGUAGES.get(data.target_lang.lower())
if not target_lang:
raise HTTPException(status_code=400, detail=f"Unsupported target language: {target_lang}")
# Perform HTML-preserving translation with sentence splitting
translated = await translate_html(data.text, target_lang, nltk_lang)
return {"translation": translated}
def safe_utf8(text: str) -> str:
return text.encode("utf-8", errors="ignore").decode("utf-8", errors="ignore")
def strip_control_and_bidi(text: str) -> str:
return re.sub(r'[\x00-\x1F\u202A-\u202E]', '', text)
def unescape_entities(text: str) -> str:
return html.unescape(text)
# HTML-preserving translation
async def translate_text_chunk(text: str, source_lang: str, target_lang: str) -> str:
# Use a different prompt if the source language is unknown
if source_lang == "unknown":
prompt = f"Translate the following text to {target_lang}:\n{text}\n{target_lang}:"
else:
prompt = f"Translate this from {source_lang} to {target_lang}:\n{source_lang}: {text}\n{target_lang}:"
response = await client.chat.completions.create(
model="/model",
messages=[
{"role": "user", "content": prompt},
],
temperature=0.0,
top_p=0.5,
stream=True,
)
output = ""
async for chunk in response:
output += chunk.choices[0].delta.content or ""
return output.strip().replace("\n", " ")
async def translate_html(html_text: str, target_lang: str, nltk_lang: str) -> str:
# Use German as the default language for sentence tokenization if the language is unknown
if nltk_lang == "unknown":
nltk_lang = "german"
# Parse already cleaned HTML input
soup = BeautifulSoup(html_text, "html.parser")
# Collect text nodes
nodes = []
def collect(node):
if isinstance(node, NavigableString):
if node.strip():
nodes.append(node)
elif isinstance(node, Tag):
for c in node.contents:
collect(c)
collect(soup)
# Create jobs: split each node into sentences based on source language
async def job(node):
text = str(node)
sentences = sent_tokenize(text, language=nltk_lang)
# Translate sentences in parallel
tasks = [asyncio.create_task(translate_text_chunk(s, nltk_lang, target_lang)) for s in sentences]
trans = await asyncio.gather(*tasks)
return node, " ".join(trans)
# Run all jobs concurrently
results = await asyncio.gather(*(job(n) for n in nodes))
# Patch the tree in place
for node, txt in results:
node.replace_with(NavigableString(txt))
# Serialize back to HTML
return str(soup)
At this point, the service was fast, reliable, and API-ready. It could translate large, messy texts in real-time. It could handle HTML, emojis, and non-standard unicode.

Productionizing the translation-service
Having a fast and functional translation API is one thing. But making it reliable under real-world conditions, deployed, monitored, fault-tolerant, and scalable, is a whole different challenge.
Here’s how I took the service from “it works on my machine” to full production readiness.
Docker, Kubernetes, and GPU scaling
The first step was containerization. I built a minimal Docker image that bundles the FastAPI app, the vLLM server, and the GemmaX2 model (preloaded into /model) to avoid cold-start delays.
The service is deployed using Kubernetes on Google Kubernetes Engine (GKE), with each pod running on a dedicated GPU-enabled node. The model is loaded at container startup so the GPU is ready to serve requests as soon as the pod becomes active.
FROM nvidia/cuda:12.8.0-base-ubuntu20.04
# Set environment variables to avoid interactive prompts
ENV DEBIAN_FRONTEND=noninteractive
ENV TORCHDYNAMO_DISABLE=1
# Install Python 3.9 and build tools
RUN apt update && apt install -y \
python3.9 python3.9-distutils python3.9-venv \
curl git build-essential wget cmake ninja-build \
&& rm -rf /var/lib/apt/lists/*
# Set python3.9 as default
RUN ln -sf /usr/bin/python3.9 /usr/bin/python && \
curl -sS https://bootstrap.pypa.io/get-pip.py | python
# Install Python deps
COPY requirements.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
# Copy app code
COPY app /app
# Download the model (or mount it at runtime instead)
RUN python -c "from huggingface_hub import snapshot_download; snapshot_download('ModelSpace/GemmaX2-28-2B-v0.1', local_dir='/model')"
WORKDIR /
EXPOSE 8080 8000
# Run vLLM server
CMD ["bash", "-c", "vllm serve /model --dtype half --api-key token-123 --disable-log-requests --max-model-len 8192 --max-num-seqs 64 --max-num-batched-tokens 8192 & uvicorn app.main:app --host 0.0.0.0 --port 8080 --workers 1"]
This Dockerfile builds both the FastAPI server and the vLLM inference engine into a single container, optimized for GPU T4 inference.
For deployment, I used Google Kubernetes Engine (GKE) with a dedicated GPU node pool running NVIDIA T4 GPUs. Each node hosts a single translations-service pod to ensure GPU isolation, and Kubernetes probes (/health, /ready) were added to guarantee that each instance is healthy before receiving traffic.
The NVIDIA T4 is particularly well-suited for inference workloads like this. It offers a cost-effective balance between performance and energy efficiency:
- It supports FP16 and INT8 precision, both ideal for high-throughput inference tasks.
- It has a thermal design power (TDP) of just 70W, making it far more energy-efficient than larger datacenter GPUs like A100 or H100.
- It’s widely available on cloud platforms (including GCP) at a low hourly rate, which helped us stay well under budget.
Here’s the full isopod.yml file used to deploy the service to GKE with NVIDIA T4 GPU support, Kubernetes probes, and per-environment config.
apiVersion: isopod/v1
default:
sreCompliant: false
timeoutMinutes: 25
gpu:
enabled: true
appPort: 8080
env:
HTTP_TIMEOUT_S: 5
HTTP_GRACEFUL_SHUTDOWN_WAIT_S: 30
LOGGING_LEVEL: WARN
probes:
startupProbe:
initialDelaySeconds: 500
periodSeconds: 10
failureThreshold: 30
successThreshold: 1
livenessProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 10
failureThreshold: 30
timeoutSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 10
failureThreshold: 30
timeoutSeconds: 30
dev:
replicaCount: 1
env:
LOGGING_LEVEL: INFO
istioProxyCPURequest: 50m
resources:
limits:
gpu: 1
memory: 10Gi
requests:
memory: 10Gi
metadata:
labels:
app: translations-service
language: python
tiers: backend
language: python
maintainer: purchasing
name: translations-service
role: api
prod:
replicaCount: 2
env:
LOGGING_LEVEL: INFO
istioProxyCPURequest: 25m
resources:
limits:
gpu: 1
memory: 10Gi
requests:
memory: 10Gi
With this configuration, I was able to deploy a highly performant LLM-backend service at a fraction of the cost and energy usage compared to more powerful GPUs, while still meeting latency and scalability targets for production use.
Github Actions workflow that runs tests, builds the image, and deploys the translations-service to GKE.
name: Release API
run-name: Test, Build and Deploy from ${{ github.ref_name }} by @${{ github.actor }}
on:
push:
paths:
- '**'
- '!**/README.md'
permissions:
id-token: write
contents: read
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
quality-gate:
runs-on: blacksmith-2vcpu-ubuntu-2204
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: 3.9
- uses: useblacksmith/setup-python@v6
with:
python-version: 3.9
cache: 'pip'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run tests
run: pytest
docker:
needs: quality-gate
runs-on: blacksmith-2vcpu-ubuntu-2204
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: GCP Auth
uses: ricardo-ch/auth-action@v1
- name: Build and push Docker image
uses: ricardo-ch/isopod-action@v1
with:
command: build -f ./isopod.yml
deploy-dev-branch:
if: ${{ github.ref != 'refs/heads/main' }}
needs: docker
runs-on: blacksmith-2vcpu-ubuntu-2204
environment: development
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: GCP Auth
uses: ricardo-ch/auth-action@v1
with:
gke_cluster: dev-banana
- name: Deploy to dev
uses: ricardo-ch/isopod-action@v1
id: deploy
with:
command: deploy --environment dev -f ./isopod.yml
deploy-dev-main:
if: ${{ github.ref == 'refs/heads/main' }}
needs: docker
runs-on: blacksmith-2vcpu-ubuntu-2204
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: GCP Auth
uses: ricardo-ch/auth-action@v1
with:
gke_cluster: dev-banana
- name: Deploy to dev
uses: ricardo-ch/isopod-action@v1
id: deploy
with:
command: deploy --environment dev -f ./isopod.yml
deploy-prod-main:
if: ${{ github.ref == 'refs/heads/main' }}
needs: deploy-dev-main
runs-on: blacksmith-2vcpu-ubuntu-2204
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: GCP Auth
uses: ricardo-ch/auth-action@v1
with:
gke_cluster: prod-tomato
- name: Deploy to prod
uses: ricardo-ch/isopod-action@v1
id: deploy
with:
command: deploy --environment prod -f ./isopod.yml
At this stage, we didn’t use pod-level autoscaling because traditional metrics like CPU or memory weren’t useful, and a single request can fully saturate the GPU. Instead, we relied on node-level autoscaling, with plans to later integrate a custom metric (like vLLM queue length) for smarter horizontal scaling.
Caching for speed and cost savings
To reduce inference load and improve response times, the translations-api checks a Redis cache before calling the model (translations-service). If a translation for the same source text and target language exists, it’s returned instantly. Otherwise, the request goes to the model, and the result is cached with a TTL (Time To Live) of 90 days.
This caching layer proved essential, not just for speed, but for keeping cloud GPU costs under control. With this in place, only new or changed texts trigger inference, while most traffic is served in milliseconds.
Monitoring and observability
No production service is complete without observability. I instrumented the FastAPI app with Prometheus, exposing custom metrics with the /metrics endpoint. These include the following metrics:
- Request duration histograms, broken down by endpoint and status code
- Client disconnects (e.g. user cancels the request mid-stream)
from prometheus_client import Histogram, Counter, REGISTRY
from fastapi import Request
from fastapi.exceptions import HTTPException
import time
# Prometheus metrics
REQUEST_DURATION = Histogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
['component', 'handler', 'method', 'code', 'endpoint_severity', 'uri'],
buckets=[.01, .1, .25, .5, .75, 1, 5, 10]
)
CONNECTION_LOST_TOTAL = Counter(
'http_server_requests_connection_lost_total',
'Number of client connections that have been closed before the server was able to respond',
['source_app', 'handler', 'method']
)
# Register Prometheus metrics if not already registered
if 'http_request_duration_seconds' not in REGISTRY._names_to_collectors:
REGISTRY.register(REQUEST_DURATION)
if 'http_server_requests_connection_lost_total' not in REGISTRY._names_to_collectors:
REGISTRY.register(CONNECTION_LOST_TOTAL)
async def prometheus_middleware(request: Request, call_next):
if request.method == "POST" and request.url.path == "/translate":
start_time = time.time()
try:
response = await call_next(request)
process_time = time.time() - start_time
REQUEST_DURATION.labels(
component="translations-service",
handler=request.url.path,
method=request.method,
code=response.status_code,
endpoint_severity="low",
uri=request.url.path
).observe(process_time)
return response
except Exception as e:
if isinstance(e, HTTPException) and e.status_code == 503:
CONNECTION_LOST_TOTAL.labels(
source_app=request.client.host,
handler=request.url.path,
method=request.method
).inc()
raise
else:
return await call_next(request)
All of these metrics are visualized in Grafana dashboards, which allow the team to monitor the system’s health in real time. For example, we track:
- Success vs error rates
- Latency distributions (p50, p95, p99)
- Request throughput
- Pod restarts and uptime
We also set up Slack alerts via Alertmanager for anomalies, such as spikes in 5xx or 4xx errors. This made the system production-grade not just in performance, but in reliability.
Load testing with Grafana k6
Before releasing the new service to users, I ran stress tests using Grafana k6. The test scenario simulated 20 concurrent users sending 5,000-character texts every second during 40 seconds, a load way higher than our real-world peak (around 3 requests/s).
import http from 'k6/http';
import { sleep, check } from 'k6';
export let options = {
stages: [
{ duration: '10s', target: 20 }, // ramp up to 20 users
{ duration: '30s', target: 20 }, // stay at 20 users
{ duration: '1s', target: 0 }, // ramp down to 0 users
],
};
const longDescription = "Translate this 5000-char text...";
export default function () {
const url = 'http://localhost:8080/translate';
const payload = JSON.stringify({
"text": longDescription,
"target_lang": "fr"
});
const params = {
headers: {
'Content-Type': 'application/json',
},
};
let res = http.post(url, payload, params);
check(res, {
'is status 200': (r) => r.status === 200,
});
sleep(1);
}
The system handled it well: no pod crashes, no GPU saturation, and all requests completed within a correct latency window (about 30s). These results gave us confidence to move forward with an A/B test in production.
Validating the system with A/B testing
We had optimized the model, built a fast API, deployed it to production (no trafic redireted to it for now), and set up monitoring. But none of that would matter if our users noticed a drop in translation experience. So before replacing DeepL entirely, we ran a large-scale A/B test to validate the solution in the real world.
The hypothesis
Our goal was simple: prove that users couldn’t tell the difference between DeepL API and our homemade translation system.
We weren’t just comparing BLEU scores anymore or response times, we were comparing user behavior. Would users continue using the “Translate” feature on Ricardo? Would it still help them complete purchases or place bids? And would the quality be good enough to support the same outcomes?
If we saw no drop in usage or conversion, we’d consider the system validated.
The Setup
The A/B test was conducted on the Web platform of Ricardo, using LaunchDarkly to split users into two buckets:
- 50% of logged in users received translations from DeepL API
- 50% received translations from the new GemmaX2–28–2B based service
The split was transparent, no visual difference, no banners, no user indication of which system was used. We only analyzed sessions where the user actively clicked the “Translate” button on the product detail page. Those were our test population.
We tracked user behavior using Google Analytics and Tableau, focusing on session counts and conversion metrics.
The results
After several weeks in June 2025, the results of the A/B test were remarkably close: the difference in conversion rate between DeepL and our translation system was just 0.22 percentage points, a difference statistically insignificant.
This confirmed that users behaved the same regardless of which system provided the translation, and that our in-house model delivered a comparable user experience without any noticeable regression.
Even better, no negative feedback was reported to Customer Care team, and no bugs or regressions were observed on our monitoring dashboards.
The decision
With the A/B test complete and the data in hand, we met with our Product Manager, Engineering Manager, and Data team. The consensus was clear: the system was ready.
We rolled out the internal translation service to 100% of traffic on Web, and shortly after, to the iOS and Android apps as well. DeepL was officially removed from our stack, and from our monthly expenses.
Results, limitations, and future improvements
The rollout of the new translation system wasn’t just a technical win, it delivered tangible business value. What started as a cost-reduction initiative turned into a broader transformation of how we handle multilingual content at SMG.
Real-world impact
The numbers speak for themselves. By replacing DeepL with a self-hosted LLM, we brought our monthly translation infrastructure costs down from over 7,000 CHF to just under 1,000 CHF, a savings of nearly 87%. These costs include GPU compute on Google Cloud (3 T4 instances) and Redis cache, which remained unchanged.

In July 2025, our internal translation system processed approximately 2,024,574 translation requests, with only 8,000 failures, resulting in a low error rate of about 0.4%. By contrast, in April 2025, the same workload handled via DeepL API led to 1,766,161 translation requests, but with a significantly higher error rate of 2.83% (about 50,000 failures). While DeepL provides excellent translation quality, these numbers show that our in-house system is not only more cost-efficient, but also more robust and reliable under our real-world production load.
Even more important: user experience didn’t suffer. The system maintained a sub-3-second latency for 95% of requests, scored similar on average BLEU evaluations, and showed no statistically significant drop in user conversions during the A/B test. It also translated text reliably, even when faced with emojis, HTML tags, or typo, common traits of user-generated content.
A note on sustainability
Sustainability was also a consideration in the design of this system.
By choosing the NVIDIA T4 GPUs, we relied on hardware with a 70W TDP, significantly lower than datacenter-grade alternatives like H100s (350W). T4s offer enough power for optimized LLM inference while remaining energy-efficient, a solid tradeoff between throughput and environmental impact.
The service was deployed on Google Cloud’s europe-west1 region, which is certified as a low-carbon location. Combined with efficient GPU usage, caching to avoid redundant inference, and no overprovisioning, the system remains relatively light from a resource consumption perspective.
While it’s not possible to fully quantify the emissions saved compared to a traditional cloud API model like DeepL, every watt counts, and this setup was built with that mindset.
Known limitations
Of course, the system isn’t perfect, and that’s okay. It’s built to evolve.
One current limitation is autoscaling. While the node pools in GKE are configured to scale based on demand (manually for now), the individual pods don’t yet autoscale intelligently. GPU usage isn’t an ideal metric for scale decisions, and we’re working on exposing better ones, like the number of queued requests in the vLLM server, to drive more responsive scaling.
Another limitation is language coverage. The model currently supports 28 languages, but we’ve only activated four (French, German, Italian and English), the ones used by Ricardo. Expanding to other platforms at SMG, like AutoScout24 or Homegate, will require validating support for additional languages and adapting caching logic per product.
We also haven’t yet enabled a fallback mechanism to the free tier of DeepL. The logic is ready, but since the free plan only allows 500,000 characters per month, we’re holding off until we define a clear emergency strategy. In the meantime, the service is monitored closely to ensure high availability.
Another important note: while the system matches DeepL in most cases, it’s still not as robust across all edge cases. Some complex or noisy texts may take longer to translate or occasionally hit timeouts. In rare cases, the model can hallucinate content or make inconsistent formatting decisions, a known limitation of LLMs that don’t rely on strict alignment training like specialized NMT models. It’s difficult to match a company like DeepL, which has over 1,000 employees, with a homemade project built during a single master thesis. But the goal was never to replicate every feature. It was to build a practical, cost-efficient, and customizable alternative, and within that scope, the system delivers.
Lastly, there’s no user-facing feedback loop in place. While conversion metrics help us track success at a high level, we’re considering adding a small 👍 / 👎 component below translations to collect qualitative feedback. That would help us detect subtle regressions and improve over time.
What’s next?
The success of this project opened doors beyond Ricardo. Other teams at SMG are already exploring how they can reuse the same service. Because the system is modular, containerized, and language-agnostic, it can be reused with minimal changes.
Looking forward, we’re also exploring:
- Nighttime autoscaling or shutdown for dev environments to reduce idle GPU costs
- Committed-use contracts with GCP to lower GPU pricing by 30 to 50%
- Continuous benchmarking against newer open-source models as they emerge
The foundation is solid. Now it’s about refining, expanding, and continuing to prove that high-performance machine translation doesn’t have to come from a commercial API.
Conclusion
This project began as a practical challenge: reduce the cost of automatic translations without compromising on quality and response time. But it ended up being much more than that.
By building and deploying a production-grade translation system based on an open-source LLM, we proved that it’s possible to match the performance of top-tier commercial APIs like DeepL. And we did it with full control over latency, reliability, privacy, and infrastructure cost.
The final system translates multilingual, user-generated HTML content in under three seconds, handles real production traffic, supports mobile and web clients, and saves thousands of CHF every month. All with no noticeable difference for our users.
More importantly, this approach is replicable.
With the right model, tooling (vLLM, FastAPI, Prometheus, etc.), and cloud infrastructure, anyone can build their own internal translation service, whether for a marketplace, any other commercial project, or even a personal project.
The era of treating LLMs as experimental or “nice to have” is behind us. With the right optimizations and architecture, they’re now production-ready building blocks for real, mission-critical services.
I hope this article gave you a concrete roadmap to follow if you’re considering replacing a commercial API with something open, customizable, and cost-efficient. And if you’ve already started down that path: good luck. You’ll learn a lot, and you might even surprise yourself with what you can build.

Author
Mateo Fernandez Martinez
Backend Software Engineer
General Marketplaces

