Real systems.
Measured results.

Every engagement starts with a broken system and ends with numbers. Here is what that looks like.

Enterprise observability platform on Prometheus/Mimir

£50K/month in cloud waste. Nobody tracking why.

£50K/mo
Cloud costs removed
4x
Faster queries
60%
Less storage

Before

High-cardinality metrics, inefficient queries, over-provisioned infrastructure. Costs escalating. Performance degrading under load.

After

£50K/month reduction in cloud costs. Faster queries. Lower storage. Reliable observability under load.

What I changed

  • Identified high-cardinality sources across Prometheus
  • Redesigned metric structure to reduce unnecessary label dimensions
  • Optimised query patterns and retention policies
  • Removed over-provisioned infrastructure and unused workloads

Why it mattered

Observability shifted from a cost problem to a reliable foundation. Financially sustainable at scale.

Distributed platform, retail media

Failing at 20k req/s. Stable at 80k.

~80k
req/s stable
4x
Throughput increase
Zero
Traffic-spike fires

Before

Latency spikes under traffic. Deployment instability. No autoscaling. Every traffic spike meant firefighting.

After

Stable throughput at ~80k req/s. Lower latency. Safer deployments. No more firefighting.

What I changed

  • Redesigned architecture using event-driven patterns
  • Introduced Kafka to decouple data flows between services
  • Implemented autoscaling using KEDA
  • Improved observability with Prometheus and tracing

Why it mattered

The platform scales confidently with demand. Traffic spikes are handled, not fought.

Support operations, LLM pipeline

59% manual accuracy. 96% fully automated.

96%
Accuracy
59→96%
Improvement
Full
Automation

Before

Tickets classified manually. Inconsistent results. No trend visibility. Growing backlog with no path to automation.

After

Fully automated LLM-based classification. 96% accuracy. Real-time trend analysis enabled.

What I changed

  • Designed and built an LLM-based classification pipeline
  • Integrated Slack to backend to LLM to database flow
  • Implemented concurrency controls and crash recovery
  • Introduced prompt versioning using SHA256

Why it mattered

Support became scalable, consistent, and insight-driven. AI applied in a way that actually works in production.

Analytics platform, data engineering

1-hour pipelines. Cut to 20 minutes.

4x
Faster
20 min
From 1 hour
Real-time
Monitoring

Before

Pipelines running ~1 hour with inefficient queries and poor indexing. Teams waiting hours for data that should take minutes.

After

20-minute pipeline runtime. Reliable processing. Faster access to insights.

What I changed

  • Redesigned ETL pipelines using event-driven architecture
  • Optimised queries and indexing strategies
  • Introduced monitoring and rollback mechanisms

Why it mattered

Teams got faster access to insights, more reliable data, and less time fighting pipelines.

Enterprise LGTM stack (Mimir, Loki, Tempo)

28 issues found before they caused outages.

28
Issues found
Zero
Outages caused
Clear
Remediation plan

Before

Single-replica components in critical paths. Misconfigured scaling. Hidden failure points. One bad deploy from a full outage.

After

All 28 issues remediated. Resilient scaling. Production-ready observability.

What I changed

  • Full audit across observability stack
  • Identified 28 issues across severity levels
  • Fixed replication and scaling gaps
  • Improved autoscaling using KEDA

Why it mattered

Platform became reliable under real conditions. Resilient to failures.

Compliance automation, AWS

Manual compliance. Now fully automated.

Real-time
Visibility
Zero
Duplicates
Full
Automation

Before

Findings tracked manually. Duplicates everywhere. No central visibility. Scaling compliance was impossible.

After

Automated pipeline. Zero duplicates. Real-time visibility. Scalable compliance.

What I changed

  • Built pipeline using AWS CodePipeline and CodeBuild
  • Implemented deterministic finding IDs using SHA256
  • Integrated with AWS Security Hub
  • Handled API limits using batching

Why it mattered

Compliance became automated, reliable, and scalable. A controlled system instead of reactive tracking.

You already know something is wrong.
Let's find exactly what.

Bring your architecture diagram, cloud bill, or last incident summary. I will tell you what is actually breaking.

30 minutes. No pitch. Ranked risks and a clear next step.