Real systems.
Measured results.

Every number here comes from a real engagement. No vanity metrics. No vague improvements. Just what changed and by how much.

Distributed platform, event-driven microservices

Platform failing at 20k req/s. Now handles 80k.

80k req/s
stable throughput
4x
capacity increase
<200ms
p99 latency at peak

Before

Latency spikes under load, no autoscaling, deployment instability, constant firefighting during traffic peaks.

After

Event-driven architecture, dynamic autoscaling, stable throughput at 4x the original ceiling.

What I changed

  • Redesigned architecture using event-driven patterns with Kafka
  • Implemented autoscaling using KEDA for dynamic resource allocation
  • Improved observability with Prometheus and distributed tracing
  • Optimised resource allocation across all services

Why this mattered

Platform scales with demand. No more firefighting during traffic spikes.

Support operations, LLM-powered automation

Manual classification at 59% accuracy. Automated to 96%.

96%
classification accuracy
59% → 96%
accuracy improvement
100%
automated end-to-end

Before

Tickets classified manually with inconsistent results. No trend visibility. Growing backlog.

After

LLM-based classification pipeline with crash recovery, prompt versioning, and real-time processing.

What I changed

  • Designed and built an LLM-based classification pipeline
  • Integrated Slack to backend to LLM to database flow
  • Implemented concurrency controls and crash recovery
  • Introduced prompt versioning using SHA256 hashing

Why this mattered

Support operations became scalable and insight-driven. AI actually worked in production.

Enterprise observability platform, Prometheus/Mimir stack

£50K/month in observability waste. No one tracking why.

£50K/mo
cloud cost removed
3x
faster queries
60%
less storage overhead

Before

High-cardinality metrics, over-provisioned infrastructure, costs rising every quarter with no attribution.

After

Right-sized retention, cleaned metric cardinality, aligned architecture with actual usage.

What I changed

  • Identified high-cardinality metric sources across Prometheus
  • Redesigned metric structure to reduce unnecessary label dimensions
  • Optimised query patterns and retention policies
  • Removed over-provisioned infrastructure and unused workloads

Why this mattered

Observability went from a cost problem to a reliable, financially sustainable foundation.

Analytics platform, ETL pipeline optimisation

Data pipeline taking 1 hour. Cut to 20 minutes.

20 min
pipeline runtime
3x
faster data freshness
70%
compute cost cut

Before

Pipelines running for ~1 hour with inefficient queries, poor indexing, delayed analytics.

After

Event-driven pipeline with optimised queries, monitoring, and rollback mechanisms.

What I changed

  • Redesigned ETL pipelines using event-driven architecture
  • Optimised queries and indexing strategies
  • Introduced monitoring and rollback mechanisms
  • Ensured pipelines were production-ready with alerting

Why this mattered

Teams got faster access to insights. Engineering stopped babysitting pipelines.

Enterprise LGTM stack (Mimir, Loki, Tempo)

28 hidden issues found before they caused outages.

28
issues caught pre-outage
0
production outages since
99.9%
observability uptime

Before

Single-replica components in critical paths, misconfigured scaling, hidden failure points across the stack.

After

Full audit, replication fixed, autoscaling aligned, system resilient under real conditions.

What I changed

  • Performed full audit across observability stack
  • Fixed replication and scaling gaps in critical paths
  • Improved autoscaling using KEDA
  • Aligned system with production best practices

Why this mattered

Platform became reliable under real conditions. One bad deploy away from a full outage, now resilient.

Security compliance, AWS Security Hub integration

Compliance was manual and fragmented. Now fully automated.

Real-time
compliance visibility
0
duplicate findings
100%
automated lifecycle

Before

Findings tracked manually with duplicates, no central visibility, no lifecycle management.

After

Automated pipeline with deterministic finding IDs, lifecycle management, and Security Hub integration.

What I changed

  • Built pipeline using AWS CodePipeline and CodeBuild
  • Implemented deterministic finding IDs using SHA256
  • Introduced lifecycle management (create + resolve)
  • Handled API limits using intelligent batching

Why this mattered

Compliance became a controlled, automated system instead of a reactive manual process.

See a pattern that looks like your system?

Bring your architecture diagram or last incident. I will tell you what is actually breaking.