Real systems.
Measured results.
Every number here comes from a real engagement. No vanity metrics. No vague improvements. Just what changed and by how much.
Distributed platform, event-driven microservices
Platform failing at 20k req/s. Now handles 80k.
Before
Latency spikes under load, no autoscaling, deployment instability, constant firefighting during traffic peaks.
After
Event-driven architecture, dynamic autoscaling, stable throughput at 4x the original ceiling.
What I changed
- Redesigned architecture using event-driven patterns with Kafka
- Implemented autoscaling using KEDA for dynamic resource allocation
- Improved observability with Prometheus and distributed tracing
- Optimised resource allocation across all services
Why this mattered
Platform scales with demand. No more firefighting during traffic spikes.
Support operations, LLM-powered automation
Manual classification at 59% accuracy. Automated to 96%.
Before
Tickets classified manually with inconsistent results. No trend visibility. Growing backlog.
After
LLM-based classification pipeline with crash recovery, prompt versioning, and real-time processing.
What I changed
- Designed and built an LLM-based classification pipeline
- Integrated Slack to backend to LLM to database flow
- Implemented concurrency controls and crash recovery
- Introduced prompt versioning using SHA256 hashing
Why this mattered
Support operations became scalable and insight-driven. AI actually worked in production.
Enterprise observability platform, Prometheus/Mimir stack
£50K/month in observability waste. No one tracking why.
Before
High-cardinality metrics, over-provisioned infrastructure, costs rising every quarter with no attribution.
After
Right-sized retention, cleaned metric cardinality, aligned architecture with actual usage.
What I changed
- Identified high-cardinality metric sources across Prometheus
- Redesigned metric structure to reduce unnecessary label dimensions
- Optimised query patterns and retention policies
- Removed over-provisioned infrastructure and unused workloads
Why this mattered
Observability went from a cost problem to a reliable, financially sustainable foundation.
Analytics platform, ETL pipeline optimisation
Data pipeline taking 1 hour. Cut to 20 minutes.
Before
Pipelines running for ~1 hour with inefficient queries, poor indexing, delayed analytics.
After
Event-driven pipeline with optimised queries, monitoring, and rollback mechanisms.
What I changed
- Redesigned ETL pipelines using event-driven architecture
- Optimised queries and indexing strategies
- Introduced monitoring and rollback mechanisms
- Ensured pipelines were production-ready with alerting
Why this mattered
Teams got faster access to insights. Engineering stopped babysitting pipelines.
Enterprise LGTM stack (Mimir, Loki, Tempo)
28 hidden issues found before they caused outages.
Before
Single-replica components in critical paths, misconfigured scaling, hidden failure points across the stack.
After
Full audit, replication fixed, autoscaling aligned, system resilient under real conditions.
What I changed
- Performed full audit across observability stack
- Fixed replication and scaling gaps in critical paths
- Improved autoscaling using KEDA
- Aligned system with production best practices
Why this mattered
Platform became reliable under real conditions. One bad deploy away from a full outage, now resilient.
Security compliance, AWS Security Hub integration
Compliance was manual and fragmented. Now fully automated.
Before
Findings tracked manually with duplicates, no central visibility, no lifecycle management.
After
Automated pipeline with deterministic finding IDs, lifecycle management, and Security Hub integration.
What I changed
- Built pipeline using AWS CodePipeline and CodeBuild
- Implemented deterministic finding IDs using SHA256
- Introduced lifecycle management (create + resolve)
- Handled API limits using intelligent batching
Why this mattered
Compliance became a controlled, automated system instead of a reactive manual process.