Real systems.
Measured results.
Every engagement starts with a broken system and ends with numbers. Here is what that looks like.
£50K/month in cloud waste. Nobody tracking why.
Before
High-cardinality metrics, inefficient queries, over-provisioned infrastructure. Costs escalating. Performance degrading under load.
After
£50K/month reduction in cloud costs. Faster queries. Lower storage. Reliable observability under load.
What I changed
- Identified high-cardinality sources across Prometheus
- Redesigned metric structure to reduce unnecessary label dimensions
- Optimised query patterns and retention policies
- Removed over-provisioned infrastructure and unused workloads
Why it mattered
Observability shifted from a cost problem to a reliable foundation. Financially sustainable at scale.
Failing at 20k req/s. Stable at 80k.
Before
Latency spikes under traffic. Deployment instability. No autoscaling. Every traffic spike meant firefighting.
After
Stable throughput at ~80k req/s. Lower latency. Safer deployments. No more firefighting.
What I changed
- Redesigned architecture using event-driven patterns
- Introduced Kafka to decouple data flows between services
- Implemented autoscaling using KEDA
- Improved observability with Prometheus and tracing
Why it mattered
The platform scales confidently with demand. Traffic spikes are handled, not fought.
59% manual accuracy. 96% fully automated.
Before
Tickets classified manually. Inconsistent results. No trend visibility. Growing backlog with no path to automation.
After
Fully automated LLM-based classification. 96% accuracy. Real-time trend analysis enabled.
What I changed
- Designed and built an LLM-based classification pipeline
- Integrated Slack to backend to LLM to database flow
- Implemented concurrency controls and crash recovery
- Introduced prompt versioning using SHA256
Why it mattered
Support became scalable, consistent, and insight-driven. AI applied in a way that actually works in production.
1-hour pipelines. Cut to 20 minutes.
Before
Pipelines running ~1 hour with inefficient queries and poor indexing. Teams waiting hours for data that should take minutes.
After
20-minute pipeline runtime. Reliable processing. Faster access to insights.
What I changed
- Redesigned ETL pipelines using event-driven architecture
- Optimised queries and indexing strategies
- Introduced monitoring and rollback mechanisms
Why it mattered
Teams got faster access to insights, more reliable data, and less time fighting pipelines.
28 issues found before they caused outages.
Before
Single-replica components in critical paths. Misconfigured scaling. Hidden failure points. One bad deploy from a full outage.
After
All 28 issues remediated. Resilient scaling. Production-ready observability.
What I changed
- Full audit across observability stack
- Identified 28 issues across severity levels
- Fixed replication and scaling gaps
- Improved autoscaling using KEDA
Why it mattered
Platform became reliable under real conditions. Resilient to failures.
Manual compliance. Now fully automated.
Before
Findings tracked manually. Duplicates everywhere. No central visibility. Scaling compliance was impossible.
After
Automated pipeline. Zero duplicates. Real-time visibility. Scalable compliance.
What I changed
- Built pipeline using AWS CodePipeline and CodeBuild
- Implemented deterministic finding IDs using SHA256
- Integrated with AWS Security Hub
- Handled API limits using batching
Why it mattered
Compliance became automated, reliable, and scalable. A controlled system instead of reactive tracking.