Real systems.
Measured results.

Every number here comes from a real engagement. No vanity metrics. No vague improvements. Just what changed and by how much.

Distributed platform, event-driven microservices

Platform failing at 20k req/s. Now handles 80k.

80k req/s

stable throughput

capacity increase

<200ms

p99 latency at peak

Before

Latency spikes under load, no autoscaling, deployment instability, constant firefighting during traffic peaks.

After

Event-driven architecture, dynamic autoscaling, stable throughput at 4x the original ceiling.

What I changed

Redesigned architecture using event-driven patterns with Kafka
Implemented autoscaling using KEDA for dynamic resource allocation
Improved observability with Prometheus and distributed tracing
Optimised resource allocation across all services

Why this mattered

Platform scales with demand. No more firefighting during traffic spikes.

Support operations, LLM-powered automation

Manual classification at 59% accuracy. Automated to 96%.

96%

classification accuracy

59% → 96%

accuracy improvement

100%

automated end-to-end

Before

Tickets classified manually with inconsistent results. No trend visibility. Growing backlog.

After

LLM-based classification pipeline with crash recovery, prompt versioning, and real-time processing.

What I changed

Designed and built an LLM-based classification pipeline
Integrated Slack to backend to LLM to database flow
Implemented concurrency controls and crash recovery
Introduced prompt versioning using SHA256 hashing

Why this mattered

Support operations became scalable and insight-driven. AI actually worked in production.

Enterprise observability platform, Prometheus/Mimir stack

£50K/month in observability waste. No one tracking why.

£50K/mo

cloud cost removed

faster queries

60%

less storage overhead

Before

High-cardinality metrics, over-provisioned infrastructure, costs rising every quarter with no attribution.

After

Right-sized retention, cleaned metric cardinality, aligned architecture with actual usage.

What I changed

Identified high-cardinality metric sources across Prometheus
Redesigned metric structure to reduce unnecessary label dimensions
Optimised query patterns and retention policies
Removed over-provisioned infrastructure and unused workloads

Why this mattered

Observability went from a cost problem to a reliable, financially sustainable foundation.

Analytics platform, ETL pipeline optimisation

Data pipeline taking 1 hour. Cut to 20 minutes.

20 min

pipeline runtime

faster data freshness

70%

compute cost cut

Before

Pipelines running for ~1 hour with inefficient queries, poor indexing, delayed analytics.

After

Event-driven pipeline with optimised queries, monitoring, and rollback mechanisms.

What I changed

Redesigned ETL pipelines using event-driven architecture
Optimised queries and indexing strategies
Introduced monitoring and rollback mechanisms
Ensured pipelines were production-ready with alerting

Why this mattered

Teams got faster access to insights. Engineering stopped babysitting pipelines.

Enterprise LGTM stack (Mimir, Loki, Tempo)

28 hidden issues found before they caused outages.

issues caught pre-outage

production outages since

99.9%

observability uptime

Before

Single-replica components in critical paths, misconfigured scaling, hidden failure points across the stack.

After

Full audit, replication fixed, autoscaling aligned, system resilient under real conditions.

What I changed

Performed full audit across observability stack
Fixed replication and scaling gaps in critical paths
Improved autoscaling using KEDA
Aligned system with production best practices

Why this mattered

Platform became reliable under real conditions. One bad deploy away from a full outage, now resilient.

Security compliance, AWS Security Hub integration

Compliance was manual and fragmented. Now fully automated.

Real-time

compliance visibility

duplicate findings

100%

automated lifecycle

Before

Findings tracked manually with duplicates, no central visibility, no lifecycle management.

After

Automated pipeline with deterministic finding IDs, lifecycle management, and Security Hub integration.

What I changed

Built pipeline using AWS CodePipeline and CodeBuild
Implemented deterministic finding IDs using SHA256
Introduced lifecycle management (create + resolve)
Handled API limits using intelligent batching

Why this mattered

Compliance became a controlled, automated system instead of a reactive manual process.

See a pattern that looks like your system?

Bring your architecture diagram or last incident. I will tell you what is actually breaking.

Real systems.Measured results.

Platform failing at 20k req/s. Now handles 80k.

Before

After

What I changed

Why this mattered

Manual classification at 59% accuracy. Automated to 96%.

Before

After

What I changed

Why this mattered

£50K/month in observability waste. No one tracking why.

Before

After

What I changed

Why this mattered

Data pipeline taking 1 hour. Cut to 20 minutes.

Before

After

What I changed

Why this mattered

28 hidden issues found before they caused outages.

Before

After

What I changed

Why this mattered

Compliance was manual and fragmented. Now fully automated.

Before

After

What I changed

Why this mattered

See a pattern that looks like your system?

Real systems.
Measured results.