Failure Stories

Real engineering mistakes, their consequences, and what I learned. I write these because the engineers I trust most are the ones who can talk honestly about what went wrong.

The Microservice That Became a Monolith

Medium impact
Problem

Split a perfectly fine service into 6 microservices because 'that's how Netflix does it'.

Mistake

Applied enterprise patterns to a system with 3 users and no scaling requirements.

Consequence

Doubled deployment complexity, increased latency by 40ms per request, and took 3x longer to add features.

Fix

Merged all 6 back into 2 services with clear domain boundaries. Much simpler, just as fast.

Lesson

Microservices solve an organizational scaling problem, not a technical one. At small scale they're almost always a net negative.

ArchitectureDistributed Systems

ElasticSearch Index Design Mistake

High impact
Problem

Designed an ES index without thinking about update patterns — built it optimised for reads.

Mistake

Didn't account for partial document updates. Every update required re-indexing the full document including nested arrays.

Consequence

Index update rate hit ES write throughput limits during catalog bulk updates. 15-minute indexing lag appeared.

Fix

Denormalized the nested data into separate indices. Introduced a Kafka topic as a buffer to smooth out write spikes.

Lesson

Design indices for your update patterns first, query patterns second. Measure write amplification before going live.

ElasticSearchSearchPerformance

Wrong Scaling Decision: Vertical Instead of Horizontal

Medium impact
Problem

PostgreSQL queries getting slow. Added more RAM and CPU to the instance.

Mistake

The real problem was missing indices and an N+1 query from the ORM. Vertical scaling masked it.

Consequence

Spent budget on hardware, queries got marginally faster, N+1 came back 2 months later at higher scale.

Fix

Profiled queries with pg_stat_statements, added composite indices, rewrote the hot ORM paths to use raw SQL joins.

Lesson

Profile before scaling. Hardware masks problems; it doesn't solve them.

PostgreSQLPerformanceBackend

Shipping a Feature Nobody Used

Low impact
Problem

Spent 3 weeks building a complex filtering system for a dashboard.

Mistake

Never talked to users. Assumed the feature was needed based on a single offhand comment.

Consequence

Feature went live, zero users clicked it in 4 weeks. The PM eventually removed it.

Fix

Introduced a "question before ticket" rule: every feature needs a user interview or analytics data to justify it.

Lesson

The most expensive code is code nobody uses. Time spent on discovery is always worth it.

ProductProcess