Skip to content
Harjot Singh Rana
Back to blog
8 min read

Distributed Systems in Practice

Distributed SystemsArchitectureGo

I've spent the better part of the last decade building systems that span regions, data centers, and cloud providers. Distributed systems aren't a design choice — they're a requirement the moment your users are spread across continents and expect the same sub-second response.

The first thing every engineer learns is that the network is unreliable. The second thing — the one that takes years to internalize — is that partial failure is the default state, not an exception. A service that responds in 2ms for six months straight will fail on the seventh month at the worst possible moment.

Idempotency is your best friend. Every operation that mutates state should be safe to retry. We built our payment processing system around idempotency keys — a UUID the client generates once and sends with every retry. The server checks if it's already seen that key and returns the cached result. This single pattern eliminated an entire class of double-charge bugs.

Circuit breakers prevent cascading failures. When service A depends on service B, and B starts timing out, A's retries will eventually exhaust its connection pool and take down everything else that depends on A. A circuit breaker watches failure rates and trips open when they exceed a threshold — failing fast instead of failing slowly.

Observability is not optional. You cannot debug a distributed system by SSHing into a box. Structured logging, distributed tracing, and metrics dashboards are the minimum. We use OpenTelemetry for tracing and ClickHouse for log aggregation — being able to follow a single request across 15 services is the difference between a 10-minute fix and a 3-day outage.

The hardest lesson? You can't test your way to reliability. Chaos engineering — deliberately injecting failures into production — is the only way to know your system actually handles them. We run weekly chaos experiments: kill a random pod, throttle a network link, expire a TLS certificate. The ones that fail get fixed before they happen for real.

Built with Moonshift