observability = the ability to understand the internal state or condition of a complex system based solely on knowledge of it’s external outputs/telemetry. the ability to ask arbitrary questions about your production systems without having to write new code
there are broadly speaking three flavors of observability
(1) logs: time-stamped, discrete records of application events
(2): traces: a record of a single request’s path across multiple services. decomposed into spans, where you have one span per service/operation. shows the sequence of things that happened (service A called service B which called service C which queried the database)
(3): metrics: numeric measurements aggregated over time intervals
counter - a number that only goes up (total error rate / total requests served)
gauge - a number that can go up and down (current cpu usage, active connections right now)
histogram - distribution of values across buckets (of the last 10k requests, 8k took 0-100ms, the rest took 500+ ms
traditional monitoring vs. observability
traditional: predict what might break, write a dashboard/alert for it. not super robust to unexpected breaks that you have no visibility into
observability: emit rich, high dimensionality events from your code at all times (let say every request records 50+ attributes). then when something weird happens you query that data interactively & this way you don’t have to anticipate the exact failure paths