good observability

  • observability = the ability to understand the internal state or condition of a complex system based solely on knowledge of it’s external outputs/telemetry. the ability to ask arbitrary questions about your production systems without having to write new code
  • there are broadly speaking three flavors of observability
    • (1) logs: time-stamped, discrete records of application events
    • (2): traces: a record of a single request’s path across multiple services. decomposed into spans, where you have one span per service/operation. shows the sequence of things that happened (service A called service B which called service C which queried the database)
    • (3): metrics: numeric measurements aggregated over time intervals
      • counter - a number that only goes up (total error rate / total requests served)
      • gauge - a number that can go up and down (current cpu usage, active connections right now)
      • histogram - distribution of values across buckets (of the last 10k requests, 8k took 0-100ms, the rest took 500+ ms
    • traditional monitoring vs. observability
    • traditional: predict what might break, write a dashboard/alert for it. not super robust to unexpected breaks that you have no visibility into
    • observability: emit rich, high dimensionality events from your code at all times (let say every request records 50+ attributes). then when something weird happens you query that data interactively & this way you don’t have to anticipate the exact failure paths
      • high cardinality
      • high dimensionality