• aeronmelon@lemmy.world
    link
    fedilink
    English
    arrow-up
    15
    ·
    2 days ago

    As with most IT troubleshooting,

    Time spent applying the fix: 15 minutes.

    Time spent identifying the problem then discovering where it is in the system: 15 hours.

    • pageflight@piefed.social
      link
      fedilink
      English
      arrow-up
      11
      ·
      2 days ago

      There’s a full postmortem from AWS. One piece that stands out to me:

      due to the large number of droplets, efforts to establish new droplet leases took long enough that the work could not be completed before they timed out. Additional work was queued to reattempt establishing the droplet lease. At this point, DWFM had entered a state of congestive collapse and was unable to make forward progress in recovering droplet leases.

      That is, the load that resulted from the initial failure was not something the system was designed to handle, so it had cascading effects / required manual cleanup.