due to the large number of droplets, efforts to establish new droplet leases took long enough that the work could not be completed before they timed out. Additional work was queued to reattempt establishing the droplet lease. At this point, DWFM had entered a state of congestive collapse and was unable to make forward progress in recovering droplet leases.
That is, the load that resulted from the initial failure was not something the system was designed to handle, so it had cascading effects / required manual cleanup.
As with most IT troubleshooting,
Time spent applying the fix: 15 minutes.
Time spent identifying the problem then discovering where it is in the system: 15 hours.
There’s a full postmortem from AWS. One piece that stands out to me:
That is, the load that resulted from the initial failure was not something the system was designed to handle, so it had cascading effects / required manual cleanup.