A few years ago I designed a way to detect bit-flips in Firefox crash reports and last year we deployed an actual memory tester that runs on user machines after the browser crashes. Today I was looking at the data that comes out of these tests and now I'm 100% positive that the heuristic is sound and a lot of the crashes we see are from users with bad memory or similarly flaky hardware. Here's a few numbers to give you an idea of how large the problem is. 🧵 1/5
No, the exact % depends on how stable everything else is.
Like a trivial example, if you have 3 programs, one that sets a pointer to a random address and tries to dereference it, one that does this but only if the last two digits of a timer it checks are “69”, and one that never sets a pointer to an invalid address, based on the programs themselves, the first one will crash almost all the time, the second one will crash about 1% of the time, and the third one won’t crash at all.
If you had a mechanism to perfectly detect bit flips (honestly, that part has me the most curious about the OP), and you ran each program until you had detected 5 bit flip crashes (let’s say they happen 1 out of each 10k runs), then the first program will have something like a 0.01% chance of any given crash being due to bit flip, about 1% for the 2nd one, and 100% for the 3rd one (assuming no other issues like OS stability causing other crashes).
Going with those numbers I made up, every 10k “runs”, you’d see 1 crash from bit flips and 9 crashes from other reasons. Or for every crash report they receive, 1 of 10 are bit flips, and 9 of 10 are “other”. Well, more accurately, 1 of 20 for bit flip and 19 of 20 for other, due to the assumption that the detector only detects half of them, because they actually only measured 5%.
No, the exact % depends on how stable everything else is.
Like a trivial example, if you have 3 programs, one that sets a pointer to a random address and tries to dereference it, one that does this but only if the last two digits of a timer it checks are “69”, and one that never sets a pointer to an invalid address, based on the programs themselves, the first one will crash almost all the time, the second one will crash about 1% of the time, and the third one won’t crash at all.
If you had a mechanism to perfectly detect bit flips (honestly, that part has me the most curious about the OP), and you ran each program until you had detected 5 bit flip crashes (let’s say they happen 1 out of each 10k runs), then the first program will have something like a 0.01% chance of any given crash being due to bit flip, about 1% for the 2nd one, and 100% for the 3rd one (assuming no other issues like OS stability causing other crashes).
Going with those numbers I made up, every 10k “runs”, you’d see 1 crash from bit flips and 9 crashes from other reasons. Or for every crash report they receive, 1 of 10 are bit flips, and 9 of 10 are “other”. Well, more accurately, 1 of 20 for bit flip and 19 of 20 for other, due to the assumption that the detector only detects half of them, because they actually only measured 5%.