This Monday, Linux kernel creator Linus Torvalds went on a frustrated rant about the lack of Error Correcting Checksum (ECC) RAM in consumer PCs and laptops.
… the misguided and arse-backwards policy of “consumers don’t need ECC”, [made] the market for ECC memory go away.
The arguments against ECC were always complete and utter garbage. Now even the memory manufacturers are starting to do ECC internally because they finally owned up to the fact that they absolutely have to.
If you’re not familiar with ECC RAM, it’s probably because you don’t build or spec dedicated servers using server-grade CPUs and motherboards—which, unfortunately, is about the only place you actually find ECC. In a nutshell, ECC RAM includes a tiny amount of extra memory used for detection and correction of errors.
Memory errors and probability
In most modern implementations, this means for every 64-bit word stored in RAM, there are eight checking bits. A single bit error—a 0 flipped to 1, or a 1 flipped to 0—can be both detected and corrected automatically. Two bits flipped in the same word can be detected but not corrected. Three or more bits flipped in the same word will probably be detected, but detection is not guaranteed.