Orangutan said:
dilbert said:
My experience is somewhat different to that... RAM boards that report ECC errors usually fail to pass testing at next reboot. YMMV.
http://www.lexar.com/content/how-long-does-ram-last
On ECC errors:
dilbert, the reason that you're seeing RAM modules reporting ECC errors/NMIs (aka "the engine siezed") is that you haven't paid attention to the Event Log in the first place ( aka "Oil light is on"): the by far most commen ECC code is a Single Error Correction, Double Error Detection (SECDED) code, which means that you will get "Correctable ECC Errors Encountered" in your System Event Log long before getting Uncorrectable ECC Errors.
On RAM lifetime:
RAM failure modes are divided into two groups: "Soft" errors and "Hard" errors.
Soft errors are caused by background radiation that flips the content of one or more memory cells. This can happen because there are around ~20 electrons in a memory cell (ie: capacitor) that is storing the information. With that few electrons, the amount of electron/hole pairs generated from a background radiation hit is comparable. To avoid this happening too often, DRAM chips employ ECC internally to the rows in order to keep the refresh times up and the error rates down (a DRAM row-read destroys the information read, so every read needs to be internally followed up by a row-write. This is why a change of rows takes longer time than reading another address within the same row).
Adding the extra 8-bits of RAM to a 64-bit RAM word basically add another layer of ECC, which is orthogonal to the internal row-oriented ECC. Not having ECC is kinda equivalent of unplugging that oil light in the car - you have no warning mechanism to tell you're heading towards trouble.
Hard errors are caused by electro-migration, where physical damage happens to wires or contacts. This is driven by heat, current density, and time (in that order of significance). See
Black's formula on wikipedia.
In an ECC memory, errors (both soft and hard) are masked by the inherent redundancy of the Error Correcting Code. To avoid this, an Event Log storage is often used to log these events and notify the system administrator and then further action can be decided by said sysadmin. Which brings me back to: if you have a system with ECC-RAM, check your bleeping Event Log.
As an illustration, I just checked up on my own server's Event Log. It showed six "Correctable ECC - Asserted" events over the course of 14 months. Without ECC, I could have gotten silent corruption of the data on my server, including files on the disks (think in-RAM data corrupted before getting written to the disk. This data could be filesystem meta-data or file contents [that golden photo of your first-born]). With ECC, all I got was an Event Log entry.