ECC memory


Error correction code memory is a type of computer data storage that uses an error correction code to detect and correct n-bit data corruption which occurs in memory.
Typically, ECC memory maintains a memory system immune to single-bit errors: the data that is read from each word is always the same as the data that had been written to it, even if one of the bits actually stored has been flipped to the wrong state. Most non-ECC memory cannot detect errors, although some non-ECC memory with parity support allows detection but not correction.
ECC memory is used in most computers where data corruption cannot be tolerated, like industrial control applications, critical databases, and infrastructural memory caches.

Background: memory errors

Concept

Error correction codes protect against undetected data corruption and are used in computers where such corruption is unacceptable, examples being scientific and financial computing applications, or in database and file servers. ECC can also reduce the number of crashes in multi-user server applications and maximum-availability systems.
Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off soft errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read or write to them. Hence, the error rates increase rapidly with rising altitude; for example, compared to sea level, the rate of neutron flux is 3.5 times higher at 1.5 km and 300 times higher at 10–12 km. As a result, systems operating at high altitudes require special provisions for reliability.
As an example, the spacecraft Cassini–Huygens, launched in 1997, contained two identical flight recorders, each with 2.5 gigabits of memory in the form of arrays of commercial DRAM chips. Due to built-in EDAC functionality, the spacecraft's engineering telemetry reported the number of single-bit-per-word errors and double-bit-per-word errors. During the first 2.5 years of flight, the spacecraft reported a nearly constant single-bit error rate of about 280 errors per day. However, on November 6, 1997, during the first month in space, the number of errors increased by more than a factor of four on that single day. This was attributed to a solar particle event that had been detected by the satellite GOES 9.
There was some concern that as DRAM density increases further, and thus the components on chips get smaller, while operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently, since lower-energy particles will be able to change a memory cell's state. On the other hand, smaller cells make smaller targets, and moves to technologies such as SOI may make individual cells less susceptible and so counteract, or even reverse, this trend. Recent studies show that single-event upsets due to cosmic radiation have been dropping dramatically with process geometry, and previous concerns over increasing bit cell error rates are unfounded.

Real-world error rates and consequences

Work published between 2007 and 2009 showed widely varying error rates with over 7 orders of magnitude difference, ranging from, roughly one bit error per hour per gigabyte of memory, to, roughly one bit error per millennium per gigabyte of memory. A large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance '09 conference. The actual error rate found was several orders of magnitude higher than the previous small-scale or laboratory studies, with between 25,000 and 70,000 errors per billion device hours per megabit. More than 8% of DIMM memory modules were affected by errors per year.
The consequence of a memory error is system-dependent. In systems without ECC, an error can lead either to a crash or to corruption of data; in large-scale production sites, memory errors are one of the most-common hardware causes of machine crashes. Memory errors can cause security vulnerabilities. A memory error can have no consequences if it changes a bit which neither causes observable malfunctioning nor affects data used in calculations or saved. A 2010 simulation study showed that, for a web browser, only a small fraction of memory errors caused data corruption, although, as many memory errors are intermittent and correlated, the effects of memory errors were greater than would be expected for independent soft errors.
Some tests conclude that the isolation of DRAM memory cells can be circumvented by unintended side effects of specially crafted accesses to adjacent cells. Thus, accessing data stored in DRAM causes memory cells to leak their charges and interact electrically, as a result of high cell density in modern memory, altering the content of nearby memory rows that actually were not addressed in the original memory access. This effect is known as row hammer, and it has also been used in some privilege escalation computer security exploits.
An example of a single-bit error that would be ignored by a system with no error-checking, would halt a machine with parity checking or be invisibly corrected by ECC: a single bit is stuck at 1 due to a faulty chip, or becomes changed to 1 due to background or cosmic radiation; a spreadsheet storing numbers in ASCII format is loaded, and the character "8" is stored in the byte that contains the stuck bit at its lowest bit position; then, a change is made to the spreadsheet and it is saved. As a result, the "8" has silently become a "9".

Solutions

Several approaches have been developed to deal with unwanted bit-flips, including immunity-aware programming, RAM parity memory, and ECC memory.
This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record parity or to use an error-correcting code. Parity allows the detection of all single-bit errors, but not correction, so the system has to either carry on or halt. Error-correction codes allow for more errors to be corrected; how much depends on the exact type of memory used.
DRAM memory may provide increased protection against soft errors by relying on error-correcting codes. Such error-correcting memory, known as ECC or EDAC-protected memory, is particularly desirable for highly fault-tolerant applications, such as servers, as well as deep-space applications due to increased radiation.
Some systems also "scrub" the memory, by periodically reading all addresses and writing back corrected versions if necessary to remove accumulated soft errors.

Schemes

Modern memory subsystems may deliver data integrity through one or more of the following schemes:
  • By memory controller: These schemes have the memory controller send or receive extra data to the chip.
  • * Side-band ECC is the traditional server approach. ECCs are stored in separate DRAM chips and transmitted with data through additional channels. The memory controller computes ECCs when writing, corrects errors when reading and reports error corrections and detections to the operating system or firmware.
  • * Inline ECC or In-band ECC does not use extra channel width and are as a result compatible with "non-ECC" memory modules. The memory controller partitions the physical space.
  • ** In one style of implementation represented by Intel's IBECC and TI's RTOS processor, the physical address space is partitioned so that there is a chunk of reserved memory. Each write-command would need to be accompanied by an addition write-command and the same applies to read-commands. This results in an approximate doubling of memory latency. Specifically, Intel's implementation has minimal performance impact on web browsing and productivity applications, but can reduce performance by up to 25% in gaming and video editing workloads.
  • ** It is theoretically possible to simply partition the existing channel to provide for an analogue of side-band ECC. A cursory read of Synopsys's description of "inline ECC" mentioning a partitioning of the 16-bit channel-per-chip would lead to this understanding, but this is not very common in commercial products.
  • By memory chip: On-die ECC, also called in-DRAM ECC or integrated ECC, is mandatory in all DDR5 and LPDDR6 memory modules to mitigate higher error rates associated with smaller memory cells. Additional ECC storage and error correction circuitry are embedded in DRAM chips and are invisible to the memory controller. Transmission errors are not corrected since ECCs are not sent with the data, and error corrections and detections are not reported. Additional latency is introduced only when error correction is needed.
  • By both
  • * Link ECC adds error-correction to the data link but not the underlying storage. The memory controller computes and transmits ECCs with the data when writing to the DRAM, which verifies and corrects errors. When reading, the DRAM computes ECCs that the memory controller then verifies. It is a part of LPDDR5. While side-band ECC automatically provides link-level redundancy, inband/inline ECC using physical address space reserving and on-die ECC do not; they would need a layer of link ECC to protect against corruption in transmission.

    Reporting of error

Many early implementations of ECC memory as well as on-die ECC mask correctable errors, acting "as if" the error never occurred, and only report uncorrectable errors. Modern implementations log both correctable errors and uncorrectable errors. Some people proactively replace memory modules that exhibit high error rates, in order to reduce the likelihood of uncorrectable error events.