29 December 2013

Crunching on big data? Make sure you have ECC RAM in your machine ...

Executive summary (aka abstract aka TL;DR):

Buying a consumer grade computer with more than just 2 or 4 gigabytes of non-ECC RAM is madness. Especially if you want to work on big data in scientific computing. You have been warned.

Long version

There are a couple of things to be said about defensive programming: some think it's utter rubbish, other take it to previously unseen levels of paranoia. Keeping things in balance can be hard, but for all my projects - and especially for MIRA, the nucleotide sequence assembler and mapper I wrote and maintain since 1997 - I try to keep a healthy balance of being just enough paranoid.

MIRA is not a system critical piece of software and no one will die if it stops unexpectedly. The rule of thumb I'm using is that if some assertion or data check within MIRA fails, the program will try to dump as much useful information as possible to locate or even reproduce the error and then bail out. That has made the code extremely robust and while users of the development version need to report bugs from time to time, I hardly get any error reports for the stable versions of the program. Cool.

But every now and then I get bug reports which leave me scratching my head.

Two of those arrived just before this Christmas. I was unable to reproduce the error with the data which came along and that left me completely perplexed at first. After digging deeper into the logs and doing manual code walks around expected program paths I found no reason why the code should fail. But while looking for the n-th time at the second case, I suddenly spotted something mysterious: a value had been logged and, a short time interval later, a copy of the very same value had also been logged. But the two values differed! I know, I know, there can be zillions of reasons for this to happen (bugs anyone?), but in this particular code path there was no way in hell a bug could strike as it was really too simple: single-threaded, no pointer operations and especially no wild memory writes. Just calling a subroutine with a given value.

What was even more telling was that the two values did not differ wildly, but just slightly: the values were 1669278 as first value and 1669276 as later value. So it's a difference of 2. As the variable in question was not a counter but just storing a value, I immediately made the connection to something I had seen earlier in my life (read more about it here) on a different machine: a bit rot (bit decay or whatever you want to call it). Indeed, the values written in binary look like this:

        110010111100010011110   (that's 1669278)
        110010111100010011100   (that's 1669276)

See the difference? The second bit from the right somehow decayed and became a 0.

So ... a memory location not used as pointer, not overwritten with some random value but the value differing by one bit, a bit being zero where it should be 1, not a chance for a program bug ... everything taken together, this points pretty strongly to faulty RAM.

Ouch, another two hours spent on a wild goose chase. I hate that, as if I had nothing better to do with my time (especially over Christmas).

This is now the fourth time in the last 18 months that MIRA catches a hardware memory error. The third time on hardware of other people. Myself I was struck in the summer of 2011 and I spent *weeks* hunting down some erroneous bug I only later recognised to be a memory problem. Best thing about it? The memory error had been impossible to detect with memtest86 even after 48hrs of continuous running, but was triggered frequently enough by MIRA to make me despair. At the time this delayed the release of a stable MIRA version by one or two months, led to short nights and wrecked my nerves.

I think that part of the reason for the increased frequency for this kind of error reports is that RAM has become dirt cheap nowadays and it is not infrequent to see consumer grade machines being loaded with 8 to 64 gigabytes of RAM, doing some serious number crunching. However, one thing has not caught up yet: the attention of the industry to the need to provide consumer grade main boards with not too expensive support for ECC RAM. While I can understand that in former times the hefty price difference between ECC and non-ECC made people wince, that difference for RAM has nowadays diminished to a point where it should be bearable by almost everyone. However, the price of the overall hardware needed to run with ECC is still a bit more pricier than "normal" consumer-grade hardware. Probably because the industry thinks that only institutions running big, fat-ass servers want ECC, and they should milk them where they can. The price difference is enough for many "normal" people to shy away from it. Bad idea, if you ask me.

So what is ECC RAM you may ask? Let me cite the first paragraph of the corresponding Wikipedia article on ECC RAM (emphasis mine):
Error-correcting code memory (ECC memory) is a type of computer data storage that can detect and correct the more common kinds of internal data corruption. ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing.
There you have it: no tolerance for memory corruption in scientific computing! Especially not in times of #DataDeluge (or #BigData or whatever you want to call it) we're living in now, where a single sequencing machine can produce dozens if not hundreds of gigabytes of data. Every. Single. Day. And where more and more people work on this data not on servers - which most of the time do have ECC RAM - but on consumer grade machines.

I, for one, know that my next machine at home will have >= 32 GiB RAM ... ECC RAM.

BaCh. We're done here.

No comments:

Post a Comment