25 July 2018

Slow memory allocation due to Transparent Huge Pages (THP)

Keywords: Linux; unusual long runtime; large contiguous memory allocation; RAM fragmentation; Transparent Huge Pages.

Executive summary (aka abstract aka TL;DR):

On the stock kernel used at least in Ubuntu 14.04, turn off Transparent Huge Pages on machines with a decent uptime and larger amounts of memory (>= 32 GiB).
This can result in a speedup of 60x - 100x for programs which need lots of memory in large contiguous chunks. Whether the THP kernel routines have been sufficiently improved in later Linux distributions remains to be seen.

Longer version.

For a number of tasks at work, some bioinformatics programs I use need quite a bit of memory. Lots of it actually. The machine I am using has 512 GiB but, as frequently seen in production environments, the OS I use is a bit older: a Ubuntu 14.04 LTS with a 4.2.x kernel .

Symptoms

I noticed a very unusual, non-linear increase of run time for some programs, like, e.g., multiple hours instead of (expected) minutes, and started to look for the cause of this performance issues. After a while I suspected the memory allocation of Linux being responsible.

A small test program

The following C++ program, once compiled, allocates 160 GiB of RAM in one contiguous chunk, initialises it to all zero and then returns:

#include <vector>
#include <cstdint>

int main(int argc, char **argv) {
  std::vector<uint8_t> v(1073741824LL * 160, 0);
  return v[1234];
}

So far so innocuous as allocating roughly 1/3rd of free available RAM should be a no-brainer. However, the behaviour of that program was -- on the otherwise empty machine from above with 190 days uptime and no swap -- really odd:
  1. The virtual memory allocation part took, as expected, just a couple of microseconds: the VIRT entry in the top-program showed the expected 160 GiB right after program start.
  2. Within ~20 seconds, RES entry in top climbed to ~50 GiB. This showed the progress in zeroing out the memory while - at the same time - really committing RAM pages to the process. This, too, was within expected boundaries.
  3. But after these 20s, it took more than 1 hour(!) to have the RES entry climb to the full 160 GiB and the program finally exit.
That felt very wrong.

Cause and cure

Poking around the internet, the suspicion quickly fell on a memory allocation mechanism which is called Transparent Huge Pages, you can read more about it on the Kernel THP documentation pages. The symptoms of the THP algorithm not performing "as expected" can be multiple, ranging from lags in memory allocation of several seconds or even minutes (like in my case) to outright system freezes and reboots. Some recounts and recommendations can be found at MemSQL, ElasticSearch, NuoDB, Couchbase, Oracle and many more.

THP is indeed turned on by default on Ubuntu 14.04 and a couple of other distributions. So, I turned it off (as root) via

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

and, lo and behold, the small test program from above then finished within just a minute.


Probable underlying cause: memory fragmentation


At least, it was in in my case. With the knowledge from the articles cited above, I rebooted the server as I suspected memory fragmentation to be the root cause of the THP problem. The expectation being that a fresh system would have no memory fragmentation. Lo and behold, even with THP on, after the reboot the memory allocation program from above finished in about 30 seconds.

Conclusions

For users / system administrators: either turn off THP or, if THP is absolutely needed, reboot the machine regularly to overcome memory fragmentation.
For authors of (bioinformatics or other) software which need huge contiguous chunks of RAM: at the start of your program, check for the THP feature and warn the user.

BaCh, we're done here.

29 December 2013

Crunching on big data? Make sure you have ECC RAM in your machine ...

Executive summary (aka abstract aka TL;DR):

Buying a consumer grade computer with more than just 2 or 4 gigabytes of non-ECC RAM is madness. Especially if you want to work on big data in scientific computing. You have been warned.

Long version

There are a couple of things to be said about defensive programming: some think it's utter rubbish, other take it to previously unseen levels of paranoia. Keeping things in balance can be hard, but for all my projects - and especially for MIRA, the nucleotide sequence assembler and mapper I wrote and maintain since 1997 - I try to keep a healthy balance of being just enough paranoid.

MIRA is not a system critical piece of software and no one will die if it stops unexpectedly. The rule of thumb I'm using is that if some assertion or data check within MIRA fails, the program will try to dump as much useful information as possible to locate or even reproduce the error and then bail out. That has made the code extremely robust and while users of the development version need to report bugs from time to time, I hardly get any error reports for the stable versions of the program. Cool.

But every now and then I get bug reports which leave me scratching my head.

Two of those arrived just before this Christmas. I was unable to reproduce the error with the data which came along and that left me completely perplexed at first. After digging deeper into the logs and doing manual code walks around expected program paths I found no reason why the code should fail. But while looking for the n-th time at the second case, I suddenly spotted something mysterious: a value had been logged and, a short time interval later, a copy of the very same value had also been logged. But the two values differed! I know, I know, there can be zillions of reasons for this to happen (bugs anyone?), but in this particular code path there was no way in hell a bug could strike as it was really too simple: single-threaded, no pointer operations and especially no wild memory writes. Just calling a subroutine with a given value.

What was even more telling was that the two values did not differ wildly, but just slightly: the values were 1669278 as first value and 1669276 as later value. So it's a difference of 2. As the variable in question was not a counter but just storing a value, I immediately made the connection to something I had seen earlier in my life (read more about it here) on a different machine: a bit rot (bit decay or whatever you want to call it). Indeed, the values written in binary look like this:

        110010111100010011110   (that's 1669278)
        110010111100010011100   (that's 1669276)

See the difference? The second bit from the right somehow decayed and became a 0.

So ... a memory location not used as pointer, not overwritten with some random value but the value differing by one bit, a bit being zero where it should be 1, not a chance for a program bug ... everything taken together, this points pretty strongly to faulty RAM.

Ouch, another two hours spent on a wild goose chase. I hate that, as if I had nothing better to do with my time (especially over Christmas).

This is now the fourth time in the last 18 months that MIRA catches a hardware memory error. The third time on hardware of other people. Myself I was struck in the summer of 2011 and I spent *weeks* hunting down some erroneous bug I only later recognised to be a memory problem. Best thing about it? The memory error had been impossible to detect with memtest86 even after 48hrs of continuous running, but was triggered frequently enough by MIRA to make me despair. At the time this delayed the release of a stable MIRA version by one or two months, led to short nights and wrecked my nerves.

I think that part of the reason for the increased frequency for this kind of error reports is that RAM has become dirt cheap nowadays and it is not infrequent to see consumer grade machines being loaded with 8 to 64 gigabytes of RAM, doing some serious number crunching. However, one thing has not caught up yet: the attention of the industry to the need to provide consumer grade main boards with not too expensive support for ECC RAM. While I can understand that in former times the hefty price difference between ECC and non-ECC made people wince, that difference for RAM has nowadays diminished to a point where it should be bearable by almost everyone. However, the price of the overall hardware needed to run with ECC is still a bit more pricier than "normal" consumer-grade hardware. Probably because the industry thinks that only institutions running big, fat-ass servers want ECC, and they should milk them where they can. The price difference is enough for many "normal" people to shy away from it. Bad idea, if you ask me.

So what is ECC RAM you may ask? Let me cite the first paragraph of the corresponding Wikipedia article on ECC RAM (emphasis mine):
Error-correcting code memory (ECC memory) is a type of computer data storage that can detect and correct the more common kinds of internal data corruption. ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing.
There you have it: no tolerance for memory corruption in scientific computing! Especially not in times of #DataDeluge (or #BigData or whatever you want to call it) we're living in now, where a single sequencing machine can produce dozens if not hundreds of gigabytes of data. Every. Single. Day. And where more and more people work on this data not on servers - which most of the time do have ECC RAM - but on consumer grade machines.

I, for one, know that my next machine at home will have >= 32 GiB RAM ... ECC RAM.

BaCh. We're done here.

31 December 2012

Death of the ATINSEQ "bug"

I first published this text in August 2011 on the MIRA talk mailing list after having spent weeks and months searching for what I thought to be a bug in my program but finally turned out to be ... oh well, you'll find out. As a totally fair and unbiased survey (3 bug reports to me and/or the MIRA talk list in the last 18 months) seems to suggest, the underlying problem is on the rise. In preparation for a second post later this week, I'm putting up a very slightly redacted (typos, grammar, links) version of the story here.