31 December 2012

Death of the ATINSEQ "bug"

I first published this text in August 2011 on the MIRA talk mailing list after having spent weeks and months searching for what I thought to be a bug in my program but finally turned out to be ... oh well, you'll find out. As a totally fair and unbiased survey (3 bug reports to me and/or the MIRA talk list in the last 18 months) seems to suggest, the underlying problem is on the rise. In preparation for a second post later this week, I'm putting up a very slightly redacted (typos, grammar, links) version of the story here.



Preface

I am most grateful to Bob Bruccoleri, Lionel Guy, Arun Rawat, Nestor Zaburannyi and numerous other people - who either declined being thanked publicly or could not respond before I wrote this (you still can, just drop me a mail) - who all donated time and computing power to hunt down a "bug" (see http://www.freelists.org/post/mira_talk/Call-for-help-bughunting) which, in the end, turned out to be something entirely different.


A sunny Friday morning in August

Life's a rollercoaster and there are days - or weeks - where moral is on a pretty hefty ride: ups and downs in fast succession ... and the occasional looping here and there.

Today was a day where I had - the first time ever - ups and downs occurring absolutely simultaneously. Something which is physically impossible, I know, but don't tell any physicist or astronomer about that or else they'll embark you on a lengthy discussion on how isochronicity is a myth by telling you stories on lightning, thunder and two poor sobs at the ends of a 300,000 km long train. But I digress ... 

So, my lowest low and highest high today were at 09:17 this morning when I prepared leaving for work (hey, it's vacation time, almost everyone else is out and I can go a bit later than usual, right?). A few minutes earlier I had just told MIRA to run on the very same PacBio test set she had successfully worked on the night before to see how stable assemblies with this kind of data are (quite well so far, thank you for asking). 

Reaching out to switch off the monitor and leave, MIRA suddenly came back with a warm and cosy little error message which she's taken the habit lately to have a mischievous pleasure to present. This time, she claimed there had been an illegal base in the FASTQ file. 

Hey, MIRA, wait a minute! I thought. Yesterday and tonight you ran on the very same data file with the very same parameters for two times three hours and even gave me back some nice assembly results. And now you claim that the INPUT data has errors?! Come on, you're not serious, are you?

  As a side note: she then just gave me back that look, you know, the one with those big open eyes almost hidden behind by long, dark lashes ... and slightly flushed cheeks accompanied by pointed lips. As if she wanted to say I am innocent and I did no nothing wrong, you disbeliever!

This usually announces a major pouting round of hers, something which I'm not looking forward to, I can tell you.



Two restarts later with the same negative result (MIRA can be quite stubborn at times) I had to give in and decided to sit down again and investigate the problem.
 
So ... read number 317301 at base position 246, eh? Let's have a look."

*clickedyclick*
Read 317299, 317300 ... 317301 ... there we are.

*hackedyhack*
Base position 239, 240 ... now: C G G G T C F A A ... wait! What? 'F' ... 'F'?!? That's not even an IUPAC code. What's a frakking 'F' doing in the FASTQ input file?! (No plan what a frak is? See the following clip which is conditionally safe for work)


Indeed, 'F' is not a valid nucleotide IUPAC code. Even more mysterious to me was the fact that just the night before it apparently had not been there. Or had it? I now was pretty unsure where this path would lead me, as if I had unlocked a door with the key of imagination. Beyond it: another dimension - a dimension of sound, a dimension of sight, a dimension of mind. I was moving into a land of both shadow and substance, of things and ideas. I just crossed over into ... the Twilight Zone: "G#-A-G#-E-G#-A-G#-E" at 128 bpm, for more info see:


But I digress again. Where was I? Ah, yes, the 'F'.

So, how did that 'F' appear in the FASTQ, and where had it been the night before? Out to town, ashamed of not being a nucleotide and getting a hangover without telling anyone up-front? Or did it surreptitiously sneak in from the outside, murdering an innocent base and taking its place in hope no one would note? I didn't have the slightest clue, but I was determined to find that out.

First thing to check: the log files of the successful runs the previous night. MIRA's very chatty at times and tidying up after her has always been a chore (there, I'm feeling that look again in my back), but now was one of those occasions where not gagging her paid out as poking around the files she left behind proved to be interesting. Read 317301 showed the following at the position in doubt: "C G G G T C ___G___ A A" Without question: a 'G', and no 'F' in sight!

So MIRA had been right and the 'G' in the sequence of the file mysteriously mutated into an 'F' overnight. I must admit that I had grown suspicious of her in the past few weeks as she had seemed to become uncooperative at times. In particular she had been screaming at me a couple of times during rehearsal of combined 454 and Illumina assemblies for the premiere of her new 3.4.0 show. She claimed that some uninvited spider monkeys (see http://dict.leo.org/ende?search=Klammeraffe) had frightened her so much she refused to continue to work and simply scribbled the '@' sign all over her error messages. I had not been able to find out how those critters entered MIRA's data and had even enrolled a few volunteers to rehearse different assemblies with MIRA ... to no avail as she'd performed without flaws there.

While reconsidering all these things, something suddenly made *click*.

The character 'G' has the hexadecimal ASCII table code 0x47 (or in 8-bit binary: 01000111). 'F', as preceding character of 'G' and the table having some logic behind it, has the hex code 0x46, which is 01000110 in 8-bit binary.

The ATINSEQ-bug (@-in-seq) I had been desperately hunting in the past few weeks (and which had held up the release of MIRA 3.4.0) was due to the "@" character sometimes mysteriously appearing in sequences during the assembly of MIRA. The '@' sign in the ASCII table has the hex code 0x40 (binary: 01000000). In the ASCII table, there is one important character for DNA assembly which is very near to the '@' character ..., so near that it is the successor of it: the 'A' character. Hexadecimal 0x41, binary 01000001. 

I had always thought that a bug in MIRA somehow corrupted the sequence, but what if ... what if MIRA was actually really innocent?! I had never taken this possibility into account as any other explanation attempt would have seemed too far stretched.

But now I had a similar effect outside of MIRA, in the Linux file system!


      File system    MIRA
before      G01000111      A01000001
after      F01000110      @01000000

The difference between the characters is in both cases exactly 1 bit which changes, and it's even at the same position (last one in a byte) and changing into the same direction (from '1' to '0'). 

I was now sure I was on to something: bit decay.

But how could I prove it? Well, elementary my dear Watson: When you have eliminated the impossible, whatever remains, however improbable, must be the truth. 

Suspects:
  • the problem is caused either by MIRA or one of the components of the computer: CPU, disk, disk/dma controller, RAM.
Facts:
  • an artefact was very sporadically observed during MIRA runs where sequences (containing lot's of 'A') suddenly contained at least one '@'. This occurred after several passes, i.e., not on loading.
  • an artefact was observed in the Linux file system where a 'G' mutated suddenly and overnight to a 'F'.
  • both artefacts are based on one bit flipping, perhaps even to the same direction all the time.
  • when loading data, MIRA does not use mmap() to mirror data from disk, but physically creates a copy of that data.
  • MIRA loaded the data twice flawlessly before the artefact in the file system occurred.
Deduction 1:
  • MIRA is innocent. The artefact in the file system happened outside of the address space of MIRA and therefore outside her control. MIRA cannot be responsible as the Linux kernel would have prevented her from writing to some memory she was not allowed to.
Further facts:
  • the system MIRA ran on had 24 GiB RAM
  • even with a KDE desktop, KMail, Firefox, Emacs and a bunch of terminals open, there is still a lot of free RAM (some 22 to 23 GiB).
  • Linux uses free RAM to cache files
Deduction 2:
  • when loading the small FASTQ input file in the morning, Linux put it into the file cache in RAM. As MIRA almost immediately stopped without taking much memory, the file stayed in cache.
Further facts:
  • the drive with the FASTQ file is run in udma6 mode. That is, when loading data the controller moves the data directly from disk to RAM without going via the processor
  • subsequent "loading" of the same FASTQ into MIRA or text viewer like 'less' showed the 'F' character always appearing at the same place.
Deduction 3:
  • the CPU is innocent! It did not touch the data while it was transferred from disk to RAM and it afterwards shows always the same data.
  • the disk controllers and UDMA controller are innocent! Some of the glitches observed in previous weeks occurred during runs of MIRA, inside the MIRA address space, long after initial loading, when UDMA had already finished their job.
From deductions 1, 2 & 3 follows:
  • it's not MIRA, not the CPU, nor the disk & UDMA controller
Suspects left:
  • RAM
  • Disk
Well, that can be easily tested: shut down the computer, restart it and subsequently look at the file again. No file cache in RAM can survive that procedure. Yes, I know, there are some magic incantations one can chant to force Linux to flush all buffers and clear all caches, but in that situation I was somehow feeling conservative. 

Lo and behold, after the above procedure the FASTQ file showed an all regular, good old nucleic acid 'G' in the file again. No 'F' to be seen anywhere. 

Deduction 4:
  • the disk is innocent.
Deduction 5:
  • as all other components have been ruled out, the RAM is faulty.

Ups 'n downs

As I wrote: life's a roller coaster.

Up: MIRA is innocent! There, she's giving me that look again and one would have to be blind to oversee the "told you so" she's sending over with it. 
Down: My RAM's broken and I need to replace it. Bought it only last March, should still be under guarantee, but still ... time and effort.
Up: I did not sell my old RAMs, so I can continue to work
Down: 12 GiB feels soooooo tight after having had 24.
Up: I can wrap up MIRA 3.4.0 end of this week with good conscience!
Down: How the hell am I gonna tie all loose bits and pieces in the documentation in the next 24 to 48 hours?
Looping: later that morning MIRA again helped me at work to locate in a couple of minutes a mutation in an eukaryotic strain important for one of our Biotech groups. Oh boy, do I love sequencing and MIRA.


Have a nice Friday and a good week-end,
  Bastien

PS: while celebrating with MIRA tonight, I expressed my fear that some people might find it strange that I anthropomorphise her. They could think I went totally nuts or that I needed an extended vacation (which I do btw). She assured me that no one would dare thinking I were insane ... and if so, she would come over to their place and give them that look.

How utterly reassuring.

No comments:

Post a comment