2298: Coronavirus Genome

Explain xkcd: It's 'cause you're dumb.
(Redirected from 2298)
Jump to: navigation, search
Coronavirus Genome
Spellcheck has been great, but whoever figures out how to get grammar check to work is guaranteed a Nobel.
Title text: Spellcheck has been great, but whoever figures out how to get grammar check to work is guaranteed a Nobel.


This comic is another comic in a series of comics related to the COVID-19 pandemic.

It was also the first in a new series, followed in the next comic by 2299: Coronavirus Genome 2.

Megan is a geneticist doing research on the SARS-CoV-2 virus. She is analyzing the virus's genome, its genetic material composed of RNA. The genomic sequence can be represented as a list of nucleotide bases (guanine, adenine, cytosine, thymine and uracil - often abbreviated as G, A, C, T, and U).

The nucleotide sequence displayed is a 100% match to six SARS-CoV-2 sequences in public databases, all of them originating from the East Coast of the United States. The sequence is from nucleotides 26202-26280 of the virus genome and overlaps an unknown open reading frame/gene named ORF3a. One of the matching sequences is [1]. However, SARS-CoV-2 is an RNA virus, and so its genetic material (not containing any DNA) would not include thymine (T) but would use uracil (U) instead. The sequence uses the codes of DNA as RNA sequencing involves copying the genome into a DNA, and the DNA code is more familiar anyways.

Cueball is surprised that Megan and her colleagues actually use Microsoft Notepad, a simple text editor, to look at the genome, instead of more modern technology. She explains that better research institutions use Microsoft Word, a more advanced editor, to allow additional formatting (such as bolding and italics), and humorously calls this "epigenetics". In the real world, epigenetics is the study of changes that are not caused by changes in nucleotides, but by chemical modifications of DNA or chromosomes that cause changes in patterns of gene expression and activation, sometimes several generations down. This might be considered analogous to altering the meaning of a text by changing its formatting rather than the content; for example, content can be moved into parentheses or footnotes to be de-emphasized, or rendered in boldface or enlarged to attract attention and emphasize key points. Much as text can be wrapped in HTML tags or similar markup to change its formatting, nucleotides can be methylated to prevent transcription, and the histones around which DNA is wound can also be modified to promote or repress gene expression. During DNA replication, these modifications are often also reproduced.

The real punchline comes when Megan uses spellcheck to detect mutations in the genome by adding the previous genome to spellcheck and comparing them. Overall, Megan uses ridiculously and humorously crude methods to analyze a major genetic item. The genome of SARS-CoV-2 is almost 30,000 base-pairs long, which exceeds the longest words of any natural language by two orders of magnitude (the longest words ever used in literature -- i.e. not constructed in isolation simply for the purpose of being a long word, or chemical formulas -- approach 200 letters), and may exceed the capabilities of any available spell-checking program. Furthermore, a spellcheck program underlines the whole word if a single letter is wrong and not just the letter itself. Thus, it would not be able to highlight individual mutated base pairs. Megan might be better served by using a diff tool, but most scientists generally use commercial software that is designed to view, annotate, and edit DNA sequences (eg: Snapgene, Geneious, DNAstrider, ApE).

The title text mentions grammar checking and claims that whoever discovers how to use that to compare genomic material should be awarded a Nobel Prize. Spell-checking is analogous to comparing sequences against ones previously known, an activity that is the bread and butter of bioinformatics nowadays. Grammar checking would be analogous to having some sort of sense as to how well all the sequences generally cooperate and interact to create possibly viable functionality in an organism, something we are unable to do at the moment except in very limited ways and only in a few simple cases. It may also be a snarky commentary on the untrustworthy nature of grammar-check programs in general, which often follow grammatical rules far more strictly than is practical; it's not uncommon for an author to follow a grammar-check recommended correction only to find the corrected portion is now part of a longer portion that the checker deems "incorrect".

Amusingly, this and the title text foreshadowed the usage of an MIT language learning algorithm to predict mutations in SARS-CoV-2.


[Megan sits at a desk, working on a laptop. A genome sequence is displayed on her laptop screen, shown with a jagged line in a text bubble.]
Cueball (off-screen): So that's the coronavirus genome, huh?
Megan: It is!
Laptop: <A long string of unintelligible letters, presumably the genome>
[Cueball walks up and stands behind Megan, still working on the laptop.]
Cueball: It's weird that you can just look at it in a text editor.
Megan: It's essential!
Megan: We geneticists do most of our work in Notepad.
[A frameless panel, Cueball still standing behind Megan. Megan rests her arm on the chair. ]
Cueball: Notepad?
Megan: Yup! Nicer labs use Word, which lets you change the genome font size and make nucleotides bold or italic.
Cueball: Ah, okay.
Megan: That extra formatting is called "epigenetics".
[A regular panel. Cueball still stands behind Megan, this time with his hand on his chin.]
Cueball: Hey, why does that one have a red underline?
Megan: When we identify a virus, we add its genome to spellcheck. That's how we spot mutations.
Cueball: Clever!

comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!


Epigenetics is a pun, right? I think it's a pun but I don't know what and it's maddening. That's right, Jacky720 just signed this (talk | contribs) 23:03, 24 April 2020 (UTC)

...Epigenetics is a real thing—the study of how changes in things other than the genome itself can be passed down between generations. An example is conditioning a mouse to be scared of the smell of oranges/cherries/almonds by having them associate the scent of acetophenone with an electric shock, then testing whether its pups also have the same fear of that smell: they do, but this obviously can't be by the genome itself changing (no component of this has a lot of ionizing radiation[citation needed]). Whatever causes this is the topic of actual epigenetics. --Volleo6144 (talk) 00:12, 25 April 2020 (UTC)
I know that, I added the link to the article. But afaik that has nothing to do with how the genome is formatted in Word, and I think it's a pun. That's right, Jacky720 just signed this (talk | contribs) 00:31, 25 April 2020 (UTC)

since when does notepad have spellcheck? 23:05, 24 April 2020 (UTC)

Neither notepad nor wordpad have spellcheck. I suspect he combined two jokes and the spellcheck to word link was not better established.Quinoje (talk) 19:35, 27 April 2020 (UTC)
Word does, so maybe she is using Word instead? Kind of contradictory. 23:14, 24 April 2020 (UTC)
I assumed Randall meant Wordpad, which ifrc is an upgrade from notepad but has a really thinned out set of Word's features. Maybe there's a spellcheck in there? (haven't used it in ~10 years) Xseo (talk) 07:47, 27 April 2020 (UTC)

Very disappointed that she's using Notepad and not Notepad++ . I mean,really... Cellocgw (talk) 15:38, 27 April 2020 (UTC)

When Dr. Theall first scanned Finnegans Wake, he had to tell Microsoft the language was Old Icelandic.

The OCR kept trying to spellcheck Finnegans Wake.15:11, 26 April 2020 (UTC)

True Story: In the 1980s, as part of the Work Experience initiative at my school, I was assigned to one of my local council's offices (I'd applied for their computer department, but someone else got that). I don't think the word-processor I used at home (Psion Exchange) had spellcheck, but the one the office used (Lotus? Can't actually recall, but it, like most things, was DOS-based) definitely had, and it was very easy to edit in new words. Inspired by the chemistry lessons I'd recently had, and some 'reports' I was asked to write (keeping the kid busy, more like!) that dealt with chemical degradation of concrete under the action of salt and suchlike, I of course added "NaCl" then absolutely any other chemical formulae I could think of. "H2SO4" was an early one (partial subscript formatting wasn't relevent to the spill-chucker) but I eventually got round to CH4, C2H6, C3H8, etc, and then as many of the derived alcohols, alkenes, alkynes, etc that I could be bothered to type in. Which were a lot. By the end I was 'confident' that nobody would ever type any correct chemical formula into that machine (no network-shared resources!) and have to worry about false-positive typo alerts. Yeah, well, I was still at school and thought I knew everything. 23:37, 24 April 2020 (UTC)

Can confirm: virus genomes are looked at in notepad. I worked at one of the national laboratories for a summer, experimenting with ways to check for the length of a gene and strength of genetic expression in various circumstances in E. coli. We used notepad because even old computers can open very large files without difficulty, and all our scripts were in Perl, which can easily output to .rtf or .txt file formats. These files are huge, by the way. If you hold down on the scroll bar so it's zooming to the bottom, you could be waiting 20 minutes to reach the end depending on the number of kilobase pairs in your microbe. And epigenetics is not a pun. It's a real word. 00:15, 25 April 2020 (UTC)

even old computers can open very large files without difficulty - Depending on what you mean by "old" and "very large" that may well not be true. In Windows 3.x, Notepad could open files as large as 54Kb, increasing to 64Kb in Windows95, 512Mb in Windows 8 and 1Gb in Windows 10. I don't know which of those would fit a typical virus genome, but I'm guessing it's not all of them. 13:43, 27 April 2020 (UTC)
Well, Sars-Cov-2 has around 30 kb, and that's considered big already. Since a base is a letter and thus a byte, a viral genome usually fits in the old notepad. But here is the catch: when people align things you get the number multiplied by whatever many genomes they are looking at. And don't even talk about the Nucleocytoviricota-whatsoever-twats.-- 06:11, 5 May 2020 (UTC)

Concurrent to the work in the medical community, work is underway in various open source software communities to fix bugs and other issues with software (eg genome analysis tools) that is useful to the scientists combatting COVID-19. These include the Debian "biohackathon" (https://lwn.net/Articles/816280/) as well as support from Mozilla (https://lwn.net/Articles/816386/). Parallel to these efforts, the FSF (Free Software Foundation) has focused on the shortage of medical equipment: https://lwn.net/Articles/816392/ 00:34, 25 April 2020 (UTC)

I’m suddenly inspired to write a DNA-edit-mode for Emacs (if it doesn’t have it already) which would allow for the virus spell check as described in this comic. 04:16, 25 April 2020 (UTC)

the dna-mode for emacs does exist. Google for it. It is not very useful for real work, though. Heikkil (talk) 04:40, 26 April 2020 (UTC)

Derek Lowe has some insights about actual coronavirus mutations here, if you are interested.

Given coronavirus has an RNA genome, shouldn't all the 'T's be replaced by 'U's?

It is standard practice no to use U's in public sequence database. It simplifies things. Heikkil (talk) 04:40, 26 April 2020 (UTC)

The sequence in the transcript does not actually appear on the site mentioned in the explanation. In fact, when I google for 'TACTAGCGTGCCTTTGTAAGCACAAGCTGATTAGTACGAACTTATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTA' I only get this particular site. (talk) 07:00, April 25, 2020 (please sign your comments with ~~~~)

To find this (or any) sequence go to [Blast] and paste the query into the box. You will receive a list of a number of best matches (10, 50 or 100 in standard search), this should look like [[2]]

Interestingly, this is an US-specific strain of the virus (top result currently is "Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/NC_0025/2020").Tier666 (talk) 23:21, 25 April 2020 (UTC)

Well, obviously it's a new variant, yet unknown to other clinical studies. Of RNA that has switched to looking like DNA, so this is a hot discovery! 12:05, 25 April 2020 (UTC)
The site shows several views into the public database entry that are easier to understand by humans than the raw sequence. Click the link at 'View: TEXT'. and scroll down. The relevant lines look like this:
     aatccagtaa tggaaccaat ttatgatgaa ccgacgacga ctactagcgt gcctttgtaa     26220
     gcacaagctg attagtacga acttatgtac tcattcgttt cggaagagac aggtacgtta     26280
As you can see, these are not meant to be search for and compared in "a notepad". For the same reason, google does not index DNA sequence database entries. There are specialised tools for that.
The sequnces were published this month, so they are available only in the most recent sequence database updates. Heikkil (talk) 04:40, 26 April 2020 (UTC)

I have had trouble opening .txt files of even a hundred KB in Notepad! Sometimes it even crashes... It's one of the reasons I started using Notepad++. Notepad++ also happens to have a very extensible spellcheck, & language-specific formatting options. Since I often need to use Windows machines, it's one of my most frequently installed apps, after 7Zip. ProphetZarquon (talk) 18:03, 25 April 2020 (UTC)

The Grammar Checker concept only has a limited analytical sophistication, though I don't doubt it'd still be enough to get a Nobel given the complexity of the task of deriving trivially feasible sequences from total codswallop. I also added the "next step" (probably much more than a single step), when I revised things, but that might actually be overstepping the explanation of the comic and removable. 20:32, 25 April 2020 (UTC)

Thanks for mentioning this in the discussion area, as I wondered what that "next step" line meant when I read it a little while ago, let alone how it related to the comic. I'll go ahead and trim that last "next step" sentence off the end, as I think it is unnecessary. Ianrbibtitlht (talk) 03:36, 26 April 2020 (UTC)

Is using Notepad to analyse RNA sequences more or less sane than using a spam filter to play chess? - Angel (talk) 00:43, 27 April 2020 (UTC)

Is that filter used to prevent emails pretending to be from Czech mates looking to give you a knight to remember in a message full of pawn images? 15:10, 27 April 2020 (UTC)

Just stumbled on this. I wonder if Japanese spell checker tech (like many logographic scripts, words aren't separated by whitespace) would work for strings of nucleotide letters. Normally, you try to match the longest possible strings with algorithms like BLAST, but maybe the spellcheckers get so much optimization that they're more efficient. Or maybe spellcheckers should use BLAST. Ericprud (talk) 18:04, 23 November 2022 (UTC)