1737: Datacenter Scale

Explain xkcd: It's 'cause you're dumb.
(Redirected from 1737)
Jump to: navigation, search
Datacenter Scale
Asimov's Cosmic AC was created by linking all datacenters through hyperspace, which explains a lot. It didn't reverse entropy--it just discarded the universe when it reached end-of-life and ordered a new one.
Title text: Asimov's Cosmic AC was created by linking all datacenters through hyperspace, which explains a lot. It didn't reverse entropy--it just discarded the universe when it reached end-of-life and ordered a new one.

Explanation[edit]

This comic expands, to the limit, the strategy that it's a net cost saving to allow cheap hardware to fail and simply replace it than to have robust but much more expensive systems to start with. The technique was made famous by Google circa 1999, when its successful cost-effective server designs were actually using sub-consumer, nearly junk, hardware.

RAID ("redundant array of independent disks") is a technology that splits data across several hard drives as if they were one. RAID comes in several levels (varieties) which have different applications, but one of the big applications of RAID is creating mirrored hard disks that back each other up. If one disk drive in such a RAID fails, no data is lost.

However, RAID is complicated to configure, so you don't want to be constantly setting it up. An alternative technique for data centers is, therefore, to simply send the data to several servers at once. This makes maintenance easier, but without RAID, one hard disk crash basically breaks the server. However, this is what Hairbun is doing since their scale is so large that fixing individual servers actually more expensive than simply buying a new one for replacement, and instead of fixing the drive they throw away the machine. (More about this approach will be explained later on)

From here, the comic starts to exaggerate. Nowadays, servers can be made extremely small ("Blade servers") and dozens of servers can be attached to one 19-inch rack in a data center. Rather than going to the effort of unplugging and unscrewing one blade from the rack, when a blade fails at Cueball's data center they just throw away the rack, and Ponytail agrees and mildly mocks Hairbun for replacing one server.

Hairy's data center goes one step further - they have so many servers that they would constantly have to be throwing away and replacing racks, so instead they just build a new room when one rack fails. This would be currently possible with small modular data centers that are built in shipping containers for easy transport and can be linked together to expand capacity. Here the cargo-container "room" with the failure would be quickly swapped with a fresh one. Cueball adds "like Google!" - Randall previously mentioned Google's approach to hard drive failures in the what if? article Google's Datacenters on Punch Cards. Back in 2007 they had one failure every few minutes, which might have increased hugely since then.

Finally Megan appears and her company, of course, breaks the scale of silliness in exaggeration. She says that they don't have any fire extinguishers (neither regular sprinklers nor the systems that deploy gasses like FM-200 which alter the room air's ability to sustain a fire). Rather, they just rope the center off, thus letting the data center burn down. Then they simply move a town over and build a new one. This may indicate they are so big that the entire town will burn down if their center catches fire, or else they did not have to skip town. Alternatively, they just leave the center burning and this may cause problems in that town, so they simply flee the premises.

Most big internet companies do have multiple redundant data centers around the world, in order to increase speeds for users in different countries, but Megan's idea would be very expensive, result in increased latency, possibly kill people (either in their company, or other people in the town, since they do not try to put out the fire), and cause severe destruction of properties in addition to their own. These last two items would result in additional litigation and fines, and potentially jail sentences for the people charged with implementing the policy. They may also result in other towns being unwilling to take their business, out of fear they will wind up burning too.

Hairy still thinks that it makes sense, while Cueball wonders what difference the roping off does. This could again be a reference to the fact that they just let the buildings burn without bothering about the local consequences, and the next step is just one more step towards the extreme of the title text. Or, he's contemplating that they're just wasting more time by roping it off.

This comic references how, as data requirements expand, the cost of time eventually outweighs the cost of hardware at ever increasing scales (drive, rack, room, building). While this comic takes this to the extreme, with whole buildings being destroyed for simple flaws, the concept is not as far-fetched as it seems if "thrown out" is taken to include being sold to equipment refurbishers. It could indeed be cost effective for a large data services provider to resell racks or even whole data center modules at some significant fraction of their "as new" price as opposed expending the time and effort to attempt a repair. The equipment refurbisher would then rely on a cost advantage like cheaper labor to repair the flaw and sell it back to Google or another company with less demanding requirements. Equipment rental firms already operate on this model and with the added incentive customers preferring to rent newer models, this means that the equipment is often preemptively replaced before failures even occur.

The title text refers to Isaac Asimov's science-fiction short story The Last Question (comic version), where humanity asks, at different stages of its spacial and technological development, the same question to increasingly advanced computers: "How can the net amount of entropy of the universe be massively decreased?". At each point, the computer's answer is that it does not yet have sufficient data for a meaningful answer. Ultimately, the computers are all linked through hyperspace, outside the physical boundaries of the universe, and make up a single computing entity named AC which keeps pondering the question even as the heat death of the universe occurs and time and space cease to exist. When AC finally discovers the answer, since there is nobody left to report it to, it decides to demonstrate it and says "LET THERE BE LIGHT!", which are the first words said by God during the Creation, according to the Book of Genesis. Here, the title text implies that, as the universe died, AC no longer had a use for it as a physical support and, taking the comic's logic to the next extreme, chose to discard it and get a brand-new one instead of bothering to "fix" it by reversing its entropy. This short story was also referenced in 1448: Question.

This comic's concept of taking a real-world phenomenon and exaggerating it to levels currently considered implausible for comic effect closely mimics an earlier comic which describes progressively more "hardcore" programmers in 378: Real Programmers. This comic might be related to 1567: Kitchen Tips which suggests not throwing away your dishes but washing them, and 2033: Repair or Replace, which is also about discarding servers instead of fixing them.

Transcript[edit]

[Zoom in on Hairbun holding her hand palm up in front of her taking to people off-panel right.]
Hairbun: RAID controllers don't make sense at our scale; everything is redundant at higher levels. When a drive fails, we just throw away the whole machine.
[In this frame-less panel it is revealed that Hairbun talked to Cueball and Ponytail who is looking her way.]
Cueball: Machine? We throw away whole racks at a time.
Ponytail: Yeah, who replaces one server?
[Hairy has appeared from the left and holds one hand palm up towards the other three where also Hairbun has turned towards him.]
Hairy: We just replace whole rooms at once. At our scale, messing with racks isn't economical.
Hairbun: Wow.
Cueball: Like Google!
[Megan walks in from the left, and everyone including Hairy now looks towards her. Cueball has taken a hand up to his chin. The replies to Megan are written with clearly smaller font.]
Megan: We don't have sprinklers or inert gas systems. When a datacenter catches fire, we just rope it off and rebuild one town over.
Hairy: Makes sense.
Cueball: I wonder if the rope is really necessary.


comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!

Discussion

While the comic is obviously exaggerating, there are situations where this could make a certain amount of sense. IF you can design a server so that most or all of the components reach end-of-life at about the same time, then if a hard drive fails on one server, every other component of that server is likely to fail soon as well.

If you install entire server racks or server rooms at the same time, where every machine contains components with the same basic life cycle...

then in theory, once the first component fails, you can ignore it until mass component failures causes the entire rack/room to fall below a certain readiness level.

At that point, there's no reason to pay a technician to spend several days removing and replacing half the individual components throughout that rack/room, when the other half are just going to fail in the next few months anyway. In theory, it might be economically more efficient just to scrap everything at once, bring in brand-new server replacements, and re-sync the needed data from a networked backup.

in real life, it's very hard to build a server that will reliably degrade on schedule.... but with the right tradeoffs, and enough long-term performance data, it might eventually become possible to do so. 162.158.74.101 04:48, 23 September 2016 (UTC)

Or give the equipment to someone with a different time/ROI equation. I've seen a lot of time/expense burned on a transient failure that turned out to be a cheap data cable. A kid/disadvantaged would have time to tinker this out with a potentially significant payoff. Elvenivle (talk) 17:27, 23 September 2016 (UTC)

The title text is referring to The Last Question by Isaac Asimov. EpicWolverine (talk) 04:56, 23 September 2016 (UTC)


I cannot help but read this in a fake Yorkshire accent. https://en.wikipedia.org/wiki/Four_Yorkshiremen_sketch 141.101.98.113 09:55, 23 September 2016 (UTC)

I wonder how closely the AC and Douglas Adams' Deep Thought are related? 188.114.102.167 (talk) (please sign your comments with ~~~~)

Not that close as Deep Thought was build inside this universe and also finished it's job and was recommissioned. They build a new computer (Earth) instead to calculate what the ultimate question was, now they knew the answer was 42. But maybe Adams was aware of AC and based the idea of solving a question with computers on that...? --Kynde (talk) 13:56, 23 September 2016 (UTC)

I think the character in Panel 1 is Science Girl and not Hairbun. PoconoChuck (talk) 12:20, 23 September 2016 (UTC)

Agree it fits with her style and she has appeared as an adult before. She also seems smaller than the other people so it could indicate she is still young. I created the Science Girl and the Hairbun categories, so I should know ;-) When a character fail I just throw it out and create a new one... :p --Kynde (talk) 13:52, 23 September 2016 (UTC)
It's clearly not Science Girl, because, as the linked page says "She became the first child to have its own character category. She is distinguished by being clearly a girl (compared to adults around her or her behavior)". You may create a page called "Datacenter Woman". 108.162.221.139 14:35, 23 September 2016 (UTC)
I dunno - she's drawn exactly like Science Girl - right down to the frizzy hair below the bun that's never been seen on Hairbun (who doesn't have black hair either). There are plenty of other instances of "child" characters being seen as young adults - and of people acting out of character when "bit parts" are needed in cartoons. The resemblance is too close to be a coincidence. SteveBaker (talk) 20:41, 23 September 2016 (UTC)
No need to invoke blade servers

There's no need to refer to blade servers in the explanation. You can fit many "normal" servers into a 19 inch rack. It could just say:

From here, the comic starts to exaggerate. Many servers can be mounted in one 19-inch rack in a data center. Rather than going to the effort of unplugging and unscrewing one server from the rack, when a disk fails at Cueball's data center they just throw away the rack, and Ponytail agrees and kinda mock the woman with a bun for replacing a single server.

162.158.83.66 14:51, 23 September 2016 (UTC)

RAID is not complicated

Simple RAID 1 is not complicated to configure, unless you have some exotic HW RAID controllers. RAID 5 would be more complicated AND requires to be HW, but RAID 1 will usually be simple as HW OR possible to do SW completely automatically. What is costly is to replace discs as they fail, because it must be done by human ; in bigger systems, it makes more sense to start with RAID 1, then when one disc fail simply ignore it - not repair nor throw it off, just let it operate without the RAID. -- Hkmaly (talk) 15:41, 23 September 2016 (UTC)

Actually, depending on OS and software other RAID levels can be done in software, too. I've done RAID levels 5 and 6 fully in software using mdraid on Linux. Neither of them are really that much more complicated than RAID-1. ZFS can do even more complicated "RAID" types fully in software, too. Iguanabob (talk) 16:55, 23 September 2016 (UTC)
Gas-based data center fire extinguishers are not lethal

There's an explanation in this article which claims that gas-based data center fire suppression systems are lethal to anyone in the room when they are deployed. This is a myth, based on some very ancient systems which used halon. For decades now, agents like FM-200 have been used, which alter the air's ability to sustain a fire without removing oxygen from the room. See this video [ https://youtu.be/-ub8gwgcOns ] for an example. The camera crew is IN THE ROOM when the system is deployed. Trust me on this, I've worked in data centers for 20 years and know this stuff inside and out. IGnatius T Foobar (talk) 16:17, 25 September 2016 (UTC)