2054: Data Pipeline

Explain xkcd: It's 'cause you're dumb.
Revision as of 01:00, 4 October 2018 by 162.158.79.149 (talk) (Explanation)
Jump to: navigation, search
Data Pipeline
"Is the pipeline literally running from your laptop?" "Don't be silly, my laptop disconnects far too often to host a service we rely on. It's running on my phone."
Title text: "Is the pipeline literally running from your laptop?" "Don't be silly, my laptop disconnects far too often to host a service we rely on. It's running on my phone."

Explanation

Ambox notice.png This explanation may be incomplete or incorrect: Please direct all data pipelines to the explanation below and only mention here why it isn't complete. Do NOT delete this tag too soon.
If you can address this issue, please edit the page! Thanks.

In the first panel Cueball shows Ponytail and White Hat a Data Pipeline he has constructed that, as he puts it, 'collects and processes all the data we need'. This implies that the three are running some sort of project that requires data processing. Ponytail assumes that this data pipeline is an unstable mess of scripts that will cease to function correctly should any unexpected input be received. Cueball responds by telling her reluctantly that this is very likely, although he seems to hope that it might not be. Ponytail then seems impressed and expresses this to him. She, however, gets interrupted by Cueball who tells her that the system just malfunctioned and collapsed. He, however, states that he can fix it, making it seem like this cycle of patching and collapsing could repeat infinitely, or until all problems have been patched. Knowing Cueball's code, though, it seems more likely he can't patch it.

In the title text, Ponytail or White Hat proceeds to question how such an important system can run on such a small computer. However, Cueball makes it worse by saying he uses his phone due to the better connection. While this might make the pipeline functional, it also makes it far more fragile.

This comic is a logical continuation of the Code Quality series (1513: Code Quality, 1695: Code Quality 2 and 1833: Code Quality 3), further highlighting Cueball's coding ineptitude and Ponytail's exasperation with it.

It's quite common for somebody who codes for enjoyment with most of their time to attempt to automate absolutely everything that is done. Whenever a rote task is seen, a programmer thinks, "why is a human doing this when the time could be spent making a computer do it automatically, forever?" Unfortunately, without the advent of strong artificial intelligence, one of the places this begins breaking down is in aggregating information from multiple sources.

People tend to publish their data via a variety of different channels, and as they are not programmers and don't share the value of consistency and computer-processability, it is all in completely different formats. Some data is only available in print. Some data is only available as photographs. Some data is only available as written reports. A certain kind of nerd will see this situation and become excited, seeing the opportunity to automate something that nobody else thinks is reasonable to put the energy into. They begin writing scripts that process all the different formats that all the data is in, and eventually get the whole thing working! They can then, in theory, make a number of mind-numbing data-processing jobs obsolete.

Google has put a lot of energy into conquering this challenge on many, many fronts around the decade of the 2000s, making data more processable everywhere, and possible hastening the advent of those strong artificial intelligences, which would thrive off of the availability of already-digitized information. A notable project was google books, where libraries were scoured for non-digital information and it was all painstakingly scanned. Additionally, organizations have been increasingly pressured to offer their information in standardized formats that can all be processed the same way. This continued pressure is giving more and more results, but because it must be implemented by humans who gain little immediately from the process, it is rare that adherence to the guidelines is universal.

The workaround of building many small programs that handle all the quirks is the domain of "scraping" -- downloading information intended to be presented to a human, running it through software that has been pre-programmed with what patterns to expect, and normalizing and making use of the data. Anybody who has, as a mere individual, attempted this goal, quickly realizes that as soon as the data source has the smallest change, the data becomes garbage. Often it becomes garbage in a way that is laborious to hunt down and understand, and may not even be noticed. This would be tragic for a corporation that was relying on the results, and would be like a trojan horse, destroying them from the inside.

Privately, though, many hobbyists can make money by collecting data that is useful to them and processing it more effectively, using computer code, than their peers do. Because they are consistently personally invested in their data pipeline, they just hack up fixes as problems occur. One website that has been maintained this way for many years is piperka.net, which is run by a single hobbyist and provides a centralized place for tracking most webcomics, including xkcd itself, using this very method of data scraping. Additionally, many companies have refined data scraping to be quite effective and reliable, such as mint.com which allows customers to consolidate their financial information all in one place.

Artificial intelligence is developing to the point that some computers are scraping data very effectively and reliably, but usually creating advanced, robust, scraping algorithms is not worth the effort -- it is more efficient to get the providers of the data to just offer it in a normalized way, hire a human to do the task, or make money through something that doesn't require as diverse processing. Hence there has been this niche filled by those hobbyists willing to put the heuristic effort in.

Transcript

Ambox notice.png This transcript is incomplete. Please help editing it! Thanks.
[Cueball is standing with an open laptop, showing it to Ponytail and White Hat.]
Cueball: Check it out - I made a full automated data pipeline that collects and processes all the information we need.
[Ponytail is looking down at Cueball's laptop.]
Ponytail: Is it a giant house of cards built from random scripts that will all completely collapse the moment any input does anything weird?
[Borderless beat panel]
[Cueball looks at his laptop.]
Cueball: It... might not be.
Ponytail: I guess that's someth-
Cueball: Whoops, just collapsed. Hang on, I can patch it.


comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!

Discussion

tried my hand at transcipts again, hope i did ok. Nintendo Mc (talk) 15:32, 3 October 2018 (UTC)

Oddly prescient, as always. I've just finished writing a fully automated data pipeline that ingests multiple data sources (both manual and automated input), has API support, a frontend, and email dispatch capabilities entirely in Google Sheets. It was about 3x faster to code than doing it right. 172.68.65.6 16:48, 3 October 2018 (UTC)

That's so awesome ! Would you come back and let us know if it ever collapses because one of the data sources changes slightly? (or alternatively, that it _doesn't_ collapse and cueball needs to get his shit together?) 162.158.78.130 01:22, 4 October 2018 (UTC)

Just added a line about how this is a logical continuation of the Code Quality series - given it's the same two people, this should be uncontroversial. Is it worth adding a new category for "Code Quality" to group these (and likely subsequent comics) together? Grimreaperwithalawnmower (talk) 17:20, 3 October 2018 (UTC)

Quite controversial, in fact. I actually found that statement quite questionable and that it should probably be removed (only in part because the title isn't grouping it in with them). Related? Certainly. But a full part of the group? THIS Cueball seems like he's far more capable than Code Quality Cueball. THIS Cueball managed to construct a highly useful piece of software that - until the final panel - did the job they needed. The issue here is the Bobby Tables issue, that he neglected to sanitize the input, i.e. to at least write the program in a way that it could handle variety. The program relies heavily on the exact format of the data it's gathering (a format that he has no control over, it's set by the source). Okay, this suggests he's using prewritten code and connecting it together, but getting code pieces together into a cohesive whole is a considerable feat, showing some programming prowess (far better ability that CQ Cueball). NiceGuy1 (talk) 05:05, 5 October 2018 (UTC)

What could we still add to the transcript? I don't think it really needs any more transcripting so maybe we should remove the marker. Kwonunn (talk) 18:50, 3 October 2018 (UTC)

No comment about the "roll over" text (excuse me if I have the name wrong). I think this is a comment about the shear computing power, battery life and superior connectivity of modern mobile phones compared to laptops. RIIW - Ponder it (talk) 19:05, 3 October 2018 (UTC)

IIRC, it's generally called "hover text." 162.158.74.57 (talk) (please sign your comments with ~~~~)
pretty sure it's actually "title text" Halo422 (talk) 01:09, 4 October 2018 (UTC)
It is actually "title text", though Randall calls it "alt-text" & contrary to W3C recommendations, he seems to use the same text for both. ProphetZarquon (talk) 05:13, 4 October 2018 (UTC)
Here on the Wiki people usually go with "Title text", occasionally "Mouse-over text". Which I like, partially because it's clear what it means, even to the casual visitor, and partially because it highlights my issue: I use these sites on a tablet, don't have a mouse, I can't see the text until I come here. :) RIIW has a point, this needs a paragraph about the Title Text. But no, it isn't saying phones are more reliable, it's a joke that neither should be hosting anything, neither is meant to be online 24/7. NiceGuy1 (talk) 05:32, 5 October 2018 (UTC)
Public service announcement: If you can't read the alt-text because you're on a mobile device, you should try using the mobile version of the website: https://m.xkcd.com/2054 . Its main features include the ability to tap/click the image to make the alt-text show up, and the whole thing is distraction-free. 172.68.189.181 16:59, 5 October 2018 (UTC)

Re: superior connectivity of mobile phones, see https://xkcd.com/1865/ 162.158.78.166 (talk) (please sign your comments with ~~~~)

This is exactly why I assert that anything hosted from a laptop should at least be considered for hosting from a mobile device instead. It's annoying to me that so many developers still consider a mobile device which has more connected uptime than a laptop to be unsuitable for hosting, say, a text-based game server. It's got a faster connection & more idle processing power than the PCs that used to run some of those game servers; I think my tablet could handle running a BBS Door game, for example. ProphetZarquon (talk) 05:14, 4 October 2018 (UTC)
Just to clarify in the comic, it is generally preposterous to host a server process like this from a mobile device. Mobile devices aggressively suspend computing time when processes are in the background, or when the screen is off, and their battery life is much shorter than that of a laptop. It is far more normal to modify the power settings of a laptop to not suspend when closed, which is doable on all laptops running linux, than to run a central data pipeline on a phone, and any meaningful server process has a dedicated server environment. The comic is a joke, like all of them. Cueball has a history of coding for his personal life that he is trying to apply in an environment where more resilience is needed than he is used to. 162.158.78.70 (talk) (please sign your comments with ~~~~)
Sorry for the rude tone of this comment. You do have a point that mobile phones have more connectivity, and many systems use them this way. I deleted my comment when I actually came to agree with you, but somebody undeleted it. 172.68.50.136 21:45, 5 October 2018 (UTC)
No offense taken; The comic is obviously a joke, that portable devices should not be used as hosting devices. I'm just saying it's not half as ridiculous as many devs seem to think: Certainly no services which many people rely on should be hosted from a mobile cellular device, but for personal purposes it at least makes more sense than hosting from a notebook computer in many cases. Phones have a longer battery run time than laptops (much longer) & communication apps are commonly designed not to have their connections suspended when the screen is off. Background processes stopping was a common issue a few years ago, when battery management was undergoing frequent change in the OS; but there are a ton of apps which would not work at all if this were an insoluble issue. I run a private messaging server & a torrent client from my mobile cellular device & it has more connected uptime than my desktop computer, because the computer shut off when not in use, whereas the phone never gets shut off for more than a few minutes for resets. Linux still has issues with the lid switch not doing "nothing" even when the switch is explicitly set to "do nothing", & a mobile cellular device of today has more idle resources than a server built not so many years ago. A lot of things done via the cloud would be more efficient & reliable done locally, but that's less monetizable so most apps don't try, preferring to offer a service where usage metadata = profits. The data connections of 4G cellular are faster than many of the connections used for home server setups just a decade ago. Overall, mobile cellular devices are actually a pretty good candidate for a latency tolerant server, & in many cases a user's phone is the device of theirs which remains connected most; Sometimes it's their only device. Large DBs & compute-heavy tasks obviously are not ideal, but there's a ton of low resource services which could be run on mobile devices but aren't. Thinking it's a bad server platform is largely erroneous & based on outdated assessments. ProphetZarquon (talk) 17:32, 10 October 2018 (UTC)
While all of the comments about connectivity between a laptop and a mobile device may be valid, I think the joke here is that any serious data processing application should not be running on either - it should instead be operated in a fixed-connectivity server-type environment instead. Ianrbibtitlht (talk) 16:37, 4 October 2018 (UTC)
Agreed. ProphetZarquon (talk) 17:32, 10 October 2018 (UTC)
I concur, the point is that nothing mobile should be HOSTING, hosted files should be on a system designed to be never off and never disconnected. Both a laptop and a cell could have their batteries die, or they have to be permanently plugged in, defeating the purpose of them being "mobile". NiceGuy1 (talk) 05:21, 5 October 2018 (UTC)
Except that many personal servers do get run on laptops & home computers with a lot of downtime, whereas phones tend to be on & connected much more of the time. Most people don't have a machine set up for server duty, but they do have a phone that runs all day every day. Battery is a non-issue: A mobile device can plug in when needed to maintain uptime; A fixed device cannot run on battery at all. (Unless you buy a battery backup, & you could buy a lot of portable batteries instead, for what one of those UPSs costs.)
Mobile devices should be hosting a lot more than they currently do. Not for business-critical enterprise-wide usage scenarios like shown in this comic, but for personal use cases where the loads are not high it's silly to think another machine is necessary. ProphetZarquon (talk) 17:32, 10 October 2018 (UTC)
Oh, very much NO, LOL! Just because people do it doesn't mean they SHOULD. :) I actually expect the title text is specifically to take a swipe at such people. Getting a cheap computer, or keeping an old computer, to act as a server at home is easy. My mother bought a new desktop computer that came with Windows Vista (back when that's what new computers came with), hasn't used it directly in about 5 years, since getting an iPad and simply entertaining herself with that. But we've got this computer set up with a nice iTunes library it shares over her home network and it just stays on 24/7. Found out the hard way that the stock power supply couldn't handle that so I replaced it with a stronger one 500W, works great. Would probably cost probably $40 to buy that computer second-hand now, keeping the power supply upgrade for later (worked fine as stock for years). It's up as long as there isn't a power outage, far longer than any mobile device (remembering that under normal circumstances a plugged in mobile device is stuck where it is, it can't be carried around, so it's unintuitive to keep it plugged in permanently, unlike a desktop computer which is designed to always be plugged in). Also. I don't think this comic is any kind of business thing, this seems like some kind of group project they're doing (maybe a business they're trying to LAUNCH, but not something that's available outside of the three of them), hence them experimenting. NiceGuy1 (talk) 04:54, 12 October 2018 (UTC)

Had to fix the description, it stated that Cueball reluctantly agreed with Ponytail's statement when he actually did the opposite, but his hesitation suggests she's correct. NiceGuy1 (talk) 05:21, 5 October 2018 (UTC)

The presence of White Hat is a little mysterious here because he doesn't have any lines. What could be going on? 172.69.226.143 07:41, 5 October 2018 (UTC)

He was just waiting for the next comic to start :) Hawthorn (talk) 15:46, 5 October 2018 (UTC)

Why is there a six paragraph diversion at the end of this explanation? This may be tangentially relevant, but not enough for an explanation that eclipses the size of the rest of the actual comic explanation. Consider removing it or boiling it down to one paragraph on the general topic with a link. I prefer removing it because the comic doesn't make it's own connection to a wider issue. 162.158.75.142 (talk) (please sign your comments with ~~~~)

I'm not the most organized thinker in the world. A lot of the explanations on this site read like the person who wrote the explanation does not actually have experience with the topic of the comic, which is generally written as if it is something that Randall does have experience with. I tried to fix that on this comic, by sharing background from a place of experience, but I wasn't really sure what the most relevent bits were or how to integrate it into the existing work well. The existing explanation read as if Cueball was simply a horrible coder, when in reality these data pipelines are common things among programming hobbyists, and it takes experience to recognize that they are inevitably a house of cards. They're not inherently bad though: liberal input validation can be used to notify a dev when something goes wrong, so that they can fix it fast, but that needs more foresight than Cueball may have if he is running it off his phone. This data pipeline approach is used in live sites still up today. I'm sorry I'm expressing so verbosely; it's being hard for me to be concise. 172.68.50.136 21:43, 5 October 2018 (UTC)
As someone who is overly verbose with the written word, especially online, let me say: I feel you. LOL! With me it seems I'm easily misunderstood, people I know seeming to strive to find alternate meanings I didn't detect. Which leads to writing so much nobody reads it all and misunderstandings increase. /sigh/. The thing is, here the point is to explain, providing an explanation that allows someone who completely missed the joke to now get it. Anything further should be avoided, except maybe extremely relevant trivia (which should be listed as such). That's it. In this case, Cueball shares that he put together a useful program that gathers information for them, Ponytail suspects (correctly) that he neglected to account for if the source changes how they do things and his program will fail as soon they do. Which promptly happens. There, comic explained. A little fleshing out for people who REALLY don't understand programming, add something about the Title Text, and we're done.
(Paragraph 2): I'm reminded of the comic about The Princess Bride. I was the perfect audience for that explanation, I had never seen the movie (but intended to, I finally saw it a few months ago). After the comic was explained - what Wesley must have done to become the Dread Pirate Roberts and remain so - I understood the comic, but like here it went on and on, providing what I guess were a synopsis of the different characters. Unnecessary and unquestionably would have provided spoilers for the movie, so I stopped reading there. That stuff should never have been there. NiceGuy1 (talk) 19:38, 10 October 2018 (UTC)
On the upside, with it running off his phone, at least he's more likely to be nearby when it goes down.  ;D
ProphetZarquon (talk) 17:32, 10 October 2018 (UTC)

Recommend completely removing everything after the paragraph mentioning the connection to the Code Quality series (which I had to fix, it stated the connection too strongly). With the title text having been just explained - usually the last part of the explanation - the rest seems unnecessary and extra. NiceGuy1 (talk) 19:51, 10 October 2018 (UTC)


Actually I expected Cueball to have found a way to crash the company mainframe. If he limited te crash to a device not even his laptop's connectivity relies on is not this bad after all...--Gunterkoenigsmann (talk) 01:07, 19 January 2022 (UTC)