Editing 2054: Data Pipeline

{{comic
| number    = 2054
| date      = October 3, 2018
| title     = Data Pipeline
| image     = data_pipeline.png
| titletext = "Is the pipeline literally running from your laptop?" "Don't be silly, my laptop disconnects far too often to host a service we rely on. It's running on my phone."
}}

==Explanation==
In the first panel [[Cueball]] shows [[Ponytail]] and [[White Hat]] a Data Pipeline he has constructed that, as he puts it, <nowiki>'collects and processes all the data we need'</nowiki>. This implies that the three are running some sort of project that requires data processing. Ponytail assumes that this data pipeline is an unstable mess of scripts that will cease to function correctly should any unexpected input be received. Cueball tries to claim it isn't, but his hesitation (including using the word "might") essentially states that this is very likely, although he seems to hope that it might not be. Ponytail then seems impressed and expresses this to him. She, however, gets interrupted by Cueball who tells her that the system just malfunctioned and collapsed. He, however, states that he can fix it, making it seem like this cycle of patching and collapsing could repeat infinitely, or until all problems have been patched. [[:Category:Code Quality|Knowing Cueball's code, though,]] it seems more likely he can't patch it.

In the title text, Ponytail or White Hat proceeds to question how such an important system can run on such a small computer. However, Cueball makes it worse by saying he uses his phone due to the better connection. While this might give the pipeline more uptime, it also means its system resources are far more limited.

This comic can be logically connected to the Code Quality series ([[1513: Code Quality]], [[1695: Code Quality 2]] and [[1833: Code Quality 3]]), similarly showing Cueball having a coding ineptitude and Ponytail's exasperation with it, though this Cueball shows a higher level of competence by having produced something useful, albeit fragile. However, Ponytail doesn't see the actual code in this case, and there's no issues with or comments on coding syntax like in the Code Quality series.

It's quite common for somebody who codes for enjoyment with most of their time to attempt to automate absolutely everything that is done.  Whenever a rote task is seen, a programmer thinks, "why is a human doing this when the time could be spent making a computer do it automatically, forever?"  Unfortunately, without the advent of strong artificial intelligence, one of the places this begins breaking down is in aggregating information from multiple sources.

People tend to publish their data via a variety of different channels, and as they are not programmers and don't share the value of consistency and computer-processability, it is all in completely different formats.  Some data is only available in print.  Some data is only available as photographs.  Some data is only available as written reports.  A certain kind of nerd will see this situation and become excited, seeing the opportunity to automate something that nobody else thinks is reasonable to put the energy into.  They begin writing scripts that process all the different formats that all the data is in, and eventually get the whole thing working!  They can then, in theory, make a number of mind-numbing data-processing jobs obsolete.

Google has put a lot of energy into conquering this challenge on many, many fronts around the decade of the 2000s, making data more processable everywhere, and possible hastening the advent of those strong artificial intelligences, which would thrive off of the availability of already-digitized information.  A notable project was google books, where libraries were scoured for non-digital information and it was all painstakingly scanned.  Additionally, organizations have been increasingly pressured to offer their information in standardized formats that can all be processed the same way.  This continued pressure is giving more and more results, but because it must be implemented by humans who gain little immediately from the process, it is rare that adherence to the guidelines is universal.

The workaround of building many small programs that handle all the quirks is the domain of "scraping" -- downloading information intended to be presented to a human, running it through software that has been pre-programmed with what patterns to expect, and normalizing and making use of the data.
Anybody who has, as a mere individual, attempted this goal, quickly realizes that as soon as the data source has the smallest change, the data becomes garbage.  Often it becomes garbage in a way that is laborious to hunt down and understand, and may not even be noticed.  This would be tragic for a corporation that was relying on the results, and would be like a trojan horse, destroying them from the inside.

Privately, though, many hobbyists can make money by collecting data that is useful to them and processing it more effectively, using computer code, than their peers do.  Because they are consistently personally invested in their data pipeline, they just hack up fixes as problems occur.  One website that has been maintained this way for many years is piperka.net, which is run by a single hobbyist and provides a centralized place for tracking most webcomics, including xkcd itself, using this very method of data scraping.  Additionally, many companies have refined data scraping to be quite effective and reliable, such as mint.com which allows customers to consolidate their financial information all in one place.

Artificial intelligence is developing to the point that some computers are scraping data very effectively and reliably, but usually creating advanced, robust, scraping algorithms is not worth the effort -- it is more efficient to get the providers of the data to just offer it in a normalized way, hire a human to do the task, or make money through something that doesn't require as diverse processing.  Hence there has been this niche filled by those hobbyists willing to put the heuristic effort in.

Cueball's hesitant response in this comic has some similarities to [[410: Math Paper]].

==Transcript==
:[Cueball is standing with an open laptop, showing it to Ponytail and White Hat.]
:Cueball: Check it out - I made a full automated data pipeline that collects and processes all the information we need.

:[Ponytail is looking down at Cueball's laptop.]
:Ponytail: Is it a giant house of cards built from random scripts that will all completely collapse the moment any input does anything weird?

:[Borderless beat panel]

:[Cueball looks at his laptop.]
:Cueball: It... ''might'' not be.
:Ponytail: I guess that's someth-
:Cueball: Whoops, just collapsed. Hang on, I can patch it.

{{comic discussion}}

[[Category:Comics featuring Cueball]]
[[Category:Comics featuring Ponytail]]
[[Category:Comics featuring White Hat]]
[[Category:Cueball Computer Problems]]