Editing 2054: Data Pipeline

Jump to: navigation, search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 13: Line 13:
  
 
This comic can be logically connected to the Code Quality series ([[1513: Code Quality]], [[1695: Code Quality 2]] and [[1833: Code Quality 3]]), similarly showing Cueball having a coding ineptitude and Ponytail's exasperation with it, though this Cueball shows a higher level of competence by having produced something useful, albeit fragile. However, Ponytail doesn't see the actual code in this case, and there's no issues with or comments on coding syntax like in the Code Quality series.
 
This comic can be logically connected to the Code Quality series ([[1513: Code Quality]], [[1695: Code Quality 2]] and [[1833: Code Quality 3]]), similarly showing Cueball having a coding ineptitude and Ponytail's exasperation with it, though this Cueball shows a higher level of competence by having produced something useful, albeit fragile. However, Ponytail doesn't see the actual code in this case, and there's no issues with or comments on coding syntax like in the Code Quality series.
 +
 +
It's quite common for somebody who codes for enjoyment with most of their time to attempt to automate absolutely everything that is done.  Whenever a rote task is seen, a programmer thinks, "why is a human doing this when the time could be spent making a computer do it automatically, forever?"  Unfortunately, without the advent of strong artificial intelligence, one of the places this begins breaking down is in aggregating information from multiple sources.
 +
 +
People tend to publish their data via a variety of different channels, and as they are not programmers and don't share the value of consistency and computer-processability, it is all in completely different formats.  Some data is only available in print.  Some data is only available as photographs.  Some data is only available as written reports.  A certain kind of nerd will see this situation and become excited, seeing the opportunity to automate something that nobody else thinks is reasonable to put the energy into.  They begin writing scripts that process all the different formats that all the data is in, and eventually get the whole thing working!  They can then, in theory, make a number of mind-numbing data-processing jobs obsolete.
 +
 +
Google has put a lot of energy into conquering this challenge on many, many fronts around the decade of the 2000s, making data more processable everywhere, and possible hastening the advent of those strong artificial intelligences, which would thrive off of the availability of already-digitized information.  A notable project was google books, where libraries were scoured for non-digital information and it was all painstakingly scanned.  Additionally, organizations have been increasingly pressured to offer their information in standardized formats that can all be processed the same way.  This continued pressure is giving more and more results, but because it must be implemented by humans who gain little immediately from the process, it is rare that adherence to the guidelines is universal.
 +
 +
The workaround of building many small programs that handle all the quirks is the domain of "scraping" -- downloading information intended to be presented to a human, running it through software that has been pre-programmed with what patterns to expect, and normalizing and making use of the data.
 +
Anybody who has, as a mere individual, attempted this goal, quickly realizes that as soon as the data source has the smallest change, the data becomes garbage.  Often it becomes garbage in a way that is laborious to hunt down and understand, and may not even be noticed.  This would be tragic for a corporation that was relying on the results, and would be like a trojan horse, destroying them from the inside.
 +
 +
Privately, though, many hobbyists can make money by collecting data that is useful to them and processing it more effectively, using computer code, than their peers do.  Because they are consistently personally invested in their data pipeline, they just hack up fixes as problems occur.  One website that has been maintained this way for many years is piperka.net, which is run by a single hobbyist and provides a centralized place for tracking most webcomics, including xkcd itself, using this very method of data scraping.  Additionally, many companies have refined data scraping to be quite effective and reliable, such as mint.com which allows customers to consolidate their financial information all in one place.
 +
 +
Artificial intelligence is developing to the point that some computers are scraping data very effectively and reliably, but usually creating advanced, robust, scraping algorithms is not worth the effort -- it is more efficient to get the providers of the data to just offer it in a normalized way, hire a human to do the task, or make money through something that doesn't require as diverse processing.  Hence there has been this niche filled by those hobbyists willing to put the heuristic effort in.
  
 
Cueball's hesitant response in this comic has some similarities to [[410: Math Paper]].
 
Cueball's hesitant response in this comic has some similarities to [[410: Math Paper]].

Please note that all contributions to explain xkcd may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see explain xkcd:Copyrights for details). Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel | Editing help (opens in new window)