1909: Digital Resource Lifespan

Explain xkcd: It's 'cause you're dumb.
Revision as of 05:36, 24 April 2023 by Xurkitree10 (talk | contribs) (Transcript)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Digital Resource Lifespan
I spent a long time thinking about how to design a system for long-term organization and storage of subject-specific informational resources without needing ongoing work from the experts who created them, only to realized I'd just reinvented libraries.
Title text: I spent a long time thinking about how to design a system for long-term organization and storage of subject-specific informational resources without needing ongoing work from the experts who created them, only to realized I'd just reinvented libraries.


In this chart, Randall laments the tendency of digital resources to quickly become obsolete or non-functional. By taking a general subject, such as xkcd's core subjects of "romance, sarcasm, math, and language", one can see that a useful tool such as a smartphone or computer app or interactive CD-ROM (essentially, software) does not have the lasting power of printed books (e.g. textbooks, for many general subjects) and microfilm/microfiche. The printed resources, not having to rely on a computerized platform for use, are far more reliable despite being less mobile and taking up physical space. The only digital source which is still working is Portable Document Format (aka PDF) which encapsulates fixed layout flat documents, and is supported for years already by Adobe Systems and is part of ISO standards, so has a widespread support, and should be still viewable in foreseeable future.

Archive.org refers to The Internet Archive, a non-profit organization that maintains the Wayback Machine, one of the largest archives of the World Wide Web. When a website is taken offline, copies of its content can often be found backed-up on the Wayback Machine. The Wayback Machine is primarily designed to back up websites, however, and will often not be able to save information stored in a site's databases, as alluded to in the comic. The Internet Archive has a part for non-website archives, but it cannot hold recent databases either due to copyright problems.

The title text makes a statement that libraries do not require the support of original authors/experts to organize and store vast resources for any subject imaginable. This is true, but omits the fact that ongoing efforts are required by experts in information organization and storage -- namely, librarians. Physical books and microfilm/microfiche need controlled storage environments, manual handling for storage, retrieval, distribution (in library terms, "circulation"), and the like. Thus, a library can require significant resources in personnel and facilities, but is usually seen as a "public good" for the benefit of society; thus, many communities and educational institutions invest in creating and maintaining a library despite the costs.


Caption Type of Resource Explanation
Book on Subject Physical Books This is the most familiar physical resource and used as the baseline for other (digital) resources.

Under optimal conditions, a book can last indefinitely for future generations; there are books from the ancient times that are still readable today.

[Subject].pdf Portable Document Format This is the most familiar digital resource, with the probable exception of the internet as a whole. A format originally developed by Adobe, the majority of the format is now an ISO standard which means a compliant reader and writer can be made independently (which avoids the majority of the pitfalls described on later resources).

A PDF file is designed to be portable (it is even in the acronym), which means unless the creator of the PDF uses a web-only feature (which is non-standard), it can be opened everywhere a PDF reader is found. Authors may also opt for a stricter, "archival" version (PDF/A) which ensures that both required files are placed on the same PDF file and only documented formats are used to prevent the reliance on non-standardized formats.

[Subject] Web Database Database Another type of a digital resource which is, in itself, is like a digital library. Unlike a physical library however, it is usually only stored in a single file or server (there are instances that the database is distributed, but it is rare), which means that a failure to that server means that the database is wiped out, not to mention the gigantic space it takes (that is why the whole database are not stored in a digital archive, like the Internet Archive).

Additionally, unlike PDFs, there are almost-infinite ways of storing and retrieving data in a database, which means that when the method used becomes unsupported (like the Java scenario, which is of now is completely unusable in web browsers), the data in it is effectively lost (whether or not the data-in-question is still physically on the server).

[Subject] Mobile App

(Local University Project)

Mobile App A type of digital resource that expands upon the idea of a web database. It allows easy access on a mobile device, however, as it is stated that it is a local university project, which means that support for it lasts only at most for a few years (which is not enough to maintain an application).

Additionally, Operating Systems can get obsolete (like the Symbian platform used on older Nokia phones) or critical changes to it breaks older applications (like on the Apple iOS).

[Subject] Analysis Software Desktop Application A type of executable program that is found on desktop systems. It allows reliable access on a desktop system, which means that (assuming the program is offline) it can survive on its own. However, Operating Systems can get obsolete (like the Classic Mac OS platform used on older Mac computers) or critical changes to it breaks older applications (like the new security features on Windows which breaks older non-compliant programs).
Interactive [Subject] CD-ROM Desktop Application, CD-ROMs A CD can hold anything from music to videos to applications. It also allows better offline access, and such were used in the 1990s and the early 2000s. It is still a fancy desktop application, which means that the situation on the analysis software applies here, not to mention the fact that a new invention can replace an obsolete one (for example, Microsoft Encarta was discontinued in 2010 due to the ease-of-access of Wikipedia).

Additional issues mentioned are that CDs can become "scratched", in which case, the data becomes corrupted or unreadable. Also, many modern laptops do not have CD-ROM drives anymore, making it difficult to use CDs as a storage medium. Additionally, this also covers the changes in a physical system: in the 1980s, floppy diskettes were used, which was replaced in the 1990s by the CDs and DVDs, which then was replaced by thumb drives in the 2000s, which is then supplemented (and in some cases, replaced entirely) by wireless device-to-device transfers (like Bluetooth) and internet file transfers using online storage (like Dropbox and Google Drive).

Library Microfilm [Subject] Collection Microfilm This is a physical resource used by libraries to preserve (or to create a copy of) a collection, usually those things that are rare or would cause a social or political issue when damaged. Although great preservation is needed to prevent damage to a film, the system used is standardized and knowledge to build a reader or a printer off a microfilm is widely available, like a PDF file. This comparison might look like a physical version of PDFs: standardized, common (books can be of any size imagined) format.


My access to resources on [SUBJECT] over time:
[Below, a timeline and a graph with gray bars is shown:]
[1980s-past 2020:]
Book on subject
[Early 2000s-past 2020:]
[SUBJECT] web database
Site goes down, backend data not on archive.org
[Small bar, 2000-2016/17:]
Java frontend no longer runs
[SUBJECT] mobile app (Local university project)
Broken on new OS, not updated
[SUBJECT] analysis software
Broken on new OS, not updated
[Late 1990s-late 2000s:]
Interactive [SUBJECT] CD-ROM
CD scratched; new computer has no CD drive anyway.
[1980s-past 2020:]
Library microfilm [SUBJECT] collection
[Caption below the panel:]
It's unsettling to realize how quickly digital resources can disappear without ongoing work to maintain them.

comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!


Even PDFs can be broken, which is why we have PDF/A (archive) - a subset of PDF that has no external dependencies and thus should last forever. JakubNarebski

To clarify: .PDF files are *frequently* created with content such as fonts (or really anything other than the actual text) referenced within the document but not *embedded* within the document. This is usually done to reduce file size, but it's usually not advisable. Whether it's a .pdf or a .ppt or a .exe it is best to keep your dependencies embedded whenever possible!
.PDF files (or any files) can of course also suffer from hash failure (CRC errors, etc) and PDF/A does not provide redundancy tables; Always make an extra copy on another drive (ideally both off-site & locally). 06:07, 1 November 2017 (UTC)

CD scratched, new computer has no CD drive anyway. - First, you can still buy external CD-ROM drive, for example connected via USB cable. Second, you can try recover data from scratched CD with tools such as ddrescue (free and OSS) or IsoBuster (shareware). --JakubNarebski (talk) 17:51, 30 October 2017 (UTC)

Scratches on the DATA layer of any optical disk destroys that DATA. There is also the consideration that the plastics of the majority of optical disks degrade with time and heat. There are some optical media that are designed to prevent such scratching or corruption like the commercially available M-Disk or laser etching into a micro format into a crystal like a 5D disk. Even then the DATA stored must be in an ISO format to read as well as the equipment to read the media needs to be maintained. I have often told people that their data is never safe unless there is a constant effort to copy, check for quality, and make multiple backups using multiple modern mediums as often as humanly possible. All form of digital media can fail, even the extended warranty on a high end HDD will not cover the data lost and most EULAs for cloud storage will say the same.
Pressed commercial CD-ROMs carry their information between two 0.6 mm thick plastic discs which are glued together, which makes them pretty resilient against scratches on either side – just remove some material with abrasive methods like toothpaste. Often the glue is the bigger issue with low-quality pressings in the long run. This is in contrast to recordable CDs, which are coated with the reflective layer on top of a single disc. –TisTheAlmondTavern (talk) 12:24, 31 October 2017 (UTC)
CD-ROMS *always* carry their information directly below the reflective layer on the "printed" side of the 1.2mm disk (so around 0.1mm below the print). That's necessary because otherwise the laser would not correctly focus on the data. The recordable CDs have the problem that the reflective layer may not be 100% air resistent and so oxydise the data layer, which is a dye. It's DVDs that have two 0.6mm disks, so in theory you cold flip the DVDs like an LP and use both sides - but then you don't have a surface to print information on it (except the few square cm just around the hole). BlueRay Disks inverted the CD: now the data layer is behind a 0.1mm "thick" coating on the data side. -- 11:23, 8 November 2017 (UTC)

Or cheaper than an external drive, borrow a friend's computer and copy the CD onto the cloud somewhere. --Angel (talk) 18:39, 30 October 2017 (UTC)
What if you don't have any friends? (or what if none of your friends has a CD drive) --
You can still buy external friends that have CD drives.-- 13:12, 3 November 2017 (UTC)
Yet something affected by that would just as likely be affected by "Broken on new OS, not updated". For example, I've got a multimedia encyclopedia which runs on Win 3.11, and thus can't run on 64-bit windows.
Ehrm... You do realise the limitation is the other way around right? You can't run 64-bit application on 32-bit Windows, but 64-bit windows can perfectly well run 32-bit apps. Though Win 3.11 is far enough back it might actually be a fun challenge to see if it runs :D 10:57, 31 October 2017 (UTC)
You can not – Win 3.1(.1) was a 16bit operation system – and Microsoft dropped the 16-bit-layer in win7. --DaB. (talk) 19:18, 31 October 2017 (UTC)
Most obsolete software can be quite easily run using various emulators or virtual machines. A lot of Win3.1 software runs without problems on modern Linuxes via Wine, and if it doesn't, there are always emulators such as DOSBox - copies of Win3.1 can be easily found on various abandonware sites and even archive.org (even though their legality may be questioned). 22:14, 11 November 2017 (UTC)

Interestingly, static .PDF files are intended to be electronic equivalents of printed books - an electronic microfiche if you will RIIW - Ponder it (talk) 18:57, 30 October 2017 (UTC)

I'm wondering if data on an older, static, website would still be readable. Would likely still be there (or on archive.org), but might be suffering progressive link rot. Also a little surprised that the start of microfilm is so recent; I remember the library having microfilm readers (that nobody ever used) when I was young enough to spend ages staring at a machine, trying to determine its purpose. Guess it depends on the subject, when it was put into that format. --Angel (talk) 18:39, 30 October 2017 (UTC)

archive.org returns contemporary pages for links on archived pages, so that should be still safe. The worst nightmare with archive.org is a newly-set robots.txt file: Wayback Machine will just pretend to know nothing about the page even if it has been archived in the past. It sometimes crawl pages, after all. -- 07:22, 3 November 2017 (UTC)

Angel, note both the My in the title and the left arrow implying that the resource (like books) were about before Randal had access. RIIW - Ponder it (talk) 18:57, 30 October 2017 (UTC)

Should those white left arrows be noted in the transcript? The gray right arrows are implied by "past", perhaps something like "Before 1980-past 2020" 17:39, 1 November 2017 (UTC)

"Only to realized"? - 23:08, 30 October 2017 (UTC)

This is probably a typoed.-- 13:19, 3 November 2017 (UTC)

[Subject] wiki, anyone? Wikis have rather detailed analyses of even obscure topics in my line of work/study. --Nialpxe, 2017. (Arguments welcome) (P.S. just to be clear I mean wikis maintained by researchers and professionals in [Subject] field, not Wikipedia)

There's a wealth of thought about exactly this problem by librarians; the Library of Congress has some recommendations along with a database evaluating over a hundred formats along a variety of axes: is the format documented openly? Is it widely used? Is it inherently transparent to inspection even if the specification is lost? Can it contain its own metadata? What sort of external dependencies does it have? Is it patent-encumbered, and are there technical access restrictions like DRM? (tl;dr, images as TIFF, text as EPUB or PDF/A, sound as WAV. They're very conservative.) 05:07, 31 October 2017 (UTC)

Note that digital data have big advantage over books when dealing with bigger quantity. The amount of work you need to make to preserve printed book is same no matter how many books you have - so it's thousand times more when you have thousand books. Meanwhile, the amount of work needed to preserve for example collection of digital images doesn't really depend on collection size. Let's say that the used format is going out of use: you can automatically convert all images fairy quickly. Of course, harder with applications ... -- Hkmaly (talk) 08:23, 31 October 2017 (UTC)

The software not running after OS update is such a Mac problem. Linux updates would break if closed software was commonly available, but open source can be recompiled, and Windows maintain a scarry amount of backwards compatibility, and only system-admin or DRM-crippled software ever stops working. 10:54, 31 October 2017 (UTC)

I must strongly disagree there; Networking features have been known to break following Windows updates, & Android is *terribly* prone to breaking apps or even removing what may be considered core system features with an OS update. Search "kitkat sd", for just one good example. Even Linux can turn into dependency hell when repositories change their branch structure. Then there's the incredible variety of different hardware which only a specific version of Windows with specific hardware once supported: I still can't get an affordable analog serial port adapter that will work with my favorite flight controller. 06:37, 1 November 2017 (UTC)

Here in the UK, the library access would also have ended some time in the last few years... 11:33, 31 October 2017 (UTC)

Nothing lasts forever (or at least that's what seems to be true for anything observed by humanity). Data becomes corrupted and lost over time and usage, and books become damaged and lost over time and usage. Not to mention, thousands of books were burned during the Nazi regime. Human minds are inevitably subject to corrupted memories as well. We lose information all the time, and we try to recover what remains. However, it is also worth mentioning that our digital technology is still pretty young compared to books and other sources of information. Information used to be recorded on papyrus, tablets (I understand that this contradicts my point as some tablets have stood the test of time), etc. Some of the earliest Chinese inks were created with soot and animal glue. The first (attempts of) photographs required hours of light exposure and would fade away quickly. Over time, we discovered ways to improve upon these sources of information. The same could apply to our digital information today. We are essentially in the "papyrus" phase of electronic technology (one could argue with other descriptions, but this isn't significant to my statements). In time, we may achieve more successful long-term solutions to maintaining original data. There are so many avenues for the advancement of technology, and those avenues continue to multiply with each step. At this time, we just need to continue to work on our projects and experiments for the progress of humanity. NAE (talk) 14:29, 31 October 2017 (UTC)

Randall did a good job frightening me this Halloween... 02:10, 1 November 2017 (UTC)

I wonder if Randall is aware of digital archiving solutions such as those provided by Preservica (https://preservica.com/), formerly part of Tessella plc. Their solutions are aimed at precisely this problem. Their library/museum clients include "the MoMA, the Frick Collection, the Museum of Fine Arts Houston, Yale Library, The National Library of Australia, The Royal Danish Library, The Philadelphia Museum of Art, McNay Art Museum, DC Public Library and the University of Manchester" and their archive clients include "15 leading pan-national and national archives, 18 US state archives, major corporate archives at BT, HSBC, Unilever and the Associated Press". 03:32, 1 November 2017 (UTC)

Randall forgot my personal favorite: UTF-8 formatted .txt files. Since 1993 & counting, never had an issue opening one. I still have my first copy of The Anarchist's Cookbook, copied from a Kaypro II running CP/M on a 5-1/4" floppy to an 8088XT running MS-DOS on a 30mb hard drive to an IBM PS/2 286 on 20mb hard drive to an Asus 486 on a 3.5" floppy to a 1.2gHz Pentium on a 100mb Zip drive to a Core 2 Duo on a CD-R to an i7 system on a 128gb solid state drive, which was finally backed up to a 1tb hard drive & archived, as there's a newer copy to carry around. That original file still opens just fine on any PC I've ever used (including mobile).

I love TXT files. They're great for reading without the need for pictures and formatting. I also occasionally prefer writing TXT files instead of creating DOC or DOCX files because there are less interface distractions. NAE (talk) 18:25, 10 November 2017 (UTC)

Also, I believe Linus Torvalds once said (talking about code, but it applies to anything sufficiently desirable) "Only wimps use tape backup, real men just upload their important stuff on ftp, and let the rest of the world mirror it ;)" I can certainly attest to that. I once made a torrent of all the Star Trek I'd accumulated (IE, all the Star Trek ever) & uploaded that. Two years later an old hard drive died & I was able to recover all 200+ gb in a little over 6 hours, simply by downloading my own torrent from other seeds. Thanks Trekkies! 07:22, 1 November 2017 (UTC)

The transcripts for the books and microfilm should state the date range as "before 1980's" to represent the arrows on the chart, just as "past ..." is used for the bars that have arrows at their right ends. I wonder if there is any significance to the fact that the arrows are done differently for "before" and "after". 14:58, 7 November 2017 (UTC)

The most reliable form of long-term backup may very well be HD-Rosetta. Which is like microfilm except etched into non-reactive metal. Unfortunately it's super-duper expensive and only a very limited number of capable of minting them even exist. On the upside they're immune to data format changes because they can be read with an ordinary light microscope 02:34, 16 March 2020 (UTC)