2109: Invisible Formatting

Explain xkcd: It's 'cause you're dumb.
Revision as of 04:19, 11 February 2019 by 162.158.79.143 (talk) (Explanation: Re-added bold marks in the first paragraph, thereby making the third point in the example section correct.)
Jump to: navigation, search
Invisible Formatting
To avoid errors like this, we render all text and pipe it through OCR before processing, fixing a handful of irregular bugs by burying them beneath a smooth, uniform layer of bugs.
Title text: To avoid errors like this, we render all text and pipe it through OCR before processing, fixing a handful of irregular bugs by burying them beneath a smooth, uniform layer of bugs.

Explanation

In some word processor programs, when highlighting text, whether by manually clicking and dragging or by double-clicking on a passage, it is easy to mistakenly highlight an unnecessary portion which has no visible effects when italicized or boldfaced. Since in most fonts the word space looks identical between the bold, the italicized, and the regular, this has no effect on how the end user will read the document, but could theoretically cause a problem on later occasions, particularly if the text cursor does not reflect formatting when hovering over formatted characters. Randall worries about this.

In the pictured case, he does not appear to have selected the word by double-clicking, since the cursor is depicted past the end of the word instead of on top of it. It appears instead that he has clicked and dragged the mouse cursor to select it, a method which also makes it easy to accidentally select a trailing space. The word space is a relatively thin character, which makes it hard to avoid and to notice, and most people don’t worry about whether they selected it. Therefore, selecting a trailing space is a common behavior, regardless of the method used.

If later the same word is highlighted to have the bold removed, but this time without including the space, the space would retain its bold formatting. Since it is an invisible character, there is no easy way to tell it is still bold—even if it is slightly longer in the bold font, this may be hard to notice. This is the situation the comic is highlighting—no pun intended.

Occasions where a hidden bold space may be a problem include:

  • Exporting to plain text files. If for example a markdown style is used, there will be characters in the output that do not make sense.
  • Scraping, data mining, and linguistics processing by computer algorithms. Often (although not always) these algorithms are written based on samples of training or testing text that may not have spurious formatting present, and may misprocess something when encountering the spurious formatting.
  • Wikis. In the first paragraph of this article, every space is a hidden bold space. From the editing view, all the spaces look like''' '''this. This will annoy all future editors of this article, due to the hidden apostrophes which are formatting the spaces. They may also accidentally introduce bold words.
    • By default, MediaWiki attempts to prevent this by not including the trailing spaces in the bold formatting when you click the “bold” button, so someone has to manually type the formatting apostrophes to do this.
  • Editing that adds some text at the location of the space will make this text bold.
  • A situation where formatted text is not allowed, and is rejected, but the user failed to strip formatting from the spaces, and this is noticed.
  • If a font has the word space look different between the bold and the regular, perhaps to make it so bold words are spaced closer to each other, the spacing will look inconsistent if there is a hidden bold space.
  • Unnecessary extra formatting will usually unnecessarily increase file size, which may put the document above some maximum file size threshold.
  • It can be later revealed that Randall considered to format parts of the text in bold. As the title text tells it is really important to Randall to control all information he publishes. Real-world examples are governments changing the impact of reports for political reasons. Attempted tampering of this kind can be revealed by bold spaces. Another example would be a casual and short one-sentence reply e.g. to a romantic interest, which one takes one hour to formulate to sound as natural as possible.

Randall’s background in computer programming could be what makes him more attentive to these types of technical problems, and therefore the reason for his worries.

Popular modern word processing programs have features which may make it easier to notice improperly formatted invisible characters. In the tutorials linked here, one may learn how to view invisible characters in Microsoft Word, Pages and LibreOffice Writer, however even with this on it would be difficult to spot a bolded space (which looks like a bolded dot – now visible but so small it's still hard to tell if it's bold or not). In the older word processor WordPerfect, one could do this with the “Reveal Codes” feature, which showed you character codes, separate from the characters themselves, around the characters. For example, a bolded space would look something like "[BOLD≻≺BOLD]".

Web sites which allow content to be edited by users but generate the formatting code automatically often have versions of the invisible formatting problem; for example, eBay listings which use anything other than the default font rapidly accumulate hard spaces, font end and begin transitions, and other invisible formatting if they are subsequently edited, which can slow page loading and cause other problems. This is also seen in blogs etc.

In the title text, Randall says that he “fixes” this by running the text through OCR, which turns physical copies or images into text. This would usually ruin even more formatting, and add inaccuracies to the text. This way, no one can tell which bugs were introduced by him and which ones by the OCR, which he facetiously suggests is better somehow.

Transcript

[A text editor, with some options. They are superscript in one section, bold, italic and underscore in another section and alignments in the third section. The word "not ", including the following space, is highlighted in blue. There is a cursor below it.]
Text: ...ere, but would not have to mo...
Action: Select
[The cursor is on the "bold" option and the selected word is bolded.]
Text: ...ere, but would not have to mo...
Action: Click
[The cursor is next to the "to". No text is highlighted.]
Text: ...ere, but would not have to mo...
Thought bubble: ...Nah, the bold is too much.
[The word "not" is now highlighted in blue again, but the following space is not.]
Text: ...ere, but would not have to mo...
Action: Select
[The cursor is on the "bold" option and the selected word is not bolded.]
Text: ...ere, but would not have to mo...
Action: Click
[The cursor and the blue highlighting are gone. The space after "not" has a dashed box around it, and an arrow points to it.]
Text: ...ere, but would not have to mo...
Arrow: Hidden bold space
[Caption below the panels:]
When editing text, in the back of my mind I always worry that I'm adding invisible formatting that will somehow cause a problem in the distant future.


comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!

Discussion

This reminds me of the person who used l (lower-case "L") instead of 1 for data entry at some business. Amazingly, the computer accepted it (BAD programming!) and it wasn't found out until the end of the tax year, when all heck broke loose! 162.158.75.136 14:50, 8 February 2019 (UTC)

Some programming puzzles are often solved with stuff like this: AΑ Fabian42 (talk) 15:19, 8 February 2019 (UTC)
"l" (lower-case "L") is a valid suffix to integer literals in C and derived languages. It indicates the number is of the "long int" type as opposed to a plain "int". Because C automatically upconverts the "int" type into "long int" when needed, the "l" suffix is rarely used. The result: "long int a = 1;" and "long int a = 1l;" mean exactly the same thing, and both statements are perfectly standard and won't raise any warning from compilers. "ll" (double el) is also a valid suffix, this time for the "long long int" type. GuB (talk) 15:39, 8 February 2019 (UTC)
Typing lowercase L instead of 1 is a common thing for people of a certain age. Old manual typewriters usually don't have a "1" key, so people learned to use lowercase L instead -- and sometimes slip back into that habit on newer technology. --Aaron of Mpls (talk) 02:03, 9 February 2019 (UTC)
Tha's exactly what happened in my example. I blame the programmer, though, for allowing a letter where a numeral was required or possibly converting the l to a 1 if the programmer knew such a thing ever happened. In either case, it shouldn't have allowed the l to just sit there like a bomb waiting to blow apart the post-tax-year processing. 172.68.58.83 15:22, 9 February 2019 (UTC)

I went to this page, expecting it to be self-referential. Was not disappointed. Fabian42 (talk) 15:19, 8 February 2019 (UTC)

Some markup conversion tools don't handle hidden bold spaces correctly. This HTML to Markdown converter is an example: https://anthonychu.github.io/to-markdown/ It converts <b>a </b> to **a ** instead of **a** . 172.69.62.10 15:40, 8 February 2019 (UTC)

Hah, this comment is not mine! Somehow I have your IP now. 172.69.62.10 17:47, 8 February 2019 (UTC)

Were the periods in the beginning there for a specific reason? Netherin5 (talk) 17:42, 8 February 2019 (UTC)

The user 108.162.245.16 thought it was a good idea for some reason. Glad you fixed it. I finished the job 172.69.62.10 17:46, 8 February 2019 (UTC)

I've had this happen when writing papers. Bold. Unbold. Later backspace into the hidden bold space and everything typed after gets put in bold. If a professor gives you a page count instead of a word count, you can make the punctuation in your paper bold (or increase the font) to add some extra padding that might go unnoticed. Don't actually do this if you can't convey your thesis in fewer words. 172.69.210.52 18:11, 8 February 2019 (UTC)

I hated when Microsoft Word took over and lacked a real "Reveal Codes" like WordPerfect used to have. I'm kind of like Randall, I think about those behind-the-scenes things that lots of companies like to try to hide from the user, and I like the power to do something about them. -boB (talk) 18:58, 8 February 2019 (UTC)

When I saw the strip, I immediately thought of Word Perfect because its brain dead way of inserting formatting as special codes inline with the text. Hit "reveal codes" and it would reveal a string of bold on / bold off codes because it wasn't clever enough to optimise them away. I assume Word does it differently, perhaps with attributed strings and so doesn't need the reveal codes function so you can manually fix the mess the program has a made.

In Microsoft Word, where the majority of people would have experience with selecting and bolding text, the cursor appears as an "I-beam" when positioned over text and not as the "mouse pointer arrow" shown by Randall. Also, in Word double-clicking a word does select the following space(s), but when bold is applied it is applied only to the selected word, NOT to the trailing space (even though the space was selected when the bold was applied). So selecting just the word and un-bolding would not leave a bolded space behind, since the space was never bolded. Clearly Randall's example is in some editor other than Word. Since Word is where most people have familiarity with selecting and bolding text, something should be added to the explanation noting this and speculating on which text editor Randall is actually showing. - 108.162.246.215 20:35, 8 February 2019 (UTC)

Agreed. Most text editors do not select the trailing space when double-clicking. Microsoft Word is one of the few that does it. But in that case, the space is not formatted as bold. But in most word processors including Word, if you do select the word with the trailing space and apply the bold formatting, the space retains the formatting even if the word is un-bolded. So the first sentence of the explanation is incorrect.
Do they not? Notepad does it. Notepad++ does it. Your browser does it. Where is the wealth of programs that don't? I reckon this is the default system-wide behavior for double-clicking in Windows, regardless of program. 172.68.65.228 11:46, 9 February 2019 (UTC)
It seems to be indeed Windows issue, as everything I tried did highlight extra space (except Notepad++), but nothing I tried on Linux did. 162.158.90.36 13:59, 9 February 2019 (UTC)

Hidden formatting annoys translators greatly. Sometimes, the formatting of the word processor used and the formatting recognized by the CAT program (such as SDL Trados Studio or MemoQ) do not line up very well, which causes the formatting to appear as tags within the text (purple colored in the most widely used CAT software, Trados). If there is sloppy or hidden formatting all through the document, this turns into what most people call a "wall of purple", with tags everywhere within the document. Since tags need to be accounted for (otherwise the document does not save properly), and the formatting capability of most CAT tools is a lot more limited compared to any word processors, this is a colossal waste of time for any translator to wade through. Thus, if you leave any hidden formatting in a document and you know it will be translated somewhere down the line, you know there is a translator out there that curses the day you were born. (A note though - PDF conversion is responsible for a lot more wall of purple incidents than sloppy formatting. Seriously - if you expect a document to be translated at some point, never bring it anywhere close to the PDF format. That format is evil, I tell you. Pure evil.) 162.158.89.61 05:47, 9 February 2019 (UTC)

In WordPerfect for DOS, the codes were [BOLD] to turn bold on and [bold] to turn it off again. --162.158.38.40 11:30, 9 February 2019 (UTC)

The whole idea of invisible formatting is being used by some websites, including Facebook, to make it much harder for ad blockers to block ads. For example, https://twitter.com/themikepan/status/1093035372186034176 Of course, the same can also be used to defeat swear filters on forums, as well (which, for some words like "bastard sword," *the moderators* themselves suggest doing). Draco18s (talk) 19:43, 9 February 2019 (UTC)

We have a category for comics with colour... can we have a category for comics with lowercase letters? :) Undergroundmonorail (talk) 02:33, 10 February 2019 (UTC)

I frequently see a similar, related problem. In preparing a weekly newsletter (consisting mostly of links to articles from various news sources), people submitting articles to me usually send me Microsoft Word files into which they have used copy/paste to insert the headline, URL and a few lines of text for context. On far too many articles, I find that the resulting text has embedded UNICODE Left-to-right mark characters (U+200E) in it. These don't affect display and printing at all (since all of the text is already left-to-right), but it creates broken links if one appears in a URL and I copy/paste it into a web browser's location bar. There doesn't seem to be any way to make these characters visible in Word. If manually cursoring over the text (with left/right keys), you will see the cursor change shape without moving when stepping over the left-to-right mark, but that's the only indication. It's quite annoying to have to work around. (If anyone knows of a good workaround, please let me know.) Shamino (talk) 19:32, 10 February 2019 (UTC)

I frequently cut-and-paste text into Notepad (or gedit, or some other text-only editor etc.), then cut-and-paste it back to Word or whatever other "rich text" capable destination I am using -- this removes all hidden junk, formatting, font changes, bold, etc. and the pasted text takes on the characteristics of wherever it's pasted into rather than where it came from. This is basically taking the text down to the bare minimum, and then I can reintroduce whatever formatting I want it to have. -boB (talk) 16:47, 11 February 2019 (UTC)

GIMP is really bad about this when trying to add text to an image. You either end up with the formatting not wanting to stick, or you end up with invisible formatting all over the place. Dark talk 00:15, 11 February 2019 (UTC)


Seems to me that everybody here misses the point of the comic. Which is not the problems hidden left over formatting could do to later text. The joke here is that Randall is about to write something where he really means that NOT. But then regrets it, as he is afraid that the reader of his text/message would take offense of having this not shouted out in bold! So he reverts the bold, but because he misses the space, he has left a proof that he actually did mean Not and this can now be found out by the receiver anyway, which might then take offense anyway, or take offense that Randall felt he had to delete the bold, as if the receiver could not handle this (of course if he took offense from this Randall had proved his point, but never the less he tries to avoid this.). All this is mentioned now at the very end of a long list of indifferent problems such a bold space could create. I will move this up to the top now, as the main explanation. --Kynde (talk) 10:06, 13 February 2019 (UTC)

I found (and find) the typography in this comic troubling, because while it is clearly a proportionally spaced font ("l" is 5px wide, "w" is 23px), the boldfaced and roman "not"s are the same size (49px wide). In a normal proportionally spaced situation, the boldfaced letters would be wider. JohnHawkinson (talk) 03:23, 23 February 2019 (UTC)

In an edit last week I removed the claims that "Randall bolds text via clicking" and that it "could indicate that Randall is not familiar with using word processors." 172.68.144.145 just reverted my removal, and I wanted to explain here why '.145 is wrong, in a little more space than the edit summary allows. I said originally, "An iconbutton is used for bold in comics for illustrative purposes, because you can't see the keyboard. It does not reflect the author-artist's knowlege." That is, we cannot draw conclusions about Randall's knowledge based on the fact that he didn't illustrate in this comic using a keyboard.
'.145 asks, perhaps rhetorically, "Then why not just write "Ctrl+B"? You can't see the mouse either, but you know what "click" and "select" are referring to."
First of all, it doesn't matter. The comic could also have illustrated use of a menu, but that wouldn't tell us anything about Randall's knowledge of the iconbutton or the keyboard shortcut. Without any information about this, it's not possible to make reasonable inferences about this, and so the explanation shouldn't even go there. Secondly, there are good reasons why an iconbutton makes more sense (not that I'm required to supply them); because keyboard shortcuts are not as discoverable as iconbuttons or menus (and menus take a lot of space that make them hard in a comic of small compact multiples like this one) that means more people are familiar with the menu or button than the keyboard shortcut, and indeed those who know the keyboard shortcut are generally a subset of those who know another method; and further still, "Ctrl+B" is not platform-independent (e.g. Mac users need Cmd+B) or software-independent (InDesign users need Cmd+Shift+B). Thirdly, you can indeed see the mouse pointer, so I'm not sure what '.145 is trying to suggest. And finally, it's utterly ridiculous and kind of offensive to suggest (without any real basis) that Randall doesn't know how to use a word processor. That a person chooses to use one method, even if it's not the most efficient method, doesn't mean they are "not familiar with using word processors." We don't even know what Randall's UI preferences are here, but even if we did that wouldn't be enough to suggest a lack of familiarity rather than a personal preference. The text from this edit is not encyclopedic and should stay out. JohnHawkinson (talk) 14:48, 4 March 2019 (UTC)

In LibreOffice Writer on Linux if I select a word with double-click it doesn't include the space, but if I select it with the keyboard using Ctrl+Shift+RightArrow it does include the space. In the comic it looks like the selection was made with the mouse, but it's not explicit. 172.68.189.193 00:15, 11 July 2019 (UTC)

I do this. 162.158.146.41 (talk) 02:05, 25 August 2022 (please sign your comments with ~~~~)

I work with some autogenerated documents that are over 90% formatting (by file size), despite being pretty uniform to the eye. Still trying to figure out how to edit the autogenerator to not do that. 172.68.245.24 10:06, 1 August 2024 (UTC)