lederhosen | The Purity Of The English Language (Reply)

Prompted by a now-deleted post on a snark* community complaining that 'grok' is a fictional term and is not proper language, regardless of whether it's in the OED...

Since everybody else seems to have strongly-held views on the subject of New Words, I thought I'd have a go at coming up with a compromise that will annoy everyone equally. And just to make sure of that, I'll start by calling on mathematics - specifically, information theory - to dictate a few rules of language. The math-phobic can skip the cut; it's just there to justify some of the principles presented after it.

Information theory is the branch of mathematics related to quantifying and conveying information accurately and efficiently.

To understand the underlying principles of information theory - indeed, to grok it - imagine you're commanding a naval fleet back in the days of semaphore: You need to be able to communicate with the other ships in your fleet, and the only way you have to do this is by waving coloured flags at them. What does this involve?

1. You need to be able to communicate urgent concepts quickly. If it takes ten minutes to say "The Frenchies are coming up behind us, have your men ready for battle!", you can kiss your command goodbye.

2. You need to be able to communicate *common* concepts quickly. If it takes ten minutes to say "Everything is in order", it's not likely to cause any immediate disasters, but your semaphorist's arms will get very tired, and that's ten minutes when he could've been doing something useful like scrubbing the decks.

2.1 You need to be able to communicate urgent-and-common concepts *very* quickly. From here on, I'm just going to lump urgency and frequency together as 'importance', noting that this doesn't quite match the standard meaning of that word - some things that are important in one sense or another may not be important from a language-construction angle.

3. You need to be able to communicate even 'unimportant' concepts - "Happy birthday Captain Hornblower, also do you happen to have any spare lime juice? Ours has leaked." But speed isn't as important here. (This is, BTW, an example of where my usage of 'important' doesn't match the common one - prevention of scurvy is very important by the regular meaning, but it's not urgent.)

4. You need to be able to communicate unambiguously. Confusing "Hard a-port!" with "Hard a-starboard!" is not a good thing.

IRL, semaphorists use different positions and motions as well as different colours & patterns to convey meaning, but let's simplify things by just considering colour, and suppose each flag-wave takes one second.

Obviously, the more colours you have, the faster you can send messages. If you have just two flags, black and white, you can send at most 2046 different messages in ten seconds. Some of them don't take the full ten seconds either - there are two single-flag messages that can be sent in just one second, four two-flag messages, and so on. (This is why it's not just 2¹⁰ = 1024 messages. In practice, you might want some sort of 'end of message' code, or a separate flag reserved solely for that purpose, but I'll ignore that issue for now.)

If you add just one more flag, say red, you can send 88572 different messages in that same time. (Unfortunately, most of these extra messages are at the long end, so this isn't quite as good as an 88-fold improvement.)

By adding more and more flags, you can get faster and faster communications. But this will only take you so far - the more flags you add, the easier it is to mix them up and garble the message. (See requirement 4.) Once you've got as many flags as you can support without an intolerable amount of confusion, you need to look at ways to make the best use of the flags you've got. For now, let's just stick with red, black, and white.

At this point, a pure information theorist would take all possible flag combinations, rank them in order of length, and allocate them each a corresponding message, starting with the shortest combinations for the most important. "Hard a'port!", "Hard a'starboard!", and "Fire!" are all very urgent and quite common, so we give them each a one-flag code: red for 'fire', black for 'port', white for 'starboard'.

At the other end of the scale, we decide that there are precisely 88571 messages more imoprtant than "Happy birthday Captain Hornblower, also do you happen to have any spare lime juice? Ours has leaked." So we assign this the last of the ten-flag codes: WWWWWWWWWW. Meanwhile, there are eight Captain Sparrows in the same fleet, so "Happy birthday Captain Sparrow, also do you happen to have any spare lime juice? Ours has leaked" is the 11071st most important message in the queue; this qualifies it for a nine-flag code, which works out at RRBWRRBWR. (Probably better if you *don't* check my base-3 calculation there, since it's probably buggy.)

In real life, though, there are two problems with this approach. One is that it leaves no margin for error. Because every code corresponds to a message, a single mistake can lead to disaster - if my semaphorist has had a bad day and sends RRBWBRBWR instead of RRBWRRBWR, he might be telling the other captains "We must flee, throw all your ammunition overboard to lighten your load, and Mr. Turner is a nancing elf-boy." And two months later everybody dies of scurvy.

The other (somewhat related) problem is that while it allows messages to be very efficiently *carried*, it's cumbersome at both ends. Each ship has to carry around a code-book with nearly a hundred thousand different sequences; presumably your semaphorists will soon learn the most important ones - which are also the shortest - but often they'll have to look through the book before they can figure out what you're saying. Note that WWWWWWWWW and RRBWRRBWR have a very similar message, but don't look anything alike; this system would be great for reliable computers with plenty of storage space working over a slow connection, but it just isn't human-friendly.

As long as every possible code stands for a legitimate message, the first problem is inevitable; you can only eliminate it by making your code less compact, and not using every possible combination, so garbled messages can be identified as illegitimate. (You could get the recipient to resend the message to compare against the original - but that at least doubles the time involved in the communication, before you get started on figuring out whether it was the original message or the return message that got garbled.)

A standard solution to the second one is to break messages up into smaller concepts - for instance, RWRRB might always stand for 'lime juice'. This makes it more human-friendly, and less compact, which goes a long way to addressing the first problem; it has the added benefit that it makes it easier to repair a garbled message by context. (Above, you might have noticed that I typoed 'important' as 'imoprtant'; that was a genuine mistake, but I left it in because it illustrates this point nicely.)

There's a caution there for those looking to make the English language too efficient - many of the existing 'inefficiencies' and 'wastage' in the language actually serve as a form of error-detection and correction.

lederhosen's principles of vocabulary. Note that some of these conflict, and have to be weighed against one another, except for rule 9 which is non-negotiable.

1. Important (in particular, common) concepts should have compact expressions. The more it's used, the shorter it ought to be.

2. As our world and context changes, so does the importance of various concepts. It's no longer as important to be able to say "bear" in a hurry as it used to be; "computer", OTOH, has become ubiquitous.

3. To keep language effective, it needs to be able to change to reflect these facts. Where there's a need for a new word, or a shortening of an existing word, we should be willing to accept such novelties.

4. If a lot of people adopt a neologism, this is evidence that such a word was needed, and can be taken as grounds for its acceptance.

5. Exception to #4: if there's already a perfectly good & compact word for this purpose, use that one instead. Neologisms should be created due to need, not ignorance and laziness.

6. Exception to #4: Where possible, neologisms should be user-friendly. As far as possible, this means following existing patterns of language. Adapting existing English is great; borrowing from other languages is good. Words derived from Latin etc. are more likely to be readily understood and accepted than words made up from scratch.

7. Exception to #6: Sometimes, insistence on following existing patterns may get in the way of #1 and #3. Latin constructions tend to become fairly long; as such, they're admirably suited for necessary but uncommon pieces of vocabulary - for instance, many academic terms - but less so for things like "blog".

8. User-friendliness also means avoiding ambiguity. English already has more than enough homophones, thank you very much.

9. Numbers are not letters and should not be used phonetically, EVER, with a possible exception for Sinead O'Connor when covering Prince.

I quite like 'grok' because it satisfies almost all of the above principles. It offers a compact and unambiguous word for an important nuance that isn't adequately conveyed by any other short form - 'understand' and 'comprehend' are longer, and as with 'know' they lack the connotations of fully absorbing and coming to terms with the concept. (Indeed, the fact that it's hard to explain 'grok' except by example is a proof that the niche exists.) The only one it doesn't satisfy is relationship to pre-existing language, and I think the others greatly outweigh this.

*Carrollites will no doubt appreciate the irony.

The Purity Of The English Language

Post a comment in response: