lederhosen

So,

cerebrate mentioned that 'mb' is the abbreviation for 'millibit', and the question came up: what possible use is there for such a unit? How can you have a fraction of a bit?

This is much like asking "how can you have a shape with a fractional number of dimensions?" At first glance, it makes no sense. But in trying to extend familiar notions to new applications, it becomes a very useful idea indeed, with applications in everything from data transmission to investment theory.

(Before I start, a caveat: It's been ten years since I did this stuff. Some of the details and/or terminology may be a little off. The gist of it should be correct, though.)

I'm about to generate two random events. First off, I'm going to toss a coin. Maybe it'll come up heads, and maybe it'll come up tails.

Now, let's consider another random event: tomorrow morning I'm going to open my front door, and either there'll be a guy in a Santa suit standing there, or there won't. I don't have any reason to expect such a thing, but it's not impossible.

In both cases, as stock-market enthusiast Paul Marsh would say, 'there are two possibilities'. Either the coin comes up heads, or it comes up tails; either Santa's there, or he isn't. But one of those feels a lot more random than the other - I can much more accurately predict whether Santa will show up, simply by saying "no, he won't", than I can call the outcome of a coin toss - and information entropy helps to quantify that difference.

Let's suppose I was to toss that coin ten thousand times, and I wanted to communicate all the results to you. Every toss of the coin has a 50/50 chance of coming up heads or tails, with no particular pattern to it. I need to transmit a string of ten thousand 'H's and 'T's (or equivalently, '0's and '1's), which means a minimum of ten thousand bits; there's no way around this. There are 2¹⁰⁰⁰⁰ equally likely strings to transmit.

Now let's suppose I was to open my front door every morning for thirty-odd years, a total of ten thousand times, and I wanted to communicate to you which of those times I found a guy in a Santa suit waiting there. My transmission to you would most likely look something like this: "Opened door ten thousand times, no Santa." Or maybe "Opened door ten thousand times, Santa only showed on June 12th 2026." That's something like 500 bits worth of message, and if we know in advance that Santa showing up is an unlikely event we could pre-arrange a code that compresses it even further: "0-0-00" for no-show, "12-6-26" for one appearance on June 12th 2026, "12-6-26, 30-8-29" for two appearances, and so on. The chances of him showing up often enough to make this take more than 1000 bits - let alone 10 000 - are negligible.

Going back to single events, then, information entropy describes how 'random' such an event is. And here I dredge out a definition, due to Claude Shannon:

When considering a random event x with n possible mutually exclusive outcomes - each with probability p_i, and those p_i have to add up to 1 between them - the entropy H(x) of that event is given by

H(x) = - sum (over i = 1 to n) of p_i * log₂(p_i).

For the coin-toss, there are 2 possible outcomes, each with probability 0.5, which gives us H(x) = -2*(0.5*log₂(0.5) = 1 bit.

If I toss the coin twice, and treat that as a single random event with four possible outcomes (HH, HT, TH, TT), all equally likely, H(x) = -4*(1/4 * log₂(1/4)) = 2 bits.

If I toss the coin ten thousand times, and treat that as a single random event with 2¹⁰⁰⁰⁰ equally likely outcomes, H(x) = -2¹⁰⁰⁰⁰*(1/2¹⁰⁰⁰⁰ * log₂(1/2¹⁰⁰⁰⁰)) = 10 000 bits. And so on - when you're combining a bunch of events with independent outcomes (coin-toss #3 doesn't influence #4, and so on) you can simply add the entropy of these events together to find the entropy of their combination.

This means, incidentally, that if you take k bits of computer memory and completely randomise their contents, the entropy of that event is equal to k bits. That's not a coincidence; Shannon set the constants in that formula to make sure it would turn out that way. In general, if the entropy of an event is k, it will take an optimal communication code an average of k bits to communicate the outcome of that event (rounding up, but combining multiple events into a single event reduces the contribution of rounding.)

Now, if Santa has a 0.001% chance of showing up on any particular day, the entropy of that random door-opening is -(0.00001 * log₂(0.00001) + 0.99999 * log₂(0.99999)) = 0.0002 bits. If I'm only communicating the results of one day, I still have to round that up and use one whole bit to tell you whether he did or didn't come. But if I'm reporting on ten thousand days, the total entropy there is 10 000 * 0.0002 = 20 bits. If I use an optimal data-coding scheme - note that picking the optimal scheme requires estimating the probability in advance - I can communicate ten thousand days' worth of Santa-watching in as little space as it takes for twenty coin-tosses. (This is an average; shorter messages for the most typical outcomes, where Santa doesn't show or only shows once, are balanced by longer ones in the unlikely event that he shows repeatedly.)

For a more complicated example, let's suppose we go to a ten-horse race, and all we're interested in is which horse wins. Without any information on the quality of the different horses, there are ten equally likely outcomes, which gives us a total entropy of log₂(10) = 3.32 bits.

But as I mentioned in this post and also this one, statements of probability are always based in some prior knowledge. Suppose we acquire better knowledge - for instance, suppose we look at the bookmakers' odds and gauge from those that Archer has a 25% chance of winning, Boffin and Charlemagne each have a 20% chance, Drongo and Edgar each have 5%, Fandango and Gollum 2% each, and Huckleberry has a 1% chance. Then plugging those numbers into Shannon's formula gives an entropy of 2.82 bits. We've gained half a bit's worth of information.

Suppose now that Reliable Eddie tells us that Fandango has been nobbled, dropping his chances to 0.5%, and Gollum has been injected with something that'll boost his chances to 2.5%. Plugging this in drops the entropy to 2.81 bits - we've improved our knowledge by .01 of a bit, or about 10 millibits. This isn't very *much* information, but it's something.

A major branch of financial mathematics, known as portfolio analysis, deals with the question of how to best use such knowledge discrepancies to turn a profit on investments - if everybody else thinks somebody's venture has a 5% chance of yielding a million-dollar profit, but your analysis suggests it's more like 10%, how much should you invest in it? About ten years back, I did some vacation work with some mathematicians from the University of Sydney who were working on this sort of problem.

But because financial markets are slow-moving things and getting that sort of information is very difficult, they'd chosen instead to use horse racing as a testbed for their methods.

A brief aside: In Australia, you can bet either with private bookmakers or the TAB (Totalisator Agency Board). Bookies set their own odds, according to their assessment of each horse's chances: if you buy a ticket at ten-to-one, that's what it pays off at. If you bet with the TAB, OTOH, you don't get fixed odds at the time of betting. Rather, they add up the total money bet, take off their percentage, and redistribute the rest according to the size of the winners' bets. (It's a bit more complicated than that, because it also has to allow for place, quinella, trifecta etc. bets, but that's the basic principle.) As bets come in, the TAB displays the current odds, but until betting stops and the race begins those are subject to change.

The result of this is that bookies' odds are determined by professionals who have a strong interest in accurately predicting the odds, while TAB odds are determined by the cumulative effect of millions of mostly-amateur gamblers. As such, bookies' odds are probably a slightly better estimate of the horses' true chances. So the Uni of Sydney guys would use the bookies' odds as an "informed estimate" and try to exploit discrepancies between this and the TAB's odds. (This meant waiting until as late as possible, just before the close of betting, to get the best idea of what those odds would actually be - a fast laptop is useful to have here.)

IIRC, they had a float of something like $5000 to bet with. In the course of the day I spent with them, they made a small profit, about enough to cover drinks. I tried gambling with a bit of my own money, not according to any particular scheme, just for the fun of it, won a few, lost a few more, and came out about $10 behind on the gambling altogether. But looking down at the floor I noticed a $50 bill somebody had dropped; I asked the punters in the area whether anybody had dropped one, but none of them claimed it, so I got to keep it and came out $40 ahead on the day :-)

Footnotes:

The theory above can be generalised to infinite n, and to non-discrete probability distributions. Especially in the latter, it's closely related to the 'entropy' of physics and chemistry, with high-entropy states corresponding roughly to the 'most random' arrangements of atoms and molecules, although I'm not really competent to go into the detail of that relationship.

I posted a while back about applying information theory to language; a lot of the concepts invoked there are also present here.

Further reading: Wikipedia: Information entropy and Shannon's seminal paper on the subject.