How Much Information is in a Message?

Trying to tackle concepts of entropy and redundancy in data

  1. Colin Weaver

“How much information is in a message?”

Huh???

That sentence, in the context of typical use of those words (information & message), doesn’t immediately make sense to most people. Well, I know it didn’t make sense to me, at least.

So let me try asking it in a seemingly more complicated but different way:

“If you wanted to communicate to me one of a range of possible values and you wanted to do it using only binary bits, how many binary bits would you have to use to be able to communicate any of the possible values?”

For example, let us suppose you want to communicate one of three (3) possible values:

  • Yes
  • No
  • Maybe

We could agree on the following ‘meaning’-to-binary mappings:

  • Yes: 00
  • No: 01
  • Maybe: 10
  • <unused>: 11

If you want to say “Yes” in response to something I ask you, you can communicate “00” to me. If you want to say “No”, you communicate “01” and a “Maybe” is communicated by you sending me a “10”. If you were to send you a “11”, it would not mean anything to me (unless we agreed later on that it now meant something (i.e. the ’11’ was information rather than being undefined)).

So, the MINIMUM number of binary bits we would have to use to communicate the different meanings is 2, with one bit combination (11) not used for anything.

Using 2 binary bits you can communicate 3 different meanings (Four, really. But we have only defined three meanings in our example).

Now let’s say that instead of using two binary bits to communicate Yes|No|Maybe we decide to instead use the characters “Y”, “N”, and “B” instead. In a computer each ASCII character is represented using an 8-bit value. For simplicity, let’s say that these are our mappings (Note: in real life they are different):

Y = 00000000
N = 00000001
B - 00000010

Now when I want to say ‘Yes’, ‘No’ or ‘Maybe’ to you I will have to communicate 8 bits rather than the original two bits. Here’s one point to take away from this (so far): I am using MORE bits but my message DOES NOT contain any more information. I still have only 3 possible meanings (Yes, No, and Maybe).

Imagine what would happen if I wanted to actually communicate the actual words, “Yes”, “No”, or “Maybe”. The longest word is 5 ASCII characters so the field we would have to use for all three would be 5 bytes (40 bits) long.

Here are the binary encodings of the words "YES", "NO", and "MAYBE":
------------------------------------------------
YES =    0000000000000000010110010100010101010011
NO =     0000000000000000000000000100111001001111
MAYBE =  0100110101000001010110010100001001000101

In those 40 bits that are used, only 3 options are defined (only 3 ‘meanings’). That means there are 1,099,511,627,773 other combinations of 0 and 1 that are not used (i.e. 240)

The “amount of information in each of these message options is the same: a little less than 2 bits (because “11” doesn’t mean anything in our example).”

The word used to describe the amount of information contained in a message (M) is “ENTROPY” and is represented in math as an “H”. So, mathematically, this is written as: H(M)

(Note: The word entropy is used in a lot of situations (cryptography, thermodynamics, physics, etc.) to describe different, yet still closely related things.)

You calculate the entropy by doing log2 n, where ‘n‘ is the number of possible meanings in the message.

In our example, there are 3 meanings so log2 3 = 1.58496
Note: This is the same as 21.58496 = 3

Let’s create some more meanings that we might want to communicate. Now let us assume you want to communicate your emotions and the only way you can do so is using binary bits across a copper cable. Here are the emotions you want to communicate:

- Happy
- Sad
- Mad
- Scared
- Envious
- Disgusted
- Surprised
- Confused

Then the entropy of a message would be:
log2 8 = 3

So you can say, “I can communicate eight different meanings with 3 binary bits.

This means it would take no less than 3 bits to communicate your meaning. Like this:

Meaning     | Bit pattern
------------|---------------
- Happy     | 000
- Sad       | 001
- Mad       | 010
- Scared    | 011
- Envious   | 100
- Disgusted | 101
- Surprised | 110
- Confused  | 111

This entropy value of 3 is also considered a measure of UNCERTAINTY. Put more plainly that is effectively answering the question, “How many questions would I have to ask you in order to figure out your meaning?”

The answer is: 3

But let’s look at why:

For me to determine your meaning I would start by asking:

Question #1:

  • “Is your meaning in the range [Happy, Sad, Mad, Scared] or [Envious, Disgusted, Surprised, Confused]?”
  • If you say it is in range [Happy, Sad, Mad, Scared] I will then ask:

Question #2:

  • “Is your meaning in the range [Happy, Sad] or [Mad, Scared]?”
  • If you answer that it is in range [Mad, Scared] I will then ask:

Question #3:

  • “Is your meaning ‘Mad'”?
  • If you answer ‘yes’, I have determined your meaning and if you say ‘no’ I have also determined your meaning because only ‘Scared’ would be left. When you have only 8 meanings it will never take me more than 3 guesses/questions (bits) to figure out which meaning it is.

So when you have 8 possible meanings we say that the ENTROPY of your message is 3 bits; you can communicate your 8 possible meanings using a minimum of 3 bits.

The is also a measure of UNCERTAINTY because it also means that I would have to figure out/recover 3 bits in order to determine your message.

Look back at the table of meanings (emotions) and their binary mappings for a moment. Question #1 as written above isn’t really what I was asking you. What I really asked you was, “Is the leftmost bit a zero (0)?” If you answered ‘yes’ I knew your word must be one of [Happy, Sad, Mad, Scared] because the leftmost bit for each of those is a zero. Similarly, if the leftmost bit is not a zero then it must be a one (1). That would mean that your meaning must be one of [Envious, Disgusted, Surprised, Confused] because each of those has a leftmost bit of one (1).

Question #2 was really asking you if the middle bit was a one (1) or a zero (0).

Question #3 was really asking you if the rightmost bit was a one (1) or a zero (0).

But let’s suppose for a moment that you use a 1 byte value to store your meaning in a database.

Meaning       | Bit pattern (1 byte)
--------------|---------------------
- Happy       | 00000000
- Sad         | 00000001
- Mad         | 00000010
- Scared      | 00000011
- Envious     | 00000100
- Disgusted   | 00000101
- Surprised   | 00000110
- Confused    | 00000111

In this situation there are now 256 possible bit patterns (00000000 –> 11111111) but you are only using 8 of them. The first 5 bits (from left to right) of your possible meanings is always 00000.  No point in my trying to figure those out; I already know what they are.

The 6th bit, being a 0 tells me your meaning is in the range [Happy, Sad, Mad, Scared] (000000xx), and the 6th bit being a 1 tells me your meaning is in the range [Envious, Disgusted, Surprised, Confused] (000001xx).
If I then recover the 7th bit I will know you are in the range:

  • [Happy, Sad] (0000000x)
  • [Mad, Scared] (0000001x)
  • [Envious, Disgusted] (0000010x)
  • [Surprised, Confused] (0000011x)

Now all I need to do is recover one more bit to determine your actual meaning (your emotion). Notice how it still only took me 3 guesses (or the recovery of 3 bits) to learn your message. From a security perspective, the first 5 bits did not contribute to your security so they did not increase the entropy. If you had 256 possible meanings defined then your entropy would be 8.0 (log2 256 = 8) rather than 3.0 (log2 8 = 3).

Because you never use the first 5 bits it dramatically reduces the number of bits I would have to recover to figure out your message. In other words, there is less UNCERTAINTY in your message (i.e. it has lower ENTROPY)

Let’s build on that idea some more…

The English language uses a 26 letter alphabet. If you do the math to calculate the ENTROPY it would be log2 26 = 4.700.

What does this mean? A few things:

  1. If you wanted to represent each of the 26 letters using binary bits you would need a MINIMUM of 5 bits (4.7 has to be rounded up to 5.0 because you can’t have 0.7 of a bit in practice; a bit has to be a 1 or a 0. There is no 0.7th of a bit). If you have 5 bits you actually have 32 possible values (25 = 32) but you would only use 26 of them. You could actually use the remaining bit patterns to represent things like punctuation (for example: a space, a period, exclamation point, etc.)
  2. It also means that if I wanted to guess a letter I would have to make, on average, 4.7 guesses to figure out your letter. This would be done by breaking the alphabet into groups.Let’s say your letter is “H”. I would ask:
  • Question 1. Is your letter in the range A-M?
    You answer ‘yes’ so I know that it is not in the range N-Z.
  • Question 2. Is your letter in the range A-G?
    You answer ‘no’ so I know the letter is in the range H-M.
  • Question 3. Is your letter in the range H-J?
    You answer ‘yes’ so I know the letter is not in the range K-M.
  • Question 4. Is your letter H?
    You answer ‘yes’. 4 guesses.

Let’s do this again but this time your letter is “G”. I would ask:

  • Question 1. Is your letter in the range A-M?
    You answer ‘yes’ so I know that it is not in the range N-Z.
  • Question 2. Is your letter in the range A-G?
    You answer ‘yes’ so I know the letter is not in the range H-M.
  • Question 3. Is your letter in the range A-D?
    You answer ‘no’ so I know the letter is in the range E-G.
  • Question 4. Is your letter E?
    You answer ‘no’ so I know the letter is in the range F-G.
  • Question 5. Is your letter F?
    You answer ‘no’ so I know your letter is G.
    5 guesses.

Regardless of the letter it will always take a minimum of 4 but never more than 5 guesses to correctly guess your letter. When you do the math on this you will see that, on average, it will take 4.7 guesses to correctly guess your letter (i.e. repeat this 26 times, using a different letter each time and average the number of guesses).

This means that if I am trying to recover your message I will have to recover, on average, 4.7 bits to determine your meaning.

However, this assumes that you are just as likely to choose the letter ‘Z’ as you are the letter ‘E’ or ‘I’. And when it comes to use in everyday language, this IS NOT TRUE. Some letters get used more than others. It’s a function of our language and how we communicate. This means that STATISTICALLY there is a greater likelihood that your letter will be ‘E’ or ‘I’ before it will be ‘Z’

In math the term ABSOLUTE RATE of a language is log2 n were ‘n’ is the total number of letters in the language. So, for English and its 26 letter alphabet it would be log2 26 = 4.7, which I have already written about above. But this calculation assumes that each letter in our language is equally likely to be used and anyone who communicates in English knows that just is not true.

Because letters like R, S, T, L, N and E are used more often than letters like Z, X and Q and because some letters have a tendency to follow other letters (like H after T and U after Q) we can even further increase the likelihood that we can guess a letter in fewer guesses. This also means that you can communicate a letter with fewer bits than expected.

To a cryptanalyst, this means that if they can determine one particular bit (or bits) in a message they can infer/guess what the meaning will be. They don’t have to recover all the bits, they just have to get a particular bit (or bits) and can infer/guess the rest of the bits.

The term used to describe this characteristic that, in language, some letters are more likely to be used than others is REDUNDANCY. The math calculation for it is:

Redundancy (D) = Absolute Rate of a Language (R) – Practical Language Rate (r)

D = R - r

Claude Shannon suggested that the practical language rate (entropy) of English was closer to 2.3 than 4.7 and others have since suggested that it can be even lower, with 1.3 being a commonly referenced value.

So, using the lower value:

D = 4.7 - 1.3 = 3.4

This equation says that the English language carries 3.4 bits of redundant information (on average).

According to Bruce Schneier (in his book, Applied Cryptography):
If English (ASCII, a-z only, 1-byte values) has 1.3 bits of information per byte (an entropy of 1.3) then there are 6.7 bits of redundant information in each byte (8 bits – 1.3 bits = 6.7). This means that the redundancy of a single bit is .84 (6.7s bit of information / 8 bits). This means the entropy of a single bit is .16 (1 bit – .84).

Now I need to put all of that into words that I can understand. What does all of that mean in practical application?

Redundancies in data mean that information can often be communicated by using less than what the information entropy suggests. This is the basis for compression algorithms (removing redundant bits) and can also be the basis for being able to guess/recover a value in fewer than anticipated guesses. It can also serve as as tool for cryptanalysts.

So when it comes to security (encryption) we need to try to get rid of redundancies as much as possible so that attackers have a harder time using them against us. In other words, we have to obscure the redundancies.

How?

Confusion & Diffusion

CONFUSION

CONFUSION obscures the relationship between the plaintext and the ciphertext. This makes taking advantage of redundancies more difficult for an attacker.

How do you create ‘confusion’?  Answer: Substitution.

With substitution, plaintext bits are substituted for ciphertext bits. How the confusion mechanism works in an algorithm will change for each bit in the plaintext or the key.

STREAM CIPHERS depend almost exclusively on CONFUSION for their security.
Note: A stream cipher using ‘cipher feedback mode’ is also adding DIFFUSION so it’s not safe to say that STREAM CIPHERS ONLY use CONFUSION.

DIFFUSION

DIFFUSION “dissapates the redundancy of the plaintext” by spreading it out over the ciphertext, making them more difficult to find by cryptanalysis.

How do you create ‘diffusion’?  Answer: Transpositon (also called Permutation).

Diffusion by itself is fairly easy to crack/unravel. This is true because, by itself, all it does is rearrange the plaintext characters. A a simple example.

If I take the plaintext word:

FANTASTIC

I can transpose the letters like so:

NTATFSIAC

With a length of 9 and a character set of 7 (fantsic), my MacBook can try every possible combination in less than 2 seconds.
Note: There are 79 combinations (40,353,607). Using a tool like crunch can make generating them very easy.

Ex:

crunch 9 9 fantsic

BLOCK CIPHERS use BOTH CONFUSION and DIFFUSION to achieve security.

Cheers,

Colin Weaver

About the Author

Colin Weaver

Colin Weaver is co-owner and lead instructor at ITdojo, Inc., a network security and information assurance training center and consulting firm located in Virginia Beach, VA. His passion for technology, networks, and security has led him to become enthralled with the idea of IPv6 and its implementation. In this blog he will share with you glimpses of what he has learned and a hint at what you’ll learn in his classes.