Saturday, April 18, 2009

Benford's Law

When a number is randomly chosen from a large data set, such as stock quotes, census statistics, or scientific data, what is the probability that the first digit is a 1? Excluding the possibility of a leading 0, the probability would logically seem to be 1 in 9, or about 11.1%.

If you test this hypothesis on real world data, though, you find that the probability that the first digit is a 1 is actually about 30.1%, the probability that the first digit is a 2 is about 17.6%, the probability that the first digit is a 3 is 12.4%, and the probabilities keep dropping so the probability that the leading digit is a 9 is only 4.5%, as the following graph illustrates.



This distribution fits the rule that the probability that the first digit is d is

pd = log10(1 + 1/d)


Cover of Henry Briggs' original table of logarithms.  Source: Eric's computer museum.This distribution is called Benford's Law after physicist Frank Benford who discovered it in 1938. Benford wasn't the first to notice the distribution. Astronomer and mathematician Simon Newcomb made the same discovery 57 years earlier when he noticed that the earlier pages of logarithm tables were dirtier and more worn than the later pages.

Benford tested thousands of data sets for the distribution of leading digits, including geographic data, physical properties of chemicals, baseball statistics, and street addresses. He found the same pattern repeated in all of these seemingly unrelated sets of data.

The following graph shows that the first digits of recent stock prices also closely resemble Benford's distribution.



Properties

Distributions that cover several orders of magnitude are likely to satisfy Benford's Law very closely. This is the case for sets of numbers that are allowed to grow exponentially, such as individual incomes and stock prices.

Another property of Benford's Law is that it is scale invariant. This means that the leading digits of the stock data above, for example, would follow Benford's Law even if it were converted to other currencies, such as Euros, or Japanese Yen.

In addition to being scale invariant, Benford's Law can be shown to be base invariant as well. If you convert the values from a set of data that adheres to Benford's Law to any other base system, the new data set will continue to adhere to the law, with one slight modification. The probability distribution of the digits of the new base can be calculated with the following adjusted formula:

pd = logbase(1 + 1/d)

where d takes the value of each non-zero digit of the new base.

Limitations

Distributions that have a built in minimum or maximum value are unlikely to satisfy Benford's Law. For example, one might expect a set of numbers representing "small insurance claims" to obey the law. However, if "small" in this instance is defined to be between $50 and $100, then some leading digits are excluded from the range by definition.

Distributions that cover only one or two orders of magnitude, or even fewer, are also unlikely to satisfy Benford's Law. Adult IQ scores are an example of a set of data that covers a relatively narrow range, despite having no theoretical maximum.

Explanation

A simple explanation of Benford's Law can be found in inflationary price increases. The price of an item that starts out at a cost of $1.00 with a steady 3% rate of annual inflation will have a leading digit of 1 for 24 years, until the price reaches $2.03 in the 25th year. The leading digit will be 2 for the following 14 years, 3 for the next 9 years, 4 for 8 years, 5 for 6 years...and 9 for only 3 years before the price reaches $10.03 in the 79th year. After that, the leading digit will be a 1 again for another 24 years.

When you take into account the fact that inflation affects a wide range of consumer goods, you can see why, if you take a sample of all of the prices in your local grocery store at one specific moment in time, the prices as a group adhere to Benford's Law. Every item in the store has been going through its own exponential price increase over time, so the probability of a randomly selected item's price having a leading digit of 1 at the particular moment that you sample those prices will be roughly 30.1%.

Applications

Benford's Law may seem like a mere mathematical curiosity, but it has some surprisingly pragmatic applications. Based on the assumption that people who attempt to falsify data will tend to distribute digits uniformly, Benford's Law can be used to expose potential fraud in accounting, insurance claims, and tax forms. Other uses, such as analyzing the results of clinical trials and election results, have also been proposed.

For more information on applications of Benford's Law, see I've Got Your Number: How a mathematical phenomenon can help CPAs uncover fraud and other irregularities.

Friday, April 10, 2009

Two Puzzles

First scenario: You are presented with four cards and are told that the cards follow two simple rules.

1. On one side of each card is a letter and on the other side is a number.
2. Cards with vowels on one side must always have even numbers on the other.

Rule #1 is always true. On the sides of the cards you can see are printed:

A 18 C 57

First question: How many cards do you need to turn over in order to check rule #2 above?

Take a moment to think about your answer before moving on to the next scenario. The immediate answer that many people give is the wrong one.

Second scenario: You work as a manager at a popular bar/restaurant. Because you have an attached game room, the bar has become a popular hangout spot for students from the local community college. One of your duties as manager is to periodically make a sweep through the game room to make sure none of the students under the age of 21 (the legal drinking age in the U.S.) have procured beer from the bar. For the sake of simplicity, the bar only serves beer and cola (no mixed drinks), so you can tell by simply looking in a patron's glass what they are drinking.

On one such sweep through the game room you see only four customers.
  • A man of indeterminate age holding mug of beer.
  • A woman drinking a glass of cola.
  • A man with a long gray beard whose glass you cannot see.
  • A young teenage girl whose glass you also cannot see.
Second question: For which of the patrons above do you need to check identification and/or beverage choice?

The second question isn't intended to be as tricky as the first. Your first instinctive answer is probably the correct one.

Note that the second question is simply a more concrete version of the first. I've replaced even numbers with ages, but that's really the only difference. Now that you know this, go back and read the first question again. If you thought you needed to check all four cards in the first question, would you like to take another guess after reading a concrete analogue? Now what about if you thought you needed to check three cards?

Spoilers Ahead!

Don't continue reading unless you want to see the solutions.
.
.
.
.
.

Solution to the first scenario

Many people mistakenly think, when presented with the first scenario, that you have to turn over all four cards to check the reverse. The problem is in the details of the wording of the second rule.

2. Cards with vowels on one side must always have even numbers on the other.

This in no way implies that a card with an even number on one side must have a vowel on the other, but this is the mistaken assumption that many people make. The fact is, if a card has an even number on one side, you don't need to care what kind of letter is on the reverse. A consonant printed on the other side doesn't break the second rule.

Based solely on rule #2, you shouldn't care what's printed on the other side of the card marked '18' or the card marked with the letter 'C'. You do need to flip over the cards marked 'A' and '57'. If the card marked 'A' has an odd number on its reverse, then rule #2 is broken. Similarly, rule #2 will be broken if the card marked '57' has a vowel on its reverse side.

Solution to the second scenario

The more concrete (but still somewhat contrived) second scenario is answered correctly much more often than the first. Let's go through each patron and decide if we need to check their age or their beverage.
  • A man of indeterminate age holding mug of beer. Clearly you need to check his age to see if he's above the legal drinking age.
  • A woman drinking a glass of cola. You don't need to check her age because she's drinking cola, which patrons of any age are permitted.
  • A man with a long gray beard whose glass you cannot see. If a man has a long gray beard then he is presumably of a legal drinking age, so you don't need to see what's in his glass.
  • A young teenage girl whose glass you also cannot see. You need to check the girl's glass to see what she is drinking.
So, just as in the first scenario where we had two cards we needed to check, here we have the same number of patrons who we need to check.

As I said before, the two puzzles are virtually the same in every way, except one is a more concrete (and for some puzzling reason easier) version of the other. When you have an abstract problem that you just can't seem to get to the heart of, sometimes it will help to assign concrete values to the abstract variables of the problem. Once you've solved a concrete instance of the problem, go back and see if this solution has shed some light on the original problem.

If you like logic puzzles of the sort I presented here, I highly recommend the book Conned Again, Watson! Cautionary Tales of Logic, Math, and Probability by Colin Bruce, from which I adapted these puzzles.

Tuesday, April 7, 2009

Bits and Pieces

I found a link to an interesting article indicating that Most Britons have lied about the books they read on reddit today. Before anyone accuses me of Brit bashing, let me just say that I'm sure most Americans are just as quick to inflate their own literary tastes. I only mention this because it expands on my own prior observations about the books that programmers don't really read.



On 24 last night, an FBI analyst broke an AES block cipher in under 30 seconds. The really strange part was that the cipher was used to encrypt the 5-digit combination to an electronic door lock.

Someone needs to tell Hollywood writers that if I'm cracking an N-digit code, and I already know N-1 digits, it won't take my computer a few more seconds to figure out the last digit, no matter how much tension it builds. Most 10 year olds could figure it out in a few seconds.



John Cook posted a couple of good articles on floating point numbers on his blog, The Endeavor, recently. Floating point numbers are a leaky abstraction and Anatomy of a floating point number are both good reads for anyone interested in math or programming.

John's regular and high-quality posts have made The Endeavor one of my favorite blogs on the 'Net. Anyone who reads this blog should give it a try. I think you'll really enjoy it.