Benford's law
Benford's law, also known as the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small.
In sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time. Uniformly distributed digits would each occur about 11.1% of the time. Benford's law also makes predictions about the distribution of second digits, third digits, digit combinations, and so on.
Benford's law may be derived by assuming the dataset values are uniformly distributed on a logarithmic scale. The graph to the right shows Benford's law for base 10. Although a decimal base is most common, the result generalizes to any integer base greater than 2. Further generalizations published in 1995 included analogous statements for both the nth leading digit and the joint distribution of the leading n digits, the latter of which leads to a corollary wherein the significant digits are shown to be a statistically dependent quantity.
It has been shown that this result applies to a wide variety of data sets, including electricity bills, street addresses, stock prices, house prices, population numbers, death rates, lengths of rivers, and physical and mathematical constants. Like other general principles about natural data—for example, the fact that many data sets are well approximated by a normal distribution—there are illustrative examples and explanations that cover many of the cases where Benford's law applies, though there are many other cases where Benford's law applies that resist simple explanations. Benford's law tends to be most accurate when values are distributed across multiple orders of magnitude, especially if the process generating the numbers is described by a power law.
The law is named after physicist Frank Benford, who stated it in 1938 in an article titled "The Law of Anomalous Numbers", although it had been previously stated by Simon Newcomb in 1881.
The law is similar in concept, though not identical in distribution, to Zipf's law.
Definition
A set of numbers is said to satisfy Benford's law if the leading digit occurs with probabilityThe leading digits in such a set thus have the following distribution:
| Relative size of | ||
| 1 | - | |
| 2 | - | |
| 3 | - | |
| 4 | - | |
| 5 | - | |
| 6 | - | |
| 7 | - | |
| 8 | - | |
| 9 | - |
The quantity is proportional to the space between and on a logarithmic scale. Therefore, this is the distribution expected if the logarithms of the numbers are uniformly and randomly distributed.
For example, a number, constrained to lie between 1 and 10, starts with the digit 1 if, and starts with the digit 9 if. Therefore, starts with the digit 1 if, or starts with 9 if. The interval is much wider than the interval ; therefore if log is uniformly and randomly distributed, it is much more likely to fall into the wider interval than the narrower interval, i.e. more likely to start with 1 than with 9; the probabilities are proportional to the interval widths, giving the equation [|above].
Benford's law is sometimes stated in a stronger form, asserting that the fractional part of the logarithm of data is typically close to uniformly distributed between 0 and 1; from this, the main claim about the distribution of first digits can be derived.
In other bases
An extension of Benford's law predicts the distribution of first digits in other bases besides decimal; in fact, any base. The general form isFor number systems, Benford's law is true but trivial: All binary and unary numbers start with the digit 1.
Examples
Examining a list of the heights of the 58 tallest structures in the world by category shows that 1 is by far the most common leading digit, irrespective of the unit of measurement :Another example is the leading digit of. The sequence of the first 96 leading digits exhibits closer adherence to Benford's law than is expected for random sequences of the same length, because it is derived from a geometric sequence.
History
The discovery of Benford's law goes back to 1881, when the Canadian-American astronomer Simon Newcomb noticed that in logarithm tables the earlier pages were much more worn than the other pages. Newcomb's published result is the first known instance of this observation and includes a distribution on the second digit as well. Newcomb proposed a law that the probability of a single number N being the first digit of a number was equal to log − log.The phenomenon was again noted in 1938 by the physicist Frank Benford, who tested it on data from 20 different domains and was credited for it. His data set included the surface areas of 335 rivers, the sizes of 3259 US populations, 104 physical constants, 1800 molecular weights, 5000 entries from a mathematical handbook, 308 numbers contained in an issue of Reader's Digest, the street addresses of the first 342 persons listed in American Men of Science and 418 death rates. The total number of observations used in the paper was 20,229. This discovery was later named after Benford.
In 1995, Ted Hill proved the result about mixed distributions mentioned [|below].
Explanations
Benford's law tends to apply most accurately to data that span several orders of magnitude. As a rule of thumb, the more orders of magnitude that the data evenly covers, the more accurately Benford's law applies. For instance, one can expect that Benford's law would apply to a list of numbers representing the populations of United Kingdom settlements. But if a "settlement" is defined as a village with population between 300 and 999, then Benford's law will not apply.Consider the probability distributions shown below, referenced to a log scale. In each case, the total area in red is the relative probability that the first digit is 1, and the total area in blue is the relative probability that the first digit is 8. For the first distribution, the size of the areas of red and blue are approximately proportional to the widths of each red and blue bar. Therefore, the numbers drawn from this distribution will approximately follow Benford's law. On the other hand, for the second distribution, the ratio of the areas of red and blue is very different from the ratio of the widths of each red and blue bar. Rather, the relative areas of red and blue are determined more by the heights of the bars than the widths. Accordingly, the first digits in this distribution do not satisfy Benford's law at all.
Thus, real-world distributions that span several orders of magnitude rather uniformly are likely to satisfy Benford's law very accurately. On the other hand, a distribution mostly or entirely within one order of magnitude is unlikely to satisfy Benford's law very accurately, if at all. However, the difference between applicable and inapplicable regimes is not a sharp cut-off: as the distribution gets narrower, the deviations from Benford's law increase gradually.
One possible explanation comes from the structure of positional notation used to write numbers.
Krieger–Kafri entropy explanation
In 1970 Wolfgang Krieger proved what is now called the Krieger generator theorem. The Krieger generator theorem might be viewed as a justification for the assumption in the Kafri ball-and-box model that, in a given base with a fixed number of digits 0, 1, ..., n, ...,, digit n is equivalent to a Kafri box containing n non-interacting balls. Other scientists and statisticians have suggested entropy-related explanations for Benford's law.Multiplicative fluctuations
Many real-world examples of Benford's law arise from multiplicative fluctuations. For example, if a stock price starts at $100, and then each day it gets multiplied by a randomly chosen factor between 0.99 and 1.01, then over an extended period the probability distribution of its price satisfies Benford's law with higher and higher accuracy.The reason is that the logarithm of the stock price is undergoing a random walk, so over time its probability distribution will get more and more broad and smooth. To be sure of approximate agreement with Benford's law, the distribution has to be approximately invariant when scaled up by any factor up to 10; a log-normally distributed data set with wide dispersion would have this approximate property.
Unlike multiplicative fluctuations, additive fluctuations do not lead to Benford's law: They lead instead to normal probability distributions, which do not satisfy Benford's law. By contrast, that hypothetical stock price described above can be written as the product of many random variables, so is likely to follow Benford's law quite well.