Benford’s Law

Introduction

Let’s speculate on a distribution involving the population of cities. We will look at only the first digit of the population size. So for example greater London has 9 million people living in it. the first digit is therefore 9.

If we took the first digit of each capital city, how do we think the distribution of numbers should be? We know cities vary in size and the possible digits are 1-9. So surely the distribution would be fairly even across each leading digit.

Shouldn’t we would expect each digit to contain 11% of the the total population as shown below;

this would mean that 11% of the cities would have a population starting with the number 1, 11% with the number 2 etc.. so the leading digits would be evenly distributed across cities.

That is our expectation, lets see if it is met

Example – Capital Cities

We can obtain data showing the population in each capital city from Wikipedia (List of national capitals by population – Wikipedia). If we save them into a file we can perform some basic analysis in Python;

What we are doing here is producing a table that shows for each leading digit the count of cities and percentage of the total number of cities that have a population starting with that leading digit. The table produced is shown below.

Leading DigitNo of Cities with population starting with leading digit% of Capital Cities
17632%
25523%
32310%
4167%
5177%
6146%
7135%
8115%
9156%

we can also plot this as a graph with a few lines of code.

and the graph is as follows.

What? Why?

This is very different from our original prediction, initially you might think all the capital cities of the world are perhaps of a similar size thereby manipulating this example, but the range of data is from King Edward Point (population 22) to Beijing (population 21.5m). Perhaps this is an odd example..

Lets map another distribution of leading digits to see what happens

Example 2 – Youtuber Views

Perhaps capital cities isn’t a good example, lets look at something totally different. The number of views Youtube videos receive from the UK. Using the following dataset (https://www.kaggle.com/datasets/datasnaek/youtube-new) we can obtain figures on this for 38,917 videos.

Again we can use very similar code;

and it produces a similar looking graph.

This suggests our initial theory about leading digits of numbers being evenly distributed is totally wrong, but why?

What Is Going on Here? – Benford’s Law

This phenomenon is known as Benford’s Law, The discovery of Benford’s law goes back to 1881, when the Canadian-American astronomer Simon Newcomb noticed that in logarithm tables the earlier pages (that started with 1) were much more worn than the other pages.

However the law is named after the physicist Frank Benford, who tested it on data from 20 different domains and was credited for it. His data set included the surface areas of 335 rivers, the sizes of 3259 US populations, 104 physical constants, 1800 molecular weights, 5000 entries from a mathematical handbook, 308 numbers contained in an issue of Reader’s Digest, the street addresses of the first 342 persons listed in American Men of Science and 418 death rates. The total number of observations used in the paper was 20,229.

Benford established a theory that applies to leading digits of data that meets the following conditions;

  • Data is measured rather than assigned. (it wouldn’t apply to staff Ids, phone numbers, postcodes etc..)
  • Ranges over orders of magnitudes. (tens, hundreds, thousands, ten thousands etc..)
  • Not artificially restricted by minimums or maximums (heights, weights, ages wouldn’t apply)
  • Mixed populations can apply (data pulled from different sources)
  • Larger datasets are better

His theory predicts that the percentage of a population for a given leading digit could be calculated as.

Where d is the leading digit. So for example if we wanted to know the percentage of values we expect to have a leading digit of 3 would calculate

Log10(1.33333) which would equal 0.125 or 12.5%. using this you can calculate an expected distribution from leading digits 1-9.

Leading DigitExpected Benford’s %
130.1%
217.6%
312.5%
49.7%
57.9%
66.7%
75.8%
85.1%
94.6%

if we were to add this onto the graph of youtube views, you can see the correlation is near perfect. This is shown below

which leads to the following graph.

you can see the prediction is near perfect, quite incredible really.

Why does this happen?

There are several theories behind this phenomenon, some get very complex, I have highlighted a couple.

The Rarity of Large Items/Orders of Magnitude

I can remember my statistic teacher at high school stating “There are more small things in the universe than there are large ones” when discussing Benford’s law. Indeed this does explain it for many distributions, lets take house numbers on a street for example – there will always be a house no 1, often be a house 10, sometimes be a house 90 or 100 but after this point house number 900 is the next instance you will see a leading digit of 9. This will not happen that frequently.

We can take this example and make it more general by considering orders of magnitude. if you notice, as you are counting, every time you get to a new order of magnitude (10, 100, 10000, … 1,000,000), you encounter a block of numbers starting with 1 that is as large as the entire block of numbers that you have already counted through. In other words – to avoid a number starting with 1 you have to double the count thus far. Then you get to the twos and you encounter a block of 2s that are 1/2 the size of the block counted so far. The 3s are 1/3 the size, 4’s 1/5 the size and eventually the 9’s are 1/9 the size of the count so far. So, to avoid the 9s you only have to increase the count by 11% to avoid them, and then you get back to a block of 1s that requires a 100% increase to be avoided.

Multiplicative fluctuations

Many real-world examples of Benford’s law arise from multiplicative fluctuations. For example, if a stock price starts at $100, and then each day it gets multiplied by a randomly chosen factor between 0.99 and 1.01, then over an extended period the probability distribution of its price satisfies Benford’s law with higher and higher accuracy.

The reason is that the logarithm of the stock price is undergoing a random walk, so over time its probability distribution will get more and more broad and smooth, a broad distributions can be graphed as follows.

where the red sections signify the leading digit 1 and the blue sections show leading digit 9.

Testing this yourself – Pick up a magazine

Benford’s law applies to all kinds of datasets, ranging from the natural world (tree sizes, populations, river lengths, volumes of lakes, street numbers), it also doesnt matter what scale you used to measure these values (miles, km, feet) – the same rule would apply.

It isn’t just things in the natural world that fit this distribution other examples could include stock prices, financial statements, transactions, numbers used in the written word and as demonstrated even youtube views.

Try it out yourself if you want, get a newspaper or magazine and write down the first digit of every number you see in a page or two. I’d be very surprised if the frequency of the leading digits wasn’t similar to the expected Benford’s Value.

Ok, that is interesting, but is it useful?

So this is somewhat of a quirk of nature, is it actually of any use? The answer to this is very much yes, like all scientific methods the value is in that that the law can be used to make predictions. We know that some types of data should follow this pattern and if they don’t it suggests potential issues. Some common uses for Benfords law are;

  1. Auditing financial records for erroneous transactions
  2. Analysing scientific data for fabrication
  3. Analysing election results for fraudulent voting patterns

There are genuine examples of this law being used to highlight significant events. Below are a couple.

Greece Joining the Eurozone

There is a body of evidence to suggest the Greek Government manipulated economic data to gain eurozone membership. The full journal highlighitng this can be access at Fact and Fiction in EU‐Governmental Economic Data – Rauch – 2011

This paper analysed official statistics of the EU member states from the last eleven years by counting the first digits. They looked at 130 different values per country and year. Among other things, they looked at the total level of debt, the cash reserves of the government and the pensions of retired civil servants.

The result is straightforward. Judged by Benford’s Law, Greece produces data that significiant deviates from Benford’s Law. This is a strong indication of creative accounting by the Greek government.

The Madoff Ponzi Scheme

A paper showing how Benfords Law identified the Ponzi Scheme ran by Bernie Madoff was supicious is available here (Madoff and Other Ponzi Schemes – Benford’s Law – Wiley Online Library)

This paper and several other perform hypothesis tests combined with Benfords Law analysis on Madoffs investment returns. The results provide statistically sufficient evidence that the data between December, 1990 and December, 1999 do not follow Benford’s Law.
This demonstrates that if these statistical tests had been performed on Madoff’s data in early 2000, the deviation from Benford’s Law could have been detected nearly nine years prior to its discovery.

Conclusion

Benford’s law is an intriguing, counterintuitive distribution, on initial evaluation of leading digits you would expect an even distribution, but unless bounds are put into place this is rarely the case in reality. The chief use of this law is to help identify red flags suggesting fraudulent activity, specifically where false data is used.

The idea behind why this works is straightforward. When people manipulate numbers, they don’t track the frequencies of their fake leading digits, producing an unnatural distribution of leading digits. In some cases, they might systematically adjust the leading digits to be below a particular threshold value. For example, if there is a £100,000 limit on a transaction type, fraudsters might start many numbers with a 9 for £99,000.

It is worth stating that something not fitting into Benford’s law doesn’t automatically mean Fraud, I know there are hundreds of conspiracy theorists out there that will say “this election was rigged etc..” it is simply showing something unexpected and further investigation is required.

That being said I highlighted two very large contemporary examples of where Benford’s law would have highlighted significant discrepancies in data that almost should have being acted upon. given how quick and simple these calculations are to perform I find it hard to imagine this analysis wasn’t performed, it most likely was presented and ignored (this 100% was the case with Madoff), the maths highlighted fraudulent activity but the people looking at the numbers didn’t seem to care.

Benford’s Law is used by audit teams, the inland revenue, transaction monitoring specialists, forensic accountants and statisticians the world over. What is quite charming is that all of this activity resulted from someone noticing the beginning pages in their statistical tables were more worn out than the latter ones!