Is the Lottery Fair?

We all dream of winning the lottery and the life changing opportunities this would give us, the houses, cars and holidays. But how likely are you of winning the lottery if every number drawn is random? And how do we know it is a random chance? are there potentially exploits where some numbers are more likely than others? Could we hack our way to the big time?

Probability of winning the lottery

The probability of winning the jackpot of a 6-ball lottery with 49 possible choices can be calculated as follows.

When the draw starts the probability of our selection is 1 in 49

When the second ball is to be draw there are 48 balls left in the machine so the odds of the second ball are 1 in 48

The third is 1 in 47 and so on…

So the probability of winning the lottery is

1/49 * 1/48 * 1/47 * 1/46 * 1/45 * 1/44 which as a fraction is 720/10,068,347,520

or 1 in 13,983,816

Not great. A horse with odds like this would have no legs! while it is not impossible, it doesn’t seem likely – what could reduce these odds?

Buy More tickets?

In a truly random lottery game, the only way to increase the probability of winning is to have more tickets, how does this work? well… if we buy 10 it’s simply 10/13,983,816 or 1 in 1,398,382

No… if we are to win, we need to find an exploit – but how could we know if one existed?

Obtaining Data

The first thing we will need is a history of lottery draws. Thankfully the Canadian 6/49 lottery has a history of all their draws available from 1982 – June 2018 online;

https://www.kaggle.com/datasets/datascienceai/lottery-dataset

The Data is presented with one row per draw for example

DRAW DATENUMBER DRAWN 1NUMBER DRAWN 2NUMBER DRAWN 3NUMBER DRAWN 4NUMBER DRAWN 5NUMBER DRAWN 6
6/13/201862224313234
6/16/201821521313849
6/20/2018142431353748

What we want to look at is the frequency individual balls are drawn out of the machine and see if some balls are more likely than others to be drawn. We will need to restructure this.

Calculating Frequency

We have options on how to restructure this, my usual approach would be a Database and SQL – but we can also use Python/Pandas for this. One way of restructuring this is a series of grouping aggregations and then combining them together before summing the totals per ball;

At this point we have a table which contains each balls number and their total number of appearances from the data set – we can graph this using the matplotlib library.

Instantly this is not looking great for our genius lottery hack – If a draw was truly random each ball would appear an even number of times (a perfectly even probability) and this looks approximately like that. There is a slight glimmer of hope that No 31 appears more than the other numbers with 499 appearances in the dataset, could this ball be our lucky one and give us an edge? and is there any way of proving this?

Coin Tossing

Let’s step away from the Lottery for a second and consider a simpler example – the humble coin toss.

Imagine you are tossing a coin – there are 2 outcomes heads and tails and the odds of each is 50% if you threw the coin 50 times you would expect the following;

HeadsTails
2525

However, what if you go the following?

HeadsTails
491

You would instantly know something was funny about your coin? (Or possibly your throwing ability?!) but something would be wrong as this is so far from expectations. However, what if you got this outcome.

HeadsTails
3020

Is this difference enough to conclude that the coin is rigged? Or is it just the number of coin throws wasn’t big enough, is this within an appropriate tolerance?

This is where we can use a statistical test known as Chi squared – to help answer this question.

What is the Chi Squared Test

A Chi-squared test (goodness of fit) determines the difference between observed and expected data. For example, in our coin toss example we are comparing if the 30 Heads we obtained is appropriate given the perfect expectation of 25.

BUT WE DONT CARE ABOUT THE COIN – WE WANT TO WIN THE LOTTERY

Ok.. lets create some hypothesis;
H0 – (The Null hypothesis) The lottery balls are fair
H1 – (The Alernaternative Hypothesis) The lottery balls are not fair

The formula to calculate the chi squared is;

Oh god… I can hear the sigh – but its not as bad as you think, lets reword this;

Well… that looks a bit better – but what is expected and observed?

  • Observed is our Actual outcomes
  • Expected is out theoretical outcomes

We know what the observed outcomes are – that’s the actual appearances of the lottery balls, we have that from our code. But what are the expectations?

Calculating the expectations

In the same way we knew a coin should be heads 50% of the time – we can state how often a lottery ball should be drawn if we know the number of draws that have taken place. The table below shows the calculations.

 Value Description
 No of Draws    3,665Number of Draws in our Dataset
 No of Balls          49Number of Distinct Balls in the Population
 Balls Per Draw            6Number of Balls per lottery Draw
 Total Balls Drawn  21,990Balls Per Draw * no of Draws
 Total Expected Drawings Per Ball    449 (rounded)Total Balls Drawn / Number of Distinct Balls in the Population

Let’s get some values into Excel on this, we can make a table with our expectations vs observations.

From this we can apply the chi squared formula.

And continue this to ball 49;

Our Chi Squared Value is 47.909 (using rounding)

In this test we have 48 degrees of freedom (a simple way of explaining this is (no of rows -1) * (no of columns -1) but please google for a more detailed explanation and using a confidence level of 95% we can then lookup a value in a chi squared table (Chi Square Table & Chi Square Calculator).

Our critical value is 65.171

As our calculated value is < the critical value we must accept the null hypothesis.

Based on the evidence we can say that the Lottery balls are fairly distributed. Accept H0.

Conclusion

Sadly, it looks like we can’t exploit the Canadian 6/49 lottery – back to work for us all. However, this was a nice example of how to use some data manipulation and statistical analysis. Obviously, I’m aware of some of the flaws in this example – its very likely the lottery company changed the sets of balls / machines over times meaning that we aren’t testing a level playing field. For example, Ball no 31 in 2022 is going to be a separate entity to ball no 31 in 1990.

I suspect another question would be – why did I use Excel and Chi Squared tables, isn’t that a bit old school? This was simply to explain what is going on, so it is more easily understood. In the GitHub repository I have shown how to use the scipy.stats library in python to run this test, it produces a different output as the function used returns the p value rather than the chi table value – but I’ve put a comment in the code explaining how to use this – a good example is provided by Statology.

The data and code used is in the following repo: 649Lottery (github.com)