We all dream of winning the lottery and the life changing opportunities this would give us, the houses, cars and holidays. But how likely are you of winning the lottery if every number drawn is random? And how do we know it is a random chance? are there potentially exploits where some numbers are more likely than others? Could we hack our way to the big time?
Probability of winning the lottery
The probability of winning the jackpot of a 6-ball lottery with 49 possible choices can be calculated as follows.
When the draw starts the probability of our selection is 1 in 49
When the second ball is to be draw there are 48 balls left in the machine so the odds of the second ball are 1 in 48
The third is 1 in 47 and so on…
So the probability of winning the lottery is
1/49 * 1/48 * 1/47 * 1/46 * 1/45 * 1/44 which as a fraction is 720/10,068,347,520
or 1 in 13,983,816
Not great. A horse with odds like this would have no legs! while it is not impossible, it doesn’t seem likely – what could reduce these odds?
Buy More tickets?
In a truly random lottery game, the only way to increase the probability of winning is to have more tickets, how does this work? well… if we buy 10 it’s simply 10/13,983,816 or 1 in 1,398,382
No… if we are to win, we need to find an exploit – but how could we know if one existed?
Obtaining Data
The first thing we will need is a history of lottery draws. Thankfully the Canadian 6/49 lottery has a history of all their draws available from 1982 – June 2018 online;
https://www.kaggle.com/datasets/datascienceai/lottery-dataset
The Data is presented with one row per draw for example
DRAW DATE | NUMBER DRAWN 1 | NUMBER DRAWN 2 | NUMBER DRAWN 3 | NUMBER DRAWN 4 | NUMBER DRAWN 5 | NUMBER DRAWN 6 |
6/13/2018 | 6 | 22 | 24 | 31 | 32 | 34 |
6/16/2018 | 2 | 15 | 21 | 31 | 38 | 49 |
6/20/2018 | 14 | 24 | 31 | 35 | 37 | 48 |
What we want to look at is the frequency individual balls are drawn out of the machine and see if some balls are more likely than others to be drawn. We will need to restructure this.
Calculating Frequency
We have options on how to restructure this, my usual approach would be a Database and SQL – but we can also use Python/Pandas for this. One way of restructuring this is a series of grouping aggregations and then combining them together before summing the totals per ball;
At this point we have a table which contains each balls number and their total number of appearances from the data set – we can graph this using the matplotlib library.
Instantly this is not looking great for our genius lottery hack – If a draw was truly random each ball would appear an even number of times (a perfectly even probability) and this looks approximately like that. There is a slight glimmer of hope that No 31 appears more than the other numbers with 499 appearances in the dataset, could this ball be our lucky one and give us an edge? and is there any way of proving this?
Coin Tossing
Let’s step away from the Lottery for a second and consider a simpler example – the humble coin toss.
Imagine you are tossing a coin – there are 2 outcomes heads and tails and the odds of each is 50% if you threw the coin 50 times you would expect the following;
Heads | Tails |
25 | 25 |
However, what if you go the following?
Heads | Tails |
49 | 1 |
You would instantly know something was funny about your coin? (Or possibly your throwing ability?!) but something would be wrong as this is so far from expectations. However, what if you got this outcome.
Heads | Tails |
30 | 20 |
Is this difference enough to conclude that the coin is rigged? Or is it just the number of coin throws wasn’t big enough, is this within an appropriate tolerance?
This is where we can use a statistical test known as Chi squared – to help answer this question.
What is the Chi Squared Test
A Chi-squared test (goodness of fit) determines the difference between observed and expected data. For example, in our coin toss example we are comparing if the 30 Heads we obtained is appropriate given the perfect expectation of 25.
BUT WE DONT CARE ABOUT THE COIN – WE WANT TO WIN THE LOTTERY
Ok.. lets create some hypothesis;
H0 – (The Null hypothesis) The lottery balls are fair
H1 – (The Alernaternative Hypothesis) The lottery balls are not fair
The formula to calculate the chi squared is;
Oh god… I can hear the sigh – but its not as bad as you think, lets reword this;
Well… that looks a bit better – but what is expected and observed?
- Observed is our Actual outcomes
- Expected is out theoretical outcomes
We know what the observed outcomes are – that’s the actual appearances of the lottery balls, we have that from our code. But what are the expectations?
Calculating the expectations
In the same way we knew a coin should be heads 50% of the time – we can state how often a lottery ball should be drawn if we know the number of draws that have taken place. The table below shows the calculations.
Value | Description | |
No of Draws | 3,665 | Number of Draws in our Dataset |
No of Balls | 49 | Number of Distinct Balls in the Population |
Balls Per Draw | 6 | Number of Balls per lottery Draw |
Total Balls Drawn | 21,990 | Balls Per Draw * no of Draws |
Total Expected Drawings Per Ball | 449 (rounded) | Total Balls Drawn / Number of Distinct Balls in the Population |
Let’s get some values into Excel on this, we can make a table with our expectations vs observations.
And continue this to ball 49;
In this test we have 48 degrees of freedom (a simple way of explaining this is (no of rows -1) * (no of columns -1) but please google for a more detailed explanation and using a confidence level of 95% we can then lookup a value in a chi squared table (Chi Square Table & Chi Square Calculator).
Conclusion
Sadly, it looks like we can’t exploit the Canadian 6/49 lottery – back to work for us all. However, this was a nice example of how to use some data manipulation and statistical analysis. Obviously, I’m aware of some of the flaws in this example – its very likely the lottery company changed the sets of balls / machines over times meaning that we aren’t testing a level playing field. For example, Ball no 31 in 2022 is going to be a separate entity to ball no 31 in 1990.
I suspect another question would be – why did I use Excel and Chi Squared tables, isn’t that a bit old school? This was simply to explain what is going on, so it is more easily understood. In the GitHub repository I have shown how to use the scipy.stats library in python to run this test, it produces a different output as the function used returns the p value rather than the chi table value – but I’ve put a comment in the code explaining how to use this – a good example is provided by Statology.
The data and code used is in the following repo: 649Lottery (github.com)