Introduction
Its cold, very cold. There little better to do in early January than sit at home and watch TV, except no one really watches TV live now (sport aside), we all use streaming services such as Netflix or Amazon Prime. Last night I found myself wondering – how does the recommendation system in this work, my recommendations are obviously tailored somewhat to me and my tastes. It seemed like an ideal topic to delve into and look at possible strategies.
The Movies Dataset
The data I am going to use in this blog is a data set containing users reviews of movies. The data is available here: The Movies Dataset (kaggle.com)
A Basic Recommender – Select Popular Movies
A basic starting point in order to recommend movies is to have some code that filters a population of movies to only those which score highly in reviews. However that in itself might not be as simple as you think, for example is a movie with a 10 out of 10 rating but only one review actually any good?
Luckily IMDB has a formula for handling this exact situation – the weighted rating;
- v is the number of votes the movie has
- m is the minimum votes required for a movie to be selected (a setting we can decide)
- R is the average rating of the movie
- C is the average rating across the whole population
Lets write a bit of code to apply this to the dataset.
Firstly lets import the metadata file in the dataset and select 3 columns from it;
This dataset is kind enough to have done a bit of work for us already and for each movie there is an vote average (the average score given per user) and a count of how many users have voted for this move.
The dataframe on import looks like this;
original_title | vote_average | vote_count |
Toy Story | 7.7 | 5415 |
Jumanji | 6.9 | 2413 |
Grumpier Old Men | 6.5 | 92 |
Waiting to Exhale | 6.1 | 34 |
… | … | … |
You can see that Waiting to Exhale has only received 34 reviews, we might consider that number of reviews too small to warrant trusting it, whereas Toy story has received 5415 reviews, a large enough sample to trust the score.
The next line in the code calculates the c and m figures – m (the minimum number of votes to be “trusted”) is calculated by looking at the 85th percentile and above. Which in the data used is 82. So we will not be including Waiting to Exhale as a movie we can recommend.
we can filter our dataframe to only movies with greater than or equal to 82 reviews as follows;
The next task is to write a function to calculate the weighted rating and apply this to our data frame;
And the final step is to apply this function to our dataframe and look what movies are recommended;
This gives us the following top 10 movie recommendations.
The output isn’t quite what I expected – but perhaps it does show the global nature of this data set – there are some movies I thought would be recommended (The Godfather, Shawshank) but also a number of foreign films that I must admit I have not seen (maybe I should watch them based on this!)
Conclusion on this approach
This is a very quick and easy way of making a recommendation – you are displaying what is popular given a fair number of reviews. I am sure all of these movies are good movies, but there are aspects of this recommendation approach which are limiting;
- Everyone will be recommended the same movies
- There is no accounting for personal taste (you can see there are a number of foreign language movies in this list – which some people may not like)
What we need is an approach that is more personalised. This is where a content based recommender could be used.
Content Based Recommender
We need a way of tailoring recommendations better, one approach would be to program something that worked along the lines of “because you liked X you might also like Y”. To do this we will need a piece of code that suggests movies that are similar to ones someone liked. This approach is known as content based filtering.
In our data we have a field that gives us an overview of a movie in a text string, some examples of this are shown below;
original_title | overview |
Toy Story | Led by Woody, Andy’s toys live happily in his room until Andy’s birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy’s heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences. |
GoldenEye | James Bond must unmask the mysterious head of the Janus Syndicate and prevent the leader from utilizing the GoldenEye weapons system to inflict devastating revenge on Britain. |
Ace Ventura: When Nature Calls | Summoned from an ashram in Tibet, Ace finds himself on a perilous journey into the jungles of Africa to find Shikaka, the missing sacred animal of the friendly Wachati tribe. He must accomplish this before the wedding of the Wachati’s Princess to the prince of the warrior Wachootoos. If Ace fails, the result will be a vicious tribal war. |
What I want to do is to use this overview to compare similarities across movies and have a function that receives a movie title and returns a list of movies that are deemed similar to it.
Firstly lets make a dataframe containing the title and overview of the movie;
I am going to use a Cosine Similarity method to produce a numeric quantity denoting the similarity between the descriptions of movies.
Secondly due to performance reasons (I’m running this on an old laptop) I am going to significantly reduce the size of the dataset to only 5000 records. This obviously will effect my results as I am omitting a lot of movies!
Conclusion on this approach
This is a decent approach to solving this issue – we could expand it further though – currently we are only considering the description of the movie, this formula is used to evaluate the similarity of objects, it is commonly used in data mining to evaluate the similarity of documents. It works by plotting vectors of values and looking at the value of the angle between them, if the angle is closer they are deemed more similar. The output of the formula shown below is a value bounded from 0 to 1 – the closer to 1 the more similar.
Luckily for us this cosine similarity approach is frequently used in data analysis and machine learning and pre-build methods are contained in most frameworks. I will use the sklearn library (scikit-learn) for this example.
Firstly I want to convert the overview text of the movies to numbers which can easily be understood by a machine learning algorithm. the TD-IDF object is a good approach for this.
TDIDF stands for Term Frequency Inverse Document Frequency. This is very common algorithm to transform text into a numbers which is used to fit machine algorithm for prediction. Lets see how this works in a matrix
Imagine we have 2 text values we want to compare
Text1 | The sun in the sky is bright |
Text2 | We can see the shining sun, the bright sun |
We can use a TDIDF object to produce a matrix of keywords and a score of commonality. Its possible the following matrix could be obtained from these 2 pieces of text;
bright | sky | Sun | |
Text1 | 0.707 | 0 | 0.707 |
Text2 | 0.707 | 0 | 0.707 |
We can declare this object as follows;
Here we are declaring a tdidf matrix – removing stop words (and, a, the etc..) and comparing every overview in our dataframe to every other overview and establishing a numeric value.
We can now perform a cosine comparison of these values to each other. I also create a transition table to give me the title of a movie for it index location in this dataframe;
At this point I have everything I need to write a recommendation function – I have the cosine scores for each item and a way of connecting the movie title to each item under analysis. The final step is to write a function that recommends me movies which are similar to one I pass as a parameter;
I’m a fan of Mafia type movies – so lets see what movies it recommends if I say I like Goodfellas;
which returns;
Rather unsurprisingly the two other godfather movies are deemed most similar and are recommended to me.
Improving our model
This approach has worked – the fact that the two other godfather movies are recommended – Made is a movie with mafia elements but “soft fruit” is a comedy, how did this get picked up?
well Soft Fruit’s description is heavily based on “Family” on and “Prison”, so while the text keywords in the movie overview are the same the actual content of the two movies are different.
I’ve also got to ask myself – are these movies any good? – I could recommend some awful films here because they are similar to one’s I’ve liked.
One way to counter this is to make a hybrid version of the simple approach and this approach and force recommendations to be given on movies of a certain quality – however I’m not too sure if that’s what a user would actually want – if you’re really into van damme movies, you’re probably want more crap movies recommended to you rather than filtering to only those the public deem as high quality.
I think the best approach to improve this model would be not to use the whole overview of the movie, this is likely throwing in a lot of false positives -we have seem examples of this where “family” could refer to a wholesome family movie or the mafia “family”.
We could also add metadata into the comparison, such as comparing directors of movies, actors and keywords relating to the movie (rather than the whole description).
For example we could create “word soup” to be compared in the vectorizer to be a concatenation of all these values;
Original_Title | Main Actors | Director | Keywords |
Avatar | [Sam Worthington, Zoe Saldana, Sigourney Weaver] | James Cameron | [culture clash, future, space war] |
Spectre | [Daniel Craig, Christoph Waltz, Léa Seydoux] | Sam Mendes | [spy, based on novel, James bond, secret agent] |
by taking this approach we would recommend movies with similar actors, movies by the same directors and movies based on a much more focused set of keywords (reducing false positives). I believe this would give a lot better quality of recommendations, and it really shows how data preparation and standardization is paramount to predictive analytics.
Collaborative Filtering
The code so far has one key limitation – it can only choose movies that are close to a certain movie (it recommends one from another) – because of this it cant capture tastes of a user across genres easily.
One way we can try to solve this is by using a latent factor model to capture similarity between users and items. Using another data science library we can use the SCD (singular value decomposition) algorithm to map each user and item into a latent space. by plotting values like this we can try to understand the relationship between users and items as they become directly comparable.
For example we might be able to ascertain that a users review suggest that they like movies gear towards males of a serious nature. Without knowing this directly at the start of calculation the correlation of plots will suggest other movies of this nature.
The maths behind this is vastly beyond my ability to explain, the Wikipedia page for SCD is linked here (Singular value decomposition – Wikipedia), but luckily for me there are libraries to allow me to ignorantly run this model. I will be using the surprise library (Surprise · A Python scikit for recommender systems. (surpriselib.com))
Firstly lets import the ratings (here we are using the ratings_small file as this is too computationally expensive for my laptop!)
This file contains ratings 1-5 (inclusive) for movies from thousands of users.
After this we need to set up a series of objects from the suprise library;
we have built and trained our model here on the data we have. We are now in a position where we could predict a review score for a user.
User Id has reviewed some movies these are shown below;
User_id | Movie Title | Rating (1-5) |
1 | Rocky III | 2.5 |
1 | Greed | 1 |
1 | American Pie | 4 |
1 | My Tutor | 2 |
1 | Jay and Silent Bob Strike Back | 2 |
1 | Vivement dimanche! | 2.5 |
We can use our model to make predictions now – for example if we want to see a forecasted prediction of what score user 1 would give movie id 8844 (Jumanji) we can run the following
Which returns a predicted review score of 2.69
Conclusion
This post shows three different ways of making movie recommendations for a user (in increasing complexity), in reality a model might well be a hybrid taking aspects from each of these (removing movies that are unpopular, looking at those which are similar to ones you like and looking at forecasting based on other peoples reviews).
This post showed a couple of machine learning approaches to this problem, both content based and collaborative filtering. This was an interesting challenge for me and well outside of my usual comfort zone, I hope to post more topics on machine learning in the future.