Introduction
In the last post I discussed how certain recommendation algorithms work – with the examples being various ways in which you could code a movie recommendation. I thought what would be interesting is to take another approach to this kind of problem and demonstrate an approach for how a social network identifies people who could possibly be your friend.
It is quite a lonely weekend I am having! – I could do with finding more friends!!
Storing Graphs
When someone says the term network – you think of a kind of connection of entities, this could be a computer network, whereby several machines are connected together – or it could be a train network such as the London underground system. Regardless each network can be drawn out showing constituent elements
- Nodes
- Edges
Using the example of the London underground network – our nodes would be the stations and one of the edges could be the different lines that connect them to each other.
We could then draw out a graph type diagram showing the relationships between train stations as follows;
Here you can see the “graph” from the node of Liverpool street. This node connects to various other nodes via different train lines (these lines are the edges). This is how this data would be stored in a graph database.
Graph Databases
A graph database – is a means of storing data between nodes and edges, instead of a relational model. Think of storing connections between things showing inference rather than clearly defined enforced relationships (such as transactions).
A classic physical embodiment of a graph database is the detectives diagram (often in movies);
This diagram seeks to produce inferred connections between entities as opposed to a traditional database which shows clearly defined relationships between things.
Some common examples of the use of graph databases are;
- Fraud Detection – Unusual connections between entities
- Recommendation algorithms – users who bought similar items/watched similar movies etc..
- Social networks – showing friends/followers and their interactions
- Geospatial computing – modelling of locations, routes, transport networks
Advantages and Disadvantages of Graph Database
Graph databases are great at highlighting and traversing relationships between entities – they might allow operators to discover potential relations they were ignorant of. They also allow for a natural representation of data – networks in graph databases provide a more intuitive and efficient representation than their relational equivalents
The disadvantages of graph databases are the complexity of querying – the languages are fairly standardised (there is no SQL equivalent) and each software offering seems to have its own query language currently. However the potential biggest disadvantages of this approach to analysis in my opinion is that inferences are not actually fact. Imagine a graph database showing teachers in a university – if one professor taught several other professors and one of those professors had a class with terrible results – can this be attributed to the original professor? inference doesn’t necessarily mean fact and incorrect interpretations cna easily be created with this appraoch.
In general graph queries should be viewed as the starting point to answer questions – rather than answering questions themselves. This is a significant difference to say a relational model – where you are usually calculating facts.
Overall, the choice of using a graph database depends on the specific requirements of the application and the nature of the data, some tasks will be very well suited to this technique, others will not be. A graph database should be one component of a much wider data analysis platform.
Using Graph in Databricks
Right… Enough about all that, lets do some coding to show one way that graphs can be created and queried in Databricks.
Lets go back to our social network example and try to see how we can find me some friends!
Firstly I need to set up a ML Cluster in databricks (at the time of writing I believe an ML cluster has to be used rather than a general purpose one).
We then need to create our fictional social network, we will have nodes (the people in the network) and edges (their relationships with each other). We can create 2 dataframes containing the information stored in our network (names and ages of individuals and their relationships to each other)
In graphframes some fields are mandatory – in the nodes data the ID field is mandatory and acts as a unique identifier for each node. In the edges dataframe source (src) and destination (dst) are mandatory to show the connections.
We can then graph this data into a GraphFrame object;
So what does this mean? well it means that data is stored in this object showing the relationships between the people in our fictional network – we can query this similar to how we can query a dataframe. Some basic queries are shown below;
Who is the youngest person in our network
This can be shown by the following query;
and we can see the youngest age in the data is 17 years old (Anna)
Filtering the Graph
you can also filter the graph to make a sub-graph. This is demonstrated below;
The newly created graphFrame of filteredGraph shows only those records with a relationship of friend that are greater than 30 years of age. This filtering is particularly useful if you are dealing with complex networks to reduce the volume of data dealt with. if we consider our London underground example earlier – we could filter to particular lines of the tube network for example.
I can hear you saying at this point “but this isnt really very impressive – I can run basic queries like this in a dataframe – why do I need graph?!”
well lets look at some inbuilt functionality that isn’t so simple.
Calculating The Page-Rank Algorithm
The page rank algorithm (PageRank) is one (of likely many) used by Google to rank webpages in it search results – it effectively seeks to rank the perceived importance of websites from a search. This algorithm could apply to our social network to show us the “best connected” or “most important” individual. using graph frames this can be easily calculated as follows;
we can see here in our network that Charlie appears to be the best connected individual, followed closely by Bob.
Here we have implemented a complex algorithm very simply via using Graphframes. Its not hard to think of genuine use cases for this kind of data analysis – an obvious one is using LinkedIn data to find salespeople with the most highest quality connections to a company we wish to sell to or finding the most influential specific promoters to use on say Instagram to help sell a product.
Whats that I hear – a slight changing of opinion “Ok.. Sure thats quite useful – what else have you got?” – well there are several inbuild algorithms that can be very useful not just PageRank – lets look at another classic graph type algorithm.
Shortest Path
This is a classic graph type problem (Shortest_path_problem) whereby you want to find the shortest connection between two points of a network. In our example we will want to find how close (the number of relationship away from) one user is to another. This can be easily implemented as follows;
Here we are asking the question for each user in our network – what is the shortest distance to user a and d (Anna and Don).
Here we can see Anna is 2 steps away from user Don;
Anna -> Elena –> Don
We can also see that don is one step from Anna (Don is friends with Anna, but Anna does not is not friends with Don)
We can also see that Elena is 1 step away from Don (friends) and 2 steps away from Anna (Elena –> Don –> Anna) (Don is friends with Anna).
Again – its easy to see how this algorithm could be incredibly useful – obvious examples include – finding the shortest path in a transport network (think of our London underground example) or various permutations in relationship management / started sales / recommendation algorithms.
Find me a Friend!
These pathing algorithms such as shortest path and variants such as BFS (also are supported in GraphFrames) are what social networks use to recommend friends to you, the process is to ifer relationships that are close to you which you do not already have.
In our example we can see that Anna is one step away from Don – and Don has added Anna as his friend. A good recommendation would be perhaps to ask Donna if she knows Don (she almost certainly would).
Conclusion
I hope this post has shown how data can be stored in a graph structure and how using the GraphFrames library you can perform complex analysis on data of this type easily. I must admit I’ve not really used graph databases much, their use cases are usually very specific but I can see their value in these specific situations. The main weakness in this analytic approach is the ease of ability to infer incorrect or irrelevant relationship. Lets go back to our previous example of a police detective making a graph database on his noticeboard, with a few critical relationships – you can possibly infer sensible lines of enquiry – but the more relationships you add onto this noticeboard the greater the chance of inferring nonsense. If the victims cousin and the owner both bought their pet dog from the same breeder 2 years apart – it is a relationship, but does it help you solve the crime? much like the detectives noticeboard – I’m of the belief that most of the output of graphing queries should be the start of much wider questions rather than the answer.