We do a lot of work with massive graphs at LiveRamp. In this post I’ll share the story of how we analyze one such massive graph and discuss the Hadoop-based technology we use to efficiently perform this analysis in real-time.
To provide some background, we’re analyzing a graph consisting of the connections between different identifiers for consumers. The edges correspond to the raw identity match data that we use to power our identity resolution technology. For example, consider the graph in the following figure.
This example depicts a type of mistake that can exist in the raw data whereby identifiers for two different people are incorrectly tied together. To refine our algorithms and models used to detect bad edges it is useful to inspect and analyze sampled parts of the graph.
It is far from trivial to sample parts of the graph due to the scale of the data. The graph consists of over 600 billion edges and they occupy 20 TB of Hadoop disk space before replication. Further, we want to be able to sample parts of the graph in real-time so that we can analyze specific parts of the graph. This therefore requires an indexed database of graph edges.
We build an indexed database of graph edges using our seeking map-side join framework (SeekMSJ). MapReduce is used to partition the edges by their source node into 8192 files residing on our Hadoop filesystem (HDFS). Each file is further sorted by the source node of each edge and SeekMSJ builds an index for each file. We can efficiently read the edges for a given node by determining the partition file that contains the node and using the index to determine the location within that file where the node is stored.
Since our SeekMSJ codebase is in Java and we analyze data in Python, we need a cross-language bridge between the two. This is accomplished by developing a RESTful webserver in Java that exposes an endpoint for fetching the edges of a given node. Additional endpoints are provided for sampling nodes that meet certain criteria and for looking up the metadata associated with each node. Each request is satisfied in sub-second response time, which provides a great environment for exploring the graph in real-time.
For price comparison, we also could have accomplished this using a managed key value store such as Amazon DynamoDB. Hosting our 20TB dataset on DynamoDB would cost $6000 a month even at very low read throughput. In contrast, it only cost us a fraction of that to provision 20 TB of disk space on our 94 PB Hadoop cluster.
Here are two anonymized examples of interesting graphs that we analyzed with these tools. The edge width scales with the number of different connections we have between each pair of identifiers.
Overall, SeekMSJ has provided an efficient way to access large datasets in real-time and that has allowed to analyze samples of this massive graph.