LiveRamp processes petabytes of data on a daily basis. One challenge that we face in the United States is how to shuffle large amounts of data in a cost-effective and reliable way. Uber’s Remote Shuffle Service (RSS) allows us to shuffle data outside of our Big Data infrastructure and run the same jobs on cheaper hardware. While many data shuffling solutions exist, we use Uber’s RSS due to its flexibility and customisability.
Let’s explore how a remote shuffle manager works, the benefits and limitations of Uber’s Remote Shuffle Service, and the business impact of using it at LiveRamp.
What is a remote shuffle manager?
Remote shuffle managers extend the way a framework handles shuffling. We use Spark and extend their ShuffleManager interface to customise how we read and write shuffle data within our network.
A shuffle manager is not required in most scenarios but is very useful for customising hardware. When shuffling data, there needs to be a tracking mechanism for where the shuffle data exists. If things like auto-scaling or node preemption are enabled, you can quickly find yourself in a loop of constantly trying to reshuffle files. During this process, nodes can go idle as they wait for the entire shuffle stage to finish, becoming wasteful the more you scale. You could redefine partitioning strategies and scale your infrastructure, but this can become costly very quickly.
LiveRamp optimised two clusters, one for shuffling data and the other for computing. Doing this allows us to be more cost effective because we can use cheaper hardware and scale both clusters independently of each other – scaling one cluster based on its disk requirements and the other based on compute requirements.
How does Uber’s RSS work?
Uber’s RSS addresses the significant challenge of scaling Apache Spark’s shuffle operations, which are critical for efficient ETL workflows and large-scale data processing. Here’s a detailed look at how this service works:
1. Architecture Overview
The architecture of Uber’s RSS involves the following key components:
- Shuffle Servers: These are remote servers responsible for managing data shuffle operations.
- Service Registry: This component keeps track of shuffle server instances and their statuses.
- Zookeeper: A coordination service for managing distributed applications. Zookeeper helps identify unique shuffle server instances for each partition.
2. Data Partitioning and Distribution
- Shuffle Executors: Spark executors use clients to communicate with the service registry and shuffle servers.
- Spark Driver Role: The Spark driver identifies the shuffle server instances for each partition using Zookeeper. This information is then passed to all mapper and reducer tasks.
- Mapper and Reducer Tasks: All hosts with the same partition data write to the same shuffle server. Once all mappers have finished writing, the reducer tasks fetch the partition data from the designated shuffle server partition.
3. Efficiency and Scalability
- Decentralised Data Management: By using remote shuffle servers, the reliance on local disk I/O is significantly reduced, leading to better performance and scalability.
- Optimised Communication: The use of a service registry and Zookeeper ensures that the communication between Spark executors and shuffle servers is streamlined and efficient.
4. Performance Improvements
- Enhanced Speed and Reliability: The distributed nature of the shuffle service allows for faster data processing and increased reliability, reducing bottlenecks associated with traditional shuffle operations.
5. Operational Workflow
- Shuffle Process: During the shuffle phase, data is redistributed across the network to balance the load and optimise processing. The remote shuffle servers handle the data shuffle independently, ensuring minimal disruption and maximal throughput.
- Data Fetching: Reducers fetch the shuffled data from the specific shuffle server partitions, allowing for efficient data retrieval and processing.
The Business Impact of Uber RSS at LiveRamp
Using Uber’s RSS has changed how we scale Big Data jobs at LiveRamp. By decoupling shuffle operations from compute nodes and introducing a dedicated shuffle cluster, we saved $2.4 million a year – simply by using the basic configuration out of the box.
Uber’s RSS Limitations and Learnings
While the process of installing a remote shuffle service cluster within the network is straightforward, it does come with some pitfalls to be aware of:
Disk Cleanup
When the job completes, it is important to make sure that the disks are cleaned up. Otherwise, there will be a lot of stale data taking up space that can cause unnecessary scaling. Doing it too soon can cause shuffle errors on read and doing it too late can cause increases in cost. Uber’s RSS does allow for customisability but it takes trial and error to find the right balance.
Node Preemption Rate
With shuffling offloaded, you will be able to increase the amount of preemptive nodes in your cluster. If a node is preempted, then Spark will try to rewrite the data. This is fine in isolation but without proper configuration, Uber’s RSS can kill the job thinking that there is a problem in processing. Increasing fault tolerance through configuration is necessary to allow for more preemptive nodes but that can come at the cost of speed.
Maintenance
This added piece of infrastructure requires more maintenance, which can be cumbersome. It also introduces another point of failure in your infrastructure. Google offers an out-of-the-box solution called EFM, however, it sacrifices customisability for maintenance.
Future Improvements
Moving forward, there are a few questions and areas we are looking to improve when it comes to using Uber’s RSS at LiveRamp.
Improved Disk I/O: Is it possible to extend writes to an object storage or some other model to reduce costs even further? Is there a better way to improve fault tolerance and allow for more preemptible nodes?
Support for Spark 3.0: Currently Uber’s RSS supports Spark 3.0 but is in a development branch.
Monitoring and Automation: How can we easily roll this out to other parts of the company without requiring teams to go through a significant set-up and tuning process.