Each year, 8,000+ developers, engineers, software architects, dev teams, managers and executives from 70+ countries gather in the SF Bay Area — the world’s epicenter of developer innovation — for DeveloperWeek (Feb 12-16, 2020) to discover the latest in developer technologies, languages, platforms, and tools.
DeveloperWeek 2020 is San Francisco’s largest developer conference & event series with dozens of week-long events including the DeveloperWeek 2020 Conference & Expo, 1,000+ attendee hackathon, 1000+ attendee tech hiring expo, and a series of workshops, open houses, drink-ups, and city-wide events across San Francisco.
This year, Andrew McVeigh, LiveRamp’s Chief Architect, was a keynote speaker on the topic: “The Secret to Scaling a Team? Strategic Architecture.” Andrew shared insights on how technical leaders can structure their system and architecture to make rapid progress that can scale, attract, and keep top talent.
You can watch the video below, as well as read the transcript below that. In addition, you can download the slides from the presentation (which also appear throughout the video and transcript) here.
Hi everyone! My name is Andrew McVeigh and I’m the Chief Architect of LiveRamp. And here today I’m talking to you about the secrets of scaling a team. And I’m going to put it from a slightly novel perspective; I know that scaling teams is not always about technology. I’m going to come in from an architectural angle and explain how architecture does affect scaling. If you can get the architecture right, then the team will scale well, and people will be able to do what they want to do; they develop mastery. And I’m going to present three significant use cases; significant situations I’ve been in… companies where we’ve introduced significant changes to team health as a result.
I’ve been developing for way too long now, probably 35 years. I sold my first software when I was in high school… and these are the key things that tend to drive me when I’m developing. The first thing that I care about is mastery of craft. And by corollary, I noticed that a lot of developers care about these things: how much mastery do you have over your craft. Are you growing? Are you learning new techniques? The second thing that is really important is getting momentum; getting stuff done, getting stuff out the door. When you get stuff out the door… I often say that a little momentum helps solve a lot of problems. I’ve seen very well rounded teams not getting stuff out the door where they get very unhappy. I’ve seen slightly toxic teams where they’re getting a lot of stuff out the door and they’re still very happy. So a little momentum solves a lot of problems.
Another thing developers really enjoy is participating in a team. Everyone remembers when they were in college and they played on a team and they won a big game, and everyone celebrated. They all felt a part, and they felt they played their part. People love to participate in teams. Software is a bit like that for me. Some of the most significant moments in my life have been when we’ve developed a serious software system and we’ve achieved a major goal under tight pressure.
I’m going to also say that architecture and development… it’s a little bit like being working in a gourmet kitchen. My sister is a gourmet chef. She’s a Cordon Bleu Chef – she’s trained for many years – and when she works in a kitchen… I went there once and they had a very large kitchen, in a very high end restaurant. And they had 20 people working in a very small space… and then coordinating and collaborating between them. They’re not having any problems at all and it’s like, a bit like a ballet. Well organized architectures allow developers to participate in that ballet, so they don’t get in each other’s way… don’t get into step on each other’s toes, etc., etc. And what I’m going to be doing is talking a lot about team sizes in this this presentation and how developers can get that ballet even in the largest of systems with the largest number of developers
So a little bit about me; so I’ve been programming for many, many years. I was the lead architect back in ‘98 or so for the smart card issuance system behind the American Express Blue Card. I did a lot of work in investment banking and trading systems for about 15 years in London and got my PhD while I was over there as well. When I came to the states I got brought over by a company called Riot Games. They make the enormously popular game, “League of Legends.” It’s played by about 100 million active monthly users. People play it more than two and a half billion hours a month. I worked a lot on the server side and also was the architect behind the newly released game client a couple of years ago, that’s on 190 million desktops.
After being at Riot for about five years I left and joined as the Chief Architect at Hulu. I joined when we were just starting the live TV push. So we started development on creating the live TV channels, in addition to the video on demand. And I’ll talk about my experiences there. After Hulu, I left and became the Chief Architect at LiveRamp. LiveRamp is an identity resolution company that is effectively democratizing identity resolution, so publishers and content providers are not beholden to the walled gardens which couple the content and the advertising.
The first architecture I’m going to talk about is at Riot Games. And the subtheme here is: the modular architecture gives you lots of independence which avoids stepping on each other’s toes… gives you the ability for the teams to excel, even, as you get to very large team sizes. So this is the League client update. It’s the game client for Riot Games for the League of Legends on 180 million desktops. Very, very popular game around the world, particularly in China. As you can see it’s a complex system; there’s a lot of real time interactions going on: there’s chatting, there’s matchmaking… there’s a lot of intense graphics. And we built this system…it was the largest undertaking that Riot Games had taken for seven years. So there’s a lot riding on this. We’re trying to replace a very old client. This was easily the most popular game in the world at this point, and Riot had tried and failed twice before. This was our third attempt, so we had to get it right. So we built this system, but we initially started with a team of maybe 15 developers. we inherited an architecture and this architecture had major bottlenecks; it was not designed for teams to scale. There’s only 15 developers working on this thing.
So this is all on your desktop. On the left hand side you can see the client. Now, the foundation layer is where we did the communications over the Internet. there’s a huge amount of communications back from the from your desktop to the game servers. Above that we got the UI and this is then running inside the browser. We’re using chromium embedded framework, which is like a product-like embedded Chrome on your desktop. It’s a little bit like an Atom for game clients. Then what we did is we had a data layer which striped across the whole system. We put domain models above that, a bit of ember related code above that, and then we built features. And the problem with this is it didn’t scale beyond 15 developers, and developers are working on it trying to work on the data layer and then make a change it; will affect everyone who’s building a feature.
The upshot is that when we had 15 developers working on it they were very unhappy. I’ve never seen such unhappy developers. They’re talking about: “How we’re not able to do this or not able to do that.” the big problem is there was no momentum because everyone was coupled to each other. We did a survey across the development team and came back with all red on all the traffic lights. People are extremely unhappy. They were talking even… some of them talking about leaving. And that’s the problem when the architecture is not well mapped on to the way the organization works.
We took a very difficult decision to pivot the architecture. We pivoted it initially in two weeks. We did a vertical slice as a prototype. And reading from the bottom, we kept the WAN communication layer, but we allowed C++ plugins to be registered with the desktop app that exposed an internal REST service. And so we allowed microservice to live inside your desktop application, and then in the UI, we wrapped the JavaScript code lexically; meaning that you got your own namespace even though it’s all running in the same JavaScript VM. You get your own namespace per JavaScript plugin. That meant that teams could come in and then own a microservice, a REST service, and they could own their own plugin. Then they’re no longer stepping on each other’s toes. We managed to scale up the development team from 15 developers to 150 and throw them at the thing. And what we realized is that the knowledge of the building of the previous game client was actually spread out over the entire org. And at this point, Riot Games was around 600 engineers, so it had to scale from 80 engineers up to 800 by the time I left. At this point we’re around 600. This allowed us to throw all of those people who knew all the system, and in the company, on this thing, within six months we got it delivered.
And we did another survey after the pivot on the architecture and we got this back. And this is actual text from the survey. You can see a lot of it’s green on the architecture side. Developers love League Client pdate’s modularity. On the far right, you can see someone said: “I spent so much time talking about how much I like else but my team gets annoyed.” We’ve gone from a very unhappy team, small team, to a very happy, big team, just simply by allowing the architecture to do that ballet… allowing the developers to be involved in that ballet where they don’t get in each other’s way and they’re carefully choreographed. From the slides… and some of the things you can see later on in the promotional material, you see people are very happy. And that’s the type of thing we’re looking for with the developers: they feel like they’re growing in their craft, they feel like they’re getting additional mastery, they’re learning new techniques, and they’re part of an important team. They’re getting stuff out.
Horses for courses, basically I’ve learned over time to make the architecture appropriate for the size of the company and the product phase of the company. And I find it it’s very, very important that we take this into account – the organization. When we made the League client update initially, the game client, we didn’t maybe appreciate that they were spread out… their knowledge of it was spread out through the entire company. We had to create an architecture that recognized that.
So lessons learned on this one: make the architecture appropriate for the problem; the project phase and the organizational structure. We pivoted the architecture for fast creation and got major development satisfaction as a result. And what happened later on is they simplified the architecture as the team size grew smaller and they reduced the product aspiration. They basically we’re able to pull a lot of the modularity features out, which, we’re adding overhead, but for very necessary purposes for architectural creation.
So when I left Riot Games I’d been there for about five years. I joined Hulu as the Chief Architect just as we were making the live streaming architecture. And I’ve subtitled this: “A Thousand Tiny Pieces.” We had around 800 engineers. We needed to find an architectural style that would allow us to have all those 800 engineers productively working without stepping on each other’s toes. We ended up creating, or recreating, or writing 400 microservices within the 15 months it took to deliver it. And those are In addition to the 400 microservices we already had. As everyone knows, Hulu is a very prominent video streamer particularly in North America. It’s going international as a result of the Disney acquisition, probably in the near future.
“The Handmaid’s Tale” was one of the key video series that was… that caused it to be popular. We added live TV for over 1000 terrestrial channels as part of introducing live TV. So, it’s a big effort; a lot of work to make sure that we knew about the regional restrictions for content. That’s a massive undertaking. Because when you look at the channels available to you, you might see 50 to 60. They’re filtered out of a set of 1000 or more based on what you’re allowed to see. With a thousand microservices, one of the challenges we had is that developers had lost track of where they fit into the architecture. A developer might be owning two or three microservices and they’ll say well I don’t really understand where it fits in. So one of the first things I did is I built an architectural dashboard which allowed them to overlay the health of the system directly on top of an architectural view of the system.
And so each of these boxes here represents services or a collection of services, and the tab at the top represents different functions. This is the playback function. What this means is that Hulu has smoke tests running on its system tens of thousands of every hour. They’re running all these tests against it to assess the health. This allows them to self-service on each of those tests and select an element up… to bring it up red if there’s a problem with it. That in and of itself allowed us to have an immense amount of control and explanatory power over saying, “Well, this is how the architecture works.” So again, another thing giving the developers context… allowing them to see, and that level of confidence then help builds up on itself. So then they feel like, “Oh yeah, we’re contributing to a large significant system that actually makes sense.”
As part of managing 1000 or 800 or so microservices, the Hulu infrastructure team created a platform-as-a-service called Donki. It’s very similar to an in-house developed version of Heroku. It basically allows you to deploy and manage a microservice within a few hours and into production. Just having that level of support for these microservices took the burden off the individual engineers, and allowed them to own their microservices, and manage them very, very quickly. So you can see on the left hand side we would feed from a GitHub repo directly into Jenkins to build it. It would then go into the Donki platform-as-a-service and sit on our provisioning system. The engineers were able to make it span between on-premise and cloud seamlessly, giving us the ability to spill over into AWS at the time. This infrastructure is vitally important so that the engineers could actually concentrate on doing their jobs of building the microservices.
But with 800 or so microservices, it doesn’t matter what you do, there’s going to be a cost of something that granular. And we bias towards being able to create stuff in a granular way and owning it. Which meant that each team could own maybe two or three microservices… they’re not stepping on anyone’s toes. They’re building out their own APIs and they’ve got full control over that. The cost of the architecture though, in this particular case, in operating 800 microservices is a challenge in its own right. And we didn’t quite have enough infrastructure in particular one of the problems with such a granular architecture is called thundering herd. Basically right at the bottom, there, a small service like a geo location service will go wrong and all of a sudden requests to it will just get locked. Then the services above that – the next layer up – will call to it. They’ll also get locked. The services above that will get locked. So a small service will start malfunctioning, and all of the sudden you have an operational issue because playback stopped for half the country.
So we, in retrospect, I wish we put in more support for things like service meshes but to be honest those service meshes weren’t mature at the time. So we learned over time… infrastructure has basically evolved to the point where it’s feasible to manage this number of microservices. Scaling this is also a challenge; we had to scale for Super Bowl and other events like the Olympics, and March Madness. Another problem is that when we get a change, some of these changes are cross-cutting; they could literally go through 50 microservices. Which meant that we then need to coordinate 10 different teams and we then have to sit down and say, “OK, how do we coordinate now?” And so that is the cost essentially of this architecture.
So lessons learned, we had 800 developers in four major offices crunching to make sure they get this finished before the deadline. We had basically a deadline – I think it was set by Rupert Murdoch, who said I wanted it out on that day – and we were replacing all of the server side and all of the client side. It took us 15 months to do it from start to a significant OTT player; it’s now the largest Internet TV based service in the United States. But the microservices allowed us to have fast creation. Those 800 developers were very happy; they’re participating as a team, providing them with architectural dashboards allowing them to see where they’re going. Suddenly, the team’s health got very good and they’re delivering stuff getting that momentum. The cost of the fine grained services came in later when we realized that cross-cutting changes are affecting 50 plus services and then there was operational overhead.
After leaving Hulu – we operated the system for about a year – and I decided to take up an opportunity with LiveRamp; and the thing here is massive data architecture. LiveRamp is an identity resolution company which competes with the largest of the walled gardens. We’re the only independent identity graph that can do that. And we’re essentially democratizing the ability of firms to not be locked into the walled gardens which are controlling content and advertising. So we have a massive data architecture; we process 10s of trillions of graph edges, and we do that many tens of thousands of times a day. This is by far the largest amount of data I’ve seen collected in a system. And we’re doing a lot of work on it; we are routinely maxing out between 80 and 100,000 cores in Google Cloud. It’s a large system. The underlying theme here is we’ve got 250 engineers. They’re working on effectively a massive data pipeline, and I’m going to show you the architectural changes we did to make it possible for us to scale these teams; so they can start working in a way where they don’t step on each other’s toes.
So this is a marketing view of LiveRamp. So you can see LiveRamp has an offline side and an online side. And let me make it concrete for you: Macy’s comes in and they say we’ve got a collection of people from… who signed up to our loyalty program who we know who bought a pair of trousers in the last six months. And Macy’s says we want to now market them t-shirts. We want to market them t-shirts on connected TV, we want to market it on the web, we also want to market it in applications in games on mobile. They send in their data to LiveRamp. Now LiveRamp’s secret sauce is the ethical treatment in this data. We take out any PII and we completely anonymize the data so you’re given a random ID and that ID identifies a person, but no one can connect it back to you. We then take that information which has been processed and anonymized, we combine it with offline signals – signals such as authenticated traffic, web impressions, etc., device IDs. And we can then send it out to the walled gardens if we want to. We can then send it out to any of 500 advertising destinations in advertising platforms allowing brands to effectively own their own pipeline and not have to worry about locking themselves into any one of these advertising platforms.
So we’re operating, as I said, on an immense scale. The offline side is very large, but the online side is truly huge. As I said, we’ve got in the region of many 10s of trillions of graph edges which we process every time we’re passing a segment through this platform. In addition, LiveRamp… because it anonymizes in a deterministic way, it allows us to basically… Macy’s can send in their data. Then they can say these are all the people who bought trousers, now we want to intersect that with all the people who own a luxury car and who live in an urban center. You can do that in Boolean logic and send it out. We effectively create a marketplace for this anonymous segment also.
So I mentioned data volume transaction scale. This slide unfortunately is old; apologies for that. Every one of the companies on here is basically tracking in 240 to 250 million people in the US. And what we’re trying to say in this slide is we’re the only independent identity graph company that can compete with these major established players. So LiveRamp has around 240 million active consumers that we can target. It puts people on a par with Google and Facebook but without having to be locked into their walled garden. So we’re allowing people to avoid that locking between content and advertising that effectively means that we’re promoting higher quality content. So we’ve got a massively successful product.
This is a schematic of the data pipeline through our system. It’s a very crude schematic, but it illustrates a point. And we’ve effectively dominated the data onboarding market; the workflow that I just talked about with Macy’s. So it’s underpinned by offline and online identity. And above this we have ingestion where we take the data in an anonymizer, in conjunction with the offline graph, we then form market data segments from it, we then activate it by combining with the online graph, and then we distribute out to 500 or so different platforms.
But there’s a problem with this, which is we baked our core product into our platform. Where… how do we separate them out? To get that beautiful market thing; to become successful, LiveRamp has essentially joined the two together with its core use case which is onboarding, data onboarding. Now we want to separate them because we realized we want to make identity resolution available to more platforms. To make it available to all platforms, what we essentially do is we decide to turn it back into a platform. We want to get to that really nice place where identity resolution is the core of the platform, and above that we can build applications. And then that means we can scale the teams; we can put them in the places where each of the teams are building applications without stepping on each other’s toes.
So what we did is we used… people were using our product as a platform already, but they’re using it in very clunky ways, simply because we didn’t allow them to use it in a more granular way. We listened to those signals, and we basically constructed a set of primitives above the platform that allowed us to truly create applications above it. And this is still a work in progress, but we’re getting there in a very strong way. So the little bubbles I, E, S, D, D2… are exposing those primitives. And to give you an idea of the power of it, the primitive D2, which is called direct-to-distribution, reduced the SLA for the delivery of segments to connected TV from an average time of 24 hours to 1.5 hours. So this is also making the platform very much more easy to use as an application.
So this is where we’re getting to; we’re getting to a situation where we can have individual developers, or small teams, creating applications on our platform. We can put people in the platform who are working very much on this very connected data pipeline, but now we’re building applications above it. So we’re able to scale a team, and we anticipate it will allow us to scale much, much larger.
So the lessons learned; complex products with massive data processing requirements are challenging for team scale. It’s very challenging because people need to understand the whole thing…how the whole thing works. And by separating into a platform and applications we’re able to partition some of that knowledge and allow teams to be more isolated or decoupled; which is a great thing for team scaling. We’ve used our current systems and signals to form a platform. And this is opening up an ecosystem, at the moment which is internal, allowing us to grow and keep people happy. But we anticipate overtime we can open it up externally so we get that virtuous cycle; being able to make the system available to other people who want to build on the identity resolution capabilities.
In summary carefully sorted architecture allows developers to excel… allows them to basically make sure that they can demonstrate mastery, keeps them happy, and creates that virtuous cycle. Unhappy developers is a warning sign that exodus isn’t far behind. My message to you is: don’t be afraid to pivot and think very carefully about your architecture to keep developers scaling, and keep them happy, and developing their mastery. Thank you.