Back to Engineering Blog

RampUp 2020: RampUp for Developers Recap – The Systems behind Lookalike Modeling

  • 20 min read


  • Ope Banwo – Software Engineer, Data Science
  • Joe Hsy – Head of Engineering, Data Science and Engineering

The third RampUp for Developers session focused on the LiveRamp’s ability expand audiences and reach through lookalike modeling. Built on LiveRamp’s RampID, a consistent and anonymous person-based ID, our lookalike modeling solution is channel- and device-agnostic, meaning we can ingest virtually any ID type and produce modeled audiences for a different ID space, such as display channels using cookies. This opens up inventory, such as TV, that are often inaccessible by other modeling providers. Learn the mechanics behind LiveRamp’s RampID and how it can enable lookalike modeling.

You can watch the video below, and can read the transcript below that. You can also download the slides from the presentation (which also appear throughout the video) here.


Joe Hsy: All right, great. So as Randall mentioned, we’re going to talk about the Lookalike Modeling API. I’ll just get started. So the Lookalike Modeling API from LiveRamp, also known as LLAMA. We’re actually pretty proud of the acronym. Sometimes acronyms you have to really convolute your terms but it kind of just flowed naturally. And that’s actually a llama I put together and drew at one of our off sites. I’m kind of proud of that.
Joe Hsy: So what is lookalike modeling? So essentially what lookalike modeling lets you do is you start with a known seed audience. Let’s say you want to do a marketing campaign against… for your new state-of-the-art vacuum cleaner, right? So you want to find people interested in a vacuum cleaner. You’d start with a seed audience of people that you know have bought vacuum cleaners in the past.
Joe Hsy: So that’s a set of people that you know for sure would be interested in your marketing campaign. And you bring it over to LiveRamp and then we expand the segment with what people who actually look similar to the people that you bring in as the seed audience, except you may start with a few thousand people in your lookalike audience and we can expand it to hundreds of thousands, or even millions of people. All right? And then once you have that custom created lookalike audience, you can then activate, as some of you saw in Andrew’s session this morning, you can activate that in multiple mediums and advertising spaces. So that’s lookalike modeling.
Joe Hsy: So when we create a LLAMA, we have four goals in place. I want to kind of just talk through that. First of all, we want it to be adjustable. So you don’t want to actually just have one set of criteria. When you create a seed audience, you don’t want to actually, okay. If we only find, based on a thousand people, 10,000 people. But if you want to target to a lot more people, you might want to be able to tune the difference between reach and similarities, right? You may be willing to give up your similarities to some extent so that you can reach a higher number of audiences. So we have a slider bar that lets you tweak how, what’s the large size? What’s the size of the audience you want? It’s accurate in terms of we want to deterministically link the two together in terms of where you bring your seed audiences to actually link to the… through RampID.
Joe Hsy: Flexible, with RampID we can pull in both offline as well online audiences so that you don’t… your customer base could come from your customer records or it could come online transactions. And then we can basically, as we mentioned, model through different channels. And customizable, you can bring your own reference dataset. So essentially a modeling, a lookalike model audience, comes from a set of potential people that you want to target, right? And that can come from LiveRamp itself or you can bring your own audience. Let’s say you want to do a marketing campaign against your own customer database. So you can start with a seed and then just expand to within your own customer database what you want. So you can bring that in as well, as well as use the reference dataset. Okay?
Joe Hsy: So before I hand it off to Ope to talk about actually how lookalike modeling works, I want to talk about how we actually build a data science product. So I think some of you probably have been in data science before. In the early days of data science typically you have academics, PhDs, came from research environments and they sit in an ivory tower and then figure out, “Okay, what are the things that, what’s the interesting insights we can do?” And once they come up with the interesting model, what they think could work, they throw it over the wall to developers and then develop. Yeah, that’s early days. The problem with this is when you find out there’s drawbacks to your model, you have to go back to the data scientists. And the data scientists usually use a different language, you have to translate. Early days they use R or Python and developers, might be people with Java.
Joe Hsy: And so there’s a translation cost. And so the approach that we’re taking, which more and more companies are actually taking as well, is to actually have scrum teams, which data scientists as well as software engineers. So we cross train, so data scientists need to learn a little bit about how to develop products. And our software engineers, that’s in the team, also need to learn about how data science works so that we have a holistic approach to data science as opposed to just thinking about what could possibly work. And the key is, Andrew mentioned this morning, about APIs, the data science and software engineering as a team build APIs that can be leveraged by application developers. So instead of creating models and then thinking they have to be embedded with source code, we create APIs and then you can layer on top of the APIs any type of applications and create additional UIs.

Joe Hsy: So in terms of how we want to productize the data science, number one API is APIs, right? We don’t want to create just models that’s in code or in sitting in a particular source code you have to pull in. So we want to actually have APIs that you can call in. We also automate the data engineering process. In order to actually continuously improve the product, we always get new data to build models and we don’t want to have to have data science manually process the data every time. So automating the ingestion … new datasets would be very important in terms of productization. Also plug and play data science models. We want to be able to also continuously innovate and figure out new models and be able to plug that into the systems so that you don’t have to rebuild everything. Of course, this is software engineering as it applies to data science, we want to continuously build improvement and reliability. Okay.

Joe Hsy: A couple of slides before I hand it off to Ope. In terms of data science challenges. So you may think that it’s relatively easy to build a reference dataset, right? So a reference dataset is what we use to find lookalikes. Like you bring your own seed data, so what other people look like my seed data? How do we find, where is that pool of people that you can find your lookalikes in? This is an example of, we have multiple different sources of data. All right. So dataset A, in the orange and let’s say there’s a B, that’s in blue, right? Ideally you want to have enough overlap between dataset A and this is B, so that they’re not disparate datasets, right? But when there’s overlap, you want to make sure that the columns has to be standardized and the values, right? And that they match. So you don’t want to have dataset A saying that these set of people are interested in cars and dataset B says no, these set of people actually not interested in cars.
Joe Hsy: So we have to resolve those conflicts. At the same time, you want to have the ability to impute missing values, right? So two datasets… you may actually have audiences that overlap, which is the middle part where you have audiences, people that don’t overlap and so you were missing some attributes from one dataset versus another. And we use data models and actually predictive values from the overlapping datasets in order to impute them, the missing values. Okay. And finally, our model building approach. This, I don’t know if there’s any data scientists in the room, but these are probably more interesting to you, but essentially we use a version of linear regression for our model building. It’s called ridge regression with cross validation for hyperparameter tuning. So this enables us to enable variable selections because we have thousands of variables.
Joe Hsy: We also sample data, as ridge regression only allows you to sample; it works within a certain limit as well. And as I was saying earlier on, we also continuously evaluate other approaches. We use random forests, which turned out to be too slow. Boosted trees, which over fit too easily, and neural networks; which worked well, but it didn’t offer a meaningful improvement over ridge regression. But we’re always continuously evaluating a new model to see if we can get better results. Okay? All right, so I’m going to hand it over to Ope now who will talk about how the system actually works and flows.

Ope Banwo: Thank you Joe. So I’m just going to work you guys through the technical bits of LLAMA and how we currently have implemented LLAMA on Google Cloud. So basically this slide just shows what the workflow is like, and it tries to address two main questions. The first one is how do we create these reference datasets that Joe already described? And then second one is how do we… how do our clients interact with this reference dataset so that they can find the lookalikes to their customers? Obviously the nonfunctional requirement is how do we address these two questions at scale. So the top part of this diagram is what addresses the first question and we do that with two pipelines. We have the profiler and then we have the onboarding pipeline. And then the second part of the question is… we do that with the lower part of this slide, which is the sampling and transformation pipeline, the model training pipeline and the predictor.
Ope Banwo: So I’m just going to walk you through these pipelines on what they do essentially. So for the profiler, the data providers – the people who actually give us this reference datasets – will send those data representing more than 300 million people and with about a thousand features peculiar to these people. So as you would imagine, it’s really difficult for a data scientist to look at this huge dataset and figure out questions like which features are irrelevant or sometimes you have our providers, they shot a feature across multiple columns. So you could have features like aged between 18 and 24, aged between 24 and 40 and so on. So the profiler needs to be able to look into these features and recommend that they should be meshed into one thing. So essentially do that, the examples of what the profiler does, it does a few other things. But so basically, and also it kind of gives you an insight into what you have in your dataset before you do any other thing.
Ope Banwo: So now we have the recommendation for the profiler. The next thing is to apply that recommend, or those recommendations, on the actual dataset. And that’s the job of the onboarding pipeline. So what he does is he basically looks into the recommendation and say, “Okay, I’m going to drop features A, B, and C because they appear to be irrelevant to whatever thing you want to do.” Or, “I’m going to mesh features A, B, and C together because they appear to be the same features.” And then he could also, he also does things like he tries to remove duplicate people he finds in that dataset, so that at the end of the day you have a cleaned up dataset that ensures that we can deliver the best models to our clients to find their lookalikes. So those two pipelines that I just described, essentially ensure that we have this cleaned up reference dataset.
Ope Banwo: So now the question is, the thing is how do we now address the second question? How do the clients interact with these datasets to find the lookalikes? So actually before we even get to the sample and transformation pipeline, we usually go back. We’d go back to the profiling pipeline again. This time it’s not going to be recommending things like what features to drop or anything, it’s actually going to be recommending how you should engineer your features. So in machine learning, basically feature engineering means a series of transformations you apply on your features to ensure that your model can learn the best things from your dataset. So the profiler does that. He knows that this data has already been onboarded, so now I’m just going to do feat, recommend how you should do feature engineering. So we get that from the profiler.

Ope Banwo: The next thing is how do we match? We try to match what the client’s existing customers to the reference dataset to produce something called the training and test set. So the train set is what the data scientists actually use to build their models. And then obviously the test set is what they use to evaluate the performance of this model. So the way this works in principle is that we deploy these two datasets to the data scientist systems and they perform their magic on it. Then give us back the models, tell us that these models are really good and then we plug it into the next pipeline, which is the predict pipeline. So predict pipeline is fairly simple. Yearly, it looks into that model we’ve plugged into it and says, “Okay, based on this model I’m going to rank everyone in the reference dataset based on their similarity to your existing customer.” So it’s now left for our clients to decide how they want to deploy it. So for instance, if they want to target those people that are highly similar to your customer model, you want to extend your reach and go for that to get more people. So basically this kind of summarizes what the LLAMA workflow is.
Ope Banwo: So when we decided to migrate to GCP, we sat down together and made some decisions on what technology stack to use. I’m just going to explain the rationale behind that. So we are using Dataflow to view these pipelines because it allows us to focus our software engineers on implementing the MapReduce algorithms as code, rather than spending time fussing about infrastructure setup. So it was really good for us. It meant we could scale easily. We are using TensorFlow and TensorFlow Transform for building these models because if you’re a data scientist here you know one of the biggest challenges is how do you actually scale horizontally when it comes to building models. So this item showed that we are able to solve that bottleneck so we were able to build models in a distributed fashion. Obviously we’re using Airflow to connect these pipelines together. So we used the Google managed one, which is Composer. And then finally we are storing these reference datasets in Bigquery to ensure that we can respond to regulatory demands. So for instance we’d always get deletions or subject access requests, I want to be able to respond to that as fast as possible. So that pretty much wraps up the technical bits of LLAMA. I’ll let Joe walk you through a typical use case. Thank you.

Joe Hsy: Thanks Ope. So this is an example of a UI that is built using the Lookalike Model API. So essentially the user is going to create a new campaign and he would basically choose the seed audience, make a request. And so basically the Request Lookalike Model UI brings up, you choose the reference dataset. So either use the demo, in this case the demo reference, but you can choose from the ones that LiveRamp supplied. Or you can bring in your own custom LiveRamp, custom dataset. So this is where I was referring to when you talk about reach versus or similarity or accuracy. There’s a slider bar at the top and you can basically see how many users you get based on what you want. And then there is a precision level, right? So the higher the position you want, the fewer people you’re going to be able to find from your reference set that looks like your model. And so depending on what you need you slide, you move your slider around. And this. Okay?

Joe Hsy: Finally, you want to activate Lookalike Audiences. And so once you create the lookalike audience, you can, it’s just like any other segment, right? It’s… you can save it for later references, later activity or you can just activate it immediately for certain campaigns and it’s usable for mobile or offline use cases. Any questions? Okay.
Joe Hsy: I’ll talk a little bit about future directions. So Andrew, earlier this morning talked about platformization of LiveRamp and this is an example of where we want to actually offer the API externally to our customers. Right now this API is used internally to build, to integrate with our own workflows, but we want to enable customers or partners actually to use Lookalike APIs as part of workflow. In fact, we’re working in healthcare right now. We have a number of healthcare partners who is interested in taking, for example, the lookalike modeling, one example is social determinants of health. So they can bring in an audience of people who have a certain chronic disease, for example, obesity or diabetes. And being able to take, “Okay, so who else could potentially look like that, that we want to actually type it and figure out potentially whether there is a campaign. Or people that we want to actually offer services to, who has that risk for diabetes.”
Joe Hsy: So that’s an example of additional external customers and partners interested in using our API to look for. And we are looking for BETA partners as well, BETA customers. So if you’re interested, come talk to me. Additionally, future directions, we want to enhance our audience expansion flexibility. Right now the lookalike currently is really binary classes. Audience members are ranked on their similarity to the seed. So… but we want to actually expand that to allow multiple expansion capabilities. So, for example, if you want to segment your audiences into low loyalty, medium loyalty or high loyalty, you also want to add those dimensions to your expansion criteria. In that case, you would, for example, see a UI with multiple sliders and say, “Okay, I want to actually expand on my audience in multiple ways.”
Joe Hsy: So. All right. So now we’re open to questions.
Audience Question: You said, offline examples, specifically television. How would you apply this lookalike modeling to… broadcast only?
Joe Hsy: Yeah, so for broadcast television it’s much more difficult, right? But the question is how do we apply it to broadcast television from in terms of choosing where we want to actually apply this lookalike, right?
Joe Hsy: And so in TV we actually did it. Broadcast TV, you broadcast the same advertising or content to everyone. Whereas, what you’re probably referring to is addressable TV. And LiveRamp has ability to actually provide content or advertising to specific users that we think are watching on a particular connected TV device. For example Roku or Apple TV, we have integration so that we can tie in segment information to specific television, over the top air boxes. And so for that case, we can actually use our IDL, RampID to say, “Okay, which television boxes, right? Apple TV, Roku boxes, et cetera, have audiences watching on that, that falls within the segment? That matches the IDL?” And so we can actually activate against a connected TV. Does that help answer?
Audience Question: How about non-addressable, mass broadcast?
Joe Hsy: Yeah. Okay. So that’s the challenge of broadcast TV is you don’t get to choose your audience, right? You don’t know who’s watching. And so yeah, I don’t think we have an answer. Maybe we have TV LiveRamp people who can help answer that.
Audience Question: How are data partners compensated when using our data partners? And then how are buyers charged?
Joe Hsy: Yeah, that’s a great question. So we from the reference data, first of all, if the customer brings their own dataset, then basically they are responsible for getting the license for that. For reference dataset, we have multiple sources and we have a formula for figuring out based on the lookalike audience and then the reference set, which one contributed the most to that. And then we basically prorate the follow through of the, yeah.
Audience Question: How are buyers charged?
Joe Hsy: Buyers are charged for segments essentially. Yeah. Yeah, exactly. It’s the same as any other segment information. That’s a good question actually. We struggle with how to find a fair way to compensate the reference dataset.
Audience Question: I want to ask about finding a lookalike audience…if you have more of a binary custom buyer, how do you determine, so I guess your predicted in cutting on the lines of a lookalike audience and a normal audience? So how do you determine, when the professionals are determining whether to use one or another?
Joe Hsy: That’s Ope.
Ope Banwo: So basically how do we come up with the percentile that which cut people when we rank them. So basically when we get these models from the data scientists, we would usually take a sample from the reference dataset and then based on that sample we come up with our percentile cut offs. So when we actually run the model on the entire reference datasets, you will get a score of saying what your probability is to the customer, to, I’ll say, a score that just kind of gives an indication of how similar you are to the, to our client’s customers. So based on that score, we kind of put you in a bucket. So we group you between zero and a hundred and then so you could have those people in bucket zero are very, very similar to you and something like that.
Audience Question: You can give some examples of what are some of the features?
Ope Banwo: Am I allowed? Are we allowed that? I’m not sure if how much of, our data providers will allow.
Ope Banwo: So you’d find features like what income, how many children, age, things like that. And we try to give a score of how much of these features contributed to the modeling. So it kind of gives us an idea on pricing as well.
Joe Hsy: Yeah. I’ll add a little bit to that. So we use the ridge regression and that essentially we found on a dataset that there’s a lot of multicollinearity to our dataset. Meaning that a lot of the segments actually are very, very correlated. And so we want to make sure that the signals are applied correctly. We’re not just really biased towards a set of signals that helps us.
Audience Question: Do you use visitation to publish your partners? Like ACS Installations as a feature? So user visited X website. Or are all your features on fine data?
Joe Hsy: So, the features that, for the dataset are the segment information. So anything that can be put into a feature – a segment – can be treated as a feature. Is that what you’re asking or?
Audience Question: No.
Joe Hsy: Okay. Sorry.
Audience Question: So my grandpa’s a huge match in a pixel network across the internet. He’s gone on many thousands of websites. Is user visitation to one of those websites a feature?
Joe Hsy: Okay, got it. Got it. Yeah. Yeah. Not right now. So the reason I mentioned that, you can use data science to convert that into segmentation. And once you convert it into segmentation, the current lookalike can handle that. There are ways potentially, and that could be a feature request or additional enhancement, that in addition to the segment information for the audiences we also have behavior or click transactional information. But I think we can get similar results by breaking it into two different parts right? Basically taking that behavior type information, applying data science, creating the segment information and then feed it into Lookalike.
Ope Banwo: Yeah, that pretty much.
Joe Hsy: But that’s a great question actually. It is something that we have thought about. Because we want to use all the information we have about our customers to figure out how to find lookalikes.
Audience Question: Do you guys retrain and how do you deal with state changes? So, for example, a user, you have a model that is purchasers versus non-purchasers, but if someone purchases during the middle of the campaign, or you have users changing their state.
Ope Banwo: So normally we keep these, we keep whatever model we’ve trained on in particular, particular run for 60 days. So basically what we try to do is we try to persist the percentile cutoffs that we calculated the first time you try to run it on a particular seed audience and then the next time you come, we kind of use that percentile if you want to, so that in case the dataset has changed. The dataset can change in a number of ways. In fact, one of the things we find now is that regulatory requirement changes the state of the data we use a lot. Yeah.
Joe Hsy: So yeah. No, just yeah, that’s also potentially, so by now, lookalike modeling is not near real time. Right? So there’s a possibility that if we can do the scoring much more frequently, you can then react to changes in your population. That’s a potential to enhancement as well.
Ope Banwo: A question.
Randall Grilli: Oh, unfortunately, we’re out of time for questions, but so we are doing a happy hour from 4:30 to 6:30. It’s for developer speakers, and attendees. I’m seeing a lot of new people so I’m just making sure people are aware of that. It’s a great opportunity to talk to these guys or anyone else that you see here.
Joe Hsy: And we’ll be around for a little bit.
Randall Grilli: So yeah-
Joe Hsy: So you can come up.
Randall Grilli: So we’ll be starting up again at 11 with GDPR and CCPA.


Interested in more content from RampUp?

Clicking on the links below (to be posted and updated on an ongoing basis) will take you to the individual posts for each of the sessions where you can watch videos, read the full transcript of each session, as well as download the slides presented.

RampUp for Developers’ inaugural run was a great success and was well attended by a variety of attendees. Many interactions and open discussions were spurred from the conference tracks and discussions, and we are looking forward to making a greater impact with engineers and developers at future events, including during our RampUp on the Road series (which take place throughout the year virtually and at a variety of locations), as well during next year’s RampUp 2021 in San Francisco. If you are interested in more information or would like to get involved as a sponsor or speaker at a future event, please reach out to Randall Grilli, Tech Evangelist at LiveRamp, by email: [email protected].