[We are] very thoughtful about the tools we build, because they can encode bias
Amy, and her team, using the power of AI and Data to fill in missing wikipedia pages with the Quicksilver Project.
Ben: Welcome to the Masters of Data Podcast, the podcast that brings you human data, and I'm your host, Ben Newton. One of the most interesting areas of innovation today in machine learning and artificial intelligence is natural language processing or NLP. It's based [inaudible 00:00:22] about teaching machines to both understand human language and then reproduce convincing human language in return. Our guests on this episode, in her company, build what they call machines that can read and write. Amy Heineike is a VP of firstname.lastname@example.org, and I talked to her about how she and her team are using applied machine learning techniques to build software that tells stories from data. In particular, we talked about Quicksilver project to fill some gaps in Wikipedia by producing new content for people that didn't have pages that probably should have. So without any further ado, let's dig in.
Welcome everybody to the Masters of Data Podcast, and I'm really excited to be here with Amy Heineike today, in her downtown San Francisco office, a very cool office. Thank you for letting me come here and talk to you.
Amy: Thank you so much, Ben.
Ben: We were talking about this a little bit before, but I always love to do one of these podcasts. I just wanted to find more about you, what led you to where you are today at Primer. You head up engineering, product engineering here, but in the data science realm, but what led you that direction though? What's your story?
Amy: Yeah. So I've been very driven by curiosity and had a little bit of a strange winding path to get here. I started out actually as a mathematician and got super interested in complexity theory and ideas for how you could tell stories about what's going on in the world. Thinking about different mathematical patterns. I had spent a while doing transportation, economics, modeling cities, and then wondered where you keep getting more data from, and then when I moved to the US, got the opportunity to work with a startup company, and then got into NLP and modeling data to try and figure out the stories from it. So, yeah. It's been an interesting winding path. Then, for Primer, I have been with the company since it was founded, and the CEO and I had a bunch of long conversations about what we thought was possible with NLP now and how we could [crosstalk 00:02:27]
Ben: This is where ... Not everybody listening might know what NLP is.
Amy: NLP ... Yeah. Sorry. So Natural Language Processing, the idea of having algorithms that can extract information from text, free texts that people write. What we're seeing is that there was a lot that was possible with those algorithms, lots that we could actually learn from text, and we wanted to build tools that would let people understand vastly more content quickly. I came in ... at Primer since the beginning and then have been on this great of building out the technology and then building out the team. We're now 70 people, series B [inaudible 00:03:05] less full. It's like action stations.
Ben: Yeah, wild ride.
Amy: Yeah. Wild ride.
Ben: I have got a chance to interview a bunch of different data scientists, and it's ... well, it's a broad term, but it's really interesting to see the different backgrounds that people come from. Because I noticed a few people come from mathematics and come ... surprising [inaudible 00:03:24] come from physics, and some people just random places. You just enjoyed math and basically you started liking to apply that? Is that how it happens in your mind?
Amy: Yeah, that's right. So, when I first started working in Silicon Valley, my first startup, it was weird to have a mathematician on the team, and there was like kind of a, why are you here? So that was 2008, and I think around that time, it was when I think TJ, and TJ Patel [inaudible 00:03:57] coined the data science phrase, and so it became something that was a bit easier to talk about. There are a bunch of different paths people have taken in. Certainly like on our team, we've got people from computational science background, computational chemistry, biology, astrophysics, and we've got people who actually had come from humanities backgrounds, and then switched into computer science.
But now there's actually tracks, right? There's clear attracts for ML and AI data science programs. So we've got people who come through those. But I think that's some of what's really fun about it and that there's a ton of hard problems to solve that aren't necessarily that clearly from one domain. You need people who are super curious and then have a bag of tricks that they can throw at it, and then aren't afraid to learn new things.
Ben: That makes a lot of sense. I mean, one of the themes on the podcast has been is connecting the humanities to science, and part of the way I interpret a lot of what I've heard is that you have to bring a diverse set of people to a problem, because if you don't, then that ... your solution is going to be skewed. It's going to be biased in one way versus the other. That makes a lot of sense.
Amy: Yeah, that's right. I think what we've also been very aware of is that there are just a lot of different kinds of problems you have to solve to build data driven products. There's this obvious piece in the middle, which is like, "I'm going to make an ML algorithm," say. If I feed in the data, how can I predict the labels? But actually there's a whole universe of stuff that goes around that. For us, we want to figure out ways to condense information and present something that might look like a first year analyst. Normally, if you hear that, you think, "Okay, that's ... Is that possible? What does that even mean?"
But these are ways of breaking that down to little steps, and then every one of those steps may be a problem that you can figure out how to solve. First of all, start out by figuring out, in a big corpus of data, who are the people that have talked about, say. That becomes a more tractable problem. But figuring out how to frame it, that's one thing, and then figuring out how you present that data ... If I showed you information from text, what other things would you need to see to believe that? Do you need to see the original text you need to have a confidence score? Or do you need to see other information that makes a case? There's some really interesting user experience questions of, "Yeah, if I'm building a data driven product, how is a person going to interact with that, and what do they need to see, and how are they going to work with it?"
Then there's all of these deployment questions and system building question. When you look across it, you're like, "There's actually this really wide range of skillsets." So it's useful when you have people who bridge unexpected combinations of them, like a bit of design and then a bit of algorithms or some DevOps and ... I don't know, infrastructure engineering, and then it makes it easier to figure out some interesting solutions.
Ben: So it's basically the maturing of data science, I guess. Moving from some sort of these things that are done at university campus with a small team to actually-
Amy: That's right.
Ben: ... turning into a real thing in the real world. You have to ... Yeah, that makes a lot of sense. One of the things ... Amy, when you and I got introduced and finding out about some of the really cool stuff that you work on, one of the projects that came up with this Quicksilver projects, is one of the things you've done at Primer. So let me talk a little bit about that. What is Quicksilver?
Amy: We had this idea that it would be cool to generate Wikipedia pages automatically. What's the input data to that? Well, we started thinking about scientists, to start with. There's a lot of scientists out there. There's a lot who are doing really interesting, impactful science, And of those, only a portion of them have Wikipedia pages. What we're wondering is, could we make Wikipedia pages for the ones who didn't have them?
Ben: How did you find out that they didn't have them in the first place?
Amy: How did you ... Yeah. Actually, I mean, Wikipedia is really interesting, right? It's this huge resource that's really wonderful when you interact with it. There's loads we can learn about. We all go and look up stuff all the time, right? Maybe more than we admit. But there's actually big recall problems. There's big holes in it. There's actually a lot of missing content, And so one example that was ... that's pretty stark about this, was ... There's a woman called Donna Strickland, who won the Nobel Prize in physics. So a woman won it for the first time 50 years. The morning she got the Nobel Prize in physics, the call, she didn't have a Wikipedia page.
Ben: Seriously? Wow.
Amy: Completely missing.
Amy: There's some interesting stories there about what they've been finding out about why that was the case, and how they're trying to address that, but in general, there's actually a lot more people missing that makes sense. There's been some interesting research on the biases behind that too. It turns out, in particular, it seems like women and other underrepresented minorities, may be under-reported compared to other groups. There's a lot of people who think very carefully about this. But that got us thinking. I think for scientists, what we were able to do is out by getting this enormous list of scientists from our friends over at the Allen Institute for AI, the people that make semantic scholar.
They had a list of 200,000 scientists, where they'd collected publication information about them and citation records, so we could start with that big list and find people with decent publication record, join those to news data. We have an archive now of about half a billion news articles in-
Amy: ... English from the last three and a half years. So we had to build some pretty smart disambiguation algorithms, so we make sure we get the right person. That's not always obvious. My favorite example of a disintegration problem is, the Michael Jordan of AI, who is actually a computer scientist professor called Michael Jordan. He's over at the University of Berkeley. If you Google him, he wrote really nice piece about the future of AI that blog post this summer. If you go and try and find it, if you just Google Michael Jordan-
Ben: Michael Jordan.
Amy: ... you might be out of luck. So you've got to add other stuff, right? The fact that he's at Berkeley, you may do it. The fact he's interested in ML, AI, you might do it, but you need to have some kind of embedding of other terminology that's about him, and then you can make sure you get the right papers, or the right documents. So the first thing we did is we built these models so we could join scientists to the news data. Then we had a really cool list, right? Oh sorry, and we also joined that to Wikipedia. So we could say, "Do they have Wikipedia content about them or not?" But then we had a really interesting list. You can scan down, you can see news coverage, citations, do they have Wikipedia page or not, and we can find people who are missing. Yeah, as I said, there's some people who are super interesting who don't have those pages.
Ben: I remember when I was looking at the data, I mean, there is ... it is pretty amazing that there is such significant holes in it. I mean, as part of what I understood is that you guys basically were able to use what you built to start filling some of those holes. Is that right? So basically, you were-
Amy: Yeah. Well, we had a training set from this data, right?
Ben: Yeah, yeah.
Amy: So, for the people who have Wikipedia pages and have news, we could look at what is the content that's represented in news that makes it into Wikipedia pages? That tends to be biographical information. So, where did they come from? Who are they? What are their interests? Where they're researching for scientists? What are the major themes of their research? What have they been discovering? Yeah. We could build a model. We have a really fun model that can take the sentences from news articles, and figure out whether they look like sentences from Wikipedia pages. We also have models that can structure content out of those sentences, figure out like fields of research, institution, awards they've won, and that kind of stuff. So we're able to build a model of that mapping. From news, what would we extract to put into Wikipedia page?
Ben: I mean, how much of that was already in existence? Because, I guess when I listen to it, it might sound like it's easy to compare content, but I'm guessing that that's pretty hard to say that this is talking about the same thing as this-
Amy: As this.
Ben: This is text [crosstalk 00:12:05]
Amy: Yeah, and it's interesting, right? Because it's not enough to just look at the words and be like, "Oh, these words look like words that are in the page." It's also the way that the sentences are structured, the grammatical structures that are in the text. There's been a lot of really interesting recent progress in the NLP, the natural language processing field. I think when we started out, we were able to play with LSTM models, sequence to sequence model. So you feed in the sequence of words into the model, and it can build up a representation and then inform this classification, and does this look like that kind of content? It can learn from a lot of different grammatical features.
Yeah. We're definitely doing this at very interesting time in this field too. There are major papers, major new models that are coming out very, very rapidly, and that gives us some great foundations to work from. So really interesting inputs to play around with. Yeah, it's definitely not an easy thing to do, but it's increasingly possible. What we found, when we applied those models, is we actually got surprisingly good summaries. There's a blog post on our website where we outlined this methodology in more detail, and we actually put up data for this training set. So, examples of the sentences from news that we found that went into the pages, you can play around with that. Then we've also got a page up where we show a bunch of these profiles that are generated.
But yeah, as I said, this is surprisingly good, right? It's surprisingly good. What I should say that I think is very interesting about this project, is our goal was not to just post these straight to Wikipedia-
Amy: ... and we haven't done that. We haven't posted them to Wikipedia. Instead, what we've done is we've looked for people who are working to add a lot of pages to Wikipedia. There's a woman, for example, at Imperial College, London Researcher Core, Jessica Wade, who, in the last year, added 270 Wikipedia profiles for female physicists and scientists.
Amy: Basically, she got into her research program and said like, "Oh, there's not many women around, and the ones that are around don't have very much content about them." So she just went on a mission to, whenever she found someone cool, just write the page from them and post it. Then there's other groups who are doing ... who organize editor funds. They get a bunch of people together, and they say like, "This is a list of people that we wish ... have pages and ... or people that have pages, but they're out of date, and let's go edit them." People like that, what's great is they can use tools like this so that it's a much, much faster job. They ... The machine does all of this heavy lifting of searching for all this content, pulling out the things that you probably want to see.
The machine can go through all of this hard work of scanning through lots of content, finding the information that you probably want it put in a page, and bringing it all together, and also, actually get all the references together, so the links back to your original content, put that in the Wikipedia format, and assemble that all. So, if you're going into making the page, what you're doing is looking and saying, "Does this make sense? Or is this cohesive? Is this story well told or are there major things that are missing that we should include? Is this the right neutral tone of voice? Is it worded the way that Wikipedia wants it to be worded?" And you could do that final pass before you go and post it. I think ... One of the things that we've thought a lot about, is that kind of interface between people, and the algorithms that you build for them, right? You don't just want to build an algorithm. Well, I don't know. There was definitely some used cases where It's really great. Just build an algorithm ultimate to [crosstalk 00:15:47]
Ben: I had imagined, when I first started reading about what you guys have done, that there would be algorithms with differing opinions that would argue on Wikipedia. You know, automate some of the drama-
Amy: The comment wars, right?
Ben: Yeah, yes.
Amy: Yeah, yeah.
Ben: [inaudible 00:16:00] that's it. Correct.
Ben: But basically, it does seem to be a bit of a common theme, I'm seeing it, in talking to people that ... I would have expected more of this automated 'AI' that we're just like, "We'll just go do it and figure it out." There is a quite a bit of that, but it does seem to be ... Like what are you guys are doing, there's a recognition that it's really good to insert some sort of ... You're actually enhancing the human, not just ... Because basically now, what you're doing is you're allowing these people to fill these holes, these really important holes in Wikipedia really much faster, but you're not replacing them. You're just making them much more effective.
Amy: Yeah, that's right. I think it makes sense. If we're going into intelligence, we're going into systems where like ... These are important and thoughtful tasks, right? I think for Wikipedia, it's ... they have whole systems of process for ... An individual makes a page, there's editors that that's forwarded to. There's comment processes. There's a whole world that they've built around those pages. It's not just you write a page and it's there. There's actually a whole process. Those processes are what make the quality of the content what it is. But that's true, actually, of all human intelligence systems. So you see. We often work with analysts in different industries. Do you know there are not ... There are people who are making very important and high decisions, for example, in the financial industry.
We work with people who are making long-term bets about the futures of companies. Who do we think is going to succeed or where industries are going? What are these big trends? Does it make sense to try and build a system that just says ... tries to simplify this and just give you an answer?
Amy: If you can be correct all the time, fine. But if you're not quite correct, then what? Right? What you need is to be a useful input into the systems that people have built, to empower analysts or empower people who are making decisions, to have the tools to be way smarter or maybe much faster in ingesting the information, so they can then go do a much better job than they would have done without it. Actually, I think that goes back to the question of the diversity of teams, right? That's why some of this is complicated. If we were just automating system, it would be much more straightforward to know what to do because you could say, "Here's the input and the output. Let's match these up as well as we can." But instead we're saying, "Here's a person doing a job that's really hard, how do we make them ... how do we superpower that person? What is it that they need?" That's less obvious. "What is the thing that most speeds them up?"
For those Wikipedia ... who work for Quicksilver, what is it that is the most useful thing that you can have in front of you to help you write those pages that you want? What do you need to see to make that work well?
Ben: One thing that comes to mind there as well, I am ... I've ... It's come up a couple of times where what the human does in the process. I'd run this past you to see if this resonates with you. Because it's probably not just having a human to have a human, but also because there's a sense of basically caring of ... essentially giving a damn about what's going to happen, and particular in this case. I mean, you're talking about probably the reason why there's holes here is because there wasn't enough people caring about it, and now that you have people that are really passionate about this, that's partly what the human provides in the process, is that they actually care about the outcome, and that's what produces a better outcome. I mean, is that ... does that resonate with you based on what you were saying?
Amy: Yeah, that's right. Having people who care in the loop is really important. I think one of the interesting challenges with machine learning and AI, right, is if you have systems that are trained on previous data, they get to predict what was seen in the past.
Amy: That's the kind of inherent part of the system, right? Models don't generalize necessarily beyond the inputs-
Ben: What they haven't seen.
Amy: ... that they had.
Amy: What they had. But we live in a world that's changing very rapidly. New things are happening that don't fit in the models of the past, and some of the things that happened in the past are not things we want to keep having happen in the future. That maybe ... Amazon did this interesting project where they tried to build systems that would figure out which resumes they should give offers to.
Ben: Oh, yeah. Yeah.
Amy: Given the resumes-
Ben: I haven't seen that.
Amy: ... they figure out, yeah, whether to give it off to it. Then they found out, when they examined the model that they built, that it basically screened women out. If you are on a women's rugby team, that would count against you. The model may have done a very good job of saying, “Okay, here's actually the relationship between the resume and then who we gave an offer to.” But that isn't what Amazon wants to do in the future. Good on them for realizing this and then pulling the plug on the project. It's difficult, right? I think if you build a machine and just fully automate, then what you're going to get is something that does what it's seen in the past. I think people are able to be very thoughtful about intent. They're able to be reflective about, “How should I interpret this? Is this what we want? Do we believe that these trends will continue?”
Having people in the loop, there's actually a lot of reasons why that might be useful. If you can be building algorithms that are telling you about the data or kind of flagging, these are the patterns we're seeing there, they don't have to be self-fulfilling prophecies, right? We can be thoughtful about building tools that are empowering and informative. But, this is a huge question for our fields, right?
Amy: When do you want the history to be- embedded in the ... in future?
Ben: Well, there's so many questions about the bias, and then we had Cathy O'Neil on earlier.
Amy: Oh, great.
Ben: Just ... I remember reading her book. It just, kind of, mind blown. But it was a part of that, is that there's ... I think we're ... the society seems to be waking up to the idea that, basically why they wouldn't call it algorithm. But basically, some of this stuff, “AI running loose” doesn't actually produce the outcomes a lot of people would expect. I think having people that are really being thoughtful about it, like you're ... like what you're talking about now, I think is super important because otherwise, there's going to be ... I mean, essentially, there's going to be a backlash to it if we don't-
Amy: Yeah. But I think also, we're already in a situation where there's so much content in the world that we're interacting with, that we can't interact with it at all. We have ways of filtering that down. Some of them are from other algorithms, right? We read what's on our Twitter feed or we read what's shared with us on Facebook, or we have some idea of how we narrow down and select stuff. You see, this with ... We work with analyst teams, who are basically, tend to have huge inboxes, where there's way more than they can read-
Ben: Correct, right.
Amy: ... and they'll do things like skim reading, and we're already in a situation where it's hard to grapple with the amount of information that we have, and we do have to have some kind of algorithm for helping us figure out how we navigate that, even if the algorithm is just us randomly picking things. Yeah. There's some hard problems here where it's not necessarily just if we have AI, we introduce bias, and if we don't have it, we're fine.
Ben: Yeah. You're right. Yeah. We already got bias.
Amy: We're already in trouble. We're already in a biased world. We're already-
Ben: We have thousands of years of practice.
Amy: Yeah. We've ... Even when there wasn't a lot of data, we were pretty biased, right? Yeah. I mean, we have to be very thoughtful about the tools we build, because they can encode that bias or ... I think some of the promise of why people have gone into some of these algorithms, I mean, you look at the sentencing algorithms for parole, the promise of it is that maybe it reduces the bias, but the trouble as it puts it into a black box sometimes, it's so-
Ben: Yeah. You don't know why it's making this decision.
Amy: You don't know why it's making the decision, and if it wasn't very, very carefully designed, maybe it is making these self-fulfilling prophecies, right.
Ben: Even if it was-
Amy: [crosstalk 00:24:04]
Ben: ... carefully designed, sometimes it's hard to predict.
Amy: It's hard to predict. Yeah. So, there's really tough question here of, if we were already biased, we want to be better than that. Maybe this tool is going to help us, but what do they need to be to actually make them helpful? I think the good thing is ... I think there is a growing awareness of this in the ... in our field, and then even in the press and the public conversation, I think there's an awareness of how important this is. Hopefully that means that more and more people work on it and care about it, we'll get some better answers of what they should look like.
Ben: Yeah. I mean, definitely with what you guys are doing, I think that's a really positive sign of a thoughtful approach to this. I mean, putting a bow on all of this, you've done some really great work here with the natural language processing, and what you guys do in Primer in general, I mean, what's next?
Amy: Yeah. I guess this is hard to ... it's hard to choose which one to talk about. I think one area that's ... what more on the research side, I think one thing that's very exciting with the recent models that have come out and possibilities there, we've been looking more and more at abstractive text generation. This is using models that can actually generate sentences and phrases and do summarization. This is an area that, historically, there are some super fun AI generated novels and poetry out there that are super wacky, and it's slightly bonkers, and then there's some research papers where you can see like, “Hey, this is a little bit of promise, but it's pretty noisy."
Ben: Yeah, yeah.
Amy: I think we've been trying to play around with, are there places where we could actually start to commercialize pieces of this tech? I think actually we're getting to a point where we're feeling a little bullish that there may be something there. Yeah, I think we're super excited about what's going on in the natural language generation side. So, going back to the conversation we were just having, there's a ton of reasons to be very, very careful and think about how to use this well, but I think there's a lot of opportunity too.
Then I think, that's on the research side. I think for me, more broadly, I think, when I think of some of our biggest social problems at the moment, and we're in a time of very challenging politics and a lot of division, and-
Ben: To put it lightly.
Amy: Yeah, to put it lightly. I think, as concerned and curious individuals, I think we live in a time when it's very hard to get a broad view of what's going on, and to have a perspective on the world where I understand why my cousins on the other side of the country have such a different perspective on different political issues than I do, or even they saw I'm English although my accent hides it slightly, but in England there's an incredible division over the Brexit question.
Ben: What's that? I haven't heard of that yet.
Amy: You haven't heard of that. Yeah. It's very topical versus we're talking about this. Who knows what's could have happened in like three weeks?
Ben: Right, right.
Amy: Yeah. Yeah. Huge divisions. It's often incredibly hard to know why people could have such different views than we do.
Amy: These are very, very complicated questions. I think for us, one of the things that's been interesting, is we're getting to work with understanding large [inaudible 00:27:27] of news data, for example, being able to use algorithms to contextualize and compare and contrast different content and stories, and my hope is we can figure out more tools that can at least contextualize, help you understand, how does my view compare to others, and I feel like there has to be some kind of tool, some kind of framing here, to help us as individuals to cope with this onslaught of perspective and opinion and division. But, that's what I find myself musing about and wondering about the most.
Ben: AI powered bubble popper?
Amy: Bubble popper, that would be cool. I like that. I like that. Yeah.
Ben: All right. I'll let you have that. But Amy, thank you so much for taking this time. I'm super excited about what you guys are doing here. I think it's super important work, and I look forward to talking to your future, see what you guys do.
Amy: Thank you so much. Ben. It's been a pleasure.
Ben: Thanks everybody for listening and rate us on your favorite podcast app so other people can find us and talk to you next time.
Speaker 3: Masters of Data is brought to you by Sumo Logic. Sumo Logic is a cloud native machine data analytics platform, delivering real time continuous intelligence as a service to build, run, and secure modern applications. Sumo Logic empowers the people who power modern business. For more information, go to sumologic.com. For more on Masters of Data, go to mastersofdata.com and subscribe, and spread the word by rating us on iTunes or your favorite podcast app.