Christian Beedgen: Intro to Data Bias
"The discussion that has to happen is about the ethics of this."
Welcome to the Masters of Data podcast where we talk to the people on the front lines of the data revolution about how data affects our businesses and our lives.
Christian Beedgen, the co-founder and Chief Technology Officer at Sumo Logic, has a long history in the world of data. Christian has spent a lot of time lately thinking about bias in data and data analytics, and that’s what we’re going to talk about.
Christian, how did you get to where you are today? How did you get into technology and how did you get to Sumo Logic?
Christian Beedgen’s Start in IT
My interest in technology started when I was much younger, a couple of “centuries” ago. I was maybe 10, 11, or 12 years old, and I got into computers by reading books that had programs in them, which is really funny! We had this library bus that would come out to our village, and once I went through all the comics, I ended up with these dot matrix prints, early computer books about Basic, and somehow it just caught my attention.
Some friends got Commodore 64s and played games on them. I always liked to read, and this technical reading stuff somehow came naturally to me. I convinced my parents to get me a computer, so I had an Atari, and I started getting into programming. I’m basically self-taught. I didn’t really learn it from anybody telling me.
What did you start out programming?
Basic. That’s what you could get. I always tried to get a Modula-2 compiler, if anybody still remembers those, but you needed more RAM in your box and that would have cost more money, so that didn’t happen.
What did you start off writing?
Graphic demos and just typing stuff, copying out of magazines and books. Then the Atari had an early version of a point and click graphical interface. I think it was called GEM or German, and there was a German guy who built a Basic version. The programing language allowed you to do menus, dial-up boxes, dropdowns, and that type of stuff, and I was just endlessly fascinated with that.
When I turned 16 or so, other things obviously started becoming a little bit more interesting, and then it was a long detour. I ended up moving to Berlin after school. I’m originally from Germany.
I started studying social sciences, sociology, and politics. When I started social sciences, it became clear that I’d need to write lots of papers, so I made a deal with my parents and got a brand new Mac out of it, a Performa 630 if I remember correctly. That was awesome and I ended up writing a bunch of papers but I could not get away from that Mac.
This was the time when computer magazines would include CD-Roms or shareware. I got back into buying books on various programming, Mac-related topics, and Linux. I always thought it was a lot of fun.
At some point, I realized that I should probably move more in that direction. So in ’96, I worked construction in the summer and I took the money and bought a modem. The internet had just arrived in Germany and I had a Supra Fax 28.8k baud, so I dialed into Germany’s version of AOL. That was not interesting to me so I went straight to the internet and Netscape and never looked back.
I ended up studying a combination of digital media and computer science programs. I wasn’t really that good at the digital media part. I had a copy of Photoshop and I applied all the filters and then I walked away thinking I was gonna be an artist, I was gonna make all these psychedelic pictures. But I ended up being the guy that somehow figured out how to write programs. This was a school of applied science; in Germany, that means in the seventh semester they kick you out and you need to go into an internship.
After an incredible sequence of random coincidences, I ended up at Amazon in Seattle in late ’98. I arrived just when everything came crashing down. After we tanked our own company, I went to Axite. I was an early engineer there. That’s where I got introduced to this concept of lock management. I was there for almost 10 years and that’s where a lot of the initial ideas behind Sumo Logic came from; observing what works. It’s a really interesting space doing data analytics on data that’s usually not being analyzed because the formats are hard to wrap your head around and it’s not relational data that you can just load into a warehouse.
The idea for Sumo came together with Kumar, who was also an Axite guy. We started postulating that this type of tool is very interesting and should be extended not just for security but also for operational analytics use cases and, most importantly, an upright software was going to be a bad way of selling it. It had to be a service and that’s what got us to start Sumo Logic in 2010.
That’s a great background. It’s actually some things I hadn’t heard.
The access to information is really interesting. It has always fascinated me and I still think that this is one of most formative things about the internet as well, because you can pretty much access everything.
The access to information has always fascinated me.
Making Sense Out of Data Bias
We’ve been talking for a while now, prepping for this. What’s been rolling around in your head lately about data specifically?
After almost 20 years of working with data, I think we are solving many interesting use cases here at Sumo, and making systems more observable so that the people that run the applications that power their business are becoming more efficient. I have a little bit of background in humanities, so sometimes there is a discussion on this dichotomy about qualitative approaches versus quantitative approaches.
Just to be clear, when we did social sciences or sociology in particular, the most feared course was in statistics because they had this hardcore math statistics guy in that department that was full of a bunch of social scientists and guys who were into politics, who really liked reading Marx and all this type of stuff. Then there was that one math guy, he was tough and everybody hated statistics.
I was a math major and I didn’t like statistics. I get that.
His personality didn’t really help. Anyway, I’m not trying to say that if you’re into humanities, you’re not ever going to use data, but this has always been in the back of my head. Some things are qualitative, some things are quantitative and often people prefer one over the other. I can’t find whether I’m on one side or the other, so I feel like there’s an interesting aspect to keeping a balance between these things.
One of the things that I always look out for is book recommendations. I happened across this book by Christian Madsbjerg called Sensemaking where he’s fundamentally questioning whether decisions should be based purely on data. Christian brings up concepts like instinct, intuition, and context. That totally resonated with me because it was an interesting stance that went against the prevailing wisdom around data-based decision making and the rise of big data.
Should decisions be based purely on data?
There’s been a lot of hype around how work can be improved by using more data, more sensors, and more observations. You can do all kinds of really interesting things. There might be another way to look at everything, coming much more from a qualitative side, from the ethnographic side. So I found that interesting because it’s good if you can be open to opinions.
That’s generally something that people do, they pick a side: conservative versus liberal, qualitative versus quantitative, left versus right, up versus down, black versus white. It feels like that’s actually a comfortable spot to be in because there’s a lot of other people there and the world’s fairly easy. To me, that’s never really worked. I think that’s kind of betraying your own intelligence.
The concept of context is something that I’ve always tried to embrace and I’ve talked a lot about. We’ve talked about it even all the way down to building product features with all of these data streams coming in logs and metrics, which tell you things but often don’t actually include the context in which stuff happens. In Sensemaking, it’s not that he’s saying data is crap or anything, but he’s basically trying to say data is not necessarily truth and you need to ask for the context the data was gathered in. He calls that the difference between the savanna and the zoo.
You can put animals in a zoo and you observe something but you have to actually observe them in the savanna to see their actual, real context. You’re gonna see something very different.
Here’s the other thing; I’m a big dog person– the company’s named after my dog. We have all these dogs in the office, which everybody seems to be really happy with. I read something about that recently. A lot of the ideas around the term “alpha dog” and domination and canine society are derived from observing wolves in captivity. It turns out, they behave completely differently when they are behind bars than if they’re in the wild.
It’s interesting because I don’t think you need to spend more than half a second thinking about it to realize it’s very true. It actually applies to people as well. You would probably behave, or I know I would behave, very differently behind bars.
So you’re saying reality TV isn’t true?
What is true? That is the huge question here. One thing I know is I’m actually a fairly intuitive person. There are different personality types, and you can say that all of this psychometric stuff is made up, but I’ve found that it can help you observe yourself. Knowing my personality type helps me see why am I feeling comfortable in this space and why am I not feeling comfortable in that other space. One of the things I learned is that I’m fairly intuitive, so I question whether data really tells you the truth.
That Sensemaking book was great. I think I connected the most with the book when he got to that point where he talked about intuition and about experts being able to act on instinct because of all the experience they’ve accumulated over time. It’s not a bad thing.
The Human Side of Data
It seems the discussion happening is: where do you use data and how do you use data?
Not discounting the intuition that people have built over time, but also realizing your own context, and my intuition is not necessarily going to be correct. Since data is driving so much of the technology we use today, that’s an incredibly important topic.
I agree but here’s the kicker: if I wasn’t able to trust my intuition this company would not exist.
If I wasn’t able to trust my intuition, this company would not exist. The data would have said you’re f-ing crazy and have no chance to compete in this market.
Because the data would have probably said you’re f-ing crazy. You have no chance to compete against the people that are already in the market.
To make bold decisions, sometimes you have to trust your intuition, but it’s very tricky because you can’t get too full of yourself. Next I realized that maybe there is a self-serving aspect to the argument in the book that basically says: I can make better decisions than the people who bring data. Then I got really worried because I like to serve myself as much as anybody!
The internet is just a crazy thing where when you hear about a new type of car and then suddenly you drive down the freeway and everywhere you see this new type of car. I find myself using the internet in my daily habits of ingesting information and, in a very similar way, once I’m aware of a topic, related things started popping up. It’s all serendipitous.
It’s making those connections
I came across this other book and I think a lot of people have used it to start a discussion. The book has a fantastic title, it’s called Weapons of Math Destruction.
It’s written by a math Ph.D., and one of the underlying points is that the idea that data in itself speaks to any kind of actual truth is essentially complete horse crap. Data gets collected and analyzed by people and those people bring their intuition and biases to the task. This is when I got into the loop about the self-serving aspect of trusting your intuition. You have to somehow find a balance because biases are a real thing.
If you look it up on Google, it basically says bias is prejudice in favor of or against one thing, person, or group. Prejudice is a pretty harsh word.
Bias is prejudice in favor of or against one thing, person, or group.
Speaking of bias, this is just too funny not to share. So, if you look at the Wikipedia article on bias, it has the neutrality flag turned on. So basically, right under the headline, it has this box that’s flagged for neutrality. This is the article on bias and there’s this big box that says the neutrality of this article is disputed.
You can’t make that stuff up.
No, you really can’t. When you start wrapping your head around things like bias, it feels like you’re almost powerless. For example, you want to buy a car and the used car sales guy gives you the initial price that’s too high. He goes down two grand and you agree to buy it, but if you actually were to do your research, you’re still paying two grand too much because the original anchor price was set up.
Anchoring is the first thing that you hear about a particular topic and how that usually forms your opinion. It’s very hard to get past that initial opinion. I think that’s probably why these political attack ads work so well. Because if you have any sort of topic, ballot, or a particular person, they try to basically anchor you on a particular opinion. Generally, it’s almost impossible to train yourself out of anchoring.
They’re not changing opinions, they’re picking something that’s probably already there and trying to pull it out to a certain extent, right?
They’re setting their opinion. So, for example, I believe that anchoring is a cognitive bias and I think people generally believe that that is something that’s true, and then you rely on the first bit of information to make a decision.
So it’s the first thing that you hear and you build off of that?
Yeah. Say I’m a conservative guy and I come in and I say something really nasty about this liberal person or about liberal policy ideas without really knowing them.
Then everything works off of that.
Yeah, it’s actually quite interesting. So there are lots of biases and confirmation bias is another one.
Well, tell me a bit more about that. So, what does confirmation bias actually mean in this context?
Understanding Confirmation Bias
Confirmation bias is essentially defined as focusing on information that supports our beliefs and paying less attention to information that contradicts those beliefs. If you have ambiguous information, you will just assume that it supports your point. Often we believe the person who has the data wins, but it’s not that easy because often it’s your piece of data versus my piece of data. If we have different biases, I’m going to take my piece of data supporting my opinion much more seriously.
There’s something very natural about that because if you want to make decisions, you can’t just view all the data all at once. It’s always a balance, but recognizing that it’s an issue is the first step.
Exactly, and I think that’s probably the best we can do in all of this. I’m not an expert on any of this stuff, but it doesn’t look like there’s an actual solution here. You just have to train yourself to stay aware of the things that are influencing you. They are natural, and you can’t just shake them.
You have to train yourself to stay aware of the things that are influencing you.
You can’t turn yourself into a purely rational machine, but the trick is that on some level you trust your intuition. Then you learn about all these horrible biases that seem to be expressed by people, which sends you for quite a loop. Other biases are prejudice and classism, which is having opinions about people based on social class. I think that’s something that’s very popular, unfortunately. The rich usually look down on the poor. It’s often by saying that there’s some sort of moral failure.
I think it happens in this country and a lot of countries. It’s pretty tough. Another bias is lookism.
I haven’t heard of that.
Lookism is when you judge people by their look. For example, news anchors are usually really good looking because, generally, people seem to trust good looking people more. That really works in my favor because I’m extremely good looking, after all!
There is a classic trick I do to figure out what people are biased against. I use Google’s type-ahead prediction. People type biased against and it completes it. So, here’s what I completed as of yesterday: action in Congress, religion, and then bias against introverts is like the fourth hit.
I was like, “What did I ever do to you?” Bias against conservatives, okay. Bias against LGBT, unfortunate. Bias against Israel, a huge conflict there. Bias against homeless and then the last one, I’m not gonna say whether it’s my favorite or not, but, bias against conservative students.
That’s very specific.
I have the screenshot. It’s pretty funny. So, these things are like mirrors, sometimes, but I think the reality is that when you hear about this stuff, you might not actually internalize it to the same degree because I think most of us consider ourselves to be fairly intelligent and aware.
I’m not walking around admitting to everybody on the street that I’m a victim to my biases, but it does come through at times. There are all kinds of little games that you can play with yourself to try to observe yourself, to try to put that second voice in your head that observes.
For example, we’re in startup land here, so we go to a lot of companies and we look at the about us pages and we look at the executive team. We just keep looking at those pages for a couple companies until we hit one that has an African American CEO, then see your reaction. Or see what your reaction is to a female CEO, or if all the executives are from India. My prediction is that what you will find in observing yourself is going to be surprising.
We want to rely on our intuition but biases are a real thing, so we have a bit of a problem here. Then people go and say, “Well, but that’s why we want to use data, in order to iron out all of these things.” I’m not the first person to talk about bias, it’s been around forever.
This is where the Weapons of Math Destruction book comes in and where it gets really interesting because big data analysis and mathematical modeling end up being introduced in order to overcome these biases.
There’s one example from the book where she’s talking through a number of scenarios where people went from thinking that qualitative assessments were subject to bias and tried to replace them with more quantitative approaches. The entire book is essentially chapter-by-chapter, a dissection of good intentions. I don’t think you can always claim good intentions, but there are a lot of examples when it comes to ranking teachers.
The Danger of Bias
Recidivism is the other example that’s talked about. The approach was to initially go review how people are doing, whether it’s judging teachers on how efficient they are, how good they are, or how much they contribute to their students’ advancements.
Right, I mean that’s one way to define efficiency for a teacher, if you want to use a cold word like that, and then recidivism is about the chances of somebody who has committed a crime going back to prison.
To basically commit another crime.
When you look at teachers’ evaluations, it’s all based on peer feedback, and you start looking at other things that might be happening, like maybe the teacher bought a new car for somebody. That’s a stupid example but at the end of the day, it’s this kind of soft social things. People try to quantify that, and then what happens is they are focusing on test scores.
Specifically, there’s this one example in the book that is really kind of interesting. They thought the school system in Washington DC was underperforming. They brought some reformer in and built a teacher assessment tool, which is a mathematical model of some sort and an algorithm. I don’t think it was super complex.
They called this the value-added model and tracked the test scores of students over a year. If the scores went up, that meant the teacher was good. If the score went down, the teacher was bad. Then they would cut the bottom 5% of teachers based on this evaluation of the delta between the scores of the students. This became a national thing, there are Washington Post articles on this. They had a bunch of teachers who ended up getting fired and they couldn’t really figure out why.
The first problem that happened was that the score wasn’t actually explained to them because, of course, the people who built this tool didn’t want to reveal the algorithm.
So, you could game it.
Right. So what had happened was this one teacher had a class for the first time in the first grade. With the previous teacher they had really high scores, then when they showed up in this new teacher’s class, the students could barely read and so, of course, the scores dropped because she didn’t inflate them artificially. What they reverse engineered was because of this value-added model, the trajectory of the scores for the students went down, and that put her on the bottom rung.
Then she got fired and I think she had to raise hell to get any kind of explanation out of the model. One thing that makes these things very complicated is that when you are being judged by an algorithm, the ecosystem around the algorithm believes the algorithm is correct, and algorithms often can’t explain their results.
You get that in machine learning all the time. We went through this in our own product with anomaly detection. If you have a product and it can’t explain an anomaly, that’s one thing because people can just ignore it. But if this type of machinery is being deployed against people, then it becomes potentially life-altering. One of the examples you run into is this paradox because the algorithm is essentially opaque. When people complain about the actions that are being done based on the evaluation that’s coming from the algorithm, they are expected to bring perfect evidence, but if you don’t know how you’re being judged, how are you going to have perfect evidence? Then you end up in this endless loop, which is very unfortunate for the individuals.
Then the recidivism example goes in the same direction in the design of these instruments, models, and tools. It’s fairly easy to observe, when you take a step back, that there are clear biases. Algorithms can reproduce the biases that were present before, the biases that people wanted to build these models to eradicate. But in the end, they’ll just get the same biases but now they don’t come with an explanation because a biased human at least has a mouth and, if you put enough pressure on them, they can at least try to explain themselves.
If you have an algorithm and a bunch of data that basically spits out a recommendation to a judge in terms of what your risk of recidivism is, you might just end up getting a longer sentence. The longer sentence is typically going to increase your risk because you’re in prison longer. There’s a lot of criminal stuff happening in prison, so it’s like another endless loop.
The way the recidivism model that is being talked about in the book worked was that they basically ask people questions. It was a survey and then they computed a score. The questions for recidivism were: how many priors do you have? Do you take drugs or alcohol? Where did you grow up? When were you involved for the first time with the police? Those are basically all systemic questions to figure out by and large whether or not you’re African American or Hispanic and if you come from a poor area. It’s sort of asking for it.
If you then look at the statistics for New York City stop and frisk, how it breaks down based on race, in terms of relationship to the population, those biases exist in reality. Then you re-encode the biases into the models but now the outcome is that the model is explained as being mathematically sound and questioning the model becomes much harder, if not impossible. So the argument to take away from this, if you follow that line of reasoning, is that things are getting worse than they were before because of unintended consequences and the potentially pliant belief that as soon as I have data, I’m right.
Now you’re coming full circle. What do you do? Intuition is prone to biases, so okay, let’s do data but how is the data being collected? How is the data being interpreted? It’s not that the data or the algorithm are right themselves, there are always humans involved and the fact is that humans are just messy.
It’s not that the data or the algorithms are right themselves, there are always humans involved and, the fact is, humans are just messy.
Bias may be harder to uncover in that instance.
Data is Messy
I think there are a lot of really interesting things going on there. It’s an interesting discussion looking at some of these really amazing things that you can do with data. For example, detecting potential infections in prematurely born babies in a hospital two weeks before they actually show signs that humans can observe. Then, day after day, being able to analyze seismic activity to predict earthquakes. That is pretty cool and it really helps people.
But the issue is that people have now analyzed that there’s potentially a cost to some of these applications of data. I think that’s really what it’s all about- awareness at that point and trying to figure out, given the context, how you should interpret the data.
You’re a CTO in Silicon Valley where a lot of these algorithms and programs are being written. When you think about it from that vantage point, how do you see the responsibility of these companies that are writing these codes and creating these algorithms?
Oh, it’s very tough. We have commercial interests. I cannot be a bigot about this and say it’s the other people’s problem. Generally, the discussion that has to happen is about the ethics around all of this. I don’t really have a perfect answer, but I find that following this train of thought and looking at references to the types of topics that we just discussed makes it clear to me that folks need to reflect on these things. There’s a couple of books on this and nobody really has an algorithmic solution to this because it is messy. The Weapons of Math Destruction author suggests some sort of Hippocratic pledge, like doctors have, but for data scientists.
There’s also this incredibly accomplished professor, Kate Crawford, who’s been writing at a very high academic level about data bias and fairness. One of the recent things that she wrote about is ten things that you need to keep in mind when you’re dealing with data.
Crawford talks about points like always assuming that data are people. Don’t assume that just because it’s a public dataset, it’s properly anonymized. When you link public data sets, you can often identify individuals. Don’t trust your own anonymization because that can often be reverse engineered very easily. It’s an intellectual exercise on some level. I think you have to want to try to solve this problem and, step-by-step, become aware, potentially saying we could build this feature, but we won’t.
Yeah, what does it mean? Well, this has been super interesting, Christian. It’s nice to see how your mind is thinking about this.
As I said, humans are messy and so am I.
That’s what makes life interesting. Christian, thank you for coming on.