Garett MacGowan

Injecting fairness into machine-learning models | MIT News ...

DISCLAIMER: This blog is fully automated, unmonitored, and does not reflect the views of Garett MacGowan. The ML model may produce content that is offensive to some readers.

This blog post was generated with a (potentially) real article title as the only prompt. A link to the original article is below.

Original Article

Photo by Diego PH on Unsplash

Generated: 3/7/2022

Injecting fairness into machine-learning models | MIT News ...

Artificial Intelligence: It’s Not What You Think

November 19, 2019. You look at machine learning algorithms and you see things like facial recognition and face detection, all working incredibly well. You look closer, and these algorithms are actually learning biases in these datasets — either through over-representation of one group, or the omission of others. In the past few years, research has also pointed towards machine learning tools that have not been trained for their use, creating even more problems when their biases affect individuals.

These biases are learned biases, not explicit or deliberate ones. A machine learning algorithm may learn to classify faces as men or women by learning what faces belong to men, and which ones to women. In that case, it has learned exactly what we expect; it’s learned biases that it will generalize across faces. But imagine, as a data scientist, you’ve fed a machine learning model to a dataset based on crime reports and you’ve accidentally omitted the crime data that applies to women! At that point, the machine learning algorithm will have learned a male bias with no awareness that it’s happening. Unfortunately, the bias is not the problem anymore; society and individuals have to deal with this new bias.

What we have in this case is an example of a statistical bias, as opposed to the more insidious forms of explicit or deliberate discrimination, like race or gender. Statistical bias is essentially the result of poor data. When we use statistical analysis (and machine learning algorithms, actually, but we’ll get to that) in the absence of data, we create patterns that may represent statistical trends, but may not be real. I’ll talk more about machine learning biases in a little bit, but first, let’s talk about statistical biases in general.

Why statistical bias?

Most data points are randomly scattered about the space in which they live. Data points in an actual dataset are no more scattered about the space than they would be in a world map of the United States. They’re just not! If they happened to be, we wouldn’t be able to use statistical analysis and machine learning algorithms to figure out which part of the map to use. We would just use an entire world map randomly; without any pattern.

That said, we can learn something by looking at the randomness and organizing it into a meaningful way to understand our world. We can, for example, see how many states have a coastline (or how many cities are coastal). We can see how many people live in each country. Those are pretty cool, but imagine if we could actually use statistical analysis to analyze our world to understand how many people are in some countries, and what about certain sub-sectors (i.e., what we call a “subcontinent”). We could even use this data to predict, perhaps, what we expect to be the distribution of people around the world.

Now imagine, in the real world, we were just as scattered about some specific space as we are anywhere else, and we had no basis on which to organize any of those points. The reason we, as a society, have a framework for organizing a world that is mostly “dry land” (no coasts and no tundra) is because we have data for those areas: the sea in particular.

But while this is certainly useful, it can also be problematic if we use a dataset that doesn’t reflect the space being analyzed. In other words, we can actually get data that doesn’t accurately reflect reality, and the best part of this is that, sometimes, we don’t even realize that it really doesn’t reflect reality and that we need to look again — until we’ve invested a lot of money. This can happen when, in the real world, we’ve got a data set that’s not random— like, for example, the actual places where crime is occurring.

For example, if we were using data to predict crime across the United States based on, say, where the crime happened to occur, we would be using data that is neither random nor actual. In other words, in the case of actual crime — where, for example, someone is robbed and their address is located — no crime is occurring. Why? You just can’t expect that an address on the map, or any real-world location, represents actual crime. This would be like using a map of US cities and trying to predict crime in, say, the Grand Canyon. The problem with this type of analysis is called a sampling bias.

Sampling bias occurs when we use a dataset that doesn’t reflect the space we’re analyzing, which results in what’s called an out-of-sample error. It’s the same problem you have in using geographic data which is neither random nor actual in your analysis in a different place with some geography. For example, it’s the same problem when you build a risk model of the risk of your company by using the number of cases that occur within your company’s territory and expecting that to reflect the risk (and therefore the company’s risk) outside that territory.

Even worse, if we, as data scientists, think that the lack of data is the problem, and that the only way to get it is to just make it up — we’re wrong. By not only making assumptions about the space where we’re analyzing, but also making assumptions about the data itself, we introduce bias.

We, as a society, have to constantly evaluate our assumptions and data based on what we observe. It’s not enough to assume that geographic data is real. If it’s real there should be a way to measure it. And if geographic data is measured in a certain geography, it should reflect that.

If you’ve thought all this through, I’m sure you can see how the issues arise with using data and machine learning algorithms. So, let’s take a look at an example to see how we manage that.

Measuring something that is not there

If you’ve ever bought a house, you probably learned that the square footage of a house is important. Well, it depends on size, of course, but, in the United States, there’s a pretty standard range — a home may fall into one of two categories: “single-family” and “multi-family.” Single-family is the traditional home with a yard and a yardstick to measure how large it is, and multi-family is, well, multi-family. Multi-family houses are usually larger; the yard is not as important since they’re located in a common yard area.

As an example, let’s say you buy a single-family house in the city of, let’s say, Boston. You’re going to use your own yardstick, which is a pretty big meter. Now, if you get a roommate, though, that’s going to change. You’ll have someone to help with yard work, since that’s a shared common area— but even then your standard is going to be affected by your common yard.

If you measure square footage in these households, you get an average of your shared yard, plus the size of the house. But, this is all dependent on the yard you’re measuring. If the yard includes a shared common area in addition to your home, like the front and back yards of an attached house, your yard won’t necessarily measure the size of your home. And, it’s this yard square footage as it’s measured that would be shown as multi-family, not the square footage you would get for an isolated house.

And, unfortunately, this measurement bias is a very common issue. Data analysts can use geographic locations to measure anything they could ever imagine. This includes not only a person’s address, but also zip code, county, or even state. Using these geographic areas in which we know people reside as our measurement basis creates measurement biases, especially in small geographic areas.

Let’s look at a map to see why. To the right, you’ll see a map of where crimes in the United States were recorded. To the left, we’ll take a look at another map — a county map — to show cities and townships and their actual number of reported crime incidents. You can see that townships, counties, and cities were used in the analysis.

The map shows crime incidents in the United States. You may notice that a lot of the larger cities are mostly on the ocean coast— where there’s plenty of open space. Now, we don’t know whether the analysis was done using the crime data where it was recorded, or a much larger space of counties where crime is reported. But, even if the analysis was done in a much bigger area, this shows that the analysis of crime in the past is significantly more of an urban bias than a county-level bias — something I was unaware of until you pointed it out.

So, what do you think the chances are that if you got a dataset that was solely collected for that analysis, over half of your counties would be oceans?

So now, we have a sampling bias. We know that for statistical analysis, the size of the geography we’re analyzing is very important. We can’t really take a sampling of any place unless all places are represented in that sampling.

Garett MacGowan