NLP Confidential: Is Google Racist?

Like many service available online, Google's search bar offers completions that suggest the rest of a user's query before they have to type it all out. Potentially, you could enter a 25-letter query in just a few keystrokes, saving a little time – and who doesn't love saving time?

Sometimes, though, you see something like this:

In this case, I deliberately "baited" Google by starting out with a query that I knew would elicit a result like this, but it certainly took the bait. Maybe someone is wondering why black people are more likely than other races to have to sickle-cell anemia, but Google guesses instead that they have a racist question that want to research.

So Google was wrong, and fails to save the person a couple of seconds. But it does something more than that. It exposes the user to a couple of stereotypes that range from unflattering to extremely offensive. And by moderating the query a little, you can find a treasure trove of other stereotypes according to Google completions:

Californians are fake, stupid, and weird. Texans are stupid idiots. New Yorkers are rude and arrogant. (Actually, most groups of people are rude, if you go by the Google completions.) Americans are obese and ignorant. Chinese people are smart. Jews are rich. Asians are bad drivers. At least, this is what Google completions suggest, in response to a partially-typed query.

Why do these completions exist? Who is actually putting these stereotypes forth as true? You'd have to know Google's backend architecture in detail to answer that exactly, but some experimentation indicates the following:

1) People who type queries into Google. Common queries are more likely to appear as completions.

2) People who create web pages. Some completions are oddly-worded and unlikely to have originated as queries, but appear verbatim in various web pages.

3) People who see these queries and then select them. This is where the algorithm becomes insidious. The intention of a completion is to help someone save the effort of typing. But an interesting completion might sidetrack someone from their original purpose and click on it just to see what it's about. For example:

Benjamin Franklin had syphilis? Maybe some student had to write a term paper on the great thinker and now they've been given the lurid suggestion that the man had a sexually-contracted disease! If true, it's certainly not why Franklin became famous, but there it is as one (in fact, two) of the top four relevant facts about the man. Never mind his efforts in founding the United States, publishing newspapers, discovering the electrical nature of lightning and so on: He had V.D.! That's bad enough if it were true, but in fact, it appears not to be! If you follow the links these completions lead to, none of them have any evidence that Franklin had syphilis – just people asking if he did. But I'll admit, the completion made me want to click on it. And that's exactly the problem. I just "voted" for the Franklin - syphilis link, elevating it a little bit higher than the other possibilities. If syphilis had started off as the #4 completion, people like me vote it up to a higher position on the list, so more sensational queries tend to "win".

As a consultant for a news aggregator startup, I once had access to the data of which headlines people clicked on. An unmistakable trend was that headlines with exciting, sensational words in them were clicked on more often. This was true even if the word was simply being used as a metaphor ("Interest Rates Explode", "President Attacks His Critics").

It doesn't require that a lot of people believe that Franklin had syphilis or that any of those stereotypes are listed are true. It only requires that enough people type that query (or click on webpages that assert or even ask the question), it gets somewhere on the list of completions (maybe #10) and then other people see that completion, get intrigued, and vote it up. In fact, a lot of the people who vote up the racial stereotypes could even be people who are incredibly offended by them.

Benjamin Franklin is dead, but these sorts of completions exist for living celebrities, too. I did a search for 3 former NFL quarterbacks, and one of them had a completion for "gay". The man in question has denied being gay. How many people, every day, see this completion? Google is in effect spreading rumors about the man's personal life, just as they flash before our eyes a number of racial stereotypes.

So what is Google's role in this? Surely not that someone at Google decided that these racial stereotypes are useful suggestions. Google only built the system. The data voted for these completions to rank so high. If Google's algorithms work so well in general, then they can claim neutrality on these questions and say, Sorry, but these are the completions lots of people type and choose.

Except Google isn't neutral. Google does sanitize their completions in many cases.

If you type "scarlett johansson photos..." into Google's search bar, and then add any letter of the alphabet, it will show you completions. Any letter, that is, besides "n". "scarlett johansson photos n" produces no completions at all. Why? And so what? The reason why is that Google has specifically censored the completion that would allow the word "nude" to appear. I've had access to the logs of queries from search engines, and I guarantee you that the completions Google shows for "scarlett johansson photos" are not more common than "scarlett johansson nude" or "scarlett johansson photos nude". In fact, Google shows no completions for the word "nude" all by itself, even though it shows completions for "Mohorovičić discontinuity"... which one do you think people are searching for more?

So Google isn't completely neutral. They let the data vote for itself sometimes, even usually, but they censor some completions on, apparently, the suspicion that the results would be offensive or non-family-friendly.

Given that, there's not much excuse for letting these racial slurs show up. If it's offensive to suggest that an actress has taken her clothes off, it's certainly more offensive to allow the data to promote the racial stereotypes listed above. "It's just data" is a valid excuse for a person or company who uses big data as a tool. But once human hands go to work in the system, selecting what does and doesn't show up, those hands start to take some of the blame for the whole system. One imagines that these stereotypes have simply been below Google's radar, and that the "Don't Be Evil" company would want to censor those completions once they're aware of them.

NLP Confidential

Saturday, February 15, 2014

Is Google Racist?

1 comment:

Blog Archive