Sunday, February 23, 2014

Murder and NLP: The Taman Shud Case, Part 1

Computational Linguistics Murder Mystery Theatre

Cracking the code found near the body of a dead man could help reveal the identity of his killer.

On December 1, 1948, the body of a dead man was found on Somerton Beach near Adelaide, South Australia. The identity of the man (henceforth, Somerton Man) was nowhere to be found among his belonging, and no one ever came forth to identify him. He seemed otherwise healthy, and the cause of death was possibly a deliberate poisoning. The identity of his killer, if there was one, has not been established. It is possible that he was a Soviet spy, and possible that he was killed for reasons pertaining to espionage, but that is speculative. The case remains one of the strangest unsolved crimes in Australia's history.

A slip of paper was found in the dead man's pocket, containing only the words "Tamam Shud." This was identified as the final line of Omar Khayyam's Rubaiyat, and a man came forward saying that he found a copy of the book in his car near the beach the same day the body was found. It turned out to be the same copy the slip of paper was torn from. Written in the book was the phone number of a nurse who lived nearby, and denied knowing the man, but various claims have been made that she did know him and was lying.

What does this strange murder story have to do with Natural Language Processing?! Also written in the book was a sequence of letters, in five lines. There is some ambiguity in the handwriting, but one reasonable reading of four of them is as follows:

WRGOABABD
WTBIMPANETP
MLIABOAIAQC
ITTMTSAMSTGAB

In addition, a line which begins like the third line above was written and crossed out between the first and second.

These letters are not in any obvious way comprehensible, and it has been supposed that the letters might represent a cipher that, if broken, would shed some light on the case. From here forward, I will call this the Tamam Shud Cipher, or TSC, although it is not clear that it is actually is a cipher, an intentionally coded message, in the literal sense of the term. The TSC has not been decoded in a highly convincing way in whole or in part.

Following many hypotheses regarding the nature of the cipher, previous work, including that of the students of Derek Abbott at the University of Adelaide, has suggested that the letters may be initials from an English text.

In previous work, the frequency of letters in the TSC was compared to letter frequencies in other collections of text. First, with samples of several languages, and next with the initial letters of words in those languages. The second is not the same as the first, because letters occur with different frequencies in different positions in words; for example, 'e' is the most common letter in English, but 's' and 't' begin more words than 'e'. Abbott’s students found that the TSC letter frequencies match those of English initials significantly better than those of English text overall, and also better than initials or text in any of several languages. I have performed similar tests using different source texts and reach the same conclusion.

This provides evidence that the TSC is, in fact, an initialism, a sequence of initials from some specific text – in this case, a short English text. However, this evidence falls short of proof. It shows that an English initialism is the best of the possibilities that were tested, and a quite a few were tested, but it leaves open that an untested possibility would fit the TSC letter frequency just as well or better. It also leaves unexamined the possibilty that the letters in TSC are initials taken from English text but are possibly in another sequence.

And so, more definitive evidence that TSC is an initialism from a specific English text is to examine short sequences of letters in TSC and measure how they rank compared to the sequences of initials from English text in general. Grammatical patterns in English make some sequences of initials more common than the same initials in another sequence, so by performing this study with the TSC and variations of TSC with its letters scrambled in random order, we should see how likely it is that TSC preserves the expected sequences.

We can use an arbitrarily large corpus of English to generate the initial-letter ngrams for English, but the TSC itself is short, and therefore it samples the space of ngrams very sparsely. This means that many kinds of statistical metrics will show a mismatch between TSC ngrams and corpus ngrams even if they are initialisms from the same language. For example, Pearson correlations of bigrams from a known, but short, English initialism and the corpus will come up negative due to the sparseness of the short string’s ngram matrix.

A useful metric that is more sensitive when the string we are testing against a corpus is short is to generate the ngrams within the string, and calculate the mean of how high those ngrams rank among the corpus ngrams. This, in effect, gives the string credit for containing common initial ngrams, but doesn’t punish it for lacking other initial ngrams because it is simply too short to “get around to” them.

I generated 1,000 random shuffles of the TSC, and for the TSC and each shuffle, I calculated the mean rank of the string’s initial ngrams in terms of those generated from a corpus of 5 million words of English literature, for n from 2 to 5. If the sequence of letters in the TSC are initials selected at random from English texts, or if they were generated by some other means altogether, then the mean ranks for the TSC should be about 50th percentile in terms of the mean ngram ranks for the 1,000 random shuffles of TSC. If, however, the TSC was generated as an initialism from a specific English text, then its mean ngram ranks should be significantly higher than 50th percentile. The results follow:

N         TSC Percentile
2          85.6
3          92.3
4          96.4
5          93.6

We see that the results are convincing, particular when n=4 (the matrices begin to become sparse for n=5, weakening the result). For many scientific purposes, 95th percentile is offered as a standard of proof, and these results are fairly convincing that the TSC is an initialism, in correct order, of English text.

However, notes also that the TSC is written as a series of lines which may be linguistically unrelated to one another. If the four lines of TSC are lines of poetry, separate sentences, or in any way excerpts of a longer text, then the ngrams that are generated across the boundaries of lines potentially introduce noise. The idea that the lines are separate entities is further validated by the fact that the crossed-out line occurs in a different order than the similar line which is not crossed out.

So we can repeat the analysis, comparing TSC’s ngrams to those generated from random shuffles of TSC, but excluding the ngrams of TSC that start on one line and end on another. If TSC is not derived from sequences of English initials, we would, again, expect to see it rank about 50th percentile in mean ngram rank among the random shuffles. What we see, instead, greatly strengthens the previous result.

N         TSC Percentile
2          96.9
3          99.2
4          99.2
5          99.2

It is exceedingly unlikely that any other method of generating lines of letters would show this regularity if these lines were not initialisms corresponding to one or more short English texts.


And so, this turns the focus deeper, given that the Tamam Shud cipher very likely is an initialism of some short English text(s), what does it say, and what does that say about the case? The story goes on in my next post.

Saturday, February 15, 2014

APIs for Multilingual NLP

Coming soon, I will be able to announce the availability of APIs that provide text analytics in the world's major business languages.

A placeholder to that site is at gistology.com, showing with a map the scope of the countries whose languages will be covered. There'll be more to say about this soon.

Is Google Racist?

Like many service available online, Google's search bar offers completions that suggest the rest of a user's query before they have to type it all out. Potentially, you could enter a 25-letter query in just a few keystrokes, saving a little time – and who doesn't love saving time?

Sometimes, though, you see something like this:

In this case, I deliberately "baited" Google by starting out with a query that I knew would elicit a result like this, but it certainly took the bait. Maybe someone is wondering why black people are more likely than other races to have to sickle-cell anemia, but Google guesses instead that they have a racist question that want to research.

So Google was wrong, and fails to save the person a couple of seconds. But it does something more than that. It exposes the user to a couple of stereotypes that range from unflattering to extremely offensive. And by moderating the query a little, you can find a treasure trove of other stereotypes according to Google completions:

Californians are fake, stupid, and weird. Texans are stupid idiots. New Yorkers are rude and arrogant. (Actually, most groups of people are rude, if you go by the Google completions.) Americans are obese and ignorant. Chinese people are smart. Jews are rich. Asians are bad drivers. At least, this is what Google completions suggest, in response to a partially-typed query.

Why do these completions exist? Who is actually putting these stereotypes forth as true? You'd have to know Google's backend architecture in detail to answer that exactly, but some experimentation indicates the following:

1) People who type queries into Google. Common queries are more likely to appear as completions.

2) People who create web pages. Some completions are oddly-worded and unlikely to have originated as queries, but appear verbatim in various web pages.

3) People who see these queries and then select them. This is where the algorithm becomes insidious. The intention of a completion is to help someone save the effort of typing. But an interesting completion might sidetrack someone from their original purpose and click on it just to see what it's about. For example:

Benjamin Franklin had syphilis? Maybe some student had to write a term paper on the great thinker and now they've been given the lurid suggestion that the man had a sexually-contracted disease! If true, it's certainly not why Franklin became famous, but there it is as one (in fact, two) of the top four relevant facts about the man. Never mind his efforts in founding the United States, publishing newspapers, discovering the electrical nature of lightning and so on: He had V.D.! That's bad enough if it were true, but in fact, it appears not to be! If you follow the links these completions lead to, none of them have any evidence that Franklin had syphilis – just people asking if he did. But I'll admit, the completion made me want to click on it. And that's exactly the problem. I just "voted" for the Franklin - syphilis link, elevating it a little bit higher than the other possibilities. If syphilis had started off as the #4 completion, people like me vote it up to a higher position on the list, so more sensational queries tend to "win".

As a consultant for a news aggregator startup, I once had access to the data of which headlines people clicked on. An unmistakable trend was that headlines with exciting, sensational words in them were clicked on more often. This was true even if the word was simply being used as a metaphor ("Interest Rates Explode", "President Attacks His Critics").

It doesn't require that a lot of people believe that Franklin had syphilis or that any of those stereotypes are listed are true. It only requires that enough people type that query (or click on webpages that assert or even ask the question), it gets somewhere on the list of completions (maybe #10) and then other people see that completion, get intrigued, and vote it up. In fact, a lot of the people who vote up the racial stereotypes could even be people who are incredibly offended by them.

Benjamin Franklin is dead, but these sorts of completions exist for living celebrities, too. I did a search for 3 former NFL quarterbacks, and one of them had a completion for "gay". The man in question has denied being gay. How many people, every day, see this completion? Google is in effect spreading rumors about the man's personal life, just as they flash before our eyes a number of racial stereotypes.

So what is Google's role in this? Surely not that someone at Google decided that these racial stereotypes are useful suggestions. Google only built the system. The data voted for these completions to rank so high. If Google's algorithms work so well in general, then they can claim neutrality on these questions and say, Sorry, but these are the completions lots of people type and choose.

Except Google isn't neutral. Google does sanitize their completions in many cases.

If you type "scarlett johansson photos..." into Google's search bar, and then add any letter of the alphabet, it will show you completions. Any letter, that is, besides "n". "scarlett johansson photos n" produces no completions at all. Why? And so what? The reason why is that Google has specifically censored the completion that would allow the word "nude" to appear. I've had access to the logs of queries from search engines, and I guarantee you that the completions Google shows for "scarlett johansson photos" are not more common than "scarlett johansson nude" or "scarlett johansson photos nude". In fact, Google shows no completions for the word "nude" all by itself, even though it shows completions for "Mohorovičić discontinuity"... which one do you think people are searching for more?

So Google isn't completely neutral. They let the data vote for itself sometimes, even usually, but they censor some completions on, apparently, the suspicion that the results would be offensive or non-family-friendly.

Given that, there's not much excuse for letting these racial slurs show up. If it's offensive to suggest that an actress has taken her clothes off, it's certainly more offensive to allow the data to promote the racial stereotypes listed above. "It's just data" is a valid excuse for a person or company who uses big data as a tool. But once human hands go to work in the system, selecting what does and doesn't show up, those hands start to take some of the blame for the whole system. One imagines that these stereotypes have simply been below Google's radar, and that the "Don't Be Evil" company would want to censor those completions once they're aware of them.