Computational Linguistics Murder
Mystery Theatre
Cracking the code found near the body of a dead man could help
reveal the identity of his killer.
On December 1, 1948, the body of a dead man was found on Somerton
Beach near Adelaide, South Australia. The identity of the man (henceforth, Somerton
Man) was nowhere to be found among his belonging, and no one ever came forth to
identify him. He seemed otherwise healthy, and the cause of death was possibly
a deliberate poisoning. The identity of his killer, if there was one, has not
been established. It is possible that he was a Soviet spy, and possible that he
was killed for reasons pertaining to espionage, but that is speculative. The
case remains one of the strangest unsolved crimes in Australia's history.
A slip of paper was found in the dead man's pocket,
containing only the words "Tamam Shud." This was identified as the
final line of Omar Khayyam's Rubaiyat,
and a man came forward saying that he found a copy of the book in his car near
the beach the same day the body was found. It turned out to be the same copy
the slip of paper was torn from. Written in the book was the phone number of a
nurse who lived nearby, and denied knowing the man, but various claims have
been made that she did know him and was lying.
What does this strange murder story have to do with Natural
Language Processing?! Also written in the book was a sequence of letters, in
five lines. There is some ambiguity in the handwriting, but one reasonable
reading of four of them is as follows:
WRGOABABD
WTBIMPANETP
MLIABOAIAQC
ITTMTSAMSTGAB
In addition, a line which begins like the third line above
was written and crossed out between the first and second.
These letters are not in any obvious way comprehensible, and
it has been supposed that the letters might represent a cipher that, if broken,
would shed some light on the case. From here forward, I will call this the
Tamam Shud Cipher, or TSC, although it is not clear that it is actually is a
cipher, an intentionally coded message, in the literal sense of the term. The
TSC has not been decoded in a highly convincing way in whole or in part.
Following many hypotheses regarding the nature of the
cipher, previous work, including that of the students of Derek Abbott at the
University of Adelaide, has suggested that the letters may be initials from an
English text.
In previous work, the frequency of letters in the TSC was
compared to letter frequencies in other collections of text. First, with
samples of several languages, and next with the initial letters of words in
those languages. The second is not the same as the first, because letters occur
with different frequencies in different positions in words; for example, 'e' is
the most common letter in English, but 's' and 't' begin more words than 'e'.
Abbott’s students found that the TSC letter frequencies match those of English
initials significantly better than those of English text overall, and also
better than initials or text in any of several languages. I have performed
similar tests using different source texts and reach the same conclusion.
This provides evidence that the TSC is, in fact, an
initialism, a sequence of initials from some specific text – in this case, a
short English text. However, this evidence falls short of proof. It shows that
an English initialism is the best of the possibilities that were tested, and a
quite a few were tested, but it leaves open that an untested possibility would
fit the TSC letter frequency just as well or better. It also leaves unexamined
the possibilty that the letters in TSC are initials taken from English text but
are possibly in another sequence.
And so, more definitive evidence that TSC is an initialism
from a specific English text is to examine short sequences of letters in TSC
and measure how they rank compared to the sequences of initials from English
text in general. Grammatical patterns in English make some sequences of
initials more common than the same initials in another sequence, so by
performing this study with the TSC and variations of TSC with its letters
scrambled in random order, we should see how likely it is that TSC preserves
the expected sequences.
We can use an arbitrarily large corpus of English to
generate the initial-letter ngrams for English, but the TSC itself is short,
and therefore it samples the space of ngrams very sparsely. This means that
many kinds of statistical metrics will show a mismatch between TSC ngrams and
corpus ngrams even if they are initialisms from the same language. For example,
Pearson correlations of bigrams from a known, but short, English initialism and
the corpus will come up negative due to the sparseness of the short string’s
ngram matrix.
A useful metric that is more sensitive when the string we
are testing against a corpus is short is to generate the ngrams within the
string, and calculate the mean of how high those ngrams rank among the corpus
ngrams. This, in effect, gives the string credit for containing common initial
ngrams, but doesn’t punish it for lacking other initial ngrams because it is
simply too short to “get around to” them.
I generated 1,000 random shuffles of the TSC, and for the
TSC and each shuffle, I calculated the mean rank of the string’s initial ngrams
in terms of those generated from a corpus of 5 million words of English
literature, for n from 2 to 5. If the sequence of letters in the TSC are
initials selected at random from English texts, or if they were generated by
some other means altogether, then the mean ranks for the TSC should be about 50th
percentile in terms of the mean ngram ranks for the 1,000 random shuffles of
TSC. If, however, the TSC was generated as an initialism from a specific
English text, then its mean ngram ranks should be significantly higher than 50th
percentile. The results follow:
N TSC
Percentile
2 85.6
3 92.3
4 96.4
5 93.6
We see that the results are convincing, particular when n=4
(the matrices begin to become sparse for n=5, weakening the result). For many
scientific purposes, 95th percentile is offered as a standard of
proof, and these results are fairly convincing that the TSC is an initialism,
in correct order, of English text.
However, notes also that the TSC is written as a series of
lines which may be linguistically unrelated to one another. If the four lines
of TSC are lines of poetry, separate sentences, or in any way excerpts of a
longer text, then the ngrams that are generated across the boundaries of lines
potentially introduce noise. The idea that the lines are separate entities is
further validated by the fact that the crossed-out line occurs in a different
order than the similar line which is not crossed out.
So we can repeat the analysis, comparing TSC’s ngrams to
those generated from random shuffles of TSC, but excluding the ngrams of TSC
that start on one line and end on another. If TSC is not derived from sequences
of English initials, we would, again, expect to see it rank about 50th
percentile in mean ngram rank among the random shuffles. What we see, instead,
greatly strengthens the previous result.
N TSC
Percentile
2 96.9
3 99.2
4 99.2
5 99.2
It is exceedingly unlikely that any other method of
generating lines of letters would show this regularity if these lines were not
initialisms corresponding to one or more short English texts.
And so, this turns the focus deeper, given that the Tamam
Shud cipher very likely is an initialism of some short English text(s), what
does it say, and what does that say about the case? The story goes on in my
next post.
No comments:
Post a Comment