Saturday, March 8, 2014
Murder and NLP: The Taman Shud Case, Part 2
In my last post, I showed that the Tamam Shud Cipher (TSC) is almost certainly an initialism, a sequence of initials taken from English text. So what does it say, and why was it written?
It should be noted first that two hypotheses, not exclusive of one another, regarding the other facts of the case are:
H1) That Somerton Man was a Cold War spy who was operating for the Soviets or perhaps the Americans.
H2) That the nurse whose phone number was found in the book was a potential lover of the Somerton Man, and the Rubaiyat was given to him by her to share the romantic aspects of the poetry. She admitted having given another copy of the Rubaiyat to another man in 1945. That man and book both turned up during the investigation.
In that light, some possibilities for the source text behind the Tamam Shud Cipher:
P1) Passage(s) from existing literature.
P1a) Written out to help the writer memorize that literature.
P2) Something written by a person who once had the book.
P2a) A message sent by a Cold War spy.
P2b) A message written out for a friend or lover to read.
P2c) A rough draft for original poetry or prose using initials as shorthand.
P2d) Written out to help the writer memorize their own composition.
P3) Scratchwork for solving a puzzle.
We can investigate each of these possibilities, although definitive answers may prove elusive. I'll devote the rest of this post to beginning an examination of possibility (P1).
If the TSC can be found, in its entirety, in the initials of any previously-existing work of literature, then we can be sure we've found the answer. Why? The probability of two texts sharing the same initials is a function of their length, L, in words, and is approximately:
pMatch = 1 / 10L
The number 10 occurs here because the probability of two randomly-chosen initials from English text being equal is about 1/10, not the 1/26 you would see if all letters occurred equally likely. The exact parameter may be experimentally derived and certainly isn't exactly 10.0, but it's so close that I use 10 to make the math easier to comprehend. The shortest line in the TSC has 9 letters and the longest has 13. Finding a random passge of 9 words to match the shortest line has a probability of about one in a billion, while a passage matching the longest line has a probability of about one in ten trillion. Meanwhile, the probability of an accidental match for the entire 44 TSC is a dicey 10-44, which is essentially zero. If we find any existing text that matches the entire TSC, then it is definitely the text used to produce the TSC.
Note: This does not mean that if a person tries to write out their own original match for the TSC and succeeds that they have found the answer. In fact, it's not that hard to write text to match a given set of initials, which produces the illusion that someone can make quick and easy progress towards figuring it out. I will show in a later post how easy it is to concoct bogus matches after the fact.
Of course, finding matches for shorter substrings is easy, for sufficiently short length. Obviously, matches for substrings of length 1 and 2 are trivial to find, and by searching larger volumes of text, one finds matches of length 3, 4 and so on.
Work has been done in the past on trying to find exact matches for the TSC in a handful of major literary works, including the King James Bible. After an initial, failing search through digital copies of 40 books and 18 collections of poetry, I conducted a rather massive search which is as follows:
Project Gutenberg creates digital transcriptions of literary works. For my search of TSC matches, I downloaded by torrent the April 2010 collection of 29,500 books. Of these, 22,353 books are in text format. I preprocessed these to create an index of their initialisms, which amounts to 1.3 billion words of original text. I searched this for the longest matches that exist for TSC substrings, to arrive at these results:
R1) There is no exact match for the entire TSC, or any of its individual lines.
R2) The longest matches are of length 8, and there were twelve of these. It is apparent that these are entirely coincidental for the following reasons:
a) Many of them contradict one another, by matching the exact same subsequence of TSC.
b) They each begin in the middle of a sentence, and continue on into the middle of the next sentence.
c) We would expect, from the aforementioned formula, to find about ten matches of length L=8 and one match of length L=9 by sheer coincidence.
d) One of the matches is from 1963, many years after TSC was written down.
The matches were from nine works of literature, and one work each from science, economics, and reference. The matches are posted separately here.
In a nutshell, the Project Gutenberg corpus does not contain a match, and this should set the tone for any future searches. The works in this corpus are not only large in number but particularly central in literary importance. It is hard to characterize "literary importance" formally, but the reach of this corpus is impressive. When one thinks of authors predating 1950 who are likely to be taught in university literature courses, the number of works this corpus has is vast, although not comprehensive. For many famous works, it includes more than one edition. There are also translations of English works into other languages (although, recall, TSC is almost certainly an initialism of English text) and translations of literature in other languages into English.
Until a match is found, we can never prove that there is no match. Whatever corpus of n books we search, it's always possible that the exact match will be found in book n+1. It remains possible that TSC matches a book, poem, newspaper article, or other text not in this Gutenberg Corpus, but a lack of matches in 22 thousand books certainly suggests that searching any additional one – or one hundred – books chosen arbitrarily will yield a very low probability of finding a full, exact TSC match.
In an upcoming post, I'll discuss possibility (P2), and how to use a large corpus (now that we have one) as a resource for trying to decode the TSC. We can perhaps find information about the words or genre that are encoded within it. And that will also open up paths to searching for exact matches.