Monday, March 17, 2014
Murder and NLP: The Taman Shud Case, Part 3
In previous posts, I've shown that the Tamam Shud Cipher (TSC) is almost certainly an initialism, a sequence of initials corresponding to English text(s), and that a large corpus of prominent literature does not contain the text from which the TSC was derived. This leads to the next question: Given the strong possibility that the text was written by someone (presumably non-famous) who handled the book in which it was found, is it possible for us to decode it? Are initialisms based on English text in general decodable? In my discussion here, I interpret the potentially ambiguous handwriting in a certain common way. Other interpretations may be valid, and we can consider that statistically, but I focus for now on this interpretation of the handwriting:
Example TSC Readings
Consider the following:
We rarely go onto Australia’s beaches and bed down.
Wade the beaches in majestic peace and nervously enter the Pacific.
My love, I am blessedly opened, and I am quite certain
In the truth. Mercifully, the sleeper awakens me, stirs the girl, and blossoms.
Western radar groups operate Australian bases and base defenses.
Weapons testing base is militarily prepared. American nationals electrified the perimeter. Military liaisons in Adelaide boarded on an international aircraft. Queensland considering increases to territorial military trainees. South American mercenaries sent to Guam and Borneo.
Both of these texts are lucid (if not sparkling) English text that correspond to the TSC (give or take alternate interpretations of the handwriting). They took me about 15 minutes each to write. One is a love poem (somewhat like the text of the Rubaiyat in which the TSC was found) and one looks like something a Soviet spy might send to his superiors. A group of writers could surely come up with endless TSC solutions on these or practically any other themes and probably never, duplicate each other's work. The simple fact demonstrated here is: The TSC, like most initialisms, is undecodable into the original text because it has virtually limitless solutions. And therefore, if the TSC is not found in some previously existing text, the source text that generated it will never be known.
It is easy to lay out, analytically, why most initialisms from English text have endless numbers of solutions. Most of what I say here will apply to a great number of other languages, but I discuss English specifically.
English vocabulary falls into two rough subsets. There are function words, generally drawn from closed classes of words, and content words, generally drawn from open classes of words. Nouns, verbs, adjectives, and adverbs are content words. They comprise the vast majority of all of the words in English. Pronouns, conjunctions, prepositions, and articles are function words. Those parts of speech have only about 3 to 90 words each.
For all but the rarest letters in English, you can come up with a very long list of any of the content word classes that begin with that letter. For the function words, this is not possible. For example, there are only three coordinating conjunctions: and, or, and but. There are no coordinating conjunctions in English that begin with 'z', ‘t’, or ‘e’.
Take any typical passage in English text, and write down the initialism. You may then freely generate almost endless alternative readings of the initialism by changing the content words to other examples of the same part of speech.
"The platypus is one of the strangest animals in Australian fauna."
"The pioneer is one of the strongest archetypes in American fiction."
"The pastegh is one of the sweetest aliments in Armenian food."
It's easy to do this with the content words in virtually any sentence. So even if we kept the function words the same (as in the previous examples), we can still generate many sentences with the same initialism and warp the meaning entirely.
It is relatively difficult to play the same trick with function words, because there are so few options to swap in. However, we could choose different function words in different locations, in effect moving the pivot of function words to another location and then manipulate the content words in their new positions.
"Tommy, play in our old toyroom since Annie is acting funny."
Most function words begin with relatively common letters in English, which is true almost by definition. These letters can be used in other words, content or function, and so the location of function words in an initialism cannot be pinned down.
There are rare letters that could greatly restrict the freedom of recombination shown above. A sequence like XXQXX in an initialism might have actually no valid readings in English. However, the TSC has only one rare letter, Q. Does the Q, or any other pattern in the TSC significantly constrain the range of possibilities for TSC readings? If we can't determine for absolute certain the reading of TSC, can we meaningfully narrow it down?
Learning from Examples
Using the same Project Gutenberg corpus which was searched for matches of TSC substrings, we can search for shorter substrings to see which phrases might match them. Substrings of length 6 are useful for providing multiple matches (at least 7 unique readings) for each position in the TSC. If the particular letters in the TSC constrain the possible readings significantly, then we should see repeated patterns in the Gutenberg matches.
It should be noted first that the Gutenberg corpus has multiple copies of some texts within it, which inflates the counts unnaturally. This observation notwithstanding, there is no substring of length 6 in the TSC that has any single reading which comprises the majority of its Gutenberg hits. In other words, whichever reading we guess to be correct, it is wrong in the majority of cases – over 60%, in fact. Therefore, the would-be sleuth who writes a reading of the TSC and feels that their match is sure to be right is being seduced by the fallacy that the solution they have in mind is rare in matching the text. The sequence that achieves the best match is "do well to bear in mind", which still only covers 38% of the matches for DWTBIM, and is one of 59 different readings found in the Gutenberg corpus. For any 6gram in TSC, whichever guess you offer for the correct reading, you will probably guess wrong.
Can we do better trying to pin down exact words? If we use the 6gram readings from the Gutenberg corpus and tally (counting each reading just once, even if it appeared multiple times) how often particular words are used to fill the specific positions in TSC, and call the share that each word has for that position the derived probability. In these values, we see the same inherent ambiguity as indicated above. Every position allows at least 7 different words to stand for that letter, and in very few cases is the most common case more than 20% frequent, meaning that whatever word we guess in that position, we will probably guess differently than the original text. There is just one case where the most common case rises slightly above 50%: the A in position 24 is filled 53% of the time by the article “a”, a poor, and in any case ambiguous, starting point for interpreting the text. Almost all of the 33 derived probabilities that exceed 20% are exceedingly function words: “a”, “the”, “and”, “in”, “of”, etc. These give not even the slightest indication of topic, genre, or even tense or person.
Four words have a derived probability between 20% and 34% and provide a slight indication of tense and person: There are two such occurrences of “my” (positions 14 and 21), and one each of “is” (position 23) and “am” (position 29). These effectively indicate three votes for the correct TSC reading being written in the first person and two votes for the present tense. These are difficult to interpret as probabilities, however, since each of these votes is, in any case, less than 50% probable, and there’s no clear prior probability of tense and person for a random unknown text. It should be noted that a first person text still contains many third person references, and a text that is primarily in past or present tense still may contain many instances of the other. Therefore, we have a glimmering of an indication that TSC may be an initialism of a first person text, but this is far from conclusive, and in no other way helpful regarding the content. One related observation: There are no occurrences of Y in the TSC, so there are no second person pronouns (“you”, “your”, “yours”) although the second person can be spoken of through circumlocution without those words.
Finally, the derived probability of “quite” for the Q in position 30 is 43%, high among the derived probabilities, but still short of 50%, and utterly ambiguous regarding genre, topic, or content.
Cumulatively, we have conclusive evidence that the TSC is an initialism, no source text has been identified in literature, and if the source text is not found in an older source, it cannot be decoded into the original text.
This is, most importantly, a strongly negative result for those who have hoped that the TSC could be deciphered, helping to solve the mystery of the Somerton Man, which is still quite a bizarre story even if the TSC is left aside. It still leaves an intriguing situation in which the TSC, which came to light because of the Somerton Man, is effectively a second mystery, which one might have found and glossed over if it were not associated with a dead body, but has been elevated in importance because of the body. There are doubtlessly countless books in the world’s libraries that have mysterious scribbles inside, and no one pays them any particular attention. (I have found some in my older books, written in my own hand, mysterious to me years after they were written.) However, since the case has gotten so much attention, I’ll devote one more post to examining what the TSC might represent – why an initialism might have been written, and what remaining, however slight, possibilities exist for obtaining a definite reading of its content.