Saturday, March 8, 2014

Murder and NLP: The Taman Shud Case, Part 2

In my last post, I showed that the Tamam Shud Cipher (TSC) is almost certainly an initialism, a sequence of initials taken from English text. So what does it say, and why was it written?

It should be noted first that two hypotheses, not exclusive of one another, regarding the other facts of the case are:

H1) That Somerton Man was a Cold War spy who was operating for the Soviets or perhaps the Americans.
H2) That the nurse whose phone number was found in the book was a potential lover of the Somerton Man, and the Rubaiyat was given to him by her to share the romantic aspects of the poetry. She admitted having given another copy of the Rubaiyat to another man in 1945. That man and book both turned up during the investigation.

In that light, some possibilities for the source text behind the Tamam Shud Cipher:

P1) Passage(s) from existing literature.
P1a) Written out to help the writer memorize that literature.

P2) Something written by a person who once had the book.
P2a) A message sent by a Cold War spy.
P2b) A message written out for a friend or lover to read.
P2c) A rough draft for original poetry or prose using initials as shorthand.
P2d) Written out to help the writer memorize their own composition.

P3) Scratchwork for solving a puzzle.

We can investigate each of these possibilities, although definitive answers may prove elusive. I'll devote the rest of this post to beginning an examination of possibility (P1).

If the TSC can be found, in its entirety, in the initials of any previously-existing work of literature, then we can be sure we've found the answer. Why? The probability of two texts sharing the same initials is a function of their length, L, in words, and is approximately:

pMatch = 1 / 10L

The number 10 occurs here because the probability of two randomly-chosen initials from English text being equal is about 1/10, not the 1/26 you would see if all letters occurred equally likely. The exact parameter may be experimentally derived and certainly isn't exactly 10.0, but it's so close that I use 10 to make the math easier to comprehend. The shortest line in the TSC has 9 letters and the longest has 13. Finding a random passge of 9 words to match the shortest line has a probability of about one in a billion, while a passage matching the longest line has a probability of about one in ten trillion. Meanwhile, the probability of an accidental match for the entire 44 TSC is a dicey 10-44, which is essentially zero. If we find any existing text that matches the entire TSC, then it is definitely the text used to produce the TSC.

Note: This does not mean that if a person tries to write out their own original match for the TSC and succeeds that they have found the answer. In fact, it's not that hard to write text to match a given set of initials, which produces the illusion that someone can make quick and easy progress towards figuring it out. I will show in a later post how easy it is to concoct bogus matches after the fact.

Of course, finding matches for shorter substrings is easy, for sufficiently short length. Obviously, matches for substrings of length 1 and 2 are trivial to find, and by searching larger volumes of text, one finds matches of length 3, 4 and so on.

Work has been done in the past on trying to find exact matches for the TSC in a handful of major literary works, including the King James Bible. After an initial, failing search through digital copies of 40 books and 18 collections of poetry, I conducted a rather massive search which is as follows:

Project Gutenberg creates digital transcriptions of literary works. For my search of TSC matches, I downloaded by torrent the April 2010 collection of 29,500 books. Of these, 22,353 books are in text format. I preprocessed these to create an index of their initialisms, which amounts to 1.3 billion words of original text. I searched this for the longest matches that exist for TSC substrings, to arrive at these results:

R1) There is no exact match for the entire TSC, or any of its individual lines.
R2) The longest matches are of length 8, and there were twelve of these. It is apparent that these are entirely coincidental for the following reasons:

a) Many of them contradict one another, by matching the exact same subsequence of TSC.
b) They each begin in the middle of a sentence, and continue on into the middle of the next sentence.
c) We would expect, from the aforementioned formula, to find about ten matches of length L=8 and one match of length L=9 by sheer coincidence.
d) One of the matches is from 1963, many years after TSC was written down.

The matches were from nine works of literature, and one work each from science, economics, and reference. The matches are posted separately here.

In a nutshell, the Project Gutenberg corpus does not contain a match, and this should set the tone for any future searches. The works in this corpus are not only large in number but particularly central in literary importance. It is hard to characterize "literary importance" formally, but the reach of this corpus is impressive. When one thinks of authors predating 1950 who are likely to be taught in university literature courses, the number of works this corpus has is vast, although not comprehensive. For many famous works, it includes more than one edition. There are also translations of English works into other languages (although, recall, TSC is almost certainly an initialism of English text) and translations of literature in other languages into English.

Until a match is found, we can never prove that there is no match. Whatever corpus of n books we search, it's always possible that the exact match will be found in book n+1. It remains possible that TSC matches a book, poem, newspaper article, or other text not in this Gutenberg Corpus, but a lack of matches in 22 thousand books certainly suggests that searching any additional one – or one hundred – books chosen arbitrarily will yield a very low probability of finding a full, exact TSC match.

In an upcoming post, I'll discuss possibility (P2), and how to use a large corpus (now that we have one) as a resource for trying to decode the TSC. We can perhaps find information about the words or genre that are encoded within it. And that will also open up paths to searching for exact matches.


Barry Traish said...

I too used Project Gutenberg to look for initialisms with the SM code. I did find some nine character matches. My results are below.

I downloaded the 45,000 out of copyright etexts created by Project Gutenberg, and converted them to a file of initials of the first letter of each word. I then looked for matches with the SM code. I searched for all of the eight, nine and ten character strings within the code (e.g. MRGOABAB, RGOABABD, GOABABDM, etc). There were 41 matches (all listed below).

I have assumed that the code is a continuous string, not four lines. I have ignored the crossed out half line. Some of the code letters are ambiguous (first letters of first and second lines, and the G/C third from the end, so all eight permutations were looked for.

There were no matches to strings of 10 characters, and three matches to nine characters (shown with a star). The rest were eight character matches. I have excluded the matches with eight characters where these are part of a nine. No etext had more than one string in it.

Most of the Project Gutenberg texts are English, but a percentage are other European languages. There were no matches to non-English texts.

Despite quite a lot of poetry contained with Project Gutenberg, none of the matches was to a poem. All of the original texts can be easily found with google.

Of the matches, 80% use at least one character from the last line of the code, and 66% are entirely on the last line. This is similar to the match against wikipedia, implying that characters in the earlier lines are much harder to match to English text. In fact, statistically the first two lines are extremely difficult to match.

Which of the ambiguous characters matches best?
M1 vs W1 0:0
M2 vs W2 0:2
G vs C 5:7

The matches:
OABABDWT of a brighter and better day, when the
DWTBIMPA dynasty. When these became inevitable, M. Perier attached
TPMLIABO that point. My life is a bad one
LIABOAIA lad is a brave one, and I am
LIABOAIA literal inflicting a blow on an individual, and
LIABOAIA looked into a book of any importance, as
IABOAIAQ is a beautiful one, and I am quite
IABOAIAQ is as badly off as I am," quivered
CITTMTSA care I took to make their stay at
CITTMTSA care is taken to make the strokes as
CITTMTSA castes. In the Tanjore Manual, the Shanans are
CITTMTSA Church in this town, Mr. Thomas Smallwood, an
CITTMTSA contemplating in turn the marshes, the sea, and
CITTMTSA conveying it to their master. The Sultan asked
ITTMTSAM I thought to myself that such a man
ITTMTSAM In talking to men--to such a man
ITTMTSAM in the textile, metal, transport, shipping, and machine
ITTMTSAM is that the men that stand around Me
ITTMTSAM it together, that Miss Thorpe should accompany Miss
ITTMTSAM itching to take me to see a man
TTMTSAMS tend to make them soft and mushy. Strawberries
TTMTSAMS than twenty miles.... There soon after midnight.... Steal
TTMTSAMS that transported me: To see a mind so
TTMTSAMS to the metropolis, to seize, at Maunsell's shop
TTMTSAMS treat the matter too seriously, and merely said
TTMTSAMS Tshaka the Mighty, the swift and merciful stroke
TTMTSAMST* the tetragonal minerals tapiolite (= skogbolite) and mossite, so that
TMTSAMST that makes the sun and moon seem to
TMTSAMST to make their saloon a market, so that
MTSAMSTCA* me to stay; and, merely stopping to cast a
MTSAMSTG motioned the stenographer and Miss Snow to go
TSAMSTCA the sideboard; ask my sister to come and
TSAMSTCA the soldiers any more." So the child and
TSAMSTCA the stronger, and more slimy) the Cores and
TSAMSTCA their 'speech,'and 'made strange their counsel.' All
TSAMSTCA to seeke a more safe, then commodious abode
TSAMSTCAB* the scene. After mutual salutations the commissioners asked: "By
TSAMSTGA the same. All men seek to get as
TSAMSTGA the sincere among My servants to gain admittance
TSAMSTGA then summoned all my strength to gaze and
SAMSTGAB Street and Main Street, the grassy area between

John Rehling said...

That's great work, Barry! By pushing the corpus a bit larger, you got matches of length 9, which equal the length of the shortest TSC line.

This large corpus should be characterized for how well it captures the most prominent works from which TSC might have been borrowed. I've seen that some authors are extremely well covered, while others are covered very spottily.

You make a great point about the earlier lines being more poorly covered: Some initialism ngrams are certainly rarer than others. However, I feel that with a large enough corpus, we would be doomed to succeed in finding matches for each line of the TSC, although a match of the whole 44 is unlikely to occur in the history of written English unless it was linked to the TSC deliberately. (In my next post, I offer a couple that I wrote just today.)

Barry Traish said...

Creating a text to match to the code deliberately (like Gerry Feltus' "It's Time To Move To South Australia Moseley Street") is like publishing a list of googlewhacks - the act of publication changes the results. I've created my own initialism for the code, using only words used in the Rubaiyat:

My road goes on, and by and by divides,
Now two branches, into morning, past a new evening that provides,
My love is a barren oblivion, and itself alone quite certain,
It's time to move the soul among magic stars, then gently asleep besides.