Saturday, March 29, 2014

Murder and NLP: The Taman Shud Case, Part 4

My three previous posts on the Tamam Shud Cipher (TSC) have supported, to varying degrees of certitude, the following conclusions:

Given these conclusions, what can we say about why the TSC was written, and what paths remain to investigate further?


P1) It has been supposed by some that the Somerton Man was a spy and the TSC was a message he intended to send to his superiors. This is certainly not possible, because initialisms are not decodable and a professional spy would know this. There are many methods of encoding a message so the intended audience can read it but someone who intercepts it cannot. Initialisms aren't such a method.

P2) Was the TSC a personal message shown, as it was written, to the intended reader?

In Tolstoy's Anna Karenina there is a scene where two lovers communicate through initialisms. A central character, Pierre, writes 14 initials on a table using a piece of chalk, showing it to Kitty, the young woman who has earlier rejected his marriage proposal. After a few moments, she interprets the message correctly, as the initialism (in Russian) of a question he would like to ask her about her rejection of his proposal. She responds to him by writing a shorter initialism as her reply, and with snippets of conversation between them, the two communicate briefly by showing initialisms to one another. A third character walks in and seemed to recognize what they are doing as a game called "Secrétaire." In this scene, Pierre and Kitty show the ability to read initialisms, aided by the questions, hints, and glances they share in person. Is this possible? Was the TSC shown to a lover by the person who was writing it, line by line, as it was written?

Of course, Anna Karenina is fiction. Tolstoy could make anything he wanted happen in the scene. The reader is meant to understand, in fact, that this feat of understanding is incredible, and that it shows how wonderful a match Pierre and Kitty are, that they can understand each other in this seemingly impossible situation. Moreover, there is extraordinary context at play: Kitty is easily able to guess the general topic of Pierre's question, which is about her feelings, from the context of their previous conversations and the very fact that Pierre seems restrained from speaking freely. He made a direct reference to an earlier conversation of obvious importance to their relationship, so she might more easily guess which ideas, and therefore which words, were in his message. Of course, with enough verbal or nonverbal clues provided in person by the writer to the reader (even some or most of the exact words), an initialism could be read, but such an event is not recorded in any way and we cannot guess if this took place.

P3) Was the TSC scratch work, using the book as blank paper, for someone's efforts to solve a puzzle? It's common to find scratch work written in the margins where someone has tried to solve newspaper puzzles. Is the TSC such a remnant, for some puzzle that appeared in another book or newspaper?

This is possible in principle, but it's not clear what sort of puzzle would lead someone to write so many initials of a long text. There is a syndicated puzzle called Jumble where the solution of several word scrambles provide a few letters to decode a message of several words. The TSC is much longer and much more orderly than the scratch work for Jumble is likely to be, but could a puzzle like Jumble, but longer, have been in the hands of someone in Australia in 1948?

Many Australian newspapers have been stored in digital archives online. A search of Australian papers from November, 1948 reveals primarily crossword puzzles, which were regular features in the Adelaide Chronicle, Sydney Morning Herald, Sydney World News, Perth West Australian, and Perth Western Mail. The Longreach Leader had riddles and word scrambles. The Murray Pioneer had weekly word "diamond" and "square" puzzles, similar to crossword puzzles with four or five clues and no unused spaces. None of these would seem to lead a person to write initialisms of 9 or more letters as their scratchwork.

We might extend the puzzle search arbitrarily far from Australian newspapers, by searching newspapers from the United States (where one of Somerton Man's garments seemed to originate) or other countries, going backwards in time from 1948, and considering books as well as newspapers. The possibility of TSC as puzzle scratchwork may have no definite resolution, as some of these publications are unlikely to survive online or otherwise. However, it would be interesting if any puzzle can be found that would require the solver to write an initialism of nine or more letters as the answer or as scratchwork.

P4) Is the TSC a rough draft of original prose or poetry? Literary writing invites the writer to create rough drafts. Conceivably, an initialism could be used as a form of shorthand, which would bypass the problem of a reader decoding the message, because the only intended reader would be the writer. If a writer had in mind a tentative version of a passage, and was considering small revisions of a word or two, they might be able to write initialisms as possible second, third, etc., drafts, and have no confusion in their mind what the exact intended reading was of each version.

At right, we see a handwritten draft of "Here Comes The Sun" by the Beatles' George Harrison. We see that, like the TSC, Harrison's work is written in a book rather than on blank paper. We also see that six times, phrases which have occurred earlier are written out only as initials, to spare Harrison the effort of longhard. Could the TSC be the sixth, or tenth, or twentieth draft of a poem the author had begun and completed on other sheets of paper?

I find this to be the single most compelling explanation for the TSC. It does not assume that the writer was attempting the impossible task of communicating with another person using initialisms. It explains the crossed-out line and the "X" written over another letter as the continuing process of editing. A literary aspiration on the part of the writer explains why the four-line format of the TSC is similar to that of quatrains in the Rubaiyat's printed text, without matching any of them exactly. Unless we find a match for the TSC in the belongings of a person close to the case, we can never verify if this is the correct answer. I find it the most likely explanation because it seems to stipulate the least unproven assumptions.

P5) Did the writer write the initialism to help memorize or recall a piece of famous literature, to get it straight, or to drill it into their memory? The apparent edits in the TSC could be corrections as someone was trying to get their imperfect memory of a literary passage to align with the verbatim text. This, like (P4), is consistent with the fact that the undecodable nature of initialisms is irrelevant if the writer is the only intended reader. As the search for matches has come up dry in prominent literature, this possibility is diminished, although not eliminated.

P6) Was the TSC a futile, misguided attempt to communicate? Although initialisms are not decodable, it is not clear that the writer of the TSC knew and appreciated that fact. The book was very likely held by a dying man in his final hours. Could it be a suicide note, or the last message from a man who knew that he'd been poisoned by someone else? It's hard to call this impossible, particularly if he was not in a clear state of mind, but it seems extremely odd that a person with that goal would not at least start writing conventional prose rather than initials. Just as the reader of Anna Karenina is supposed to appreciate that initialisms are nearly impossible to read, a literate person might intuit this fact quickly as soon as they began to ask themselves the question.

Paths Ahead

As I see it, the analysis of the TSC is at something approaching an endgame, or, one might say, a stalemate. There are a few possible explanations for why it was written, and although I think (P4) is the most likely, a others cannot be ruled out. Moreover, most of the possible explanations give us no reason to hope that the TSC's underlying message can ever be known. A modest consolation is that the TSC's underlying message may be utterly unrelated to the Somerton Man's death and of no intrinsic interest to us, even if it were known.

I see at least three possible paths for follow-up:

F1) Further Literature Search

While the Project Gutenberg corpus I searched, expanded to additional Gutenberg texts in an excellent effort by Barry Traish, is quite vast, it is not infinite. In an effort to determine how comprehensive that collection is, I examined how many of the Modern Library's editors' list of the 100 best English language novels of the 20th Century are present in the Project Gutenberg corpus. Simply put, all of them published before 1923 were present, and none after that year, a divide that is undoubtedly due to copyright restrictions. It may be that the 25-year window from 1923-1948 is relatively uncovered.

I extended the search to include a collection of poetry by Rumi (which is similar in origin and nature to the Rubaiyat), and also added the post-1923 novels Brave New World, Darkness at Noon, The Great Gatsby, Portrait of the Artist as a Young Man, and Ulysses. These additions to the search produced no interesting results.

It should be noted that a vast increase to the search is "doomed to succeed"… given enough text searched, we will inevitably find longer substring matches by random chance, so we might eventually, with a truly vast collection of text, find coincidental matches to all of the TSC's individual lines. However, no collection of text will produce by mere chance a match to all 44 letters.

F2) Statistical triage

I made an effort to identify the person and tense of the TSC text. There is potentially more work to do in exploring statistical methods to determine general properties of the TSC, although no useful directions are immediately apparent to me. It is perhaps possible to use statistical methods to "spell check" the ambiguous handwriting, if one doesn't fall into the naive trap of assuming that the highest-probability letters and ngrams are most likely to be correct.

F3) Google search

A radical broadening of the search for an exact match could exploit Google or another web search engine. A practical algorithm for this could be as follows:

A1) Pick two or more substrings from the TSC of length 3 to 5.
A2) Using a large corpus, find the commonly recurring phrases in English whose initials match those substrings.
A3) Query the search engine for combinations of those phrases.

It is possible, in principle, that a tractable permutation of phrases could cover most of the probable readings of those substrings. The parameters of an efficient search of this kind will be left as an exercise for the reader.


For now, my investigation of the TSC is at a halt or a pause, as the main questions I started out with have been answered, shown to be unanswerable, or brought to a point where further investigation seems to offer diminishing returns.

I would like to thank Derek Abbott and his students at the University of Adelaide, who have laid a foundation of TSC research, and helpful comments through personal communication in the previous weeks. I'd also like to credit Barry Traish, who expanded the Project Gutenberg search considerably and also offered insightful comments.

