NLP Confidential

Thursday, July 24, 2014

Murder and NLP: The Taman Shud Case Splash Page

This year, I have posted several analyses of the mysterious text found in association with the Taman Shud case, concerning the body of the Somerton Man found on a beach near Adelaide, Australia in 1948. This post serves as an index and brief summary of those four posts.

Taman Shud 1: Ngram frequency indicates that the Taman Shud "cipher" (TSC) is almost certainly an initialism (initial letters from the words in a text, in the same order as the original words) drawn from one or more short English texts.

Taman Shud 2: A search of over 20,000 books in the Project Gutenberg collection finds no complete or nearly-complete passages of which the TSC is an initialism. In a comment, Barry Traish reports on a similar search, also finding no matches, that doubles the number of books considered. This includes a plurality of prominent literary works dated 1925 or earlier, but does not exclude the possibility of a match with some more obscure text.

Taman Shud 3: A simple existence proof and discussion showing that initialisms are not, in general, decodable. By extension, initialisms are not useful as a code for espionage or any interpersonal communication in general.

Taman Shud 4: A discussion of six possible reasons why someone would write down an initialism. While it is impossible to choose among these and other possible reasons why the TSC was written, two or three seem more plausible than others. Three possible follow-up investigations are described, although these do not seem especially likely to provide definitive answers.

Overall, I believe that the TSC was written by a person for the purpose of quickly jotting down some idea, using initials as a sort of shorthand readable only by the person while they had those words fresh in their memory. This may have been an original composition or something by another writer/speaker that they were trying to remember/memorize. An initialism is not useful for a person who is trying to send a message to another person, so the TSC is not a code sent by a professional spy and is unlikely to be any attempt to communicate. Therefore, it is unlikely that anyone will ever decode the TSC and thereby learn more about the mysterious case of the Somerton Man's death.

Saturday, March 29, 2014

Murder and NLP: The Taman Shud Case, Part 4

My three previous posts on the Tamam Shud Cipher (TSC) have supported, to varying degrees of certitude, the following conclusions:

C1) The letters are an initialism – the initials, insequence, of words in some English text(s).

C2) The text which the TSC encodes is not present in a large collection of prominent literature.

C3) Initialisms in general, and the TSC in particular, cannot be decoded to find the text that created them.

Given these conclusions, what can we say about why the TSC was written, and what paths remain to investigate further?

Possibilities

P1) It has been supposed by some that the Somerton Man was a spy and the TSC was a message he intended to send to his superiors. This is certainly not possible, because initialisms are not decodable and a professional spy would know this. There are many methods of encoding a message so the intended audience can read it but someone who intercepts it cannot. Initialisms aren't such a method.

P2) Was the TSC a personal message shown, as it was written, to the intended reader?

In Tolstoy's Anna Karenina there is a scene where two lovers communicate through initialisms. A central character, Levin, writes 14 initials on a table using a piece of chalk, showing it to Kitty, the young woman who has earlier rejected his marriage proposal. After a few moments, she interprets the message correctly, as the initialism (in Russian) of a question he would like to ask her about her rejection of his proposal. She responds to him by writing a shorter initialism as her reply, and with snippets of conversation between them, the two communicate briefly by showing initialisms to one another. A third character walks in and seemed to recognize what they are doing as a game called "Secrétaire." In this scene, Levin and Kitty show the ability to read initialisms, aided by the questions, hints, and glances they share in person. Is this possible? Was the TSC shown to a lover by the person who was writing it, line by line, as it was written?

Of course, Anna Karenina is fiction. Tolstoy could make anything he wanted happen in the scene. The reader is meant to understand, in fact, that this feat of understanding is incredible, and that it shows how wonderful a match Levin and Kitty are, that they can understand each other in this seemingly impossible situation. Moreover, there is extraordinary context at play: Kitty is easily able to guess the general topic of Levin's question, which is about her feelings, from the context of their previous conversations and the very fact that Levin seems restrained from speaking freely. He made a direct reference to an earlier conversation of obvious importance to their relationship, so she might more easily guess which ideas, and therefore which words, were in his message. Of course, with enough verbal or nonverbal clues provided in person by the writer to the reader (even some or most of the exact words), an initialism could be read, but such an event is not recorded in any way and we cannot guess if this took place.

P3) Was the TSC scratch work, using the book as blank paper, for someone's efforts to solve a puzzle? It's common to find scratch work written in the margins where someone has tried to solve newspaper puzzles. Is the TSC such a remnant, for some puzzle that appeared in another book or newspaper?

This is possible in principle, but it's not clear what sort of puzzle would lead someone to write so many initials of a long text. There is a syndicated puzzle called Jumble where the solution of several word scrambles provide a few letters to decode a message of several words. The TSC is much longer and much more orderly than the scratch work for Jumble is likely to be, but could a puzzle like Jumble, but longer, have been in the hands of someone in Australia in 1948?

Many Australian newspapers have been stored in digital archives online. A search of Australian papers from November, 1948 reveals primarily crossword puzzles, which were regular features in the Adelaide Chronicle, Sydney Morning Herald, Sydney World News, Perth West Australian, and Perth Western Mail. The Longreach Leader had riddles and word scrambles. The Murray Pioneer had weekly word "diamond" and "square" puzzles, similar to crossword puzzles with four or five clues and no unused spaces. None of these would seem to lead a person to write initialisms of 9 or more letters as their scratchwork.

We might extend the puzzle search arbitrarily far from Australian newspapers, by searching newspapers from the United States (where one of Somerton Man's garments seemed to originate) or other countries, going backwards in time from 1948, and considering books as well as newspapers. The possibility of TSC as puzzle scratchwork may have no definite resolution, as some of these publications are unlikely to survive online or otherwise. However, it would be interesting if any puzzle can be found that would require the solver to write an initialism of nine or more letters as the answer or as scratchwork.

P4) Is the TSC a rough draft of original prose or poetry? Literary writing invites the writer to create rough drafts. Conceivably, an initialism could be used as a form of shorthand, which would bypass the problem of a reader decoding the message, because the only intended reader would be the writer. If a writer had in mind a tentative version of a passage, and was considering small revisions of a word or two, they might be able to write initialisms as possible second, third, etc., drafts, and have no confusion in their mind what the exact intended reading was of each version.

At right, we see a handwritten draft of "Here Comes The Sun" by the Beatles' George Harrison. We see that, like the TSC, Harrison's work is written in a book rather than on blank paper. We also see that six times, phrases which have occurred earlier are written out only as initials, to spare Harrison the effort of longhard. Could the TSC be the sixth, or tenth, or twentieth draft of a poem the author had begun and completed on other sheets of paper?

I find this to be the single most compelling explanation for the TSC. It does not assume that the writer was attempting the impossible task of communicating with another person using initialisms. It explains the crossed-out line and the "X" written over another letter as the continuing process of editing. A literary aspiration on the part of the writer explains why the four-line format of the TSC is similar to that of quatrains in the Rubaiyat's printed text, without matching any of them exactly. Unless we find a match for the TSC in the belongings of a person close to the case, we can never verify if this is the correct answer. I find it the most likely explanation because it seems to stipulate the least unproven assumptions.

P5) Did the writer write the initialism to help memorize or recall a piece of famous literature, to get it straight, or to drill it into their memory? The apparent edits in the TSC could be corrections as someone was trying to get their imperfect memory of a literary passage to align with the verbatim text. This, like (P4), is consistent with the fact that the undecodable nature of initialisms is irrelevant if the writer is the only intended reader. As the search for matches has come up dry in prominent literature, this possibility is diminished, although not eliminated.

P6) Was the TSC a futile, misguided attempt to communicate? Although initialisms are not decodable, it is not clear that the writer of the TSC knew and appreciated that fact. The book was very likely held by a dying man in his final hours. Could it be a suicide note, or the last message from a man who knew that he'd been poisoned by someone else? It's hard to call this impossible, particularly if he was not in a clear state of mind, but it seems extremely odd that a person with that goal would not at least start writing conventional prose rather than initials. Just as the reader of Anna Karenina is supposed to appreciate that initialisms are nearly impossible to read, a literate person might intuit this fact quickly as soon as they began to ask themselves the question.

Paths Ahead

As I see it, the analysis of the TSC is at something approaching an endgame, or, one might say, a stalemate. There are a few possible explanations for why it was written, and although I think (P4) is the most likely, a others cannot be ruled out. Moreover, most of the possible explanations give us no reason to hope that the TSC's underlying message can ever be known. A modest consolation is that the TSC's underlying message may be utterly unrelated to the Somerton Man's death and of no intrinsic interest to us, even if it were known.

I see at least three possible paths for follow-up:

F1) Further Literature Search

While the Project Gutenberg corpus I searched, expanded to additional Gutenberg texts in an excellent effort by Barry Traish, is quite vast, it is not infinite. In an effort to determine how comprehensive that collection is, I examined how many of the Modern Library's editors' list of the 100 best English language novels of the 20th Century are present in the Project Gutenberg corpus. Simply put, all of them published before 1923 were present, and none after that year, a divide that is undoubtedly due to copyright restrictions. It may be that the 25-year window from 1923-1948 is relatively uncovered.

I extended the search to include a collection of poetry by Rumi (which is similar in origin and nature to the Rubaiyat), and also added the post-1923 novels Brave New World, Darkness at Noon, The Great Gatsby, Portrait of the Artist as a Young Man, and Ulysses. These additions to the search produced no interesting results.

It should be noted that a vast increase to the search is "doomed to succeed"… given enough text searched, we will inevitably find longer substring matches by random chance, so we might eventually, with a truly vast collection of text, find coincidental matches to all of the TSC's individual lines. However, no collection of text will produce by mere chance a match to all 44 letters.

F2) Statistical triage

I made an effort to identify the person and tense of the TSC text. There is potentially more work to do in exploring statistical methods to determine general properties of the TSC, although no useful directions are immediately apparent to me. It is perhaps possible to use statistical methods to "spell check" the ambiguous handwriting, if one doesn't fall into the naive trap of assuming that the highest-probability letters and ngrams are most likely to be correct.

F3) Google search

A radical broadening of the search for an exact match could exploit Google or another web search engine. A practical algorithm for this could be as follows:

A1) Pick two or more substrings from the TSC of length 3 to 5.

A2) Using a large corpus, find the commonly recurring phrases in English whose initials match those substrings.

A3) Query the search engine for combinations of those phrases.

It is possible, in principle, that a tractable permutation of phrases could cover most of the probable readings of those substrings. The parameters of an efficient search of this kind will be left as an exercise for the reader.

Acknowledgements

For now, my investigation of the TSC is at a halt or a pause, as the main questions I started out with have been answered, shown to be unanswerable, or brought to a point where further investigation seems to offer diminishing returns.

I would like to thank Derek Abbott and his students at the University of Adelaide, who have laid a foundation of TSC research, and helpful comments through personal communication in the previous weeks. I'd also like to credit Barry Traish, who expanded the Project Gutenberg search considerably and also offered insightful comments.

Monday, March 17, 2014

Murder and NLP: The Taman Shud Case, Part 3

In previous posts, I've shown that the Tamam Shud Cipher (TSC) is almost certainly an initialism, a sequence of initials corresponding to English text(s), and that a large corpus of prominent literature does not contain the text from which the TSC was derived. This leads to the next question: Given the strong possibility that the text was written by someone (presumably non-famous) who handled the book in which it was found, is it possible for us to decode it? Are initialisms based on English text in general decodable? In my discussion here, I interpret the potentially ambiguous handwriting in a certain common way. Other interpretations may be valid, and we can consider that statistically, but I focus for now on this interpretation of the handwriting:

WRGOABABD

WTBIMPANETP

MLIABOAIAQC

ITTMTSAMSTGAB

Example TSC Readings

Consider the following:

(1)

We rarely go onto Australia’s beaches and bed down.

Wade the beaches in majestic peace and nervously enter the Pacific.

My love, I am blessedly opened, and I am quite certain

In the truth. Mercifully, the sleeper awakens me, stirs the girl, and blossoms.

(2)

Western radar groups operate Australian bases and base defenses.

Weapons testing base is militarily prepared. American nationals electrified the perimeter. Military liaisons in Adelaide boarded on an international aircraft. Queensland considering increases to territorial military trainees. South American mercenaries sent to Guam and Borneo.

Both of these texts are lucid (if not sparkling) English text that correspond to the TSC (give or take alternate interpretations of the handwriting). They took me about 15 minutes each to write. One is a love poem (somewhat like the text of the Rubaiyat in which the TSC was found) and one looks like something a Soviet spy might send to his superiors. A group of writers could surely come up with endless TSC solutions on these or practically any other themes and probably never, duplicate each other's work. The simple fact demonstrated here is: The TSC, like most initialisms, is undecodable into the original text because it has virtually limitless solutions. And therefore, if the TSC is not found in some previously existing text, the source text that generated it will never be known.

Grammatical Analysis

It is easy to lay out, analytically, why most initialisms from English text have endless numbers of solutions. Most of what I say here will apply to a great number of other languages, but I discuss English specifically.

English vocabulary falls into two rough subsets. There are function words, generally drawn from closed classes of words, and content words, generally drawn from open classes of words. Nouns, verbs, adjectives, and adverbs are content words. They comprise the vast majority of all of the words in English. Pronouns, conjunctions, prepositions, and articles are function words. Those parts of speech have only about 3 to 90 words each.

For all but the rarest letters in English, you can come up with a very long list of any of the content word classes that begin with that letter. For the function words, this is not possible. For example, there are only three coordinating conjunctions: and, or, and but. There are no coordinating conjunctions in English that begin with 'z', ‘t’, or ‘e’.

Take any typical passage in English text, and write down the initialism. You may then freely generate almost endless alternative readings of the initialism by changing the content words to other examples of the same part of speech.

"The platypus is one of the strangest animals in Australian fauna."

"The pioneer is one of the strongest archetypes in American fiction."

"The pastegh is one of the sweetest aliments in Armenian food."

It's easy to do this with the content words in virtually any sentence. So even if we kept the function words the same (as in the previous examples), we can still generate many sentences with the same initialism and warp the meaning entirely.

It is relatively difficult to play the same trick with function words, because there are so few options to swap in. However, we could choose different function words in different locations, in effect moving the pivot of function words to another location and then manipulate the content words in their new positions.

"Tommy, play in our old toyroom since Annie is acting funny."

Most function words begin with relatively common letters in English, which is true almost by definition. These letters can be used in other words, content or function, and so the location of function words in an initialism cannot be pinned down.

There are rare letters that could greatly restrict the freedom of recombination shown above. A sequence like XXQXX in an initialism might have actually no valid readings in English. However, the TSC has only one rare letter, Q. Does the Q, or any other pattern in the TSC significantly constrain the range of possibilities for TSC readings? If we can't determine for absolute certain the reading of TSC, can we meaningfully narrow it down?

Learning from Examples

Using the same Project Gutenberg corpus which was searched for matches of TSC substrings, we can search for shorter substrings to see which phrases might match them. Substrings of length 6 are useful for providing multiple matches (at least 7 unique readings) for each position in the TSC. If the particular letters in the TSC constrain the possible readings significantly, then we should see repeated patterns in the Gutenberg matches.

It should be noted first that the Gutenberg corpus has multiple copies of some texts within it, which inflates the counts unnaturally. This observation notwithstanding, there is no substring of length 6 in the TSC that has any single reading which comprises the majority of its Gutenberg hits. In other words, whichever reading we guess to be correct, it is wrong in the majority of cases – over 60%, in fact. Therefore, the would-be sleuth who writes a reading of the TSC and feels that their match is sure to be right is being seduced by the fallacy that the solution they have in mind is rare in matching the text. The sequence that achieves the best match is "do well to bear in mind", which still only covers 38% of the matches for DWTBIM, and is one of 59 different readings found in the Gutenberg corpus. For any 6gram in TSC, whichever guess you offer for the correct reading, you will probably guess wrong.

Can we do better trying to pin down exact words? If we use the 6gram readings from the Gutenberg corpus and tally (counting each reading just once, even if it appeared multiple times) how often particular words are used to fill the specific positions in TSC, and call the share that each word has for that position the derived probability. In these values, we see the same inherent ambiguity as indicated above. Every position allows at least 7 different words to stand for that letter, and in very few cases is the most common case more than 20% frequent, meaning that whatever word we guess in that position, we will probably guess differently than the original text. There is just one case where the most common case rises slightly above 50%: the A in position 24 is filled 53% of the time by the article “a”, a poor, and in any case ambiguous, starting point for interpreting the text. Almost all of the 33 derived probabilities that exceed 20% are exceedingly function words: “a”, “the”, “and”, “in”, “of”, etc. These give not even the slightest indication of topic, genre, or even tense or person.

Four words have a derived probability between 20% and 34% and provide a slight indication of tense and person: There are two such occurrences of “my” (positions 14 and 21), and one each of “is” (position 23) and “am” (position 29). These effectively indicate three votes for the correct TSC reading being written in the first person and two votes for the present tense. These are difficult to interpret as probabilities, however, since each of these votes is, in any case, less than 50% probable, and there’s no clear prior probability of tense and person for a random unknown text. It should be noted that a first person text still contains many third person references, and a text that is primarily in past or present tense still may contain many instances of the other. Therefore, we have a glimmering of an indication that TSC may be an initialism of a first person text, but this is far from conclusive, and in no other way helpful regarding the content. One related observation: There are no occurrences of Y in the TSC, so there are no second person pronouns (“you”, “your”, “yours”) although the second person can be spoken of through circumlocution without those words.

Finally, the derived probability of “quite” for the Q in position 30 is 43%, high among the derived probabilities, but still short of 50%, and utterly ambiguous regarding genre, topic, or content.

Summary

Cumulatively, we have conclusive evidence that the TSC is an initialism, no source text has been identified in literature, and if the source text is not found in an older source, it cannot be decoded into the original text.

This is, most importantly, a strongly negative result for those who have hoped that the TSC could be deciphered, helping to solve the mystery of the Somerton Man, which is still quite a bizarre story even if the TSC is left aside. It still leaves an intriguing situation in which the TSC, which came to light because of the Somerton Man, is effectively a second mystery, which one might have found and glossed over if it were not associated with a dead body, but has been elevated in importance because of the body. There are doubtlessly countless books in the world’s libraries that have mysterious scribbles inside, and no one pays them any particular attention. (I have found some in my older books, written in my own hand, mysterious to me years after they were written.) However, since the case has gotten so much attention, I’ll devote one more post to examining what the TSC might represent – why an initialism might have been written, and what remaining, however slight, possibilities exist for obtaining a definite reading of its content.

Saturday, March 8, 2014

Murder and NLP: The Taman Shud Case, Part 2

In my last post, I showed that the Tamam Shud Cipher (TSC) is almost certainly an initialism, a sequence of initials taken from English text. So what does it say, and why was it written?

It should be noted first that two hypotheses, not exclusive of one another, regarding the other facts of the case are:

H1) That Somerton Man was a Cold War spy who was operating for the Soviets or perhaps the Americans.

H2) That the nurse whose phone number was found in the book was a potential lover of the Somerton Man, and the Rubaiyat was given to him by her to share the romantic aspects of the poetry. She admitted having given another copy of the Rubaiyat to another man in 1945. That man and book both turned up during the investigation.

In that light, some possibilities for the source text behind the Tamam Shud Cipher:

P1) Passage(s) from existing literature.

P1a) Written out to help the writer memorize that literature.

P2) Something written by a person who once had the book.

P2a) A message sent by a Cold War spy.

P2b) A message written out for a friend or lover to read.

P2c) A rough draft for original poetry or prose using initials as shorthand.

P2d) Written out to help the writer memorize their own composition.

P3) Scratchwork for solving a puzzle.

We can investigate each of these possibilities, although definitive answers may prove elusive. I'll devote the rest of this post to beginning an examination of possibility (P1).

If the TSC can be found, in its entirety, in the initials of any previously-existing work of literature, then we can be sure we've found the answer. Why? The probability of two texts sharing the same initials is a function of their length, L, in words, and is approximately:

p_Match = 1 / 10^L

The number 10 occurs here because the probability of two randomly-chosen initials from English text being equal is about 1/10, not the 1/26 you would see if all letters occurred equally likely. The exact parameter may be experimentally derived and certainly isn't exactly 10.0, but it's so close that I use 10 to make the math easier to comprehend. The shortest line in the TSC has 9 letters and the longest has 13. Finding a random passge of 9 words to match the shortest line has a probability of about one in a billion, while a passage matching the longest line has a probability of about one in ten trillion. Meanwhile, the probability of an accidental match for the entire 44 TSC is a dicey 10^-44, which is essentially zero. If we find any existing text that matches the entire TSC, then it is definitely the text used to produce the TSC.

Note: This does not mean that if a person tries to write out their own original match for the TSC and succeeds that they have found the answer. In fact, it's not that hard to write text to match a given set of initials, which produces the illusion that someone can make quick and easy progress towards figuring it out. I will show in a later post how easy it is to concoct bogus matches after the fact.

Of course, finding matches for shorter substrings is easy, for sufficiently short length. Obviously, matches for substrings of length 1 and 2 are trivial to find, and by searching larger volumes of text, one finds matches of length 3, 4 and so on.

Work has been done in the past on trying to find exact matches for the TSC in a handful of major literary works, including the King James Bible. After an initial, failing search through digital copies of 40 books and 18 collections of poetry, I conducted a rather massive search which is as follows:

Project Gutenberg creates digital transcriptions of literary works. For my search of TSC matches, I downloaded by torrent the April 2010 collection of 29,500 books. Of these, 22,353 books are in text format. I preprocessed these to create an index of their initialisms, which amounts to 1.3 billion words of original text. I searched this for the longest matches that exist for TSC substrings, to arrive at these results:

R1) There is no exact match for the entire TSC, or any of its individual lines.

R2) The longest matches are of length 8, and there were twelve of these. It is apparent that these are entirely coincidental for the following reasons:

a) Many of them contradict one another, by matching the exact same subsequence of TSC.

b) They each begin in the middle of a sentence, and continue on into the middle of the next sentence.

c) We would expect, from the aforementioned formula, to find about ten matches of length L=8 and one match of length L=9 by sheer coincidence.

d) One of the matches is from 1963, many years after TSC was written down.

The matches were from nine works of literature, and one work each from science, economics, and reference. The matches are posted separately here.

In a nutshell, the Project Gutenberg corpus does not contain a match, and this should set the tone for any future searches. The works in this corpus are not only large in number but particularly central in literary importance. It is hard to characterize "literary importance" formally, but the reach of this corpus is impressive. When one thinks of authors predating 1950 who are likely to be taught in university literature courses, the number of works this corpus has is vast, although not comprehensive. For many famous works, it includes more than one edition. There are also translations of English works into other languages (although, recall, TSC is almost certainly an initialism of English text) and translations of literature in other languages into English.

Until a match is found, we can never prove that there is no match. Whatever corpus of n books we search, it's always possible that the exact match will be found in book n+1. It remains possible that TSC matches a book, poem, newspaper article, or other text not in this Gutenberg Corpus, but a lack of matches in 22 thousand books certainly suggests that searching any additional one – or one hundred – books chosen arbitrarily will yield a very low probability of finding a full, exact TSC match.

In an upcoming post, I'll discuss possibility (P2), and how to use a large corpus (now that we have one) as a resource for trying to decode the TSC. We can perhaps find information about the words or genre that are encoded within it. And that will also open up paths to searching for exact matches.

Murder and NLP: The Taman Shud Case, Gutenberg Matches

Appendix:

Longest matches of substrings of the Tamam Shud Cipher among the initials in a corpus of 22,353 Project Gutenberg books. These are all of length 8. There were many shorter matches and no longer matches.

Title, Author, Substring, Passage

"The Iceberg Express", David Magie Cory

cittmtsa: cake i think the mermaid took somewhat after

"On Laboratory Arts", Richard Threlfall

cittmtsa: care is taken to make the strokes as

"Showell's Dictionary of Birmingham", Thomas T. Harman and Walter Showell

cittmtsa: Church in this town, Mr. Thomas Smallwood, an

"Beatrix", Honore de Balzac

cittmtsa: contemplating in turn the marshes the sea and

"The Nail", Pedro de Alarçon

iaboaiaq: is a beautiful one, and I am quite

"The Evolution of Modern Capitalism", John Atkinson Hobos

ittmtsam: in the textile metal transport shipping and machine

"Northanger Abbey", Jane Austen

ittmtsam: it together that miss thorpe should accompany miss

"The Trouble with Telstar", John Berryman

ittmtsam: itching to take me to see a man

"Pensées", Blaise Pascal

mtsamstg: make them saint augustine montaigne s'bond the genealogy

"A Woman for Mayor", Helen M. Winslow

mtsamstg: motioned the stenographer and miss snow to go

"Ernest Maltravers", Edward Bulwer-Lytton

tpmliabo: that point my life is a bad one

"Carette of Sark", John Oxenham

ttmtsams: than twenty miles there soon after midnight steal