Sunday, February 28, 2010

Speech To Speech

When Google talks, people listen. Recently, Google has been saying that sometime in the near future, when you talk, people can listen in other languages, thanks to your phone and their software. Franz Och, the head of Google's translation services has said "We think speech-to-speech translation should be possible and work reasonably well in a few years' time." I'm not the only one who's skeptical.

In 2007, I evaluated a state-of-the-art StS MT system that was built for military applications (think of the Somali pirate crisis from 2009 or any scene from The Hurt Locker). I don't think it delivered. The issue is quality. StS MT can "work" if:

a) Both parties are extremely cooperative and are willing to put some work into the exchange. (Eg, training the system on their voices; repeating things, maybe more than once, if the result is unclear the first time; tolerating the inherent delays.) This could mean a life-or-death situation where the alternative -- no translation -- is worse than a bad translation. (Although the stakes could be higher, even deadly, if the translation is faulty.)

b) The comprehension of the content is held to a low standard. Eg, if two businesspeople or recreational chatters want to feel like they're getting to know each other; not trying to hammer out the fine details of a legal agreement.

c) There is no alternative. Because having the parties type content into an online text-based MT system immediately removes one source of error.

As long as the sources of error are as great as they are now, I have trouble thinking of many contexts where people would be willing to tolerate the flaws. Maybe chatters who are only looking for entertainment and have no bottom line regarding accuracy. In war / emergency contexts, perhaps. In business, I think the problems just about doom the effort, unless a cultural adjustment makes people value "meeting" someone in this way even when the comprehension is shaky.

I was, however, impressed by how well the speech recognition worked with my voice when I trained the system to recognize me. Perhaps if Google trains the system on enough people, just about anyone new to come along would sound close enough to one of those. The speech-to-text part of the problem just might be solvable for a large segment of people.

But that leaves the machine translation link in the chain, and that's something where there's little reason to suspect that a quantum leap is about to happen. I took Mr. Och's own quotation above and translated it to Spanish and then back to English using Google's text-to-text machine translation. It did a pretty good job, but it changed his use of the word "should" to "must". That's the sort of thing you don't want to have happen when a Somali pirate has his gun trained on you.

Thursday, February 25, 2010

Inflection Predilection

Here's a lightweight, factoid-level look at a lot of different languages in a fun way.

We often hear that some languages are "heavily inflected". You can glance at tables of verb conjugations and noun declensions to see the brutal details: Russian surely does have a lot of declensions to memorize, just as Spanish does with verb conjugations. But that's all theory and not practice -- lots of those forms are hardly ever used. I lived in Italy and started making a mental note of how often I heard (or had to use) the second-person plural form of the future tense: Not often!

One way of quantifying how "heavily" inflected a language is can be seen from some statistical work I was doing for a more practical purpose. For each of sixteen languages, I processed corpora of news articles with total token counts for each language ranging from about 100,000 to 3 million. Based on this, I generated two ways of counting tokens: First, the absolute count of nonunique tokens (in other words, the total size of the corpus); Second, the sum of the unique token count of each article (in other words, counting each word just once for each article it occurs in).

If each word occurred at most once per article, these two counts would be identical. The ratio between the two counts thus expresses how much a language tends to repeat words. There are a number of factors that can determine the degree of repetition, but many of these should be more or less equally true of all languages across the news corpora. Inflectional morphology, however, varies greatly from one language to another. Where English would use the form "said" in the past tense regardless of the person and number of the verb, but would use "say" or "says" for, the present tense. Spanish could use "dijeron", "dijo", "dije", etc. for the past, and "dicen", "dice", "digo", etc. for the present. Meanwhile, Chinese would use the character "" regardless of person, number, or tense. In this example, Spanish could have a relatively small discrepancy between the number of "SAY" tokens in an article and the number of "SAY" forms in the article. English would have a larger discrepancy, with only three forms likely to occur. Chinese, meanwhile, would have only one form regardless of any other factors, so the ratio of total "SAY" tokens and the count of one for that one form could be quite a bit larger. So "more inflected" languages should have lower ratios between the two forms of counts, while "uninflected" languages should have higher ratios.

Behold, we see the basic truths upheld. Finnish is the most inflected language in this sample, and Chinese the least. English is the least inflected European language. Among the branches of the Indo-European family in the sample, the Slavic branch is the most inflected, with Germanic (besides the mongrel English) next-most, and Romance least.

A bit of analysis: Finnish and Turkish are polysynthetic, allowing combinatorial addition of affixes and generating vast numbers of forms. Arabic manifests similar complexity in a slightly different way, with active derivational morphology combining with a respectable number of verb inflections.

Why do the Romance languages, with their vast verb conjugation tables look so uninflected? Almost certainly, the fact that this is a news corpus plays a big role. News preferentially favors the past tense and the third person, restricting verb conjugation more than, say, a corpus of spoken conversations would. The main complication of the Slavic languages, noun declension, is no more restricted in news than it would be in other contexts: nouns can still be subjects, objects, instruments, etc. Germanic languages, with their modest declension schemes, rank between that of Slavic and that of the undeclined Romance languages. English, which came out of the Norman conquest on a path to eliminate as much of the complexity of its Anglo-Saxon and Old French origins, ended up the least-inflected language in the region. Chinese is famously "uninflected", although I have to admit a procedural bias here: Tokenizing Chinese into single tokens guarantees that it occupies the extreme position; if I were tokenizing it into words (taking some side or another in the debate concerning what is a word in Chinese), the counts would be different, although its rank as the least-inflected language in the bunch would not.

One could have quite a bit of fun explaining all of the nuances in the ranking: Why are Danish and Norwegian so far apart? Because Norwegian has more zero-ending plurals? What about Spanish and Italian? Is it some nonmorphological reason?

Overall, the thing that was interesting here was to see the numeric basis to an observation that is made so readily about which languages are "heavily" inflected. Where qualitative truths exist, quantitative breakdowns can show them.

Parsing Twitter

The Internet has not re-invented language as such, but it has created many new registers that have to be parsed as such. Out-of-the-box NLP tools that were developed to parse the Wall Street Journal and other well-behaved text will fall down flat if they are used to process other niches around the Internet.

I have seen some of these phenomena going back a quarter of a century, in online chat. In a nutshell, people use Internet means of writing in ways more colloquial than formal writing tends (and tended) to be. But even without that broad sweep, there are many sub-niches of usage -- some determined by medium, and some determined by user population. (Besides the obvious segmentation into different national languages like English, German, and Chinese.)

Interest in parsing Twitter is suddenly getting hot, and while a lot of the linguistic behavior there resembles linguistic behavior in other online locales like chat rooms, email, and instant messaging, every niche ends up with its own rules (and lack thereof).

Here are some phenomena I've seen as I build a parser that is robust enough to handle Twitter:

1) Pro drop. Twitter in particular makes the first-person singular pronoun implicit. Many tweets look like English sentences that have the leading word "I" implied. In other cases, "I am" is implied.

2) Nonsentential statements. Sometimes a noun phrase stands alone, with an implied existential quantification out front. "Party tonight" means "There will be a party tonight."

3) A register that resembles Black English Vernacular has arisen. I would suggest that this new written form deliberately deviates from formal written standards. At the same time, it is economical, using shorter forms as rebuses for bulkier forms whenever the shorter form would be pronounced the same way. For example, rewriting "You know" as "u no" (4 characters instead of 8). One can feel William Safire quaking, but for those of us writing parsers, we must accept and embrace.

The first I noticed this was in the titles of songs written by Prince. The titles of songs on his first three albums never did this, but in albums released in 1981 and 1982, three of his songs had these elements in his titles (eg, "I would die 4 u"). You can see the deliberately contrary nature of his language by 1988, when he titled a song "Eye No", thus using a longer form instead of a standard shorter form. I don't know if Prince was significantly responsible for this phenomenon or not, but it has certainly caught on by now.

Incidentally, detecting a user's register is potentially quite valuable, since many business purposes for parsing Twitter would be involved with market analysis and market segmentation.

4) Acronyms and emoticons. These are so common in computer-mediated communication that it is impossible to be unaware of them. LOL.

5) Novel contractions, like "hella", "tryna", "weneva".

6) Repeating characters to establish emphasis. Eg, "welcomeeeeeeeeee". This is in some cases a challenge to parse (in principle, "good" is "god" with the "o" repeated). In other cases, it's easy to convert to standard usage, but it does defeat literal search mechanisms.

Notice that the aforementioned devices can occur in combination. For example, "lmaaooo" = "laughing my ass off" with the "a" and "o" repeated for emphasis.

7) Unique medium-specific entities like URLs and the Twitter features for directing a tweet to a user (eg, @FakeSteveJobs) or a topic (eg, #lost).

8) The substitution of characters that resemble other characters for one another. "3" can be used for "E", "0" for "o", "q" for "g".

9) Deliberate swapping of character order. For example, "teh" as a playful misspelling of "the". This can also combine with aforementioned devices. Eg, "pr0n" is a way to rewrite "porn".

Not every user partakes of these new linguistic devices, but a parser that is intended to wring meaning (and market intelligence) out of Twitter (or blogs or email or other electronic communication) ignores them at its peril. The more of these you miss, the more information you miss. And the people who have embraced these nonstandard devices represent a nontrivial amount of spending power.