Thursday, February 25, 2010

Parsing Twitter

The Internet has not re-invented language as such, but it has created many new registers that have to be parsed as such. Out-of-the-box NLP tools that were developed to parse the Wall Street Journal and other well-behaved text will fall down flat if they are used to process other niches around the Internet.

I have seen some of these phenomena going back a quarter of a century, in online chat. In a nutshell, people use Internet means of writing in ways more colloquial than formal writing tends (and tended) to be. But even without that broad sweep, there are many sub-niches of usage -- some determined by medium, and some determined by user population. (Besides the obvious segmentation into different national languages like English, German, and Chinese.)

Interest in parsing Twitter is suddenly getting hot, and while a lot of the linguistic behavior there resembles linguistic behavior in other online locales like chat rooms, email, and instant messaging, every niche ends up with its own rules (and lack thereof).

Here are some phenomena I've seen as I build a parser that is robust enough to handle Twitter:

1) Pro drop. Twitter in particular makes the first-person singular pronoun implicit. Many tweets look like English sentences that have the leading word "I" implied. In other cases, "I am" is implied.

2) Nonsentential statements. Sometimes a noun phrase stands alone, with an implied existential quantification out front. "Party tonight" means "There will be a party tonight."

3) A register that resembles Black English Vernacular has arisen. I would suggest that this new written form deliberately deviates from formal written standards. At the same time, it is economical, using shorter forms as rebuses for bulkier forms whenever the shorter form would be pronounced the same way. For example, rewriting "You know" as "u no" (4 characters instead of 8). One can feel William Safire quaking, but for those of us writing parsers, we must accept and embrace.

The first I noticed this was in the titles of songs written by Prince. The titles of songs on his first three albums never did this, but in albums released in 1981 and 1982, three of his songs had these elements in his titles (eg, "I would die 4 u"). You can see the deliberately contrary nature of his language by 1988, when he titled a song "Eye No", thus using a longer form instead of a standard shorter form. I don't know if Prince was significantly responsible for this phenomenon or not, but it has certainly caught on by now.

Incidentally, detecting a user's register is potentially quite valuable, since many business purposes for parsing Twitter would be involved with market analysis and market segmentation.

4) Acronyms and emoticons. These are so common in computer-mediated communication that it is impossible to be unaware of them. LOL.

5) Novel contractions, like "hella", "tryna", "weneva".

6) Repeating characters to establish emphasis. Eg, "welcomeeeeeeeeee". This is in some cases a challenge to parse (in principle, "good" is "god" with the "o" repeated). In other cases, it's easy to convert to standard usage, but it does defeat literal search mechanisms.

Notice that the aforementioned devices can occur in combination. For example, "lmaaooo" = "laughing my ass off" with the "a" and "o" repeated for emphasis.

7) Unique medium-specific entities like URLs and the Twitter features for directing a tweet to a user (eg, @FakeSteveJobs) or a topic (eg, #lost).

8) The substitution of characters that resemble other characters for one another. "3" can be used for "E", "0" for "o", "q" for "g".

9) Deliberate swapping of character order. For example, "teh" as a playful misspelling of "the". This can also combine with aforementioned devices. Eg, "pr0n" is a way to rewrite "porn".

Not every user partakes of these new linguistic devices, but a parser that is intended to wring meaning (and market intelligence) out of Twitter (or blogs or email or other electronic communication) ignores them at its peril. The more of these you miss, the more information you miss. And the people who have embraced these nonstandard devices represent a nontrivial amount of spending power.

1 comment:

Fredrik said...

Great blog, and many good points here!

As for the more grammatical aspects of tweeting, 1 & 2, pro drop or ellipses, and non-sentential statements, many lessons could be learned form analyzing transcribed spoken language corpora, where the same phenomena are highly frequent, for instance the Switchboard Corpus (part of the Penn Treebank) or the Christine Corpus (part of the Susanne Corpus).