Monday, August 29, 2011

Social Media: Linguistic Anarchy?

Your schoolteacher would be horrified. As technology opens up new channels for people to communicate via the written word, the use of language in those channels becomes increasingly ill-formed and deviant. Social critics may look at this as a relaxation in standards, a harbinger of the decay of reason and civilization.

However, it's really not that bad. People feel differently about standards in language, and one might observe that if language use did not vary over time, we would still be speaking Latin, Anglo-Saxon, Proto-Indo-European, or some more primeval language. Whether you identify more with prescriptive linguistics (the use of instruction to make students keep in line with existing standards) or descriptive linguistics (the laissez-faire study of language to understand it, without concern for changing how people use it), the truth is that people still adhere to standards, and in many ways, those standards aren't so different in the era of electronic media than they were in the long-lost era of pens and inkwells.

How does English, say, on Twitter, differ from English in the news? Certainly one sees slang, profanity, misspelling, and neologisms. But the single greatest set of differences come from the different kind of interaction. The news is meant to sound like the voice of God, detached, objective, hovering over the topic and the reader alike in the Third Person. In contrast, many interactions in social media are person-to-person, openly subjective, inherently spoken by the First Person to the Second Person... often on the topic of the First Person and/or the Second Person.

For ten major European languages, I have computed word frequencies for a corpus of news and of Twitter posts, and here I focus on the words which are the most prevalent on Twitter as compared to the news (frequency in Twitter minus frequency in the news). And for English, the word at the top of that list is nothing that would give schoolteachers and the clergy a stroke. It is "I." In fact, of the 39 words topping that list, only the acronym "LOL" and the abbreviation "u" are the stuff of which schoolteacher nightmares are made. The other 37 are almost exclusively words that are very common in The Queen's English and American Standard English, but happen to be more prevalent in first person narration than in the third person. They are common words that are natural descriptors of situated language, where the speaker's and listener's identity, time, place, attitude, and -- more generally -- their context, are part of the discussion.

And so, with an eye towards the top 25 words on the (Twitter-minus-news) frequency list, we see the following categories:

First-and-Second Person forms: I, am, me, my, we...
Deixis (words referring to the speaker's or listener's situation): today, just...
Simple, plainspoken vocabulary, words that are common in the news, but still more common for writers who are not consulting the thesaurus to make their language more flowery: not, was, do, had, did, have, got...

Far lower on the list come the shock words: "shit", "fuck", "alot", "alright", "ima", etc. And even the abbreviations are understandable, when writers are constrained to squeeze their idea into the 140 character limit, and deal with keypads that make extra effort a true burden. A drowning person is likely to yell for help in something other than complete sentences, and a person straining to fit an idea into Twitter's constraints has a legitimate motive for abbreviating more than they otherwise might. All told, we see that people have a strong tendency to hold to convention -- not the universal adherence to convention that William Safire would have liked, but it is still the most common case.

This is true in other languages as well, and to much the same degree. Here are, according to Twitter-minus-news frequency, the top 25 words for the five leading Western European languages. They reflect more or less the same tendencies.

English: I, 's, not, am, me, my, was, he, do, we, had, lol, news, did, u, have, new, today, just, think, haha, got, 'd, game, she.

French: je, j, c, pas, est, ai, tu, mais, moi, me, que, ça, a, t, mon, suis, on, ma, si, y, fait, il, te, quand, m.

German: ich, du, ja, d, nicht, aber, mir, mal, hab, was, jetzt, so, ist, noch, mich, da, dann, bin, es, schon, das, war, wenn, auch, dir.

Spanish: no, me, q, te, es, ya, si, yo, lo, jajaja, mi, tu, a, d, mas, pero, XD, jaja, jajajaja, eso, México, son, hay, solo, x.

Italian: non, mi, ho, io, ma, che, d, XD, è, ti, se, u, a, me, sono, lo, o, no, ci, l, ora, çç, sei, mia, poi.

Interestingly, the negative adverb in each language appears quite high. This seems to indicate that journalists exercise a discipline to express things in terms of positives while people generally use the negative a larger proportion of the time.

The big cross-language difference that is evident from the above is in the tendency for Twitter users to type out a "laugh", and this particularly stands out on the Spanish list. This merits a fun and funny, follow-up post on how much people type out laughs in different languages. The results will probably not surprise you.

1 comment:

dan said...

nice post and nice computation! I wonder what happens if u look at 2-grams, or at the distribution of "right co-occurrences" (e.g the distribution of x's for linguistic patterns of the form "I x", or "we x" etc...), is there a further level of differentiation between twitter-talk and news-talk?