NLP Confidential: 2011

Sunday, September 25, 2011

Laughing Out Loud

This may make you laugh.

And laughs are something people like to share. When people communicate via social media, they type "laughs." In a sample of a million words of Twitter messages in ten different languages, I found that about 0.5% of all "words" are laughs – "haha", "LOL", or other ways of typing out a chuckle.

Do people everywhere laugh equally?

Not on your life.

In a study of ten Western languages (English, German, Dutch, Norwegian, Swedish, Danish, French, Spanish, Italian, and Portuguese), I found enormous differences in the frequency of Twitter-laughs.

The Germans laugh least, with Twitter laughs making up under 0.1% of all words.

Other languages of Northern Europe were somewhat more prone to laughs than German. In increasing order of laugh frequency, Norwegian, French, Swedish, English, and Danish all came in below 0.4%.

And then there are the happy Latins. Laughing just more than the Danes, Portuguese has 0.5% laughs, and that's nothing compared to the Italians who Twitter-laugh in 0.9% of words. But the runaway laugh champions are Spanish speakers who type Twitter laughs for 1.4% of words.

The North-South pattern is noteworthy, but is broken by the Dutch, who out-laugh their neighbors like they're misplaced Latins, finishing way up at 0.8%.

The Dutch withstanding, the North-South trend is sharp and undeniable, as this color-coded map makes clear.

What's even funnier, the languages where people laugh more often, they also type longer laughs.

While three "ha"s are the preferred laugh in Spanish and Italian, they are not afraid to laugh longer. The five-ha laugh ("jajajajaja") is more common in Spanish than the two-ha laugh is in German. This graph shows how laugh length occurs in the five most-spoken languages. While Spanish runs away with the championship here at every length, notice that English is actually the runner-up for the two-ha laugh ("haha") with Italian strongly preferring a three-ha approach ("ahahah").

When you take into account the length as well as frequency of laughs, Spanish Twitter has 24 times more laughing than German, as measured in character count. This is not a subtle difference!

So, why is all of this happening? It's clear that more Twitter laughs come from the warmer and sunnier countries.This is true not only in Europe but also in the Americas, where the most speakers of English, Spanish, and Portuguese live. Statistically speaking, the laugh statistic is highly correlated with the latitude of the corresponding European capital (farther south: r=0.66), how sunny that city is (more sun: r=0.74), and inversely with the suicide rate (r=-0.74; this is the same if you choose the U.S., Mexico, and Brazil instead of the U.K., Spain, and Portugal).

So is it as simple as this: Warm, sunny weather makes people laugh a lot and immune to depression?

That may be part of it. But another idea to consider is that in Germany and Scandinavia Twitter is used comparatively more often for business and relatively less often for chatting. When one subtracts the social chat, then naturally less laughter remains.

Overall, it's not clear how much Twitter reflects life as a whole. Until we plant microphones everywhere and monitor all human communication, studies like this will just be suggestive of larger truths. But insofar as it goes, this study of Twitter laughs serves to support a lot of existing cultural stereotypes.

Monday, August 29, 2011

Social Media: Linguistic Anarchy?

Your schoolteacher would be horrified. As technology opens up new channels for people to communicate via the written word, the use of language in those channels becomes increasingly ill-formed and deviant. Social critics may look at this as a relaxation in standards, a harbinger of the decay of reason and civilization.

However, it's really not that bad. People feel differently about standards in language, and one might observe that if language use did not vary over time, we would still be speaking Latin, Anglo-Saxon, Proto-Indo-European, or some more primeval language. Whether you identify more with prescriptive linguistics (the use of instruction to make students keep in line with existing standards) or descriptive linguistics (the laissez-faire study of language to understand it, without concern for changing how people use it), the truth is that people still adhere to standards, and in many ways, those standards aren't so different in the era of electronic media than they were in the long-lost era of pens and inkwells.

How does English, say, on Twitter, differ from English in the news? Certainly one sees slang, profanity, misspelling, and neologisms. But the single greatest set of differences come from the different kind of interaction. The news is meant to sound like the voice of God, detached, objective, hovering over the topic and the reader alike in the Third Person. In contrast, many interactions in social media are person-to-person, openly subjective, inherently spoken by the First Person to the Second Person... often on the topic of the First Person and/or the Second Person.

For ten major European languages, I have computed word frequencies for a corpus of news and of Twitter posts, and here I focus on the words which are the most prevalent on Twitter as compared to the news (frequency in Twitter minus frequency in the news). And for English, the word at the top of that list is nothing that would give schoolteachers and the clergy a stroke. It is "I." In fact, of the 39 words topping that list, only the acronym "LOL" and the abbreviation "u" are the stuff of which schoolteacher nightmares are made. The other 37 are almost exclusively words that are very common in The Queen's English and American Standard English, but happen to be more prevalent in first person narration than in the third person. They are common words that are natural descriptors of situated language, where the speaker's and listener's identity, time, place, attitude, and -- more generally -- their context, are part of the discussion.

And so, with an eye towards the top 25 words on the (Twitter-minus-news) frequency list, we see the following categories:

First-and-Second Person forms: I, am, me, my, we...

Deixis (words referring to the speaker's or listener's situation): today, just...

Simple, plainspoken vocabulary, words that are common in the news, but still more common for writers who are not consulting the thesaurus to make their language more flowery: not, was, do, had, did, have, got...

Far lower on the list come the shock words: "shit", "fuck", "alot", "alright", "ima", etc. And even the abbreviations are understandable, when writers are constrained to squeeze their idea into the 140 character limit, and deal with keypads that make extra effort a true burden. A drowning person is likely to yell for help in something other than complete sentences, and a person straining to fit an idea into Twitter's constraints has a legitimate motive for abbreviating more than they otherwise might. All told, we see that people have a strong tendency to hold to convention -- not the universal adherence to convention that William Safire would have liked, but it is still the most common case.

This is true in other languages as well, and to much the same degree. Here are, according to Twitter-minus-news frequency, the top 25 words for the five leading Western European languages. They reflect more or less the same tendencies.

English: I, 's, not, am, me, my, was, he, do, we, had, lol, news, did, u, have, new, today, just, think, haha, got, 'd, game, she.

French: je, j, c, pas, est, ai, tu, mais, moi, me, que, ça, a, t, mon, suis, on, ma, si, y, fait, il, te, quand, m.

German: ich, du, ja, d, nicht, aber, mir, mal, hab, was, jetzt, so, ist, noch, mich, da, dann, bin, es, schon, das, war, wenn, auch, dir.

Spanish: no, me, q, te, es, ya, si, yo, lo, jajaja, mi, tu, a, d, mas, pero, XD, jaja, jajajaja, eso, México, son, hay, solo, x.

Italian: non, mi, ho, io, ma, che, d, XD, è, ti, se, u, a, me, sono, lo, o, no, ci, l, ora, çç, sei, mia, poi.

Interestingly, the negative adverb in each language appears quite high. This seems to indicate that journalists exercise a discipline to express things in terms of positives while people generally use the negative a larger proportion of the time.

The big cross-language difference that is evident from the above is in the tendency for Twitter users to type out a "laugh", and this particularly stands out on the Spanish list. This merits a fun and funny, follow-up post on how much people type out laughs in different languages. The results will probably not surprise you.

Sunday, June 26, 2011

Empirical Measure of Reliability: Part Four

This fourth installment on Twitter predictions for the NBA Finals will wrap up, for now, my pilot study on measuring reliability as a function of qualities that are detectable in the writings of the individuals who make the predictions. I've made some observations in the three preceding posts, and here I will make a few more and summing up the overall results.

First, the lay of the land: I collected predictions that individuals made on Twitter regarding the 2011 NBA Finals. The predictions were of which team would win and how many games the best-of-seven series would last. This gave eight possible outcomes, which ranged from one team winning in a four-game sweep to the other winning in a four-game sweep. Besides the individuals making predictions, there were other ways to quantify the soundness of each possible outcome, and the odds posted by Las Vegas sports books offer a possible "rationalist" position. At least, if there is a systematic error in the sports books, then there is a way for someone who knows more to get rich.

The sports books predicted a fairly close series, with Miami favored, but only slightly, with Miami-in-seven as the most likely outcome. As such, the least-favored outcomes were blowouts -- Miami-in-four or Dallas-in-four. Even before the Finals began, those seemed to be the brashest/wrongest predictions. And that proved to be true very soon, as each team won one of the first two games, making both of those predictions already wrong. Note that in some other year, where one team was dramatically superior to the other, predicting a four-game series might have been the smart bet -- but that was not true this year.

The heart of this study is to analyze the text that these 165 Twitter users have offered in other tweets (not the specific single prediction itself) and see if there is a fingerprint for reliability in the text that people generate.

Here are the observations that I have previously noted:

1) The people who made the rashest and most-incorrect predictions, a four-game series, differed from the crowd in being less prone to punctuate their tweets. (Note: There were only six such individuals, so this finding may not be significant.) They did not stand out in other obvious ways, such as the use of capitalization.

2) The people who made more extreme predictions (four or five-game series; including the "5s" raised the sample size to 32) were more likely to use the modal verb "must" than the modal verb "might." This is suggestive that some people carry a black-and-white worldview around versus those who see things in shades of gray; the "must" people predicted a lopsided outcome despite the judgment of Las Vegas (and the actual outcome of the series) that was less extreme.

3) While the frequency of the eight possible predictions roughly matched the Las Vegas probabilities (Pearson correlation r=0.56), the crowd deviated from this primarily in over-predicting Dallas-in-six. Interestingly, this ended up being the actual outcome of the series! Is there a significant minority out there (about 25% of the individuals) who know better than Vegas?!

Now, two observations that I have not reported previously:

4) Do those Dallas-in-six predictors show some sign of brilliance in their vocabulary? The words most apt for those people to use more often than the other 75% of individuals were: {awesome, follow, through, morning, maybe}. The words they used less often than the others were: {watching, fun, free}. If there's anything plausible about a worldview here (as with the "must" vs. "might" observation) I don't see it.

5) We can rate the full set of predictions according to correctness (Dallas-in-six being exactly right, but every other prediction being a certain number of games away from this, ranging from one game off to five games off). And then we can correlate correctness with the gross properties of the individuals' tweets and we see that correctness correlated positively with:

A) People who type longer tweets were more accurate than those who type shorter tweets. Pearson correlation, r=0.48.

B) People who use more mixed capitalization (capitalizing the first letters of words, vs. leaving the whole word lowercase or all-uppercase) were more accurate: Pearson correlation, r=0.46.

C) An overall measure of "fluency", using the correct English function words like "and" and "be" correlated only slightly with correctness: r=0.08.

I stress again that this study was not painstakingly scientific, but I would like to use it as a pathfinder towards more informative studies in the coming months. An ambitious-enough goal would be to assess which individuals write in such a way as to seem to be systematically deluded, showing the world that they dispense with facts and wisdom and draw their own conclusions anyway. A yet-more ambitious goal would be to distinguish those individuals of exceptional reliability from those of average reliability, and perhaps to use the crowd as a predictor of the future that is better than anyone has yet systematically recognized (imagine if this led to predictions that were smarter than Las Vegas tends to posit).

While it will be fun and informative to continue this work with other sports events (a ready source of quantitative predictions that can be graded objectively), it would be even more rewarding to evaluate the soundness of predictions regarding politics, policy, technology, and science. It would of course be useful to analyze qualities that are deeper and more meaningful than punctuation. And I will confess to an ultimate goal of collecting empirical statistics on the soundness of various kinds of higher reasoning. Can we do the same as I've done here with arguments that intelligent astronomers made in the Twenties through Fifties, grading those predictions with the correct answers that we now in many cases have? Can we have a sort of truth-o-meter based on this sort of empirical work? Or, at least, can we show people that if they write badly they will seem less reliable? Watch this space in the months to come.

Friday, June 17, 2011

Empirical Measure of Reliability: Part Three

The NBA Finals ended last weekend with the Dallas Mavericks beating the Miami Heat in six games. This is an outcome that a significant minority of the Twitter users predicted -- specifically, it was the second-most common prediction (out of eight logical possibilities) with 24.8% of users choosing it.

What's interesting is to see how the predictions of the crowd selectively followed the probabilities implied by the odds posted by Las Vegas sports books. In the graph (click t0 enlarge), we see the possible outcomes along the bottom, with those most favorable to Miami on the right and those most favorable to Dallas on the left. The house (red line) gave Miami a modest edge in the series, and the probabilities hint at a Gaussian with a peak centered on "Miami winning in seven games." If we ignore the "Dallas in six" position, it looks like the crowd largely went with the gambling odds, choosing a peak that was nearby (Miami in six instead of seven) and then following a course somewhere between Response Matching and a Winner Takes All preference for the favored outcome.

However, the crowd deviated from that in one big way by giving far more credence to the "Dallas in six" outcome (and slightly more to "Miami in six" and slightly less to "Miami in seven"). It is specious to read too much into this single instance, but it looks like the crowd -- a significant minority of them -- got smart in a way that Las Vegas underestimated. Do those Dallas-in-six people have some special talent, or did they just get lucky? I'll take a look next time at how the Dallas-in-six predictors differed -- or didn't -- from the other predictors. Is there a smart gene somewhere in their writing?

Sunday, June 5, 2011

Empirical Measure of Reliability: Part Two

My pilot study on the use of text analytics to determine reliability is underway. The idea was to log a lot of predictions regarding the NBA Finals before the event begins, then to analyze the text generated by the predictors and look for correlations between a person's language and their accuracy in predicting the future event.

This is not a proper experiment -- there's a lot of hackery on my part. My hope is to learn from this pilot study to gear up for a proper experiment in the coming months. As of this writing, two games in the best-of-seven series have been completed, with the Miami Heat and the Dallas Mavericks each having one win. At this point, I'll give a quick overview of things I've seen in the data, which I collected right up to the last hour before the series began.

First, the predictions themselves: I collected 245 predictions from users on Twitter. I chose only predictions in which the user specified the outcome in terms of the series winner and the total number of games. (A best-of-seven series ends whenever one team has four wins. This could happen in as few as four or as many as seven games.) The object of interest to me is to look at the text these users produce in posts besides the prediction post itself, so I cut the study to those 165 posters for whom I was able to collect at least 10 other Twitter posts (excluding those which are retweets, containing the posts of other users).

The users tended (59%) to prefer Miami and the single most common prediction was Miami to win in six games. This is not far from the predicted outcome implied by the Las Vegas odds, which favored Miami in seven games, with Miami in six as a close second. However, the overall profile of the Twitter predictors in some ways deviated quite a bit from the Las Vegas view.

The Twitter predictors have a strange overestimate of the likelihood of Dallas (the underdog) winning the series in six games. This is strange in that the odds predict that the most likely duration of the series is seven games (34% probability). However, 59% of Twitter users predict a six-game series (most favoring Miami, some Dallas). Why? Maybe the answer lies in history: Six games is historically the most likely duration of an NBA Finals (41.5% of the Finals since 1980). Maybe people begin by accepting the likely duration of the series, and impose upon that their selection of which team will be the beneficiary. If so, there is an illogical bias: If Dallas is to perform better than the odds predict, then it is more likely that Dallas will lose in seven games, or win in seven games. It seems that users who favor Dallas, however, "go big" in their favoritism, concluding that if Dallas is to do well, they will do really well, winning by the comparatively large margin of four games to two which the odds say is fairly unlikely (10.2% probability).

Looking at the gross breakdown of predictions, it is also interesting to consider those who predicted the series to end in just four games. At a glance, it looks like the predictors are conservative, with only 4% having predicted a four-game series while the odds give a 13% probability of that outcome and history telling us that 17% of Finals end in four games. However, seen another way, the predictors seem irrationally exuberant in predicting a four-game outcome. Given an objective determination of the probabilities, it is irrational for anyone to choose the least-likely outcome. Let's say that we had a six-sided die with one side marked A, two sides marked B, and three sides marked C. If we ask a million rational gamblers to bet on the outcome of a single roll, it's not that 1/6 should say "A"; rather, absolutely nobody should say "A" if they want to maximize their probability of being correct. That 4% of predictors predicted a four-game series sweep (four choosing Miami, two Dallas) indicates either that those predictors (and to a lesser extent, those choosing a five-game series or Dallas winning in six) are responding irrationally. Maybe for one of these reasons:

1) They incorrectly believe they know something that the rest of the world doesn't.

2) They actually do know something that the rest of the world doesn't.

3) The payoff for these predictors may favor winning and losing unequally. They may get great prestige or psychological reward from making a rare, perfect prediction, whereas they can simply ignore or walk away from their incorrect prediction should it not come true.

Interpretation (2), I should add, is not very persuasive. If a class of individuals had the ability to predict more accurately than the rest of the world, then some of those individuals would have the incentive to bet large sums of money on the event, which would shift the odds to the correct value. The odds should reflect the "smart money" quite closely unless someone has access to information that the world does not. (E.g., if the series is "fixed" by a point-shaving scheme.) It is unlikely, to say the least, that people posting on Twitter have rigged the series or are in on such a scheme and have decided to act on that information by posting a prediction to Twitter.

It is premature to call the study complete, but I will quickly note some characteristics of the data. First, I noted how much punctuation (particularly, periods, commas, or apostrophes/single quotes) per tweet each user produces. The average over all predictors is 2.34 punctuation marks (of those kinds) per tweet, with 60% of predictors using more than 2.0 punctuation marks per tweet. An interesting, if not significant observation: Of those users who predicted a series duration of five or more games, only 38% used fewer than 2.0 punctuation marks per tweet. Of those (only six) users who predicted a four-game series, 100% used fewer than 2.0 punctuation marks per tweet! Are the people who ignored their English teachers walking through life ignoring all sorts of wisdom, making basketball predictions as badly as they punctuate? There's no statistical significance in this result, but note that the four-game predictors are already wrong: Because each team has already won a game, the series will last at least five games.

Performing a similar analysis of how people capitalize has not shown any such effect, however. If people who punctuate poorly are ignoring reality, the same is not obviously happening with Twitter capitalization.

It is also going to be interesting to note the vocabulary biases in the various groups of predictors. Those who used the word "must" in their non-prediction posts had a 32% probability of predicting an extreme, and unlikely, outcome (four games or five). Only 12% of users who used the word "might" in non-prediction posts predicted such an outcome. Does this mean that there are people who see the world in black-and-white, who ignore the more likely median cases, the shades of gray? Perhaps! In future studies, it will be useful to collect more data to look at a wider range of terms that indicate a black-and-white worldview or a shades-of-gray outlook.

That's the mid-course report. I will provide more analysis as the results come in. And the tangible (if not statistically significant) determination of how accurate the predictions were will come courtesy of the Miami Heat and the Dallas Mavericks. Tonight: Game Three!

Tuesday, May 31, 2011

Empirical Measure of Reliability: Part One

Students writing essays. Contributors to Wikipedia. Economists, Wall Street traders, technologists, scientists, and political pundits predicting the future. A general enterprise for thinking persons is to understand the world, sometimes vying against the adversity of uncertainty. In various ways, those thinkers who venture an opinion are "graded" for the quality of their work, and those who do well are -- perhaps -- accorded more credibility in the future.

Separately, text analytics is a technology that is coming of age. Low-level properties of text are used to assess the meaning in interesting ways. I have worked to build business-to-business solutions in text analytics -- netnography at Netbase and I am currently building sentiment analysis classifiers in many languages for Meltwater News.

When one reads analysis (as when I graded research papers as a teacher at Western Reserve Academy), one inevitably feels that some writers show a high degree of insight while others analyze poorly and show this with writing of lower quality. If an analyst cannot construct a proper sentence, then doesn't that say something damning about the quality of the analysis? Is a tongue-tied politician inevitably a hack in determining policy? Can someone who confuses "there" and "their" have something worth saying?

Here, I announce a pilot study to examine if the low-level properties of text have a bearing on the quality of the analysis therein -- more specifically, the accuracy of predictions made by the analyst. Choosing an objective proposition that is close at hand, I will use the 2011 NBA Championship Series, which begins a few hours from now, featuring the Dallas Mavericks vs. the Miami Heat, as a test of the many people out there who are staking their reputation upon predictions of the outcome. In the era of social media, such predictions are not scarce. My experiment is to collect the identities of many Twitter users who are predicting the outcome of the series. Then we can analyze the low-level qualities of the text that these users produce and have an objective measure (albeit with very low N -- just one yes/no "grade") of the reliability of each user's predictions.

The provocative promise of this type of study is that we can find empirical evidence that people who express themselves in certain types of ways are better predictors than people who express themselves in other ways. Imagine the range of possible discoveries. Imagine the gleeful English teacher who can point to empirical fact to conclude once and for all, "People who do not capitalize and punctuate correctly think poorly and what they say is factually incorrect." Imagine the shock in academia if the reverse proves to be true!

To be clear, this pilot is far from a proper experiment. The "low N" problem (only one result) means that there cannot be much meaning in the result. Maybe the smart prediction will turn out to be wrong (say, if a player on the team that should have won suffers an unexpected injury). I am not carefully picking my variables in advance -- I will analyze them as the series takes place. And there is plenty of subjectivity even in the determination of what prediction is being made (the language of these tweets is very vague). Also, the independent variables will necessarily be low-level (like use of punctuation) and preclude the more interesting high-level variables like reasonableness-of-argument. I think a proper scientific analysis could find so many flaws in my methodology to disgrace me thoroughly.

But the topic, is for me, compelling, and can serve as a pilot for better studies in the future, addressing many of those shortcomings in future work. Call this observation instead of science.

I have been collecting prediction posts from Twitter over the last few days, and I continue to do so at the present moment. It seems that I will have roughly 200 users in my sample, with about 50 posts per user as the text that can be analyzed as a sample of their writing. I will separate the users into groups after the series begins this afternoon, and then we can see how the best-of-seven series speaks to the quality of the predictions.

I hope that this the beginning of future studies, extending the scope to other domains (e.g., election outcomes, public policy, the success of Internet startups), and more subtle qualities of the analysis (e.g., use of causality to explain one's points; proper use of logical reasoning). And maybe one day, we will be able to take much of what might be written and say, "That kind of thought is invalid." And maybe the world will raise the quality of its thought by a notch or two. Isn't it be pretty to think so?

My data collection continues. Predictors are typing away. And over the coming days, the Miami Heat and the Dallas Mavericks will help us determine what is right and what is not right.

NLP Confidential