NLP Confidential: June 2011

Sunday, June 26, 2011

Empirical Measure of Reliability: Part Four

This fourth installment on Twitter predictions for the NBA Finals will wrap up, for now, my pilot study on measuring reliability as a function of qualities that are detectable in the writings of the individuals who make the predictions. I've made some observations in the three preceding posts, and here I will make a few more and summing up the overall results.

First, the lay of the land: I collected predictions that individuals made on Twitter regarding the 2011 NBA Finals. The predictions were of which team would win and how many games the best-of-seven series would last. This gave eight possible outcomes, which ranged from one team winning in a four-game sweep to the other winning in a four-game sweep. Besides the individuals making predictions, there were other ways to quantify the soundness of each possible outcome, and the odds posted by Las Vegas sports books offer a possible "rationalist" position. At least, if there is a systematic error in the sports books, then there is a way for someone who knows more to get rich.

The sports books predicted a fairly close series, with Miami favored, but only slightly, with Miami-in-seven as the most likely outcome. As such, the least-favored outcomes were blowouts -- Miami-in-four or Dallas-in-four. Even before the Finals began, those seemed to be the brashest/wrongest predictions. And that proved to be true very soon, as each team won one of the first two games, making both of those predictions already wrong. Note that in some other year, where one team was dramatically superior to the other, predicting a four-game series might have been the smart bet -- but that was not true this year.

The heart of this study is to analyze the text that these 165 Twitter users have offered in other tweets (not the specific single prediction itself) and see if there is a fingerprint for reliability in the text that people generate.

Here are the observations that I have previously noted:

1) The people who made the rashest and most-incorrect predictions, a four-game series, differed from the crowd in being less prone to punctuate their tweets. (Note: There were only six such individuals, so this finding may not be significant.) They did not stand out in other obvious ways, such as the use of capitalization.

2) The people who made more extreme predictions (four or five-game series; including the "5s" raised the sample size to 32) were more likely to use the modal verb "must" than the modal verb "might." This is suggestive that some people carry a black-and-white worldview around versus those who see things in shades of gray; the "must" people predicted a lopsided outcome despite the judgment of Las Vegas (and the actual outcome of the series) that was less extreme.

3) While the frequency of the eight possible predictions roughly matched the Las Vegas probabilities (Pearson correlation r=0.56), the crowd deviated from this primarily in over-predicting Dallas-in-six. Interestingly, this ended up being the actual outcome of the series! Is there a significant minority out there (about 25% of the individuals) who know better than Vegas?!

Now, two observations that I have not reported previously:

4) Do those Dallas-in-six predictors show some sign of brilliance in their vocabulary? The words most apt for those people to use more often than the other 75% of individuals were: {awesome, follow, through, morning, maybe}. The words they used less often than the others were: {watching, fun, free}. If there's anything plausible about a worldview here (as with the "must" vs. "might" observation) I don't see it.

5) We can rate the full set of predictions according to correctness (Dallas-in-six being exactly right, but every other prediction being a certain number of games away from this, ranging from one game off to five games off). And then we can correlate correctness with the gross properties of the individuals' tweets and we see that correctness correlated positively with:

A) People who type longer tweets were more accurate than those who type shorter tweets. Pearson correlation, r=0.48.

B) People who use more mixed capitalization (capitalizing the first letters of words, vs. leaving the whole word lowercase or all-uppercase) were more accurate: Pearson correlation, r=0.46.

C) An overall measure of "fluency", using the correct English function words like "and" and "be" correlated only slightly with correctness: r=0.08.

I stress again that this study was not painstakingly scientific, but I would like to use it as a pathfinder towards more informative studies in the coming months. An ambitious-enough goal would be to assess which individuals write in such a way as to seem to be systematically deluded, showing the world that they dispense with facts and wisdom and draw their own conclusions anyway. A yet-more ambitious goal would be to distinguish those individuals of exceptional reliability from those of average reliability, and perhaps to use the crowd as a predictor of the future that is better than anyone has yet systematically recognized (imagine if this led to predictions that were smarter than Las Vegas tends to posit).

While it will be fun and informative to continue this work with other sports events (a ready source of quantitative predictions that can be graded objectively), it would be even more rewarding to evaluate the soundness of predictions regarding politics, policy, technology, and science. It would of course be useful to analyze qualities that are deeper and more meaningful than punctuation. And I will confess to an ultimate goal of collecting empirical statistics on the soundness of various kinds of higher reasoning. Can we do the same as I've done here with arguments that intelligent astronomers made in the Twenties through Fifties, grading those predictions with the correct answers that we now in many cases have? Can we have a sort of truth-o-meter based on this sort of empirical work? Or, at least, can we show people that if they write badly they will seem less reliable? Watch this space in the months to come.

Friday, June 17, 2011

Empirical Measure of Reliability: Part Three

The NBA Finals ended last weekend with the Dallas Mavericks beating the Miami Heat in six games. This is an outcome that a significant minority of the Twitter users predicted -- specifically, it was the second-most common prediction (out of eight logical possibilities) with 24.8% of users choosing it.

What's interesting is to see how the predictions of the crowd selectively followed the probabilities implied by the odds posted by Las Vegas sports books. In the graph (click t0 enlarge), we see the possible outcomes along the bottom, with those most favorable to Miami on the right and those most favorable to Dallas on the left. The house (red line) gave Miami a modest edge in the series, and the probabilities hint at a Gaussian with a peak centered on "Miami winning in seven games." If we ignore the "Dallas in six" position, it looks like the crowd largely went with the gambling odds, choosing a peak that was nearby (Miami in six instead of seven) and then following a course somewhere between Response Matching and a Winner Takes All preference for the favored outcome.

However, the crowd deviated from that in one big way by giving far more credence to the "Dallas in six" outcome (and slightly more to "Miami in six" and slightly less to "Miami in seven"). It is specious to read too much into this single instance, but it looks like the crowd -- a significant minority of them -- got smart in a way that Las Vegas underestimated. Do those Dallas-in-six people have some special talent, or did they just get lucky? I'll take a look next time at how the Dallas-in-six predictors differed -- or didn't -- from the other predictors. Is there a smart gene somewhere in their writing?

Sunday, June 5, 2011

Empirical Measure of Reliability: Part Two

My pilot study on the use of text analytics to determine reliability is underway. The idea was to log a lot of predictions regarding the NBA Finals before the event begins, then to analyze the text generated by the predictors and look for correlations between a person's language and their accuracy in predicting the future event.

This is not a proper experiment -- there's a lot of hackery on my part. My hope is to learn from this pilot study to gear up for a proper experiment in the coming months. As of this writing, two games in the best-of-seven series have been completed, with the Miami Heat and the Dallas Mavericks each having one win. At this point, I'll give a quick overview of things I've seen in the data, which I collected right up to the last hour before the series began.

First, the predictions themselves: I collected 245 predictions from users on Twitter. I chose only predictions in which the user specified the outcome in terms of the series winner and the total number of games. (A best-of-seven series ends whenever one team has four wins. This could happen in as few as four or as many as seven games.) The object of interest to me is to look at the text these users produce in posts besides the prediction post itself, so I cut the study to those 165 posters for whom I was able to collect at least 10 other Twitter posts (excluding those which are retweets, containing the posts of other users).

The users tended (59%) to prefer Miami and the single most common prediction was Miami to win in six games. This is not far from the predicted outcome implied by the Las Vegas odds, which favored Miami in seven games, with Miami in six as a close second. However, the overall profile of the Twitter predictors in some ways deviated quite a bit from the Las Vegas view.

The Twitter predictors have a strange overestimate of the likelihood of Dallas (the underdog) winning the series in six games. This is strange in that the odds predict that the most likely duration of the series is seven games (34% probability). However, 59% of Twitter users predict a six-game series (most favoring Miami, some Dallas). Why? Maybe the answer lies in history: Six games is historically the most likely duration of an NBA Finals (41.5% of the Finals since 1980). Maybe people begin by accepting the likely duration of the series, and impose upon that their selection of which team will be the beneficiary. If so, there is an illogical bias: If Dallas is to perform better than the odds predict, then it is more likely that Dallas will lose in seven games, or win in seven games. It seems that users who favor Dallas, however, "go big" in their favoritism, concluding that if Dallas is to do well, they will do really well, winning by the comparatively large margin of four games to two which the odds say is fairly unlikely (10.2% probability).

Looking at the gross breakdown of predictions, it is also interesting to consider those who predicted the series to end in just four games. At a glance, it looks like the predictors are conservative, with only 4% having predicted a four-game series while the odds give a 13% probability of that outcome and history telling us that 17% of Finals end in four games. However, seen another way, the predictors seem irrationally exuberant in predicting a four-game outcome. Given an objective determination of the probabilities, it is irrational for anyone to choose the least-likely outcome. Let's say that we had a six-sided die with one side marked A, two sides marked B, and three sides marked C. If we ask a million rational gamblers to bet on the outcome of a single roll, it's not that 1/6 should say "A"; rather, absolutely nobody should say "A" if they want to maximize their probability of being correct. That 4% of predictors predicted a four-game series sweep (four choosing Miami, two Dallas) indicates either that those predictors (and to a lesser extent, those choosing a five-game series or Dallas winning in six) are responding irrationally. Maybe for one of these reasons:

1) They incorrectly believe they know something that the rest of the world doesn't.

2) They actually do know something that the rest of the world doesn't.

3) The payoff for these predictors may favor winning and losing unequally. They may get great prestige or psychological reward from making a rare, perfect prediction, whereas they can simply ignore or walk away from their incorrect prediction should it not come true.

Interpretation (2), I should add, is not very persuasive. If a class of individuals had the ability to predict more accurately than the rest of the world, then some of those individuals would have the incentive to bet large sums of money on the event, which would shift the odds to the correct value. The odds should reflect the "smart money" quite closely unless someone has access to information that the world does not. (E.g., if the series is "fixed" by a point-shaving scheme.) It is unlikely, to say the least, that people posting on Twitter have rigged the series or are in on such a scheme and have decided to act on that information by posting a prediction to Twitter.

It is premature to call the study complete, but I will quickly note some characteristics of the data. First, I noted how much punctuation (particularly, periods, commas, or apostrophes/single quotes) per tweet each user produces. The average over all predictors is 2.34 punctuation marks (of those kinds) per tweet, with 60% of predictors using more than 2.0 punctuation marks per tweet. An interesting, if not significant observation: Of those users who predicted a series duration of five or more games, only 38% used fewer than 2.0 punctuation marks per tweet. Of those (only six) users who predicted a four-game series, 100% used fewer than 2.0 punctuation marks per tweet! Are the people who ignored their English teachers walking through life ignoring all sorts of wisdom, making basketball predictions as badly as they punctuate? There's no statistical significance in this result, but note that the four-game predictors are already wrong: Because each team has already won a game, the series will last at least five games.

Performing a similar analysis of how people capitalize has not shown any such effect, however. If people who punctuate poorly are ignoring reality, the same is not obviously happening with Twitter capitalization.

It is also going to be interesting to note the vocabulary biases in the various groups of predictors. Those who used the word "must" in their non-prediction posts had a 32% probability of predicting an extreme, and unlikely, outcome (four games or five). Only 12% of users who used the word "might" in non-prediction posts predicted such an outcome. Does this mean that there are people who see the world in black-and-white, who ignore the more likely median cases, the shades of gray? Perhaps! In future studies, it will be useful to collect more data to look at a wider range of terms that indicate a black-and-white worldview or a shades-of-gray outlook.

That's the mid-course report. I will provide more analysis as the results come in. And the tangible (if not statistically significant) determination of how accurate the predictions were will come courtesy of the Miami Heat and the Dallas Mavericks. Tonight: Game Three!

NLP Confidential