NLP Confidential: Empirical Measure of Reliability: Part Two

My pilot study on the use of text analytics to determine reliability is underway. The idea was to log a lot of predictions regarding the NBA Finals before the event begins, then to analyze the text generated by the predictors and look for correlations between a person's language and their accuracy in predicting the future event.

This is not a proper experiment -- there's a lot of hackery on my part. My hope is to learn from this pilot study to gear up for a proper experiment in the coming months. As of this writing, two games in the best-of-seven series have been completed, with the Miami Heat and the Dallas Mavericks each having one win. At this point, I'll give a quick overview of things I've seen in the data, which I collected right up to the last hour before the series began.

First, the predictions themselves: I collected 245 predictions from users on Twitter. I chose only predictions in which the user specified the outcome in terms of the series winner and the total number of games. (A best-of-seven series ends whenever one team has four wins. This could happen in as few as four or as many as seven games.) The object of interest to me is to look at the text these users produce in posts besides the prediction post itself, so I cut the study to those 165 posters for whom I was able to collect at least 10 other Twitter posts (excluding those which are retweets, containing the posts of other users).

The users tended (59%) to prefer Miami and the single most common prediction was Miami to win in six games. This is not far from the predicted outcome implied by the Las Vegas odds, which favored Miami in seven games, with Miami in six as a close second. However, the overall profile of the Twitter predictors in some ways deviated quite a bit from the Las Vegas view.

The Twitter predictors have a strange overestimate of the likelihood of Dallas (the underdog) winning the series in six games. This is strange in that the odds predict that the most likely duration of the series is seven games (34% probability). However, 59% of Twitter users predict a six-game series (most favoring Miami, some Dallas). Why? Maybe the answer lies in history: Six games is historically the most likely duration of an NBA Finals (41.5% of the Finals since 1980). Maybe people begin by accepting the likely duration of the series, and impose upon that their selection of which team will be the beneficiary. If so, there is an illogical bias: If Dallas is to perform better than the odds predict, then it is more likely that Dallas will lose in seven games, or win in seven games. It seems that users who favor Dallas, however, "go big" in their favoritism, concluding that if Dallas is to do well, they will do really well, winning by the comparatively large margin of four games to two which the odds say is fairly unlikely (10.2% probability).

Looking at the gross breakdown of predictions, it is also interesting to consider those who predicted the series to end in just four games. At a glance, it looks like the predictors are conservative, with only 4% having predicted a four-game series while the odds give a 13% probability of that outcome and history telling us that 17% of Finals end in four games. However, seen another way, the predictors seem irrationally exuberant in predicting a four-game outcome. Given an objective determination of the probabilities, it is irrational for anyone to choose the least-likely outcome. Let's say that we had a six-sided die with one side marked A, two sides marked B, and three sides marked C. If we ask a million rational gamblers to bet on the outcome of a single roll, it's not that 1/6 should say "A"; rather, absolutely nobody should say "A" if they want to maximize their probability of being correct. That 4% of predictors predicted a four-game series sweep (four choosing Miami, two Dallas) indicates either that those predictors (and to a lesser extent, those choosing a five-game series or Dallas winning in six) are responding irrationally. Maybe for one of these reasons:

1) They incorrectly believe they know something that the rest of the world doesn't.

2) They actually do know something that the rest of the world doesn't.

3) The payoff for these predictors may favor winning and losing unequally. They may get great prestige or psychological reward from making a rare, perfect prediction, whereas they can simply ignore or walk away from their incorrect prediction should it not come true.

Interpretation (2), I should add, is not very persuasive. If a class of individuals had the ability to predict more accurately than the rest of the world, then some of those individuals would have the incentive to bet large sums of money on the event, which would shift the odds to the correct value. The odds should reflect the "smart money" quite closely unless someone has access to information that the world does not. (E.g., if the series is "fixed" by a point-shaving scheme.) It is unlikely, to say the least, that people posting on Twitter have rigged the series or are in on such a scheme and have decided to act on that information by posting a prediction to Twitter.

It is premature to call the study complete, but I will quickly note some characteristics of the data. First, I noted how much punctuation (particularly, periods, commas, or apostrophes/single quotes) per tweet each user produces. The average over all predictors is 2.34 punctuation marks (of those kinds) per tweet, with 60% of predictors using more than 2.0 punctuation marks per tweet. An interesting, if not significant observation: Of those users who predicted a series duration of five or more games, only 38% used fewer than 2.0 punctuation marks per tweet. Of those (only six) users who predicted a four-game series, 100% used fewer than 2.0 punctuation marks per tweet! Are the people who ignored their English teachers walking through life ignoring all sorts of wisdom, making basketball predictions as badly as they punctuate? There's no statistical significance in this result, but note that the four-game predictors are already wrong: Because each team has already won a game, the series will last at least five games.

Performing a similar analysis of how people capitalize has not shown any such effect, however. If people who punctuate poorly are ignoring reality, the same is not obviously happening with Twitter capitalization.

It is also going to be interesting to note the vocabulary biases in the various groups of predictors. Those who used the word "must" in their non-prediction posts had a 32% probability of predicting an extreme, and unlikely, outcome (four games or five). Only 12% of users who used the word "might" in non-prediction posts predicted such an outcome. Does this mean that there are people who see the world in black-and-white, who ignore the more likely median cases, the shades of gray? Perhaps! In future studies, it will be useful to collect more data to look at a wider range of terms that indicate a black-and-white worldview or a shades-of-gray outlook.

That's the mid-course report. I will provide more analysis as the results come in. And the tangible (if not statistically significant) determination of how accurate the predictions were will come courtesy of the Miami Heat and the Dallas Mavericks. Tonight: Game Three!

NLP Confidential

Sunday, June 5, 2011

Empirical Measure of Reliability: Part Two

No comments:

Blog Archive