Tuesday, May 31, 2011

Empirical Measure of Reliability: Part One

Students writing essays. Contributors to Wikipedia. Economists, Wall Street traders, technologists, scientists, and political pundits predicting the future. A general enterprise for thinking persons is to understand the world, sometimes vying against the adversity of uncertainty. In various ways, those thinkers who venture an opinion are "graded" for the quality of their work, and those who do well are -- perhaps -- accorded more credibility in the future.

Separately, text analytics is a technology that is coming of age. Low-level properties of text are used to assess the meaning in interesting ways. I have worked to build business-to-business solutions in text analytics -- netnography at Netbase and I am currently building sentiment analysis classifiers in many languages for Meltwater News.

When one reads analysis (as when I graded research papers as a teacher at Western Reserve Academy), one inevitably feels that some writers show a high degree of insight while others analyze poorly and show this with writing of lower quality. If an analyst cannot construct a proper sentence, then doesn't that say something damning about the quality of the analysis? Is a tongue-tied politician inevitably a hack in determining policy? Can someone who confuses "there" and "their" have something worth saying?

Here, I announce a pilot study to examine if the low-level properties of text have a bearing on the quality of the analysis therein -- more specifically, the accuracy of predictions made by the analyst. Choosing an objective proposition that is close at hand, I will use the 2011 NBA Championship Series, which begins a few hours from now, featuring the Dallas Mavericks vs. the Miami Heat, as a test of the many people out there who are staking their reputation upon predictions of the outcome. In the era of social media, such predictions are not scarce. My experiment is to collect the identities of many Twitter users who are predicting the outcome of the series. Then we can analyze the low-level qualities of the text that these users produce and have an objective measure (albeit with very low N -- just one yes/no "grade") of the reliability of each user's predictions.

The provocative promise of this type of study is that we can find empirical evidence that people who express themselves in certain types of ways are better predictors than people who express themselves in other ways. Imagine the range of possible discoveries. Imagine the gleeful English teacher who can point to empirical fact to conclude once and for all, "People who do not capitalize and punctuate correctly think poorly and what they say is factually incorrect." Imagine the shock in academia if the reverse proves to be true!

To be clear, this pilot is far from a proper experiment. The "low N" problem (only one result) means that there cannot be much meaning in the result. Maybe the smart prediction will turn out to be wrong (say, if a player on the team that should have won suffers an unexpected injury). I am not carefully picking my variables in advance -- I will analyze them as the series takes place. And there is plenty of subjectivity even in the determination of what prediction is being made (the language of these tweets is very vague). Also, the independent variables will necessarily be low-level (like use of punctuation) and preclude the more interesting high-level variables like reasonableness-of-argument. I think a proper scientific analysis could find so many flaws in my methodology to disgrace me thoroughly.

But the topic, is for me, compelling, and can serve as a pilot for better studies in the future, addressing many of those shortcomings in future work. Call this observation instead of science.

I have been collecting prediction posts from Twitter over the last few days, and I continue to do so at the present moment. It seems that I will have roughly 200 users in my sample, with about 50 posts per user as the text that can be analyzed as a sample of their writing. I will separate the users into groups after the series begins this afternoon, and then we can see how the best-of-seven series speaks to the quality of the predictions.

I hope that this the beginning of future studies, extending the scope to other domains (e.g., election outcomes, public policy, the success of Internet startups), and more subtle qualities of the analysis (e.g., use of causality to explain one's points; proper use of logical reasoning). And maybe one day, we will be able to take much of what might be written and say, "That kind of thought is invalid." And maybe the world will raise the quality of its thought by a notch or two. Isn't it be pretty to think so?

My data collection continues. Predictors are typing away. And over the coming days, the Miami Heat and the Dallas Mavericks will help us determine what is right and what is not right.