<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-3989794568827688223</id><updated>2011-11-27T16:00:31.712-08:00</updated><category term='nlp'/><category term='machine translation'/><category term='classifiers'/><category term='linguistics'/><category term='information retrieval'/><category term='cl'/><category term='twitter'/><category term='reliability'/><category term='sports'/><category term='history'/><category term='language'/><category term='parsing'/><category term='social media'/><category term='ir'/><category term='comparative linguistics'/><category term='prediction'/><category term='google'/><title type='text'>NLP Confidential</title><subtitle type='html'>An insider's guide to the technologies known collectively as Natural Language Processing -- also known as Computational Linguistics -- and its many subfields. This is the story from past to present to future, from Socrates to Chomsky to Brin and Page, from academia to industry, from MIT to Silicon Valley.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>15</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-3989794568827688223.post-5107989062232478319</id><published>2011-09-25T00:36:00.000-07:00</published><updated>2011-09-25T23:18:34.935-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='social media'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><category scheme='http://www.blogger.com/atom/ns#' term='language'/><title type='text'>Laughing Out Loud</title><content type='html'>&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-v28bzbmQ6zU/Tn7ZVfmPZbI/AAAAAAAAAC8/xa_tU7Vcrbs/s1600/haha.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="196" src="http://1.bp.blogspot.com/-v28bzbmQ6zU/Tn7ZVfmPZbI/AAAAAAAAAC8/xa_tU7Vcrbs/s200/haha.jpg" width="200" /&gt;&lt;/a&gt;&lt;/div&gt;This may make you laugh.&lt;br /&gt;&lt;br /&gt;And laughs are something people like to share. When people communicate via social media, they type "laughs." In a sample of a million words of Twitter messages in ten different languages, I found that about 0.5% of all "words" are laughs – "haha", "LOL", or other ways of typing out a chuckle.&lt;br /&gt;&lt;br /&gt;Do people everywhere laugh equally?&lt;br /&gt;&lt;br /&gt;Not on your life.&lt;br /&gt;&lt;br /&gt;In a study of ten Western languages (English, German, Dutch, Norwegian, Swedish, Danish, French, Spanish, Italian, and Portuguese), I found enormous differences in the frequency of Twitter-laughs.&lt;br /&gt;&lt;br /&gt;The Germans laugh least, with Twitter laughs making up under 0.1% of all words.&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/-q7SmGhqXYDA/TmojYdkDupI/AAAAAAAAAC0/yKEljbE0ODQ/s1600/laughMapFaces2.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://4.bp.blogspot.com/-q7SmGhqXYDA/TmojYdkDupI/AAAAAAAAAC0/yKEljbE0ODQ/s320/laughMapFaces2.jpg" width="236" /&gt;&lt;/a&gt;&lt;br /&gt;Other languages of Northern Europe were somewhat more prone to laughs than German. In increasing order of laugh frequency, Norwegian, French, Swedish, English, and Danish all came in below 0.4%.&lt;br /&gt;&lt;br /&gt;And then there are the happy Latins. Laughing just more than the Danes, Portuguese has 0.5% laughs, and that's nothing compared to the Italians who Twitter-laugh in 0.9% of words. But the runaway laugh champions are Spanish speakers who type Twitter laughs for 1.4% of words.&lt;br /&gt;&lt;br /&gt;The North-South pattern is noteworthy, but is broken by the Dutch, who out-laugh their neighbors like they're misplaced Latins, finishing way up at 0.8%.&lt;br /&gt;&lt;br /&gt;The Dutch withstanding, the North-South trend is sharp and undeniable, as this color-coded map makes clear.&lt;br /&gt;&lt;br /&gt;What's even funnier, the languages where people laugh more often, they also type longer laughs.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-OonV4RDC_IU/ToAZJqIFseI/AAAAAAAAADA/GMii6J2GLQ8/s1600/laughChart.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-OonV4RDC_IU/ToAZJqIFseI/AAAAAAAAADA/GMii6J2GLQ8/s1600/laughChart.png" /&gt;&lt;/a&gt;&lt;/div&gt;While three "ha"s are the preferred laugh in Spanish and Italian, they are not afraid to laugh longer. The five-ha laugh ("jajajajaja") is more common in Spanish than the two-ha laugh is in German. This graph shows how laugh length occurs in the five most-spoken languages. While Spanish runs away with the championship here at every length, notice that English is actually the runner-up for the two-ha laugh ("haha") with Italian strongly preferring a three-ha approach ("ahahah").&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-9pJoPlQNlv4/TmolJBdfSmI/AAAAAAAAAC4/D9Ei5wh3DUg/s1600/laughGraph.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="199" src="http://1.bp.blogspot.com/-9pJoPlQNlv4/TmolJBdfSmI/AAAAAAAAAC4/D9Ei5wh3DUg/s320/laughGraph.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;When you take into account the length as well as frequency of laughs, Spanish Twitter has 24 times more laughing than German, as measured in character count. This is not a subtle difference!&lt;br /&gt;&lt;br /&gt;So, why is all of this happening? It's clear that more Twitter laughs come from the warmer and sunnier countries.This is true not only in Europe but also in the Americas, where the most speakers of English, Spanish, and Portuguese live. Statistically speaking, the laugh statistic is highly correlated with the latitude of the corresponding European capital (farther south: r=0.66), how sunny that city is (more sun: r=0.74), and inversely with the suicide rate (r=-0.74; this is the same if you choose the U.S., Mexico, and Brazil instead of the U.K., Spain, and Portugal).&lt;br /&gt;&lt;br /&gt;So is it as simple as this: Warm, sunny weather makes people laugh a lot and immune to depression?&lt;br /&gt;&lt;br /&gt;That may be part of it. But another idea to consider is that in Germany and Scandinavia Twitter is used comparatively more often for business and relatively less often for chatting. When one subtracts the social chat, then naturally less laughter remains.&lt;br /&gt;&lt;br /&gt;Overall, it's not clear how much Twitter reflects life as a whole. Until we plant microphones everywhere and monitor all human communication, studies like this will just be suggestive of larger truths. But insofar as it goes, this study of Twitter laughs serves to support a lot of existing cultural stereotypes.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3989794568827688223-5107989062232478319?l=nlpconfidential.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/5107989062232478319/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3989794568827688223&amp;postID=5107989062232478319' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/5107989062232478319'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/5107989062232478319'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/2011/09/laughing-out-loud.html' title='Laughing Out Loud'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-v28bzbmQ6zU/Tn7ZVfmPZbI/AAAAAAAAAC8/xa_tU7Vcrbs/s72-c/haha.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3989794568827688223.post-5948051459777658923</id><published>2011-08-29T10:10:00.000-07:00</published><updated>2011-08-29T11:30:36.008-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='social media'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><category scheme='http://www.blogger.com/atom/ns#' term='language'/><title type='text'>Social Media: Linguistic Anarchy?</title><content type='html'>Your schoolteacher would be horrified. As technology opens up new channels for people to communicate via the written word, the use of language in those channels becomes increasingly ill-formed and deviant. Social critics may look at this as a relaxation in standards, a harbinger of the decay of reason and civilization.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, it's really not that bad. People feel differently about standards in language, and one might observe that if language use did not vary over time, we would still be speaking Latin, Anglo-Saxon, Proto-Indo-European, or some more primeval language. Whether you identify more with prescriptive linguistics (the use of instruction to make students keep in line with existing standards) or descriptive linguistics (the &lt;i&gt;laissez-faire&lt;/i&gt; study of language to understand it, without concern for changing how people use it), the truth is that people still adhere to standards, and in many ways, those standards aren't so different in the era of electronic media than they were in the long-lost era of pens and inkwells.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;How does English, say, on Twitter, differ from English in the news? Certainly one sees slang, profanity, misspelling, and neologisms. But the single greatest set of differences come from the different kind of interaction. The news is meant to sound like the voice of God, detached, objective, hovering over the topic and the reader alike in the Third Person. In contrast, many interactions in social media are person-to-person, openly subjective, inherently spoken by the First Person to the Second Person... often &lt;b&gt;on the topic of&lt;/b&gt; the First Person and/or the Second Person.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For ten major European languages, I have computed word frequencies for a corpus of news and of Twitter posts, and here I focus on the words which are the most prevalent on Twitter as compared to the news (frequency in Twitter minus frequency in the news). And for English, the word at the top of that list is nothing that would give schoolteachers and the clergy a stroke. It is "I." In fact, of the 39 words topping that list, only the acronym "LOL" and the abbreviation "u" are the stuff of which schoolteacher nightmares are made. The other 37 are almost exclusively words that are very common in The Queen's English and American Standard English, but happen to be more prevalent in first person narration than in the third person. They are common words that are natural descriptors of &lt;b&gt;situated language&lt;/b&gt;, where the speaker's and listener's identity, time, place, attitude, and -- more generally -- their context, are part of the discussion.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;And so, with an eye towards the top 25 words on the (Twitter-minus-news) frequency list, we see the following categories:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;First-and-Second Person forms&lt;/b&gt;: I, am, me, my, we...&lt;/div&gt;&lt;div&gt;&lt;b&gt;Deixis&lt;/b&gt; (words referring to the speaker's or listener's situation): today, just...&lt;/div&gt;&lt;div&gt;&lt;b&gt;Simple, plainspoken vocabulary&lt;/b&gt;, words that are common in the news, but still more common for writers who are not consulting the thesaurus to make their language more flowery: not, was, do, had, did, have, got...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Far lower on the list come the shock words: "shit", "fuck", "alot", "alright", "ima", etc. And even the abbreviations are understandable, when writers are constrained to squeeze their idea into the 140 character limit, and deal with keypads that make extra effort a true burden. A drowning person is likely to yell for help in something other than complete sentences, and a person straining to fit an idea into Twitter's constraints has a legitimate motive for abbreviating more than they otherwise might. All told, we see that people have a strong tendency to hold to convention -- not the universal adherence to convention that William Safire would have liked, but it is still the most common case.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This is true in other languages as well, and to much the same degree. Here are, according to Twitter-minus-news frequency, the top 25 words for the five leading Western European languages. They reflect more or less the same tendencies.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;English&lt;/b&gt;: I, 's, not, am, me, my, was, he, do, we, had, lol, news, did, u, have, new, today, just, think, haha, got, 'd, game, she.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;French&lt;/b&gt;: je, j, c, pas, est, ai, tu, mais, moi, me, que, ça, a, t, mon, suis, on, ma, si, y, fait, il, te, quand, m.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;German&lt;/b&gt;: ich, du, ja, d, nicht, aber, mir, mal, hab, was, jetzt, so, ist, noch, mich, da, dann, bin, es, schon, das, war, wenn, auch, dir.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Spanish&lt;/b&gt;: no, me, q, te, es, ya, si, yo, lo, jajaja, mi, tu, a, d, mas, pero, XD, jaja, jajajaja, eso, México, son, hay, solo, x.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Italian&lt;/b&gt;: non, mi, ho, io, ma, che, d, XD, è, ti, se, u, a, me, sono, lo, o, no, ci, l, ora, çç, sei, mia, poi.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;Interestingly, the negative adverb in each language appears quite high. This seems to indicate that journalists exercise a discipline to express things in terms of positives while people generally use the negative a larger proportion of the time.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The big cross-language difference that is evident from the above is in the tendency for Twitter users to type out a "laugh", and this particularly stands out on the Spanish list. This merits a fun and funny, follow-up post on how much people type out laughs in different languages. The results will probably not surprise you.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3989794568827688223-5948051459777658923?l=nlpconfidential.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/5948051459777658923/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3989794568827688223&amp;postID=5948051459777658923' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/5948051459777658923'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/5948051459777658923'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/2011/08/social-media-linguistic-anarchy.html' title='Social Media: Linguistic Anarchy?'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3989794568827688223.post-1292739860549532383</id><published>2011-06-26T14:00:00.000-07:00</published><updated>2011-06-26T15:00:29.518-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='classifiers'/><category scheme='http://www.blogger.com/atom/ns#' term='reliability'/><category scheme='http://www.blogger.com/atom/ns#' term='sports'/><category scheme='http://www.blogger.com/atom/ns#' term='nlp'/><category scheme='http://www.blogger.com/atom/ns#' term='prediction'/><title type='text'>Empirical Measure of Reliability: Part Four</title><content type='html'>This fourth installment on Twitter predictions for the NBA Finals will wrap up, for now, my pilot study on measuring reliability as a function of qualities that are detectable in the writings of the individuals who make the predictions. I've made some observations in the three preceding posts, and here I will make a few more and summing up the overall results.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;First, the lay of the land: I collected predictions that individuals made on Twitter regarding the 2011 NBA Finals. The predictions were of which team would win and how many games the best-of-seven series would last. This gave eight possible outcomes, which ranged from one team winning in a four-game sweep to the other winning in a four-game sweep. Besides the individuals making predictions, there were other ways to quantify the soundness of each possible outcome, and the odds posted by Las Vegas sports books offer a possible "rationalist" position. At least, if there is a systematic error in the sports books, then there is a way for someone who knows more to get rich.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The sports books predicted a fairly close series, with Miami favored, but only slightly, with Miami-in-seven as the most likely outcome. As such, the least-favored outcomes were blowouts -- Miami-in-four or Dallas-in-four. Even before the Finals began, those seemed to be the brashest/wrongest predictions. And that proved to be true very soon, as each team won one of the first two games, making both of those predictions already wrong. Note that in some other year, where one team was dramatically superior to the other, predicting a four-game series might have been the smart bet -- but that was not true this year.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The heart of this study is to analyze the text that these 165 Twitter users have offered in other tweets (not the specific single prediction itself) and see if there is a fingerprint for reliability in the text that people generate.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here are the observations that I have previously noted:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1) The people who made the rashest and most-incorrect predictions, a four-game series, differed from the crowd in being less prone to punctuate their tweets. (Note: There were only six such individuals, so this finding may not be significant.) They did not stand out in other obvious ways, such as the use of capitalization.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2) The people who made more extreme predictions (four or five-game series; including the "5s" raised the sample size to 32) were more likely to use the modal verb "must" than the modal verb "might." This is suggestive that some people carry a black-and-white worldview around versus those who see things in shades of gray; the "must" people predicted a lopsided outcome despite the judgment of Las Vegas (and the actual outcome of the series) that was less extreme.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3) While the frequency of the eight possible predictions roughly matched the Las Vegas probabilities (Pearson correlation r=0.56), the crowd deviated from this primarily in over-predicting Dallas-in-six. Interestingly, this ended up being the actual outcome of the series! Is there a significant minority out there (about 25% of the individuals) who know better than Vegas?!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now, two observations that I have not reported previously:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;4) Do those Dallas-in-six predictors show some sign of brilliance in their vocabulary? The words most apt for those people to use more often than the other 75% of individuals were: {awesome, follow, through, morning, maybe}. The words they used less often than the others were: {watching, fun, free}. If there's anything plausible about a worldview here (as with the "must" vs. "might" observation) I don't see it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;5) We can rate the full set of predictions according to correctness (Dallas-in-six being exactly right, but every other prediction being a certain number of games away from this, ranging from one game off to five games off). And then we can correlate correctness with the gross properties of the individuals' tweets and we see that correctness correlated positively with:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A) People who type longer tweets were more accurate than those who type shorter tweets. Pearson correlation, r=0.48.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;B) People who use more mixed capitalization (capitalizing the first letters of words, vs. leaving the whole word lowercase or all-uppercase) were more accurate: Pearson correlation, r=0.46.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;C) An overall measure of "fluency", using the correct English function words like "and" and "be" correlated only slightly with correctness: r=0.08.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I stress again that this study was not painstakingly scientific, but I would like to use it as a pathfinder towards more informative studies in the coming months. An ambitious-enough goal would be to assess which individuals write in such a way as to seem to be systematically deluded, showing the world that they dispense with facts and wisdom and draw their own conclusions anyway. A yet-more ambitious goal would be to distinguish those individuals of exceptional reliability from those of average reliability, and perhaps to use the crowd as a predictor of the future that is better than anyone has yet systematically recognized (imagine if this led to predictions that were smarter than Las Vegas tends to posit).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;While it will be fun and informative to continue this work with other sports events (a ready source of quantitative predictions that can be graded objectively), it would be even more rewarding to evaluate the soundness of predictions regarding politics, policy, technology, and science. It would of course be useful to analyze qualities that are deeper and more meaningful than punctuation. And I will confess to an ultimate goal of collecting empirical statistics on the soundness of various kinds of higher reasoning. Can we do the same as I've done here with arguments that intelligent astronomers made in the Twenties through Fifties, grading those predictions with the correct answers that we now in many cases have? Can we have a sort of truth-o-meter based on this sort of empirical work? Or, at least, can we show people that if they write badly they will seem less reliable? Watch this space in the months to come.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3989794568827688223-1292739860549532383?l=nlpconfidential.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/1292739860549532383/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3989794568827688223&amp;postID=1292739860549532383' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/1292739860549532383'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/1292739860549532383'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/2011/06/empirical-measure-of-reliability-part_26.html' title='Empirical Measure of Reliability: Part Four'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3989794568827688223.post-8107402628230110528</id><published>2011-06-17T12:48:00.000-07:00</published><updated>2011-06-17T13:07:09.127-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='classifiers'/><category scheme='http://www.blogger.com/atom/ns#' term='reliability'/><category scheme='http://www.blogger.com/atom/ns#' term='sports'/><category scheme='http://www.blogger.com/atom/ns#' term='nlp'/><category scheme='http://www.blogger.com/atom/ns#' term='prediction'/><title type='text'>Empirical Measure of Reliability: Part Three</title><content type='html'>&lt;a href="http://4.bp.blogspot.com/--KauRPKs0MY/TfuxWJGykpI/AAAAAAAAACw/4YS6L-oIYwo/s1600/NBA_Predict_Odds.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 200px; height: 136px;" src="http://4.bp.blogspot.com/--KauRPKs0MY/TfuxWJGykpI/AAAAAAAAACw/4YS6L-oIYwo/s200/NBA_Predict_Odds.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5619279954034463378" /&gt;&lt;/a&gt;&lt;br /&gt;The NBA Finals ended last weekend with the Dallas Mavericks beating the Miami Heat in six games. This is an outcome that a significant minority of the Twitter users predicted -- specifically, it was the second-most common prediction (out of eight logical possibilities) with 24.8% of users choosing it.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What's interesting is to see how the predictions of the crowd selectively followed the probabilities implied by the odds posted by Las Vegas sports books. In the graph (click t0 enlarge), we see the possible outcomes along the bottom, with those most favorable to Miami on the right and those most favorable to Dallas on the left. The house (red line) gave Miami a modest edge in the series, and the probabilities hint at a Gaussian with a peak centered on "Miami winning in seven games." If we ignore the "Dallas in six" position, it looks like the crowd largely went with the gambling odds, choosing a peak that was nearby (Miami in six instead of seven) and then following a course somewhere between Response Matching and a Winner Takes All preference for the favored outcome.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, the crowd deviated from that in one big way by giving far more credence to the "Dallas in six" outcome (and slightly more to "Miami in six" and slightly less to "Miami in seven"). It is specious to read too much into this single instance, but it looks like the crowd -- a significant minority of them -- got smart in a way that Las Vegas underestimated. Do those Dallas-in-six people have some special talent, or did they just get lucky? I'll take a look next time at how the Dallas-in-six predictors differed -- or didn't -- from the other predictors. Is there a smart gene somewhere in their writing?&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3989794568827688223-8107402628230110528?l=nlpconfidential.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/8107402628230110528/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3989794568827688223&amp;postID=8107402628230110528' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/8107402628230110528'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/8107402628230110528'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/2011/06/empirical-measure-of-reliability-part_17.html' title='Empirical Measure of Reliability: Part Three'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/--KauRPKs0MY/TfuxWJGykpI/AAAAAAAAACw/4YS6L-oIYwo/s72-c/NBA_Predict_Odds.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3989794568827688223.post-7657029037228131562</id><published>2011-06-05T13:10:00.000-07:00</published><updated>2011-06-05T14:08:25.641-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='classifiers'/><category scheme='http://www.blogger.com/atom/ns#' term='reliability'/><category scheme='http://www.blogger.com/atom/ns#' term='sports'/><category scheme='http://www.blogger.com/atom/ns#' term='nlp'/><category scheme='http://www.blogger.com/atom/ns#' term='prediction'/><title type='text'>Empirical Measure of Reliability: Part Two</title><content type='html'>&lt;a href="http://nlpconfidential.blogspot.com/2011/05/empirical-measure-of-reliability-part.html"&gt;My pilot study&lt;/a&gt; on the use of text analytics to determine reliability is underway. The idea was to log a lot of predictions regarding the NBA Finals before the event begins, then to analyze the text generated by the predictors and look for correlations between a person's language and their accuracy in predicting the future event.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This is not a proper experiment -- there's a lot of hackery on my part. My hope is to learn from this pilot study to gear up for a proper experiment in the coming months. As of this writing, two games in the best-of-seven series have been completed, with the Miami Heat and the Dallas Mavericks each having one win. At this point, I'll give a quick overview of things I've seen in the data, which I collected right up to the last hour before the series began.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;First, the predictions themselves: I collected 245 predictions from users on Twitter. I chose only predictions in which the user specified the outcome in terms of the series winner and the total number of games. (A best-of-seven series ends whenever one team has four wins. This could happen in as few as four or as many as seven games.) The object of interest to me is to look at the text these users produce in posts besides the prediction post itself, so I cut the study to those 165 posters for whom I was able to collect at least 10 other Twitter posts (excluding those which are retweets, containing the posts of other users).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The users tended (59%) to prefer Miami and the single most common prediction was Miami to win in six games. This is not far from the predicted outcome implied by the Las Vegas odds, which favored Miami in seven games, with Miami in six as a close second. However, the overall profile of the Twitter predictors in some ways deviated quite a bit from the Las Vegas view. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 200px; height: 136px;" src="http://1.bp.blogspot.com/-EiwKp1XFjF0/TevoGRcK87I/AAAAAAAAACo/JJh7XKSx4PM/s200/NBA_Predict_Odds.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5614836554905875378" /&gt;&lt;/div&gt;&lt;div&gt;The Twitter predictors have a strange overestimate of the likelihood of Dallas (the underdog) winning the series in six games. This is strange in that the odds predict that the most likely duration of the series is seven games (34% probability). However, 59% of Twitter users predict a six-game series (most favoring Miami, some Dallas). Why? Maybe the answer lies in history: Six games is historically the most likely duration of an NBA Finals (41.5% of the Finals since 1980). Maybe people begin by accepting the likely duration of the series, and impose upon that their selection of which team will be the beneficiary. If so, there is an illogical bias: If Dallas is to perform better than the odds predict, then it is more likely that Dallas will lose in seven games, or win in seven games. It seems that users who favor Dallas, however, "go big" in their favoritism, concluding that if Dallas is to do well, they will do really well, winning by the comparatively large margin of four games to two which the odds say is fairly unlikely (10.2% probability).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Looking at the gross breakdown of predictions, it is also interesting to consider those who predicted the series to end in just four games. At a glance, it looks like the predictors are conservative, with only 4% having predicted a four-game series while the odds give a 13% probability of that outcome and history telling us that 17% of Finals end in four games. However, seen another way, the predictors seem irrationally exuberant in predicting a four-game outcome. Given an objective determination of the probabilities, it is irrational for &lt;b&gt;anyone&lt;/b&gt; to choose the least-likely outcome. Let's say that we had a six-sided die with one side marked A, two sides marked B, and three sides marked C. If we ask a million rational gamblers to bet on the outcome of a single roll, it's not that 1/6 should say "A"; rather, absolutely nobody should say "A" if they want to maximize their probability of being correct. That 4% of predictors predicted a four-game series sweep (four choosing Miami, two Dallas) indicates either that those predictors (and to a lesser extent, those choosing a five-game series or Dallas winning in six) are responding irrationally. Maybe for one of these reasons:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1) They incorrectly believe they know something that the rest of the world doesn't.&lt;/div&gt;&lt;div&gt;2) They actually do know something that the rest of the world doesn't.&lt;/div&gt;&lt;div&gt;3) The payoff for these predictors may favor winning and losing unequally. They may get great prestige or psychological reward from making a rare, perfect prediction, whereas they can simply ignore or walk away from their incorrect prediction should it not come true.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Interpretation (2), I should add, is not very persuasive. If a class of individuals had the ability to predict more accurately than the rest of the world, then some of those individuals would have the incentive to bet large sums of money on the event, which would shift the odds to the correct value. The odds should reflect the "smart money" quite closely unless someone has access to information that the world does not. (E.g., if the series is "fixed" by a point-shaving scheme.) It is unlikely, to say the least, that people posting on Twitter have rigged the series or are in on such a scheme and have decided to act on that information by posting a prediction to Twitter.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It is premature to call the study complete, but I will quickly note some characteristics of the data. First, I noted how much punctuation (particularly, periods, commas, or apostrophes/single quotes) per tweet each user produces. The average over all predictors is 2.34 punctuation marks (of those kinds) per tweet, with 60% of predictors using more than 2.0 punctuation marks per tweet. An interesting, if not significant observation: Of those users who predicted a series duration of five or more games, only 38% used fewer than 2.0 punctuation marks per tweet. Of those (only six) users who predicted a four-game series, 100% used fewer than 2.0 punctuation marks per tweet! Are the people who ignored their English teachers walking through life ignoring all sorts of wisdom, making basketball predictions as badly as they punctuate? There's no statistical significance in this result, but note that the four-game predictors are already wrong: Because each team has already won a game, the series will last at least five games.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Performing a similar analysis of how people capitalize has not shown any such effect, however. If people who punctuate poorly are ignoring reality, the same is not obviously happening with Twitter capitalization.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It is also going to be interesting to note the vocabulary biases in the various groups of predictors. Those who used the word "must" in their non-prediction posts had a 32% probability of predicting an extreme, and unlikely, outcome (four games or five). Only 12% of users who used the word "might" in non-prediction posts predicted such an outcome. Does this mean that there are people who see the world in black-and-white, who ignore the more likely median cases, the shades of gray? Perhaps! In future studies, it will be useful to collect more data to look at a wider range of terms that indicate a black-and-white worldview or a shades-of-gray outlook.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;That's the mid-course report. I will provide more analysis as the results come in. And the tangible (if not statistically significant) determination of how accurate the predictions were will come courtesy of the Miami Heat and the Dallas Mavericks. Tonight: Game Three!&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3989794568827688223-7657029037228131562?l=nlpconfidential.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/7657029037228131562/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3989794568827688223&amp;postID=7657029037228131562' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/7657029037228131562'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/7657029037228131562'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/2011/06/empirical-measure-of-reliability-part.html' title='Empirical Measure of Reliability: Part Two'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-EiwKp1XFjF0/TevoGRcK87I/AAAAAAAAACo/JJh7XKSx4PM/s72-c/NBA_Predict_Odds.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3989794568827688223.post-6750373146870703446</id><published>2011-05-31T12:48:00.000-07:00</published><updated>2011-05-31T15:39:19.277-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='classifiers'/><category scheme='http://www.blogger.com/atom/ns#' term='reliability'/><category scheme='http://www.blogger.com/atom/ns#' term='sports'/><category scheme='http://www.blogger.com/atom/ns#' term='nlp'/><category scheme='http://www.blogger.com/atom/ns#' term='prediction'/><title type='text'>Empirical Measure of Reliability: Part One</title><content type='html'>Students writing essays. Contributors to Wikipedia. Economists, Wall Street traders, technologists, scientists, and political pundits predicting the future. A general enterprise for thinking persons is to understand the world, sometimes vying against the adversity of uncertainty. In various ways, those thinkers who venture an opinion are "graded" for the quality of their work, and those who do well are -- perhaps -- accorded more credibility in the future.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Separately, text analytics is a technology that is coming of age. Low-level properties of text are used to assess the meaning in interesting ways. I have worked to build business-to-business solutions in text analytics -- netnography at Netbase and I am currently building&lt;a href="http://thenextweb.com/media/2011/03/09/can-social-media-really-measure-market-demand/"&gt; sentiment analysis classifiers&lt;/a&gt; in many languages for Meltwater News.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When one reads analysis (as when I graded research papers as a teacher at Western Reserve Academy), one inevitably feels that some writers show a high degree of insight while others analyze poorly and show this with writing of lower quality. If an analyst cannot construct a proper sentence, then doesn't that say something damning about the quality of the analysis? Is a tongue-tied politician inevitably a hack in determining policy? Can someone who confuses "there" and "their" have something worth saying?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here, I announce a pilot study to examine if the low-level properties of text have a bearing on the quality of the analysis therein -- more specifically, the accuracy of predictions made by the analyst. Choosing an objective proposition that is close at hand, I will use the 2011 NBA Championship Series, which begins a few hours from now, featuring the Dallas Mavericks vs. the Miami Heat, as a test of the many people out there who are staking their reputation upon predictions of the outcome. In the era of social media, such predictions are not scarce. My experiment is to collect the identities of many Twitter users who are predicting the outcome of the series. Then we can analyze the low-level qualities of the text that these users produce and have an objective measure (albeit with very low N -- just one yes/no "grade") of the reliability of each user's predictions.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The provocative promise of this type of study is that we can find empirical evidence that people who express themselves in certain types of ways are better predictors than people who express themselves in other ways. Imagine the range of possible discoveries. Imagine the gleeful English teacher who can point to empirical fact to conclude once and for all, "People who do not capitalize and punctuate correctly think poorly and what they say is factually incorrect." Imagine the shock in academia if the reverse proves to be true!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To be clear, this pilot is far from a proper experiment. The "low N" problem (only one result) means that there cannot be much meaning in the result. Maybe the smart prediction will turn out to be wrong (say, if a player on the team that should have won suffers an unexpected injury). I am not carefully picking my variables in advance -- I will analyze them as the series takes place. And there is plenty of subjectivity even in the determination of what prediction is being made (the language of these tweets is very vague). Also, the independent variables will necessarily be low-level (like use of punctuation) and preclude the more interesting high-level variables like reasonableness-of-argument. I think a proper scientific analysis could find so many flaws in my methodology to disgrace me thoroughly.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But the topic, is for me, compelling, and can serve as a pilot for better studies in the future, addressing many of those shortcomings in future work. Call this observation instead of science.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I have been collecting prediction posts from Twitter over the last few days, and I continue to do so at the present moment. It seems that I will have roughly 200 users in my sample, with about 50 posts per user as the text that can be analyzed as a sample of their writing. I will separate the users into groups after the series begins this afternoon, and then we can see how the best-of-seven series speaks to the quality of the predictions.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I hope that this the beginning of future studies, extending the scope to other domains (e.g., election outcomes, public policy, the success of Internet startups), and more subtle qualities of the analysis (e.g., use of causality to explain one's points; proper use of logical reasoning). And maybe one day, we will be able to take much of what might be written and say, "That kind of thought is invalid." And maybe the world will raise the quality of its thought by a notch or two. Isn't it be pretty to think so?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;My data collection continues. Predictors are typing away. And over the coming days, the Miami Heat and the Dallas Mavericks will help us determine what is right and what is not right.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3989794568827688223-6750373146870703446?l=nlpconfidential.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/6750373146870703446/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3989794568827688223&amp;postID=6750373146870703446' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/6750373146870703446'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/6750373146870703446'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/2011/05/empirical-measure-of-reliability-part.html' title='Empirical Measure of Reliability: Part One'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3989794568827688223.post-8942148739136524936</id><published>2010-02-28T20:23:00.000-08:00</published><updated>2010-02-28T20:47:25.473-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='machine translation'/><category scheme='http://www.blogger.com/atom/ns#' term='google'/><title type='text'>Speech To Speech</title><content type='html'>&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;When Google talks, people listen. Recently, Google has been saying that sometime in the near future, when you talk, people can listen in other languages, thanks to your phone and their software. Franz Och, the head of Google's translation services has said "We think speech-to-speech translation should be possible and work reasonably well in a few years' time." I'm not the only one who's skeptical.&lt;p&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;In 2007, I evaluated a state-of-the-art StS MT system that was built for military applications (think of the Somali pirate crisis from 2009 or any scene from The Hurt Locker). I don't think it delivered. The issue is quality. StS MT can "work" if:&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;a) Both parties are extremely cooperative and are willing to put some work into the exchange. (Eg, training the system on their voices; repeating things, maybe more than once, if the result is unclear the first time; tolerating the inherent delays.) This could mean a life-or-death situation where the alternative -- no translation -- is worse than a bad translation. (Although the stakes could be higher, even deadly, if the translation is faulty.)&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;b) The comprehension of the content is held to a low standard. Eg, if two businesspeople or recreational chatters want to feel like they're getting to know each other; not trying to hammer out the fine details of a legal agreement.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;c) There is no alternative. Because having the parties type content into an online text-based MT system immediately removes one source of error.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;As long as the sources of error are as great as they are now, I have trouble thinking of many contexts where people would be willing to tolerate the flaws. Maybe chatters who are only looking for entertainment and have no bottom line regarding accuracy. In war / emergency contexts, perhaps. In business, I think the problems just about doom the effort, unless a cultural adjustment makes people value "meeting" someone in this way even when the comprehension is shaky.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;I was, however, impressed by how well the speech recognition worked with my voice when I trained the system to recognize me. Perhaps if Google trains the system on enough people, just about anyone new to come along would sound close enough to one of those. The speech-to-text part of the problem just might be solvable for a large segment of people.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;But that leaves the machine translation link in the chain, and that's something where there's little reason to suspect that a quantum leap is about to happen. I took Mr. Och's own quotation above and translated it to Spanish and then back to English using Google's text-to-text machine translation. It did a pretty good job, but it changed his use of the word "should" to "must". That's the sort of thing you don't want to have happen when a Somali pirate has his gun trained on you.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/p&gt;  &lt;!--EndFragment--&gt;   &lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3989794568827688223-8942148739136524936?l=nlpconfidential.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/8942148739136524936/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3989794568827688223&amp;postID=8942148739136524936' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/8942148739136524936'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/8942148739136524936'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/2010/02/speech-to-speech.html' title='Speech To Speech'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3989794568827688223.post-7106176052378940827</id><published>2010-02-25T12:10:00.000-08:00</published><updated>2010-02-25T14:28:31.834-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='cl'/><category scheme='http://www.blogger.com/atom/ns#' term='comparative linguistics'/><title type='text'>Inflection Predilection</title><content type='html'>&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Here's a lightweight, factoid-level look at a lot of different languages in a fun way.&lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;We often hear that some languages are "heavily inflected". You can glance at tables of verb conjugations and noun declensions to see the brutal details: Russian surely does have a lot of declensions to memorize, just as Spanish does with verb conjugations. But that's all theory and not practice -- lots of those forms are hardly ever used. I lived in Italy and started making a mental note of how often I heard (or had to use) the second-person plural form of the future tense: Not often!&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;One way of quantifying how "heavily" inflected a language is can be seen from some statistical work I was doing for a more practical purpose. For each of sixteen languages, I processed corpora of news articles with total token counts for each language ranging from about 100,000 to 3 million. Based on this, I generated two ways of counting tokens: First, the absolute count of nonunique tokens (in other words, the total size of the corpus); Second, the sum of the unique token count of each article (in other words, counting each word just once for each article it occurs in).&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;If each word occurred at most once per article, these two counts would be identical. The ratio between the two counts thus expresses how much a language tends to repeat words. There are a number of factors that can determine the degree of repetition, but many of these should be more or less equally true of all languages across the news corpora. Inflectional morphology, however, varies greatly from one language to another. Where English would use the form "said" in the past tense regardless of the person and number of the verb, but would use "say" or "says" for, the present tense. Spanish could use "dijeron", "dijo", "dije", etc. for the past, and "dicen", "dice", "digo", etc. for the present. Meanwhile, Chinese would use the character "&lt;/span&gt;&lt;span class="Apple-style-span"  style="  line-height: 25px; font-family:Arial, sans-serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;说&lt;/span&gt;&lt;span class="Apple-style-span"  style="  line-height: normal; font-family:Georgia, serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;" regardless of person, number, or tense. In this example, Spanish could have a relatively small discrepancy between the number of "SAY" tokens in an article and the number of "SAY" forms in the article. English would have a larger discrepancy, with only three forms likely to occur. Chinese, meanwhile, would have only one form regardless of any other factors, so the ratio of total "SAY" tokens and the count of one for that one form could be quite a bit larger. So "more inflected" languages should have lower ratios between the two forms of counts, while "uninflected" languages should have higher ratios.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 169px; height: 200px;" src="http://2.bp.blogspot.com/_A0kVC-Xc3xM/S4bwdRrMg5I/AAAAAAAAACE/DaZ7cueDbs0/s200/ratios.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5442301585475273618" /&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;Behold, we see the basic truths upheld. Finnish is the most inflected language in this sample, and Chinese the least. English is the least inflected European language. Among the branches of the Indo-European family in the sample, the Slavic branch is the most inflected, with Germanic (besides the mongrel English) next-most, and Romance least.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A bit of analysis: Finnish and Turkish are polysynthetic, allowing combinatorial addition of affixes and generating vast numbers of forms. Arabic manifests similar complexity in a slightly different way, with active derivational morphology combining with a respectable number of verb inflections.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Why do the Romance languages, with their vast verb conjugation tables look so uninflected? Almost certainly, the fact that this is a &lt;b&gt;news&lt;/b&gt; corpus plays a big role. News preferentially favors the past tense and the third person, restricting verb conjugation more than, say, a corpus of spoken conversations would. The main complication of the Slavic languages, noun declension, is no more restricted in news than it would be in other contexts: nouns can still be subjects, objects, instruments, etc. Germanic languages, with their modest declension schemes, rank between that of Slavic and that of the undeclined Romance languages. English, which came out of the Norman conquest on a path to eliminate as much of the complexity of its Anglo-Saxon and Old French origins, ended up the least-inflected language in the region. Chinese is famously "uninflected", although I have to admit a procedural bias here: Tokenizing Chinese into single tokens guarantees that it occupies the extreme position; if I were tokenizing it into words (taking some side or another in the debate concerning what is a word in Chinese), the counts would be different, although its rank as the least-inflected language in the bunch would not.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One could have quite a bit of fun explaining all of the nuances in the ranking: Why are Danish and Norwegian so far apart? Because Norwegian has more zero-ending plurals? What about Spanish and Italian? Is it some nonmorphological reason?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Overall, the thing that was interesting here was to see the numeric basis to an observation that is made so readily about which languages are "heavily" inflected. Where qualitative truths exist, quantitative breakdowns can show them.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3989794568827688223-7106176052378940827?l=nlpconfidential.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/7106176052378940827/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3989794568827688223&amp;postID=7106176052378940827' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/7106176052378940827'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/7106176052378940827'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/2010/02/inflection-predilection.html' title='Inflection Predilection'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_A0kVC-Xc3xM/S4bwdRrMg5I/AAAAAAAAACE/DaZ7cueDbs0/s72-c/ratios.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3989794568827688223.post-16801602499788693</id><published>2010-02-25T00:52:00.001-08:00</published><updated>2010-02-25T00:52:43.199-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='linguistics'/><category scheme='http://www.blogger.com/atom/ns#' term='social media'/><category scheme='http://www.blogger.com/atom/ns#' term='parsing'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>Parsing Twitter</title><content type='html'>The Internet has not re-invented language as such, but it has created many new &lt;a href="http://en.wikipedia.org/wiki/Register_(linguistics)"&gt;registers&lt;/a&gt; that have to be parsed as such. Out-of-the-box NLP tools that were developed to parse the Wall Street Journal and other well-behaved text will fall down flat if they are used to process other niches around the Internet.&lt;br /&gt;&lt;br /&gt;I have seen some of these phenomena going back a quarter of a century, in online chat. In a nutshell, people use Internet means of writing in ways more colloquial than formal writing tends (and tended) to be. But even without that broad sweep, there are many sub-niches of usage -- some determined by medium, and some determined by user population. (Besides the obvious segmentation into different national languages like English, German, and Chinese.)&lt;br /&gt;&lt;br /&gt;Interest in parsing Twitter is suddenly getting hot, and while a lot of the linguistic behavior there resembles linguistic behavior in other online locales like chat rooms, email, and instant messaging, every niche ends up with its own rules (and lack thereof).&lt;br /&gt;&lt;br /&gt;Here are some phenomena I've seen as I build a parser that is robust enough to handle Twitter:&lt;br /&gt;&lt;br /&gt;1) Pro drop. Twitter in particular makes the first-person singular pronoun implicit. Many tweets look like English sentences that have the leading word "I" implied. In other cases, "I am" is implied.&lt;br /&gt;&lt;br /&gt;2) Nonsentential statements. Sometimes a noun phrase stands alone, with an implied existential quantification out front. "Party tonight" means "There will be a party tonight."&lt;br /&gt;&lt;br /&gt;3) A register that resembles &lt;a href="http://en.wikipedia.org/wiki/African_American_Vernacular_English"&gt;Black English Vernacular&lt;/a&gt; has arisen. I would suggest that this new written form deliberately deviates from formal written standards. At the same time, it is economical, using shorter forms as rebuses for bulkier forms whenever the shorter form would be pronounced the same way. For example, rewriting "You know" as "u no" (4 characters instead of 8). One can feel William Safire quaking, but for those of us writing parsers, we must accept and embrace.&lt;br /&gt;&lt;br /&gt;The first I noticed this was in the titles of songs written by Prince. The titles of songs on his first three albums never did this, but in albums released in 1981 and 1982, three of his songs had these elements in his titles (eg, "I would die 4 u"). You can see the deliberately contrary nature of his language by 1988, when he titled a song "Eye No", thus using a longer form instead of a standard shorter form. I don't know if Prince was significantly responsible for this phenomenon or not, but it has certainly caught on by now.&lt;br /&gt;&lt;br /&gt;Incidentally, detecting a user's register is potentially quite valuable, since many business purposes for parsing Twitter would be involved with market analysis and market segmentation.&lt;br /&gt;&lt;br /&gt;4) Acronyms and emoticons. These are so common in computer-mediated communication that it is impossible to be unaware of them. LOL.&lt;br /&gt;&lt;br /&gt;5) Novel contractions, like "hella", "tryna", "weneva".&lt;br /&gt;&lt;br /&gt;6) Repeating characters to establish emphasis. Eg, "welcomeeeeeeeeee". This is in some cases a challenge to parse (in principle, "good" is "god" with the "o" repeated). In other cases, it's easy to convert to standard usage, but it does defeat literal search mechanisms.&lt;br /&gt;&lt;br /&gt;Notice that the aforementioned devices can occur in combination. For example, "lmaaooo" = "laughing my ass off" with the "a" and "o" repeated for emphasis.&lt;br /&gt;&lt;br /&gt;7) Unique medium-specific entities like URLs and the Twitter features for directing a tweet to a user (eg, @FakeSteveJobs) or a topic (eg, #lost).&lt;br /&gt;&lt;br /&gt;8) The substitution of characters that resemble other characters for one another. "3" can be used for "E", "0" for "o", "q" for "g".&lt;br /&gt;&lt;br /&gt;9) Deliberate swapping of character order. For example, "teh" as a playful misspelling of "the". This can also combine with aforementioned devices. Eg, "pr0n" is a way to rewrite "porn".&lt;br /&gt;&lt;br /&gt;Not every user partakes of these new linguistic devices, but a parser that is intended to wring meaning (and market intelligence) out of Twitter (or blogs or email or other electronic communication) ignores them at its peril. The more of these you miss, the more information you miss. And the people who have embraced these nonstandard devices represent a nontrivial amount of spending power.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3989794568827688223-16801602499788693?l=nlpconfidential.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/16801602499788693/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3989794568827688223&amp;postID=16801602499788693' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/16801602499788693'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/16801602499788693'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/2010/02/parsing-twitter.html' title='Parsing Twitter'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3989794568827688223.post-4363890799320237699</id><published>2009-09-09T20:23:00.000-07:00</published><updated>2009-09-15T10:52:30.832-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='nlp'/><category scheme='http://www.blogger.com/atom/ns#' term='information retrieval'/><category scheme='http://www.blogger.com/atom/ns#' term='cl'/><category scheme='http://www.blogger.com/atom/ns#' term='ir'/><title type='text'>Medicine for healthBase</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_A0kVC-Xc3xM/Sqh4rnYXlXI/AAAAAAAAAB8/mB7i8rRyUdE/s1600-h/healthbase-bad.jpg"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 200px; height: 158px;" src="http://4.bp.blogspot.com/_A0kVC-Xc3xM/Sqh4rnYXlXI/AAAAAAAAAB8/mB7i8rRyUdE/s200/healthbase-bad.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5379682445594957170" /&gt;&lt;/a&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Public Showcase Gone Wrong&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;Last week, TechCrunch reviewed &lt;/span&gt;&lt;/span&gt;&lt;a href="http://healthbase.netbase.com/"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;healthBase&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;, a public showcase of the Natural Language technology coming out of NetBase Solutions. In a rapidly-developing turn of events, TC published Leena Rao's &lt;/span&gt;&lt;/span&gt;&lt;a href="http://www.techcrunch.com/2009/09/02/healthbase-is-the-ultimate-medical-content-search-engine/"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;brief and largely glowing review&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;. Then the comments came and absolutely destroyed it, with phrases like "total fail" prompted by search results that were alternately terrible, hilarious, and if some posts were taken at face value, offensive. Hours later, Rao posted &lt;/span&gt;&lt;/span&gt;&lt;a href="http://www.techcrunch.com/2009/09/02/netbase-thinks-you-can-get-rid-of-jews-with-alcohol-and-salt/"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;a second review&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; picking up on the criticism.&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Most of the criticism is to some extent fair in that healthBase does readily yield lots of bad results. However, some of the critics go on to hypothesize how the system works and make incorrect conclusions. Worse yet, &lt;/span&gt;&lt;/span&gt;&lt;a href="http://marklogic.blogspot.com/2009/09/netbase-tragicomedy-perils-of-magic-and.html"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;some bloggers&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; have used this as an indictment of the state of Natural Language Processing in general.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;I'm in an unusually good situation to comment, since I was a founding engineer of this technology, building the original Natural Language system at NetBase back when it was called Accelovation. I'm disheartened to see the public rollout of the technology turn out like this, particularly in that some of the fixes to the evident problems were on the agenda when I left the company two years ago. There really is a strong technology at the heart of this system, and with a couple of fixes, this rollout could have been much stronger. I don't know why the low-hanging fruit that could have fixed these problems wasn't plucked in the last two years, but the solutions are clearly identifiable, so let me describe here the two specific technological fixes that are needed, plus one other crucial bit of wisdom.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Treatments for Bad Results&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;TechCrunch's second review made much of a bad set of results for "Causes of aids" (meaning, of course, Acquired Immune Deficiency Syndrome, not the verb to &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;aid&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;). In the initial set of results (the company has worked rapidly to clean up the kinds of results that drew all the criticism), the top two results were good, although loosely referring to the same thing: &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;sexual contact with an infected partner&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;. The third result, &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;virus&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;, was also quite valid. But the next seven were all downright bad. To an outsider, they ranged from the bizarre (&lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;strong magnetic field&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;) to the equally bizarre and arguably offensive (&lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Jew&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;). But as an insider, I can tell you that there were two causes of bad results, and the system can get a lot better if these are fixed.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Tell Me Something I Don't Already Know&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;One of the results for the "Causes of aids" search was the singularly unilluminating &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Feature&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;. When you perform searches on healthBase now, &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;feature&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; never comes up, indicating to me that they placed it on a blocked list of possible results. This was an initiative that we knew was necessary back in 2007, and was something I was working on at the time. Somehow, this work stopped far short of the g&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;oal after I left. Adding &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;feature&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; in 2009 is not the necessary general solution, because scads of similar terms are still coming up tonight: A cause of measles is &lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;characteristic&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;. A cause of blindness is &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;disorder&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;. A cause of malaria is &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;objective&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;. A cause of leukemia is &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;defect&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;. Terms like &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;feature&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; and &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;characteristic&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; are too general to make sense in any circumstances. Terms like &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;disorder&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; and &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;defect&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; carry just one bit of information: They are something unfortunate, and of course, you could say that most bad things like AIDS are caused by some defect or other. Most things, good or bad, can be said to be caused by a characteristic -- the thing someone would want to know is -- &lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;which&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; characteristic of &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;what&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;? In the case of measles, the extracted sentence tells us that it's a characteristic of "immune priming" -- something the system should have and could have extracted instead of &lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;characteristic&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;The simple logic to fix these problems is to have a list of such terms and never show them. That's not a project for an all-nighter or even a couple of months' work, but in two years, they should have come up with an exhaustive list -- the top such terms identify themselves pretty easily by being pervasive; they're vacuous because they're omnipresent -- but didn't. These vacuous terms are the less-numerous and less-glaring of the bugs that have lit up the blogosphere, but they stem from a clearly identifiable source of bad result that is easily fixed.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Safety in Numbers&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;The worst kind of errors are results that defy common sense -- statements that &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Jew&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; and &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;strong magnetic field&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; are causes of AIDS, or that &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Rancho del Arroyo mares&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; cause hookers. NetBase's very talented Jens Tellefsen correctly identified -- in part -- the root cause of one of these errors, a single sentence on Wikipedia (and echoed elsewhere on the web) that juxtaposed "Jews" with "aiding". To wit:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Hispano-Visigothic king &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;em style="font-style: normal; "&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Egica accuses the Jews&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/em&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; of &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;em style="font-style: normal; "&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;aiding&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/em&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; the Muslims, and sentences all Jews to slavery.&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Clearly, there was an error in the parsing. The obvious (to us) use of &lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;AIDS&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; as a noun was confounded with the system parsing the verb &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;aiding&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; to its stem &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;aid&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;, and somewhere along the lines, not seeing the difference between the two. Accusation indicates that the thing described is bad, so the parser concluded that Jews did something bad involving &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;aid&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;. We could go into greater detail, but the gist is, that noun-verb error was made between the parsing of that sentence and the interpretation of the pithy search term AIDS. Jens called the problem out, and drew an unfair backlash, including:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Personally, I think such basic distinctions should have been ironed out before launching the site.&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;and&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 19px; "&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;I am sorry, but if you are purporting to be an intelligent search engine, you need to be show basic intelligence like being able to disambiguate from different meanings and tenses of words. You need to be able to identify parts of speech, especially if you are trying to find references to causes.&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;and&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=" line-height: 18px; "&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;I hate to be pedestrian, but isn't that just a fancy way of saying it doesn't work?&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;All three of those rebuttals are misplaced: It's a simple fact that no Natural Language system will get the meanings and tenses (and, most to the point here: part of speech) correct 100% of the time. You don't iron out those distinctions, get a perfect parser, and then launch your product. Jens is quite right in asking the critics to excuse the occasional misparse.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;However, this result is obviously a failure in the rubber-meets-the-road sense, and the critics miss the real problem: It's not that the parser could make such a mistake: It's that such a mistake was allowed to produce a result which was ranked fourth! And there's an easy fix. Whatever formula you use to rank results (sheer number of occurrences being a likely but mistaken candidate) the system should only allow a single specific sentence to count once no many how many times it is repeated across the web. Once you accept that principle, don't display any results based on so few occurrences that the inevitable imperfections in parsing could allow such a result. And then -- problem solved!&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;The "Causes of aids" results claimed that there were 116 records retrieved, and the top twenty causes were displayed. If "top" had been expressed in terms of number of specific sentences that provided that particular result, then &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Jew&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; would have gotten a score of 1. If a threshold of, say, 2, were applied as a minimum number of evidentiary sentences required for a result to be displayed, the single most lampooned result of this launch would have been avoided -- along with most of the other bad results. In several cases where I currently find unintentionally humorous results, a Google search reveals the single sentence that caused the problem. Now you'd only run into trouble if multiple writers had produced the same anomalous result with alternative phrasing -- set your threshold according to the precision of your parser and the size of the database, and you can make the probability of such errors as low as you'd like -- a very desirable choice of precision over recall.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Again, a simple fix, and one that I'd mentioned back in 2007, but that was never put into production at the time. But it's totally crucial to filtering automated Information Retrieval results and achieving reasonable quality. If a system allows results based on one misparsed sentence to go public, it is going to show crazy results.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Pick Your Battles&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;When I found out that the domain of health and medicine was being used as a showcase for Netbase technology, I was very surprised. Because back in 2005 and 2007, when we were seeking out venture capital, a key point was that the technology was meant to be generic and not about "vertical", single-domain searches. When &lt;/span&gt;&lt;/span&gt;&lt;a href="http://www.medstory.com/"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Medstory&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; launched, we had some internal discussion about it, and I made clear the point that medicine is an area where an ontology is almost sufficient for structuring searches in the way that Accelovation (as it was still called) used NLP to do from scratch.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;To be more specific, the entities in medicine are usually confined to one semantic category from the following list: "Drugs and Substances", "Conditions", "Procedures", "People", etc. -- the main result "silos" for Medstory results. As a result, they don't need to understand the meanings of sentences -- if a drug is mentioned in a result on the search topic, it is automatically placed in the "Drugs and Substances" column with an almost 100% chance that it is not being misclassified (have you ever met a person named "Thiamine"?). As a result, information retrieval in medicine can be done quite well without semantic search, and in fact, semantic search is more likely to surface errors unless the results are filtered through the two face-saving mechanisms I mentioned above. If you wanted to trip up the ontology-based approach, you could do it by being clever: Back in 2007, I typed "vitamin B poisoning" into Medstory and it listed "Vitamin B" as a drug -- be careful with that advice. But the exceptions are few, so Medstory already had the problem solved better than healthBase really had a chance to do in 2009. (And the problem I mentioned has since been fixed, showing that Medstory's team has not been complacent.)&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial, serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;If you picked a problem in a mechanical domain, you would immediately find that an ontology is not enough to perform that context-sensitive classification. For example, if you search healthBase now for "white noise", you find both pros ("calm baby") and cons ("corrupt measurement") -- impossible to do with an ontology alone.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial, serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;A related problem with the choice of a specific domain is that the healthBase implementation has not narrowed the field of indexed data to medicine, so you may (as jokers have done) produce results that are amusing simply because they are so flagrantly not medical: I just searched for complications of &lt;/span&gt;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Senator&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt; and the top result was "raise issue". This may actually point to some flaws in the IR besides those I've mentioned before, but stands out even more so for being non-medical.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Baby with the Bathwater&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;An earlier post on this blog was about the unfortunate recurrence of &lt;/span&gt;&lt;/span&gt;&lt;a href="http://nlpconfidential.blogspot.com/2008/02/winter-or-spring.html"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;AI Winter&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;, something which had hit the world pretty hard before I had made it out of my undergraduate work. To my dismay, the basic dynamic feeding it keeps on going, which is for the peddler of an AI-style technology (and NLP is certainly that) to over-market their own work, then live with the black mark on their reputation, and by extension, get people to write off the whole field. This is particularly damaging for those of us who make a living from it.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial, serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial, serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;The reality of the healthBase technology is that it provides lots of useful results, and if you were earnestly seeking some background on a medical-related search -- no, not a replacement for actual medical advise -- you can get that information there. But, because at launch they had a system with far too many bad results, combined with the blogosphere's understandable joy at having a laugh, it turned into bad PR for the field as a whole. My dismay is all the greater for knowing that the company had already identified every one of these problems in 2007.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial, serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:arial, serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;I'm completely certain that Information Retrieval will be a useful public-facing tool in the near future. I think NetBase had a window of opportunity to get there before anyone else, but in 2009, you can sense the Wolframs and the Bings and the Google Squareds circling the goal. The surest bet now is that whoever gets there first, they'll have an extremely short wait before there's company.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3989794568827688223-4363890799320237699?l=nlpconfidential.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/4363890799320237699/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3989794568827688223&amp;postID=4363890799320237699' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/4363890799320237699'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/4363890799320237699'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/2009/09/medicine-for-healthbase.html' title='Medicine for healthBase'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_A0kVC-Xc3xM/Sqh4rnYXlXI/AAAAAAAAAB8/mB7i8rRyUdE/s72-c/healthbase-bad.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3989794568827688223.post-4766622252122768218</id><published>2009-06-01T00:59:00.001-07:00</published><updated>2009-09-09T20:19:46.259-07:00</updated><title type='text'>Platform</title><content type='html'>Language is for the most part sequential because, in a nutshell, we only have one mouth. (Sign language gets around this a bit, as do speakers of language, with changes in tone, facial expressions, gestures and so on.)&lt;br /&gt;&lt;br /&gt;I've put together some of the basics for a platform for work in NLP. If anyone would like to collaborate on this as a true open-source project, I'd love to hear from you. Admittedly, there are some great packages out there like &lt;a href="http://gate.ac.uk/"&gt;GATE&lt;/a&gt; and &lt;a href="http://alias-i.com/lingpipe/"&gt;LingPipe&lt;/a&gt;, really nice guys who know their stuff and -- your NL Pundit loves nothing so much as this -- are upfront about the limitations as well as the strengths of their software. But I think there are some niches left unfilled for open source NLP. If you agree, drop me a line and maybe we can get a ball rolling.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3989794568827688223-4766622252122768218?l=nlpconfidential.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/4766622252122768218/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3989794568827688223&amp;postID=4766622252122768218' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/4766622252122768218'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/4766622252122768218'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/2009/06/platform.html' title='Platform'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3989794568827688223.post-3615878288403486570</id><published>2009-04-13T10:34:00.000-07:00</published><updated>2009-04-13T11:00:33.874-07:00</updated><title type='text'>Spell Check for POS Tagging</title><content type='html'>Language is inherently hierarchical. Letters make words; words make phrases; phrases make sentences; and, so on. NL processing tends to take place in stages corresponding to the levels of that hierarchy. It's a good paradigm; you can get good results by plugging state-of-the-art modules together, squeezing semantics out of natural language input.&lt;br /&gt;&lt;br /&gt;It's also the case that NL processing is inherently error-ridden. Every step in the process can stumble on trip on the rampant ambiguities in language. Notice that I did not restrict that comment to discussing machine NL processing... people make mistakes, too. We say "um", we introduce sounds we didn't mean to, misspell words, mangle sentences and ideas midway. And we mishear, misread, get caught walking down (usually unintended) &lt;a href="http://en.wikipedia.org/wiki/Garden_path_sentence"&gt;garden paths&lt;/a&gt;. Many mechanical text-processing systems are feed-forward, making errors at each step and then accumulating the errors into a growing number until by the end, perhaps 25% to 75% of the sentences are misunderstood.&lt;br /&gt;&lt;br /&gt;People, however, often recover from misunderstandings, re-reading the text, rejumbling one's thoughts until a correct parsing is found. The key is to use realizations (of error) at one level in processing to revise the previous level's work, which was flawed. Once an error is noted, the key is to determine where it was.&lt;br /&gt;&lt;br /&gt;This weekend, I ran a POS-tagger on the output of an unrelated phrase chunker. I noticed such patterns as (and these are some of the bad ones only):&lt;br /&gt;&lt;br /&gt;NP: DT VB&lt;br /&gt;VP: JJ IN&lt;br /&gt;&lt;br /&gt;Now obviously what had happened was that POS-tagging errors in the second tagger took place in processing phrases that resulted when the other system's chunker (and thus, it's POS-tagger) had categorized the POS correctly. "JJ IN" was a case where the real underlying form was "VB RP", and the first tagger got it right (allowing the chunker to get it right) but the second tagger got it wrong.&lt;br /&gt;&lt;br /&gt;The key takeaway here is not that the first tagger is better! That may be true, or may not be true. The key observation, rather, is that when a POS tagger makes an error (and they all do), the prospect of chunking the sentence correctly is thereby doomed on that sentence. GIGO = "Garbage in, garbage out."&lt;br /&gt;&lt;br /&gt;But consider the opportunity to recover. The fact of the matter is, "JJ IN" is a red flag that the tagger may have screwed up and that before committing to a chunking of those tokens, the system may want to reconsider the probabilities of those particular tags and see if a more agreeable chunking can result from different tagging.&lt;br /&gt;&lt;br /&gt;This is the heart of the top-down/bottom-up manner of processing which is known to be powerful in pattern recognition. &lt;a href="http://www.itee.uq.edu.au/~cogs2010/cmc/chapters/IAC/"&gt;Interactive Activation&lt;/a&gt; in particular is a demonstration of just how right this kind of processing is.&lt;br /&gt;&lt;br /&gt;This amounts to incorporating context in the tagging of a particular word, which is something that any competent tagger already does. (Hidden Markov, by looking at the context to the left of a token; a transformation-based tagger by rules with conditions that look at neighboring tokens.) But what people do, and NL systems need to do, is to revisit the processing done by one module when the next module detects anomalies. A simple corollary to GIGO... when you find garbage coming out, you know that garbage came in. Find it and fix it, and you've got a better, more robust system.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3989794568827688223-3615878288403486570?l=nlpconfidential.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/3615878288403486570/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3989794568827688223&amp;postID=3615878288403486570' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/3615878288403486570'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/3615878288403486570'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/2009/04/spell-check-for-pos-tagging.html' title='Spell Check for POS Tagging'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3989794568827688223.post-6153400184449295591</id><published>2008-02-25T15:34:00.000-08:00</published><updated>2008-02-25T16:26:02.379-08:00</updated><title type='text'>I Fought the Law...</title><content type='html'>Historically, many of the major advances in science have involved the coining of a "Law". A good law is pithy, describes the world in a way that enables applied use, and tells people who are strong in mathematics, something about the nature of the world that would make the law fit the equation. For example, the inverse square law governing the apparent brightness of a star follows neatly from the fact that the discrete packets of light streaming outward from it spread out through successively larger spheres. Einstein's famous E=mc&lt;sup&gt;2&lt;/sup&gt; tells you that there is a universal speed limit equal to lightspeed.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_A0kVC-Xc3xM/R8NVJCYWoSI/AAAAAAAAAAM/0NXrqBvmekQ/s1600-h/vertical.jpg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://4.bp.blogspot.com/_A0kVC-Xc3xM/R8NVJCYWoSI/AAAAAAAAAAM/0NXrqBvmekQ/s320/vertical.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5171070410897662242" /&gt;&lt;/a&gt; Stepping back from the methods and techniques that have been used in NLP, we can hope for a law that describes the progress that has been made as research from industry and academia pursue better solutions to the hard problems. If you listen to Ray Kurzweil, you'd conclude that progress is going to approach a &lt;a href="http://www.wisegeek.com/what-is-kurzweils-law.htm"&gt;vertical asymptote&lt;/a&gt;. In other words, things are going to improve so much that the future will be off the charts -- things, sooner or later, are going to become infinitely good. Maybe &lt;a href="http://news.bbc.co.uk/2/hi/americas/7248875.stm"&gt;quite soon.&lt;/a&gt; If you think that's true, you should invest heavily in AI-style technologies now!&lt;br /&gt;&lt;br /&gt;The history of NLP, however, as well as that of some other fields of AI, has produced repeated evidence for a different law governing progress. A lot of the cycle of "hype, then winter" comes from people failing to recognize this law. The law that the history of NLP describes is quite different from Kurzweil's rosy-colored vision. In fact, it's at right angles to it. Rather than the state of the art approaching a vertical asymptote (infinite goodness soon!), it has been approaching a &lt;span style="font-style:italic;"&gt;horizontal&lt;/span&gt; asymptote (progress is all done!). &lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_A0kVC-Xc3xM/R8NVhiYWoTI/AAAAAAAAAAU/Q7uNr2T_NDc/s1600-h/horizontal.jpg"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_A0kVC-Xc3xM/R8NVhiYWoTI/AAAAAAAAAAU/Q7uNr2T_NDc/s320/horizontal.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5171070831804457266" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Obviously, these two worldviews could not be more opposed. But the facts clearly point to which one is valid. Let's take some examples.&lt;br /&gt;&lt;br /&gt;An important task in NLP is called Part of Speech Tagging. It consists of marking all of the words in a piece of text with their grammatical category. Not many people want to do this for its own sake, but it's a highly useful initial step in analyzing text more deeply. The first effort to do this electronically was the work on the &lt;a href="http://en.wikipedia.org/wiki/Brown_Corpus"&gt;Brown Corpus&lt;/a&gt; by Greene and Rubin in 1971. They achieved, with a very simple approach, performance of 70%. Not too bad for a first try. By the early 1980s, researchers had pushed this number way up, with a system called &lt;a href="http://ucrel.lancs.ac.uk/claws/"&gt;CLAWS&lt;/a&gt; achieving about 94%, meaning that we'd done away with about four fifths of the errors made by Greene and Rubin's approach. Big progress! But in the 25 years since then, progress has been extremely minor, perhaps to 96%. In fact, there is good reason to believe that tagging better than 97% is impossible, because even human annotators tagging a text do not agree more than that often. Moreover, some very different approaches to tagging all yield similar accuracy rates, just shy of that theoretical maximum. When much more progress happens in the first decade than in the succeeding quarter century, you have found yourself a horizontal asymptote.&lt;br /&gt;&lt;br /&gt;As it happens, 96% is pretty good, and a tagger that performs that well is highly useful, errors be damned. And of course, no one would argue that a horizontal asymptote for &lt;span style="font-weight:bold;"&gt;accuracy&lt;/span&gt; must exist for anything, because you can't beat 100%. So in principle, this isn't a problem.&lt;br /&gt;&lt;br /&gt;But in practice, it is! Because while Part of Speech Tagging is ready for application, some other very important tasks in NLP are not. Suppose the asymptote were not at 96%, but much lower. That's what you have with another highly interesting problem called Word Sense Disambiguation -- the determination of which meaning of a word a writer had in mind. (For example, "bank" like the side of a river, or "bank" like a financial institution.) Here, though the numbers vary considerably according to how you measure accuracy, it is clear that there &lt;a href="http://sites.univ-provence.fr/veronis/pdf/1998wsd.pdf"&gt;are no excellent methods available&lt;/a&gt;. WSD has not advanced dramatically in the half century since it became an area of interest, and the approaches that exist are tailor-made to disappoint the people who understandably would like a good solution to this problem.&lt;br /&gt;&lt;br /&gt;A very big problem for NLP as an enterprise is the &lt;a href="http://nlpconfidential.blogspot.com/2008_02_01_archive.html"&gt;cycle of hype, disappointment, and mistrust&lt;/a&gt; that I identified a few weeks ago. And a very big part of the hype end of the problem is that people who are the recipients of hype (customer bases, venture capitalists, managers, etc.) don't know which law has been in effect. Innovation does exist. Progress does happen. But before an idea becomes a software project, anyone reviewing it should determine if the NLP needed to make the approach successful has already hit a horizontal asymptote.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3989794568827688223-6153400184449295591?l=nlpconfidential.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/6153400184449295591/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3989794568827688223&amp;postID=6153400184449295591' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/6153400184449295591'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/6153400184449295591'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/2008/02/i-fought-law.html' title='I Fought the Law...'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_A0kVC-Xc3xM/R8NVJCYWoSI/AAAAAAAAAAM/0NXrqBvmekQ/s72-c/vertical.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3989794568827688223.post-5431461536833493220</id><published>2008-02-01T22:49:00.000-08:00</published><updated>2008-02-01T22:57:33.464-08:00</updated><title type='text'>Winter or Spring?</title><content type='html'>NLP is a sub-field of Artificial Intelligence. Many sub-fields of AI share with NLP a history in which periods of optimism give way to periods of pessimism. The leading term for this is "&lt;a href="http://en.wikipedia.org/wiki/AI_winter"&gt;AI Winter&lt;/a&gt;". Expectations are set high, results are promised, funding comes from all corners, work begins, results fall short of goals, disappointment reigns, and funding goes away. Then, after enough time has passed, new expectations are formed and the cycle begins anew.&lt;br /&gt;&lt;br /&gt;Of course, the cycles of failure are just part of the story. Speech-understanding has eventually found worthwhile (though limited) applications after surviving through cycles of hype and disappointment. Over time, the technology got better, the hardware got better and cheaper, and now it's rare to call a major corporation's customer support without being routed by a system that uses speech-understanding software.&lt;br /&gt;&lt;br /&gt;But the cycles of failure keep coming, and they are continuing to the present day.  Is NLP in Winter now or in Spring? Probably both. There has been a lot of funding for NLP-based ventures in the last few years. Some of it will lead to particular successes and some to particular failures. One beneficiary of high expectations of late has been San Francisco-based &lt;a href="http://www.powerset.com/"&gt;Powerset&lt;/a&gt;. Whether or not Powerset eventually delivers on the high &lt;a href="http://www.techcrunch.com/2007/02/09/powerset-hype-to-boiling-point/"&gt;expectations&lt;/a&gt; it has generated may have a lot to do with whether the field as a whole spends the next few years in spring or in winter.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3989794568827688223-5431461536833493220?l=nlpconfidential.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/5431461536833493220/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3989794568827688223&amp;postID=5431461536833493220' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/5431461536833493220'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/5431461536833493220'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/2008/02/winter-or-spring.html' title='Winter or Spring?'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3989794568827688223.post-8746457110135434983</id><published>2008-01-01T14:54:00.000-08:00</published><updated>2008-01-01T15:11:09.254-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='nlp'/><category scheme='http://www.blogger.com/atom/ns#' term='linguistics'/><category scheme='http://www.blogger.com/atom/ns#' term='cl'/><category scheme='http://www.blogger.com/atom/ns#' term='history'/><category scheme='http://www.blogger.com/atom/ns#' term='language'/><title type='text'>Natural Language Processing</title><content type='html'>Sometime long ago, the grunts of some species of ape began to take on a new character, enabling the animals to communicate in ways more powerful than any species had before.  This surely took place gradually, and surely changed the apes as they changed this new thing they had, which we now call language. It was born around campfires and on hunts, in reference to children, prey, crops, predators, friends, enemies, and lovers. Over a span of hundreds of thousands of years, it grew up on six continents, in many different forms.  It existed in two forms: In the mind, and as a spoken medium, formed in the mouth of a speaker, carried through the air as sound, and received in the eardrum of a listener.&lt;br /&gt;&lt;br /&gt;Just a few thousand years ago, and in particular locations, humans began to write language. Just a few decades ago, humans began to encode written language so it could be stored and manipulated by digital computers. It didn't take long for people to realize that the kinds of tricks we can all perform, almost effortlessly, in crafting and understanding language are magnificently difficult to engineer in a machine.&lt;br /&gt;&lt;br /&gt;Workers in Natural Language Processing (which also goes by the alias Computational Linguistics, among other names) have grappled with the difficulty of this subject matter. Theoretical and intellectual curiosity has driven research, as have governmental and commercial enterprises. The history of the field has been marked by optimism, setbacks, hype, successes, more optimism, more setbacks, more hype, and the occasional white lie.&lt;br /&gt;&lt;br /&gt;The Internet, wireless communication, and portable computing are going to stimulate more interest in and need for NLP. If the past is any guide, there'll be a steady stream of optimism, setbacks, hype, and hopefully some successes. This blog will track NLP's ups and downs through 2008 and in the years to come.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3989794568827688223-8746457110135434983?l=nlpconfidential.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nlpconfidential.blogspot.com/feeds/8746457110135434983/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3989794568827688223&amp;postID=8746457110135434983' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/8746457110135434983'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3989794568827688223/posts/default/8746457110135434983'/><link rel='alternate' type='text/html' href='http://nlpconfidential.blogspot.com/2008/01/natural-language-processing.html' title='Natural Language Processing'/><author><name>John Rehling</name><uri>http://www.blogger.com/profile/16282519946871219302</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://3.bp.blogspot.com/_A0kVC-Xc3xM/Sqhrpr8KYqI/AAAAAAAAABU/q5yyeMyYhDE/S220/jrehling.jpg'/></author><thr:total>0</thr:total></entry></feed>
