Wednesday, September 9, 2009

Medicine for healthBase

Public Showcase Gone Wrong

Last week, TechCrunch reviewed
healthBase, a public showcase of the Natural Language technology coming out of NetBase Solutions. In a rapidly-developing turn of events, TC published Leena Rao's brief and largely glowing review. Then the comments came and absolutely destroyed it, with phrases like "total fail" prompted by search results that were alternately terrible, hilarious, and if some posts were taken at face value, offensive. Hours later, Rao posted a second review picking up on the criticism.

Most of the criticism is to some extent fair in that healthBase does readily yield lots of bad results. However, some of the critics go on to hypothesize how the system works and make incorrect conclusions. Worse yet, some bloggers have used this as an indictment of the state of Natural Language Processing in general.

I'm in an unusually good situation to comment, since I was a founding engineer of this technology, building the original Natural Language system at NetBase back when it was called Accelovation. I'm disheartened to see the public rollout of the technology turn out like this, particularly in that some of the fixes to the evident problems were on the agenda when I left the company two years ago. There really is a strong technology at the heart of this system, and with a couple of fixes, this rollout could have been much stronger. I don't know why the low-hanging fruit that could have fixed these problems wasn't plucked in the last two years, but the solutions are clearly identifiable, so let me describe here the two specific technological fixes that are needed, plus one other crucial bit of wisdom.

Treatments for Bad Results

TechCrunch's second review made much of a bad set of results for "Causes of aids" (meaning, of course, Acquired Immune Deficiency Syndrome, not the verb to aid). In the initial set of results (the company has worked rapidly to clean up the kinds of results that drew all the criticism), the top two results were good, although loosely referring to the same thing: sexual contact with an infected partner. The third result, virus, was also quite valid. But the next seven were all downright bad. To an outsider, they ranged from the bizarre (strong magnetic field) to the equally bizarre and arguably offensive (Jew). But as an insider, I can tell you that there were two causes of bad results, and the system can get a lot better if these are fixed.

Tell Me Something I Don't Already Know

One of the results for the "Causes of aids" search was the singularly unilluminating Feature. When you perform searches on healthBase now, feature never comes up, indicating to me that they placed it on a blocked list of possible results. This was an initiative that we knew was necessary back in 2007, and was something I was working on at the time. Somehow, this work stopped far short of the goal after I left. Adding feature in 2009 is not the necessary general solution, because scads of similar terms are still coming up tonight: A cause of measles is characteristic. A cause of blindness is disorder. A cause of malaria is objective. A cause of leukemia is defect. Terms like feature and characteristic are too general to make sense in any circumstances. Terms like disorder and defect carry just one bit of information: They are something unfortunate, and of course, you could say that most bad things like AIDS are caused by some defect or other. Most things, good or bad, can be said to be caused by a characteristic -- the thing someone would want to know is -- which characteristic of what? In the case of measles, the extracted sentence tells us that it's a characteristic of "immune priming" -- something the system should have and could have extracted instead of characteristic.

The simple logic to fix these problems is to have a list of such terms and never show them. That's not a project for an all-nighter or even a couple of months' work, but in two years, they should have come up with an exhaustive list -- the top such terms identify themselves pretty easily by being pervasive; they're vacuous because they're omnipresent -- but didn't. These vacuous terms are the less-numerous and less-glaring of the bugs that have lit up the blogosphere, but they stem from a clearly identifiable source of bad result that is easily fixed.

Safety in Numbers

The worst kind of errors are results that defy common sense -- statements that Jew and strong magnetic field are causes of AIDS, or that Rancho del Arroyo mares cause hookers. NetBase's very talented Jens Tellefsen correctly identified -- in part -- the root cause of one of these errors, a single sentence on Wikipedia (and echoed elsewhere on the web) that juxtaposed "Jews" with "aiding". To wit:

Hispano-Visigothic king Egica accuses the Jews of aiding the Muslims, and sentences all Jews to slavery.

Clearly, there was an error in the parsing. The obvious (to us) use of AIDS as a noun was confounded with the system parsing the verb aiding to its stem aid, and somewhere along the lines, not seeing the difference between the two. Accusation indicates that the thing described is bad, so the parser concluded that Jews did something bad involving aid. We could go into greater detail, but the gist is, that noun-verb error was made between the parsing of that sentence and the interpretation of the pithy search term AIDS. Jens called the problem out, and drew an unfair backlash, including:

Personally, I think such basic distinctions should have been ironed out before launching the site.

and

I am sorry, but if you are purporting to be an intelligent search engine, you need to be show basic intelligence like being able to disambiguate from different meanings and tenses of words. You need to be able to identify parts of speech, especially if you are trying to find references to causes.

and

I hate to be pedestrian, but isn't that just a fancy way of saying it doesn't work?

All three of those rebuttals are misplaced: It's a simple fact that no Natural Language system will get the meanings and tenses (and, most to the point here: part of speech) correct 100% of the time. You don't iron out those distinctions, get a perfect parser, and then launch your product. Jens is quite right in asking the critics to excuse the occasional misparse.

However, this result is obviously a failure in the rubber-meets-the-road sense, and the critics miss the real problem: It's not that the parser could make such a mistake: It's that such a mistake was allowed to produce a result which was ranked fourth! And there's an easy fix. Whatever formula you use to rank results (sheer number of occurrences being a likely but mistaken candidate) the system should only allow a single specific sentence to count once no many how many times it is repeated across the web. Once you accept that principle, don't display any results based on so few occurrences that the inevitable imperfections in parsing could allow such a result. And then -- problem solved!

The "Causes of aids" results claimed that there were 116 records retrieved, and the top twenty causes were displayed. If "top" had been expressed in terms of number of specific sentences that provided that particular result, then Jew would have gotten a score of 1. If a threshold of, say, 2, were applied as a minimum number of evidentiary sentences required for a result to be displayed, the single most lampooned result of this launch would have been avoided -- along with most of the other bad results. In several cases where I currently find unintentionally humorous results, a Google search reveals the single sentence that caused the problem. Now you'd only run into trouble if multiple writers had produced the same anomalous result with alternative phrasing -- set your threshold according to the precision of your parser and the size of the database, and you can make the probability of such errors as low as you'd like -- a very desirable choice of precision over recall.

Again, a simple fix, and one that I'd mentioned back in 2007, but that was never put into production at the time. But it's totally crucial to filtering automated Information Retrieval results and achieving reasonable quality. If a system allows results based on one misparsed sentence to go public, it is going to show crazy results.

Pick Your Battles

When I found out that the domain of health and medicine was being used as a showcase for Netbase technology, I was very surprised. Because back in 2005 and 2007, when we were seeking out venture capital, a key point was that the technology was meant to be generic and not about "vertical", single-domain searches. When Medstory launched, we had some internal discussion about it, and I made clear the point that medicine is an area where an ontology is almost sufficient for structuring searches in the way that Accelovation (as it was still called) used NLP to do from scratch.

To be more specific, the entities in medicine are usually confined to one semantic category from the following list: "Drugs and Substances", "Conditions", "Procedures", "People", etc. -- the main result "silos" for Medstory results. As a result, they don't need to understand the meanings of sentences -- if a drug is mentioned in a result on the search topic, it is automatically placed in the "Drugs and Substances" column with an almost 100% chance that it is not being misclassified (have you ever met a person named "Thiamine"?). As a result, information retrieval in medicine can be done quite well without semantic search, and in fact, semantic search is more likely to surface errors unless the results are filtered through the two face-saving mechanisms I mentioned above. If you wanted to trip up the ontology-based approach, you could do it by being clever: Back in 2007, I typed "vitamin B poisoning" into Medstory and it listed "Vitamin B" as a drug -- be careful with that advice. But the exceptions are few, so Medstory already had the problem solved better than healthBase really had a chance to do in 2009. (And the problem I mentioned has since been fixed, showing that Medstory's team has not been complacent.)

If you picked a problem in a mechanical domain, you would immediately find that an ontology is not enough to perform that context-sensitive classification. For example, if you search healthBase now for "white noise", you find both pros ("calm baby") and cons ("corrupt measurement") -- impossible to do with an ontology alone.

A related problem with the choice of a specific domain is that the healthBase implementation has not narrowed the field of indexed data to medicine, so you may (as jokers have done) produce results that are amusing simply because they are so flagrantly not medical: I just searched for complications of Senator and the top result was "raise issue". This may actually point to some flaws in the IR besides those I've mentioned before, but stands out even more so for being non-medical.

Baby with the Bathwater

An earlier post on this blog was about the unfortunate recurrence of AI Winter, something which had hit the world pretty hard before I had made it out of my undergraduate work. To my dismay, the basic dynamic feeding it keeps on going, which is for the peddler of an AI-style technology (and NLP is certainly that) to over-market their own work, then live with the black mark on their reputation, and by extension, get people to write off the whole field. This is particularly damaging for those of us who make a living from it.

The reality of the healthBase technology is that it provides lots of useful results, and if you were earnestly seeking some background on a medical-related search -- no, not a replacement for actual medical advise -- you can get that information there. But, because at launch they had a system with far too many bad results, combined with the blogosphere's understandable joy at having a laugh, it turned into bad PR for the field as a whole. My dismay is all the greater for knowing that the company had already identified every one of these problems in 2007.

I'm completely certain that Information Retrieval will be a useful public-facing tool in the near future. I think NetBase had a window of opportunity to get there before anyone else, but in 2009, you can sense the Wolframs and the Bings and the Google Squareds circling the goal. The surest bet now is that whoever gets there first, they'll have an extremely short wait before there's company.

Monday, June 1, 2009

Platform

Language is for the most part sequential because, in a nutshell, we only have one mouth. (Sign language gets around this a bit, as do speakers of language, with changes in tone, facial expressions, gestures and so on.)

I've put together some of the basics for a platform for work in NLP. If anyone would like to collaborate on this as a true open-source project, I'd love to hear from you. Admittedly, there are some great packages out there like GATE and LingPipe, really nice guys who know their stuff and -- your NL Pundit loves nothing so much as this -- are upfront about the limitations as well as the strengths of their software. But I think there are some niches left unfilled for open source NLP. If you agree, drop me a line and maybe we can get a ball rolling.

Monday, April 13, 2009

Spell Check for POS Tagging

Language is inherently hierarchical. Letters make words; words make phrases; phrases make sentences; and, so on. NL processing tends to take place in stages corresponding to the levels of that hierarchy. It's a good paradigm; you can get good results by plugging state-of-the-art modules together, squeezing semantics out of natural language input.

It's also the case that NL processing is inherently error-ridden. Every step in the process can stumble on trip on the rampant ambiguities in language. Notice that I did not restrict that comment to discussing machine NL processing... people make mistakes, too. We say "um", we introduce sounds we didn't mean to, misspell words, mangle sentences and ideas midway. And we mishear, misread, get caught walking down (usually unintended) garden paths. Many mechanical text-processing systems are feed-forward, making errors at each step and then accumulating the errors into a growing number until by the end, perhaps 25% to 75% of the sentences are misunderstood.

People, however, often recover from misunderstandings, re-reading the text, rejumbling one's thoughts until a correct parsing is found. The key is to use realizations (of error) at one level in processing to revise the previous level's work, which was flawed. Once an error is noted, the key is to determine where it was.

This weekend, I ran a POS-tagger on the output of an unrelated phrase chunker. I noticed such patterns as (and these are some of the bad ones only):

NP: DT VB
VP: JJ IN

Now obviously what had happened was that POS-tagging errors in the second tagger took place in processing phrases that resulted when the other system's chunker (and thus, it's POS-tagger) had categorized the POS correctly. "JJ IN" was a case where the real underlying form was "VB RP", and the first tagger got it right (allowing the chunker to get it right) but the second tagger got it wrong.

The key takeaway here is not that the first tagger is better! That may be true, or may not be true. The key observation, rather, is that when a POS tagger makes an error (and they all do), the prospect of chunking the sentence correctly is thereby doomed on that sentence. GIGO = "Garbage in, garbage out."

But consider the opportunity to recover. The fact of the matter is, "JJ IN" is a red flag that the tagger may have screwed up and that before committing to a chunking of those tokens, the system may want to reconsider the probabilities of those particular tags and see if a more agreeable chunking can result from different tagging.

This is the heart of the top-down/bottom-up manner of processing which is known to be powerful in pattern recognition. Interactive Activation in particular is a demonstration of just how right this kind of processing is.

This amounts to incorporating context in the tagging of a particular word, which is something that any competent tagger already does. (Hidden Markov, by looking at the context to the left of a token; a transformation-based tagger by rules with conditions that look at neighboring tokens.) But what people do, and NL systems need to do, is to revisit the processing done by one module when the next module detects anomalies. A simple corollary to GIGO... when you find garbage coming out, you know that garbage came in. Find it and fix it, and you've got a better, more robust system.