Sunday, February 28, 2010

Speech To Speech

When Google talks, people listen. Recently, Google has been saying that sometime in the near future, when you talk, people can listen in other languages, thanks to your phone and their software. Franz Och, the head of Google's translation services has said "We think speech-to-speech translation should be possible and work reasonably well in a few years' time." I'm not the only one who's skeptical.

In 2007, I evaluated a state-of-the-art StS MT system that was built for military applications (think of the Somali pirate crisis from 2009 or any scene from The Hurt Locker). I don't think it delivered. The issue is quality. StS MT can "work" if:

a) Both parties are extremely cooperative and are willing to put some work into the exchange. (Eg, training the system on their voices; repeating things, maybe more than once, if the result is unclear the first time; tolerating the inherent delays.) This could mean a life-or-death situation where the alternative -- no translation -- is worse than a bad translation. (Although the stakes could be higher, even deadly, if the translation is faulty.)

b) The comprehension of the content is held to a low standard. Eg, if two businesspeople or recreational chatters want to feel like they're getting to know each other; not trying to hammer out the fine details of a legal agreement.

c) There is no alternative. Because having the parties type content into an online text-based MT system immediately removes one source of error.

As long as the sources of error are as great as they are now, I have trouble thinking of many contexts where people would be willing to tolerate the flaws. Maybe chatters who are only looking for entertainment and have no bottom line regarding accuracy. In war / emergency contexts, perhaps. In business, I think the problems just about doom the effort, unless a cultural adjustment makes people value "meeting" someone in this way even when the comprehension is shaky.

I was, however, impressed by how well the speech recognition worked with my voice when I trained the system to recognize me. Perhaps if Google trains the system on enough people, just about anyone new to come along would sound close enough to one of those. The speech-to-text part of the problem just might be solvable for a large segment of people.

But that leaves the machine translation link in the chain, and that's something where there's little reason to suspect that a quantum leap is about to happen. I took Mr. Och's own quotation above and translated it to Spanish and then back to English using Google's text-to-text machine translation. It did a pretty good job, but it changed his use of the word "should" to "must". That's the sort of thing you don't want to have happen when a Somali pirate has his gun trained on you.

Thursday, February 25, 2010

Inflection Predilection

Here's a lightweight, factoid-level look at a lot of different languages in a fun way.

We often hear that some languages are "heavily inflected". You can glance at tables of verb conjugations and noun declensions to see the brutal details: Russian surely does have a lot of declensions to memorize, just as Spanish does with verb conjugations. But that's all theory and not practice -- lots of those forms are hardly ever used. I lived in Italy and started making a mental note of how often I heard (or had to use) the second-person plural form of the future tense: Not often!

One way of quantifying how "heavily" inflected a language is can be seen from some statistical work I was doing for a more practical purpose. For each of sixteen languages, I processed corpora of news articles with total token counts for each language ranging from about 100,000 to 3 million. Based on this, I generated two ways of counting tokens: First, the absolute count of nonunique tokens (in other words, the total size of the corpus); Second, the sum of the unique token count of each article (in other words, counting each word just once for each article it occurs in).

If each word occurred at most once per article, these two counts would be identical. The ratio between the two counts thus expresses how much a language tends to repeat words. There are a number of factors that can determine the degree of repetition, but many of these should be more or less equally true of all languages across the news corpora. Inflectional morphology, however, varies greatly from one language to another. Where English would use the form "said" in the past tense regardless of the person and number of the verb, but would use "say" or "says" for, the present tense. Spanish could use "dijeron", "dijo", "dije", etc. for the past, and "dicen", "dice", "digo", etc. for the present. Meanwhile, Chinese would use the character "" regardless of person, number, or tense. In this example, Spanish could have a relatively small discrepancy between the number of "SAY" tokens in an article and the number of "SAY" forms in the article. English would have a larger discrepancy, with only three forms likely to occur. Chinese, meanwhile, would have only one form regardless of any other factors, so the ratio of total "SAY" tokens and the count of one for that one form could be quite a bit larger. So "more inflected" languages should have lower ratios between the two forms of counts, while "uninflected" languages should have higher ratios.

Behold, we see the basic truths upheld. Finnish is the most inflected language in this sample, and Chinese the least. English is the least inflected European language. Among the branches of the Indo-European family in the sample, the Slavic branch is the most inflected, with Germanic (besides the mongrel English) next-most, and Romance least.

A bit of analysis: Finnish and Turkish are polysynthetic, allowing combinatorial addition of affixes and generating vast numbers of forms. Arabic manifests similar complexity in a slightly different way, with active derivational morphology combining with a respectable number of verb inflections.

Why do the Romance languages, with their vast verb conjugation tables look so uninflected? Almost certainly, the fact that this is a news corpus plays a big role. News preferentially favors the past tense and the third person, restricting verb conjugation more than, say, a corpus of spoken conversations would. The main complication of the Slavic languages, noun declension, is no more restricted in news than it would be in other contexts: nouns can still be subjects, objects, instruments, etc. Germanic languages, with their modest declension schemes, rank between that of Slavic and that of the undeclined Romance languages. English, which came out of the Norman conquest on a path to eliminate as much of the complexity of its Anglo-Saxon and Old French origins, ended up the least-inflected language in the region. Chinese is famously "uninflected", although I have to admit a procedural bias here: Tokenizing Chinese into single tokens guarantees that it occupies the extreme position; if I were tokenizing it into words (taking some side or another in the debate concerning what is a word in Chinese), the counts would be different, although its rank as the least-inflected language in the bunch would not.

One could have quite a bit of fun explaining all of the nuances in the ranking: Why are Danish and Norwegian so far apart? Because Norwegian has more zero-ending plurals? What about Spanish and Italian? Is it some nonmorphological reason?

Overall, the thing that was interesting here was to see the numeric basis to an observation that is made so readily about which languages are "heavily" inflected. Where qualitative truths exist, quantitative breakdowns can show them.

Parsing Twitter

The Internet has not re-invented language as such, but it has created many new registers that have to be parsed as such. Out-of-the-box NLP tools that were developed to parse the Wall Street Journal and other well-behaved text will fall down flat if they are used to process other niches around the Internet.

I have seen some of these phenomena going back a quarter of a century, in online chat. In a nutshell, people use Internet means of writing in ways more colloquial than formal writing tends (and tended) to be. But even without that broad sweep, there are many sub-niches of usage -- some determined by medium, and some determined by user population. (Besides the obvious segmentation into different national languages like English, German, and Chinese.)

Interest in parsing Twitter is suddenly getting hot, and while a lot of the linguistic behavior there resembles linguistic behavior in other online locales like chat rooms, email, and instant messaging, every niche ends up with its own rules (and lack thereof).

Here are some phenomena I've seen as I build a parser that is robust enough to handle Twitter:

1) Pro drop. Twitter in particular makes the first-person singular pronoun implicit. Many tweets look like English sentences that have the leading word "I" implied. In other cases, "I am" is implied.

2) Nonsentential statements. Sometimes a noun phrase stands alone, with an implied existential quantification out front. "Party tonight" means "There will be a party tonight."

3) A register that resembles Black English Vernacular has arisen. I would suggest that this new written form deliberately deviates from formal written standards. At the same time, it is economical, using shorter forms as rebuses for bulkier forms whenever the shorter form would be pronounced the same way. For example, rewriting "You know" as "u no" (4 characters instead of 8). One can feel William Safire quaking, but for those of us writing parsers, we must accept and embrace.

The first I noticed this was in the titles of songs written by Prince. The titles of songs on his first three albums never did this, but in albums released in 1981 and 1982, three of his songs had these elements in his titles (eg, "I would die 4 u"). You can see the deliberately contrary nature of his language by 1988, when he titled a song "Eye No", thus using a longer form instead of a standard shorter form. I don't know if Prince was significantly responsible for this phenomenon or not, but it has certainly caught on by now.

Incidentally, detecting a user's register is potentially quite valuable, since many business purposes for parsing Twitter would be involved with market analysis and market segmentation.

4) Acronyms and emoticons. These are so common in computer-mediated communication that it is impossible to be unaware of them. LOL.

5) Novel contractions, like "hella", "tryna", "weneva".

6) Repeating characters to establish emphasis. Eg, "welcomeeeeeeeeee". This is in some cases a challenge to parse (in principle, "good" is "god" with the "o" repeated). In other cases, it's easy to convert to standard usage, but it does defeat literal search mechanisms.

Notice that the aforementioned devices can occur in combination. For example, "lmaaooo" = "laughing my ass off" with the "a" and "o" repeated for emphasis.

7) Unique medium-specific entities like URLs and the Twitter features for directing a tweet to a user (eg, @FakeSteveJobs) or a topic (eg, #lost).

8) The substitution of characters that resemble other characters for one another. "3" can be used for "E", "0" for "o", "q" for "g".

9) Deliberate swapping of character order. For example, "teh" as a playful misspelling of "the". This can also combine with aforementioned devices. Eg, "pr0n" is a way to rewrite "porn".

Not every user partakes of these new linguistic devices, but a parser that is intended to wring meaning (and market intelligence) out of Twitter (or blogs or email or other electronic communication) ignores them at its peril. The more of these you miss, the more information you miss. And the people who have embraced these nonstandard devices represent a nontrivial amount of spending power.

Wednesday, September 9, 2009

Medicine for healthBase

Public Showcase Gone Wrong

Last week, TechCrunch reviewed
healthBase, a public showcase of the Natural Language technology coming out of NetBase Solutions. In a rapidly-developing turn of events, TC published Leena Rao's brief and largely glowing review. Then the comments came and absolutely destroyed it, with phrases like "total fail" prompted by search results that were alternately terrible, hilarious, and if some posts were taken at face value, offensive. Hours later, Rao posted a second review picking up on the criticism.

Most of the criticism is to some extent fair in that healthBase does readily yield lots of bad results. However, some of the critics go on to hypothesize how the system works and make incorrect conclusions. Worse yet, some bloggers have used this as an indictment of the state of Natural Language Processing in general.

I'm in an unusually good situation to comment, since I was a founding engineer of this technology, building the original Natural Language system at NetBase back when it was called Accelovation. I'm disheartened to see the public rollout of the technology turn out like this, particularly in that some of the fixes to the evident problems were on the agenda when I left the company two years ago. There really is a strong technology at the heart of this system, and with a couple of fixes, this rollout could have been much stronger. I don't know why the low-hanging fruit that could have fixed these problems wasn't plucked in the last two years, but the solutions are clearly identifiable, so let me describe here the two specific technological fixes that are needed, plus one other crucial bit of wisdom.

Treatments for Bad Results

TechCrunch's second review made much of a bad set of results for "Causes of aids" (meaning, of course, Acquired Immune Deficiency Syndrome, not the verb to aid). In the initial set of results (the company has worked rapidly to clean up the kinds of results that drew all the criticism), the top two results were good, although loosely referring to the same thing: sexual contact with an infected partner. The third result, virus, was also quite valid. But the next seven were all downright bad. To an outsider, they ranged from the bizarre (strong magnetic field) to the equally bizarre and arguably offensive (Jew). But as an insider, I can tell you that there were two causes of bad results, and the system can get a lot better if these are fixed.

Tell Me Something I Don't Already Know

One of the results for the "Causes of aids" search was the singularly unilluminating Feature. When you perform searches on healthBase now, feature never comes up, indicating to me that they placed it on a blocked list of possible results. This was an initiative that we knew was necessary back in 2007, and was something I was working on at the time. Somehow, this work stopped far short of the goal after I left. Adding feature in 2009 is not the necessary general solution, because scads of similar terms are still coming up tonight: A cause of measles is characteristic. A cause of blindness is disorder. A cause of malaria is objective. A cause of leukemia is defect. Terms like feature and characteristic are too general to make sense in any circumstances. Terms like disorder and defect carry just one bit of information: They are something unfortunate, and of course, you could say that most bad things like AIDS are caused by some defect or other. Most things, good or bad, can be said to be caused by a characteristic -- the thing someone would want to know is -- which characteristic of what? In the case of measles, the extracted sentence tells us that it's a characteristic of "immune priming" -- something the system should have and could have extracted instead of characteristic.

The simple logic to fix these problems is to have a list of such terms and never show them. That's not a project for an all-nighter or even a couple of months' work, but in two years, they should have come up with an exhaustive list -- the top such terms identify themselves pretty easily by being pervasive; they're vacuous because they're omnipresent -- but didn't. These vacuous terms are the less-numerous and less-glaring of the bugs that have lit up the blogosphere, but they stem from a clearly identifiable source of bad result that is easily fixed.

Safety in Numbers

The worst kind of errors are results that defy common sense -- statements that Jew and strong magnetic field are causes of AIDS, or that Rancho del Arroyo mares cause hookers. NetBase's very talented Jens Tellefsen correctly identified -- in part -- the root cause of one of these errors, a single sentence on Wikipedia (and echoed elsewhere on the web) that juxtaposed "Jews" with "aiding". To wit:

Hispano-Visigothic king Egica accuses the Jews of aiding the Muslims, and sentences all Jews to slavery.

Clearly, there was an error in the parsing. The obvious (to us) use of AIDS as a noun was confounded with the system parsing the verb aiding to its stem aid, and somewhere along the lines, not seeing the difference between the two. Accusation indicates that the thing described is bad, so the parser concluded that Jews did something bad involving aid. We could go into greater detail, but the gist is, that noun-verb error was made between the parsing of that sentence and the interpretation of the pithy search term AIDS. Jens called the problem out, and drew an unfair backlash, including:

Personally, I think such basic distinctions should have been ironed out before launching the site.

and

I am sorry, but if you are purporting to be an intelligent search engine, you need to be show basic intelligence like being able to disambiguate from different meanings and tenses of words. You need to be able to identify parts of speech, especially if you are trying to find references to causes.

and

I hate to be pedestrian, but isn't that just a fancy way of saying it doesn't work?

All three of those rebuttals are misplaced: It's a simple fact that no Natural Language system will get the meanings and tenses (and, most to the point here: part of speech) correct 100% of the time. You don't iron out those distinctions, get a perfect parser, and then launch your product. Jens is quite right in asking the critics to excuse the occasional misparse.

However, this result is obviously a failure in the rubber-meets-the-road sense, and the critics miss the real problem: It's not that the parser could make such a mistake: It's that such a mistake was allowed to produce a result which was ranked fourth! And there's an easy fix. Whatever formula you use to rank results (sheer number of occurrences being a likely but mistaken candidate) the system should only allow a single specific sentence to count once no many how many times it is repeated across the web. Once you accept that principle, don't display any results based on so few occurrences that the inevitable imperfections in parsing could allow such a result. And then -- problem solved!

The "Causes of aids" results claimed that there were 116 records retrieved, and the top twenty causes were displayed. If "top" had been expressed in terms of number of specific sentences that provided that particular result, then Jew would have gotten a score of 1. If a threshold of, say, 2, were applied as a minimum number of evidentiary sentences required for a result to be displayed, the single most lampooned result of this launch would have been avoided -- along with most of the other bad results. In several cases where I currently find unintentionally humorous results, a Google search reveals the single sentence that caused the problem. Now you'd only run into trouble if multiple writers had produced the same anomalous result with alternative phrasing -- set your threshold according to the precision of your parser and the size of the database, and you can make the probability of such errors as low as you'd like -- a very desirable choice of precision over recall.

Again, a simple fix, and one that I'd mentioned back in 2007, but that was never put into production at the time. But it's totally crucial to filtering automated Information Retrieval results and achieving reasonable quality. If a system allows results based on one misparsed sentence to go public, it is going to show crazy results.

Pick Your Battles

When I found out that the domain of health and medicine was being used as a showcase for Netbase technology, I was very surprised. Because back in 2005 and 2007, when we were seeking out venture capital, a key point was that the technology was meant to be generic and not about "vertical", single-domain searches. When Medstory launched, we had some internal discussion about it, and I made clear the point that medicine is an area where an ontology is almost sufficient for structuring searches in the way that Accelovation (as it was still called) used NLP to do from scratch.

To be more specific, the entities in medicine are usually confined to one semantic category from the following list: "Drugs and Substances", "Conditions", "Procedures", "People", etc. -- the main result "silos" for Medstory results. As a result, they don't need to understand the meanings of sentences -- if a drug is mentioned in a result on the search topic, it is automatically placed in the "Drugs and Substances" column with an almost 100% chance that it is not being misclassified (have you ever met a person named "Thiamine"?). As a result, information retrieval in medicine can be done quite well without semantic search, and in fact, semantic search is more likely to surface errors unless the results are filtered through the two face-saving mechanisms I mentioned above. If you wanted to trip up the ontology-based approach, you could do it by being clever: Back in 2007, I typed "vitamin B poisoning" into Medstory and it listed "Vitamin B" as a drug -- be careful with that advice. But the exceptions are few, so Medstory already had the problem solved better than healthBase really had a chance to do in 2009. (And the problem I mentioned has since been fixed, showing that Medstory's team has not been complacent.)

If you picked a problem in a mechanical domain, you would immediately find that an ontology is not enough to perform that context-sensitive classification. For example, if you search healthBase now for "white noise", you find both pros ("calm baby") and cons ("corrupt measurement") -- impossible to do with an ontology alone.

A related problem with the choice of a specific domain is that the healthBase implementation has not narrowed the field of indexed data to medicine, so you may (as jokers have done) produce results that are amusing simply because they are so flagrantly not medical: I just searched for complications of Senator and the top result was "raise issue". This may actually point to some flaws in the IR besides those I've mentioned before, but stands out even more so for being non-medical.

Baby with the Bathwater

An earlier post on this blog was about the unfortunate recurrence of AI Winter, something which had hit the world pretty hard before I had made it out of my undergraduate work. To my dismay, the basic dynamic feeding it keeps on going, which is for the peddler of an AI-style technology (and NLP is certainly that) to over-market their own work, then live with the black mark on their reputation, and by extension, get people to write off the whole field. This is particularly damaging for those of us who make a living from it.

The reality of the healthBase technology is that it provides lots of useful results, and if you were earnestly seeking some background on a medical-related search -- no, not a replacement for actual medical advise -- you can get that information there. But, because at launch they had a system with far too many bad results, combined with the blogosphere's understandable joy at having a laugh, it turned into bad PR for the field as a whole. My dismay is all the greater for knowing that the company had already identified every one of these problems in 2007.

I'm completely certain that Information Retrieval will be a useful public-facing tool in the near future. I think NetBase had a window of opportunity to get there before anyone else, but in 2009, you can sense the Wolframs and the Bings and the Google Squareds circling the goal. The surest bet now is that whoever gets there first, they'll have an extremely short wait before there's company.

Monday, June 1, 2009

Platform

Language is for the most part sequential because, in a nutshell, we only have one mouth. (Sign language gets around this a bit, as do speakers of language, with changes in tone, facial expressions, gestures and so on.)

I've put together some of the basics for a platform for work in NLP. If anyone would like to collaborate on this as a true open-source project, I'd love to hear from you. Admittedly, there are some great packages out there like GATE and LingPipe, really nice guys who know their stuff and -- your NL Pundit loves nothing so much as this -- are upfront about the limitations as well as the strengths of their software. But I think there are some niches left unfilled for open source NLP. If you agree, drop me a line and maybe we can get a ball rolling.

Monday, April 13, 2009

Spell Check for POS Tagging

Language is inherently hierarchical. Letters make words; words make phrases; phrases make sentences; and, so on. NL processing tends to take place in stages corresponding to the levels of that hierarchy. It's a good paradigm; you can get good results by plugging state-of-the-art modules together, squeezing semantics out of natural language input.

It's also the case that NL processing is inherently error-ridden. Every step in the process can stumble on trip on the rampant ambiguities in language. Notice that I did not restrict that comment to discussing machine NL processing... people make mistakes, too. We say "um", we introduce sounds we didn't mean to, misspell words, mangle sentences and ideas midway. And we mishear, misread, get caught walking down (usually unintended) garden paths. Many mechanical text-processing systems are feed-forward, making errors at each step and then accumulating the errors into a growing number until by the end, perhaps 25% to 75% of the sentences are misunderstood.

People, however, often recover from misunderstandings, re-reading the text, rejumbling one's thoughts until a correct parsing is found. The key is to use realizations (of error) at one level in processing to revise the previous level's work, which was flawed. Once an error is noted, the key is to determine where it was.

This weekend, I ran a POS-tagger on the output of an unrelated phrase chunker. I noticed such patterns as (and these are some of the bad ones only):

NP: DT VB
VP: JJ IN

Now obviously what had happened was that POS-tagging errors in the second tagger took place in processing phrases that resulted when the other system's chunker (and thus, it's POS-tagger) had categorized the POS correctly. "JJ IN" was a case where the real underlying form was "VB RP", and the first tagger got it right (allowing the chunker to get it right) but the second tagger got it wrong.

The key takeaway here is not that the first tagger is better! That may be true, or may not be true. The key observation, rather, is that when a POS tagger makes an error (and they all do), the prospect of chunking the sentence correctly is thereby doomed on that sentence. GIGO = "Garbage in, garbage out."

But consider the opportunity to recover. The fact of the matter is, "JJ IN" is a red flag that the tagger may have screwed up and that before committing to a chunking of those tokens, the system may want to reconsider the probabilities of those particular tags and see if a more agreeable chunking can result from different tagging.

This is the heart of the top-down/bottom-up manner of processing which is known to be powerful in pattern recognition. Interactive Activation in particular is a demonstration of just how right this kind of processing is.

This amounts to incorporating context in the tagging of a particular word, which is something that any competent tagger already does. (Hidden Markov, by looking at the context to the left of a token; a transformation-based tagger by rules with conditions that look at neighboring tokens.) But what people do, and NL systems need to do, is to revisit the processing done by one module when the next module detects anomalies. A simple corollary to GIGO... when you find garbage coming out, you know that garbage came in. Find it and fix it, and you've got a better, more robust system.

Monday, February 25, 2008

I Fought the Law...

Historically, many of the major advances in science have involved the coining of a "Law". A good law is pithy, describes the world in a way that enables applied use, and tells people who are strong in mathematics, something about the nature of the world that would make the law fit the equation. For example, the inverse square law governing the apparent brightness of a star follows neatly from the fact that the discrete packets of light streaming outward from it spread out through successively larger spheres. Einstein's famous E=mc2 tells you that there is a universal speed limit equal to lightspeed.

Stepping back from the methods and techniques that have been used in NLP, we can hope for a law that describes the progress that has been made as research from industry and academia pursue better solutions to the hard problems. If you listen to Ray Kurzweil, you'd conclude that progress is going to approach a vertical asymptote. In other words, things are going to improve so much that the future will be off the charts -- things, sooner or later, are going to become infinitely good. Maybe quite soon. If you think that's true, you should invest heavily in AI-style technologies now!

The history of NLP, however, as well as that of some other fields of AI, has produced repeated evidence for a different law governing progress. A lot of the cycle of "hype, then winter" comes from people failing to recognize this law. The law that the history of NLP describes is quite different from Kurzweil's rosy-colored vision. In fact, it's at right angles to it. Rather than the state of the art approaching a vertical asymptote (infinite goodness soon!), it has been approaching a horizontal asymptote (progress is all done!).

Obviously, these two worldviews could not be more opposed. But the facts clearly point to which one is valid. Let's take some examples.

An important task in NLP is called Part of Speech Tagging. It consists of marking all of the words in a piece of text with their grammatical category. Not many people want to do this for its own sake, but it's a highly useful initial step in analyzing text more deeply. The first effort to do this electronically was the work on the Brown Corpus by Greene and Rubin in 1971. They achieved, with a very simple approach, performance of 70%. Not too bad for a first try. By the early 1980s, researchers had pushed this number way up, with a system called CLAWS achieving about 94%, meaning that we'd done away with about four fifths of the errors made by Greene and Rubin's approach. Big progress! But in the 25 years since then, progress has been extremely minor, perhaps to 96%. In fact, there is good reason to believe that tagging better than 97% is impossible, because even human annotators tagging a text do not agree more than that often. Moreover, some very different approaches to tagging all yield similar accuracy rates, just shy of that theoretical maximum. When much more progress happens in the first decade than in the succeeding quarter century, you have found yourself a horizontal asymptote.

As it happens, 96% is pretty good, and a tagger that performs that well is highly useful, errors be damned. And of course, no one would argue that a horizontal asymptote for accuracy must exist for anything, because you can't beat 100%. So in principle, this isn't a problem.

But in practice, it is! Because while Part of Speech Tagging is ready for application, some other very important tasks in NLP are not. Suppose the asymptote were not at 96%, but much lower. That's what you have with another highly interesting problem called Word Sense Disambiguation -- the determination of which meaning of a word a writer had in mind. (For example, "bank" like the side of a river, or "bank" like a financial institution.) Here, though the numbers vary considerably according to how you measure accuracy, it is clear that there are no excellent methods available. WSD has not advanced dramatically in the half century since it became an area of interest, and the approaches that exist are tailor-made to disappoint the people who understandably would like a good solution to this problem.

A very big problem for NLP as an enterprise is the cycle of hype, disappointment, and mistrust that I identified a few weeks ago. And a very big part of the hype end of the problem is that people who are the recipients of hype (customer bases, venture capitalists, managers, etc.) don't know which law has been in effect. Innovation does exist. Progress does happen. But before an idea becomes a software project, anyone reviewing it should determine if the NLP needed to make the approach successful has already hit a horizontal asymptote.