NLP Confidential: Inflection Predilection

Here's a lightweight, factoid-level look at a lot of different languages in a fun way.

We often hear that some languages are "heavily inflected". You can glance at tables of verb conjugations and noun declensions to see the brutal details: Russian surely does have a lot of declensions to memorize, just as Spanish does with verb conjugations. But that's all theory and not practice -- lots of those forms are hardly ever used. I lived in Italy and started making a mental note of how often I heard (or had to use) the second-person plural form of the future tense: Not often!

One way of quantifying how "heavily" inflected a language is can be seen from some statistical work I was doing for a more practical purpose. For each of sixteen languages, I processed corpora of news articles with total token counts for each language ranging from about 100,000 to 3 million. Based on this, I generated two ways of counting tokens: First, the absolute count of nonunique tokens (in other words, the total size of the corpus); Second, the sum of the unique token count of each article (in other words, counting each word just once for each article it occurs in).

If each word occurred at most once per article, these two counts would be identical. The ratio between the two counts thus expresses how much a language tends to repeat words. There are a number of factors that can determine the degree of repetition, but many of these should be more or less equally true of all languages across the news corpora. Inflectional morphology, however, varies greatly from one language to another. Where English would use the form "said" in the past tense regardless of the person and number of the verb, but would use "say" or "says" for, the present tense. Spanish could use "dijeron", "dijo", "dije", etc. for the past, and "dicen", "dice", "digo", etc. for the present. Meanwhile, Chinese would use the character "说" regardless of person, number, or tense. In this example, Spanish could have a relatively small discrepancy between the number of "SAY" tokens in an article and the number of "SAY" forms in the article. English would have a larger discrepancy, with only three forms likely to occur. Chinese, meanwhile, would have only one form regardless of any other factors, so the ratio of total "SAY" tokens and the count of one for that one form could be quite a bit larger. So "more inflected" languages should have lower ratios between the two forms of counts, while "uninflected" languages should have higher ratios.

Behold, we see the basic truths upheld. Finnish is the most inflected language in this sample, and Chinese the least. English is the least inflected European language. Among the branches of the Indo-European family in the sample, the Slavic branch is the most inflected, with Germanic (besides the mongrel English) next-most, and Romance least.

A bit of analysis: Finnish and Turkish are polysynthetic, allowing combinatorial addition of affixes and generating vast numbers of forms. Arabic manifests similar complexity in a slightly different way, with active derivational morphology combining with a respectable number of verb inflections.

Why do the Romance languages, with their vast verb conjugation tables look so uninflected? Almost certainly, the fact that this is a news corpus plays a big role. News preferentially favors the past tense and the third person, restricting verb conjugation more than, say, a corpus of spoken conversations would. The main complication of the Slavic languages, noun declension, is no more restricted in news than it would be in other contexts: nouns can still be subjects, objects, instruments, etc. Germanic languages, with their modest declension schemes, rank between that of Slavic and that of the undeclined Romance languages. English, which came out of the Norman conquest on a path to eliminate as much of the complexity of its Anglo-Saxon and Old French origins, ended up the least-inflected language in the region. Chinese is famously "uninflected", although I have to admit a procedural bias here: Tokenizing Chinese into single tokens guarantees that it occupies the extreme position; if I were tokenizing it into words (taking some side or another in the debate concerning what is a word in Chinese), the counts would be different, although its rank as the least-inflected language in the bunch would not.

One could have quite a bit of fun explaining all of the nuances in the ranking: Why are Danish and Norwegian so far apart? Because Norwegian has more zero-ending plurals? What about Spanish and Italian? Is it some nonmorphological reason?

Overall, the thing that was interesting here was to see the numeric basis to an observation that is made so readily about which languages are "heavily" inflected. Where qualitative truths exist, quantitative breakdowns can show them.

NLP Confidential

Thursday, February 25, 2010

Inflection Predilection

No comments:

Blog Archive