2013/07/09

Building a corpus for classification: size isn't always the only problem

Building a meaningful corpus for classifying documents is a tedious task. Building a few is even more tedious.

That's an obvious fact.

The trouble is, as the rule stands with anything involving language: ambiguity finally rules and corners you.

Let's say, for instance, that we are trying to classify articles from newspapers, in three categories: "business", "energy" and "environment".
That seems simply enough. The next article about Lady Gaga is out. The next articles on coal prices is in. It's in. Right. It fall in the business section, as it's going to influence energy prices. Wait, energy prices, that should be also in energy then, right ? But won't coal prices influence the environment too, as renewable will be less competitive ?
Wait, once again, wouldn't classifying this in environment be biased ? What about global warming naysayers ?
What began as a simple and trivial classification finally ends up in a mess. The multiple section facet is solved fast: use that article as a reference for all three categories, never mind the confusion it will potentially give to the computer which has to decide which category the evaluated articles will fall into. But what about the implicit point of view concerning this simple example, and its potential bias ?

If that seems too obvious (and in some ways, it is), let's try another one: identifying if a search string refers to a person or anything else. It's easy to define a person, right ?
Are Greek Gods persons for instance ? One can find pictures of them, they were treated as beings and interacted quite intimately with human beings. They could be persons then.
Let's say Gods are people. Then, what about Allah, which isn't quite really a person, but definitely a God to many people ?
There's also the more trivial case of movie and tv characters: should they be classified as beings, or ?...

Once again, the brain with its seemingly endless prowesses for confusion puzzles easily the man trying to teach a thing or two to the computer.

No comments: