2013/08/19

Tips to build a corpus for statistical classification faster

Building a corpus for statical classification is usually a tedious task. I will try to give you here a few ideas to get started to help speed that up.

As is usual with any optimization work: the fastest way to do something is not to do it at all.
Most of the time, the task has already been tackled, and with some chance, someone already built a corpus and made it available. The chances may seem small, but it's usually more common than one could expect.
So, the first step is to use your favorite search engine, try different keywords, and look for that. If you can't find it, look harder, change keywords.

A ready-made corpus is not necessarily where one would look in the first place. One such example is a classic used for training statistical machine translation engines: the United Nations speeches. It has all the speeches  translated in many languages (and there's a lot of speeches made there). A variant could be works at the European Parliament, which are also translated to different languages. You get the idea.

A "new" trove of human knowledge is Wikipedia. This may seem obvious, but  until recently it was quite unusable, as it was poorly structured, if at all. This has changed, and, though minimalistic in many ways, the "infobox" tag for instance is gathering more and more useful information.
It is for those who are already a bit experienced with coping with size though (or have a lot of patience), as the English Wikipedia dump is, as of this writing, 8.3G bzipped (that's something above 40G). (Mahout has a nice ready-made splitter for the Wikipedia dump, here is a quick introduction to the stuff, it's really useful to have a smaller set for debugging for instance)

Now, if that's not enough to get the bulk human information you need, things are getting tricky.
The usual not subtle path of doing the stuff could be then:
    - write a quick viewer which will allow you to easily classify content
    - befriend grep and pcre
    - building a set of "good" keywords
    - being even more creative

There's also the possibility to seed your classification engine with a little data, then run it on  some more data, check what is a positive and what is not in your corpus yet, and then add it.
That is _not_ a good idea, as it will give a bias to your corpus. However, this can be mitigated by ignoring the best matching data, and targeting the "more or less" matching data" according to your classifier. You better check the maths though !

Have more tips ? Share them in the comments :)