Building A Search Engine: Choosing a library

Each time I want to add a new feature to the search engine, I try to look at what is already existing out there.

I usually find a few libraries close enough to my needs, thinking it will take less time to learn their API and integrate them than writing my own libraries.

So far, contrary to popular belief, I'm not sure that this strategy really paid off. I will give a few example, where and how it failed to "work for me". The time it took me to realize it wouldn't fit my needs varied greatly, as did the reason I finally did not use them, or did.

- wget: Although not strictly speaking a library, I thought about wget for a short time to avoid writing my own spider. The prospect of having millions of directories stopped me early enough. The time it took me to evaluate that solution was the time it took me to shut out bad advises.
I could not see a proper way to make it reasonably useful.

- RobotFileParser in python: Prototype 1 had a spider written in Python; reading the documentation did not make me feel easy: the robots class has no cache. I could pickle it, true, but that would soon have troubles to fit in memory. I could also fetch the robots.txt everytimes, but, obviously, the less files to fetch, the happier one is, considering that the robots file is supposed to be read only once a week.
This was not the behaviour I expected, and it would have soon enough created memory troubles.

- libots: I found this library browsing freshmeat one day. I tried it using the command line. It seemed a good idea, and seemed to work well. Then, I wrote a module using this library. After a few thousand pages, bang, segfault somewhere in the glib. Now, given what I think of the glib, that this libots thing wanted to impose on my software its lousy license, I forgot about it.
True, it might have been me doing something wrong with the library. Given that it crashed when loading its dictionary, after parsing a thousand pages, I didn't feel that bad about myself.
My conclusion there would be that when using a third-party library, it has to live up to some standards that aren't necessarily achieved by the library you will find round the corner.

- libcurl: With libcurl, things get more subtle. It is a broadly used and very well tested library. But it has its occasional segfault in various contexts. To quote one: threads. This library doesn't seem to be designed from the ground up for threads. One has to carefully read the little lines at the end of the documentation pointing there to find that if one wished to handle ssl, there is some weird code to add. I'm saying this, because I cannot really see a reason why it's not already in the lib (be it libcurl or openssl). And, wrong, it does not prevent libcurl from segfault-ing.
Although writing such a library is a huge work, I spent so much time trying to find the reason why it would do a segfault that it makes me wonder if I wouldn't have been better off writing my own library.

- heritrix: This is a crawler I discovered recently, and said to myself, well, these guys must know what they are doing. It's in Java. It has a nice interface, rather cryptic if you didn't write the software. I gave it a try. I couldn't make it work (maybe java on a freebsd/amd64 is the reason, but still...), and it doesn't seem to fit my needs.
This is a good answer to the common question "but how on earth didn't you know about that thing ?".

- nspr: This might be the best stuff I stumbled upon. True, I don't use it very much, but it always does its job properly.
However, fcgi did not like at all being linked to it (if that information might save you a few hours wondering why you can only have one fcgi process running)

- fcgi: This might be worst code I've ever seen. I discovered things that I couldn't imagine could be done in C.
Seems to work though, rather efficiently (I wonder how), and gets the job done (modulo side effects due to its rather poor coding mentioned above).

What seems to be the pattern here ? Cutting-edge libraries should most of the time be avoided, a widely used library might have its bugs, a library that's still around for a long time might have its use, mammoths are mammoths, and some guys will always know the perfect-library-you-did-not-know-about (which is most of the time useless).

The trick is being able to evaluate fast enough the time it will take to integrate that library, solve bug with it, as well as the time it would take to write it.

The best strategy I've found so far to deal with this issue is to keep in mind these examples, and to evaluate the category the library falls into. And to act accordingly.

Building A Search Engine

2007/08/15

Choosing a library

No comments:

Blog Archive

About Me