2007/08/26

building a spider in python

This was the first set of keywords that someone in the US typed to reach my blog. He must have been disappointed, there isn't much about writing a spider in Python, except that there is a useful class which wasn't that useful to me. Maybe he didn't care to know more than that, maybe he did.

Prototype 1 had its spider written in Python.

It is quite easy to do so, the httplib class is easy to use, it has all the useful options one might need (or at least that I needed to have), I couldn't find any bug in it, and its use is straightforward.

It gets the job done, leaving to the developer the opportunity to focus on what to do with the data.

If you are considering python to write a spider, go for it. And if you don't trust my word (yet), you will see a good number of major search engine using python to write that component.

2007/08/24

Even fortune said so

43rd Law of Computing:
Anything that can go wr
fortune: Segmentation violation -- Core dumped

2007/08/22

Garbage collectors and memory

Is a garbage collector as usually found in interpreted languages enough not to worry about memory ?

The tempting and common answer is yes, the garbage collector is here to take charge of this, and is here exactly to solve that problem.

However, when writing a software that need to be speed and memory efficient (and it's really easy to write a program which needs rapidly a bit of tuning, no rocket launch has to be involved), it is usually a good thing to keep in mind memory allocation: how many scattered objects, small and big, are being dealt with during the program execution, how often does new/delete happen, and so on. It's usually rather easy to write a beast which will need Gigabytes of memory, or one which allocate/free Gigabytes of memory every few seconds, or a code where the garbage collector will never get a chance to work.

Here's a few examples of memory going wrong:
Functions not tail recursive in Erlang will never give the garbage collector a chance to free the memory
- Creating and destroying objects, in Java for, say, processing high frequency requests, will make the garbage collector allocate and free huge chunks of memory
- Arrays that can get huge in a php command line script, this language being usually used to handle short term requests
- SQL results sets that get bigger with time (and eventually don't even fit in memory)
- Add your own...

Java programmers beware: I've seen that example happen more than once.
I had a good laugh. The author of the program did not.

Memory leaks

Let's consider memory leaks in the document parser.

They are usually easy to track: the memory behavior gets wrong, too much memory is used as compared to what it should. This good thing usually happen fast for at least two reasons: time and code coverage. As opposed to some software which will show it has memory leaks after a few days, a parser show it in a few minutes: the amount of documents to process will be high in a short span of time. They are easy to see also because of code coverage: if anything could trigger a memory leak, it will.

When these memory leaks happen, they are either obvious, or really hard to track. Whichever way, it consumes time.

What could be a solution to this problem then ?

The algorithm/methodology I've used, and which worked quite well so far I will the call the "door algorithm". When you open a door, you close it after yourself.
With memory, it's the same reflex, Allocate/Free. Be a good boy, or a good girl. Also, "Standardize" where memory gets allocated and freed. Allocate at much as possible in the "constructor", free it all in the "destructor" (quotes applying if you are not using an object oriented language).

Wouldn't it be more efficient and less error-prone to rely only on tools such as valgrind ?
Well, the two methods, I think, must walk hand in hand: these tools can give false positives, and miss memory leaks. It does a good job at tracking them, but can miss them too. We talked about Murphy's law before.
And, above all, I think these tools must be used as a parachute, not as the only programing strategy (referring to debug time, to mention the least).

2007/08/15

Choosing a library

Each time I want to add a new feature to the search engine, I try to look at what is already existing out there.

I usually find a few libraries close enough to my needs, thinking it will take less time to learn their API and integrate them than writing my own libraries.

So far, contrary to popular belief, I'm not sure that this strategy really paid off. I will give a few example, where and how it failed to "work for me". The time it took me to realize it wouldn't fit my needs varied greatly, as did the reason I finally did not use them, or did.

- wget: Although not strictly speaking a library, I thought about wget for a short time to avoid writing my own spider. The prospect of having millions of directories stopped me early enough. The time it took me to evaluate that solution was the time it took me to shut out bad advises.
I could not see a proper way to make it reasonably useful.

- RobotFileParser in python: Prototype 1 had a spider written in Python; reading the documentation did not make me feel easy: the robots class has no cache. I could pickle it, true, but that would soon have troubles to fit in memory. I could also fetch the robots.txt everytimes, but, obviously, the less files to fetch, the happier one is, considering that the robots file is supposed to be read only once a week.
This was not the behaviour I expected, and it would have soon enough created memory troubles.

- libots: I found this library browsing freshmeat one day. I tried it using the command line. It seemed a good idea, and seemed to work well. Then, I wrote a module using this library. After a few thousand pages, bang, segfault somewhere in the glib. Now, given what I think of the glib, that this libots thing wanted to impose on my software its lousy license, I forgot about it.
True, it might have been me doing something wrong with the library. Given that it crashed when loading its dictionary, after parsing a thousand pages, I didn't feel that bad about myself.
My conclusion there would be that when using a third-party library, it has to live up to some standards that aren't necessarily achieved by the library you will find round the corner.

- libcurl: With libcurl, things get more subtle. It is a broadly used and very well tested library. But it has its occasional segfault in various contexts. To quote one: threads. This library doesn't seem to be designed from the ground up for threads. One has to carefully read the little lines at the end of the documentation pointing there to find that if one wished to handle ssl, there is some weird code to add. I'm saying this, because I cannot really see a reason why it's not already in the lib (be it libcurl or openssl). And, wrong, it does not prevent libcurl from segfault-ing.
Although writing such a library is a huge work, I spent so much time trying to find the reason why it would do a segfault that it makes me wonder if I wouldn't have been better off writing my own library.

- heritrix: This is a crawler I discovered recently, and said to myself, well, these guys must know what they are doing. It's in Java. It has a nice interface, rather cryptic if you didn't write the software. I gave it a try. I couldn't make it work (maybe java on a freebsd/amd64 is the reason, but still...), and it doesn't seem to fit my needs.
This is a good answer to the common question "but how on earth didn't you know about that thing ?".

- nspr: This might be the best stuff I stumbled upon. True, I don't use it very much, but it always does its job properly.
However, fcgi did not like at all being linked to it (if that information might save you a few hours wondering why you can only have one fcgi process running)

- fcgi: This might be worst code I've ever seen. I discovered things that I couldn't imagine could be done in C.
Seems to work though, rather efficiently (I wonder how), and gets the job done (modulo side effects due to its rather poor coding mentioned above).

What seems to be the pattern here ? Cutting-edge libraries should most of the time be avoided, a widely used library might have its bugs, a library that's still around for a long time might have its use, mammoths are mammoths, and some guys will always know the perfect-library-you-did-not-know-about (which is most of the time useless).

The trick is being able to evaluate fast enough the time it will take to integrate that library, solve bug with it, as well as the time it would take to write it.

The best strategy I've found so far to deal with this issue is to keep in mind these examples, and to evaluate the category the library falls into. And to act accordingly.

Code coverage

One quite challenging issue with a search engine is code coverage.

At the internet company I currently work for, any piece of software will go through the hands of a few million unique users each month. So, right, I'm used to something called code coverage.
However, there is one small thing that eases the challenges: the user's data is rather easily processed. If it's an email, it's easily checked against a pattern, if it's some free text, let's strip html tags and so on. Let's warn the user we can't accept the data as it is. A few tricks will be involved, but the final data being narrowly defined, one usually gets it right rather easily.

And then comes html. And those who write it. Two main factors, to me, change the rules of the game:
1) it doesn't really matter whether the page if well formatted or not, we want to get some useful data out of it
2) any error during the parsing will potentially corrupt gigabytes of data

The first point is interesting: one could say, if this xhtml page is not well-formatted, then let's forget about it. Right. This would mean ignoring a large part of the web. Same goes for pages supposedly html formatted.
I won't rant about encodings, such as a page encoded in UTF-16 with a meta indicating a UTF-8 encoding.
Point is, you are going to get anything really. And it's a lot better test than some random data one would feed his/her parser: people seem to know what will bother you when writing your search engine: bugs, "impossible" code path, or syntax you wouldn't dare to imagine. You cannot name it yet, you will soon.

Playing with the parser would be an amusing masochistic play if it "only" had the impact of making one loose data. It also feeds to the remaining of the chain corrupt data.
Then, the url database is corrupt, the spider is trying frantically to fetch pages that don't exist, making you loose time, taking useless space and corrupting the final database.

Murphy's law being the only law that stands, the corrupt data will get on the first results page an alpha user will get (although you already spent nights trying to fix bugs and finally couldn't find any corrupt data).

2007/08/09

Prototype 1

Hi,

I will try to post on this blog as often as I can, my views and day-to-day experience on building a search engine.

Expect this blog to talk mostly about parallelism, distribution, I/O troubles, data integrity strategies, code coverage and much more.

Right, the simplest things, just one click away, are very likely the most complex things to implement.