Building A Search Engine: 2007

2007/10/09

Update

It's been a while since my last post.
Things are going more slowly these day, although rather smoothly.
So far I have implemented the three main user functionalities I wished to have: text search, image search and video search.

Right now I'm starting to fill up my disks. About 5 million URL should do to start with and see a few things not adequately coping with that number of pages. I worked hard to be safe on all grounds, but working for years with computers taught me to expect (at least) the unexpected.

Let me give you a few numbers about the raw data. These are figures from a innacurate (emphasis intended) sample of 200 000 pages from the web.

Mean page size: 14.5 kb
Mean number of words per page: 449
Mean unique words per page: 216
Mean number of links per page: 35
Mean number of JPEG per page: 19

What do these numbers mean ? I don't think they mean a lot. I expect them to vary greatly with time.
The number of links per page for instance is suspiciously high. Refining the data in three categories of unique links: number of links in a page, number of links pointing inside the website, number of links pointing outside the website. Here two links pointing to the same page make for two links instead of one. The same applies to the JPEG.

I will try to update the stats from time to time to let you know how crude or accurate this first sample was, as well as other interesting stats.

2007/09/19

Optimizing: things that matter

In my previous post, I explained how to optimize (broadly speaking) the initialization of a block of data.
I also mentioned that it was useless. There is at least two reasons for that:

- Initializing data can't (and shouldn't) be the costliest part of your program:
If you are doing some string processing, some computations on an array, initializing the array is unlikely the part that will require the most computer instructions. You should instead focus on optimizing your algorithms.

- do your best to never have to initialize an array to a uniform value more than once in a program's lifetime:
This may sound surprising, even unfeasible. However, if initializing happens very often in your program, then truth #1 doesn't hold anymore, and optimizing your algorithms now include optimizing the initialization: choosing an algorithm were there is no, or a very minimal need to initialize data.

This will lead us to our next article on optimizing scripting languages execution time.

Optimizing: bzero/memset, loops and beyond

Something nice about opensource is that one gets to see how others do things, and learn from it, sometimes to keep good ideas, to think why it was written that way, improve one's programs. I could go on on licences, but that will be for another time.

Let's consider we have a reasonably small block of data, say an array of characters we wish to initialize every byte to 0 and that we are using a lower level language such as C.

To do this, we could write it that way:

for (i=0; i<ArraySize; i++)
  data[i] = '\0';

Rather straightforward and efficient, gets the job done you say ?
Well, first, there is nothing really obvious when reading the code about what it's actually doing. Out of context, it's not necessarily self-explaining.
It's easy to see also that if your processor can handle words bigger than 8 bits, you could be needing two or four or height more times instructions than needed.

Let's write it that way:

// Initialize the array to the null character
memset (data, '\0', SizeArray * sizeof (char));

What have we gained ? Well, maybe we only added the cost of a function call. Let's take a look at OpenBSD's implentation of memset:

$ less /usr/src/lib/libc/string/memset.c
/* $OpenBSD: memset.c,v 1.5 2005/08/08 08:05:37 espie Exp $ */
(full file and licence header at:
http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/string/memset.c
*/

#include <string.h>

void *
memset(void *dst, int c, size_t n)
{

if (n != 0) {
  char *d = dst;

  do
    *d++ = c;
  while (--n != 0);
}
return (dst);
}

OK, that's even more cryptic than our own version. In some way better though, as we aren't incrementing a 'useless' variable. But this iq disappointing optimization-wise. That was useless; or was it ? Let's look at the FreeBSD implementation of memset. Now things get interesting: memset is adapted to the machine's native word size. No more byte per byte initialization.

That's one lesson: if it's something common, there must exist someone who tried to write it right. There is no need on your part to take the risk of introducing new bugs or security breaches.

Let's consider we are the proud owner of an amd64 processor. Wait, there is a special implementation in assembly for that. Is it just C written in assembly ? The answer is no, the magic assembly instruction shrq does the right job.

Another lesson: if the action is common enough, the processor will know how to do it natively. And the guys who wrote your OS' libc will likely have noticed that (your mileage may vary).

The good thing is, by using these "high level" functions, you keep can keep your code readable, you do not need to spend as much time debugging it, it remains portable, you don't need to be able to speak assembly fluently with in-depth knowledge of the processor you are targetting your application to, and according to the OS' libc you get the opportunity to use clever improvements made in the libc code and at the assembly level.

Will you ever think about bypassing the libc functions again ? :)

Next, why this optimization was useless....

2007/09/11

Reading

I stumbled upon very interesting article by Anna Patterson. It's one of the best articles I've read on the subject. It's not really technical, doesn't go into details, but is very encouraging, and when writing a search engine, there are high walls to break down.

Incidentally, I discovered this article because their crawler visited my websites.
It seems their company is the new buzz in town.
I have my doubts (probably disgruntled employees and revolutionizing app do not sound like a new google to me).

But, well, as her article points, work, technique and perseverance is key.

2007/09/08

Being out of url

A friend of mine asked me one day if I wasn't afraid of running out of url: finally reaching dead ends.

After all, the question could be pertinent: one could imagine that with an identical global interest, this could happen: many sites, linking to a handful of well-known sites.

I would be glad if that were true: that would give me a stable pool of pages to work on.

The truth is, one of the first issues is coping with a huge quantities of url, internal and external. Check this blog for instance, it's liking to at least ten websites.

I will post a few statistics later on.

For now, the answer is "no, I'm far from being out of url".

2007/08/26

building a spider in python

This was the first set of keywords that someone in the US typed to reach my blog. He must have been disappointed, there isn't much about writing a spider in Python, except that there is a useful class which wasn't that useful to me. Maybe he didn't care to know more than that, maybe he did.

Prototype 1 had its spider written in Python.

It is quite easy to do so, the httplib class is easy to use, it has all the useful options one might need (or at least that I needed to have), I couldn't find any bug in it, and its use is straightforward.

It gets the job done, leaving to the developer the opportunity to focus on what to do with the data.

If you are considering python to write a spider, go for it. And if you don't trust my word (yet), you will see a good number of major search engine using python to write that component.

2007/08/24

Even fortune said so

43rd Law of Computing:
Anything that can go wr
fortune: Segmentation violation -- Core dumped

2007/08/22

Garbage collectors and memory

Is a garbage collector as usually found in interpreted languages enough not to worry about memory ?

The tempting and common answer is yes, the garbage collector is here to take charge of this, and is here exactly to solve that problem.

However, when writing a software that need to be speed and memory efficient (and it's really easy to write a program which needs rapidly a bit of tuning, no rocket launch has to be involved), it is usually a good thing to keep in mind memory allocation: how many scattered objects, small and big, are being dealt with during the program execution, how often does new/delete happen, and so on. It's usually rather easy to write a beast which will need Gigabytes of memory, or one which allocate/free Gigabytes of memory every few seconds, or a code where the garbage collector will never get a chance to work.

Here's a few examples of memory going wrong:
Functions not tail recursive in Erlang will never give the garbage collector a chance to free the memory
- Creating and destroying objects, in Java for, say, processing high frequency requests, will make the garbage collector allocate and free huge chunks of memory
- Arrays that can get huge in a php command line script, this language being usually used to handle short term requests
- SQL results sets that get bigger with time (and eventually don't even fit in memory)
- Add your own...

Java programmers beware: I've seen that example happen more than once.
I had a good laugh. The author of the program did not.

Memory leaks

Let's consider memory leaks in the document parser.

They are usually easy to track: the memory behavior gets wrong, too much memory is used as compared to what it should. This good thing usually happen fast for at least two reasons: time and code coverage. As opposed to some software which will show it has memory leaks after a few days, a parser show it in a few minutes: the amount of documents to process will be high in a short span of time. They are easy to see also because of code coverage: if anything could trigger a memory leak, it will.

When these memory leaks happen, they are either obvious, or really hard to track. Whichever way, it consumes time.

What could be a solution to this problem then ?

The algorithm/methodology I've used, and which worked quite well so far I will the call the "door algorithm". When you open a door, you close it after yourself.
With memory, it's the same reflex, Allocate/Free. Be a good boy, or a good girl. Also, "Standardize" where memory gets allocated and freed. Allocate at much as possible in the "constructor", free it all in the "destructor" (quotes applying if you are not using an object oriented language).

Wouldn't it be more efficient and less error-prone to rely only on tools such as valgrind ?
Well, the two methods, I think, must walk hand in hand: these tools can give false positives, and miss memory leaks. It does a good job at tracking them, but can miss them too. We talked about Murphy's law before.
And, above all, I think these tools must be used as a parachute, not as the only programing strategy (referring to debug time, to mention the least).

2007/08/15

Choosing a library

Each time I want to add a new feature to the search engine, I try to look at what is already existing out there.

I usually find a few libraries close enough to my needs, thinking it will take less time to learn their API and integrate them than writing my own libraries.

So far, contrary to popular belief, I'm not sure that this strategy really paid off. I will give a few example, where and how it failed to "work for me". The time it took me to realize it wouldn't fit my needs varied greatly, as did the reason I finally did not use them, or did.

- wget: Although not strictly speaking a library, I thought about wget for a short time to avoid writing my own spider. The prospect of having millions of directories stopped me early enough. The time it took me to evaluate that solution was the time it took me to shut out bad advises.
I could not see a proper way to make it reasonably useful.

- RobotFileParser in python: Prototype 1 had a spider written in Python; reading the documentation did not make me feel easy: the robots class has no cache. I could pickle it, true, but that would soon have troubles to fit in memory. I could also fetch the robots.txt everytimes, but, obviously, the less files to fetch, the happier one is, considering that the robots file is supposed to be read only once a week.
This was not the behaviour I expected, and it would have soon enough created memory troubles.

- libots: I found this library browsing freshmeat one day. I tried it using the command line. It seemed a good idea, and seemed to work well. Then, I wrote a module using this library. After a few thousand pages, bang, segfault somewhere in the glib. Now, given what I think of the glib, that this libots thing wanted to impose on my software its lousy license, I forgot about it.
True, it might have been me doing something wrong with the library. Given that it crashed when loading its dictionary, after parsing a thousand pages, I didn't feel that bad about myself.
My conclusion there would be that when using a third-party library, it has to live up to some standards that aren't necessarily achieved by the library you will find round the corner.

- libcurl: With libcurl, things get more subtle. It is a broadly used and very well tested library. But it has its occasional segfault in various contexts. To quote one: threads. This library doesn't seem to be designed from the ground up for threads. One has to carefully read the little lines at the end of the documentation pointing there to find that if one wished to handle ssl, there is some weird code to add. I'm saying this, because I cannot really see a reason why it's not already in the lib (be it libcurl or openssl). And, wrong, it does not prevent libcurl from segfault-ing.
Although writing such a library is a huge work, I spent so much time trying to find the reason why it would do a segfault that it makes me wonder if I wouldn't have been better off writing my own library.

- heritrix: This is a crawler I discovered recently, and said to myself, well, these guys must know what they are doing. It's in Java. It has a nice interface, rather cryptic if you didn't write the software. I gave it a try. I couldn't make it work (maybe java on a freebsd/amd64 is the reason, but still...), and it doesn't seem to fit my needs.
This is a good answer to the common question "but how on earth didn't you know about that thing ?".

- nspr: This might be the best stuff I stumbled upon. True, I don't use it very much, but it always does its job properly.
However, fcgi did not like at all being linked to it (if that information might save you a few hours wondering why you can only have one fcgi process running)

- fcgi: This might be worst code I've ever seen. I discovered things that I couldn't imagine could be done in C.
Seems to work though, rather efficiently (I wonder how), and gets the job done (modulo side effects due to its rather poor coding mentioned above).

What seems to be the pattern here ? Cutting-edge libraries should most of the time be avoided, a widely used library might have its bugs, a library that's still around for a long time might have its use, mammoths are mammoths, and some guys will always know the perfect-library-you-did-not-know-about (which is most of the time useless).

The trick is being able to evaluate fast enough the time it will take to integrate that library, solve bug with it, as well as the time it would take to write it.

The best strategy I've found so far to deal with this issue is to keep in mind these examples, and to evaluate the category the library falls into. And to act accordingly.

Code coverage

One quite challenging issue with a search engine is code coverage.

At the internet company I currently work for, any piece of software will go through the hands of a few million unique users each month. So, right, I'm used to something called code coverage.
However, there is one small thing that eases the challenges: the user's data is rather easily processed. If it's an email, it's easily checked against a pattern, if it's some free text, let's strip html tags and so on. Let's warn the user we can't accept the data as it is. A few tricks will be involved, but the final data being narrowly defined, one usually gets it right rather easily.

And then comes html. And those who write it. Two main factors, to me, change the rules of the game:
1) it doesn't really matter whether the page if well formatted or not, we want to get some useful data out of it
2) any error during the parsing will potentially corrupt gigabytes of data

The first point is interesting: one could say, if this xhtml page is not well-formatted, then let's forget about it. Right. This would mean ignoring a large part of the web. Same goes for pages supposedly html formatted.
I won't rant about encodings, such as a page encoded in UTF-16 with a meta indicating a UTF-8 encoding.
Point is, you are going to get anything really. And it's a lot better test than some random data one would feed his/her parser: people seem to know what will bother you when writing your search engine: bugs, "impossible" code path, or syntax you wouldn't dare to imagine. You cannot name it yet, you will soon.

Playing with the parser would be an amusing masochistic play if it "only" had the impact of making one loose data. It also feeds to the remaining of the chain corrupt data.
Then, the url database is corrupt, the spider is trying frantically to fetch pages that don't exist, making you loose time, taking useless space and corrupting the final database.

Murphy's law being the only law that stands, the corrupt data will get on the first results page an alpha user will get (although you already spent nights trying to fix bugs and finally couldn't find any corrupt data).

2007/08/09

Prototype 1

Hi,

I will try to post on this blog as often as I can, my views and day-to-day experience on building a search engine.

Expect this blog to talk mostly about parallelism, distribution, I/O troubles, data integrity strategies, code coverage and much more.

Right, the simplest things, just one click away, are very likely the most complex things to implement.

Building A Search Engine