Building A Search Engine: 09.2007

2007/09/19

Optimizing: things that matter

In my previous post, I explained how to optimize (broadly speaking) the initialization of a block of data.
I also mentioned that it was useless. There is at least two reasons for that:

- Initializing data can't (and shouldn't) be the costliest part of your program:
If you are doing some string processing, some computations on an array, initializing the array is unlikely the part that will require the most computer instructions. You should instead focus on optimizing your algorithms.

- do your best to never have to initialize an array to a uniform value more than once in a program's lifetime:
This may sound surprising, even unfeasible. However, if initializing happens very often in your program, then truth #1 doesn't hold anymore, and optimizing your algorithms now include optimizing the initialization: choosing an algorithm were there is no, or a very minimal need to initialize data.

This will lead us to our next article on optimizing scripting languages execution time.

Optimizing: bzero/memset, loops and beyond

Something nice about opensource is that one gets to see how others do things, and learn from it, sometimes to keep good ideas, to think why it was written that way, improve one's programs. I could go on on licences, but that will be for another time.

Let's consider we have a reasonably small block of data, say an array of characters we wish to initialize every byte to 0 and that we are using a lower level language such as C.

To do this, we could write it that way:

for (i=0; i<ArraySize; i++)
  data[i] = '\0';

Rather straightforward and efficient, gets the job done you say ?
Well, first, there is nothing really obvious when reading the code about what it's actually doing. Out of context, it's not necessarily self-explaining.
It's easy to see also that if your processor can handle words bigger than 8 bits, you could be needing two or four or height more times instructions than needed.

Let's write it that way:

// Initialize the array to the null character
memset (data, '\0', SizeArray * sizeof (char));

What have we gained ? Well, maybe we only added the cost of a function call. Let's take a look at OpenBSD's implentation of memset:

$ less /usr/src/lib/libc/string/memset.c
/* $OpenBSD: memset.c,v 1.5 2005/08/08 08:05:37 espie Exp $ */
(full file and licence header at:
http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/string/memset.c
*/

#include <string.h>

void *
memset(void *dst, int c, size_t n)
{

if (n != 0) {
  char *d = dst;

  do
    *d++ = c;
  while (--n != 0);
}
return (dst);
}

OK, that's even more cryptic than our own version. In some way better though, as we aren't incrementing a 'useless' variable. But this iq disappointing optimization-wise. That was useless; or was it ? Let's look at the FreeBSD implementation of memset. Now things get interesting: memset is adapted to the machine's native word size. No more byte per byte initialization.

That's one lesson: if it's something common, there must exist someone who tried to write it right. There is no need on your part to take the risk of introducing new bugs or security breaches.

Let's consider we are the proud owner of an amd64 processor. Wait, there is a special implementation in assembly for that. Is it just C written in assembly ? The answer is no, the magic assembly instruction shrq does the right job.

Another lesson: if the action is common enough, the processor will know how to do it natively. And the guys who wrote your OS' libc will likely have noticed that (your mileage may vary).

The good thing is, by using these "high level" functions, you keep can keep your code readable, you do not need to spend as much time debugging it, it remains portable, you don't need to be able to speak assembly fluently with in-depth knowledge of the processor you are targetting your application to, and according to the OS' libc you get the opportunity to use clever improvements made in the libc code and at the assembly level.

Will you ever think about bypassing the libc functions again ? :)

Next, why this optimization was useless....

2007/09/11

Reading

I stumbled upon very interesting article by Anna Patterson. It's one of the best articles I've read on the subject. It's not really technical, doesn't go into details, but is very encouraging, and when writing a search engine, there are high walls to break down.

Incidentally, I discovered this article because their crawler visited my websites.
It seems their company is the new buzz in town.
I have my doubts (probably disgruntled employees and revolutionizing app do not sound like a new google to me).

But, well, as her article points, work, technique and perseverance is key.

2007/09/08

Being out of url

A friend of mine asked me one day if I wasn't afraid of running out of url: finally reaching dead ends.

After all, the question could be pertinent: one could imagine that with an identical global interest, this could happen: many sites, linking to a handful of well-known sites.

I would be glad if that were true: that would give me a stable pool of pages to work on.

The truth is, one of the first issues is coping with a huge quantities of url, internal and external. Check this blog for instance, it's liking to at least ten websites.

I will post a few statistics later on.

For now, the answer is "no, I'm far from being out of url".

Building A Search Engine