2007/09/19

Optimizing: bzero/memset, loops and beyond

Something nice about opensource is that one gets to see how others do things, and learn from it, sometimes to keep good ideas, to think why it was written that way, improve one's programs. I could go on on licences, but that will be for another time.

Let's consider we have a reasonably small block of data, say an array of characters we wish to initialize every byte to 0 and that we are using a lower level language such as C.

To do this, we could write it that way:

for (i=0; i<ArraySize; i++)
  data[i] = '\0';

Rather straightforward and efficient, gets the job done you say ?
Well, first, there is nothing really obvious when reading the code about what it's actually doing. Out of context, it's not necessarily self-explaining.
It's easy to see also that if your processor can handle words bigger than 8 bits, you could be needing two or four or height more times instructions than needed.


Let's write it that way:

// Initialize the array to the null character
memset (data, '\0', SizeArray * sizeof (char));

What have we gained ? Well, maybe we only added the cost of a function call. Let's take a look at OpenBSD's implentation of memset:

$ less /usr/src/lib/libc/string/memset.c
/* $OpenBSD: memset.c,v 1.5 2005/08/08 08:05:37 espie Exp $ */
(full file and licence header at:
http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/string/memset.c
*/

#include <string.h>

void *
memset(void *dst, int c, size_t n)
{

if (n != 0) {
  char *d = dst;

  do
    *d++ = c;
  while (--n != 0);
}
return (dst);
}

OK, that's even more cryptic than our own version. In some way better though, as we aren't incrementing a 'useless' variable. But this iq disappointing optimization-wise. That was useless; or was it ? Let's look at the FreeBSD implementation of memset. Now things get interesting: memset is adapted to the machine's native word size. No more byte per byte initialization.

That's one lesson: if it's something common, there must exist someone who tried to write it right. There is no need on your part to take the risk of introducing new bugs or security breaches.

Let's consider we are the proud owner of an amd64 processor. Wait, there is a special implementation in assembly for that. Is it just C written in assembly ? The answer is no, the magic assembly instruction shrq does the right job.

Another lesson: if the action is common enough, the processor will know how to do it natively. And the guys who wrote your OS' libc will likely have noticed that (your mileage may vary).

The good thing is, by using these "high level" functions, you keep can keep your code readable, you do not need to spend as much time debugging it, it remains portable, you don't need to be able to speak assembly fluently with in-depth knowledge of the processor you are targetting your application to, and according to the OS' libc you get the opportunity to use clever improvements made in the libc code and at the assembly level.

Will you ever think about bypassing the libc functions again ? :)

Next, why this optimization was useless....

No comments: