Building A Search Engine: 2009

2009/10/04

Design patterns

A while ago in the company I worked for, I was assigned a task.

Me being given that task was an error: a guy with the same first name as me was supposed to get it.
I could go on and on how that situation got managed, but this blog is about serious stuff and not about politics and egos in the workplace.

So, I got a look at the application I was supposed to maintain and to make evolve. A very nice java application, quite well written, hundreds of line of code, readable and so on. Nicely written from a design patterns point of view.

What was this application doing with its hundreds of lines of code ?
$ xslproc sheet.xsl document.xsl > output.xml

Yes, these hundred of lines of code fit into one line. What about datasource are you asking ? Well, everything is a file in unix...

Don't get me wrong, there are a lot of things I hate about the Unix way, but...

Please, let's keep things as simple as they are... And you know what ? They are usually more efficient and as powerful when they are kept simple...

2009/08/13

Ubuntu and "large files" (files greater than 2 G)

Recently, for whatever reason, I decided that I should have Linux on a few machines. It might have been to be sure that my tools run on Linux.

It happened that Linux ended up on one of the key machines.
That was pure over-confidence and trust.

If FreeBSD hasn't any problem with "big files", that is, files greater that 2 gigabytes, something one could expect in 2009, one could expect the same from Linux.

Well, not quite so. The spider got stuck around 2 G of data.

The two magics options to add at compile time before banging your heads on the walls (it does hurt, trust my experience):

-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64

Why does one need to add these defines ? (which trigger an ugly cascade of other defines, it seems a lot of people decided this needs to be frightening)
I have no idea why.

Anyway, that's two flags one needs to add.

2009/01/18

Error

An interesting thing to feed a search engine is the word "error".

It's a very simple, single word.

You say, a good search engine will know how to handle all those "mysql error", "php error", "error page not found", etc etc.

Well, it's pretty interesting to see how well each behaves to that.

Some just give you in their top results "Error! Reason: File 'index.html was not found!" (right, this is the second result on a popular French engines). Some do it as their top result (Error. Reason: File "menu.asp" was not found!)

Others will play it on the safe side, displaying news that mentions error, the wikipedia error entry, then, the algorithm seeing that the remaining thousands of pages are wrong show only a handful of results.

I found one search engine that seemed less prone to this kind of troubles that the major search engine we know.

One the second hand, my search engine seems to be affected a lot later than the other search engines. That's suspicious, considering it's unlikely I'm the only one to run and try to fix that trouble. The real answer is... well, it doesn't show very well the content of the page, and where that "error" comes from.

Is this question relevant in any way ? It's the traditional "the written words don't really mean what the website is about".
Even if we are using the link text from web page to web page, there's going to be guys saying "this pages yields an error" or something like that.
One can say that, well, if enough guys say something different about the page, use what the majority is saying; true, but if you are looking for fast raising page per words...

I don't have an answer to that; it looks like I'm not the only one, and that's not really a comforting idea as to the future of search.

2009/01/06

Security concepts and an open source search engine

I was reading tonight a very long list of comments on how an ideal distributed opensource search engine could be.

The interesting things, reading the comments, is how it relates to security. Let me explain.

The main argument on why an opensource (even more so a distributed one) search engine can't work in practice is because when you know how the thing works, you can easily influence the results (ie: spam). And then people begin to praise the "security through obscurity" that the major search engines have: it's, according to them, the best way to preserve security.
No need to say, this is wrong. If it wasn't so, big companies wouldn't be spending money optimizing their ranking, especially if that wasn't working at all. Even if you consider the "moron factor", it's too easy to see if it's effective: run a search and see if you are on the first page.

So, obviously, even for ranking, security through obscurity doesn't work.

As a reminder, the most widely used library for secure communication, openssl, which source code is widely available, which encryption algorithm are know, isn't (officially at least), easily cracked. True, there's a lot of money involved in being on the major search engines first page, and people are desperate to get there. It's true too that brilliant guys do spend their days trying to break that openssl thing.

So, maybe that's one of the accurate goals for the next big search engine: a ranking algorithm that can't be diverted, even if you precisely know what the algorithm is.

Building A Search Engine