Building A Search Engine

2009/01/18

Error

An interesting thing to feed a search engine is the word "error".

It's a very simple, single word.

You say, a good search engine will know how to handle all those "mysql error", "php error", "error page not found", etc etc.

Well, it's pretty interesting to see how well each behaves to that.

Some just give you in their top results "Error! Reason: File 'index.html was not found!" (right, this is the second result on a popular French engines). Some do it as their top result (Error. Reason: File "menu.asp" was not found!)

Others will play it on the safe side, displaying news that mentions error, the wikipedia error entry, then, the algorithm seeing that the remaining thousands of pages are wrong show only a handful of results.

I found one search engine that seemed less prone to this kind of troubles that the major search engine we know.

One the second hand, my search engine seems to be affected a lot later than the other search engines. That's suspicious, considering it's unlikely I'm the only one to run and try to fix that trouble. The real answer is... well, it doesn't show very well the content of the page, and where that "error" comes from.

Is this question relevant in any way ? It's the traditional "the written words don't really mean what the website is about".
Even if we are using the link text from web page to web page, there's going to be guys saying "this pages yields an error" or something like that.
One can say that, well, if enough guys say something different about the page, use what the majority is saying; true, but if you are looking for fast raising page per words...

I don't have an answer to that; it looks like I'm not the only one, and that's not really a comforting idea as to the future of search.

2009/01/06

Security concepts and an open source search engine

I was reading tonight a very long list of comments on how an ideal distributed opensource search engine could be.

The interesting things, reading the comments, is how it relates to security. Let me explain.

The main argument on why an opensource (even more so a distributed one) search engine can't work in practice is because when you know how the thing works, you can easily influence the results (ie: spam). And then people begin to praise the "security through obscurity" that the major search engines have: it's, according to them, the best way to preserve security.
No need to say, this is wrong. If it wasn't so, big companies wouldn't be spending money optimizing their ranking, especially if that wasn't working at all. Even if you consider the "moron factor", it's too easy to see if it's effective: run a search and see if you are on the first page.

So, obviously, even for ranking, security through obscurity doesn't work.

As a reminder, the most widely used library for secure communication, openssl, which source code is widely available, which encryption algorithm are know, isn't (officially at least), easily cracked. True, there's a lot of money involved in being on the major search engines first page, and people are desperate to get there. It's true too that brilliant guys do spend their days trying to break that openssl thing.

So, maybe that's one of the accurate goals for the next big search engine: a ranking algorithm that can't be diverted, even if you precisely know what the algorithm is.

2008/11/29

Encoding

A friend of mine reminded me tonight about something obvious I had to deal with: encodings. Let's recap'.

First, the Computer made ASCII and ANSI. And the Computer saw it was good.
And the Computer said "Let there be KOI", and there was KOI.
And the Computer saw KOI was good. And the Computer separated KOI from the others. (...) And the Computer said "Let there be BIG-5 in the midst of the sea of encodings and let it separate the encoding from the dark encodings". And the Computer made the firmaments and separated the encodings which where under firmament from the encodings which where above the firmaments.(...)
And the Computer saw everything he had made, and behold, it was very good. And there was evening, and there was morning, a sixth day. (...)

To make a long story short, at one point, thanks to the Computer's in-depth look and long term view, we get to the tale of the tower of Babel.

And then, well, we get to the search engine. It's bad enough already that websites all around the earth will use different encodings, but, to make matters worse, everyone seem to pretend their encoding is obviously the right one.
And that's where the "obvious" seriously gets in the way. How is one supposed to know that a Russian website hosted in the US is using the latin1 encoding ? Or that a Korean website hosted in Japan is using iso-8859-1 encoding ?
In case you think that it's easy, consider that the page is advertising another impossible encoding.

Do you think I'm over-doing it ?
I have 7 million pages for you.
Anyone who has the generic good answer for that one gets a free beer on me. (International shipping is ok). And, no, dropping the pages which are that crazy is not the satisfactory answer.

2008/11/15

Thank you !

I was very surprise of the very warm welcome our presentation received and how many question it sparked, as well as encouragements.

Thank you all for this :)

The other presentations were really interesting, and we were nervous to be the ones speaking after a great presentation about The New York Times, and how they so embrace the web.

You will see me at the next Ignite session for sure :)

2008/11/13

OsO @ Ignite Paris #3

Nicolas Toper and I will be giving a short presentation on OsO in Paris for the third Ignite event there (more information here).

The rules are to give a presentation with 20 slides during exactly 5 minutes. The presentation can be about anything (geeky), so we chose to do ours on my search engine and some of the hard lessons learned.
You can find the slides here to get a feeling of what we are going to talk about.

2008/07/23

By the Book (of law)

A few weeks ago, I was quite excited about a "good idea" I had.
I spent some time on alexa, and looked at what the most popular sites are. It seemed that anything that had to do with news got to the top audience.

So I said to myself, this looks like a good idea: something of interest, a "small" corpus and a lot to do with natural language processing.

My tools being quite modular, I had rapidly a news search engine at my fingers with about 40 different sources and even toyed with graphics comparing terms frequency in articles, to be used with anything from brands to politicians.

I thought to myself at first, well, I've got a nice idea (something a bit more elaborate than the usual news search engine) and it's going to be some sort of win/win strategy: the newspapers will attract readers to articles they might have missed and are interested in, the newspapers will be generating more revenue, and my idea will bring me traffic.
Remembering a few articles I read some while ago, I did a quick search on trials going on that were about this kind of tool. There are a few, for large sums of money. The Belgian press syndicate seems to refuse any link to their newspapers. It seemed quite ridiculous.

Then I took a look at the Berne Convention. A news search engine could fall in the category of "fair use", the result being "quotations from newspaper articles and periodicals in the form of press summaries" (Article 10). But it might not.
The French law for instance can be even more restrictive: displaying the number of words of an article can be considered a "transformation" (Article L122-4 of the Code de la Propriété Intellectuelle) and thus forbidden. Or showing the size of a document, as about any search engine does. And it goes on and on. Any f(document) could be illegal.

True, I could just contact every newspaper and wait for their answer. For whatever reason, I don't expect any answer.

If someone is interested in developing that project, that's great, the code is ready. I'm off to other territories for the time being, at least until a few trials come to an end.

Next time, things will be technical again.

2008/06/02

Over five million pages

The search engine now handles more than five million pages. The global performance is ok.
Stay tuned, I wish to give you exciting news in a few weeks.