2007/08/15

Code coverage

One quite challenging issue with a search engine is code coverage.

At the internet company I currently work for, any piece of software will go through the hands of a few million unique users each month. So, right, I'm used to something called code coverage.
However, there is one small thing that eases the challenges: the user's data is rather easily processed. If it's an email, it's easily checked against a pattern, if it's some free text, let's strip html tags and so on. Let's warn the user we can't accept the data as it is. A few tricks will be involved, but the final data being narrowly defined, one usually gets it right rather easily.

And then comes html. And those who write it. Two main factors, to me, change the rules of the game:
1) it doesn't really matter whether the page if well formatted or not, we want to get some useful data out of it
2) any error during the parsing will potentially corrupt gigabytes of data

The first point is interesting: one could say, if this xhtml page is not well-formatted, then let's forget about it. Right. This would mean ignoring a large part of the web. Same goes for pages supposedly html formatted.
I won't rant about encodings, such as a page encoded in UTF-16 with a meta indicating a UTF-8 encoding.
Point is, you are going to get anything really. And it's a lot better test than some random data one would feed his/her parser: people seem to know what will bother you when writing your search engine: bugs, "impossible" code path, or syntax you wouldn't dare to imagine. You cannot name it yet, you will soon.

Playing with the parser would be an amusing masochistic play if it "only" had the impact of making one loose data. It also feeds to the remaining of the chain corrupt data.
Then, the url database is corrupt, the spider is trying frantically to fetch pages that don't exist, making you loose time, taking useless space and corrupting the final database.

Murphy's law being the only law that stands, the corrupt data will get on the first results page an alpha user will get (although you already spent nights trying to fix bugs and finally couldn't find any corrupt data).

1 comment:

Anonymous said...

Well said.