2007/10/09

Update

It's been a while since my last post.
Things are going more slowly these day, although rather smoothly.
So far I have implemented the three main user functionalities I wished to have: text search, image search and video search.

Right now I'm starting to fill up my disks. About 5 million URL should do to start with and see a few things not adequately coping with that number of pages. I worked hard to be safe on all grounds, but working for years with computers taught me to expect (at least) the unexpected.

Let me give you a few numbers about the raw data. These are figures from a innacurate (emphasis intended) sample of 200 000 pages from the web.

Mean page size: 14.5 kb
Mean number of words per page: 449
Mean unique words per page: 216
Mean number of links per page: 35
Mean number of JPEG per page: 19

What do these numbers mean ? I don't think they mean a lot. I expect them to vary greatly with time.
The number of links per page for instance is suspiciously high. Refining the data in three categories of unique links: number of links in a page, number of links pointing inside the website, number of links pointing outside the website. Here two links pointing to the same page make for two links instead of one. The same applies to the JPEG.

I will try to update the stats from time to time to let you know how crude or accurate this first sample was, as well as other interesting stats.