<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-7083275890674905250</id><updated>2012-01-29T10:51:07.359+01:00</updated><title type='text'>Building A Search Engine</title><subtitle type='html'>This blog is about the experience of trying to put the pieces together to get a search engine that scales and actually can yield correct query results on a reasonable number of web pages</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>22</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-5369030781043756776</id><published>2009-10-04T02:05:00.002+02:00</published><updated>2009-10-04T02:12:24.095+02:00</updated><title type='text'>Design patterns</title><content type='html'>A while ago in the company I worked for, I was assigned a task.&lt;br /&gt;&lt;br /&gt;Me being given that task was an error: a guy with the same first name as me was supposed to get it.&lt;br /&gt;I could go on and on how that situation got managed, but this blog is about serious stuff and not about politics and egos in the workplace.&lt;br /&gt;&lt;br /&gt;So, I got a look at the application I was supposed to maintain and to make evolve. A very nice java application, quite well written, hundreds of line of code, readable and so on. Nicely written for a design patterns point of view.&lt;br /&gt;&lt;br /&gt;What was this application doing with that nice design pattern, its hundreds of lines of code ?&lt;br /&gt;$ xslproc sheet.xsl document.xsl &gt; output.xml&lt;br /&gt;&lt;br /&gt;Yes, these hundred of lines of code fit in one line. What about datasource are you asking ? Well, everything is a file in unix...&lt;br /&gt;&lt;br /&gt;Don't get me wrong, there are a lot of things I hate about the Unix way, but...&lt;br /&gt;&lt;br /&gt;Please, let things as simple as they are... And you know what ? They are usually more efficient when they are simple...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-5369030781043756776?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/5369030781043756776/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=5369030781043756776' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/5369030781043756776'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/5369030781043756776'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2009/10/design-patterns.html' title='Design patterns'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-8861393047489781769</id><published>2009-08-13T00:26:00.002+02:00</published><updated>2009-08-13T00:41:15.114+02:00</updated><title type='text'>Ubuntu and "large files" (files greater than 2 G)</title><content type='html'>Recently, for whatever reason, I decided that I should have Linux on a few machines. It might have been to be sure that my tools run with Linux.&lt;br /&gt;&lt;br /&gt;It happened that Linux ended up on one of the key machine.&lt;br /&gt;That was pure over-confidence and trust.&lt;br /&gt;&lt;br /&gt;If FreeBSD hasn't any problem with "big files", that is, files greater that 2 gigabytes, something one could expect in 2009, one could expect the same from Linux.&lt;br /&gt;&lt;br /&gt;Well, not quite so. The spider got stuck around 2 Giga bytes of data.&lt;br /&gt;&lt;br /&gt;The two magics options to add at compile time before banging your heads on the walls (it does hurt, trust my experience):&lt;br /&gt;&lt;br /&gt;-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64&lt;br /&gt;&lt;br /&gt;Why does one need to add these defines ? (which trigger an ugly cascade of other defines, it seems a lot of people decided this needs to be frightening)&lt;br /&gt;I have no idea why.&lt;br /&gt;&lt;br /&gt;Anyway, that's two flags one needs to add.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-8861393047489781769?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/8861393047489781769/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=8861393047489781769' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/8861393047489781769'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/8861393047489781769'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2009/08/ubuntu-and-large-files-files-greater.html' title='Ubuntu and &quot;large files&quot; (files greater than 2 G)'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-7624448831563465652</id><published>2009-01-18T21:08:00.002+01:00</published><updated>2009-01-18T21:32:44.091+01:00</updated><title type='text'>Error</title><content type='html'>An interesting to feed a search engine is the word "error".&lt;br /&gt;&lt;br /&gt;It's very simple, a single word.&lt;br /&gt;&lt;br /&gt;You say, a good search engine will know how to handle all those "mysql error", "php error", "error page not found", etc etc.&lt;br /&gt;&lt;br /&gt;Well, it's pretty interesting to see how well each behaves to that.&lt;br /&gt;&lt;br /&gt;Some just give you in their top results "Error! Reason: File 'index.html was not found!" (right, this is the second result on a popular French engines). Some do it as their top result (&lt;span class="s"&gt;Error. Reason: File "menu.asp" was not found!)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Others will play it on the safe side, displaying news that mentions error, the wikipedia error entry, then, the algorithm seeing that the remaining thousands of pages are wrong show only a handful of results.&lt;br /&gt;&lt;br /&gt;I found one search engine that seemed less prone to this kind of troubles that the major search engine we know.&lt;br /&gt;&lt;br /&gt;One the second hand, my search engine seems to be affected a lot later than the other search engines. That's suspicious, considering it's unlikely I'm the only one to run and try to fix that trouble. The real answer is... well, it doesn't show very well the content of the page, and where that "error" comes from.&lt;br /&gt;&lt;br /&gt;Is this question relevant in any way ? It's the traditional "the written words don't really mean what the website is about".&lt;br /&gt;Even if we are using the link text from web page to web page, there's going to be guys saying "this pages yields an error" or something like that.&lt;br /&gt;One can say that, well, if enough guys say something different about the page, use what the majority is saying; true, but if you are looking for fast raising page per words...&lt;br /&gt;&lt;br /&gt;I don't have an answer to that; it looks like I'm not the only one, and that's not really a comforting idea as to the future of search.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-7624448831563465652?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/7624448831563465652/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=7624448831563465652' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/7624448831563465652'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/7624448831563465652'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2009/01/error.html' title='Error'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-7538500454046856500</id><published>2009-01-06T23:33:00.003+01:00</published><updated>2009-01-06T23:59:52.739+01:00</updated><title type='text'>Security concepts and an open source search engine</title><content type='html'>I was reading tonight a very long list of comments on how &lt;a href="http://www.readwriteweb.com/archives/how_to_build_an_open_source_google.php"&gt;an ideal distributed opensource search engine&lt;/a&gt; could be.&lt;br /&gt;&lt;br /&gt;The interesting things, reading the comments, is how it relates to security. Let me explain.&lt;br /&gt;&lt;br /&gt;The main argument on why an opensource (even more so a distributed one) search engine can't work in practice is because when you know how the thing works, you can easily influence the results (ie: spam). And then people begin to praise the "security through obscurity" that the major search engines have: it's, according to them, the best way to preserve security.&lt;br /&gt;No need to say, this is wrong. If it wasn't so, big companies wouldn't be spending money optimizing their ranking, especially if that wasn't working at all. Even if you consider the "moron factor", it's too easy to see if it's effective: run a search and see if you are on the first page.&lt;br /&gt;&lt;br /&gt;So, obviously, even for ranking, security through obscurity doesn't work.&lt;br /&gt;&lt;br /&gt;As a reminder, the most widely used library for secure communication, openssl, which source code is widely available, which encryption algorithm are know, isn't (officially at least), easily cracked. True, there's a lot of money involved in being on the major search engines first page, and people are desperate to get there. It's true too that brilliant guys do spend their days trying to break that openssl thing.&lt;br /&gt;&lt;br /&gt;So, maybe that's one of the accurate goals for the next big search engine: a ranking algorithm that can't be diverted, even if you precisely know what the algorithm is.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-7538500454046856500?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/7538500454046856500/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=7538500454046856500' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/7538500454046856500'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/7538500454046856500'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2009/01/security-concepts-and-open-source.html' title='Security concepts and an open source search engine'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-8526922545572687127</id><published>2008-11-29T04:05:00.003+01:00</published><updated>2009-01-14T01:20:20.268+01:00</updated><title type='text'>Encoding</title><content type='html'>A friend of mine reminded me tonight about something obvious I had to deal with: encodings. Let's recap'.&lt;br /&gt;&lt;br /&gt;First, the Computer made ASCII and ANSI. And the Computer saw it was good.&lt;br /&gt;And the Computer said "Let there be KOI", and there was KOI.&lt;br /&gt;And the Computer saw KOI was good. And the Computer separated KOI from the others. (...) And the Computer said "Let there be BIG-5 in the midst of the sea of encodings and let it separate the encoding from the dark encodings". And the Computer made the firmaments and separated the encodings which where under firmament from the encodings which where above the firmaments.(...)&lt;br /&gt;And the Computer saw everything he had made, and behold, it was very good. And there was evening, and there was morning, a sixth day.&lt;span style=";font-family:arial;font-size:85%;color:Black;"   &gt;&lt;/span&gt; (...)&lt;br /&gt;&lt;br /&gt;To make a long story short, at one point, thanks to the Computer's in-depth look and long term view, we get to the tale of the tower of Babel.&lt;br /&gt;&lt;br /&gt;And then, well, we get to the search engine. It's bad enough already that websites all around the earth will use different encodings, but, to make matters worse, everyone seem to pretend their encoding is obviously the right one.&lt;br /&gt;And that's where the "obvious" seriously gets in the way. How is one supposed to know that a Russian website hosted in the US is using the latin1 encoding ? Or that a Korean website hosted in Japan is using iso-8859-1 encoding ?&lt;br /&gt;In case you think that it's easy, consider that the page is advertising another impossible encoding.&lt;br /&gt;&lt;br /&gt;Do you think I'm over-doing it ?&lt;br /&gt;I have 7 million pages for you.&lt;br /&gt;Anyone who has the generic good answer for that one gets a free beer on me. (International shipping is ok). And, no, dropping the pages which are that crazy is not the satisfactory answer.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-8526922545572687127?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/8526922545572687127/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=8526922545572687127' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/8526922545572687127'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/8526922545572687127'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2008/11/encoding.html' title='Encoding'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-3851360721258331312</id><published>2008-11-15T16:16:00.003+01:00</published><updated>2009-01-14T01:21:18.814+01:00</updated><title type='text'>Thank you !</title><content type='html'>I was very surprise of the very warm welcome our presentation received and how many question it sparked, as well as encouragements.&lt;br /&gt;&lt;br /&gt;Thank you all for this :)&lt;br /&gt;&lt;br /&gt;The other presentations were really interesting, and we were nervous to be the ones speaking after a great presentation about The New York Times, and how they so embrace the web.&lt;br /&gt;&lt;br /&gt;You will see me at the next Ignite session for sure :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-3851360721258331312?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/3851360721258331312/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=3851360721258331312' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/3851360721258331312'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/3851360721258331312'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2008/11/thank-you.html' title='Thank you !'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-6368575465354286217</id><published>2008-11-13T23:19:00.003+01:00</published><updated>2008-11-13T23:50:45.784+01:00</updated><title type='text'>OsO @ Ignite Paris #3</title><content type='html'>&lt;a href="http://www.m--x--m.net/"&gt;Nicolas Toper&lt;/a&gt; and I will be giving a short presentation on &lt;a href="http://oso.tikuts.com/"&gt;OsO&lt;/a&gt; in Paris for the third &lt;a href="http://ignite.oreilly.com/"&gt;Ignite&lt;/a&gt; event there (more information &lt;a href="http://ignite.oreilly.com/2008/11/ignite-is-back-in-paris-1.html"&gt;here&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;The rules are to give a presentation with 20 slides during exactly 5 minutes. The presentation can be about anything (geeky), so we chose to do ours on my search engine and some of the hard lessons learned.&lt;br /&gt;You can find the slides &lt;a href="http://oso.tikuts.com/ignite3.pdf"&gt;here&lt;/a&gt; to get a feeling of what we are going to talk about.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-6368575465354286217?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/6368575465354286217/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=6368575465354286217' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/6368575465354286217'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/6368575465354286217'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2008/11/oso-ignite-paris-3.html' title='OsO @ Ignite Paris #3'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-3839834564207361837</id><published>2008-07-23T00:08:00.005+02:00</published><updated>2008-11-12T21:40:52.214+01:00</updated><title type='text'>By the Book (of law)</title><content type='html'>A few weeks ago, I was quite excited about a "good idea" I had.&lt;br /&gt;I spent some time on &lt;a href="http://www.alexa.com/"&gt;alexa&lt;/a&gt;, and looked at what the most popular sites are. It seemed that anything that had to do with news got to the top audience.&lt;br /&gt;&lt;br /&gt;So I said to myself, this looks like a good idea: something of interest, a "small" corpus and a lot to do with natural language processing.&lt;br /&gt;&lt;br /&gt;My tools being quite modular, I had rapidly a news search engine at my fingers with about 40 different sources and even toyed with graphics comparing terms frequency in articles, to be used with anything from brands to politicians.&lt;br /&gt;&lt;br /&gt;I thought to myself at first, well, I've got a nice idea (something a bit more elaborate than the usual news search engine) and it's going to be some sort of win/win strategy: the newspapers will attract readers to articles they might have missed and are interested in, the newspapers will be generating more revenue, and my idea will bring me traffic.&lt;br /&gt;Remembering a few articles I read some while ago, I did a quick search on trials going on that were about this kind of tool. There are a few, for large sums of money. The Belgian press syndicate seems to refuse any link to their newspapers. It seemed quite ridiculous.&lt;br /&gt;&lt;br /&gt;Then I took a look at the &lt;a href="http://www.law.cornell.edu/treaties/berne/overview.html"&gt;Berne Convention&lt;/a&gt;. A news search engine could fall in the category of "fair use", the result being "quotations from newspaper articles and periodicals in the form of press summaries" (&lt;a href="http://www.law.cornell.edu/treaties/berne/10.html"&gt;Article 10&lt;/a&gt;). But it might not.&lt;br /&gt;The French law for instance can be even more restrictive: displaying  the number of words of an article can be considered a "transformation" (Article L122-4 of the Code de la Propriété Intellectuelle) and thus forbidden. Or showing the size of a document, as about any search engine does. And it goes on and on. Any &lt;span style="font-style: italic;"&gt;f(document)&lt;/span&gt; could be illegal.&lt;br /&gt;&lt;br /&gt;True, I could just contact every newspaper and wait for their answer. For whatever reason, I don't expect any answer.&lt;br /&gt;&lt;br /&gt;If someone is interested in developing that project, that's great, the code is ready. I'm off to other territories for the time being, at least until a few trials come to an end.&lt;br /&gt;&lt;br /&gt;Next time, things will be technical again.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-3839834564207361837?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/3839834564207361837/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=3839834564207361837' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/3839834564207361837'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/3839834564207361837'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2008/07/by-book-of-law.html' title='By the Book (of law)'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-4853438650614817806</id><published>2008-06-02T00:51:00.001+02:00</published><updated>2008-06-02T00:54:05.102+02:00</updated><title type='text'>Over five million pages</title><content type='html'>The search engine now handles more than five million pages. The global performance is ok.&lt;br /&gt;Stay tuned, I wish to give you exciting news in a few weeks.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-4853438650614817806?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/4853438650614817806/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=4853438650614817806' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/4853438650614817806'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/4853438650614817806'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2008/06/over-five-million-pages.html' title='Over five million pages'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-5585469489911896489</id><published>2008-02-18T00:46:00.003+01:00</published><updated>2008-06-02T00:51:05.088+02:00</updated><title type='text'>Close to three million</title><content type='html'>Despite being quite busy with other projects lately, the last index update now has about 3 million pages, and response time is ok when doing a query (about half a second), but far from great. Multiple words search is now available.&lt;br /&gt;&lt;br /&gt;A lot of work is still ahead.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-5585469489911896489?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/5585469489911896489/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=5585469489911896489' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/5585469489911896489'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/5585469489911896489'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2008/02/close-to-three-million.html' title='Close to three million'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-5082501102744706766</id><published>2007-10-09T21:10:00.001+02:00</published><updated>2008-07-22T23:39:51.319+02:00</updated><title type='text'>Update</title><content type='html'>It's been a while since my last post.&lt;br /&gt;Things are going more slowly these day, although rather smoothly.&lt;br /&gt;So far I have implemented the three main user functionalities I wished to have: text search, image search and video search.&lt;br /&gt;&lt;br /&gt;Right now I'm starting to fill up my disks. About 5 million URL should do to start with and see a few things not adequately coping with that number of pages. I worked hard to be safe on all grounds, but working for years with computers taught me to expect (at least) the unexpected.&lt;br /&gt;&lt;br /&gt;Let me give you a few numbers about the raw data. These are figures from a &lt;span style="font-style: italic;"&gt;innacurate&lt;/span&gt; (emphasis intended) sample of 200 000 pages from the web.&lt;br /&gt;&lt;br /&gt;Mean page size: 14.5 kb&lt;br /&gt;Mean number of words per page: 449&lt;br /&gt;Mean unique words per page: 216&lt;br /&gt;Mean number of links per page: 35&lt;br /&gt;Mean number of JPEG per page: 19&lt;br /&gt;&lt;br /&gt;What do these numbers mean ? I don't think they mean a lot. I expect them to vary greatly with time.&lt;br /&gt;The number of links per page for instance is suspiciously high. Refining the data in three categories of unique links: number of links in a page, number of links pointing inside the website, number of links pointing outside the website. Here two links pointing to the same page make for two links instead of one. The same applies to the JPEG.&lt;br /&gt;&lt;br /&gt;I will try to update the stats from time to time to let you know how crude or accurate this first sample was, as well as other interesting stats.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-5082501102744706766?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/5082501102744706766/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=5082501102744706766' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/5082501102744706766'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/5082501102744706766'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2007/10/update.html' title='Update'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-7594930713018017038</id><published>2007-09-19T23:46:00.000+02:00</published><updated>2007-09-22T19:44:21.543+02:00</updated><title type='text'>Optimizing: things that matter</title><content type='html'>In my previous post, I explained how to optimize (broadly speaking) the initialization of a block of data.&lt;br /&gt;I also mentioned that it was useless. There is at least two reasons for that:&lt;br /&gt;&lt;br /&gt;- Initializing data can't (and shouldn't) be the costliest part of your program:&lt;br /&gt;If you are doing some string processing, some computations on an array, initializing the array is unlikely the part that will require the most computer instructions. You should instead focus on optimizing your algorithms.&lt;br /&gt;&lt;br /&gt;- do your best to never have to initialize an array to a uniform value more than once in a program's lifetime:&lt;br /&gt;This may sound surprising, even unfeasible. However, if initializing happens very often in your program, then truth #1 doesn't hold anymore, and optimizing your algorithms now include optimizing the initialization: choosing an algorithm were there is no, or a very minimal need to initialize data.&lt;br /&gt;&lt;br /&gt;This will lead us to our next article on optimizing scripting languages execution time.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-7594930713018017038?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/7594930713018017038/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=7594930713018017038' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/7594930713018017038'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/7594930713018017038'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2007/09/optimizing-things-that-matter.html' title='Optimizing: things that matter'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-8628376011389103276</id><published>2007-09-19T22:38:00.007+02:00</published><updated>2008-07-23T01:48:24.130+02:00</updated><title type='text'>Optimizing: bzero/memset, loops and beyond</title><content type='html'>Something nice about opensource is that one gets to see how others do things, and learn from it, sometimes to keep good ideas, to think why it was written that way, improve one's programs. I could go on on licences, but that will be for another time.&lt;br /&gt;&lt;br /&gt;Let's consider we have a reasonably small block of data, say an array of characters we wish to initialize every byte to 0 and that we are using a lower level language such as C.&lt;br /&gt;&lt;br /&gt;To do this, we could write it that way:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:verdana;"&gt;for (i=0; i&amp;lt;ArraySize; i++)&lt;br /&gt;&lt;span style="font-family:verdana;"&gt;&amp;nbsp;&amp;nbsp;data[i] = '\0';&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Rather straightforward and efficient, gets the job done you say ?&lt;br /&gt;Well, first, there is nothing really obvious when reading the code about what it's actually doing. Out of context, it's not necessarily self-explaining.&lt;br /&gt;It's easy to see also that if your processor can handle words bigger than 8 bits, you could be needing two or four or height more times instructions than needed.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Let's write it that way:&lt;br /&gt;&lt;br /&gt;// Initialize the array to the null character&lt;br /&gt;memset (data, '\0', SizeArray * sizeof (char));&lt;br /&gt;&lt;br /&gt;What have we gained ? Well, maybe we only added the cost of a function call. Let's take a look at OpenBSD's implentation of memset:&lt;br /&gt;&lt;br /&gt;$ less /usr/src/lib/libc/string/memset.c&lt;br /&gt;/*      $OpenBSD: memset.c,v 1.5 2005/08/08 08:05:37 espie Exp $ */&lt;br /&gt;(full file and licence header at:&lt;br /&gt;&lt;a href="http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/string/memset.c"&gt;http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/string/memset.c&lt;/a&gt;&lt;br /&gt;*/&lt;br /&gt;&lt;br /&gt;#include &amp;lt;string.h&amp;gt;&lt;br /&gt;&lt;br /&gt;void *&lt;br /&gt;memset(void *dst, int c, size_t n)&lt;br /&gt;{&lt;br /&gt;&lt;br /&gt;if (n != 0) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;char *d = dst;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;do&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;*d++ = c;&lt;br /&gt;&amp;nbsp;&amp;nbsp;while (--n != 0);&lt;br /&gt;}&lt;br /&gt;return (dst);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;OK, that's even more cryptic than our own version. In some way better though, as we aren't incrementing a 'useless' variable. But this iq disappointing optimization-wise. That was useless; or was it ? Let's look at the &lt;a href="http://www.freebsd.org/cgi/cvsweb.cgi/src/lib/libc/string/memset.c"&gt;FreeBSD implementation of memset&lt;/a&gt;. Now things get interesting: memset is adapted to the machine's native word size. No more byte per byte initialization.&lt;br /&gt;&lt;br /&gt;That's one lesson: if it's something common, there must exist someone who tried to write it right. There is no need on your part to take the risk of introducing new bugs or security breaches.&lt;br /&gt;&lt;br /&gt;Let's consider we are the proud owner of an amd64 processor. Wait, there is a special &lt;a href="http://www.freebsd.org/cgi/cvsweb.cgi/src/lib/libc/amd64/string/memset.S"&gt;implementation in assembly&lt;/a&gt; for that. Is it just C written in assembly ? The answer is no, the magic assembly instruction &lt;a href="http://docs.sun.com/app/docs/doc/817-5477/6mkuavhri?a=view"&gt;shrq&lt;/a&gt; does the right job.&lt;br /&gt;&lt;br /&gt;Another lesson: if the action is common enough, the processor will know how to do it natively. And the guys who wrote your OS' libc will likely have noticed that (your mileage may vary).&lt;br /&gt;&lt;br /&gt;The good thing is, by using these "high level" functions, you keep can keep your code readable, you do not need to spend as much time debugging it, it remains portable, you don't need to be able to speak assembly fluently with in-depth knowledge of the processor you are targetting your application to, and according to the OS' libc you get the opportunity to use clever improvements made in the libc code and at the assembly level.&lt;br /&gt;&lt;br /&gt;Will you ever think about bypassing the libc functions again ? :)&lt;br /&gt;&lt;br /&gt;Next, why this optimization was useless....&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-8628376011389103276?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/8628376011389103276/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=8628376011389103276' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/8628376011389103276'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/8628376011389103276'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2007/09/optimizing-bzeromemset-loops-and-beyond.html' title='Optimizing: bzero/memset, loops and beyond'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-2285184499223757777</id><published>2007-09-11T23:20:00.000+02:00</published><updated>2007-09-11T23:32:28.695+02:00</updated><title type='text'>Reading</title><content type='html'>I stumbled upon very interesting &lt;a href="http://acmqueue.com/modules.php?name=Content&amp;pa=showpage&amp;amp;pid=143"&gt;article by          Anna Patterson&lt;/a&gt;. It's one of the best articles I've read on the subject. It's not really technical, doesn't go into details, but is very encouraging, and when writing a search engine, there are high walls to break down.&lt;br /&gt;&lt;br /&gt;Incidentally, I discovered this article because their crawler visited my websites.&lt;br /&gt;It seems their &lt;a href="http://www.cuill.com"&gt;company&lt;/a&gt; is the new buzz in town.&lt;br /&gt;I have my doubts (probably disgruntled employees and revolutionizing app do not sound like a new google to me).&lt;br /&gt;&lt;br /&gt;But, well, as her article points, work, technique and perseverance is key.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-2285184499223757777?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/2285184499223757777/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=2285184499223757777' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/2285184499223757777'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/2285184499223757777'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2007/09/reading.html' title='Reading'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-3603900530594913440</id><published>2007-09-08T20:06:00.001+02:00</published><updated>2008-07-22T23:50:52.163+02:00</updated><title type='text'>Being out of url</title><content type='html'>A friend of mine asked me one day if I wasn't afraid of running out of url: finally reaching dead ends.&lt;br /&gt;&lt;br /&gt;After all, the question could be pertinent: one could imagine that with an identical global interest, this could happen: many sites, linking to a handful of well-known sites.&lt;br /&gt;&lt;br /&gt;I would be glad if that were true: that would give me a stable pool of pages to work on.&lt;br /&gt;&lt;br /&gt;The truth is, one of the first issues is coping with a huge quantities of url, internal and external. Check this blog for instance, it's liking to at least ten websites.&lt;br /&gt;&lt;br /&gt;I will post a few statistics later on.&lt;br /&gt;&lt;br /&gt;For now, the answer is "no, I'm far from being out of url".&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-3603900530594913440?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/3603900530594913440/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=3603900530594913440' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/3603900530594913440'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/3603900530594913440'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2007/09/being-out-of-url.html' title='Being out of url'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-3973169130222626840</id><published>2007-08-26T23:40:00.001+02:00</published><updated>2008-07-23T01:36:19.630+02:00</updated><title type='text'>building a spider in python</title><content type='html'>This was the first set of keywords that someone in the US typed to reach my blog. He must have been disappointed, there isn't much about writing a spider in Python, except that there is a useful class which wasn't that useful to me. Maybe he didn't care to know more than that, maybe he did.&lt;br /&gt;&lt;br /&gt;Prototype 1 had its spider written in Python.&lt;br /&gt;&lt;br /&gt;It is quite easy to do so, the &lt;a href="http://docs.python.org/lib/module-httplib.html"&gt;httplib&lt;/a&gt; class is easy to use, it has all the useful options one might need (or at least that I needed to have), I couldn't find any bug in it, and its use is straightforward.&lt;br /&gt;&lt;br /&gt;It gets the job done, leaving to the developer the opportunity to focus on what to do with the data.&lt;br /&gt;&lt;br /&gt;If you are considering python to write a spider, go for it. And if you don't trust my word (yet), you will see a good number of major search engine using python to write that component.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-3973169130222626840?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/3973169130222626840/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=3973169130222626840' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/3973169130222626840'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/3973169130222626840'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2007/08/building-spider-in-python.html' title='building a spider in python'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-9099603782134281479</id><published>2007-08-24T01:00:00.000+02:00</published><updated>2007-08-24T01:01:00.349+02:00</updated><title type='text'>Even fortune said so</title><content type='html'>43rd Law of Computing:&lt;br /&gt;        Anything that can go wr&lt;br /&gt;fortune: Segmentation violation -- Core dumped&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-9099603782134281479?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/9099603782134281479/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=9099603782134281479' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/9099603782134281479'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/9099603782134281479'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2007/08/even-fortune-said-so.html' title='Even fortune said so'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-8411424184280110657</id><published>2007-08-22T19:33:00.000+02:00</published><updated>2007-08-23T20:53:37.444+02:00</updated><title type='text'>Garbage collectors and memory</title><content type='html'>Is a garbage collector as usually found in interpreted languages enough not to worry about memory ?&lt;br /&gt;&lt;br /&gt;The tempting and common answer is yes, the garbage collector is here to take charge of this, and is here exactly to solve that problem.&lt;br /&gt;&lt;br /&gt;However, when writing a software that need to be speed and memory efficient (and it's really easy to write a program which needs rapidly a bit of tuning, no rocket launch has to be involved), it is usually a good thing to keep in mind memory allocation: how many scattered objects, small and big, are being dealt with during the program execution, how often does new/delete happen, and so on. It's usually rather easy to write a beast which will need Gigabytes of memory, or one which allocate/free Gigabytes of memory every few seconds, or a code where the garbage collector will never get a chance to work.&lt;br /&gt;&lt;br /&gt;Here's a few examples of memory going wrong:&lt;br /&gt;Functions not tail recursive in &lt;a href="http://www.erlang.org/"&gt;Erlang&lt;/a&gt; will never give the garbage collector a chance to free the memory&lt;br /&gt;- Creating and destroying objects, in &lt;a href="http://www.java.com/"&gt;Java&lt;/a&gt; for, say, processing high frequency requests, will make the garbage collector allocate and free huge chunks of memory&lt;br /&gt;- Arrays that can get huge in a &lt;a href="http://www.php.net/"&gt;php&lt;/a&gt; command line script, this language being usually used to handle short term requests&lt;br /&gt;- SQL results sets that get bigger with time (and eventually don't even fit in memory)&lt;br /&gt;- Add your own...&lt;br /&gt;&lt;br /&gt;Java programmers beware: I've seen that example happen more than once.&lt;br /&gt;I had a good laugh. The author of the program did not.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-8411424184280110657?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/8411424184280110657/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=8411424184280110657' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/8411424184280110657'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/8411424184280110657'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2007/08/garbage-collectors-and-memory.html' title='Garbage collectors and memory'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-3233005263078655941</id><published>2007-08-22T19:00:00.000+02:00</published><updated>2007-08-23T20:56:52.401+02:00</updated><title type='text'>Memory leaks</title><content type='html'>Let's consider memory leaks in the document parser.&lt;br /&gt;&lt;br /&gt;They are usually easy to track: the memory behavior gets wrong, too much memory is used as compared to what it should. This good thing usually happen fast for at least two reasons: time and code coverage. As opposed to some software which will show it has memory leaks after a few days, a parser show it in a few minutes: the amount of documents to process will be high in a short span of time. They are easy to see also because of code coverage: if anything could trigger a memory leak, it will.&lt;br /&gt;&lt;br /&gt;When these memory leaks happen, they are either obvious, or really hard to track. Whichever way, it consumes time.&lt;br /&gt;&lt;br /&gt;What could be a solution to this problem then ?&lt;br /&gt;&lt;br /&gt;The algorithm/methodology I've used, and which worked quite well so far I will the call the "door algorithm". When you open a door, you close it after yourself.&lt;br /&gt;With memory, it's the same reflex, Allocate/Free. Be a good boy, or a good girl. Also, "Standardize" where memory gets allocated and freed. Allocate at much as possible in the "constructor", free it all in the "destructor" (quotes applying if you are not using an object oriented language).&lt;br /&gt;&lt;br /&gt;Wouldn't it be more efficient and less error-prone to rely only on tools such as &lt;a href="http://valgrind.org/"&gt;valgrind&lt;/a&gt; ?&lt;br /&gt;Well, the two methods, I think, must walk hand in hand: these tools can give false positives, and miss memory leaks. It does a good job at tracking them, but can miss them too. We talked about Murphy's law before.&lt;br /&gt;And, above all, I think these tools must be used as a parachute, not as the only programing strategy (referring to debug time, to mention the least).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-3233005263078655941?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/3233005263078655941/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=3233005263078655941' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/3233005263078655941'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/3233005263078655941'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2007/08/memory-leaks.html' title='Memory leaks'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-6470233048446706408</id><published>2007-08-15T23:23:00.001+02:00</published><updated>2007-08-26T23:55:06.426+02:00</updated><title type='text'>Choosing a library</title><content type='html'>Each time I want to add a new feature to the search engine, I try to look at what is already existing out there.&lt;br /&gt;&lt;br /&gt;I usually find a few libraries close enough to my needs, thinking it will take less time to learn their API and integrate them than writing my own libraries.&lt;br /&gt;&lt;br /&gt;So far, contrary to popular belief, I'm not sure that this strategy really paid off. I will give a few example, where and how it failed to "work for me". The time it took me to realize it wouldn't fit my needs varied greatly, as did the reason I finally did not use them, or did.&lt;br /&gt;&lt;br /&gt;- &lt;a href="http://www.gnu.org/software/wget/wget.html"&gt;wget&lt;/a&gt;: Although not strictly speaking a library, I thought about wget for a short time to avoid writing my own spider. The prospect of having millions of directories stopped me early enough. The time it took me to evaluate that solution was the time it took me to shut out bad advises.&lt;br /&gt;I could not see a proper way to make it reasonably useful.&lt;br /&gt;&lt;br /&gt;- &lt;a href="http://docs.python.org/lib/module-robotparser.html"&gt;&lt;b&gt;&lt;tt id="l2h-2072" class="class"&gt;RobotFileParser&lt;/tt&gt;&lt;/b&gt;&lt;/a&gt; in &lt;a href="http://www.python.org/"&gt;python&lt;/a&gt;: Prototype 1 had a spider written in Python; reading the documentation did not make me feel easy: the robots class has no cache. I could pickle it, true, but that would soon have troubles to fit in memory. I could also fetch the robots.txt everytimes, but, obviously, the less files to fetch, the happier one is, considering that the robots file is supposed to be read only &lt;a href="http://www.robotstxt.org/wc/norobots-rfc.html"&gt;once a week&lt;/a&gt;.&lt;br /&gt;This was not the behaviour I expected, and it would have soon enough created memory troubles.&lt;br /&gt;&lt;br /&gt;- &lt;a href="http://libots.sourceforge.net/"&gt;libots&lt;/a&gt;: I found this library browsing &lt;a href="http://www.freshmeat.net/"&gt;freshmeat&lt;/a&gt; one day. I tried it using the command line. It seemed a good idea, and seemed to work well. Then, I wrote a module using this library. After a few thousand pages, bang, segfault somewhere in the &lt;a href="http://developer.gnome.org/doc/API/glib/"&gt;glib&lt;/a&gt;. Now, given what I think of the glib, that this libots thing wanted to impose on my software its lousy license, I forgot about it.&lt;br /&gt;True, it might have been me doing something wrong with the library. Given that it crashed when loading its dictionary, after parsing a thousand pages, I didn't feel that bad about myself.&lt;br /&gt;My conclusion there would be that when using a third-party library, it has to live up to some standards that aren't necessarily achieved by the library you will find round the corner.&lt;br /&gt;&lt;br /&gt;- &lt;a href="http://curl.haxx.se/libcurl/"&gt;libcurl&lt;/a&gt;: With libcurl, things get more subtle. It is a broadly used and very well tested library. But it has its occasional segfault in various contexts. To quote one: threads. This library doesn't seem to be designed from the ground up for threads. One has to carefully read the little lines at the end of the documentation pointing &lt;a href="http://www.openssl.org/docs/crypto/threads.html"&gt;there&lt;/a&gt; to find that if one wished to handle ssl, there is some weird code to add. I'm saying this, because I cannot really see a reason why it's not already in the lib (be it libcurl or openssl). And, wrong, it does not prevent libcurl from segfault-ing.&lt;br /&gt;Although writing such a library is a huge work, I spent so much time trying to find the reason why it would do a segfault that it makes me wonder if I wouldn't have been better off writing my own library.&lt;br /&gt;&lt;br /&gt;- &lt;a href="http://crawler.archive.org/"&gt;heritrix&lt;/a&gt;: This is a crawler I discovered recently, and said to myself, well, these guys must know what they are doing. It's in Java. It has a nice interface, rather cryptic if you didn't write the software. I gave it a try. I couldn't make it work (maybe java on a freebsd/amd64 is the reason, but still...), and it doesn't seem to fit my needs.&lt;br /&gt;This is a good answer to the common question "but how on earth didn't you know about that thing ?".&lt;br /&gt;&lt;br /&gt;- &lt;a href="http://www.mozilla.org/projects/nspr/"&gt;nspr&lt;/a&gt;: This might be the best stuff I stumbled upon. True, I don't use it very much, but it always does its job properly.&lt;br /&gt;However, fcgi did not like at all being linked to it (if that information might save you a few hours wondering why you can only have one fcgi process running)&lt;br /&gt;&lt;br /&gt;- &lt;a href="http://www.fastcgi.com/"&gt;fcgi&lt;/a&gt;: This might be worst code I've ever seen. I discovered things that I couldn't imagine could be done in C.&lt;br /&gt;Seems to work though, rather efficiently (I wonder how), and gets the job done (modulo side effects due to its rather poor coding mentioned above).&lt;br /&gt;&lt;br /&gt;What seems to be the pattern here ? Cutting-edge libraries should most of the time be avoided, a widely used library might have its bugs, a library that's still around for a long time might have its use, mammoths are mammoths, and some guys will always know the perfect-library-you-did-not-know-about (which is most of the time useless).&lt;br /&gt;&lt;br /&gt;The trick is being able to evaluate fast enough the time it  will take to integrate that library,  solve bug with it, as well as the time it would take to write it.&lt;br /&gt;&lt;br /&gt;The best strategy I've found so far to deal with this issue is to keep in mind these examples, and to evaluate the category the library falls into. And to act accordingly.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-6470233048446706408?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/6470233048446706408/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=6470233048446706408' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/6470233048446706408'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/6470233048446706408'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2007/08/choosing-library-code-coverage.html' title='Choosing a library'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-593426154633959452</id><published>2007-08-15T22:49:00.000+02:00</published><updated>2007-08-16T21:13:31.212+02:00</updated><title type='text'>Code coverage</title><content type='html'>One quite challenging issue with a search engine is code coverage.&lt;br /&gt;&lt;br /&gt;At the internet company I currently work for, any piece of software will go through the hands of a few million unique users each month. So, right, I'm used to something called code coverage.&lt;br /&gt;However, there is one small thing that eases the challenges: the user's data is rather easily processed. If it's an email, it's easily checked against a pattern, if it's some free text, let's strip html tags and so on. Let's warn the user we can't accept the data as it is. A few tricks will be involved, but the final data being narrowly defined, one usually gets it right rather easily.&lt;br /&gt;&lt;br /&gt;And then comes html. And those who write it. Two main factors, to me, change the rules of the game:&lt;br /&gt;1) it doesn't really matter whether the page if well formatted or not, we want to get some useful data out of it&lt;br /&gt;2) any error during the parsing will potentially corrupt gigabytes of data&lt;br /&gt;&lt;br /&gt;The first point is interesting: one could say, if this xhtml page is not well-formatted, then let's forget about it. Right. This would mean ignoring a large part of the web. Same goes for pages supposedly html formatted.&lt;br /&gt;I won't rant about encodings, such as a page encoded in UTF-16 with a meta indicating a UTF-8 encoding.&lt;br /&gt;Point is, you are going to get anything really. And it's a lot better test than some random data one would feed his/her parser: people seem to know what will bother you when writing your search engine: bugs, "impossible" code path, or syntax you wouldn't dare to imagine. You cannot name it yet, you will soon.&lt;br /&gt;&lt;br /&gt;Playing with the parser would be an amusing masochistic play if it "only" had the impact of making one loose data. It also feeds to the remaining of the chain corrupt data.&lt;br /&gt;Then, the url database is corrupt, the spider is trying frantically to fetch pages that don't exist, making you loose time, taking useless space and corrupting the final database.&lt;br /&gt;&lt;br /&gt;Murphy's law being the only law that stands, the corrupt data will get on the first results page an alpha user will get (although you already spent nights trying to fix bugs and finally couldn't find any corrupt data).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-593426154633959452?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/593426154633959452/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=593426154633959452' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/593426154633959452'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/593426154633959452'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2007/08/code-coverage.html' title='Code coverage'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7083275890674905250.post-8678754000127781342</id><published>2007-08-09T22:51:00.000+02:00</published><updated>2007-08-09T22:55:10.801+02:00</updated><title type='text'>Prototype 1</title><content type='html'>Hi,&lt;br /&gt;&lt;br /&gt;I will try to post on this blog as often as I can, my views and day-to-day experience on building a search engine.&lt;br /&gt;&lt;br /&gt;Expect this blog to talk mostly about parallelism, distribution, I/O troubles, data integrity strategies, code coverage and much more.&lt;br /&gt;&lt;br /&gt;Right, the simplest things, just one click away, are very likely the most complex things to implement.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7083275890674905250-8678754000127781342?l=buildingasearchengine.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://buildingasearchengine.blogspot.com/feeds/8678754000127781342/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7083275890674905250&amp;postID=8678754000127781342' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/8678754000127781342'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7083275890674905250/posts/default/8678754000127781342'/><link rel='alternate' type='text/html' href='http://buildingasearchengine.blogspot.com/2007/08/prototype-1.html' title='Prototype 1'/><author><name>buildingasearchengine</name><uri>http://www.blogger.com/profile/14808785305435773279</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
