Building A Search Engine: 2013

2013/10/22

The boss issue with distributing (computational) work

Here in France (your mileage may very then), the boss gives jobs to subordinates. That's what subordinates are for, and that's what a boss is here for.

Now, as some of you might have noticed, there are a few glitches to that scenario. The boss most of the time does not correctly estimate the time needed for a task if he does that only on his own. If he asks, it means hours of meetings. Not very efficient.

From the boss' view, there was no doubt that the task must take the estimated time; if it's taking more time, it's because the worker is lazy; if it's taking less time than estimated, it's good, time to drown the guy with work. End of story.

So far, this all looks like the usual ranting about a pyramidal work organization. But here comes the real stuff: when dealing with distributing computational workload between machines, the same approach is, amazingly, actually used. And that does not make sense. Let me explain where the trouble is.

It's very difficult and unlikely, when doing some computations on a set of computers (let's call that a computing farm for big data to follow the hype) that each and every machine will have the same and exact computational power; or memory or hard storage. In real life, it's highly unlikely to happen.
And if it does, there will be different hop times between machines, network congestions happening, etc.
In a few words: the plateform is bound to be heterogeneous to a noticeable degree.

Yet, most distributed systems fail to take that into account. There is always a grand master (or, a boss, to fall back to the teamwork metaphor) which will decide everything.
Not to mention most will assume an equal RAM, link speed, etc.
That model has at its root a hidden major flaw: the boss needs to know about each and every machine status, do estimates on each one as to where to push jobs. In other word, it has to do a lot of tracking, guessing, ordering, keeping track on dead machines, etc.

What about now reversing the model: the workers ask for work. What happens then ?
The only thing for the boss to handle is to have a job queue, and keep track of dead tasks. Workers keep on asking for jobs.
And that's it. The work is consumed at the full available computing power with little managerial tasks. The fine tuning boils down to finding the right slice size of work.

This model is resilient and simple. Works well with multithreading too.

So, pile up tasks, and have your workers ask for the next task when they're done.

A simple libevent http get server example

I recently tried to serve json queries in the "lightest" way, with a simple C program, with as little overhead as possible (no fcgi module, no scripting, standalone).

It happens that libevent has already all one need to do just that: answer http queries with as little overhead as possible, laid on an efficient networking library.

The trouble is, there is some documentation, but not that much, and a few pieces are missing to write a simple GET http server.

So, here is a skeleton for a simple GET http server with libevent.
It's handy for a very simple GET json server for instance.

Just in case that's of any help.

2013/08/19

Tips to build a corpus for statistical classification faster

Building a corpus for statical classification is usually a tedious task. I will try to give you here a few ideas to get started to help speed that up.

As is usual with any optimization work: the fastest way to do something is not to do it at all.
Most of the time, the task has already been tackled, and with some chance, someone already built a corpus and made it available. The chances may seem small, but it's usually more common than one could expect.
So, the first step is to use your favorite search engine, try different keywords, and look for that. If you can't find it, look harder, change keywords.

A ready-made corpus is not necessarily where one would look in the first place. One such example is a classic used for training statistical machine translation engines: the United Nations speeches. It has all the speeches translated in many languages (and there's a lot of speeches made there). A variant could be works at the European Parliament, which are also translated to different languages. You get the idea.

A "new" trove of human knowledge is Wikipedia. This may seem obvious, but until recently it was quite unusable, as it was poorly structured, if at all. This has changed, and, though minimalistic in many ways, the "infobox" tag for instance is gathering more and more useful information.
It is for those who are already a bit experienced with coping with size though (or have a lot of patience), as the English Wikipedia dump is, as of this writing, 8.3G bzipped (that's something above 40G). (Mahout has a nice ready-made splitter for the Wikipedia dump, here is a quick introduction to the stuff, it's really useful to have a smaller set for debugging for instance)

Now, if that's not enough to get the bulk human information you need, things are getting tricky.
The usual not subtle path of doing the stuff could be then:
    - write a quick viewer which will allow you to easily classify content
    - befriend grep and pcre
    - building a set of "good" keywords
    - being even more creative

There's also the possibility to seed your classification engine with a little data, then run it on some more data, check what is a positive and what is not in your corpus yet, and then add it.
That is _not_ a good idea, as it will give a bias to your corpus. However, this can be mitigated by ignoring the best matching data, and targeting the "more or less" matching data" according to your classifier. You better check the maths though !

Have more tips ? Share them in the comments :)

2013/07/24

How to get zendx_jquery_form_element_autocomplete to respond in real time

Were I work for a living now, they use Zend Framework and zendx_jquery_form_element_autocomplete. It's slow, and definitely not responding in real time. A quick test with Firebug shows a median response time of 300 ms. Far from the 200 ms needed to make it appear instantaneous.
Everyone says the database is slow, I didn't take it further at first.

But, yes, the whole thing is slow. Something was missing an index, which I added, but the time gains were very slim. I found that suspicious. It's not a big deal, it's an application with few users, so I concentrated on the auto-complete part, which is worth optimizing.

The index was right, the code was ok, and it took 300 ms. It has to be the database. Or slow machine executing the code. I blindly tried xcache, it saved about 100 ms. Or, well, there was something else going on.

I decided to bypass everything and wrote a very simple script, doing the bare query, printing in json the result array, along the line of this:



// Connection and database selection



$query = "SELECT name FROM table WHERE name LIKE '" . mysql_real_escape ($_REQUEST['term']) . "%'";



$my_res = mysql_query ($query);



$result = array (); 

while ($item = mysql_fetch_array ($my_res)) 

    $result[] = $item['name'];



print json_encode ($result);

And I ran the thing in Firebug. Tadam... It suddenly took only 6 ms. Right, that's a 50 fold time gain.

Is it clean ? No, not really. Is Zend Framework slowing your applications ? Well, now you know how to deal with that one issue which matters.

2013/07/09

Building a corpus for classification: size isn't always the only problem

Building a meaningful corpus for classifying documents is a tedious task. Building a few is even more tedious.

That's an obvious fact.

The trouble is, as the rule stands with anything involving language: ambiguity finally rules and corners you.

Let's say, for instance, that we are trying to classify articles from newspapers, in three categories: "business", "energy" and "environment".
That seems simply enough. The next article about Lady Gaga is out. The next articles on coal prices is in. It's in. Right. It fall in the business section, as it's going to influence energy prices. Wait, energy prices, that should be also in energy then, right ? But won't coal prices influence the environment too, as renewable will be less competitive ?
Wait, once again, wouldn't classifying this in environment be biased ? What about global warming naysayers ?
What began as a simple and trivial classification finally ends up in a mess. The multiple section facet is solved fast: use that article as a reference for all three categories, never mind the confusion it will potentially give to the computer which has to decide which category the evaluated articles will fall into. But what about the implicit point of view concerning this simple example, and its potential bias ?

If that seems too obvious (and in some ways, it is), let's try another one: identifying if a search string refers to a person or anything else. It's easy to define a person, right ?
Are Greek Gods persons for instance ? One can find pictures of them, they were treated as beings and interacted quite intimately with human beings. They could be persons then.
Let's say Gods are people. Then, what about Allah, which isn't quite really a person, but definitely a God to many people ?
There's also the more trivial case of movie and tv characters: should they be classified as beings, or ?...

Once again, the brain with its seemingly endless prowesses for confusion puzzles easily the man trying to teach a thing or two to the computer.

2013/06/25

Searching images for portraits

Building my search engine drew me to try to index other things than raw text: images, videos, news, etc. It has been a while now since I worked on the stuff, and I recently said to myself that it's a shame not to do something useful and creative with the bunch of images that I gathered years ago.
I forgot that I had more than 10 millions images from all over the web. It would indeed be a shame not to do something with that.

Following what was done at the time, and not having so much time to look back at all the possibilities, I thought about what anyone was already doing: search by tone color, sizes, etc. Things (more or less) easily technically done, but most of the time with very little value for the user.

And then, a simple thought came to mind: what about allowing search by portraits ? As a first step towards search by shapes and objects.

It's rather simple to do nowadays: you probably have stumbled upon one of those software for webcams which track faces: that's essentially the kind of software we need to find a portrait in an image.
If there's a face, and it's big enough, it must be a portrait.

So, the first big step is being able to recognize a face on an image. That's simple enough thanks to a great library which can (among a lot of other things) do that: OpenCV. For that task, cvHaarDetectObjects is what we are looking for (it's for C, the exact function/method name which applies will vary according to your programming language of choice).

And to make things even more simple for a quick and dirty evaluation of the thing, there's for example a very basic PECL extention for PHP, PHP-Facedetect, which does exactly what we need.
The configure script being broken as the time of this writing and the git pull request waiting to be accepted, you can check out the patched extension I wrote.

That should be enough ground and ideas to get you started ! Share your ideas ! :)

Building A Search Engine