Building A Search Engine

2014/04/10

Is that bug really important ?

Of course a bug is important and should be dealt with, shouldn't it ? Every bug is important and affects the global quality of a product.

Just one second... for whom is that bug important ? (And by the way, is it really a bug, or a user complaining about a product he doesn't understand ? More on this in a later post).

Bug are important, but that importance lays on different axis: the customer's, the manager's, the developer's, the marketer's (add your own).
To put things shortly, different interests will collide on how important that bug is.

And there will be a great confusion and a lot of tension around that issue shortly, repeatedly. Even the tools designed to help with bugs will not help most of the time: very few tools take this conflict of interest into account. To make things worse the default reporting value for a bug is by default "high". Guess what priority all bugs get.

This situation obviously ends up generating a lot of tensions all over. Developers are frustrated being told that a spelling mistake gets a high priority over a major refactoring which will not get done, while marketing is frantic about having corrupt data about the users, said users not understanding why it takes more than a month for a bug (simple, obvious !) to get fixed. And the management tries to fix things by spending longer hours in meetings and making everyone work longer hours "to get people to work right".

The real issue with handling known bugs is about setting the priorities right.

And that means at least to things: evaluating what the different priorities are for each axis, and balancing them. It's not an easy task.

Is that very abusive user being just... abusive ? Is that technical bug creating a crash, but for a very rare use case ? Is it slowly corrupting data in a non-recoverable way ? Is that bug a spelling mistake on welcome screen ? Is the bug reported by a check-signing customer ? And so on...

At the end of the day, a developer will have to fix the bug.
Being able to say for the assigners "hey, we know this is crap for you", and then, but "please fix it right now" really helps to get the developers going.

Avoiding bug reporting tools which do not allow you to use at least two different axis is a very good first step. Being able to say "hey this is futile", but, then, "it's important for our paying users" will make developers laugh at first, that's for sure. it will introduce a very important concept to all though: there are different priorities, and those need to be evaluted.

It will also be a kind reminder to developer that software is meant for final users.
Things will get more easily done. And according to a consistent set of priorities understood and agreed by all.

2014/01/09

Platform sizing 101

Today at work introducing a new database and application server were on the menu.

This was happening totally against my better judgement, but, somewhere, somehow, I came across an article on management which said "get things done, don't fuss too much how", and it's indeed probably a better way: we are getting things done in the spirit I had, let's just don't mind about the fussy details, things are evolving, and we're avoiding speeding up against a wall.
Enough about politics, to the real stuff now.

Sizing. We have n users. The literature around, I was told, says that the application server can handle n/2 users. Straightforward answer: we need two instances. Right. Wrong ?

Yes, that's the wrong answer. We care about user experience: if one server dies, or, more mundanely, one server is off for an update, the whole load goes to... well, that one server which can actually handle n/2 users. Oops.

Right, there's one way management likes, but users (and workers don't): take your service off and work at odd hours.

Or just setup three servers.

And there's a goodie along the way of that alternative scenario: when your load increases, you get more time to add another one before user experience deteriorates.

2013/10/22

The boss issue with distributing (computational) work

Here in France (your mileage may very then), the boss gives jobs to subordinates. That's what subordinates are for, and that's what a boss is here for.

Now, as some of you might have noticed, there are a few glitches to that scenario. The boss most of the time does not correctly estimate the time needed for a task if he does that only on his own. If he asks, it means hours of meetings. Not very efficient.

From the boss' view, there was no doubt that the task must take the estimated time; if it's taking more time, it's because the worker is lazy; if it's taking less time than estimated, it's good, time to drown the guy with work. End of story.

So far, this all looks like the usual ranting about a pyramidal work organization. But here comes the real stuff: when dealing with distributing computational workload between machines, the same approach is, amazingly, actually used. And that does not make sense. Let me explain where the trouble is.

It's very difficult and unlikely, when doing some computations on a set of computers (let's call that a computing farm for big data to follow the hype) that each and every machine will have the same and exact computational power; or memory or hard storage. In real life, it's highly unlikely to happen.
And if it does, there will be different hop times between machines, network congestions happening, etc.
In a few words: the plateform is bound to be heterogeneous to a noticeable degree.

Yet, most distributed systems fail to take that into account. There is always a grand master (or, a boss, to fall back to the teamwork metaphor) which will decide everything.
Not to mention most will assume an equal RAM, link speed, etc.
That model has at its root a hidden major flaw: the boss needs to know about each and every machine status, do estimates on each one as to where to push jobs. In other word, it has to do a lot of tracking, guessing, ordering, keeping track on dead machines, etc.

What about now reversing the model: the workers ask for work. What happens then ?
The only thing for the boss to handle is to have a job queue, and keep track of dead tasks. Workers keep on asking for jobs.
And that's it. The work is consumed at the full available computing power with little managerial tasks. The fine tuning boils down to finding the right slice size of work.

This model is resilient and simple. Works well with multithreading too.

So, pile up tasks, and have your workers ask for the next task when they're done.

A simple libevent http get server example

I recently tried to serve json queries in the "lightest" way, with a simple C program, with as little overhead as possible (no fcgi module, no scripting, standalone).

It happens that libevent has already all one need to do just that: answer http queries with as little overhead as possible, laid on an efficient networking library.

The trouble is, there is some documentation, but not that much, and a few pieces are missing to write a simple GET http server.

So, here is a skeleton for a simple GET http server with libevent.
It's handy for a very simple GET json server for instance.

Just in case that's of any help.

2013/08/19

Tips to build a corpus for statistical classification faster

Building a corpus for statical classification is usually a tedious task. I will try to give you here a few ideas to get started to help speed that up.

As is usual with any optimization work: the fastest way to do something is not to do it at all.
Most of the time, the task has already been tackled, and with some chance, someone already built a corpus and made it available. The chances may seem small, but it's usually more common than one could expect.
So, the first step is to use your favorite search engine, try different keywords, and look for that. If you can't find it, look harder, change keywords.

A ready-made corpus is not necessarily where one would look in the first place. One such example is a classic used for training statistical machine translation engines: the United Nations speeches. It has all the speeches translated in many languages (and there's a lot of speeches made there). A variant could be works at the European Parliament, which are also translated to different languages. You get the idea.

A "new" trove of human knowledge is Wikipedia. This may seem obvious, but until recently it was quite unusable, as it was poorly structured, if at all. This has changed, and, though minimalistic in many ways, the "infobox" tag for instance is gathering more and more useful information.
It is for those who are already a bit experienced with coping with size though (or have a lot of patience), as the English Wikipedia dump is, as of this writing, 8.3G bzipped (that's something above 40G). (Mahout has a nice ready-made splitter for the Wikipedia dump, here is a quick introduction to the stuff, it's really useful to have a smaller set for debugging for instance)

Now, if that's not enough to get the bulk human information you need, things are getting tricky.
The usual not subtle path of doing the stuff could be then:
    - write a quick viewer which will allow you to easily classify content
    - befriend grep and pcre
    - building a set of "good" keywords
    - being even more creative

There's also the possibility to seed your classification engine with a little data, then run it on some more data, check what is a positive and what is not in your corpus yet, and then add it.
That is _not_ a good idea, as it will give a bias to your corpus. However, this can be mitigated by ignoring the best matching data, and targeting the "more or less" matching data" according to your classifier. You better check the maths though !

Have more tips ? Share them in the comments :)

2013/07/24

How to get zendx_jquery_form_element_autocomplete to respond in real time

Were I work for a living now, they use Zend Framework and zendx_jquery_form_element_autocomplete. It's slow, and definitely not responding in real time. A quick test with Firebug shows a median response time of 300 ms. Far from the 200 ms needed to make it appear instantaneous.
Everyone says the database is slow, I didn't take it further at first.

But, yes, the whole thing is slow. Something was missing an index, which I added, but the time gains were very slim. I found that suspicious. It's not a big deal, it's an application with few users, so I concentrated on the auto-complete part, which is worth optimizing.

The index was right, the code was ok, and it took 300 ms. It has to be the database. Or slow machine executing the code. I blindly tried xcache, it saved about 100 ms. Or, well, there was something else going on.

I decided to bypass everything and wrote a very simple script, doing the bare query, printing in json the result array, along the line of this:



// Connection and database selection



$query = "SELECT name FROM table WHERE name LIKE '" . mysql_real_escape ($_REQUEST['term']) . "%'";



$my_res = mysql_query ($query);



$result = array (); 

while ($item = mysql_fetch_array ($my_res)) 

    $result[] = $item['name'];



print json_encode ($result);

And I ran the thing in Firebug. Tadam... It suddenly took only 6 ms. Right, that's a 50 fold time gain.

Is it clean ? No, not really. Is Zend Framework slowing your applications ? Well, now you know how to deal with that one issue which matters.

2013/07/09

Building a corpus for classification: size isn't always the only problem

Building a meaningful corpus for classifying documents is a tedious task. Building a few is even more tedious.

That's an obvious fact.

The trouble is, as the rule stands with anything involving language: ambiguity finally rules and corners you.

Let's say, for instance, that we are trying to classify articles from newspapers, in three categories: "business", "energy" and "environment".
That seems simply enough. The next article about Lady Gaga is out. The next articles on coal prices is in. It's in. Right. It fall in the business section, as it's going to influence energy prices. Wait, energy prices, that should be also in energy then, right ? But won't coal prices influence the environment too, as renewable will be less competitive ?
Wait, once again, wouldn't classifying this in environment be biased ? What about global warming naysayers ?
What began as a simple and trivial classification finally ends up in a mess. The multiple section facet is solved fast: use that article as a reference for all three categories, never mind the confusion it will potentially give to the computer which has to decide which category the evaluated articles will fall into. But what about the implicit point of view concerning this simple example, and its potential bias ?

If that seems too obvious (and in some ways, it is), let's try another one: identifying if a search string refers to a person or anything else. It's easy to define a person, right ?
Are Greek Gods persons for instance ? One can find pictures of them, they were treated as beings and interacted quite intimately with human beings. They could be persons then.
Let's say Gods are people. Then, what about Allah, which isn't quite really a person, but definitely a God to many people ?
There's also the more trivial case of movie and tv characters: should they be classified as beings, or ?...

Once again, the brain with its seemingly endless prowesses for confusion puzzles easily the man trying to teach a thing or two to the computer.