Building A Search Engine: 07.2013

2013/07/24

How to get zendx_jquery_form_element_autocomplete to respond in real time

Were I work for a living now, they use Zend Framework and zendx_jquery_form_element_autocomplete. It's slow, and definitely not responding in real time. A quick test with Firebug shows a median response time of 300 ms. Far from the 200 ms needed to make it appear instantaneous.
Everyone says the database is slow, I didn't take it further at first.

But, yes, the whole thing is slow. Something was missing an index, which I added, but the time gains were very slim. I found that suspicious. It's not a big deal, it's an application with few users, so I concentrated on the auto-complete part, which is worth optimizing.

The index was right, the code was ok, and it took 300 ms. It has to be the database. Or slow machine executing the code. I blindly tried xcache, it saved about 100 ms. Or, well, there was something else going on.

I decided to bypass everything and wrote a very simple script, doing the bare query, printing in json the result array, along the line of this:



// Connection and database selection



$query = "SELECT name FROM table WHERE name LIKE '" . mysql_real_escape ($_REQUEST['term']) . "%'";



$my_res = mysql_query ($query);



$result = array (); 

while ($item = mysql_fetch_array ($my_res)) 

    $result[] = $item['name'];



print json_encode ($result);

And I ran the thing in Firebug. Tadam... It suddenly took only 6 ms. Right, that's a 50 fold time gain.

Is it clean ? No, not really. Is Zend Framework slowing your applications ? Well, now you know how to deal with that one issue which matters.

2013/07/09

Building a corpus for classification: size isn't always the only problem

Building a meaningful corpus for classifying documents is a tedious task. Building a few is even more tedious.

That's an obvious fact.

The trouble is, as the rule stands with anything involving language: ambiguity finally rules and corners you.

Let's say, for instance, that we are trying to classify articles from newspapers, in three categories: "business", "energy" and "environment".
That seems simply enough. The next article about Lady Gaga is out. The next articles on coal prices is in. It's in. Right. It fall in the business section, as it's going to influence energy prices. Wait, energy prices, that should be also in energy then, right ? But won't coal prices influence the environment too, as renewable will be less competitive ?
Wait, once again, wouldn't classifying this in environment be biased ? What about global warming naysayers ?
What began as a simple and trivial classification finally ends up in a mess. The multiple section facet is solved fast: use that article as a reference for all three categories, never mind the confusion it will potentially give to the computer which has to decide which category the evaluated articles will fall into. But what about the implicit point of view concerning this simple example, and its potential bias ?

If that seems too obvious (and in some ways, it is), let's try another one: identifying if a search string refers to a person or anything else. It's easy to define a person, right ?
Are Greek Gods persons for instance ? One can find pictures of them, they were treated as beings and interacted quite intimately with human beings. They could be persons then.
Let's say Gods are people. Then, what about Allah, which isn't quite really a person, but definitely a God to many people ?
There's also the more trivial case of movie and tv characters: should they be classified as beings, or ?...

Once again, the brain with its seemingly endless prowesses for confusion puzzles easily the man trying to teach a thing or two to the computer.

Building A Search Engine

2013/07/24

How to get zendx_jquery_form_element_autocomplete to respond in real time

2013/07/09

Building a corpus for classification: size isn't always the only problem

Blog Archive

About Me