2008/11/29

Encoding

A friend of mine reminded me tonight about something obvious I had to deal with: encodings. Let's recap'.

First, the Computer made ASCII and ANSI. And the Computer saw it was good.
And the Computer said "Let there be KOI", and there was KOI.
And the Computer saw KOI was good. And the Computer separated KOI from the others. (...) And the Computer said "Let there be BIG-5 in the midst of the sea of encodings and let it separate the encoding from the dark encodings". And the Computer made the firmaments and separated the encodings which where under firmament from the encodings which where above the firmaments.(...)
And the Computer saw everything he had made, and behold, it was very good. And there was evening, and there was morning, a sixth day. (...)

To make a long story short, at one point, thanks to the Computer's in-depth look and long term view, we get to the tale of the tower of Babel.

And then, well, we get to the search engine. It's bad enough already that websites all around the earth will use different encodings, but, to make matters worse, everyone seem to pretend their encoding is obviously the right one.
And that's where the "obvious" seriously gets in the way. How is one supposed to know that a Russian website hosted in the US is using the latin1 encoding ? Or that a Korean website hosted in Japan is using iso-8859-1 encoding ?
In case you think that it's easy, consider that the page is advertising another impossible encoding.

Do you think I'm over-doing it ?
I have 7 million pages for you.
Anyone who has the generic good answer for that one gets a free beer on me. (International shipping is ok). And, no, dropping the pages which are that crazy is not the satisfactory answer.

No comments: