[SWLUG] Encoding problem

Thu Jul 13 14:38:05 UTC 2006

On Thu Jul 13 09:38:15 2006, Justin Mitchell wrote:
> One word 'charsets'  a major source of headaches.
> 
> 
Every time I say "character sets", someone more skilled in the arts 
of pedantry tells me it's "character set encodings".

> Your linux system is most likely to be using unicode UTF-8 as its
> default charset, this allows it to handle all possible international
> characters correctly.
> 
> 
Well, as long as we gloss over the minor and trivial issues 
surrounding Korean, Chinese, and Japanese. Like they all use the same 
code positions, yet have slightly different glyphs. So it's actually 
impossible to handle strings containing those codepoints unless you 
*also* know the language - and if the string contains multiple 
languages, then you're doooomed, laddie, dooooomed. (There's MLSF, 
which was abandoned by the IETF because the Unicode Consortium 
insisted on using Plane 9 (from outer space, I suppose), except they 
abandoned that too, so nothing understands either).

But loosely, yes, modern Linux uses UTF-8 predominately, which is one 
of the many "extended ASCII" encodings, and uses the high octets as 
variable-width encoded unicode codepoints, encoding all the Basic 
Multilingual Plane, and typically extended to cover more than that.

> If a file displays properly in windows but nowhere else then its 
> very
> likely using the brain-dead microsoft specific charset's like
> windows-1252
> 
> 
Whereas windows-1252 (which is one of the many ISO-8859-1 variants, 
basically) uses the high octets to provide a local selection of 
characters, just to contrast. Windows is not alone in having OS 
specific variants on ISO-8859-1, mind - Apple does it too, and it 
wouldn't surprise me at all if Linux's ISO-8859-1 was not quite 
correct. Windows *is* unique in marking it differently, however.

Window's does drift more from ISO-8859-1 than Apple's, I think some 
visible glyphs in ISO-8859-1 are not present in windows-1252, 
although IIRC, all the alphanumeric character glyphs are identical, 
so you have to hunt about a bit to find stuff that causes a problem 
if it's mismarked.

Apple's drifts because some control codes in Latin-1 aren't present 
in Apple's variant - normally not a problem in that direction, but 
Apple replace them with printable characters, so they come out as 
control codes when translating them back to characters with a 
"genuine" ISO-8859-1 mapping. The problem is that Apple mark this as 
ISO-8859-1, so there's no easy way of figuring out what it is you're 
dealing with.

> On Thu, 2006-07-13 at 09:23 +0100, Huw Lynes wrote: > try
> > export LANG=fr_FR.UTF-8
> 
> No this wont help at all. If the charset was utf-8 already it would 
> have displayed fine.

(Probably)

> all this command is adding is telling the programs that they should 
> talk
> to you in french instead of english, but still in utf-8.

A better bet might be:

LANG=fr.iso-8859-1

Although it's more likely to be windows-1252.

In practise, the "fr" doesn't matter *here*. If you were dealing with 
Korean, Chinese, or Japanese, however, then it would.

Dave.
-- 
Dave Cridland - mailto:dave at cridland.net - xmpp:dwd at jabber.org
  - acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
  - http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade