[SWLUG] Encoding problem
Dave Cridland
dave at cridland.net
Thu Jul 13 14:38:05 UTC 2006
On Thu Jul 13 09:38:15 2006, Justin Mitchell wrote:
> One word 'charsets' a major source of headaches.
>
>
Every time I say "character sets", someone more skilled in the arts
of pedantry tells me it's "character set encodings".
> Your linux system is most likely to be using unicode UTF-8 as its
> default charset, this allows it to handle all possible international
> characters correctly.
>
>
Well, as long as we gloss over the minor and trivial issues
surrounding Korean, Chinese, and Japanese. Like they all use the same
code positions, yet have slightly different glyphs. So it's actually
impossible to handle strings containing those codepoints unless you
*also* know the language - and if the string contains multiple
languages, then you're doooomed, laddie, dooooomed. (There's MLSF,
which was abandoned by the IETF because the Unicode Consortium
insisted on using Plane 9 (from outer space, I suppose), except they
abandoned that too, so nothing understands either).
But loosely, yes, modern Linux uses UTF-8 predominately, which is one
of the many "extended ASCII" encodings, and uses the high octets as
variable-width encoded unicode codepoints, encoding all the Basic
Multilingual Plane, and typically extended to cover more than that.
> If a file displays properly in windows but nowhere else then its
> very
> likely using the brain-dead microsoft specific charset's like
> windows-1252
>
>
Whereas windows-1252 (which is one of the many ISO-8859-1 variants,
basically) uses the high octets to provide a local selection of
characters, just to contrast. Windows is not alone in having OS
specific variants on ISO-8859-1, mind - Apple does it too, and it
wouldn't surprise me at all if Linux's ISO-8859-1 was not quite
correct. Windows *is* unique in marking it differently, however.
Window's does drift more from ISO-8859-1 than Apple's, I think some
visible glyphs in ISO-8859-1 are not present in windows-1252,
although IIRC, all the alphanumeric character glyphs are identical,
so you have to hunt about a bit to find stuff that causes a problem
if it's mismarked.
Apple's drifts because some control codes in Latin-1 aren't present
in Apple's variant - normally not a problem in that direction, but
Apple replace them with printable characters, so they come out as
control codes when translating them back to characters with a
"genuine" ISO-8859-1 mapping. The problem is that Apple mark this as
ISO-8859-1, so there's no easy way of figuring out what it is you're
dealing with.
> On Thu, 2006-07-13 at 09:23 +0100, Huw Lynes wrote: > try
> > export LANG=fr_FR.UTF-8
>
> No this wont help at all. If the charset was utf-8 already it would
> have displayed fine.
(Probably)
> all this command is adding is telling the programs that they should
> talk
> to you in french instead of english, but still in utf-8.
A better bet might be:
LANG=fr.iso-8859-1
Although it's more likely to be windows-1252.
In practise, the "fr" doesn't matter *here*. If you were dealing with
Korean, Chinese, or Japanese, however, then it would.
Dave.
--
Dave Cridland - mailto:dave at cridland.net - xmpp:dwd at jabber.org
- acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
- http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade
More information about the Swlug
mailing list