[SWLUG] Encoding problem

neil at nwjones.demon.co.uk neil at nwjones.demon.co.uk
Thu Jul 13 15:39:35 UTC 2006


dave at cridland.net wrote:
> On Thu Jul 13 09:38:15 2006, Justin Mitchell wrote:
> > One word 'charsets'  a major source of headaches.
> > 
> > 
> Every time I say "character sets", someone more skilled in the arts 
> of pedantry tells me it's "character set encodings".
> 
> 
> > Your linux system is most likely to be using unicode UTF-8 as its
> > default charset, this allows it to handle all possible international
> > characters correctly.
> > 
> > 
> Well, as long as we gloss over the minor and trivial issues 
> surrounding Korean, Chinese, and Japanese. Like they all use the same 
> code positions, yet have slightly different glyphs. So it's actually 
> impossible to handle strings containing those codepoints unless you 
> *also* know the language - and if the string contains multiple 
> languages, then you're doooomed, laddie, dooooomed. (There's MLSF, 
> which was abandoned by the IETF because the Unicode Consortium 
> insisted on using Plane 9 (from outer space, I suppose), except they 
> abandoned that too, so nothing understands either).
> 
> But loosely, yes, modern Linux uses UTF-8 predominately, which is one 
> of the many "extended ASCII" encodings, and uses the high octets as 
> variable-width encoded unicode codepoints, encoding all the Basic 
> Multilingual Plane, and typically extended to cover more than that.
> 
> 
> > If a file displays properly in windows but nowhere else then its 
> > very
> > likely using the brain-dead microsoft specific charset's like
> > windows-1252
> > 
> > 
> Whereas windows-1252 (which is one of the many ISO-8859-1 variants, 
> basically) uses the high octets to provide a local selection of 
> characters, just to contrast. Windows is not alone in having OS 
> specific variants on ISO-8859-1, mind - Apple does it too, and it 
> wouldn't surprise me at all if Linux's ISO-8859-1 was not quite 
> correct. Windows *is* unique in marking it differently, however.
> 
> Window's does drift more from ISO-8859-1 than Apple's, I think some 
> visible glyphs in ISO-8859-1 are not present in windows-1252, 
> although IIRC, all the alphanumeric character glyphs are identical, 
> so you have to hunt about a bit to find stuff that causes a problem 
> if it's mismarked.
> 
> Apple's drifts because some control codes in Latin-1 aren't present 
> in Apple's variant - normally not a problem in that direction, but 
> Apple replace them with printable characters, so they come out as 
> control codes when translating them back to characters with a 
> "genuine" ISO-8859-1 mapping. The problem is that Apple mark this as 
> ISO-8859-1, so there's no easy way of figuring out what it is you're 
> dealing with.
> 
> > On Thu, 2006-07-13 at 09:23 +0100, Huw Lynes wrote: > try
> > > export LANG=fr_FR.UTF-8
> > 
> > No this wont help at all. If the charset was utf-8 already it would 
> > have displayed fine.
> 
> (Probably)
> 
> 
> > all this command is adding is telling the programs that they should 
> > talk
> > to you in french instead of english, but still in utf-8.
> 
> A better bet might be:
> 
> LANG=fr.iso-8859-1
> 
> Although it's more likely to be windows-1252.
> 
> In practise, the "fr" doesn't matter *here*. If you were dealing with 
> Korean, Chinese, or Japanese, however, then it would.
> 
> Dave.
> -- 
> Dave Cridland - mailto:dave at cridland.net - xmpp:dwd at jabber.org
>   - acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
>   - http://dave.cridland.net/
> Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade
> _______________________________________________
> SWLUG Discussion List - Discuss at swlug.org



I think I should clarify a point . I haven't tried it under windows at all.
It won't be a windows generated file. It is generated under a nix environment I think.
The file command says that it is

 UTF-8 Unicode text, with very long lines

Here are a few random problem words snipped out.

 général 
génération  linéaire

As you can see they look very odd. Provided of course the encoding system in your email  doesn't correct them


Neil Jones
Neil at nwjones.demon.co.uk



> http://swlug.org/mailman/listinfo/discuss
> 




More information about the Swlug mailing list