[SWLUG] Encoding problem

Justin Mitchell justin at discordia.org.uk
Thu Jul 13 08:38:15 UTC 2006


On Thu, 2006-07-13 at 09:25 +0100, Neil Jones wrote:
> I have a very large file that I am attempting to carry out some
> processing on.
> 
> There is a problem in that the file is in French and contains accented
> characters. When I do 
> 
> more filename
> 
> They display as odd characters.
> 
> I need to be able to divide the file into chunks ( easy)
> 
> then I need to be able to send the file chunks to a windows system where
> they display properly and operations can be performed on the file.
> 
> 
> when I take a part of a French web page and do a cut and paste into a
> smple editing problem like Gedit and save the words.  and thenI do a
> more on that file the accents are displayed fine.
> 
> I would like to do something to the other file to make it display these
> characters properly.

One word 'charsets'  a major source of headaches.

Your linux system is most likely to be using unicode UTF-8 as its
default charset, this allows it to handle all possible international
characters correctly.

If a file displays properly in windows but nowhere else then its very
likely using the brain-dead microsoft specific charset's like
windows-1252

most commands have no way of knowing what charset a file is in, they
have to assume that its the same as the current locale (eg UTF-8) and
interpret it accordingly.

You can convert a file from one charset to another using the 'iconv'
command, but you need to know which charset it was.

You can -usually- (but not always or guaranteed) find out what charset a
file is by using the 'file' command on it, but theres no way to tell for
sure.

Your webserver will be expecting all the documents it hands out to be in
utf-8 unless you specially tweak the config. It will also be telling the
browser exactly what charset the file was in, so the browser can convert
and display it properly.


On Thu, 2006-07-13 at 09:23 +0100, Huw Lynes wrote: 
> try
> export LANG=fr_FR.UTF-8

No this wont help at all. 
If the charset was utf-8 already it would have displayed fine.
all this command is adding is telling the programs that they should talk
to you in french instead of english, but still in utf-8.


This has been a crude introduction to the problems of charsets, im sure
someone who deals with languages regularly can tell you more.





More information about the Swlug mailing list