[Gllug] Find non-7-bit characters in files

Thu Jun 16 18:41:50 UTC 2005

On 6/16/05, Richard Jones <rich at annexia.org> wrote:
> Here's a small Thursday afternoon puzzler for everyone.
> 
> I hae a large number of files (HTML files in fact, not that it
> matters).  A clueless^Wevil web monkey^Wdesigner has hidden bytes in
> them that are in the range 0x80 - 0xff, so the files aren't valid
> UTF-8.
> 
> I want to find those characters.  Preferably quickly from the command
> line.
> 
> I tried various combinations of egrep with the [:print:] character,
> but to no avail.

tr -d '\200-\240' <bad_code > good_code

...will remove characters from the range you specified.

However, your question leads me to trawl through memory and
mailing-lists - I think you may find that these evil characters are a
know Microsoftism.

>From the ISO 8859-1 National Character Set FAQ:

18.4 MS-Windows
Microsoft Windows uses an ISO 8859-1 compatible character set (Code
Page 1252), as delivered in the US, Europe (except Eastern Europe) and
Latin America.  In Windows 3.1, Microsoft has added additional characters
in the 0x80-0x9F range.

Hence, I think your 80-FF character set may be overly restrictive.

$ echo 'obase=8;ibase=16;80;9F' | bc
200
237

So, simply substitute 237 for 240 above.

> Rich.

S.
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug