[Gllug] Find non-7-bit characters in files
Steve Nelson
sanelson at gmail.com
Thu Jun 16 18:41:50 UTC 2005
On 6/16/05, Richard Jones <rich at annexia.org> wrote:
> Here's a small Thursday afternoon puzzler for everyone.
>
> I hae a large number of files (HTML files in fact, not that it
> matters). A clueless^Wevil web monkey^Wdesigner has hidden bytes in
> them that are in the range 0x80 - 0xff, so the files aren't valid
> UTF-8.
>
> I want to find those characters. Preferably quickly from the command
> line.
>
> I tried various combinations of egrep with the [:print:] character,
> but to no avail.
tr -d '\200-\240' <bad_code > good_code
...will remove characters from the range you specified.
However, your question leads me to trawl through memory and
mailing-lists - I think you may find that these evil characters are a
know Microsoftism.
>From the ISO 8859-1 National Character Set FAQ:
18.4 MS-Windows
Microsoft Windows uses an ISO 8859-1 compatible character set (Code
Page 1252), as delivered in the US, Europe (except Eastern Europe) and
Latin America. In Windows 3.1, Microsoft has added additional characters
in the 0x80-0x9F range.
Hence, I think your 80-FF character set may be overly restrictive.
$ echo 'obase=8;ibase=16;80;9F' | bc
200
237
So, simply substitute 237 for 240 above.
> Rich.
S.
--
Gllug mailing list - Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug
More information about the GLLUG
mailing list