[sclug] URL encoding/decoding question

Tue Feb 21 21:35:15 UTC 2006

On Mon, 2006-02-20 at 13:27 +0100, pieter claassen wrote:

> The problem was (and still is, even though database mangling and
> URLEncoding/decoding mangling has been eliminated) that if I edit some
> text in a <textarea> that contains HTML (specifically encoded HTML)
> then
> it promptly decodes it as HTML and renders it in the textarea and then
> encodes the submissions into something more evil.
> 
> So, when I place € in my text area and submits, it ends up as
> € in the db but is rendered as ? in the textarea the second time
> round.
> 
> If I submit ? the second time round, it is stored as ? in the db and
> rendered as ???.

OK, this is bad. Something somewhere is using 8859-1 interpretation
instead of UTF-8 interpretation.

- Assuming that the '?' in the db is being stored as a Unicode character
(how are you checking what's in the database, exactly?) and assuming
UTF-8 encoding this is E2 82 AC. In 8859-1, E2 is '?', 82 is undefined
and AC is '?'.

If you are using inconsistent interpretations, you're most unlikely to
ever come up with a reliable solution.

> The browser interprets € &#x20AC; and € in the <TEXTAREA>

OK, a couple of things:

- You're confusing SGML entity expansion with application (HTML)
semantics. Entities are expanded _very_ early in the parsing process.

- TEXTAREA does not interpret HTML tags (they appear as "source" on
screen) but, for the reason given above, cannot prevent the expansion of
SGML entities because that expansion has already occured by the time
TEXTAREA gets its hands on it. (e.g. '<' and '<' don't merely have
the same on-screen appearance, they are, by the time TEXTAREA (or any
other HTML tag) gets its hands on them, indistinguishable ('<' has
already been replaced by '<'; likewise '€' has already been
replaced by ?).

- You cannot expect any HTML-specific stuff to block the expanding of
SGML entities.

- Consequently, if you want the entity presented in its "source" form,
you'll have to escape the '&'. This is pretty ordinary stuff:

   <TEXTAREA>foo costs &euro;25.29</TEXTAREA>

> The field in the db is a TEXT field.

This is fine, as long as it understands Unicode.

(If you were using a subset like 8859-1, then the above would read "This
is fine, as long as it understands 8859-1". The important thing is that
the database field can hold whatever your user might cause to be put
into it. It is difficult to predict what might happen if your database
only accepts 8859-1 but your user submits ???????? characters; maybe
they'll be dropped, maybe an exception will be thrown, maybe a remote
code execution vulnerability will obtain. There's more, but what I'm
really saying is "check!".)

> I strikes me that if there was a way to inform text areas that the
> content they contain is NOT HTML and therefore should not be rendered
> as
> such, then the problem might go away. BTW. This is Firefox 1.0.7.

As above; your "problem" isn't HTML it's SGML(/XML). If you want to
avoid having SGML entities expanded, you'll have to present the '&'
itself as an entity.

> Ok, looks like mozilla is broken in regards to this matter.

I don't believe so.

> So, I thought this should work to intercept all inserts into the
> db and just convert them back to something civil?
>
>        static public String decode (String data){
>                data.replace(String.valueOf('\u20AC'),"€");
>                return data;
>        }
>
> However, no luck.

Hmm? You're saying that if you have '&' in the TEXTAREA and submit it
through the above snippet, what goes into the database is _not_
'€'? Or just that the round-trip still isn't quite working?

A side question, but perhaps an important one: why are you using the
entity form in the first place? Why not store the '?' literally in the
database and then chase down and remove the 8859-1 interpretation
problem? Is it possible that you have your browser hard-wired to apply
8859-1 interpretation despite an alternate specification in the
HTTP/HTML headers? Are you sure that both the HTTP and HTML headers do,
in fact, specify UTF-8? (I'm not asking about what you've set up, I'm
asking about what's on the wire: How about posting an ethereal hex dump
(Analyze/'Follow TCP Stream'/'Hex Dump' then 'Save As')?)

> My question. How do I print the "data" string in the above code in UTF
> format so that I can see what is being submitted to the DB?

Umm,

- UTF is not a format

- UTF-1, UTF-7, UTF-8 and UTF-16 are encodings of Unicode. (UCS-4 is
also an encoding in the trivial sense of "no encoding"; all characters
are represented directly as the four byte representation of their
Unicode codepoint.)

- I take it that what you are actually asking for is the hex
representation of the UTF-8 encoding of the characters that you're
processing.

- Will's code looks about right.

> I assume that & 0xFF is to convert UFT-8 to UTF-16 by masking out
> the higher order bits?

UTF-16 is not in use at runtime. The '&& 0xff' is to coerce the -128 -
127 range to a more useful 0 - 255 range by causing the compiler to
desist sign-extension during an 8-bit -> 16-bit conversion. (Bear in
mind that all Java integers (byte, short, int, long) are signed.)

> So, I think this is what is happening
> 
> old		new		oldchar	newchar
> 
> 26		c3		&  	?	diff 157(0x9D)
> 65		a2		e 	?	diff 61(0x3D)
> 75		c2		u  	?	diff 77 (0x4D)
> 72		82		r  	?	diff 16(0x10)
> 6f		c2 		o  	?	diff 83(0x53)
> 3b		ac		;  	?	diff 113(0x71)
> 
> How is it possible that 0x75 and 0x6F both encode to C2?

It's not. As you've noticed, in the round trip the encoded string is
getting longer, consequently attempting one-to-one byte mappings is a
non-starter. (If it was reasonable to assume that one byte was becoming
only one byte, how could the string be getting longer.)

> I don't see a pattern in the encoding lengths?

I suspect pretty strongly that you've gotten an 8859-1 interpretation or
encoding going on somewhere. Actually teasing out the possible
conversion paths is more work than I have sufficient patience for (and
lacking accurate measurement (byte-level recordings of the HTTP
exchanges) is a little academic).

Let's see that hex dump from ethereal for both the presentation of the
form and for the processing of the POST.

- Raz