[sclug] URL encoding/decoding question

Sun Feb 19 18:00:44 UTC 2006

pieter claassen wrote:
> Hi,
> 
> I am trying to edit some database stored fields containing HTML with a
> JSP page.
> 
> I store the data URL encoded in UTF-8 in the db and decode it before
> rendering with the exception of URL parameter strings that contain HTML
> that are re-encoded so that they don't break the rendering of links in
> the browser.
> 
> My questions:

> 1. I assume it makes no difference whether I store data encoded or not
> in the DB? The reason I went for encoding was in case there were some
> values that would screw the SQL insertion up (like "). 

I am very ignorant on the subject of SQL. However, I would've thought 
that the allowed repertoire for URLs would be a subset of the allowed
repertoire for SQL: the former is deliberately quite restrictive, so 
anything which is a correct URL (as opposed to the superset thereof 
which browsers sometimes tolerate, and broken websystems create) should 
be OK.

Encoding and
> decoding a string should result in exactly the same value?

Yes.

> 2. For some reason when I try to encode the " % " characters (space%
> space), I get an encoded value of "+%25+" in the database but when I try
> to decode this value, I get:
> 
> java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing
> escape (%) pattern

Works for me! Which version of Java are you using? Can you get a dump of 
the numerical values of the chars in the string you're feeding the decoder?
> 
> 3. A big problem is the encoding of € strings which give me 
> "%26euro%3B" in the database and is then rendered by the browser inside
> a textarea block as ? (the euro sign). The problem is that if I encode
> this symbol then I get "%C3%A2%C2%82%C2%AC" which in return encodes to
> "%C3%83%C2%A2%C3%82%C2%82%C3%82%C2%AC".

The entity encoding is not reversible without effort. What's happening 
here is that, when the renderer parses your HTML, it converts alphabetic 
entities such as € into numeric codepoints; in this case, the 
codepoint happens to be 0x20AC (8364 decimal). It then renders these as 
it would any other codepoint (character). However, when the browser 
sends you the result back, it's using UTF-8, so the number above gets 
converted into a 3-byte UTF-8 sequence: For euro this is supposed to be 
0xE2, 0x82, 0xAC (so you seem to have some other issues as well; I 
suspect the browser is sending your data back as ISO-8859-1 rather than 
UTF-8).

Before URL-encoding the returned data for storage, you could parse it to 
find chars that aren't in a suitably defined "safe" set, and convert 
them to numeric entities; eg. &#x20AC; for euro. You can convert any 
character in this way which isn't part of markup, and it'll render.

If you want to keep the alphabetic entities (ie. € not &#x20AC;) 
then you have a harder job. The normative list of alphabetic entities 
can be found in <http://www.w3.org/TR/xhtml1/xhtml1.zip> in the files 
/DTD/*.ent. Working out how to parse that lot is left as an exercise.

HTH

Will.