[sclug] URL encoding/decoding question

Sun Feb 19 17:32:30 UTC 2006

On Sun, 2006-02-19 at 16:58 +0100, pieter claassen wrote:

> I am trying to edit some database stored fields containing HTML with a
> JSP page.
> 
> I store the data URL encoded in UTF-8 in the db and decode it before

(boggle)

You are URL encoding _entire_ HTML documents? How long are the resulting
URLs? Or are you URL-encoding for some other reason?

> rendering with the exception of URL parameter strings that contain HTML
> that are re-encoded so that they don't break the rendering of links in
> the browser.
> 
> My questions:
> 1. I assume it makes no difference whether I store data encoded or not
> in the DB? The reason I went for encoding was in case there were some
> values that would screw the SQL insertion up (like "). Encoding and
> decoding a string should result in exactly the same value?

That rather depends upon the field type that you're using in the
database. PostgreSQL's BYTEA provides a raw byte bucket in which to
stick things, safe in the knowledge that, upon retrieval, they'll be
identical to what was stored. Other databases have similar types, but
beware the distinction between byte buckets and character buckets
(PostgreSQL BYTEA vs. TEXT, Oracle BLOB vs. CLOB, etc.); if your data is
not character oriented (yours is) or uses characters that the database
server doesn't understand (e.g. Unicode on a database that only
understands ASCII) then character buckets may be inappropriate.

> 2. For some reason when I try to encode the " % " characters (space%
> space), I get an encoded value of "+%25+" in the database but when I try
> to decode this value, I get:

This is correct URL encoding, although as above, I'm not sure that you
really want to do this.

> java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing
> escape (%) pattern

You're passing exactly "+%25+"? What implementation of URLDecode are you
using?

> 3. A big problem is the encoding of € strings which give me 
> "%26euro%3B" in the database and is then rendered by the browser inside
> a textarea block as ? (the euro sign). The problem is that if I encode
> this symbol then I get "%C3%A2%C2%82%C2%AC" which in return encodes to

Wild. Part of the answer may be that the ? itself is Unicode 0x20AC
which UTF-8 encodes as E2 82 AC, but even this doesn't quite line up. 

You appear to have forced enough stuff to UTF-8 to allay any 8859-15
concerns.

Back to question one though, why use _any_ encoding beyond UTF-8 when
storing in the database? Yes, you'll need to encode stuff on the way to
the web-browser and decode it on the way back, but this stuff should
never appear in the database.

> How do I get text in the text area to be decoded to HTML values and then
> re-encoded before insertion in the DB to the same UTF-8 value? Whey does
> this happen?

You want the text area to show HTML source, right? Just push everything
that isn't in the printing ASCII range (32 <= n < 127) plus '<' and '&'
to HTML "numeric character references:

    "&#" + Character.getNumericValue(c) + ";"

or thereabouts.

On the way back, you should be able to fish it out unencoded, as long as
the form's encoding is set to message/multipart instead of url-encoded.
(Again, the question, why are you url-encoding?)

- Raz