[sclug] URL encoding/decoding question

Thu Feb 23 20:25:31 UTC 2006

On Thu, 2006-02-23 at 19:41 +0100, pieter claassen wrote:

> the issue regarding mysql not interpreting prepared statements:
> 
> [23 Feb 17:28] Mark Matthews
> 
> This is a known issue with the debian-provided builds of MySQL-4.1 (see
> the JDBC forums at http://forums.mysql.com/ for more reports).

OK, this is ... terrifying.

> The issue is listed in the bug tracker for Ubuntu
> https://launchpad.net/distros/ubuntu/+source/mysql-dfsg-4.1/+bug/3448
> but it appears upstream as well in Debian.
> 
> The only workaround (until the Debian folks figure out what's wrong
> with their build) is to add "useServerPrepStmts=false" to your JDBC URL
> to tell the JDBC driver not to use server-side prepared statements.

This is even more terrifying; the JDBC driver actually contains a
complete emulator for PreparedStatement! While this conveniently means
that the "useServerPrepStmts=false" workaround is available (the
PreparedStatement API will still work, it's just being emulated in the
JDBC driver), it also means that there's an entire extra layer of code
for things to go wrong in.

> I did two round trip submissions and this is what was logged by the db :
...
>                       6 Query       UPDATE content set data='€'
...
>                       6 Query       UPDATE content set data='?' where
...
> decoder started
> 26
> 65
> 75
> 72
> 6f
> 3b
> decoder started
> c3
> a2
> c2
> 82
> c2
> ac

OK, this is still this odd encoding problem. Note that you're still
inviting the database and driver to participate in character
encoding/decoding. Try this (after checking that the column type really
is BLOB or equivalent):

String inwardValue = ...
PreparedStatement ps = ...

ps.setBytes(n, s.getBytes("UTF-8"));

...

ResultSet rs = ...

String outwardValue = new String(rs.getBytes(n), "UTF-8");

In this way, the conversion between characters and UTF-8-encoded byte
array is handled by the Java library, and I am _certain_ that it is
correct, at least with respect to the Euro symbol.

Note that there remains a possibility that the database and/or driver
_are_ attempting to "interpret" an array of bytes en route to/from a
BLOB. Note that this is different from en/decoding a String in transit
(which the database really does need to do); if it's modifying a byte[]
in transit, then it'll break pretty well anything that anyone would use
a BLOB for. If this is so - I doubt it, but then I was surprised to hear
about the brokenness of PreparedStatements - then you do need to switch
databases.

> AND I AM STILL USING THE FOLLOWING CODE TO PRINT TO THE BROWSER:
> while (rs.next()){
> 			data=rs.getString(1);
> 			count=count+1;
> 		}

As above, calling ResultSet.getString() invites (in fact, requires) the
database to perform an interpretation. You would do well to take control
of that interpretation in the way that I propose above.

> > > OK, this doesn't make sense. If you had already switched to BLOB, then
> > > why did you care what interpretation the database was applying? What
> > > would be the point in switching the character encoding in the on-disk
> > > files and in playing with session character set/encoding parameters if
> > > you were already using BLOB anyway?
> 
> It struck me that the java connector deducted a lot of things automatically based on db settings so I 
> thought it might (and still might be) getting the returned data encoding wrong, because the DB is telling it
> that the data is latin1.

This is now clearer to me. Because you are using character-oriented
interfaces, the database's character codec is still involved. Don't do
that.

> No, this is still something I have to try but I expect we might have
> enough data in the ethereal dumps now to identify exactly where the
> problem is occurring.

We may have more than one problem. It is appropriate to control the
actual byte representations in use at all places (in the db, on the
wire, ...)

> I swear not to upgrade the bloody database again (not even to postgresql) while we are trying to sort the character
> encoding thing out ;-)

Given what you've just told me about MySQL, Debian and
PreparedStatement, we may yet have to do this. Hopefully the client-side
emulator in the driver can deal smoothly with byte[].

> POST /content/index.jsp HTTP/1.1
...
> Content-Type: application/x-www-form-urlencoded
...
> id=hardware.product2.name&lang=en&data=%26euro%3B&submit

OK, here's another problem. You're still URL-encoding. URL-encoding is
safe, when correctly used, but at present it's just compounding your
difficulties. In the HTML source, replace:

  <form action="/content/index.jsp" method="post">

with

  <form action="/content/index.jsp" method="post" enctype="
multipart/form-data">

See http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13.4.2
and note in particular:

  The content type "application/x-www-form-urlencoded" is
  inefficient for sending large quantities of binary data or text
  containing non-ASCII characters. The content type
  "multipart/form-data" should be used for submitting forms that
  contain files, non-ASCII data, and binary data.

The Euro character is non-ASCII.

This may be enough to clear the remainder of the Java->browser->Java
problem. Note that there is no particular need for you to use the €
entity; as you've already noticed it gets converted to the Euro symbol
in the round trip anyway. If you stick _soley_ to the Euro symbol in
UTF-8 (you'll see E2 82 AC if you are looking at a hex dump from
Ethereal or the contents of a BLOB), you should see that it goes to and
from the browser as those three bytes, and to and from the database that
way too.

- Raz