[sclug] URL encoding/decoding question
Graham
lists at Information-Cascade.co.uk
Fri Feb 24 01:51:05 UTC 2006
> From: "Pieter Claassen" <pieter at claassen.co.uk>
> Subject: Re: [sclug] URL encoding/decoding question
> To: "Roland Turner (SCLUG)" <raz.fpyht.bet.hx at raz.cx>
> Cc: SCLUG <sclug at sclug.org.uk>
> Message-ID: <33257.131.174.32.246.1140697622.squirrel at claassen.co.uk>
> To tell you the truth, I have long ago switched to a BLOB with MySQL 4.0
> to get rid of the utf8 issues but didn't have any luck with that (i.e.
> still had the character mangling on the browser->db->browser round trip).
all this wrapping of layers on layers is confusing from the start.
If I was you, I would immediately remove every outer layer ASAP.
In a bug free world, each layer has an apply/unapply operation.
You can write test code that shows that this is happening as expected.
You dont have to test all layers in one go, you can break it down.
Separate the webclient-to-server stage, from the SQLclient-to-server stage.
Add the hex_dump_print() functions, and singlestep your data over each line.
Each encoding/decoding step gets a name, or phrase, for example,
putting_text_into_the_middle_of_an_sgml_doc
< --becomes--> <
& --becomes--> &
ANY --becomes--> {
others optional
getting_text_out_of_the_middle_of_an_SGML_doc
un-encoding the above encoding
allow for other named sdata characters (©right; €)
what about (named) characters that wont fit into a byte?
putting_text_into_URL_encoding
SPACE --becomes--> +
+ --becomes-->
VARIOUS --becomes--> %AB hexadecimal
getting_text_from_URL_encoding
reverse of above + RTFM
URL encoding is not a good design, what will your logs look like?
encoding_8859_1_to_UNICODE
if its ASCII it goes straight
if its 8859-1 it goes straight through - except high control area
if its 8859-2 it gets mapped to values above 256
decoding_UNICODE_to_8859_1
if it was originally 8859_1 ONLY - no problem
if it contained OTHER chars, you must handle each (throw exception)
most people dont think about it, and the unknown happens
encoding_UNICODE_to_utf8 (unicode transport format 8bit)
There isnt a value that cant be encoded (upto 31 bits)
Its all too easy to double encode, 1 becomes 2 becomes 4 ...
NUL is encoded as NUL - its the one value utf8 cant handle without a NUL
(there isnt a legal sequence to generate a NUL, I want 1100-0000-1000-0000)
7-bit-values go straight through (ASCII is easy)
8 bits: abcd-efgh becomes 16 bits: 1100-00ab 10cd-efgh
(there is a simple pattern for 11, 16, 21, 26, 31 bit unicode)
The first byte (apart from ASCII) has n initial 1 bits, then a 0, never 10
The second byte always starts with 10, plus 6 data bits
(its a bit like base64 but completely different)
decoding_unicode_transport_format_back_to_unicode
systematic code that un-does the above
except many chars are illegal, and may throw an exception
illegal - first byte is 10xx-xxxx
illegal - subsequent bytes dont have 10xx-xxxx
You must have a strategy to handle chars bigger than 8 bits.
SGML uses &name; (or throw away)
encoding_a_blob_into_an_SQL_request
It is safe to put any arbitrary BLOB into a contructed SQL request string
provided you quote-away any dangerous characters.
REPLACE table1(key1,blob1) VALUES("CleanedUpKey","BLOBSTR")
VITAL: BLOBSTR is the output from: mysql_escape_string()
or (RTFM) mysql_real_escape_string() which converts backslash, NUL,
quotes, (etc) into \xxx
you then send that request using mysql_query( mysql, requestStringBuffer )
JAVA will be doing that too, or equivolent, get the jar source
IIRC - parametised requests did something similar in V3.
IIRC - calling escape on the key, leaves simple word keys unchanged
decoding_the_SQL_request_to_get_the_blob
MySQL server extracts the BLOB from the request string
encoding_a_blob_into_an_SQL_result
MySQL server does that, back to the client library
extracting_a_blob_from_the_SQL_result
this is really easy, as the client library does it for you, something like
select key1,blob1 from table1 where key1="CleanedUpKey"
int t = mysql_query( & mysql, prepared_request_buff );
if(t) { ...error happenned...}
results = mysql_use_result( & mysql ); // see also mysql_store_result()
... zero, one or more rows might be returned ... loop foreach ...
MYSQL_ROW row = mysql_fetch_row( results );
if(!row) break; // nothing returned or end of loop
unsigned long * lens = mysql_fetch_lengths( results );
// the current rows memory is managed by MySQL client
// row[0] is key1 which you already have
// len[1] is byte length of key1
void * pointer_to_blob = (Bytef*) row[1]
int nbytes_in_blob = len[1]
encoding <DATATYPE> into the DBMS_BACKEND
MySQL has a choice of engines.
Its unlikely that these errors are coming from there,
but possible that you are expected to configure it correctly.
decoding is also invisible, outside of server
So with a bit of patience, test code, test data, and by matching JAVA with C,
you should get back what was originally there. (And dump to screen).
You will be pushing data back-and-forth, through a matching transparent filter pair.
Its worth mentioning remembering that LANG-uage is different from
CHAR-set-encoding. Different (LANG) countries sort their dictionaries
in different ways. "utf8" is not the name of a language-culture.
Graham
More information about the Sclug
mailing list