[sclug] URL encoding/decoding question

Fri Feb 24 01:51:05 UTC 2006

> From: "Pieter Claassen" <pieter at claassen.co.uk>
> Subject: Re: [sclug] URL encoding/decoding question
> To: "Roland Turner (SCLUG)" <raz.fpyht.bet.hx at raz.cx>
> Cc: SCLUG <sclug at sclug.org.uk>
> Message-ID: <33257.131.174.32.246.1140697622.squirrel at claassen.co.uk>

> To tell you the truth, I have long ago switched to a BLOB with MySQL 4.0
> to get rid of the utf8 issues but didn't have any luck with that (i.e.
> still had the character mangling on the browser->db->browser round trip).

	all this wrapping of layers on layers is confusing from the start.
	If I was you, I would immediately remove every outer layer ASAP.

	In a bug free world, each layer has an apply/unapply operation.
	You can write test code that shows that this is happening as expected.

	You dont have to test all layers in one go, you can break it down.
	Separate the webclient-to-server stage, from the SQLclient-to-server stage.
	Add the hex_dump_print() functions, and singlestep your data over each line.
	Each encoding/decoding step gets a name, or phrase, for example,

		putting_text_into_the_middle_of_an_sgml_doc

			<	--becomes-->	<
			&	--becomes-->	&
			ANY	--becomes-->	{
			others optional

		getting_text_out_of_the_middle_of_an_SGML_doc

			un-encoding the above encoding
			allow for other named sdata characters (©right; €)
			what about (named) characters that wont fit into a byte?

		putting_text_into_URL_encoding

			SPACE	--becomes-->	+
			+		--becomes-->
			VARIOUS	--becomes-->	%AB		hexadecimal

		getting_text_from_URL_encoding

			reverse of above + RTFM
			URL encoding is not a good design, what will your logs look like?

		encoding_8859_1_to_UNICODE

			if its ASCII it goes straight
			if its 8859-1 it goes straight through - except high control area
			if its 8859-2 it gets mapped to values above 256

		decoding_UNICODE_to_8859_1

			if it was originally 8859_1 ONLY - no problem
			if it contained OTHER chars, you must handle each (throw exception)
			most people dont think about it, and the unknown happens

		encoding_UNICODE_to_utf8 (unicode transport format 8bit)

			There isnt a value that cant be encoded (upto 31 bits)
			Its all too easy to double encode, 1 becomes 2 becomes 4 ...

			NUL is encoded as NUL - its the one value utf8 cant handle without a NUL
			(there isnt a legal sequence to generate a NUL, I want 1100-0000-1000-0000)

			7-bit-values go straight through (ASCII is easy)
			8 bits: abcd-efgh becomes 16 bits: 1100-00ab 10cd-efgh 
			(there is a simple pattern for 11, 16, 21, 26, 31 bit unicode)
			The first byte (apart from ASCII) has n initial 1 bits, then a 0, never 10
			The second byte always starts with 10, plus 6 data bits
			(its a bit like base64 but completely different)

		decoding_unicode_transport_format_back_to_unicode

			systematic code that un-does the above
			except many chars are illegal, and may throw an exception
				illegal - first byte is 10xx-xxxx
				illegal - subsequent bytes dont have 10xx-xxxx

			You must have a strategy to handle chars bigger than 8 bits.
			SGML uses &name; (or throw away)

		encoding_a_blob_into_an_SQL_request

			It is safe to put any arbitrary BLOB into a contructed SQL request string
			provided you quote-away any dangerous characters. 

				REPLACE table1(key1,blob1) VALUES("CleanedUpKey","BLOBSTR")

			VITAL: BLOBSTR is the output from:  mysql_escape_string() 
			or (RTFM) mysql_real_escape_string() which converts backslash, NUL,
			quotes, (etc) into \xxx

			you then send that request using mysql_query( mysql, requestStringBuffer )
			JAVA will be doing that too, or equivolent, get the jar source

			IIRC - parametised requests did something similar in V3.
			IIRC - calling escape on the key, leaves simple word keys unchanged

		decoding_the_SQL_request_to_get_the_blob
			MySQL server extracts the BLOB from the request string

		encoding_a_blob_into_an_SQL_result
			MySQL server does that, back to the client library

		extracting_a_blob_from_the_SQL_result
			this is really easy, as the client library does it for you, something like 

				select key1,blob1 from table1 where key1="CleanedUpKey"

			int t = mysql_query( & mysql, prepared_request_buff );
			if(t) { ...error happenned...}

			results = mysql_use_result( & mysql ); // see also mysql_store_result()
			 ... zero, one or more rows might be returned ... loop foreach ...

			MYSQL_ROW row = mysql_fetch_row( results );
			if(!row) break; // nothing returned or end of loop
			unsigned long * lens = mysql_fetch_lengths( results );
			// the current rows memory is managed by MySQL client

			// row[0] is key1 which you already have
			// len[1] is byte length of key1

			void * pointer_to_blob = (Bytef*) row[1]
			int nbytes_in_blob = len[1]

		encoding <DATATYPE> into the DBMS_BACKEND

			MySQL has a choice of engines.
			Its unlikely that these errors are coming from there,
			but possible that you are expected to configure it correctly.
			decoding is also invisible, outside of server

	So with a bit of patience, test code, test data, and by matching JAVA with C,
	you should get back what was originally there. (And dump to screen).
	You will be pushing data back-and-forth, through a matching transparent filter pair.

	Its worth mentioning remembering that LANG-uage is different from
	CHAR-set-encoding. Different (LANG) countries sort their dictionaries
	in different ways. "utf8" is not the name of a language-culture.

Graham