[Wolves] TXT to XML

Andy Wootton andy.wootton at wyrley.demon.co.uk
Thu Nov 4 00:22:54 GMT 2004

sparkes wrote:

> Simon Burke wrote:
>> I can put my own stuff in whatever not a problem but its the rfcs et
>> al that i need formatting, they are in plain ASCII format, and i want
>> to make them look all nice and shiny with not necessarily XML but
>> something that can be easily read in a browser and looks nice.
> I think the point Aq was making is that you are attempting to add 
> formating where it currently does not exist.  This is one of the 
> hardest jobs you can be asked to do but one that customers seem to 
> think is trivial ;-)
> It would only be a guess what the title is, the author is, which bits 
> should be highlighted.  Do paragraphs have a newline (or 2 between 
> them) or is it something to do with \t?
> Unless the ascii text files have some standard structure it's 
> impossible to make all these guesses.
> ...

> sparkes


Something I played with recently might be of interest. I was looking for 
a way to build documents out of components (as you might do with 
conditional compilation of source code for different customers) so that 
I could create versions of documents for different audiences from 
component libraries of 'statements'. LaTeX was my first thought but I 
started looking at DocBook XML, having been impressed by the Fedora Core 
Linux documents. The tools aren't really there yet for the 'general 
writer' but if you don't mind hacking XML in Emacs then it looked like a 
good way to go and clearly has a future. While doing some evaluation 
work I realised that I could build a DocBook XML 'wrapper' that used 
include statements to pull in chunks of text from individual flat text 
files. I was happy for the text to be reformatted but you might want it 
to be included with existing layout preserved. Using these techniques 
you can put pretty headers and footers onto existing documents without 
much effort and format them for the web or printing using one of the 
DocBook tool chains. Documents could be created or edited later to 
manually make the necessary separation of structure, style and content 
that Sparkes refers to (because human beings are quite good at it; 
particularly programmers.)

I didn't succeed in selling the idea at work because, well ... you 
couldn't do it in Microsoft Office. I have a sneaking suspicion that 
noone else could see that the benefits justified the additional 
complexity of the method (they weren't programmers.)

I think document processing like this is far more in keeping with the 
Unix philosophy than the pseudo-Microsoft approach of suites like 
OpenOffice.org. Yes, OOo is nearly as good as MS Office [ ducks under 
desk to avoid the flames ] but I'm looking for a much better tool than 
that, built on fundamentally stronger principles.


More information about the Wolves mailing list