[Wolves] TXT to XML
Andy Wootton
andy.wootton at wyrley.demon.co.uk
Thu Nov 4 00:22:54 GMT 2004
sparkes wrote:
> Simon Burke wrote:
>
>>
>> I can put my own stuff in whatever not a problem but its the rfcs et
>> al that i need formatting, they are in plain ASCII format, and i want
>> to make them look all nice and shiny with not necessarily XML but
>> something that can be easily read in a browser and looks nice.
>
> I think the point Aq was making is that you are attempting to add
> formating where it currently does not exist. This is one of the
> hardest jobs you can be asked to do but one that customers seem to
> think is trivial ;-)
>
> It would only be a guess what the title is, the author is, which bits
> should be highlighted. Do paragraphs have a newline (or 2 between
> them) or is it something to do with \t?
>
> Unless the ascii text files have some standard structure it's
> impossible to make all these guesses.
> ...
> sparkes
Simon,
Something I played with recently might be of interest. I was looking for
a way to build documents out of components (as you might do with
conditional compilation of source code for different customers) so that
I could create versions of documents for different audiences from
component libraries of 'statements'. LaTeX was my first thought but I
started looking at DocBook XML, having been impressed by the Fedora Core
Linux documents. The tools aren't really there yet for the 'general
writer' but if you don't mind hacking XML in Emacs then it looked like a
good way to go and clearly has a future. While doing some evaluation
work I realised that I could build a DocBook XML 'wrapper' that used
include statements to pull in chunks of text from individual flat text
files. I was happy for the text to be reformatted but you might want it
to be included with existing layout preserved. Using these techniques
you can put pretty headers and footers onto existing documents without
much effort and format them for the web or printing using one of the
DocBook tool chains. Documents could be created or edited later to
manually make the necessary separation of structure, style and content
that Sparkes refers to (because human beings are quite good at it;
particularly programmers.)
I didn't succeed in selling the idea at work because, well ... you
couldn't do it in Microsoft Office. I have a sneaking suspicion that
noone else could see that the benefits justified the additional
complexity of the method (they weren't programmers.)
I think document processing like this is far more in keeping with the
Unix philosophy than the pseudo-Microsoft approach of suites like
OpenOffice.org. Yes, OOo is nearly as good as MS Office [ ducks under
desk to avoid the flames ] but I'm looking for a much better tool than
that, built on fundamentally stronger principles.
Woo
More information about the Wolves
mailing list