On 11/24/2010 7:04 PM, Peter Flynn wrote:
> Basically, the parser has to look at every character in the input XML
>> data, to determine where "<" and">" occur. So in the process of doing
>> that, it could also be identifying pre-determined strings, like element
>> names. Otherwise, whether it's SAX or DOM, all it's tell the calling
>> application that there are elements.
> I don't think that's correct. At least I really hope not.
Well, at some level, the software has to examine the actual string data
to identify the characters. I'm not sure how much overhead there is in
re-examining the same strings again and again, but there has to be
some. If my XML consists just of "<blah/>," then a parser has to read
this to determine that it's a single empty element whose name is
"blah." It then either gives me a tree containing the element (DOM), or
calls my 'beginElement' and 'endElement' callbacks (SAX). Either way, I
now have to do string comparisons against the name of this element in
order to figure out that it's a "blah," as opposed to a "whosis" or a
>> Yes, XSLT is looking good. I've also found some example XSLT files for
>> generating TeX/LaTeX, so that's a good head start.
> I have acres of those if you need them. There's a big one downloadable
> from my online LaTeX book (see production note at
> http://latex.silmaril.ie/formattinginformation/intro.html#prodnote for
> the link).
Wow! Thank you. Obviously I found the right guy to be talking to about
>> Unfortunately, I can't pre-process the images in this application. I
>> don't know what or where they are until I process the XML. I could
>> process the XML twice, but that's ugly. It would mean processing the
>> XML to make a list of image URIs, then running through the list with
>> some other process to get all the image info, and finally processing the
>> XML again to generate the markup.
> That's exactly what I would do: the script fragment I posted does that.
> Passing over the document to grab the image file names is very fast: I
> just tested lxgrep on a very complex 1Mb DocBook document and it took
> under a second:
> $ date;lxgrep -w pix '//graphic' thesis.xml>pix.xml 2>pix.err;date
> Wed Nov 24 23:32:43 GMT 2010
> Wed Nov 24 23:32:44 GMT 2010
> Grabbing the info from the images is going to take about the same amount
> of time no matter what way you do it. I wouldn't be afraid to preprocess
> a document in order to get the data: it's a perfectly normal way to work
> in a pipeline.
I'll take a closer look at it, but my preference would be for a one-pass
solution. I may be able to push some of the actual image reading to
TeX/LaTeX/DVI, since those files will have to be retrieved and opened
again there anyway.
>> Yes, so do I. As I said, XSLT seems to be *almost* there, but I need to
>> solve the image problem.
> I bit the bullet and finished the script. This is rough and ready, but
> it reads the file above and gets the image data. There are a few Mb
> sized EPS files which take up most of the 19 seconds: omitting them
> brought it down to 8 secs. (Not all my @filerefs have the images/ path
> so some of the script rationalises that, and needs to be omitted elsewhere.)
Thanks so much! I'll definitely look into this approach. I'm
anticipating documents using tens of thousands of images, though many of
them may recur. I'll have to weigh the alternatives.
The Tech Curmudgeon - http://www.techcurmudgeon.com
Ideas Great and Dumb - http://www.ideasgreatanddumb.com