LISTSERV mailing list manager LISTSERV 15.5

Help for XML-L Archives


XML-L Archives

XML-L Archives


View:

Next Message | Previous Message
Next in Topic | Previous in Topic
Next by Same Author | Previous by Same Author
Chronologically | Most Recent First
Proportional Font | Monospaced Font

Options:

Join or Leave XML-L
Reply | Post New Message
Search Archives


Subject: Re: Somewhere between parsing and data binding?
From: Peter Flynn <[log in to unmask]>
Reply-To:General discussion of Extensible Markup Language <[log in to unmask]>
Date:Thu, 25 Nov 2010 00:04:31 +0000
Content-Type:text/plain
Parts/Attachments:
Parts/Attachments

text/plain (117 lines)


On 24/11/10 23:01, Peter Davis wrote:
> I hope you're at least enjoying this exchange, as I am.  I'd hate to be
> wasting your time.

It's on-topic anyway :-)

[...]
> Basically, the parser has to look at every character in the input XML
> data, to determine where "<" and ">" occur.  So in the process of doing
> that, it could also be identifying pre-determined strings, like element
> names.  Otherwise, whether it's SAX or DOM, all it's tell the calling
> application that there are elements.

I don't think that's correct. At least I really hope not.

> The application than has to scan
> the strings again to figure out what the element names are.  That the
> primary inefficiency that troubles me.

I would be very worried if it had to do this. AFAIK the result of a
parse is an InfoSet (PSVI or Grove) which contains the whole caboodle,
names and all. I think I should go check with someone who writes this
stuff internally, just to make sure.

> I'm not sure what you mean by "procedural" here.  This is pretty much
> the approach that Expat and other SAX parsers require you to take.  For
> DOM parsers, the code is similar, but it's in the context of traversing
> the tree.

I'm sorry, I didn't mean procedural at the level of the parser. I meant
at the level of handling the document (eg in XSLT).

[...]
> Yes, XSLT is looking good.  I've also found some example XSLT files for
> generating TeX/LaTeX, so that's a good head start.

I have acres of those if you need them. There's a big one downloadable
from my online LaTeX book (see production note at
http://latex.silmaril.ie/formattinginformation/intro.html#prodnote for
the link).

[...]
> Unfortunately, I can't pre-process the images in this application.  I
> don't know what or where they are until I process the XML.  I could
> process the XML twice, but that's ugly.  It would mean processing the
> XML to make a list of image URIs, then running through the list with
> some other process to get all the image info, and finally processing the
> XML again to generate the markup.

That's exactly what I would do: the script fragment I posted does that.
Passing over the document to grab the image file names is very fast: I
just tested lxgrep on a very complex 1Mb DocBook document and it took
under a second:

$ date;lxgrep -w pix '//graphic' thesis.xml >pix.xml 2>pix.err;date
Wed Nov 24 23:32:43 GMT 2010
Wed Nov 24 23:32:44 GMT 2010

Grabbing the info from the images is going to take about the same amount
of time no matter what way you do it. I wouldn't be afraid to preprocess
a document in order to get the data: it's a perfectly normal way to work
in a pipeline.

> Yes, so do I.  As I said, XSLT seems to be *almost* there, but I need to
> solve the image problem. 

I bit the bullet and finished the script. This is rough and ready, but
it reads the file above and gets the image data. There are a few Mb
sized EPS files which take up most of the 19 seconds: omitting them
brought it down to 8 secs. (Not all my @filerefs have the images/ path
so some of the script rationalises that, and needs to be omitted elsewhere.)

> $ date;echo "<images>";for f in `lxgrep '//graphic/@fileref' thesis.xml 2>/dev/null | awk -F\" '{print $2}' | grep -v '^$' | sed -e "s+images/++"`; do if [ -s images/$f.png ]; then F=$f.png; elif [ -s images/$f.pdf ]; then F=$f.pdf; else F=$f.eps; fi; identify images/$F; done | awk '{n=split($3,xy,"x");print "<image name=\"" $1 "\" type=\"" $2 "\" x=\"" xy[1] "\" y=\"" xy[2] "\" depth=\"" $5 "\" size=\"" $(NF-2) "\"/>"}';echo "</images>";date
> Wed Nov 24 23:57:20 GMT 2010
> <images>
> <image name="images/ptx104.eps" type="PS" x="828" y="1134" depth="16-bit" size="996KiB"/>
> <image name="images/vin155.eps" type="PS" x="1069" y="967" depth="16-bit" size="767KiB"/>
> <image name="images/legal-mono.eps" type="PS" x="360" y="504" depth="16-bit" size="34.8KiB"/>
> <image name="images/vesalius-text.png" type="PNG" x="1311" y="2010" depth="8-bit" size="325KiB"/>
> <image name="images/floatingpara.pdf" type="PDF" x="595" y="842" depth="16-bit" size="61.7KiB"/>
> <image name="images/floatingpara-edit.png" type="PNG" x="674" y="479" depth="8-bit" size="948KiB"/>
> <image name="images/tree.png" type="PNG" x="1009" y="689" depth="8-bit" size="36KiB"/>
> <image name="images/distrib.png" type="PNG" x="396" y="357" depth="8-bit" size="23.2KiB"/>
> <image name="images/exp-q2-crop.pdf" type="PDF" x="290" y="288" depth="16-bit" size="10.5KiB"/>
> <image name="images/editors-crop.pdf" type="PDF" x="765" y="493" depth="16-bit" size="46.3KiB"/>
> <image name="images/survey-oocalc-encode.png" type="PNG" x="1023" y="689" depth="8-bit" size="73.3KiB"/>
> <image name="images/framemap-crop.pdf" type="PDF" x="606" y="237" depth="16-bit" size="17.7KiB"/>
> <image name="images/model-dynamics-crop.pdf" type="PDF" x="818" y="514" depth="16-bit" size="51.8KiB"/>
> <image name="images/example-screen-crop.pdf" type="PDF" x="598" y="451" depth="16-bit" size="33.1KiB"/>
> <image name="images/criticalturns-edit.png" type="PNG" x="1024" y="690" depth="8-bit" size="195KiB"/>
> <image name="images/criticalturns-typeset.png" type="PNG" x="642" y="687" depth="8-bit" size="102KiB"/>
> <image name="images/modal-ws-crop.pdf" type="PDF" x="169" y="234" depth="16-bit" size="5.09KiB"/>
> <image name="images/tabex-crop.pdf" type="PDF" x="228" y="98" depth="16-bit" size="2.84KiB"/>
> <image name="images/modal-tab-crop.pdf" type="PDF" x="169" y="238" depth="16-bit" size="5.18KiB"/>
> <image name="images/bs-del-example-before.png" type="PNG" x="601" y="171" depth="8-bit" size="30.1KiB"/>
> <image name="images/bs-del-example-after.png" type="PNG" x="601" y="138" depth="8-bit" size="28.7KiB"/>
> <image name="images/modal-sp-crop.pdf" type="PDF" x="169" y="198" depth="16-bit" size="4.32KiB"/>
> <image name="images/lyxtoolbar.png" type="PNG" x="874" y="91" depth="8-bit" size="26.9KiB"/>
> <image name="images/ibutton-crop.pdf" type="PDF" x="139" y="149" depth="16-bit" size="2.68KiB"/>
> <image name="images/fontdropdown-crop.pdf" type="PDF" x="253" y="164" depth="16-bit" size="5.19KiB"/>
> <image name="images/desclistbutton-crop.pdf" type="PDF" x="390" y="28" depth="16-bit" size="1.4KiB"/>
> <image name="images/cursor1-crop.pdf" type="PDF" x="352" y="138" depth="16-bit" size="5.99KiB"/>
> <image name="images/cursor2-crop.pdf" type="PDF" x="352" y="138" depth="16-bit" size="5.99KiB"/>
> <image name="images/cursor3-crop.pdf" type="PDF" x="352" y="138" depth="16-bit" size="5.99KiB"/>
> <image name="images/cursor4-crop.pdf" type="PDF" x="352" y="138" depth="16-bit" size="5.99KiB"/>
> <image name="images/cursor5-crop.pdf" type="PDF" x="313" y="144" depth="16-bit" size="5.69KiB"/>
> <image name="images/cursor6-crop.pdf" type="PDF" x="313" y="147" depth="16-bit" size="5.8KiB"/>
> <image name="images/cursor7-crop.pdf" type="PDF" x="312" y="144" depth="16-bit" size="5.55KiB"/>
> <image name="images/cursor8-crop.pdf" type="PDF" x="313" y="243" depth="16-bit" size="9.55KiB"/>
> <image name="images/dndtoolbar.png" type="PNG" x="619" y="402" depth="8-bit" size="731KiB"/>
> <image name="images/expert-pilot1.pdf" type="PDF" x="595" y="842" depth="16-bit" size="61.7KiB"/>
> <image name="images/expert-pilot2.pdf" type="PDF" x="595" y="842" depth="16-bit" size="61.7KiB"/>
> </images>
> Wed Nov 24 23:57:39 GMT 2010
> $ 

///Peter

Back to: Top of Message | Previous Page | Main XML-L Page

Permalink



LISTSERV.HEANET.IE

CataList Email List Search Powered by the LISTSERV Email List Manager