Quoting Peter Davis <[log in to unmask]>:
> I'm a new member, so I hope this hasn't already been beaten to death.
Welcome aboard, and no, I don't think this topic has ever arisen before.
> I'm trying to build a prototype as a proof of concept for a workflow
> that requires parsing two XML files. In coding this (C++), it occurred
> to me that having to conditionally test an element's name *after* the
> parser has already scanned it is redundant.
I'm not clear what the difficulty is here: this would be the normal
way of doing it. A parser checks for well-formedness, and then hands
the resulting tree to the application so that the application can do
whatever it's supposed to do.
> There's no reason I should
> have to perform a sequence of string comparisons on strings the parser
> has already scanned.
Ah. I am probably being dense here. Are you trying to locate or
isolate one specific element from within each file (ie as opposed to
needing to handle all the elements)?
> On the other hand, I don't have schemas for these files, so I can't use
> a data binding tool like CodeSynthesis.
> I thought a worthwhile improvement over existing SAX(2) parsers would
> be to simply allow registering element name specific callbacks. Instead
> of just calling me when any element begins or ends, I want to register
> a function for when <document> begins or ends, and another function for
> when <page> begins or ends, etc. This would make the code *much*
> cleaner, and eliminate multiple passes over the same strings.
I think this is outside the scope of a parser itself, which by
definition has to read the entire document before it can be satisfied
that it is usable.
Can you explain a little more about what you are actually trying to
do? I tend to build workflows at the scripting level rather than in a
single language because the facilities that can be plugged into the
pipeline are much more extensive.
> I have not come across this yet, though admittedly there are a *lot* of
> XML parsers out there, and I haven't seen them all. It seems like a
> fairly obvious enhancement to an otherwise simple SAX parser.
There are a number of tools which can be used to extract specific
individual elements, but by design they are mostly limited to handing
you the whole element, not the isolated start-tag or end-tag.
XSLT is probably the most common, but I assume you have already looked
at this. XQuery may also be useful if you are looking to identify an
There are also lxgrep and lxprintf (part of the ltxml2 package at
http://www.cogsci.ed.ac.uk/~richard/) which are very fast at pulling
stuff out of XML documents.
The old onsgmls parser (part of OpenSP at
http://sourceforge.net/projects/openjade/files/opensp/) can still
produce an ESIS stream, which is a line-by-line decomposition of the
document, and is very useful for creating trigger conditions because
it *does* expose the start-tags and end-tags as separate objects.
There are many libraries for Perl, Python, Tcl, and other scripting
languages which may do the same, but I have no in-depth experience of
them. Ditto for C, C++, Java, etc.
I try to avoid using non-XML solutions, because they are prone to
misinterpret the markup in certain conditions (inside a CDATA marked
section, for example), but depending on the nature of your documents,
it may be possible to use nasty tricks like turning all newlines into
spaces and then turning all < characters into newlines and all <
characters into spaces; this leaves the element type name at the start
of every line, followed by a space, which can then be isolated, eg
cat myfile.xml | tr '\012<>' '\040\012\040' | grep '^page\ '
but without knowing what you want to do with the result it isn't
possible to know if that approach is useful.