File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes XML and Related Technologies and the fly likes Getting an index from a parser Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Getting an index from a parser " Watch "Getting an index from a parser " New topic
Author

Getting an index from a parser

Shane Burgel
Ranch Hand

Joined: Sep 09, 2003
Posts: 47
Is it possible to get the index of a tag when you break on a START_ELEMENT event using STaX(or some other parser)?

For instance if I had a file that was 1000 characters long and the first <record> tag began at character 12, is there a way for STaX(or some other parser) to tell me that? I can't use DOM because my files are too big, and right now I'm having to manually read the file a bite at a time so that I can get these indexes.

Thanks

Shane
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18563
    
    8

Yes, that's what a Locator is for. org.xml.sax.Locator
Shane Burgel
Ranch Hand

Joined: Sep 09, 2003
Posts: 47
Cool, but can I use that with a STaX parser, or do I have to use SAX?
Shane Burgel
Ranch Hand

Joined: Sep 09, 2003
Posts: 47
On second glance that won't work for me. I need the index, not the line number and column number.

Any other ideas?
Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 500
    
    5
XMStreamReader.getLocation() returns a javax.xml.stream.Location...use its getCharacterOffset() to get the current byte in the stream.

Also, XMLEventReader.nextEvent() returns XMLEvent, which has a getLocation() to get a Location object.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12778
    
    5
So why not count the elements as the events come though to get your "index"??

Bill
Shane Burgel
Ranch Hand

Joined: Sep 09, 2003
Posts: 47
I found that after my last post but I can't seem to get the correct index from it.

With my byte by byte parser I get this as my first ten offsets:
[3235, 6467, 9699, 12931, 16163, 19395, 22627, 25859, 29091, 32323, 35555]

The location object gives me this. They aren't even in order and I know for a fact that there are none before 3235, so why are there numbers lower than that?
[3235, 6467, 1513, 4745, 7977, 3025, 6257, 1302, 4534, 7766, 2810]

Here is my code:

Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 500
    
    5
Are you using a BufferedInputStream around a FileInputStream?
Shane Burgel
Ranch Hand

Joined: Sep 09, 2003
Posts: 47
Karthik Shiraly wrote:Are you using a BufferedInputStream around a FileInputStream?


Shane Burgel
Ranch Hand

Joined: Sep 09, 2003
Posts: 47
My location object is an MXParser and the correct number can be obtained by added the bufAbsoluteStart value with the number that is returned.

How do I get the Location instance to return THAT number?
Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 500
    
    5
Was browsing MXParser source code. You're right, it's maintaining its own termporary character buffer with a size limit and the offset returned is into that buffer. I feel this isn't a correct implementation by MXParser. Guess you'll have to do a dirty workaround of subclassing MXParser and add bufAbsoluteStart to super.getCharacterOffset().
Sun's implementation works fine.
Shane Burgel
Ranch Hand

Joined: Sep 09, 2003
Posts: 47
What is Sun's implementation of this?

Thanks for your help btw
Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 500
    
    5
You're welcome!
Sun JRE comes with a default implementation for StAX. I get the correct file positions in ascending order with getCharacterOffset() for a 14 MB XML file.
Your app launcher must be overriding Sun's implementation with codehaus one by setting -Djavax.xml.stream.XMLInputFactory.
Shane Burgel
Ranch Hand

Joined: Sep 09, 2003
Posts: 47
I'm using groovy, I guess I should have said that up front. That is why I'm getting the codehaus implementation.
Shane Burgel
Ranch Hand

Joined: Sep 09, 2003
Posts: 47
What implementation of XMLStreamReader do you get? Which Location implementation?

Shane Burgel
Ranch Hand

Joined: Sep 09, 2003
Posts: 47
I tried using Woodstox instead of STaX and it seems to be working correctly.
Karthik Shiraly
Ranch Hand

Joined: Apr 04, 2009
Posts: 500
    
    5
I get "com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl". So Sun implementation is Xerces (based).
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Getting an index from a parser