Meaningless Drivel is fun!*
The moose likes XML and Related Technologies and the fly likes How to read XML without SAX or DOM parser Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "How to read XML without SAX or DOM parser" Watch "How to read XML without SAX or DOM parser" New topic
Author

How to read XML without SAX or DOM parser

Sumit Patil
Ranch Hand

Joined: May 25, 2009
Posts: 296

Hi,

I was just curious to know, how can i read any XML file without using SAX or DOM parser but only core java APIs ?

Also how this is done by SAX and DOM parser?? what API are they using?
How are they getting data from different nodes ?

Please share your knowledge about this.

Thanks & Regards, Sumeet
SCJP 1.4, SCWCD 5, LinkedIn Profile
Marco Ehrentreich
best scout
Bartender

Joined: Mar 07, 2007
Posts: 1282

Hi Sumit,

basically you need a parser for any data processing of this kind. Of course you can write a parser by hand (someone has written DOM and SAX implementations, too ;-)).

Depending on the complexity of your data a hand made parser can become a big mess quickly. For this reason there are parser generator tools like Yacc, Bison, Antlr etc. which help you by automatically generating a parser implementation for a specific grammar. This requires some understanding of the theory behind formal languages etc.

Unfortunately I don't know how typical DOM and SAX parser are working internally, but I suspect that they use some of the said generator tools. Anyway, once you have a working parser you're free to do almost anything you can imagine for the parsed data. For example you can create Java objects representing the parsed data structure, you can fire events in case specific language element (like XML nodes) are parsed and so on. This principles are the same for any kind of computer language. Every compiler needs a parser, too. The Java compiler for example generates bytecode for the parsed Java source files.

Marco


Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42061
    
  64
What do you mean by "core Java"? The JRE -which is the "core" of Java- contains the JAXP API, which has SAX and DOM parsers.

It's certainly possible to write your own parser. That would involve using a lot of the classes in the java.io package, plus a good deal of string processing (possibly using regexps).

If you want to study how it might work, check out the Crimson parser: http://xml.apache.org/crimson/


Ping & DNS - my free Android networking tools app
Marco Ehrentreich
best scout
Bartender

Joined: Mar 07, 2007
Posts: 1282

Of course Ulf is right, the popular XML related APIs are already part of core Java

But I think you wanted to know what's the "magic" behind parsers?!?
Sumit Patil
Ranch Hand

Joined: May 25, 2009
Posts: 296

But I think you wanted to know what's the "magic" behind parsers?!?


Yes Marco, you are right.I want to understand the process behind various parsers.

Well thanks Ulf and Marco for the inputs.
Marco Ehrentreich
best scout
Bartender

Joined: Mar 07, 2007
Posts: 1282

Does that mean you have concrete questions how you can create a parser yourself?
Sumit Patil
Ranch Hand

Joined: May 25, 2009
Posts: 296

Yes.
Marco Ehrentreich
best scout
Bartender

Joined: Mar 07, 2007
Posts: 1282

OK, I can surely give you some ideas but as formal languages and the whole theory behind them are a big topic of computer science it will be hard to explain everything here in the forum

First you should think about the language you want to parse and define a syntax and grammar specifying this language. This requires some knowledge and planning but I think the naive approach without exactly defining the language doesn't scale really well. For any non-trivial language this will end up in a mess, in particular if you want to extend or change your language. Another problem is the "Chomsky type" of a language. This has an important impact on the difficulty of a program to parse such language data. Regarding this difficulty level XML languages are not the best starting point because they are a subset of context-free languages. Regular languages which can be defined with regular expressions are the easier ones.

To avoid problems in general you should separate the process of lexical analysis and semantic analysis (you can easily find all these buzz words with Google ). The lexical or syntax analysis is done by a so-called "lexer" or "scanner", the semantic analysis is the work of the actual parser.

A scanner is a kind of pre-processor for the parser. A scanner can be implemented as a finite state machine. It takes an input stream and just splits it up into "tokens". For example in XML token means special characters like <, > or ", identifiers for element and attribute names etc. In particular the scanner doesn't care about if the input makes sense as long as it contains allowed tokens, i.e. invalid and not well-formed XML is nevertheless valid input for the scanner.

The parser takes the stream of (allowed) tokens from the scanner as its input. So you don't have to take care anymore about syntax errors (like illegal characters, typos and so on) in this step of processing which is why I said you should split up these two steps. I think the easiest implementation to create a parser manually is a recursive descent parser. The job of the parser is to make sense out of the tokens. For the XML example this means that the parser recognizes if the elements are well-formed, i.e. balanced, and reports errors as necessary. During the parsing process you will usally want to create some kind of tree data structure in memory which is named "abstract syntax tree" or in short "AST".

By traversing this AST (like any tree data structure) you can now decide what you actually want "to do" with your parsed language content. This could mean anything from creating a graphic visualization to transforming Java source files into bytecode or directly interpreting the AST of a scripting language source file.

I know this is not the answer you liked to hear but without knowing how much background knowledge you already have, I couldn't come up with a better answer. For any non-trivial example the problem is complex and so will be the solution. Sorry :-) Some may argue that the "naive" approach with simple pattern matching in strings etc. will work, too, but I suspect this doesn't work any longer as soons as the content/language to parse gets even slightly more complex and requires systematic parsing. To make a long story short, in my opinion if you don't have the required knowledge already you could read up on all the things in this post which you don't know. Otherwise I seriously doubt that you will be able to create a XML parser which works correctly by hand. Of course feel free to ask any question but the topic is really, really to big to explain every detail here. The lengthy text above touches only the surface

Marco
Sumit Patil
Ranch Hand

Joined: May 25, 2009
Posts: 296

Hey Marco,

Thanks for the reply and the concepts.

I am totally new to all the concepts mentioned by you ...so i guess first i have to study all this.

Anyways thanks for the reply.

Marco Ehrentreich
best scout
Bartender

Joined: Mar 07, 2007
Posts: 1282

This topic is really strong meat, in particular because there's a lot of theory involved if you want to fully understand it. But nevertheless it's an interesting topic and you will know how compilers work inside after learning the ideas behind it. It's definitely not necessary to know every detail (as this would probably require several years of research), but you should have at least a basic understanding of the most essential concepts.

But to be realistic, to get something done quickly it would be wise to use the said generator tools to create a lexer and parser. The downside is that all these tools usually require special formats for their input files which to define syntax tokens and a grammar. My last project of this kind was a PL/0 interpreter (PL/0 is a minimalistic programming language often used to study compiler design). There I used JFlex as a generator tool for the lexer and Beaver as a parser generator. Unfortunately I've not yet used it, but I guess ANTLR is a more comfortable alternative.

Of course the easiest approach would be to use existing parsers - especially for a common language like XML

Marco
santosh kimothi
Ranch Hand

Joined: Jun 10, 2009
Posts: 32
first of all without parsing you can not read any xml document...............

and if you want to use java api to parse xml document. i think sun already given api to read the xml document independently.
with JAXP and JAXB.

so these with these apis you can parse xml document, and both are purely from sun Microsystems.


Santosh Kimothi,
Java programmer
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42061
    
  64
While lexers and parser are a good technology to know, I'd say they're overkill for XML. One of the design goals of XML was that it should be so simple that everybody would be able to write a parser for it in a relatively short amount of time. It does have a syntax with few rules, after all. (It does get more involved once DOCTYPEs are considered, but that's probably beyond what one would write for fun anyway.)
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42061
    
  64
santosh kimothi wrote:first of all without parsing you can not read any xml document

How do you think XML parsers read XML - by calling a parser?
Sumit Patil
Ranch Hand

Joined: May 25, 2009
Posts: 296

santosh kimothi wrote:
and if you want to use java api to parse xml document. i think sun already given api to read the xml document independently.
with JAXP and JAXB.

so these with these apis you can parse xml document, and both are purely from sun Microsystems.


Ok....can you tell me how exactly they work? and how can i make my own custom parser for reading XML files?

Thanks
Marco Ehrentreich
best scout
Bartender

Joined: Mar 07, 2007
Posts: 1282

Hi Sumit,

a good start would be to have a look at the source code of an implementation for your favourite XML API ;-) Unfortunately I don't know any details about popular implementations myself.

Besides, if you are only interested in processing XML then my ideas are definitely overkill as Ulf pointed out. You may consider to read up on this topic if you're interested but it surely isn't required to know the theoretical concepts as long as you're concerned only with XML

Marco
Sumit Patil
Ranch Hand

Joined: May 25, 2009
Posts: 296

Hi All,

I googled for implementation of SAX parser.

here is the link

http://www.docjar.com/html/api/javax/xml/parsers/SAXParser.java.html
Marco Ehrentreich
best scout
Bartender

Joined: Mar 07, 2007
Posts: 1282

Obviously this is only an abstract base class for an implementation so you don't really see how it works

Fully functional implementations of the popular XML APIs are for example Xerces, Xalan or Crimson as Ulf suggested. You may simply download the source code for one of them!

Marco
Jimmy Clark
Ranch Hand

Joined: Apr 16, 2008
Posts: 2187
Ok....can you tell me how exactly they work? and how can i make my own custom parser for reading XML files?


The book below has the whole shebang....enjoy!

Building Parsers With Java by Steven John Metsker
Paperback: 371 pages
Publisher: Addison-Wesley Professional
Language: English
ISBN-10: 0201719622
ISBN-13: 978-0201719628


* Here is a hint, the SAX and DOM APIs have nothing to do with creating a parser.
 
GeeCON Prague 2014
 
subject: How to read XML without SAX or DOM parser