basically you need a parser for any data processing of this kind. Of course you can write a parser by hand (someone has written DOM and SAX implementations, too ;-)).
Depending on the complexity of your data a hand made parser can become a big mess quickly. For this reason there are parser generator tools like Yacc, Bison, Antlr etc. which help you by automatically generating a parser implementation for a specific grammar. This requires some understanding of the theory behind formal languages etc.
Unfortunately I don't know how typical DOM and SAX parser are working internally, but I suspect that they use some of the said generator tools. Anyway, once you have a working parser you're free to do almost anything you can imagine for the parsed data. For example you can create Java objects representing the parsed data structure, you can fire events in case specific language element (like XML nodes) are parsed and so on. This principles are the same for any kind of computer language. Every compiler needs a parser, too. The Java compiler for example generates bytecode for the parsed Java source files.
OK, I can surely give you some ideas but as formal languages and the whole theory behind them are a big topic of computer science it will be hard to explain everything here in the forum
First you should think about the language you want to parse and define a syntax and grammar specifying this language. This requires some knowledge and planning but I think the naive approach without exactly defining the language doesn't scale really well. For any non-trivial language this will end up in a mess, in particular if you want to extend or change your language. Another problem is the "Chomsky type" of a language. This has an important impact on the difficulty of a program to parse such language data. Regarding this difficulty level XML languages are not the best starting point because they are a subset of context-free languages. Regular languages which can be defined with regular expressions are the easier ones.
To avoid problems in general you should separate the process of lexical analysis and semantic analysis (you can easily find all these buzz words with Google ). The lexical or syntax analysis is done by a so-called "lexer" or "scanner", the semantic analysis is the work of the actual parser.
A scanner is a kind of pre-processor for the parser. A scanner can be implemented as a finite state machine. It takes an input stream and just splits it up into "tokens". For example in XML token means special characters like <, > or ", identifiers for element and attribute names etc. In particular the scanner doesn't care about if the input makes sense as long as it contains allowed tokens, i.e. invalid and not well-formed XML is nevertheless valid input for the scanner.
The parser takes the stream of (allowed) tokens from the scanner as its input. So you don't have to take care anymore about syntax errors (like illegal characters, typos and so on) in this step of processing which is why I said you should split up these two steps. I think the easiest implementation to create a parser manually is a recursive descent parser. The job of the parser is to make sense out of the tokens. For the XML example this means that the parser recognizes if the elements are well-formed, i.e. balanced, and reports errors as necessary. During the parsing process you will usally want to create some kind of tree data structure in memory which is named "abstract syntax tree" or in short "AST".
By traversing this AST (like any tree data structure) you can now decide what you actually want "to do" with your parsed language content. This could mean anything from creating a graphic visualization to transforming Java source files into bytecode or directly interpreting the AST of a scripting language source file.
I know this is not the answer you liked to hear but without knowing how much background knowledge you already have, I couldn't come up with a better answer. For any non-trivial example the problem is complex and so will be the solution. Sorry :-) Some may argue that the "naive" approach with simple pattern matching in strings etc. will work, too, but I suspect this doesn't work any longer as soons as the content/language to parse gets even slightly more complex and requires systematic parsing. To make a long story short, in my opinion if you don't have the required knowledge already you could read up on all the things in this post which you don't know. Otherwise I seriously doubt that you will be able to create a XML parser which works correctly by hand. Of course feel free to ask any question but the topic is really, really to big to explain every detail here. The lengthy text above touches only the surface
This topic is really strong meat, in particular because there's a lot of theory involved if you want to fully understand it. But nevertheless it's an interesting topic and you will know how compilers work inside after learning the ideas behind it. It's definitely not necessary to know every detail (as this would probably require several years of research), but you should have at least a basic understanding of the most essential concepts.
But to be realistic, to get something done quickly it would be wise to use the said generator tools to create a lexer and parser. The downside is that all these tools usually require special formats for their input files which to define syntax tokens and a grammar. My last project of this kind was a PL/0 interpreter (PL/0 is a minimalistic programming language often used to study compiler design). There I used JFlex as a generator tool for the lexer and Beaver as a parser generator. Unfortunately I've not yet used it, but I guess ANTLR is a more comfortable alternative.
Of course the easiest approach would be to use existing parsers - especially for a common language like XML
first of all without parsing you can not read any xml document...............
and if you want to use java api to parse xml document. i think sun already given api to read the xml document independently.
with JAXP and JAXB.
so these with these apis you can parse xml document, and both are purely from sun Microsystems.
Joined: Mar 22, 2005
While lexers and parser are a good technology to know, I'd say they're overkill for XML. One of the design goals of XML was that it should be so simple that everybody would be able to write a parser for it in a relatively short amount of time. It does have a syntax with few rules, after all. (It does get more involved once DOCTYPEs are considered, but that's probably beyond what one would write for fun anyway.)
Joined: Mar 22, 2005
santosh kimothi wrote:first of all without parsing you can not read any xml document
How do you think XML parsers read XML - by calling a parser?
a good start would be to have a look at the source code of an implementation for your favourite XML API ;-) Unfortunately I don't know any details about popular implementations myself.
Besides, if you are only interested in processing XML then my ideas are definitely overkill as Ulf pointed out. You may consider to read up on this topic if you're interested but it surely isn't required to know the theoretical concepts as long as you're concerned only with XML