wood burning stoves 2.0*
The moose likes XML and Related Technologies and the fly likes Parsing data out of an XML document Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Parsing data out of an XML document" Watch "Parsing data out of an XML document" New topic
Author

Parsing data out of an XML document

Sagar Suraj
Greenhorn

Joined: Apr 19, 2011
Posts: 4

I want to parse a xml content some think like below. It is HTML formatted. How can i parse the content?
I want the values like name,employee number ,age etc....
But they are not defined in particular tag.
Kindly help me out in extracting the content from this HTML formatted xml content.


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ListForEmployee_1_0 SYSTEM "c:/file/hello.dtd">
<List suppressFolio="n" xmlProviderInfo="test Server" Strategy="normal">
<Wrapper>
<doc>
<docBody>
<displayGroup lineSeparator="n" leftIndent="10" fontFamily="Verdana" fontSize="11">
Service: <startStyle fontEmphasis="b"/>Employee File<endStyle/> <startStyle fontEmphasis="b"/>10 records<endStyle/>
<newLine n="1"/>
Company: <startStyle fontEmphasis="b"/>A2B company<endStyle/>
</displayGroup>
<displayGroup lineSeparator="y">
<table>
<cellWidth numSpaces="10"/>
<cellWidth numSpaces="2"/>
<cellWidth numSpaces="15"/>
<cellWidth numSpaces="20"/>
<cellWidth numSpaces="20"/>
<cellWidth numSpaces="13"/>
<tableBody>
<row>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Name<endStyle/></cell>
<cell topBorder="y" bottomBorder="y">Employee number</cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Sex<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Age<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Designation<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Date<endStyle/></cell>
</row>

<row>
<cell topBorder="y" bottomBorder="y" justification="left">Mark</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1001</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">25</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2005-02-01</cell>
</row>


<row>
<cell topBorder="y" bottomBorder="y" justification="left">ricky</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1005</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2008-12-01</cell>
</row>


<row>
<cell topBorder="y" bottomBorder="y" justification="left">David</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1007</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">35</cell>
<cell topBorder="y" bottomBorder="y" justification="left">SeniorAnalyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2005-08-11</cell>
</row>


<row>
<cell topBorder="y" bottomBorder="y" justification="left">hilary</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1008</cell>
<cell topBorder="y" bottomBorder="y" justification="left">female</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">maketing</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2001-02-01</cell>
</row>

</tableBody>
</table>
</displayGroup>
</docBody>
</doc>
</Wrapper>
</List>
Sagar Suraj
Greenhorn

Joined: Apr 19, 2011
Posts: 4

I want to parse a xml content some think like below. It is HTML formatted. How can i parse the content?
I want the values like name,employee number ,age etc....
But they are not defined in particular tag.
Kindly help me out in extracting the content from this HTML formatted Xml content.


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ListForEmployee_1_0 SYSTEM "c:/file/hello.dtd">
<List suppressFolio="n" xmlProviderInfo="test Server" Strategy="normal">
<Wrapper>
<doc>
<docBody>
<displayGroup lineSeparator="n" leftIndent="10" fontFamily="Verdana" fontSize="11">
Service: <startStyle fontEmphasis="b"/>Employee File<endStyle/> <startStyle fontEmphasis="b"/>10 records<endStyle/>
<newLine n="1"/>
Company: <startStyle fontEmphasis="b"/>A2B company<endStyle/>
</displayGroup>
<displayGroup lineSeparator="y">
<table>
<cellWidth numSpaces="10"/>
<cellWidth numSpaces="2"/>
<cellWidth numSpaces="15"/>
<cellWidth numSpaces="20"/>
<cellWidth numSpaces="20"/>
<cellWidth numSpaces="13"/>
<tableBody>
<row>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Name<endStyle/></cell>
<cell topBorder="y" bottomBorder="y">Employee number</cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Sex<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Age<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Designation<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Date<endStyle/></cell>
</row>

<row>
<cell topBorder="y" bottomBorder="y" justification="left">Mark</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1001</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">25</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2005-02-01</cell>
</row>


<row>
<cell topBorder="y" bottomBorder="y" justification="left">ricky</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1005</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2008-12-01</cell>
</row>


<row>
<cell topBorder="y" bottomBorder="y" justification="left">David</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1007</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">35</cell>
<cell topBorder="y" bottomBorder="y" justification="left">SeniorAnalyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2005-08-11</cell>
</row>


<row>
<cell topBorder="y" bottomBorder="y" justification="left">hilary</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1008</cell>
<cell topBorder="y" bottomBorder="y" justification="left">female</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">maketing</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2001-02-01</cell>
</row>

</tableBody>
</table>
</displayGroup>
</docBody>
</doc>
</Wrapper>
</List>
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12789
    
    5
First things first!

Have you been able to parse this document into a DOM using the standard Java library parser?

If you can get a DOM, you will have to locate each of the table "row" Elements then extract the NodeList of "cell" elements inside each row.

These NodeList collections will maintain the order of the "cell" elements so you can extract the values in each column of the table.

Bill


Sagar Suraj
Greenhorn

Joined: Apr 19, 2011
Posts: 4

I am able to parse the document using dom parser and I can retrieve the valuse from the below tags.
<row>
<cell topBorder="y" bottomBorder="y" justification="left">ricky</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1005</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2008-12-01</cell>
</row>

below is the piece of code I have used.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// Use the factory to create a builder
DocumentBuilder builder;

try {
builder = factory.newDocumentBuilder();

Document doc;

//doc = builder.parse(response);
doc= builder.parse(new InputSource(new ByteArrayInputStream(xmlResponse.toString().getBytes("utf-8"))));
// here xmlResponse is the xml to be parsed

NodeList nodes = doc.getElementsByTagName("row");

System.err.println("in nodes is " + nodes.getLength());

List ls =new ArrayList();



for (int i = 0; i < nodes.getLength(); i++) {

Element element = (Element) nodes.item(i);
//List ls1 =new ArrayList();
LmlPrinterFriendlyResponseParsed lmlTextOnly = new LmlPrinterFriendlyResponseParsed();
NodeList nTitle = element.getElementsByTagName("cell");
for(int j = 0; j < nTitle.getLength(); j++){
Element line = (Element) nTitle.item(j);

//System.err.println("line is "+line);
String title = getCharacterDataFromElement(line);


}
}
}


But I couldnt retrive the values from the below tags. I want the values Name,Sex,Age<Destination,Date etc....
Sicne it contains ><startStyle ***> tag i cudnt proceed.
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Name<endStyle/></cell>
<cell topBorder="y" bottomBorder="y">Employee number</cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Sex<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Age<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Designation<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Date<endStyle/></cell>

NodeList nodes = doc.getElementsByTagName("row");

System.err.println("in nodes is " + nodes.getLength());

List ls =new ArrayList();



for (int i = 0; i < nodes.getLength(); i++) {

Element element = (Element) nodes.item(i);
//List ls1 =new ArrayList();
LmlPrinterFriendlyResponseParsed lmlTextOnly = new LmlPrinterFriendlyResponseParsed();
NodeList nTitle = element.getElementsByTagName("cell");
for(int j = 0; j < nTitle.getLength(); j++){
Element line = (Element) nTitle.item(j);

NodeList nStyle = line.getElementsByTagName("startStyle");
for(int k = 0; k < nStyle.getLength(); k++){
Element elemStyle = (Element) nStyle.item(k);
String title = getCharacterDataFromElement(line);


}
}
}


Sagar Suraj
Greenhorn

Joined: Apr 19, 2011
Posts: 4

I am able to parse the document using dom parser and I can retrieve the valuse from the below tags.
<row>
<cell topBorder="y" bottomBorder="y" justification="left">ricky</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1005</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2008-12-01</cell>
</row>

below is the piece of code I have used.

But I couldnt retrive the values from the below tags. I want the values Name,Sex,Age<Destination,Date etc....
Sicne it contains ><startStyle ***> tag i cudnt proceed.
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Name<endStyle/></cell>
<cell topBorder="y" bottomBorder="y">Employee number</cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Sex<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Age<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Designation<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Date<endStyle/></cell>
Below is the piece of code I have used

William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12789
    
    5
But I couldnt retrive the values from the below tags. I want the values Name,Sex,Age<Destination,Date etc....
Sicne it contains ><startStyle ***> tag i cudnt proceed.
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Name<endStyle/></cell>
<cell topBorder="y" bottomBorder="y">Employee number</cell> .....


Well of course you can't, that row is being used as a header, not data. You need to skip that row and find the rows with real data.

Incidentally, your post would be more readable if you used the "Code" annotation.

Bill
g tsuji
Ranch Hand

Joined: Jan 18, 2011
Posts: 517
    
    3
You can simply do this, if you are not very fluent in traversing nodes.

That depends on the dom level 3 support. In most dom parsers not too archaic, even though they may only have partial level 3 support, should have getTextContent() support in place.

ps: Your doctype line is actually incorrect. I wonder how it comes about!
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Parsing data out of an XML document