Win a copy of The Java Performance Companion this week in the Performance forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Parsing data out of an XML document

 
Sagar Suraj
Greenhorn
Posts: 4
Firefox Browser Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I want to parse a xml content some think like below. It is HTML formatted. How can i parse the content?
I want the values like name,employee number ,age etc....
But they are not defined in particular tag.
Kindly help me out in extracting the content from this HTML formatted xml content.


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ListForEmployee_1_0 SYSTEM "c:/file/hello.dtd">
<List suppressFolio="n" xmlProviderInfo="test Server" Strategy="normal">
<Wrapper>
<doc>
<docBody>
<displayGroup lineSeparator="n" leftIndent="10" fontFamily="Verdana" fontSize="11">
Service: <startStyle fontEmphasis="b"/>Employee File<endStyle/> <startStyle fontEmphasis="b"/>10 records<endStyle/>
<newLine n="1"/>
Company: <startStyle fontEmphasis="b"/>A2B company<endStyle/>
</displayGroup>
<displayGroup lineSeparator="y">
<table>
<cellWidth numSpaces="10"/>
<cellWidth numSpaces="2"/>
<cellWidth numSpaces="15"/>
<cellWidth numSpaces="20"/>
<cellWidth numSpaces="20"/>
<cellWidth numSpaces="13"/>
<tableBody>
<row>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Name<endStyle/></cell>
<cell topBorder="y" bottomBorder="y">Employee number</cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Sex<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Age<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Designation<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Date<endStyle/></cell>
</row>

<row>
<cell topBorder="y" bottomBorder="y" justification="left">Mark</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1001</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">25</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2005-02-01</cell>
</row>


<row>
<cell topBorder="y" bottomBorder="y" justification="left">ricky</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1005</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2008-12-01</cell>
</row>


<row>
<cell topBorder="y" bottomBorder="y" justification="left">David</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1007</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">35</cell>
<cell topBorder="y" bottomBorder="y" justification="left">SeniorAnalyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2005-08-11</cell>
</row>


<row>
<cell topBorder="y" bottomBorder="y" justification="left">hilary</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1008</cell>
<cell topBorder="y" bottomBorder="y" justification="left">female</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">maketing</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2001-02-01</cell>
</row>

</tableBody>
</table>
</displayGroup>
</docBody>
</doc>
</Wrapper>
</List>
 
Sagar Suraj
Greenhorn
Posts: 4
Firefox Browser Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I want to parse a xml content some think like below. It is HTML formatted. How can i parse the content?
I want the values like name,employee number ,age etc....
But they are not defined in particular tag.
Kindly help me out in extracting the content from this HTML formatted Xml content.


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ListForEmployee_1_0 SYSTEM "c:/file/hello.dtd">
<List suppressFolio="n" xmlProviderInfo="test Server" Strategy="normal">
<Wrapper>
<doc>
<docBody>
<displayGroup lineSeparator="n" leftIndent="10" fontFamily="Verdana" fontSize="11">
Service: <startStyle fontEmphasis="b"/>Employee File<endStyle/> <startStyle fontEmphasis="b"/>10 records<endStyle/>
<newLine n="1"/>
Company: <startStyle fontEmphasis="b"/>A2B company<endStyle/>
</displayGroup>
<displayGroup lineSeparator="y">
<table>
<cellWidth numSpaces="10"/>
<cellWidth numSpaces="2"/>
<cellWidth numSpaces="15"/>
<cellWidth numSpaces="20"/>
<cellWidth numSpaces="20"/>
<cellWidth numSpaces="13"/>
<tableBody>
<row>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Name<endStyle/></cell>
<cell topBorder="y" bottomBorder="y">Employee number</cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Sex<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Age<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Designation<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Date<endStyle/></cell>
</row>

<row>
<cell topBorder="y" bottomBorder="y" justification="left">Mark</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1001</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">25</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2005-02-01</cell>
</row>


<row>
<cell topBorder="y" bottomBorder="y" justification="left">ricky</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1005</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2008-12-01</cell>
</row>


<row>
<cell topBorder="y" bottomBorder="y" justification="left">David</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1007</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">35</cell>
<cell topBorder="y" bottomBorder="y" justification="left">SeniorAnalyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2005-08-11</cell>
</row>


<row>
<cell topBorder="y" bottomBorder="y" justification="left">hilary</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1008</cell>
<cell topBorder="y" bottomBorder="y" justification="left">female</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">maketing</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2001-02-01</cell>
</row>

</tableBody>
</table>
</displayGroup>
</docBody>
</doc>
</Wrapper>
</List>
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13064
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
First things first!

Have you been able to parse this document into a DOM using the standard Java library parser?

If you can get a DOM, you will have to locate each of the table "row" Elements then extract the NodeList of "cell" elements inside each row.

These NodeList collections will maintain the order of the "cell" elements so you can extract the values in each column of the table.

Bill


 
Sagar Suraj
Greenhorn
Posts: 4
Firefox Browser Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am able to parse the document using dom parser and I can retrieve the valuse from the below tags.
<row>
<cell topBorder="y" bottomBorder="y" justification="left">ricky</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1005</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2008-12-01</cell>
</row>

below is the piece of code I have used.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// Use the factory to create a builder
DocumentBuilder builder;

try {
builder = factory.newDocumentBuilder();

Document doc;

//doc = builder.parse(response);
doc= builder.parse(new InputSource(new ByteArrayInputStream(xmlResponse.toString().getBytes("utf-8"))));
// here xmlResponse is the xml to be parsed

NodeList nodes = doc.getElementsByTagName("row");

System.err.println("in nodes is " + nodes.getLength());

List ls =new ArrayList();



for (int i = 0; i < nodes.getLength(); i++) {

Element element = (Element) nodes.item(i);
//List ls1 =new ArrayList();
LmlPrinterFriendlyResponseParsed lmlTextOnly = new LmlPrinterFriendlyResponseParsed();
NodeList nTitle = element.getElementsByTagName("cell");
for(int j = 0; j < nTitle.getLength(); j++){
Element line = (Element) nTitle.item(j);

//System.err.println("line is "+line);
String title = getCharacterDataFromElement(line);


}
}
}


But I couldnt retrive the values from the below tags. I want the values Name,Sex,Age<Destination,Date etc....
Sicne it contains ><startStyle ***> tag i cudnt proceed.
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Name<endStyle/></cell>
<cell topBorder="y" bottomBorder="y">Employee number</cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Sex<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Age<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Designation<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Date<endStyle/></cell>

NodeList nodes = doc.getElementsByTagName("row");

System.err.println("in nodes is " + nodes.getLength());

List ls =new ArrayList();



for (int i = 0; i < nodes.getLength(); i++) {

Element element = (Element) nodes.item(i);
//List ls1 =new ArrayList();
LmlPrinterFriendlyResponseParsed lmlTextOnly = new LmlPrinterFriendlyResponseParsed();
NodeList nTitle = element.getElementsByTagName("cell");
for(int j = 0; j < nTitle.getLength(); j++){
Element line = (Element) nTitle.item(j);

NodeList nStyle = line.getElementsByTagName("startStyle");
for(int k = 0; k < nStyle.getLength(); k++){
Element elemStyle = (Element) nStyle.item(k);
String title = getCharacterDataFromElement(line);


}
}
}


 
Sagar Suraj
Greenhorn
Posts: 4
Firefox Browser Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am able to parse the document using dom parser and I can retrieve the valuse from the below tags.
<row>
<cell topBorder="y" bottomBorder="y" justification="left">ricky</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1005</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2008-12-01</cell>
</row>

below is the piece of code I have used.

But I couldnt retrive the values from the below tags. I want the values Name,Sex,Age<Destination,Date etc....
Sicne it contains ><startStyle ***> tag i cudnt proceed.
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Name<endStyle/></cell>
<cell topBorder="y" bottomBorder="y">Employee number</cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Sex<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Age<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Designation<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Date<endStyle/></cell>
Below is the piece of code I have used

 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13064
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
But I couldnt retrive the values from the below tags. I want the values Name,Sex,Age<Destination,Date etc....
Sicne it contains ><startStyle ***> tag i cudnt proceed.
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Name<endStyle/></cell>
<cell topBorder="y" bottomBorder="y">Employee number</cell> .....


Well of course you can't, that row is being used as a header, not data. You need to skip that row and find the rows with real data.

Incidentally, your post would be more readable if you used the "Code" annotation.

Bill
 
g tsuji
Ranch Hand
Posts: 666
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You can simply do this, if you are not very fluent in traversing nodes.

That depends on the dom level 3 support. In most dom parsers not too archaic, even though they may only have partial level 3 support, should have getTextContent() support in place.

ps: Your doctype line is actually incorrect. I wonder how it comes about!
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic