| Author |
Parsing data out of an XML document
|
Sagar Suraj
Greenhorn
Joined: Apr 19, 2011
Posts: 4
|
|
I want to parse a xml content some think like below. It is HTML formatted. How can i parse the content?
I want the values like name,employee number ,age etc....
But they are not defined in particular tag.
Kindly help me out in extracting the content from this HTML formatted xml content.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ListForEmployee_1_0 SYSTEM "c:/file/hello.dtd">
<List suppressFolio="n" xmlProviderInfo="test Server" Strategy="normal">
<Wrapper>
<doc>
<docBody>
<displayGroup lineSeparator="n" leftIndent="10" fontFamily="Verdana" fontSize="11">
Service: <startStyle fontEmphasis="b"/>Employee File<endStyle/> <startStyle fontEmphasis="b"/>10 records<endStyle/>
<newLine n="1"/>
Company: <startStyle fontEmphasis="b"/>A2B company<endStyle/>
</displayGroup>
<displayGroup lineSeparator="y">
<table>
<cellWidth numSpaces="10"/>
<cellWidth numSpaces="2"/>
<cellWidth numSpaces="15"/>
<cellWidth numSpaces="20"/>
<cellWidth numSpaces="20"/>
<cellWidth numSpaces="13"/>
<tableBody>
<row>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Name<endStyle/></cell>
<cell topBorder="y" bottomBorder="y">Employee number</cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Sex<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Age<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Designation<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Date<endStyle/></cell>
</row>
<row>
<cell topBorder="y" bottomBorder="y" justification="left">Mark</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1001</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">25</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2005-02-01</cell>
</row>
<row>
<cell topBorder="y" bottomBorder="y" justification="left">ricky</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1005</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2008-12-01</cell>
</row>
<row>
<cell topBorder="y" bottomBorder="y" justification="left">David</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1007</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">35</cell>
<cell topBorder="y" bottomBorder="y" justification="left">SeniorAnalyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2005-08-11</cell>
</row>
<row>
<cell topBorder="y" bottomBorder="y" justification="left">hilary</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1008</cell>
<cell topBorder="y" bottomBorder="y" justification="left">female</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">maketing</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2001-02-01</cell>
</row>
</tableBody>
</table>
</displayGroup>
</docBody>
</doc>
</Wrapper>
</List>
|
 |
Sagar Suraj
Greenhorn
Joined: Apr 19, 2011
Posts: 4
|
|
I want to parse a xml content some think like below. It is HTML formatted. How can i parse the content?
I want the values like name,employee number ,age etc....
But they are not defined in particular tag.
Kindly help me out in extracting the content from this HTML formatted Xml content.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ListForEmployee_1_0 SYSTEM "c:/file/hello.dtd">
<List suppressFolio="n" xmlProviderInfo="test Server" Strategy="normal">
<Wrapper>
<doc>
<docBody>
<displayGroup lineSeparator="n" leftIndent="10" fontFamily="Verdana" fontSize="11">
Service: <startStyle fontEmphasis="b"/>Employee File<endStyle/> <startStyle fontEmphasis="b"/>10 records<endStyle/>
<newLine n="1"/>
Company: <startStyle fontEmphasis="b"/>A2B company<endStyle/>
</displayGroup>
<displayGroup lineSeparator="y">
<table>
<cellWidth numSpaces="10"/>
<cellWidth numSpaces="2"/>
<cellWidth numSpaces="15"/>
<cellWidth numSpaces="20"/>
<cellWidth numSpaces="20"/>
<cellWidth numSpaces="13"/>
<tableBody>
<row>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Name<endStyle/></cell>
<cell topBorder="y" bottomBorder="y">Employee number</cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Sex<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Age<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Designation<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Date<endStyle/></cell>
</row>
<row>
<cell topBorder="y" bottomBorder="y" justification="left">Mark</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1001</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">25</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2005-02-01</cell>
</row>
<row>
<cell topBorder="y" bottomBorder="y" justification="left">ricky</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1005</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2008-12-01</cell>
</row>
<row>
<cell topBorder="y" bottomBorder="y" justification="left">David</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1007</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">35</cell>
<cell topBorder="y" bottomBorder="y" justification="left">SeniorAnalyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2005-08-11</cell>
</row>
<row>
<cell topBorder="y" bottomBorder="y" justification="left">hilary</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1008</cell>
<cell topBorder="y" bottomBorder="y" justification="left">female</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">maketing</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2001-02-01</cell>
</row>
</tableBody>
</table>
</displayGroup>
</docBody>
</doc>
</Wrapper>
</List>
|
 |
William Brogden
Author and all-around good cowpoke
Rancher
Joined: Mar 22, 2000
Posts: 12271
|
|
First things first!
Have you been able to parse this document into a DOM using the standard Java library parser?
If you can get a DOM, you will have to locate each of the table "row" Elements then extract the NodeList of "cell" elements inside each row.
These NodeList collections will maintain the order of the "cell" elements so you can extract the values in each column of the table.
Bill
|
Java Resources at www.wbrogden.com
|
 |
Sagar Suraj
Greenhorn
Joined: Apr 19, 2011
Posts: 4
|
|
I am able to parse the document using dom parser and I can retrieve the valuse from the below tags.
<row>
<cell topBorder="y" bottomBorder="y" justification="left">ricky</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1005</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2008-12-01</cell>
</row>
below is the piece of code I have used.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// Use the factory to create a builder
DocumentBuilder builder;
try {
builder = factory.newDocumentBuilder();
Document doc;
//doc = builder.parse(response);
doc= builder.parse(new InputSource(new ByteArrayInputStream(xmlResponse.toString().getBytes("utf-8"))));
// here xmlResponse is the xml to be parsed
NodeList nodes = doc.getElementsByTagName("row");
System.err.println("in nodes is " + nodes.getLength());
List ls =new ArrayList();
for (int i = 0; i < nodes.getLength(); i++) {
Element element = (Element) nodes.item(i);
//List ls1 =new ArrayList();
LmlPrinterFriendlyResponseParsed lmlTextOnly = new LmlPrinterFriendlyResponseParsed();
NodeList nTitle = element.getElementsByTagName("cell");
for(int j = 0; j < nTitle.getLength(); j++){
Element line = (Element) nTitle.item(j);
//System.err.println("line is "+line);
String title = getCharacterDataFromElement(line);
}
}
}
But I couldnt retrive the values from the below tags. I want the values Name,Sex,Age<Destination,Date etc....
Sicne it contains ><startStyle ***> tag i cudnt proceed.
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Name<endStyle/></cell>
<cell topBorder="y" bottomBorder="y">Employee number</cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Sex<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Age<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Designation<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Date<endStyle/></cell>
NodeList nodes = doc.getElementsByTagName("row");
System.err.println("in nodes is " + nodes.getLength());
List ls =new ArrayList();
for (int i = 0; i < nodes.getLength(); i++) {
Element element = (Element) nodes.item(i);
//List ls1 =new ArrayList();
LmlPrinterFriendlyResponseParsed lmlTextOnly = new LmlPrinterFriendlyResponseParsed();
NodeList nTitle = element.getElementsByTagName("cell");
for(int j = 0; j < nTitle.getLength(); j++){
Element line = (Element) nTitle.item(j);
NodeList nStyle = line.getElementsByTagName("startStyle");
for(int k = 0; k < nStyle.getLength(); k++){
Element elemStyle = (Element) nStyle.item(k);
String title = getCharacterDataFromElement(line);
}
}
}
|
 |
Sagar Suraj
Greenhorn
Joined: Apr 19, 2011
Posts: 4
|
|
I am able to parse the document using dom parser and I can retrieve the valuse from the below tags.
<row>
<cell topBorder="y" bottomBorder="y" justification="left">ricky</cell>
<cell topBorder="y" bottomBorder="y" justification="left">1005</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Male</cell>
<cell topBorder="y" bottomBorder="y" justification="left">28</cell>
<cell topBorder="y" bottomBorder="y" justification="left">Analyst</cell>
<cell topBorder="y" bottomBorder="y" justification="left">2008-12-01</cell>
</row>
below is the piece of code I have used.
But I couldnt retrive the values from the below tags. I want the values Name,Sex,Age<Destination,Date etc....
Sicne it contains ><startStyle ***> tag i cudnt proceed.
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Name<endStyle/></cell>
<cell topBorder="y" bottomBorder="y">Employee number</cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Sex<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Age<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Designation<endStyle/></cell>
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Date<endStyle/></cell>
Below is the piece of code I have used
|
 |
William Brogden
Author and all-around good cowpoke
Rancher
Joined: Mar 22, 2000
Posts: 12271
|
|
But I couldnt retrive the values from the below tags. I want the values Name,Sex,Age<Destination,Date etc....
Sicne it contains ><startStyle ***> tag i cudnt proceed.
<cell topBorder="y" bottomBorder="y" justification="left"><startStyle fontEmphasis="b"/>Name<endStyle/></cell>
<cell topBorder="y" bottomBorder="y">Employee number</cell> .....
Well of course you can't, that row is being used as a header, not data. You need to skip that row and find the rows with real data.
Incidentally, your post would be more readable if you used the "Code" annotation.
Bill
|
 |
g tsuji
Ranch Hand
Joined: Jan 18, 2011
Posts: 368
|
|
You can simply do this, if you are not very fluent in traversing nodes.
That depends on the dom level 3 support. In most dom parsers not too archaic, even though they may only have partial level 3 support, should have getTextContent() support in place.
ps: Your doctype line is actually incorrect. I wonder how it comes about!
|
 |
 |
|
|
subject: Parsing data out of an XML document
|
|
|