Granny's Programming Pearls
"inside of every large program is a small program struggling to get out"
JavaRanch.com/granny.jsp
The moose likes XML and Related Technologies and the fly likes Parsing HTML, not XHTML or XML..... Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Parsing HTML, not XHTML or XML....." Watch "Parsing HTML, not XHTML or XML....." New topic
Author

Parsing HTML, not XHTML or XML.....

Madhav Lakkapragada
Ranch Hand

Joined: Jun 03, 2000
Posts: 5040
Ok, one of my friend asked me this and my answer was nope, can't be done. But I wanted to double check with folks here in case anyone else has solved this use case -

The input file is a HTML file. Using the JDK 1.4.xx (or any other) API's she wanted to parse a HTML file. The problem here was that the HTML file had a SCRIPT start element tag and an script end element tag.

So, the question is - 'Am I right in saying that there's no parser available to parse such HTML files?'

BTW, the input file is not in our control to modify.
I would like to know.
Thanks.

- m


Take a Minute, Donate an Hour, Change a Life
http://www.ashanet.org/workanhour/2006/?r=Javaranch_ML&a=81
Balaji Loganathan
author and deputy
Bartender

Joined: Jul 13, 2001
Posts: 3150
I havent used it but the release news of this html parsertool says it supports script blocks.
Madhav Lakkapragada
Ranch Hand

Joined: Jun 03, 2000
Posts: 5040
Thanks for that link Balaji. I will look into this.

You see, as if the requirements I said before were not challenging enough, the thing is I can't rely on some third party software.
So, my options are narrowed down to the standard API's - JDK, Xerces, Xalan, things of that nature. If I could acheive this with the standard APIs, I would like to investigate more. If push comes to shove, then non-standard third party is acceptable.

My interest here is more academic in nature, something that I want to learn and see if I am missing something.

- m
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12754
    
    5
Several years ago we used JTidy to parse poorly written HTML
JTidy project at Sourceforge
I wonder if you might find JTidy helpful.
Bill
Balaji Loganathan
author and deputy
Bartender

Joined: Jul 13, 2001
Posts: 3150
One more link http://java-source.net/open-source/html-parsers i found.

By the way, i am very curious to know what is the idea behind parsing out the html ?
Asanga Pradeep
Greenhorn

Joined: Apr 01, 2005
Posts: 5
i don't know if this is what ur looking for
http://www.javaalmanac.com/egs/javax.swing.text.html/GetText.html

it parses a html file. this is a basic example but can be expanded.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Parsing HTML, not XHTML or XML.....
 
Similar Threads
Can't set invisible border for Table inside P:dataTable.
Error parsing a xml stream
environment variable
A Javascript function is said as "Not a function"
Eclipse editors for Facelets xhtml?