Win a copy of Re-engineering Legacy Software this week in the Refactoring forum
or Docker in Action in the Cloud/Virtualization forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Parsing HTML, not XHTML or XML.....

 
Madhav Lakkapragada
Ranch Hand
Posts: 5040
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ok, one of my friend asked me this and my answer was nope, can't be done. But I wanted to double check with folks here in case anyone else has solved this use case -

The input file is a HTML file. Using the JDK 1.4.xx (or any other) API's she wanted to parse a HTML file. The problem here was that the HTML file had a SCRIPT start element tag and an script end element tag.

So, the question is - 'Am I right in saying that there's no parser available to parse such HTML files?'

BTW, the input file is not in our control to modify.
I would like to know.
Thanks.

- m
 
Balaji Loganathan
author and deputy
Bartender
Posts: 3150
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I havent used it but the release news of this html parsertool says it supports script blocks.
 
Madhav Lakkapragada
Ranch Hand
Posts: 5040
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for that link Balaji. I will look into this.

You see, as if the requirements I said before were not challenging enough, the thing is I can't rely on some third party software.
So, my options are narrowed down to the standard API's - JDK, Xerces, Xalan, things of that nature. If I could acheive this with the standard APIs, I would like to investigate more. If push comes to shove, then non-standard third party is acceptable.

My interest here is more academic in nature, something that I want to learn and see if I am missing something.

- m
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13056
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Several years ago we used JTidy to parse poorly written HTML
JTidy project at Sourceforge
I wonder if you might find JTidy helpful.
Bill
 
Balaji Loganathan
author and deputy
Bartender
Posts: 3150
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
One more link http://java-source.net/open-source/html-parsers i found.

By the way, i am very curious to know what is the idea behind parsing out the html ?
 
Asanga Pradeep
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
i don't know if this is what ur looking for
http://www.javaalmanac.com/egs/javax.swing.text.html/GetText.html

it parses a html file. this is a basic example but can be expanded.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic