• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Parsing HTML, not XHTML or XML.....

 
Ranch Hand
Posts: 5040
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Ok, one of my friend asked me this and my answer was nope, can't be done. But I wanted to double check with folks here in case anyone else has solved this use case -

The input file is a HTML file. Using the JDK 1.4.xx (or any other) API's she wanted to parse a HTML file. The problem here was that the HTML file had a SCRIPT start element tag and an script end element tag.

So, the question is - 'Am I right in saying that there's no parser available to parse such HTML files?'

BTW, the input file is not in our control to modify.
I would like to know.
Thanks.

- m
 
author and deputy
Posts: 3150
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I havent used it but the release news of this html parsertool says it supports script blocks.
 
Madhav Lakkapragada
Ranch Hand
Posts: 5040
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for that link Balaji. I will look into this.

You see, as if the requirements I said before were not challenging enough, the thing is I can't rely on some third party software.
So, my options are narrowed down to the standard API's - JDK, Xerces, Xalan, things of that nature. If I could acheive this with the standard APIs, I would like to investigate more. If push comes to shove, then non-standard third party is acceptable.

My interest here is more academic in nature, something that I want to learn and see if I am missing something.

- m
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Several years ago we used JTidy to parse poorly written HTML
JTidy project at Sourceforge
I wonder if you might find JTidy helpful.
Bill
 
Balaji Loganathan
author and deputy
Posts: 3150
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
One more link http://java-source.net/open-source/html-parsers i found.

By the way, i am very curious to know what is the idea behind parsing out the html ?
 
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
i don't know if this is what ur looking for
http://www.javaalmanac.com/egs/javax.swing.text.html/GetText.html

it parses a html file. this is a basic example but can be expanded.
 
Arch enemy? I mean, I don't like you, but I don't think you qualify as "arch enemy". Here, try this tiny ad:
a bit of art, as a gift, that will fit in a stocking
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic