• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Java HTML parser

 
Ranch Hand
Posts: 539
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi all,
I'm posting this here not in the XML or HTML forums because it relates to Java parsing libraries rather than XML or HTML structure/syntax etc.
My question is, what is a good free (ideally open source) HTML parsing library? I want to pull data off the net and run through it for certain data. I've considered:
  • Apache Crimson (behind SAX). This is no good because it's too strict - if tags don't match up (as they so often don't in HTML), it barfs.
  • The HTML parser in javax.swing.text.html.parser, but this isn't suitable - for example it can't handle lowercase letters in tags (Ie <a> not <A> .
  • Using regexps rather than a parser to find my data. But this quickly becomes absurdly difficult if what I'm looking for is remotely complex.


  • So, does anyone have some tips on either (1) how I can either make Crimson more error-tolerant, (2) what other library I should be using?
    Cheers all,
    --Tim
     
    Tim West
    Ranch Hand
    Posts: 539
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    An addition...I've just found Xerxes-J. Anyone know if this is appropriate? I'm guessing it's quality given it's an Apache project...
    [Update - 5 mins later]
    It's no good (for what I want to do)...to quote their "common problems" section...


    Unfortunately, HTML does not, in general, follow the XML grammar rules. Most HTML files do not meet the XML style quidelines. Therefore, the XML parser generates XML well-formedness errors.
    (...)
    HTML must match the XHTML standard for well-formedness before it can be parsed by Xerces-J or any other XML parser. You can find the XHTML standard on the W3C web site.


    Now I'm looking at Jericho (http://sourceforge.net/projects/jerichohtml/)...sounds like there's some potential there.
    -Tim
    [ April 02, 2004: Message edited by: Tim West ]
    [ April 02, 2004: Message edited by: Tim West ]
     
    (instanceof Sidekick)
    Posts: 8791
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    I use the Quotix Parser. There are others out there, probably better supported. I liked this one because it supports the Visitor Pattern in a way that made my life very easy.
    reply
      Bookmark Topic Watch Topic
    • New Topic