• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Extracting html from webpages?

 
Ranch Hand
Posts: 40
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Is there a built in function to extract html off of webpages? Say for instance i wanted to extract all of the "plain text" off of the javaranch.com website, is there a simple way to go about this?
Thank You...
Nick Ueda
 
Ranch Hand
Posts: 1873
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
by 'text' u mean removing all the html tags and haev the remaining part right?
well, as far as i know there is not a simple way of doing that.
regards
maulin
 
Nick Ueda
Ranch Hand
Posts: 40
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
well what about just getting the html file off of a webpage?
 
Greenhorn
Posts: 21
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm not 100% what you are looking for. However, if you want to remove all the HTML tags from the source file I suggest using regular expressions.
I wrote a very simple (and not all inclusive) Perl script that removed tags.
<code> ~s/<(?:[^>""]*|([""]).*?\1)*>//g; </code>
With jdk 1.4, you can use regular expressions in Java. I was able to replicate the script above. Check out java.util.regex and build off the snippet above.
 
Ranch Hand
Posts: 51
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
not simple,this solution, but it is precise and it is a java solution.
while this is alovely site, the markup is not wellformed. however, if you download JTidy, you can run this tool in java and it will give you a well-formed XHTML representation of a page. then you can use a simple XPath expression in an XSL stylesheet that selects all the text from the <body> tag downwards <xsl:value-of select="//body/text()" />
i know - a complicated option. but still an option
peter
 
(instanceof Sidekick)
Posts: 8791
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
JTidy sounds cool. I have also used the Quiotix HTML Parser. It builds a DOM and provides a Visitor interface for walking the DOM and some sample visitors.
Was that the original question, or were you trying to get the HTML from a server in the first place? Here's an example of doing that with URL:

You have to know the URL you're after, so it won't automatically grab all the content of a site. You could grab a page, parse it, look for links, grab linked pages, parse them, etc. Watch for circular links and watch for a ticked off webmaster who doesn't appreciate you taking expensive mips and bandwidth from the regular customers while copying copyrighted material.
Some sites that WANT you to do this use RSS publishing. Neat trend.
[ July 03, 2003: Message edited by: Stan James ]
 
Sheriff
Posts: 67746
173
Mac Mac OS X IntelliJ IDE jQuery TypeScript Java iOS
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

well what about just getting the html file off of a webpage?


Check out the URL.openConnection() method.
hth,
bear
 
Nick Ueda
Ranch Hand
Posts: 40
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Bear Bibeault:

Check out the URL.openConnection() method.


Thanks I will do that.

[ July 03, 2003: Message edited by: Nick Ueda ]
 
reply
    Bookmark Topic Watch Topic
  • New Topic