aspose file tools*
The moose likes Java in General and the fly likes remove javascript from html web page Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "remove javascript from html web page" Watch "remove javascript from html web page" New topic
Author

remove javascript from html web page

asit dhal
Greenhorn

Joined: May 05, 2009
Posts: 13

I need to remove all tags(html tags and javascript code) from a web page.

Can somebody tell me how to do this ?


http://kodeyard.blogspot.com/
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8427
    
  23

asit dhal wrote:I need to remove all tags(html tags and javascript code) from a web page.

Can somebody tell me how to do this ?

I suggest you look at a parser for SAX or DOM. Java has implementations for both. The first is generally easier to use, and I'm pretty sure it will do what you want; however you may need to convert the HTML to XHTML first. For that, there is a utility called JTidy, which I believe has it's own SAX-like parser built-in; but I've never used it, so have no idea how easy it is.

Tip: DON'T think about a regex-based solution if there is any "awareness" required. They are very powerful, but not well-suited to hierarchical logic.

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: remove javascript from html web page