This week's book giveaway is in the Design forum.
We're giving away four copies of Building Microservices and have Sam Newman on-line!
See this thread for details.
The moose likes Java in General and the fly likes remove javascript from html web page Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Building Microservices this week in the Design forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "remove javascript from html web page" Watch "remove javascript from html web page" New topic
Author

remove javascript from html web page

asit dhal
Greenhorn

Joined: May 05, 2009
Posts: 13

I need to remove all tags(html tags and javascript code) from a web page.

Can somebody tell me how to do this ?


http://kodeyard.blogspot.com/
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8661
    
  23

asit dhal wrote:I need to remove all tags(html tags and javascript code) from a web page.

Can somebody tell me how to do this ?

I suggest you look at a parser for SAX or DOM. Java has implementations for both. The first is generally easier to use, and I'm pretty sure it will do what you want; however you may need to convert the HTML to XHTML first. For that, there is a utility called JTidy, which I believe has it's own SAX-like parser built-in; but I've never used it, so have no idea how easy it is.

Tip: DON'T think about a regex-based solution if there is any "awareness" required. They are very powerful, but not well-suited to hierarchical logic.

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
 
jQuery in Action, 3rd edition
 
subject: remove javascript from html web page
 
It's not a secret anymore!