wood burning stoves 2.0*
The moose likes Java in General and the fly likes regular expression to remove javascript from html Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "regular expression to remove javascript from html" Watch "regular expression to remove javascript from html" New topic
Author

regular expression to remove javascript from html

Richard Hands
Greenhorn

Joined: Feb 25, 2004
Posts: 12
Hi,

I'm writing a servlet to parse XML and XSL together,
and then do some bits an pieces with the resultant HTML.

First thing i need to do with the output is remove any and
all javascript content from the HTML, as it's to be used in
a different fashion to the original page.

I'm sure i'm not far off with the regex but i can't quite get
it to work. (don't you just hate it when you get to that stage :roll: )

here's an example of the regexpr, and a code snippet that it
fails to parse even though i think it should (note there's no
CR/LF between the end of the first script tag and the start of the second one)



[ May 26, 2004: Message edited by: Richard Hands ]

[ May 27, 2004: Message edited by: Max Habibi ]
[ May 27, 2004: Message edited by: Max Habibi ]
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
Use the Pattern.DOTALL flag instead of Pattern.MULTILINE. The latter causes the start and end anchors (^ and $) to match at line boundaries as well as at the beginning and end of the input. DOTALL allows the dot to match line terminator characters (\r, \n, etc.), which it doesn't normally do.
Richard Hands
Greenhorn

Joined: Feb 25, 2004
Posts: 12
Thanks for the tip, but it still didn't work.
Paul Sturrock
Bartender

Joined: Apr 14, 2004
Posts: 10336

If your original source is XML, why use regular expressions at all? You have a well formed document, so you can guarentee that <script></script> is all JavaScript - so handle it when you parse your XML, or transform it with your XSLT. Regular expressions in this seem unecessary, and just an over complication. But then again, perhaps I'm not quite following what you are trying to do?


JavaRanch FAQ HowToAskQuestionsOnJavaRanch
Richard Hands
Greenhorn

Joined: Feb 25, 2004
Posts: 12
basically, i have an abstract class that instantiates an xml transformer, reads the basic data-centric xml from the database and transforms it with a html-centric xsl stylesheet (which xsl it uses is determined by various bits of data in the database).

This is being extended to a variety of things, and in the one i'm currently working on, i want to strip out all the javascripts that may be in it.

does this make a bit more sense?
Paul Sturrock
Bartender

Joined: Apr 14, 2004
Posts: 10336

If all you want to do is remove the JS then do it at transform time - its far easier to write a template which matches <script /> elements and replaces them with nothing than to transform XML in to HTML and then strip out the script tags. Basically you are parsing the document twice and exposing your app to more possible sources of error - both in the transform and in the regular expression matching. If you get the document after it has initially been transformed and can't change part of the application, remember that you can make the result of an xslt transformation valid XHTML, in which case you could quite easily transform it again with a simple XSLT.
Richard Hands
Greenhorn

Joined: Feb 25, 2004
Posts: 12
a very interesting point, and one i will go and try out.

Thanks for the help.
 
wood burning stoves
 
subject: regular expression to remove javascript from html