Win a copy of Clojure in Action this week in the Clojure forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

regular expression to remove javascript from html

 
Richard Hands
Greenhorn
Posts: 12
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I'm writing a servlet to parse XML and XSL together,
and then do some bits an pieces with the resultant HTML.

First thing i need to do with the output is remove any and
all javascript content from the HTML, as it's to be used in
a different fashion to the original page.

I'm sure i'm not far off with the regex but i can't quite get
it to work. (don't you just hate it when you get to that stage :roll: )

here's an example of the regexpr, and a code snippet that it
fails to parse even though i think it should (note there's no
CR/LF between the end of the first script tag and the start of the second one)



[ May 26, 2004: Message edited by: Richard Hands ]

[ May 27, 2004: Message edited by: Max Habibi ]
[ May 27, 2004: Message edited by: Max Habibi ]
 
Alan Moore
Ranch Hand
Posts: 262
  • 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Use the Pattern.DOTALL flag instead of Pattern.MULTILINE. The latter causes the start and end anchors (^ and $) to match at line boundaries as well as at the beginning and end of the input. DOTALL allows the dot to match line terminator characters (\r, \n, etc.), which it doesn't normally do.
 
Richard Hands
Greenhorn
Posts: 12
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the tip, but it still didn't work.
 
Paul Sturrock
Bartender
Posts: 10336
Eclipse IDE Hibernate Java
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If your original source is XML, why use regular expressions at all? You have a well formed document, so you can guarentee that <script></script> is all JavaScript - so handle it when you parse your XML, or transform it with your XSLT. Regular expressions in this seem unecessary, and just an over complication. But then again, perhaps I'm not quite following what you are trying to do?
 
Richard Hands
Greenhorn
Posts: 12
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
basically, i have an abstract class that instantiates an xml transformer, reads the basic data-centric xml from the database and transforms it with a html-centric xsl stylesheet (which xsl it uses is determined by various bits of data in the database).

This is being extended to a variety of things, and in the one i'm currently working on, i want to strip out all the javascripts that may be in it.

does this make a bit more sense?
 
Paul Sturrock
Bartender
Posts: 10336
Eclipse IDE Hibernate Java
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If all you want to do is remove the JS then do it at transform time - its far easier to write a template which matches <script /> elements and replaces them with nothing than to transform XML in to HTML and then strip out the script tags. Basically you are parsing the document twice and exposing your app to more possible sources of error - both in the transform and in the regular expression matching. If you get the document after it has initially been transformed and can't change part of the application, remember that you can make the result of an xslt transformation valid XHTML, in which case you could quite easily transform it again with a simple XSLT.
 
Richard Hands
Greenhorn
Posts: 12
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
a very interesting point, and one i will go and try out.

Thanks for the help.
 
I agree. Here's the link: http://aspose.com/file-tools
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic