aspose file tools*
The moose likes Beginning Java and the fly likes Stripping out HTML from String Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Java 8 in Action this week in the Java 8 forum!
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Stripping out HTML from String" Watch "Stripping out HTML from String" New topic
Author

Stripping out HTML from String

Maksim Ustinov
Greenhorn

Joined: Sep 15, 2008
Posts: 26
Hello,

I'm writing a web app using JAVA. I have a code that generates HTML code from some template and database. Funtion returns this HTML in String and I need to take out <head> tag from my html.

My code looks like this:



As you see, i need to take out <head>.....</head> from my code and <body ...>. Leave everything that is in inside body.
Eric Daly
Ranch Hand

Joined: Jul 11, 2006
Posts: 143
Do you know how to search a file? Basically search through the file, looking for the stuff you want to remove (or the first line you want to keep). You'll need to create a temporary file to copy the contents you want to keep from the original, and then when you're done, write the new stuff to the original file (or write to a new file).


Studying for SCJP 6
Maksim Ustinov
Greenhorn

Joined: Sep 15, 2008
Posts: 26
That's not a problem. I already have the file and the content is in the string. Not i just need to create Regular Expression to remove it using .removeAll() function but I don't know how to create that RegEx.
Jeanne Boyarsky
internet detective
Marshal

Joined: May 26, 2003
Posts: 29274
    
140

Maksim,
You are correct that using a regular expression is the best way to approach this. Whenever I use regular expressions, I start out small and make sure my regular expression does the same thing at each step.

For example, can you write a regular expression to:
1) Remove <head>?
2) Remove <head>...</head>?
3) Remove <body withABunchOfAttributes>?
3) Remove </body>?
4) Combine steps 2-4? (hint - you need to use grouping parens for this one if you want to do it one regular expression)

This sounds like a strange requirement. Do you really want to remove all the HTML rather than just the head and body tags? In particular do you want the <html> and <table> tags present?

Also, take a look at the Pattern.DOT_ALL flag since you are matching across multiple lines. I know about this flag, use it frequently and still manage to forget it on my first shot most of the time.


[Blog] [JavaRanch FAQ] [How To Ask Questions The Smart Way] [Book Promos]
Blogging on Certs: SCEA Part 1, Part 2 & 3, Core Spring 3, OCAJP, OCPJP beta, TOGAF part 1 and part 2
Maksim Ustinov
Greenhorn

Joined: Sep 15, 2008
Posts: 26
Thanks Jeanne for your response.
Yes, I do need to delete <html> and </html> tags but that's not a problem, the problem is with <HEAD> tags..

Here is what I came up with to take out those tags but I'm not sure if this is correct.



Please let me know how it can be optimized and it can out unlimited number of spaces and new lines ignore everything that's in between.
Maksim Ustinov
Greenhorn

Joined: Sep 15, 2008
Posts: 26
I just did few modifications to my RegEx and here is what I've got:



One small question is, how do I modify <head> part?
Jeanne Boyarsky
internet detective
Marshal

Joined: May 26, 2003
Posts: 29274
    
140

Maksim,
Are you trying to delete everything between the head tags? (I think that's what you are trying to accomplish, but the reg exp is way too complicated for that. So then I second guessed my understanding.)

This matches everything between the head tags regardless of what is in between:
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Stripping out HTML from String
 
Similar Threads
Table Allignment problem in IE 6.0
Creating-deleting rows in HTML(URGENT)
How to store answer available in option form in database table?
Problem with my jsp compilation/Java mail
Client side validation problem