aspose file tools*
The moose likes Java in General and the fly likes I need help to remove HTML tags Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of JavaScript Promises Essentials this week in the JavaScript forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "I need help to remove HTML tags " Watch "I need help to remove HTML tags " New topic
Author

I need help to remove HTML tags

Agustin Perez
Greenhorn

Joined: Jan 26, 2012
Posts: 7
Hello, I need help with a piece of Java code for a JMeter plugin,

The code has to remove the first HTML tag and the last.

For example:

If the code is originally
<div class='ads'> <img src='test.jpg' /> <img src='test2.jpg'/> </ div>

the result should be
<img src='test.jpg' /> <img src='test2.jpg'/>

Anyone know how to do this? ^^

Please ask for any detail you need.
Randall Fairman
Greenhorn

Joined: Apr 18, 2011
Posts: 29

There are a zillion ways to do this. What have you tried?

Maybe your question is really an issue with JMeter. Sorry, but I can' t help you there.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39773
    
  28
. . . and welcome to the Ranch
Agustin Perez
Greenhorn

Joined: Jan 26, 2012
Posts: 7
Randall Fairman wrote:
There are a zillion ways to do this. What have you tried?

Maybe your question is really an issue with JMeter. Sorry, but I can' t help you there.


It's not a JMeter question because I know the problem is with me.. I haven't learned very well how regex works and now I'm paying the price.

Here i paste you some of the code I tryed to use to remove the first&last tags.

// Create the pattern
Pattern pattern = Pattern.compile("</?\w+(\s*([a-zA-Z]+=".+")*)*\s*/?>");
// create Matcher from patron
Matcher fit = pattern.matcher(cont);

String result = fit.replaceFirst("");
result = replaceLast(result, '</?\w+(\s*([a-zA-Z]+=".+")*)*\s*/?>', "");
System.out.println("result!!!" + result);
this.content = cont;

I founded the regex browsing the web, but it doesn't work for me... or I couldn't make it work

I only need to remove the first and last Tag like the example on the previous post, just that.

Send me some light please.

Campbell Ritchie wrote:
. . . and welcome to the Ranch


Forgive my manners,

Let me introduce myself, my name is Agustín Perez and I am a programmer attempt ^^U
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8176
    
  23

Agustin Perez wrote:It's not a JMeter question because I know the problem is with me.. I haven't learned very well how regex works and now I'm paying the price.

To be honest, unless your requirement is very simple (eg, to remove only the first and last tag in a line or a file, regardless of the situation), I doubt whether regex is what you want. HTML is hierarchical, and regexes are lousy for hierarchical structures.

My suggestion would be to look at SAX or DOM, but you may have to convert the HTML to XHTML first. I haven't used it much, but JTidy is supposed to be quite good, and it comes with its own parser.

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Agustin Perez
Greenhorn

Joined: Jan 26, 2012
Posts: 7
Winston Gutkowski wrote:
Agustin Perez wrote:It's not a JMeter question because I know the problem is with me.. I haven't learned very well how regex works and now I'm paying the price.

To be honest, unless your requirement is very simple (eg, to remove only the first and last tag in a line or a file, regardless of the situation), I doubt whether regex is what you want. HTML is hierarchical, and regexes are lousy for hierarchical structures.


That's my requirement, I only need to remove the opener tag (<div class="whatever">) and the closing tag (</div>).
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8176
    
  23

Agustin Perez wrote:That's my requirement, I only need to remove the opener tag (<div class="whatever">) and the closing tag (</div>).

Right, but what if the tag isn't a <div>? Do you still want to remove it? And what if the first and last tags don't match? I think you need to get ALL the rules sorted out before you try anything.

Winston
Agustin Perez
Greenhorn

Joined: Jan 26, 2012
Posts: 7
Winston Gutkowski wrote:
Agustin Perez wrote:That's my requirement, I only need to remove the opener tag (<div class="whatever">) and the closing tag (</div>).

Right, but what if the tag isn't a <div>? Do you still want to remove it? And what if the first and last tags don't match? I think you need to get ALL the rules sorted out before you try anything.

Winston


Tag can be something like <div>, <span>, <p>, <a>, or something that can contain children elements.

It will be <div> almost every time because its the most common to divide the content in a modern website.


Thanks 4 the help ;)

Agustín

Edit: I forgot to say that I am finding the target tag (example: <div class='ads'>) using HTMLunit.

Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18965
    
  40

Agustin Perez wrote:
Here i paste you some of the code I tryed to use to remove the first&last tags.

// Create the pattern
Pattern pattern = Pattern.compile("</?\w+(\s*([a-zA-Z]+=".+")*)*\s*/?>");
// create Matcher from patron
Matcher fit = pattern.matcher(cont);

String result = fit.replaceFirst("");
result = replaceLast(result, '</?\w+(\s*([a-zA-Z]+=".+")*)*\s*/?>', "");
System.out.println("result!!!" + result);
this.content = cont;

I founded the regex browsing the web, but it doesn't work for me... or I couldn't make it work

I only need to remove the first and last Tag like the example on the previous post, just that.

Send me some light please.


How do you know if the regex works or not? The code snippet that you provided doesn't compile -- regardless of whether the regex is correct or not.


Regardless, if you want to use regexes, let's back up a bit... Step one, how do you match a single tag? Let's start there, and we can later get you to match (and extract) the component that you want.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Agustin Perez
Greenhorn

Joined: Jan 26, 2012
Posts: 7
I found the target tags using the library htmlunit with XPaths.

For example: to find the <div class='ads'> I will write the Xpath ( //div[@class='ads'] ) and using this:



It writes all the div tags with the class='ads' (verified )


Agustin Perez
Greenhorn

Joined: Jan 26, 2012
Posts: 7
Agustin Perez wrote:I found the target tags using the library htmlunit with XPaths.

For example: to find the <div class='ads'> I will write the Xpath ( //div[@class='ads'] ) and using this:



It writes all the div tags with the class='ads' (verified )



For example, with something like getHhtmlElement I can extract only the content without the enclosing tags. Is there anything like that in HtmlUnit ¿?
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8176
    
  23

Agustin Perez wrote:I found the target tags using the library htmlunit with XPaths.

I'm not familiar with that lib, but I suspect you will still want to convert to XHTML before you run your check (I'm pretty sure it's a requirement for XPath, so perhaps htmlunit already does). The problem with regular HTML is that
(a) It doesn't require all tags to be closed.
(b) It allows overlapping tags.
Conversion to XHTML will solve both of those issues.

Winston
Agustin Perez
Greenhorn

Joined: Jan 26, 2012
Posts: 7
Winston Gutkowski wrote:
Agustin Perez wrote:I found the target tags using the library htmlunit with XPaths.

I'm not familiar with that lib, but I suspect you will still want to convert to XHTML before you run your check (I'm pretty sure it's a requirement for XPath, so perhaps htmlunit already does). The problem with regular HTML is that
(a) It doesn't require all tags to be closed.
(b) It allows overlapping tags.
Conversion to XHTML will solve both of those issues.

Winston



XHTML is not a problem because the requirements of the plugin, the webpage must be XHTML
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: I need help to remove HTML tags