• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

I need help to remove HTML tags

 
Agustin Perez
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello, I need help with a piece of Java code for a JMeter plugin,

The code has to remove the first HTML tag and the last.

For example:

If the code is originally
<div class='ads'> <img src='test.jpg' /> <img src='test2.jpg'/> </ div>

the result should be
<img src='test.jpg' /> <img src='test2.jpg'/>

Anyone know how to do this? ^^

Please ask for any detail you need.
 
Randall Fairman
Greenhorn
Posts: 29
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

There are a zillion ways to do this. What have you tried?

Maybe your question is really an issue with JMeter. Sorry, but I can' t help you there.
 
Campbell Ritchie
Sheriff
Posts: 48652
56
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
. . . and welcome to the Ranch
 
Agustin Perez
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Randall Fairman wrote:
There are a zillion ways to do this. What have you tried?

Maybe your question is really an issue with JMeter. Sorry, but I can' t help you there.


It's not a JMeter question because I know the problem is with me.. I haven't learned very well how regex works and now I'm paying the price.

Here i paste you some of the code I tryed to use to remove the first&last tags.

// Create the pattern
Pattern pattern = Pattern.compile("</?\w+(\s*([a-zA-Z]+=".+")*)*\s*/?>");
// create Matcher from patron
Matcher fit = pattern.matcher(cont);

String result = fit.replaceFirst("");
result = replaceLast(result, '</?\w+(\s*([a-zA-Z]+=".+")*)*\s*/?>', "");
System.out.println("result!!!" + result);
this.content = cont;

I founded the regex browsing the web, but it doesn't work for me... or I couldn't make it work

I only need to remove the first and last Tag like the example on the previous post, just that.

Send me some light please.

Campbell Ritchie wrote:
. . . and welcome to the Ranch


Forgive my manners,

Let me introduce myself, my name is Agustín Perez and I am a programmer attempt ^^U
 
Winston Gutkowski
Bartender
Pie
Posts: 10257
59
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Agustin Perez wrote:It's not a JMeter question because I know the problem is with me.. I haven't learned very well how regex works and now I'm paying the price.

To be honest, unless your requirement is very simple (eg, to remove only the first and last tag in a line or a file, regardless of the situation), I doubt whether regex is what you want. HTML is hierarchical, and regexes are lousy for hierarchical structures.

My suggestion would be to look at SAX or DOM, but you may have to convert the HTML to XHTML first. I haven't used it much, but JTidy is supposed to be quite good, and it comes with its own parser.

Winston
 
Agustin Perez
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:
Agustin Perez wrote:It's not a JMeter question because I know the problem is with me.. I haven't learned very well how regex works and now I'm paying the price.

To be honest, unless your requirement is very simple (eg, to remove only the first and last tag in a line or a file, regardless of the situation), I doubt whether regex is what you want. HTML is hierarchical, and regexes are lousy for hierarchical structures.


That's my requirement, I only need to remove the opener tag (<div class="whatever">) and the closing tag (</div>).
 
Winston Gutkowski
Bartender
Pie
Posts: 10257
59
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Agustin Perez wrote:That's my requirement, I only need to remove the opener tag (<div class="whatever">) and the closing tag (</div>).

Right, but what if the tag isn't a <div>? Do you still want to remove it? And what if the first and last tags don't match? I think you need to get ALL the rules sorted out before you try anything.

Winston
 
Agustin Perez
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:
Agustin Perez wrote:That's my requirement, I only need to remove the opener tag (<div class="whatever">) and the closing tag (</div>).

Right, but what if the tag isn't a <div>? Do you still want to remove it? And what if the first and last tags don't match? I think you need to get ALL the rules sorted out before you try anything.

Winston


Tag can be something like <div>, <span>, <p>, <a>, or something that can contain children elements.

It will be <div> almost every time because its the most common to divide the content in a modern website.


Thanks 4 the help ;)

Agustín

Edit: I forgot to say that I am finding the target tag (example: <div class='ads'>) using HTMLunit.

 
Henry Wong
author
Marshal
Pie
Posts: 21004
77
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Agustin Perez wrote:
Here i paste you some of the code I tryed to use to remove the first&last tags.

// Create the pattern
Pattern pattern = Pattern.compile("</?\w+(\s*([a-zA-Z]+=".+")*)*\s*/?>");
// create Matcher from patron
Matcher fit = pattern.matcher(cont);

String result = fit.replaceFirst("");
result = replaceLast(result, '</?\w+(\s*([a-zA-Z]+=".+")*)*\s*/?>', "");
System.out.println("result!!!" + result);
this.content = cont;

I founded the regex browsing the web, but it doesn't work for me... or I couldn't make it work

I only need to remove the first and last Tag like the example on the previous post, just that.

Send me some light please.


How do you know if the regex works or not? The code snippet that you provided doesn't compile -- regardless of whether the regex is correct or not.


Regardless, if you want to use regexes, let's back up a bit... Step one, how do you match a single tag? Let's start there, and we can later get you to match (and extract) the component that you want.

Henry
 
Agustin Perez
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I found the target tags using the library htmlunit with XPaths.

For example: to find the <div class='ads'> I will write the Xpath ( //div[@class='ads'] ) and using this:



It writes all the div tags with the class='ads' (verified )


 
Agustin Perez
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Agustin Perez wrote:I found the target tags using the library htmlunit with XPaths.

For example: to find the <div class='ads'> I will write the Xpath ( //div[@class='ads'] ) and using this:



It writes all the div tags with the class='ads' (verified )



For example, with something like getHhtmlElement I can extract only the content without the enclosing tags. Is there anything like that in HtmlUnit ¿?
 
Winston Gutkowski
Bartender
Pie
Posts: 10257
59
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Agustin Perez wrote:I found the target tags using the library htmlunit with XPaths.

I'm not familiar with that lib, but I suspect you will still want to convert to XHTML before you run your check (I'm pretty sure it's a requirement for XPath, so perhaps htmlunit already does). The problem with regular HTML is that
(a) It doesn't require all tags to be closed.
(b) It allows overlapping tags.
Conversion to XHTML will solve both of those issues.

Winston
 
Agustin Perez
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:
Agustin Perez wrote:I found the target tags using the library htmlunit with XPaths.

I'm not familiar with that lib, but I suspect you will still want to convert to XHTML before you run your check (I'm pretty sure it's a requirement for XPath, so perhaps htmlunit already does). The problem with regular HTML is that
(a) It doesn't require all tags to be closed.
(b) It allows overlapping tags.
Conversion to XHTML will solve both of those issues.

Winston



XHTML is not a problem because the requirements of the plugin, the webpage must be XHTML
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic