• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

I need help to remove HTML tags

 
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hello, I need help with a piece of Java code for a JMeter plugin,

The code has to remove the first HTML tag and the last.

For example:

If the code is originally
<div class='ads'> <img src='test.jpg' /> <img src='test2.jpg'/> </ div>

the result should be
<img src='test.jpg' /> <img src='test2.jpg'/>

Anyone know how to do this? ^^

Please ask for any detail you need.
 
Greenhorn
Posts: 29
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

There are a zillion ways to do this. What have you tried?

Maybe your question is really an issue with JMeter. Sorry, but I can' t help you there.
 
Marshal
Posts: 79178
377
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
. . . and welcome to the Ranch
 
Agustin Perez
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Randall Fairman wrote:
There are a zillion ways to do this. What have you tried?

Maybe your question is really an issue with JMeter. Sorry, but I can' t help you there.



It's not a JMeter question because I know the problem is with me.. I haven't learned very well how regex works and now I'm paying the price.

Here i paste you some of the code I tryed to use to remove the first&last tags.

// Create the pattern
Pattern pattern = Pattern.compile("</?\w+(\s*([a-zA-Z]+=".+")*)*\s*/?>");
// create Matcher from patron
Matcher fit = pattern.matcher(cont);

String result = fit.replaceFirst("");
result = replaceLast(result, '</?\w+(\s*([a-zA-Z]+=".+")*)*\s*/?>', "");
System.out.println("result!!!" + result);
this.content = cont;

I founded the regex browsing the web, but it doesn't work for me... or I couldn't make it work

I only need to remove the first and last Tag like the example on the previous post, just that.

Send me some light please.

Campbell Ritchie wrote:
. . . and welcome to the Ranch



Forgive my manners,

Let me introduce myself, my name is Agustín Perez and I am a programmer attempt ^^U
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Agustin Perez wrote:It's not a JMeter question because I know the problem is with me.. I haven't learned very well how regex works and now I'm paying the price.


To be honest, unless your requirement is very simple (eg, to remove only the first and last tag in a line or a file, regardless of the situation), I doubt whether regex is what you want. HTML is hierarchical, and regexes are lousy for hierarchical structures.

My suggestion would be to look at SAX or DOM, but you may have to convert the HTML to XHTML first. I haven't used it much, but JTidy is supposed to be quite good, and it comes with its own parser.

Winston
 
Agustin Perez
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Winston Gutkowski wrote:

Agustin Perez wrote:It's not a JMeter question because I know the problem is with me.. I haven't learned very well how regex works and now I'm paying the price.


To be honest, unless your requirement is very simple (eg, to remove only the first and last tag in a line or a file, regardless of the situation), I doubt whether regex is what you want. HTML is hierarchical, and regexes are lousy for hierarchical structures.



That's my requirement, I only need to remove the opener tag (<div class="whatever">) and the closing tag (</div>).
 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Agustin Perez wrote:That's my requirement, I only need to remove the opener tag (<div class="whatever">) and the closing tag (</div>).


Right, but what if the tag isn't a <div>? Do you still want to remove it? And what if the first and last tags don't match? I think you need to get ALL the rules sorted out before you try anything.

Winston
 
Agustin Perez
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Winston Gutkowski wrote:

Agustin Perez wrote:That's my requirement, I only need to remove the opener tag (<div class="whatever">) and the closing tag (</div>).


Right, but what if the tag isn't a <div>? Do you still want to remove it? And what if the first and last tags don't match? I think you need to get ALL the rules sorted out before you try anything.

Winston



Tag can be something like <div>, <span>, <p>, <a>, or something that can contain children elements.

It will be <div> almost every time because its the most common to divide the content in a modern website.


Thanks 4 the help ;)

Agustín

Edit: I forgot to say that I am finding the target tag (example: <div class='ads'>) using HTMLunit.

 
author
Posts: 23951
142
jQuery Eclipse IDE Firefox Browser VI Editor C++ Chrome Java Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Agustin Perez wrote:
Here i paste you some of the code I tryed to use to remove the first&last tags.

// Create the pattern
Pattern pattern = Pattern.compile("</?\w+(\s*([a-zA-Z]+=".+")*)*\s*/?>");
// create Matcher from patron
Matcher fit = pattern.matcher(cont);

String result = fit.replaceFirst("");
result = replaceLast(result, '</?\w+(\s*([a-zA-Z]+=".+")*)*\s*/?>', "");
System.out.println("result!!!" + result);
this.content = cont;

I founded the regex browsing the web, but it doesn't work for me... or I couldn't make it work

I only need to remove the first and last Tag like the example on the previous post, just that.

Send me some light please.



How do you know if the regex works or not? The code snippet that you provided doesn't compile -- regardless of whether the regex is correct or not.


Regardless, if you want to use regexes, let's back up a bit... Step one, how do you match a single tag? Let's start there, and we can later get you to match (and extract) the component that you want.

Henry
 
Agustin Perez
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I found the target tags using the library htmlunit with XPaths.

For example: to find the <div class='ads'> I will write the Xpath ( //div[@class='ads'] ) and using this:



It writes all the div tags with the class='ads' (verified )


 
Agustin Perez
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Agustin Perez wrote:I found the target tags using the library htmlunit with XPaths.

For example: to find the <div class='ads'> I will write the Xpath ( //div[@class='ads'] ) and using this:



It writes all the div tags with the class='ads' (verified )



For example, with something like getHhtmlElement I can extract only the content without the enclosing tags. Is there anything like that in HtmlUnit ¿?
 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Agustin Perez wrote:I found the target tags using the library htmlunit with XPaths.


I'm not familiar with that lib, but I suspect you will still want to convert to XHTML before you run your check (I'm pretty sure it's a requirement for XPath, so perhaps htmlunit already does). The problem with regular HTML is that
(a) It doesn't require all tags to be closed.
(b) It allows overlapping tags.
Conversion to XHTML will solve both of those issues.

Winston
 
Agustin Perez
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Winston Gutkowski wrote:

Agustin Perez wrote:I found the target tags using the library htmlunit with XPaths.


I'm not familiar with that lib, but I suspect you will still want to convert to XHTML before you run your check (I'm pretty sure it's a requirement for XPath, so perhaps htmlunit already does). The problem with regular HTML is that
(a) It doesn't require all tags to be closed.
(b) It allows overlapping tags.
Conversion to XHTML will solve both of those issues.

Winston




XHTML is not a problem because the requirements of the plugin, the webpage must be XHTML
reply
    Bookmark Topic Watch Topic
  • New Topic