aspose file tools*
The moose likes Java in General and the fly likes Help needed for an regex expression Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Help needed for an regex expression" Watch "Help needed for an regex expression" New topic
Author

Help needed for an regex expression

Keshav Khedkar
Greenhorn

Joined: Jun 23, 2007
Posts: 4
Hi All,
this is the regex expression I am using: (<div)\s+id="article_body">(.*?((\1((.*?(\5|\8).*?|.*?)|.*?)\9)|(\1[^>]*?/>))){1,}(</div>)
to extract whole complete <div id='article_body'> tag. Note that this tag can have other <div> tags as well as other tags. there can be other <div> tags before or after this tag. My expression is not accurate.
Following are the contents:

// Snip

Please help me to get right regex expression.
Thanks in advance.
regards,
kk.
Wouter Oet
Saloon Keeper

Joined: Oct 25, 2008
Posts: 2700

Hi Keshav,

Please don't post huge amounts of code. Try to give an small example that explains your problem. Also tell what you think
that should happen and what actually happened.


"Any fool can write code that a computer can understand. Good programmers write code that humans can understand." --- Martin Fowler
Please correct my English.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18108
    
  39


Also, the regex provided doesn't seem to make much sense. I can't figure out the purpose of all the backreferences.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Keshav Khedkar
Greenhorn

Joined: Jun 23, 2007
Posts: 4
Hi all,
What I want is the whole <div id='article_body'> tag from the contents of the file attached. the regex expression I provided considers the nested nature of this tag - this tag can be nested within other <div> tags and other <div> tag can be nested into this.
My expression is giving me wrong results - it either extracts contents starting from article_body to first </div> tag or last </div> tag. both the cases are invalid. extracted contents should end up to the </div> tag meant for <div id='article_body'>.
I have numbered groups in the regex expression from left to right (don't know the right order).
cases may be-
1) there would not be any tags in article_body tag.
2) nested tags - like <div id='parent'><div id='article-body'><div>sss</div>ssdd<div><div />sfdfd</div></div></div>

for nested nature I have used backreferences to group.
other alternative solutions like best open source html parser are also welcome - suggest me a html parser.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Help needed for an regex expression
 
Similar Threads
creating button using <s:button> - similar to the CSS based button
DIV Tag Loading Behavior is Erratic
div tag text
Struts Dojo Tree disappears on page reload
Disable an image anchor