This week's book giveaway is in the OO, Patterns, UML and Refactoring forum.
We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line!
See this thread for details.
The moose likes Java in General and the fly likes Help needed for an regex expression Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

JavaRanch » Java Forums » Java » Java in General
Bookmark "Help needed for an regex expression" Watch "Help needed for an regex expression" New topic

Help needed for an regex expression

Keshav Khedkar

Joined: Jun 23, 2007
Posts: 4
Hi All,
this is the regex expression I am using: (<div)\s+id="article_body">(.*?((\1((.*?(\5|\8).*?|.*?)|.*?)\9)|(\1[^>]*?/>))){1,}(</div>)
to extract whole complete <div id='article_body'> tag. Note that this tag can have other <div> tags as well as other tags. there can be other <div> tags before or after this tag. My expression is not accurate.
Following are the contents:

// Snip

Please help me to get right regex expression.
Thanks in advance.
Wouter Oet
Saloon Keeper

Joined: Oct 25, 2008
Posts: 2700

Hi Keshav,

Please don't post huge amounts of code. Try to give an small example that explains your problem. Also tell what you think
that should happen and what actually happened.

"Any fool can write code that a computer can understand. Good programmers write code that humans can understand." --- Martin Fowler
Please correct my English.
Henry Wong

Joined: Sep 28, 2004
Posts: 20030

Also, the regex provided doesn't seem to make much sense. I can't figure out the purpose of all the backreferences.


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Keshav Khedkar

Joined: Jun 23, 2007
Posts: 4
Hi all,
What I want is the whole <div id='article_body'> tag from the contents of the file attached. the regex expression I provided considers the nested nature of this tag - this tag can be nested within other <div> tags and other <div> tag can be nested into this.
My expression is giving me wrong results - it either extracts contents starting from article_body to first </div> tag or last </div> tag. both the cases are invalid. extracted contents should end up to the </div> tag meant for <div id='article_body'>.
I have numbered groups in the regex expression from left to right (don't know the right order).
cases may be-
1) there would not be any tags in article_body tag.
2) nested tags - like <div id='parent'><div id='article-body'><div>sss</div>ssdd<div><div />sfdfd</div></div></div>

for nested nature I have used backreferences to group.
other alternative solutions like best open source html parser are also welcome - suggest me a html parser.
Have you checked out Aspose?
subject: Help needed for an regex expression
It's not a secret anymore!