File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes RegEx ! operator help Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "RegEx ! operator help" Watch "RegEx ! operator help" New topic
Author

RegEx ! operator help

Brian M Smith
Ranch Hand

Joined: Aug 13, 2009
Posts: 35
I'm rather close to a solution, but I'm struggling with some RegEx.

I'm attempting to read a text file and parse the contents of that file. The file contains HTML. The objective is that I want to find all href attributes on an anchor tag and do processing(find/replace) on those values. I wrote up some quick RegEx to accomplish that.

href=[\"|\'](.+?)[?|\"|\']

The problem that I'm faced with is that I need to make sure that I exclude href attributes that contain the string ".do" in them. I have experimenting with the following regex pattern and it will find all href attributes that contain ".do", but when I attempt to use the not operator I do not get any returns.

href=[\"|\'](.+?)\.do[?|\"|\']

Here is the text that I'm testing with

<a href="blahdo">This is the en-us version of this spot<br /><br />aaa This would represent a content spot pull from a file.<br /><br /> <a href="/about">About US</a><br /> <a href="/about.do?id=1a">blah</a><br /><br ><a href="something?id=1111">asdfasdf</a> <b>Testing</b> <table> <tr> <td>1</td> <td>2</td> </tr> <tr> <td colspan="2"> . . 3 </td> </tr> </table> <br /><br /> <a href="custom-cable"></a>


I have been using this testing tool which will allow you guys to see the groups that the regex returns http://www.regexplanet.com/simple/index.html

I'm interested to know why the pattern href=[\"|\'](.+?)!(\.do)[?|\"|\'] will not return the results I'm expected where href=[\"|\'](.+?)\.do[?|\"|\'] does.

Looking to be taught how to fish here, so please don't just give me an answer without an explanation!

Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19670
    
  18

Because ! is not a valid regex operator in Java except in negative lookaheads / negative lookbehinds. In your example you are looking for the literal !


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Ireneusz Kordal
Ranch Hand

Joined: Jun 21, 2008
Posts: 423
I am not absolutely sure if I correctly get your requirements, but this is probably what you want.

href=[\"|\']((?![^?|\"|\']+?\.do).+?)[?|\"|\']

Please let me know if this fits your needs.

Regards.


Edit:
And a little explanation:
(?!regexp) - is a construct known as 'negative lookahead'.
Here you find nice explanation how regexp lokaheads work: http://www.regular-expressions.info/lookaround.html
Brian M Smith
Ranch Hand

Joined: Aug 13, 2009
Posts: 35
Rob Prime wrote:Because ! is not a valid regex operator in Java except in negative lookaheads / negative lookbehinds. In your example you are looking for the literal !


Rob thanks for the response. I took a look at negative lookaheads, but I'm still struggling with the syntax and how regex is put together. Here is my new statement.

href=[\"|\'](.+?)(?!\.do)[?|\"|\']

I think that I can probably figure this out if someone could validate what I think is going on. In trying to break this down into sections, I read this statement as so.

href=[\"|\'] - Look for href= followed by a " or ' character
(.+?) - Match at least one or more instance of any character.
(?!\.do) - Read ahead, but do not match strings that contain the characters .do
[?|\"|\'] - Match the ?, ", or ' character

I guess I don't understand how the read ahead special characters work.
Brian M Smith
Ranch Hand

Joined: Aug 13, 2009
Posts: 35
Ireneusz Kordal wrote:I am not absolutely sure if I correctly get your requirements, but this is probably what you want.

href=[\"|\']((?![^?|\"|\']+?\.do).+?)[?|\"|\']

Please let me know if this fits your needs.

Regards.


Edit:
And a little explanation:
(?!regexp) - is a construct known as 'negative lookahead'.
Here you fine nice explanation how regexp lokaheads work: http://www.regular-expressions.info/lookaround.html
\
Ireneusz,

Thank you for the response. This didn't seem to do what I needed it to do. Sorry if I wasn't clear in explaining what I'm trying to accomplish with regex. I'll take a look at that link you provided to see if I can make any headway.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: RegEx ! operator help