wood burning stoves 2.0*
The moose likes Java in General and the fly likes Regex to find URL from anchor tag Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Java 8 in Action this week in the Java 8 forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regex to find URL from anchor tag" Watch "Regex to find URL from anchor tag" New topic
Author

Regex to find URL from anchor tag

shwetank singh
Greenhorn

Joined: Apr 02, 2007
Posts: 26
Hi Ranchers!

Please help understand what may possibly be wrong with following:
error:thanks in advance!
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19552
    
  16

Check out all occurrences of (?. You must either escape the "(" ( an optional "(" in the regex), escape the "?", or add a ":" to make it a non-capturing group: "(?:assdsad)".

However, I would change your regex slightly:
- use a group for the opening quote
- use a reluctant (non-greedy) capture all: .*?
- require your opening quote using a back reference \\1

That leaves with just one small problem: what if there are no quote characters around the value? In HTML it's perfectly valid to write <a href=http://www.google.com>.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Raymond Tong
Ranch Hand

Joined: Aug 15, 2010
Posts: 230
    
    2

What are the use of <link> and <text> ?
It looks like you want to put the matched pattern part into named group??
shwetank singh
Greenhorn

Joined: Apr 02, 2007
Posts: 26
Thanks Rob for the useful insight..i tried doing all as suggested but can't crack it..tried:
removing the optional "(" or using an escape
using ?: --has already tried with this but can't get it to work
- require your opening quote using a back reference \\1 -- didn't actually get this

could you please suggest if the approach is correct or if it could have a simpler approach!

@Raymond: yes, i am trying to do the same. suggestions?

thanks!
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19552
    
  16

java.util.regex.Pattern has no support for named groups. Only numbered.

Here's how I would build have built this regex:
- make the entire thing case insensitive. That allows you to find <A as well as <a
- start with <a
- anything that doesn't close the tag, as a reluctant quantifier†: [^>]*?
- href
- any amount of whitespace: \s*
- =
- any amount of whitespace: \s*
- a capturing group that for the opening quote: ('|")
- a capturing group with anything, as a reluctant quantifier, for the URL: (.*?)
- the closing quote, equal to the opening quote: \1
- again, anything that doesn't close the tag, as a reluctant quantifier†: [^>]*?
- a negative lookahead for /, to prevent a case of <a xxxxx/>: (?!/)
- the closing >

If you paste all that together you get a regex that should do what you need. In the future, I would build regexes the same if I were you: write down what you think you need in words, bit by bit, then translate all these bits to separate little regexes, then combine these regexes into one larger regex.

†These two "anything that doesn't close the tag" parts are for any other attributes, like target, name, id, etc.


As for the non-quoted values, I ended up using a second regex for that. It looked the same, except the opening quote, reluctant anything, closing quote was replaced by negative lookahead to prevent quotes, any non-whitespace. After it came either > or whitespace followed by the last three parts of the above regex.
shwetank singh
Greenhorn

Joined: Apr 02, 2007
Posts: 26
Thanks Rob!
got it:

String pattern = "<a[^>]*?href\\s*=\\s*((\'|\")(.*?)(\'|\"))[^>]*?(?!/)>";
System.out.println(m.group(1) + "<-- -->"+ m.start() + "<-- -->" + ss);

output :

'google.com'<-- -->6<-- -->hello link

will take those quotes out too..and the case when there are no quotes..will post back when done.

thanks..i did take the thought step by step..but did not write them down!..nice learning from you!
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Regex to find URL from anchor tag
 
Similar Threads
regular expression help
Regex
Writing from variables into a file
doubt on group() in Matcher class
regex