This week's book giveaway is in the Clojure forum.
We're giving away four copies of Clojure in Action and have Amit Rathore and Francis Avila on-line!
See this thread for details.
Win a copy of Clojure in Action this week in the Clojure forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Regex to find URL from anchor tag

 
shwetank singh
Greenhorn
Posts: 26
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Ranchers!

Please help understand what may possibly be wrong with following:
error:thanks in advance!
 
Rob Spoor
Sheriff
Pie
Posts: 20372
44
Chrome Eclipse IDE Java Windows
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Check out all occurrences of (?. You must either escape the "(" ( an optional "(" in the regex), escape the "?", or add a ":" to make it a non-capturing group: "(?:assdsad)".

However, I would change your regex slightly:
- use a group for the opening quote
- use a reluctant (non-greedy) capture all: .*?
- require your opening quote using a back reference \\1

That leaves with just one small problem: what if there are no quote characters around the value? In HTML it's perfectly valid to write <a href=http://www.google.com>.
 
Raymond Tong
Ranch Hand
Posts: 240
2
IntelliJ IDE Java Spring
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What are the use of <link> and <text> ?
It looks like you want to put the matched pattern part into named group??
 
shwetank singh
Greenhorn
Posts: 26
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Rob for the useful insight..i tried doing all as suggested but can't crack it..tried:
removing the optional "(" or using an escape
using ?: --has already tried with this but can't get it to work
- require your opening quote using a back reference \\1 -- didn't actually get this

could you please suggest if the approach is correct or if it could have a simpler approach!

@Raymond: yes, i am trying to do the same. suggestions?

thanks!
 
Rob Spoor
Sheriff
Pie
Posts: 20372
44
Chrome Eclipse IDE Java Windows
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
java.util.regex.Pattern has no support for named groups. Only numbered.

Here's how I would build have built this regex:
- make the entire thing case insensitive. That allows you to find <A as well as <a
- start with <a
- anything that doesn't close the tag, as a reluctant quantifier†: [^>]*?
- href
- any amount of whitespace: \s*
- =
- any amount of whitespace: \s*
- a capturing group that for the opening quote: ('|")
- a capturing group with anything, as a reluctant quantifier, for the URL: (.*?)
- the closing quote, equal to the opening quote: \1
- again, anything that doesn't close the tag, as a reluctant quantifier†: [^>]*?
- a negative lookahead for /, to prevent a case of <a xxxxx/>: (?!/)
- the closing >

If you paste all that together you get a regex that should do what you need. In the future, I would build regexes the same if I were you: write down what you think you need in words, bit by bit, then translate all these bits to separate little regexes, then combine these regexes into one larger regex.

†These two "anything that doesn't close the tag" parts are for any other attributes, like target, name, id, etc.


As for the non-quoted values, I ended up using a second regex for that. It looked the same, except the opening quote, reluctant anything, closing quote was replaced by negative lookahead to prevent quotes, any non-whitespace. After it came either > or whitespace followed by the last three parts of the above regex.
 
shwetank singh
Greenhorn
Posts: 26
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Rob!
got it:

String pattern = "<a[^>]*?href\\s*=\\s*((\'|\")(.*?)(\'|\"))[^>]*?(?!/)>";
System.out.println(m.group(1) + "<-- -->"+ m.start() + "<-- -->" + ss);

output :

'google.com'<-- -->6<-- -->hello link

will take those quotes out too..and the case when there are no quotes..will post back when done.

thanks..i did take the thought step by step..but did not write them down!..nice learning from you!
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic