• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Reg Exp

 
author
Posts: 15385
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I am working with VB in ASP.NET
but that does not really matter...I am wanting to parse out urls. I can match urls with the following code, but I can not make it stop after the last character of the url. It keeps going.

Anyone have any insight?
Eric
 
Eric Pascarello
author
Posts: 15385
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
the things that is causing the trouble are:
.[/b]]http://url.com.
<[/b]tag]http://url.com<tag

I see the forum has the same problem!
[ November 14, 2003: Message edited by: Eric Pascarello ]
 
Ranch Hand
Posts: 251
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Try....

This regexp will capture anything in an <a> tag that is between quotes and after 'href='.
Of course, this is the basic idea - You'll probably have to tweak it a bit to work with VB's regexp engine, which is probably different than Java's engine, which is definately different from PHP's engine, etc...
 
Sheriff
Posts: 7001
6
Eclipse IDE Python C++ Debian Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Try somethning like "(\bhttp[s]{0,1}://.+?\b)" as a first test. The '?' indicates a "reluctant" match, which stops at the first whitespace. Note that this may still cause problems, depending on where you use it. I find that trying to find URLs in plain text is complicated by people's tendency to add punctuation at the end. To that end I use the following somewhat more complex regex in my development Friki for this task
([Hh][Tt][Tt][Pp]|[Ff][Tt][Pp]|[Mm][Aa][Ii][Ll][Tt][Oo]) [^\s\<\>\[\]\"'\(\)\?])*[^\s\<\>\[\]\"'\(\)\?\,\.]([\?][^\s\<\>\[\]\"'\(\)\?]*[^\s\<\>\[\]\"'\(\)\?\,\.])?
It passes all my unit tests at the moment, but if you want to suggest any tests that it won't pass ...
Coinidentally this also matches upper and lower case http: as well as the common ftp: and mailto: protocols. You probably need to add [Ss]{0,1} if you want to handle https: too. There's an argument for allowing any protocol prefix just using [A-Za-z]+: to allow for the whole range of possible URLs including wierdos like gopher.
Have you tried your regex with the URL at the start of the file/stream? I found I had to add something like (?:^|\b) (non capturing, start of stream or whitespace) as a prefix at the start, rather than including the \b in the captured group to make this work. Unit tests are your friend
Hope this helps.
curse those smilies. grr..
[ November 14, 2003: Message edited by: Frank Carver ]
 
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
First a minor note: You can replace
http[s]{0,1}
with
https?
which should mean the same thing, but is both shorter and faster in most engines.
As for the rest: I gather the main problems are when the URL is terminated by something other than a space - e.g. '<', '>', or '.'. The first two are easy - just include <> in the exclusion list for the character class. The . is a bit more complex, since it's perfectly OK within a URL, but it can't be the final character. You can make a separate class for that final char:
(\bhttps?://[^ <>]*[^ <>.]\b)
I think those word boundaries (/b) could be a problem too, particularly the last one. You might have something like
foo,http://www.yahoo.com/,bar
The final / should be part of the URL, but the , is not, and there's no word boundary between those two chars - they're both non-word. How about something like this:
\bhttps?://[^ <>,]*[^ <>,.]
I added ',' to the list of forbidden chars. I'm sure you could find more...
[ November 14, 2003: Message edited by: Jim Yingst ]
 
Eric Pascarello
author
Posts: 15385
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I used \bhttps?://[^ <>,]*[^ <>,.]
but is still ignoring < and .
\I have tried some other versions of it and still coming up short.
I am playing with Frank's now....
 
Frank Carver
Sheriff
Posts: 7001
6
Eclipse IDE Python C++ Debian Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Ah. That'll teach me to rely on my memory. \b is indeed a word break, not a whitespace. Not much use for URLs, as they contain too many non-word characters. I reckon you'll definitely need to be explicit about which characters you allow and which you don't.
 
Phil Chuang
Ranch Hand
Posts: 251
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
what if the links are local? then there will be no "http://" ... keep that in mind.
 
Eric Pascarello
author
Posts: 15385
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
WOHOO I GOT IT.....I am a dumb
I was converting <br> from \n after I was doing the url so that is why the it was not working!!!
 
Happiness is not a goal ... it's a by-product of a life well lived - Eleanor Roosevelt. Tiny ad:
a bit of art, as a gift, that will fit in a stocking
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic