aspose file tools*
The moose likes Java in General and the fly likes Regular expressions: finding but displaying first email two times? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regular expressions: finding but displaying first email two times?" Watch "Regular expressions: finding but displaying first email two times?" New topic
Author

Regular expressions: finding but displaying first email two times?

Farakh khan
Ranch Hand

Joined: Mar 22, 2008
Posts: 732
Hello,



The above code output is:


The fileData has a string of many words but two emails inside. Its reading emails but printing first email two times.

Thanks in anticipation
Joe Areeda
Ranch Hand

Joined: Apr 15, 2011
Posts: 318
    
    2

I tried your regex in my test harness and I only got one match for each email looking string in the input.

Why do you remove the spaces after an @?

Have you tried grep on that file data to make sure there is only 1 of the first email?

Sorry but I don't see anything wrong with your code snippet except maybe you're accepting characters that are not legal in an real email address.

Joe


It's not what your program can do, it's what your users do with the program.
Farakh khan
Ranch Hand

Joined: Mar 22, 2008
Posts: 732
Sorry for edit:

I checked repeatedly by writing a manual string and its working fine but when am reading a file then I found its printing each email twice.

Thanks again
Farakh khan
Ranch Hand

Joined: Mar 22, 2008
Posts: 732
Sorry for edit:

I checked repeatedly by writing a manual string and its working fine but when am reading a file then I found its printing each email twice.



oh, I got a point the *.rtf file output printing one email two times as emailto: aa@yahoo.com and then aa@yahoo.com that creates problem. Please check the following:

HYPERLINK "mailto:aa@ymail.com"}{\rtlch\fcs1 \af37 \ltrch\fcs0 \f37\insrsid5575354 {\*\datafield 00d0c9ea79f9bace118c8200aa004ba90b0200000003000000e0c9ea79f9bace118c8200aa004ba90b5c0000006d00610069006c0074006f003a006a0065007200680065006d0069006e0070006800610072006d00610063007900400079006d00610069006c002e0063006f006d000000795881f43b1d7f48af2c825dc485 276300000000a5ab00000000}}}{\fldrslt {\rtlch\fcs1 \af36\afs20 \ltrch\fcs0 \f36\fs20\ul\cf2\insrsid5575354 \hich\af36\dbch\af31505\loch\f36 aa@ymail.com}}}\sectd \


How can I fix it?

Thanks again
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18896
    
  40

Farakh khan wrote:oh, I got a point the *.rtf file output printing one email two times as emailto: aa@yahoo.com and then aa@yahoo.com that creates problem. Please check the following:

HYPERLINK "mailto:aa@ymail.com"}{\rtlch\fcs1 \af37 \ltrch\fcs0 \f37\insrsid5575354 {\*\datafield 00d0c9ea79f9bace118c8200aa004ba90b0200000003000000e0c9ea79f9bace118c8200aa004ba90b5c0000006d00610069006c0074006f003a006a0065007200680065006d0069006e0070006800610072006d00610063007900400079006d00610069006c002e0063006f006d000000795881f43b1d7f48af2c825dc485 276300000000a5ab00000000}}}{\fldrslt {\rtlch\fcs1 \af36\afs20 \ltrch\fcs0 \f36\fs20\ul\cf2\insrsid5575354 \hich\af36\dbch\af31505\loch\f36 aa@ymail.com}}}\sectd \


How can I fix it?


One way is to modify the regex so that only one will succeed -- for example, if you require the "mailto:" as part of the match, then only the first will succeed.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Farakh khan
Ranch Hand

Joined: Mar 22, 2008
Posts: 732
Thanks Henry for your favorable reply. I am looking for any clue/gesture from your end that can fix it.

I tried but in vain. The following code removes duplicate words but not email addresses. Please check this also: http://www.rubular.com/r/kn4ZMtBnny

Thanks again
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18896
    
  40

Farakh khan wrote:Thanks Henry for your favorable reply. I am looking for any clue/gesture from your end that can fix it.

I tried but in vain. The following code removes duplicate words but not email addresses. Please check this also: http://www.rubular.com/r/kn4ZMtBnny

Thanks again


You decided to use regular expressions, on the results generated by your previous regular expression solution, to remove the duplicates that it incorrectly produced? Would it not be a lot more efficient to just fix the first regular expressions to not generate the duplicates?


BTW, as a side note, do you understand what your email regex (ie.... "[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?") does? and how it works? The reason I am asking is because your second regex is much simpler, yet you are struggling with it.... IMO, you should never ever use something that you don't completely understand. And you have a pretty ugly regex to deal with.

Henry
Farakh khan
Ranch Hand

Joined: Mar 22, 2008
Posts: 732
Thanks again for your reply

Henry Wong wrote:
BTW, as a side note, do you understand what your email regex (ie.... "[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?") does? and how it works?


Frankly speaking not at all but copied from an article on the web and I found it works perfect

Henry Wong wrote:
The reason I am asking is because your second regex is much simpler, yet you are struggling with it.... IMO, you should never ever use something that you don't completely understand. And you have a pretty ugly regex to deal with.
Henry


You are right because I am newbie with regex topic. Can you please suggest something to fix it?

Thanks again

Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18896
    
  40

Farakh khan wrote:
Frankly speaking not at all but copied from an article on the web and I found it works perfect

You are right because I am newbie with regex topic. Can you please suggest something to fix it?


First, I really very highly recommend that you stop coding -- and start learning regular expressions first. It is really *not* a good idea to use something that you don't understand. And as you probably already figured out, when you don't understand something, and when something goes wrong, you can't really fix it. You need to understand something before you can fix it.

Second, I also highly recommend that you start over with the regex too. It is way too complicated. It does validation of emails addresses -- which while is a good idea, you don't understand it anyway, so you can't do anything about when validation fails.

I recommending matching for this field.... HYPERLINK "mailto:aa@ymail.com" ... you just need the value between the quotes (after the mailto: tag) and it is probably much easier to resolve. At this point, you don't need to validate the address as I think it is safe to assume that the value is an email address. Also, the "mailto" tag isn't on the other email address, so it won't match as a duplicate.


The combination of learning regular expressions, and working on a much simpler regex, would really help here.

Henry
Farakh khan
Ranch Hand

Joined: Mar 22, 2008
Posts: 732
Hello,

I agree with you to stop and learn regular expression. I am working on it


Secondly your suggestion to take email after mailto: does not work as some files have email address but not linked with mailto: tag. Finally I fix it like following:


Thanks for your hard yet positive advise
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18896
    
  40

Farakh khan wrote:
I agree with you to stop and learn regular expression. I am working on it


To give you some incentive, let me give you an example of the power of regular expressions...


Farakh khan wrote:The following code removes duplicate words but not email addresses.



The are actually a few reasons why it doesn't work. First, you are right, this is for words, you have email addresses. But second, your delimiters are wrong.

This regex is for words separated by a space, but your emails are separated by space comma space. So you need to change the regex to ... "\\b(\\w+)\\b , \\b\\1\\b".

As for handling words or email addresses, it actually doesn't matter You know it is email addresses, there is no reason to validate it again. You can just say that it is a group of stuff that is not a space or comma... so, you can change the regex to... "\\b([^,\\s]+)\\b , \\b\\1\\b".

Also, the "\\b" is for word boundaries, which you don't need... so ... "([^,\\s]+) , \\1"

Finally, there are many modes of regexes -- one mode is for word replacement. And the string class actually has a convenience method for it... so ...



Basically, you can remove the duplicate emails with a single method call.

Hope this helps,
Henry
dennis deems
Ranch Hand

Joined: Mar 12, 2011
Posts: 808
I have found this site rather helpful: http://www.regular-expressions.info/
 
jQuery in Action, 2nd edition
 
subject: Regular expressions: finding but displaying first email two times?