aspose file tools*
The moose likes Beginning Java and the fly likes Pattern matching an email address read from a file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Pattern matching an email address read from a file" Watch "Pattern matching an email address read from a file" New topic
Author

Pattern matching an email address read from a file

Iain Emsley
Ranch Hand

Joined: Oct 11, 2007
Posts: 60
I'm trying to create a programme which will search a file (eventually a series of files) and extract email addresses from them to be put into database so that another programme can verify that a user exists. (It is definitely NOT for spam purposes - just in case somebody asks).

I'm trying to do it in a series of classes and have started out with the one to open one file as a starter. It is reading the file correctly but when I put a pattern into the hasNext(), the programme does not create an output.

Am I using scanner correctly or is there a better way of getting the output (though it may be the regex that is shot!)? What I am trying to do is to read possible emails out (and currently print it out) of the file.

My next thing will be to write them to a database but I'm assuming that I can put this into a different class to deal with the db.

I'd be grateful for some help on the Scanner though to begin to understand what I need to do to fix it. Thanks.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
You're not using the Scanner correctly. The problem is that your hasNext() and next() do not match. The hasNext(Pattern) is looking for a token that matches the Pattern, but the next() is looking for a token that is delimited by the delimiter (which is whitespace by default). If you use hasNext(Pattern), you should also use next(Pattern) to match.

Also, there's no need to compile the Pattern each time you use it. It never changes, so just compile it once, and reuse it.

Another problem is your regex doesn't work. Characters inside character classes (the []) in general don't mean the same thing they would outside a character class. '+' just meanst a literal '+', not "one or more". And "\\w" means a literal \ and a literal w. Instead, \\w is intended to represent a word character, when used outside character class braces. So just drop those braces. Your pattern

"[\\w+]+@[\\w+]\\.\\w{2,4}"

will probably work better as

"\\w+@\\w+\\.\\w{2,4}"

I haven't tested that - in general I would recommend testing your regexes independently of the rest of the program. There's enough that can go wrong in a regex, without involving the rest of the program.


"I'm not back." - Bill Harding, Twister
Iain Emsley
Ranch Hand

Joined: Oct 11, 2007
Posts: 60
D'oh, thanks for the hasNext() and next(), I thought I'd got something wrong. I'll sort those out before getting back to work on the regex.
Iain Emsley
Ranch Hand

Joined: Oct 11, 2007
Posts: 60
Sorted it out and got the code working with findWithinHorizon

Many thanks.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
OK, that looks much better. However that pattern looks strangely overcomplex, and I don't think it does what you think it does. In particular:

([\\w+|\\.?]+)

Earlier I said:

Characters inside character classes (the []) in general don't mean the same thing they would outside a character class. '+' just meanst a literal '+', not "one or more". And "\\w" means a literal \ and a literal w. Instead, \\w is intended to represent a word character, when used outside character class braces.

This was partly incorrect, as it turns out that \\w does still get interpreted as a word character, even when used inside []. But several other special characters are not interpreted the way they would be outside braces. Let's look at each part of ([\\w+|\\.?]+):

In other words, because they're being used inside the [], + does not mean "one or more", | does not mean "or", and ? does not mean "zero or one". Because you include \\w, the expression does end up matching most of what you want it to, but is also matches many strange things which are not part of e-mail addresses, like +|?.

And there's really no apparent point to having this expression at all, because it's followed by the much simpler

\\w+

which matches one or more word characters. Which is what you actually want, isn't it?

Hm, actually an email can have a . in this section too. So you probably want something like this:

[\\w\\.]+

Later on, you have

[\\w+|\\.?]

where, again, the [] mean that the subsequent +, | and ? will be interpreted as literal +, |, and ?, which is probably not what you want.

And lastly,

\\w{2,8}\\w?

This means 2-8 word characters, followed by 0 or 1 word character. Isn't that the same as saying 2-9 word characters? Wouldn't it be simpler to just say that?

\\w[2,9]

It takes some study and practice to get good with regular expressions, but it's worth the effort in the long run. You may want to check out this site as a good resource.
abhi jitnag
Greenhorn

Joined: Oct 17, 2010
Posts: 13
Very nice Iain Emsley .
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7807
    
  21

abhi jitnag wrote:Very nice Iain Emsley .

Erm, you do realize you've revived a 5 year old thread?

If you want the real McCoy, go to the horse's mouth.

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Pattern matching an email address read from a file