This week's book giveaway is in the OO, Patterns, UML and Refactoring forum. We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line! See this thread for details.
I'm trying to create a programme which will search a file (eventually a series of files) and extract email addresses from them to be put into database so that another programme can verify that a user exists. (It is definitely NOT for spam purposes - just in case somebody asks).
I'm trying to do it in a series of classes and have started out with the one to open one file as a starter. It is reading the file correctly but when I put a pattern into the hasNext(), the programme does not create an output.
Am I using scanner correctly or is there a better way of getting the output (though it may be the regex that is shot!)? What I am trying to do is to read possible emails out (and currently print it out) of the file.
My next thing will be to write them to a database but I'm assuming that I can put this into a different class to deal with the db.
I'd be grateful for some help on the Scanner though to begin to understand what I need to do to fix it. Thanks.
You're not using the Scanner correctly. The problem is that your hasNext() and next() do not match. The hasNext(Pattern) is looking for a token that matches the Pattern, but the next() is looking for a token that is delimited by the delimiter (which is whitespace by default). If you use hasNext(Pattern), you should also use next(Pattern) to match.
Also, there's no need to compile the Pattern each time you use it. It never changes, so just compile it once, and reuse it.
Another problem is your regex doesn't work. Characters inside character classes (the ) in general don't mean the same thing they would outside a character class. '+' just meanst a literal '+', not "one or more". And "\\w" means a literal \ and a literal w. Instead, \\w is intended to represent a word character, when used outside character class braces. So just drop those braces. Your pattern
will probably work better as
I haven't tested that - in general I would recommend testing your regexes independently of the rest of the program. There's enough that can go wrong in a regex, without involving the rest of the program.
"I'm not back." - Bill Harding, Twister
Joined: Oct 11, 2007
D'oh, thanks for the hasNext() and next(), I thought I'd got something wrong. I'll sort those out before getting back to work on the regex.
Joined: Oct 11, 2007
Sorted it out and got the code working with findWithinHorizon
Joined: Jan 30, 2000
OK, that looks much better. However that pattern looks strangely overcomplex, and I don't think it does what you think it does. In particular:
Earlier I said:
Characters inside character classes (the ) in general don't mean the same thing they would outside a character class. '+' just meanst a literal '+', not "one or more". And "\\w" means a literal \ and a literal w. Instead, \\w is intended to represent a word character, when used outside character class braces.
This was partly incorrect, as it turns out that \\w does still get interpreted as a word character, even when used inside . But several other special characters are not interpreted the way they would be outside braces. Let's look at each part of ([\\w+|\\.?]+):
In other words, because they're being used inside the , + does not mean "one or more", | does not mean "or", and ? does not mean "zero or one". Because you include \\w, the expression does end up matching most of what you want it to, but is also matches many strange things which are not part of e-mail addresses, like +|?.
And there's really no apparent point to having this expression at all, because it's followed by the much simpler
which matches one or more word characters. Which is what you actually want, isn't it?
Hm, actually an email can have a . in this section too. So you probably want something like this:
Later on, you have
where, again, the  mean that the subsequent +, | and ? will be interpreted as literal +, |, and ?, which is probably not what you want.
This means 2-8 word characters, followed by 0 or 1 word character. Isn't that the same as saying 2-9 word characters? Wouldn't it be simpler to just say that?
It takes some study and practice to get good with regular expressions, but it's worth the effort in the long run. You may want to check out this site as a good resource.