File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Beginning Java and the fly likes regex versus tokenizer Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "regex versus tokenizer" Watch "regex versus tokenizer" New topic
Author

regex versus tokenizer

ilteris kaplan
Ranch Hand

Joined: Jan 21, 2006
Posts: 38
Hello guys,

I am at a point where I want to decide whether I should go with regex or tokenizer object. I need feedback of you guys for that. So here is my scenario: I am sending basic queries to google with keywords like "red is the color" or "red is associated" and then putting the result URLs in a linkedlist and start to crawl those pages.

I am looking for these keywords in those html pages, so for example if one sentence is "Red is the color bla bla bla bla." I want to grab that sentence and put it in an array to use it later.

I have successfully striped the html tags without problems but the problem I am having is those keywords sometimes come in the beginning of the sentence and sometimes they come in the middle and sometimesin the end. so when I try to match them through regex I couldn't figure out how to make them match optionally. I haven't tried using tokenizer but sometimes suggested me and I am interested. I have heard it is depreciated though, true?

so I hope I am making sense, what do you guys think? what kind of path should I follow?

best
ilteris kaplan
Rusty Shackleford
Ranch Hand

Joined: Jan 03, 2006
Posts: 490
Yes the Tokenizer class is deprecated, so should not be used. Granted, it will likely always be available but better options exist(regex and String.split()).


"Computer science is no more about computers than astronomy is about telescopes" - Edsger Dijkstra
Garrett Rowe
Ranch Hand

Joined: Jan 17, 2006
Posts: 1296
...when I try to match them through regex I couldn't figure out how to make them match optionally.


What kind of problems are you running into?


Some problems are so complex that you have to be highly intelligent and well informed just to be undecided about them. - Laurence J. Peter
ilteris kaplan
Ranch Hand

Joined: Jan 21, 2006
Posts: 38
thanks for the response. I couldn't figured out how to match the type of sentence I want. So for example let's say I am supplying the word pink as my variable and I am looking for sentences that has pink in it like the source text below(Basically I want to get sentences that has pink somewhere.):

Pink is a combination of red and white. The quality of energy in pink is determined by how much red is present. White is the potential for fullness, while red helps you to achieve that potential. Pink combines these energies. Shades of deep pink, such as magenta, are effective in neutralizing disorder and violence. Some prisons use limited deep pink tones to diffuse aggressive behaviour.


and I want to have a regex that matches this. This is how I come so far:


so I am trying to get the words before the pink if there is any and then pink and if there is any words after pink I want to get them until period.
Garrett Rowe
Ranch Hand

Joined: Jan 17, 2006
Posts: 1296
See if this fits your needs:


1) turn on flags for case insensitive and multi-line matching
2) Start with (but dont capture) the beginning of the input or a period followed by a space, via a look-behind
3) Start of capture (group 1)
4) match anything but a period, 0 or more times
5) match the word
6) same as 4
7) match the end of the input or a period
8) end of capture (group 1)
[ April 12, 2006: Message edited by: Garrett Rowe ]
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: regex versus tokenizer
 
Similar Threads
Simple yet perplexing XSLT question
Happy New Year
converting this to a prefix instead of a postfix calculator
RegEx ! operator help
Empty space caused by rendered attribute