• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

regex versus tokenizer

 
Ranch Hand
Posts: 38
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hello guys,

I am at a point where I want to decide whether I should go with regex or tokenizer object. I need feedback of you guys for that. So here is my scenario: I am sending basic queries to google with keywords like "red is the color" or "red is associated" and then putting the result URLs in a linkedlist and start to crawl those pages.

I am looking for these keywords in those html pages, so for example if one sentence is "Red is the color bla bla bla bla." I want to grab that sentence and put it in an array to use it later.

I have successfully striped the html tags without problems but the problem I am having is those keywords sometimes come in the beginning of the sentence and sometimes they come in the middle and sometimesin the end. so when I try to match them through regex I couldn't figure out how to make them match optionally. I haven't tried using tokenizer but sometimes suggested me and I am interested. I have heard it is depreciated though, true?

so I hope I am making sense, what do you guys think? what kind of path should I follow?

best
ilteris kaplan
 
Ranch Hand
Posts: 490
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes the Tokenizer class is deprecated, so should not be used. Granted, it will likely always be available but better options exist(regex and String.split()).
 
Ranch Hand
Posts: 1296
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

...when I try to match them through regex I couldn't figure out how to make them match optionally.



What kind of problems are you running into?
 
ilteris kaplan
Ranch Hand
Posts: 38
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
thanks for the response. I couldn't figured out how to match the type of sentence I want. So for example let's say I am supplying the word pink as my variable and I am looking for sentences that has pink in it like the source text below(Basically I want to get sentences that has pink somewhere.):

Pink is a combination of red and white. The quality of energy in pink is determined by how much red is present. White is the potential for fullness, while red helps you to achieve that potential. Pink combines these energies. Shades of deep pink, such as magenta, are effective in neutralizing disorder and violence. Some prisons use limited deep pink tones to diffuse aggressive behaviour.


and I want to have a regex that matches this. This is how I come so far:


so I am trying to get the words before the pink if there is any and then pink and if there is any words after pink I want to get them until period.
 
Garrett Rowe
Ranch Hand
Posts: 1296
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
See if this fits your needs:


1) turn on flags for case insensitive and multi-line matching
2) Start with (but dont capture) the beginning of the input or a period followed by a space, via a look-behind
3) Start of capture (group 1)
4) match anything but a period, 0 or more times
5) match the word
6) same as 4
7) match the end of the input or a period
8) end of capture (group 1)
[ April 12, 2006: Message edited by: Garrett Rowe ]
 
reply
    Bookmark Topic Watch Topic
  • New Topic