my dog learned polymorphism*
The moose likes Java in General and the fly likes Filtering string Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Filtering string" Watch "Filtering string" New topic
Author

Filtering string

Mat Banik
Ranch Hand

Joined: Jan 16, 2004
Posts: 57

Hello fellow ranchers,

This is the scenario:

I have string:

"In this game you play as a young man coming to Leaf Valley to start a farm, there you find the"

I have file with list of keywords:

play
man
bolt
hula

boolean detectWords(String toBeFiltered, File fileOfKeywords);

So basically I just want to find out if "any" of the keywords are present within the file and return true on first occurrence.
I imagine writing this algorithm would be easy enough but I don't trust myself to write efficient one.

I had in mind this:



I want to know if there is library that would allow me to run method like that very efficiently better yet with native methods.
I know about com.eaio.stringsearch.StringSearch but that is far passed my understanding level.

Any and all help would be greatly appreciated.

mat

SCJP 1.4 - 83%, SCJD 360-90%,
ITeezy.com
Wouter Oet
Saloon Keeper

Joined: Oct 25, 2008
Posts: 2700

Look at List.retainAll.


"Any fool can write code that a computer can understand. Good programmers write code that humans can understand." --- Martin Fowler
Please correct my English.
Mat Banik
Ranch Hand

Joined: Jan 16, 2004
Posts: 57

If I would have to breakup the string into list. That I would be better off working with StringBuffer or StringBuilder.
List to List operation is nice thing but doesn't apply to String objects. I know collections have ton of stuff to be used. What I'm looking for is more in the range of text.
Wouter Oet
Saloon Keeper

Joined: Oct 25, 2008
Posts: 2700

About the StringBuffer: you will get N words in a sentence with N words. That are not really a lot of Strings.
Can you then explain why List<String> would not work?
Mat Banik
Ranch Hand

Joined: Jan 16, 2004
Posts: 57

This is really intended to filter 20000 charter long strings on application server against dictionary with couple of thousands words. Therefor performance is an issue. The sentence is just an example.
It doesn't feel efficient the way I envision it but could you give example of pseudo code that would describe how to break up the "sentence" into List<String> assuming there are multiple delimiters between words not just white spaces?
Also I head in mind something more complex that would be able to memorize words that already passed the test and would use pre-fetching to improve performance.
Wouter Oet
Saloon Keeper

Joined: Oct 25, 2008
Posts: 2700

Something like this: Substring reuses the string value thus memory consumption is reduced.
Mat Banik
Ranch Hand

Joined: Jan 16, 2004
Posts: 57

I will give this a try. I didn't realize that only the words from the file would stay in the collection. I guess I'm to tired today. This is even better than I thought. I'm working on scheduled profanity filter that will notify admin via email that there is something somewhere that needs attention.
Thanks for the sample code. I'll see if I can find some faster native libraries to do the job you sampled, since I'm talking several gigabytes of mysql database data that will be processed every night.
Wouter Oet
Saloon Keeper

Joined: Oct 25, 2008
Posts: 2700

If you already have a database why don't you let your database solve it for you. It's just a join on words. Add an index and it will be pretty fast.
Mat Banik
Ranch Hand

Joined: Jan 16, 2004
Posts: 57

I use hibernate and I'm not experienced programmer with it either. I pretty much run select and update requests.
I have no idea how to do join and index on TEXT fields that have 20Kb each. Do you have link where I could take look on an example?
Thank you soo much for sticking around and helping me.
Wouter Oet
Saloon Keeper

Joined: Oct 25, 2008
Posts: 2700

I'm also neither I don't know how to do it with hibernate (don't think that hibernate supports string split function because not all databases support that feature).
But where is an example for postgresql:


But if that is not going to work out you could try a batch processing approach based on the example I gave you earlier.
Mat Banik
Ranch Hand

Joined: Jan 16, 2004
Posts: 57

I tried some native stuff with regexp_split_to_table and it seems to be very slow with MySQL. Although when I use in memory handling and with little tweaks to the size of memory for JVM it is pretty fast.

Thank you for the tips Wouter.
Mat Banik
Ranch Hand

Joined: Jan 16, 2004
Posts: 57

Does anybody know where to get
64bit JVM 6.0 for 64bit CentOS 5

How do you configure tomcat to run on JVM 6.0 - 64bit ?
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: Filtering string