Win a copy of Learn Spring Security (video course) this week in the Spring forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Filtering string

 
Mat Banik
Ranch Hand
Posts: 57
Google Web Toolkit Hibernate Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello fellow ranchers,

This is the scenario:

I have string:

"In this game you play as a young man coming to Leaf Valley to start a farm, there you find the"

I have file with list of keywords:

play
man
bolt
hula

boolean detectWords(String toBeFiltered, File fileOfKeywords);

So basically I just want to find out if "any" of the keywords are present within the file and return true on first occurrence.
I imagine writing this algorithm would be easy enough but I don't trust myself to write efficient one.

I had in mind this:



I want to know if there is library that would allow me to run method like that very efficiently better yet with native methods.
I know about com.eaio.stringsearch.StringSearch but that is far passed my understanding level.

Any and all help would be greatly appreciated.

mat
 
Wouter Oet
Saloon Keeper
Posts: 2700
IntelliJ IDE Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Look at List.retainAll.
 
Mat Banik
Ranch Hand
Posts: 57
Google Web Toolkit Hibernate Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If I would have to breakup the string into list. That I would be better off working with StringBuffer or StringBuilder.
List to List operation is nice thing but doesn't apply to String objects. I know collections have ton of stuff to be used. What I'm looking for is more in the range of text.
 
Wouter Oet
Saloon Keeper
Posts: 2700
IntelliJ IDE Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
About the StringBuffer: you will get N words in a sentence with N words. That are not really a lot of Strings.
Can you then explain why List<String> would not work?
 
Mat Banik
Ranch Hand
Posts: 57
Google Web Toolkit Hibernate Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is really intended to filter 20000 charter long strings on application server against dictionary with couple of thousands words. Therefor performance is an issue. The sentence is just an example.
It doesn't feel efficient the way I envision it but could you give example of pseudo code that would describe how to break up the "sentence" into List<String> assuming there are multiple delimiters between words not just white spaces?
Also I head in mind something more complex that would be able to memorize words that already passed the test and would use pre-fetching to improve performance.
 
Wouter Oet
Saloon Keeper
Posts: 2700
IntelliJ IDE Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Something like this: Substring reuses the string value thus memory consumption is reduced.
 
Mat Banik
Ranch Hand
Posts: 57
Google Web Toolkit Hibernate Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I will give this a try. I didn't realize that only the words from the file would stay in the collection. I guess I'm to tired today. This is even better than I thought. I'm working on scheduled profanity filter that will notify admin via email that there is something somewhere that needs attention.
Thanks for the sample code. I'll see if I can find some faster native libraries to do the job you sampled, since I'm talking several gigabytes of mysql database data that will be processed every night.
 
Wouter Oet
Saloon Keeper
Posts: 2700
IntelliJ IDE Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you already have a database why don't you let your database solve it for you. It's just a join on words. Add an index and it will be pretty fast.
 
Mat Banik
Ranch Hand
Posts: 57
Google Web Toolkit Hibernate Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I use hibernate and I'm not experienced programmer with it either. I pretty much run select and update requests.
I have no idea how to do join and index on TEXT fields that have 20Kb each. Do you have link where I could take look on an example?
Thank you soo much for sticking around and helping me.
 
Wouter Oet
Saloon Keeper
Posts: 2700
IntelliJ IDE Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm also neither I don't know how to do it with hibernate (don't think that hibernate supports string split function because not all databases support that feature).
But where is an example for postgresql:


But if that is not going to work out you could try a batch processing approach based on the example I gave you earlier.
 
Mat Banik
Ranch Hand
Posts: 57
Google Web Toolkit Hibernate Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I tried some native stuff with regexp_split_to_table and it seems to be very slow with MySQL. Although when I use in memory handling and with little tweaks to the size of memory for JVM it is pretty fast.

Thank you for the tips Wouter.
 
Mat Banik
Ranch Hand
Posts: 57
Google Web Toolkit Hibernate Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Does anybody know where to get
64bit JVM 6.0 for 64bit CentOS 5

How do you configure tomcat to run on JVM 6.0 - 64bit ?
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic