File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Performance and the fly likes Fast text searching Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCM Java EE 6 Enterprise Architect Exam Guide this week in the OCMJEA forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "Fast text searching " Watch "Fast text searching " New topic
Author

Fast text searching

Nicholas Jordan
Ranch Hand

Joined: Sep 17, 2006
Posts: 1282
In topic: Topic: How to search for index quickly

William Brogden states: I keep hoping the original poster will come back and explain more about the problem.

Okay, I am not the original poster, but I can sure tell you more about my coding challenge, and am approaching the point where I will begin tweaking time-based responsiveness.

As several people expressed an interest in what we know to be productive area of work - expressing the wish to continue - I here post my design needs for the nearly complete first draft of the data model while I begin work on a GUI.

1. Does it start small and get added to as your program runs or does you program get handed a big initial list?
Lists are not especially big - often ten to one hundred words, usually but not always not likely to be modifiyed except at state change. The results are of two distinct types, a single line - ideally returning a *COPY* of a find from a LineNumberReader containing the file name, line number and a copy (strict copy semantics) of the line in which the find was obtained. As soon as a line is found to contain a keyword, the search may be abandoned for that file.

Later evolutions of the program will contain a more exhaustive search with scoring by word count, find value of keyword, and context sensitivity such as proximity to other keywords.

2. Does it get only deletions or are there also additions to the list?

At construction, lists are fixed. The operator of the program should be provided the ability to 'drop' a copy-paste operation onto the center area of a

which would then be de-composed into wordlists - which thus become new wordlists available for operator use. (persist from one program load to the next)

3. Does ordering matter ?

Ordering may be strictly subordinated to fast retrival. Retrieval may be ordered for presentation at the request of operator. In general fast search of massive directory listings for a moderate keyword list is an almost monolithic overshadowing of all other design goals. The first-draft uses a Hashtable, ignoring ordering totally during prototyping and that design decision has been expressly approved by floor managers. Ordering for presentation may be somewhat sluggish if it can be separated cleanly into a user-generated button push.

4. How frequent are changes to the list compared to queries for list position? In other words, will there be periods when the list is stable?

Lists, generally, do not change much.

In general, lists are constructed once. We may chose a different list for subsequent searches based on previous searches, but this "state transition" is user-generated and reasonble construction time may be alloted for the construction of new lists. Once a list is generated, they *MAY* be saved to a file. It may be necessary to subject such generated lists to a single round of 32-bit RSA to thwart intrusion by routine curiosity seekers and normal human error.

Queries for list position are nearly irrelevant, except for which list the keyword (or other search term, such as a phrase) generated the hit.

5. Are queries completely random or are there some lookups repeated frequently, justifying some sort of cache approach?

Full buffered lookahead reads on target text, tweaked to default page size for underlying platform - except that the words and phrases will likely be text from normal human talk. The queries are known at program load time, or are re-constructed and program re-load may be done. This phase of the project is a picker, and may achieve design sucess under the 80/20 rule if if catches 80% of the candidates. I call it Neutron Drag Retarders in Lightwave Tunnels. Imagine a sheath on a light-wave fiber: It must not hamper the primary feed. A business owner (the intended client) has a legitimate interest in spotting unwanted traffic on machines which are there to serve the business' regular duties. An 80% + pickoff rate would spot candidates for review by operators, thus allowing human intervention to override the impulse to try to do it all in code.

First prototype uses Regex's. I downloaded A Fast String Scanning Algorithm with Small Startup Overhead by Thomas Wang and The Boyer-Moore Fast String Searching Algorithm by J Strother Moore at U.T. Austin ( the Boyer-Moore algorithm ) in an attempt to speed up the central loop. I am not effective at reading these nor do I currently know how to incorporate jar files into my program as pre-written libraries.

Wordlists are generally known at program load-time, or are constructed during initialization. Re-configuration / re-loads may consume reasonble times, invoke the gc on previous runs (yes: suggest, not invoke the gc), and the list size may vary in unknown ways. We can make it a stipulation that wild variations in list size may require program re-write, and this has been expressly approved by shop managers. Once begun, the program works it's way through a folder and either exits or awaits a new search configuration by the operator. Folders may contain any number of files, which may be of any size. They (the files to be searched) should (are expected) yield to character conversion conventions of Java for searching capabilities. Binary files have been purposely put off into some unknown fog of hope.

[ for the OO / MVC proponents, I got my first taste of disentangling the GUI from the controller from the data model this morning when I wrote a hundered or so lines of code while I was waking up (which may explain some of my earlier posts         )]

[message edit: Tuesday, August 07, 2007, as follows ....]

The following code sample reflects my first attempt to work out the problem - though it may compile, reducing time-wasters for reviewers, it was written while I was waking up and only stabalizes in the sense of making a place to begin disussions.

Three approaches come to mind:
  • hash the string split from a readline and see if key.exists() in map
  • construct an ArrayList of Regex's from word lists and search each readline() with regexes in the Vector or ArrayList
  • do foreach on list, building up SortedList or SortedSet at construction and then do boolean exists = SortedCollection.contains();


  • There remains quite a bit of thought to go before coming up with a hashing re-write, but this emerged as a primary, immediate issue in this first coding attempt.


    [ August 07, 2007: Message edited by: Nicholas Jordan ]

    After contemplation this week, the design approach which seemed the best candidate would be to construct the hash codes for the words in a SortedSet in the constructor.

    [ This afternoon's work - Sunday, August 12, 2007 ]

    [ major portions of keytable deleted for readability - Saturday, September 15, 2007 ]
    [ September 15, 2007: Message edited by: Nicholas Jordan ]

    "The differential equations that describe dynamic interactions of power generators are similar to that of the gravitational interplay among celestial bodies, which is chaotic in nature."
     
    I agree. Here's the link: http://aspose.com/file-tools
     
    subject: Fast text searching