File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes I/O and Streams and the fly likes implementation ideas Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Java 8 in Action this week in the Java 8 forum!
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "implementation ideas" Watch "implementation ideas" New topic
Author

implementation ideas

Ciri Bhoy
Greenhorn

Joined: Oct 20, 2011
Posts: 16
Hi all,

I'm working on a project where I have to search for word synonyms in a large thesaurus file, in the form of:

cat
bobcat, cheetah, cougar, grimalkin, jaguar, kitten,
kitty, leopard, lion, lynx, malkin, mouser, ocelot,
panther, puma, puss, pussy, tabby, tiger, tom, tomcat

dog
bitch, bowwow, cur, doggy, fido, flea bag, hound,
man's best friend, mongrel, mutt, pooch, pup, puppy,
stray, tail-wagger, tyke


At the moment I'm just searching the file using a Scanner and that works fine, albeit slightly slow. I'm just
wondering if anyone could suggest a more efficient method. Would creating 'word' objects containing their
corresponding values be a good option?? should I use serialization???

This is a small project that I'm working on just to gain some experience so I have complete freedom in my
implementation method and all suggestions are very welcome. I'm just hoping for a few pointers really.

Thanks in advance
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

It's not clear exactly what you're trying to do, or what's slow.

What I think you're saying is that you have to look up a word in a thesaurus file ("dog"), and then provide all that word's synonyms ("cur", "puppy", etc.), which, presumably, are stored with the word. Furthermore, I think that your performance problem is not finding the synonyms once you've found the word, but rather, just finding the word in the first place. Presumably you're doing a linear search from the start of the file each time.

And presumably you've also measured or at least observed that it is too slow to meet your requirements, yes?

There are many different ways to address that problem. Here are a a couple. I know nothing about your requirements, constraints, assumptions, or use cases, so I'm not advocating either of these. Just throwing them out there as ideas of thought processes and approaches.

  • Read the entire file in at startup, and define a Map that relates each lookup word with its synonyms. Simple, brute force, fast. May chew up an unacceptable amount of memory, although it really shouldn't be that much. Say, 100,000 words in the English language, average length 6 letters, each one has 10 synonyms. That's 600,000 * 11 letters = 6,600,000 letters, and since each letter is 2 bytes, that's around 13 MB for all the characters. Even if we double that for overhead, that's still only 26 MB.


  • Let's say your program runs in a mode where it starts up, looks up 1 or 2 or 3 words, and then shuts down. Now the above is not worth the overhead of reading the entire file each time we start up, since we're only going to look up a small number of words for that one read. So instead we create a separate file that has the byte indices of each word in the main file. We open the index file, scan aardvark, apple, bee, cat, dog! and find that this is at byte position 884 in the main file. We're stil doing a linear file search for each word, but we're only scanning the preceding lookup words themselves, not their synonyms, which is the bulk of the linear search time (if my assumptions about what you're doing and how the file is structure are correct).



  • Would creating 'word' objects containing their
    corresponding values be a good option?? should I use serialization???


    While either or both of those may be appropriate for some part of this project, you would choose them for design reasons. We don't go, "Oh, this is too slow, I'll define a class and use serialization and that will help."
    Ciri Bhoy
    Greenhorn

    Joined: Oct 20, 2011
    Posts: 16
    Thanks Jeff, despite my best attempts to muddy the waters you answered my question for me.

    Typically my app will startup, read one word and shut down again, so I think the second approach best suits what I'm trying to do.


    In regards to designing a class and using serialization, I think this was just something I 'wanted' to do more than something I 'should' do, as I
    said, this project is only to gain a little coding experience, so I'll probably implement it in a few different ways, but it's good to get ideas as to the
    most appropriate methods.

    Thanks again.
    Tony Docherty
    Bartender

    Joined: Aug 07, 2007
    Posts: 1950
        
      28
    In regards to designing a class and using serialization, I think this was just something I 'wanted' to do more than something I 'should' do, as I
    said, this project is only to gain a little coding experience, so I'll probably implement it in a few different ways, but it's good to get ideas as to the
    most appropriate methods.

    You will need to write some code to generate the index file which can either contain every word as Jeff suggested or just the index of the first word beginning with A, B, C etc. so you can step into the thesaurus file and search from that point onwards.
    You probably should have the program auto-generate this file if it is missing when the program is started. You may also want to include an option in the program to allow you to force a re-index in case you add/delete or edit any words in the thesaurus file.
    If you are looking coding experience you could do a combination of the 2 options Jeff gave by reading the index file into memory on start-up.
    Ciri Bhoy
    Greenhorn

    Joined: Oct 20, 2011
    Posts: 16
    Thanks a million Tony, much appreciated.
    Jeff Verdegan
    Bartender

    Joined: Jan 03, 2004
    Posts: 6109
        
        6

    One more possible approach: Binary search. Start at the middle byte of the file. Search forward until you find the next entry. If it's less than the one you're searching for, split the second half of the file in half and repeat. If it's after the one you're searching for, search backward to the previous entry, and if that's not what you're looking for, split the first half in half and repeat.

    This only works if the file is structured such that you can start in an arbitrary location and detect an entry when searching forward or backward. (Or at least forward, but then you might have to handle some special fence-post logic where you keep hopping back and forth around the word you're looking for.)
    Ciri Bhoy
    Greenhorn

    Joined: Oct 20, 2011
    Posts: 16
    thanks again fellas, it's great to get a few ideas to try out.
     
    I agree. Here's the link: http://aspose.com/file-tools
     
    subject: implementation ideas
     
    Similar Threads
    why the debug goes to a different project in my struts project?
    distributed entity beans and caching
    Singleton or Not?
    Maven 1.0.2 & eclipse: "build-all does not exist"
    Finally, I'm also another SCJEA