As an intro, I am working on a project for a 2nd year data structures class, and we are not permitted to use any libraries other than the Java API.
For my project-- this part of it anyway-- I am creating a word frequency tree of, basically, my school's whole domain, in order to create a search engine for it. I created a class to spider through and look for hrefs in html and generate a list of all reachable sites from a seed site (the home page) and then create a binary search tree with objects composed of a word from the site and how frequently it appears. Then, I have a separate class that contains the String of the URL and the word frequency tree that goes with it-- URLContent.
Anyway, we're required to use a minheap of URLContent objects (generated after the search of a keyword/words) in order to return the most relevant sites. However, I can not, for the life of me, think of a good solution for the URLContents' key. Essentially, the more relevant the search is, the lower the key should be.
My brute force idea is to bake a class level integer variable into the URLContent class-- and then subtract how often each of the search words appear from the initialized number (say 100). However, this does not lend itself well to caching(the next part of my project).
1st question: Can anyone think of a good reason to use MinHeapPriorityQueue over a MaxHeapPriorityQueue here?
2nd question: Any supplemental ideas with key generation?
Why limit yourself to integers? Real numbers seem like an excellent key to look up elements by relevance.
Personally I would go for the average Levenshtein distance between each word on the page and the keyword. But I guess that's mostly an implementation detail. You can play around to see what would give you the best results.
Regardless, you can start out with a Catalog class to which you can add URLs, at which point it will determine the count of each word on the page and add the results to your list of URLContents. Instead of this, you can consider using a Map<URL, Map<String, Integer>>, or Map<URL, WordTree> if you really must use your own binary search tree.
This Catalog class could then have a search(String keyword) method which returns a Heap<URL>, which prioritizes its elements according to the map structure inside the catalog.
I'm just spitballing here though.
I’ve looked at a lot of different solutions, and in my humble opinion Aspose is the way to go. Here’s the link: http://aspose.com