File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes MinHeap Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Soft Skills this week in the Jobs Discussion forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "MinHeap" Watch "MinHeap" New topic
Author

MinHeap

Brendan Rhoads
Greenhorn

Joined: Dec 05, 2011
Posts: 4
As an intro, I am working on a project for a 2nd year data structures class, and we are not permitted to use any libraries other than the Java API.

For my project-- this part of it anyway-- I am creating a word frequency tree of, basically, my school's whole domain, in order to create a search engine for it. I created a class to spider through and look for hrefs in html and generate a list of all reachable sites from a seed site (the home page) and then create a binary search tree with objects composed of a word from the site and how frequently it appears. Then, I have a separate class that contains the String of the URL and the word frequency tree that goes with it-- URLContent.

Anyway, we're required to use a minheap of URLContent objects (generated after the search of a keyword/words) in order to return the most relevant sites. However, I can not, for the life of me, think of a good solution for the URLContents' key. Essentially, the more relevant the search is, the lower the key should be.

My brute force idea is to bake a class level integer variable into the URLContent class-- and then subtract how often each of the search words appear from the initialized number (say 100). However, this does not lend itself well to caching(the next part of my project).

1st question: Can anyone think of a good reason to use MinHeapPriorityQueue over a MaxHeapPriorityQueue here?
2nd question: Any supplemental ideas with key generation?

Thanks!



Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3649
    
  17

Brendan Rhoads wrote:1st question: Can anyone think of a good reason to use MinHeapPriorityQueue over a MaxHeapPriorityQueue here?


Well, you presented it as a requirement. If that's what they want you to do, isn't it reason enough?

2nd question: Any supplemental ideas with key generation?


How do you want to measure relevance? The amount of words on a page that matches the keyword? The percentage of words? Maybe the average of the difference between each word on a page and the keyword?

I think you first need to decide when a page is relevant.
Brendan Rhoads
Greenhorn

Joined: Dec 05, 2011
Posts: 4
I guess the actual algorithm of relevance is sort of left open ended. I intended a pretty basic implementation of how often a word appears on the site.

I do like the idea of implementing the % based idea, but most of the time the actual percent would be below 1%, and I wanted to use integers as keys (although, multiplying by 100.0 could solve that).
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3649
    
  17

Why limit yourself to integers? Real numbers seem like an excellent key to look up elements by relevance.

Personally I would go for the average Levenshtein distance between each word on the page and the keyword. But I guess that's mostly an implementation detail. You can play around to see what would give you the best results.

Regardless, you can start out with a Catalog class to which you can add URLs, at which point it will determine the count of each word on the page and add the results to your list of URLContents. Instead of this, you can consider using a Map<URL, Map<String, Integer>>, or Map<URL, WordTree> if you really must use your own binary search tree.

This Catalog class could then have a search(String keyword) method which returns a Heap<URL>, which prioritizes its elements according to the map structure inside the catalog.

I'm just spitballing here though.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: MinHeap