As an intro, I am working on a project for a 2nd year data structures class, and we are not permitted to use any libraries other than the Java API.
For my project-- this part of it anyway-- I am creating a word frequency tree of, basically, my school's whole domain, in order to create a search engine for it. I created a class to spider through and look for hrefs in html and generate a list of all reachable sites from a seed site (the home page) and then create a binary search tree with objects composed of a word from the site and how frequently it appears. I have not had much trouble with this thus far. However, I have run into an issue with web pages that are in PDF format-- http://pvcc.edu/docs/aac_services_resources.pdf-- for example. My html parser just returns byte codes(I'm guessing) along with other gobbly-gook.
Is there a way I can write the .pdf to some sort of parse-able format?
I would be willing to PM my parser/spider/data structure classes upon request. I feel uneasy posting them in plain sight without absolute need (at risk of unintentionally showing a classmate my final project's code).
PDF already is a parsable format. (Parseable? Parsable? Anyway...) And there does exist Java code which parses PDF, just not code which is in the standard API. So you aren't allowed to use it.
The good news is that the spec for PDF is publicly available, and you should be able to track it down on Adobe's site somewhere. The bad news is that it is very large and complicated, and you probably don't have the (my rough guess) six months that it would take to implement even a useful subset of the spec. So probably your best bet is to write off the PDFs as unusable (okay, so "parsable" is the correct spelling) and stick to the HTML.
PDF-Renderer is a Java library that can display PDFs (or at least some subset of it, not sure about its scope). It's huge. So, Paul was spot on about the effort it would take to tackle this. Maybe you can negotiate something more manageable with your professor.
Joined: Dec 05, 2011
Thanks a lot for the responses. I'll briefly glance over the stuff you've linked, but I think I'll take your advice and just stick to the html. I suppose the best way to do that is just to not include the url if the address contains pdf.