jQuery in Action, 3rd edition
The moose likes I/O and Streams and the fly likes Converting PDF to HTML or Parse PDF Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Converting PDF to HTML or Parse PDF" Watch "Converting PDF to HTML or Parse PDF" New topic

Converting PDF to HTML or Parse PDF

Brendan Rhoads

Joined: Dec 05, 2011
Posts: 4
As an intro, I am working on a project for a 2nd year data structures class, and we are not permitted to use any libraries other than the Java API.

For my project-- this part of it anyway-- I am creating a word frequency tree of, basically, my school's whole domain, in order to create a search engine for it. I created a class to spider through and look for hrefs in html and generate a list of all reachable sites from a seed site (the home page) and then create a binary search tree with objects composed of a word from the site and how frequently it appears. I have not had much trouble with this thus far. However, I have run into an issue with web pages that are in PDF format-- http://pvcc.edu/docs/aac_services_resources.pdf-- for example. My html parser just returns byte codes(I'm guessing) along with other gobbly-gook.

Is there a way I can write the .pdf to some sort of parse-able format?

I would be willing to PM my parser/spider/data structure classes upon request. I feel uneasy posting them in plain sight without absolute need (at risk of unintentionally showing a classmate my final project's code).


Paul Clapham

Joined: Oct 14, 2005
Posts: 19973

PDF already is a parsable format. (Parseable? Parsable? Anyway...) And there does exist Java code which parses PDF, just not code which is in the standard API. So you aren't allowed to use it.

The good news is that the spec for PDF is publicly available, and you should be able to track it down on Adobe's site somewhere. The bad news is that it is very large and complicated, and you probably don't have the (my rough guess) six months that it would take to implement even a useful subset of the spec. So probably your best bet is to write off the PDFs as unusable (okay, so "parsable" is the correct spelling) and stick to the HTML.
Tim Moores

Joined: Sep 21, 2011
Posts: 2413
PDF-Renderer is a Java library that can display PDFs (or at least some subset of it, not sure about its scope). It's huge. So, Paul was spot on about the effort it would take to tackle this. Maybe you can negotiate something more manageable with your professor.
Brendan Rhoads

Joined: Dec 05, 2011
Posts: 4
Thanks a lot for the responses. I'll briefly glance over the stuff you've linked, but I think I'll take your advice and just stick to the html. I suppose the best way to do that is just to not include the url if the address contains pdf.
I agree. Here's the link: http://aspose.com/file-tools
subject: Converting PDF to HTML or Parse PDF
It's not a secret anymore!