aspose file tools*
The moose likes I/O and Streams and the fly likes Converting PDF to HTML or Parse PDF Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Converting PDF to HTML or Parse PDF" Watch "Converting PDF to HTML or Parse PDF" New topic
Author

Converting PDF to HTML or Parse PDF

Brendan Rhoads
Greenhorn

Joined: Dec 05, 2011
Posts: 4
As an intro, I am working on a project for a 2nd year data structures class, and we are not permitted to use any libraries other than the Java API.

For my project-- this part of it anyway-- I am creating a word frequency tree of, basically, my school's whole domain, in order to create a search engine for it. I created a class to spider through and look for hrefs in html and generate a list of all reachable sites from a seed site (the home page) and then create a binary search tree with objects composed of a word from the site and how frequently it appears. I have not had much trouble with this thus far. However, I have run into an issue with web pages that are in PDF format-- http://pvcc.edu/docs/aac_services_resources.pdf-- for example. My html parser just returns byte codes(I'm guessing) along with other gobbly-gook.

Is there a way I can write the .pdf to some sort of parse-able format?

I would be willing to PM my parser/spider/data structure classes upon request. I feel uneasy posting them in plain sight without absolute need (at risk of unintentionally showing a classmate my final project's code).

Thanks


--
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18135
    
    8

PDF already is a parsable format. (Parseable? Parsable? Anyway...) And there does exist Java code which parses PDF, just not code which is in the standard API. So you aren't allowed to use it.

The good news is that the spec for PDF is publicly available, and you should be able to track it down on Adobe's site somewhere. The bad news is that it is very large and complicated, and you probably don't have the (my rough guess) six months that it would take to implement even a useful subset of the spec. So probably your best bet is to write off the PDFs as unusable (okay, so "parsable" is the correct spelling) and stick to the HTML.
Tim Moores
Rancher

Joined: Sep 21, 2011
Posts: 2408
PDF-Renderer is a Java library that can display PDFs (or at least some subset of it, not sure about its scope). It's huge. So, Paul was spot on about the effort it would take to tackle this. Maybe you can negotiate something more manageable with your professor.
Brendan Rhoads
Greenhorn

Joined: Dec 05, 2011
Posts: 4
Thanks a lot for the responses. I'll briefly glance over the stuff you've linked, but I think I'll take your advice and just stick to the html. I suppose the best way to do that is just to not include the url if the address contains pdf.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Converting PDF to HTML or Parse PDF
 
Similar Threads
Help with design of stock class when the stock's web data is part of a group.
MinHeap
How to generate PDF files in Java
creating objects from xml
search engine to search both at database and web application level