IntelliJ Java IDE
The moose likes I/O and Streams and the fly likes Converting PDF to HTML or Parse PDF Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Reply Bookmark "Converting PDF to HTML or Parse PDF" Watch "Converting PDF to HTML or Parse PDF" New topic
Author

Converting PDF to HTML or Parse PDF

Brendan Rhoads
Greenhorn

Joined: Dec 05, 2011
Posts: 4
As an intro, I am working on a project for a 2nd year data structures class, and we are not permitted to use any libraries other than the Java API.

For my project-- this part of it anyway-- I am creating a word frequency tree of, basically, my school's whole domain, in order to create a search engine for it. I created a class to spider through and look for hrefs in html and generate a list of all reachable sites from a seed site (the home page) and then create a binary search tree with objects composed of a word from the site and how frequently it appears. I have not had much trouble with this thus far. However, I have run into an issue with web pages that are in PDF format-- http://pvcc.edu/docs/aac_services_resources.pdf-- for example. My html parser just returns byte codes(I'm guessing) along with other gobbly-gook.

Is there a way I can write the .pdf to some sort of parse-able format?

I would be willing to PM my parser/spider/data structure classes upon request. I feel uneasy posting them in plain sight without absolute need (at risk of unintentionally showing a classmate my final project's code).

Thanks


--
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 13842

PDF already is a parsable format. (Parseable? Parsable? Anyway...) And there does exist Java code which parses PDF, just not code which is in the standard API. So you aren't allowed to use it.

The good news is that the spec for PDF is publicly available, and you should be able to track it down on Adobe's site somewhere. The bad news is that it is very large and complicated, and you probably don't have the (my rough guess) six months that it would take to implement even a useful subset of the spec. So probably your best bet is to write off the PDFs as unusable (okay, so "parsable" is the correct spelling) and stick to the HTML.
Tim Moores
Rancher

Joined: Sep 21, 2011
Posts: 2329
PDF-Renderer is a Java library that can display PDFs (or at least some subset of it, not sure about its scope). It's huge. So, Paul was spot on about the effort it would take to tackle this. Maybe you can negotiate something more manageable with your professor.
Brendan Rhoads
Greenhorn

Joined: Dec 05, 2011
Posts: 4
Thanks a lot for the responses. I'll briefly glance over the stuff you've linked, but I think I'll take your advice and just stick to the html. I suppose the best way to do that is just to not include the url if the address contains pdf.
 
IntelliJ Java IDE
 
subject: Converting PDF to HTML or Parse PDF
 
Threads others viewed
Help with design of stock class when the stock's web data is part of a group.
creating objects from xml
MinHeap
How to generate PDF files in Java
search engine to search both at database and web application level
MyEclipse, The Clear Choice

cast iron skillet 49er

more from paul wheaton's glorious empire of web junk: cast iron skillet diatomaceous earth rocket mass heater sepp holzer raised garden beds raising chickens lawn care CFL flea control missoula heat permaculture