File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Java in General and the fly likes Identifying readable PDF files in Java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Identifying readable PDF files in Java" Watch "Identifying readable PDF files in Java" New topic

Identifying readable PDF files in Java

Joe Vahabzadeh
Ranch Hand

Joined: Jan 05, 2005
Posts: 140
Ok, I know this isn't entirely a Java question but . . .

I have a Java program that, among other things, has to look at a file and determine if it's a PDF. I have, basically, a byte array to work with.

I know I can convert the byte array to a String, and look at the first 5 characters for "%PDF-"

However, this apparently isn't a hard-and-fast rule. In fact, I'm seeing now some files coming into my program which have something like:

Now, the code I've got already obviously doesn't recognize this - yet Adobe Reader handles it just fine!

So, what's the "correct" way to see if a file that comes in as a .pdf is actually legitimately a .pdf that is readable by Adobe Reader? I think trying to look at the string and yank out all of what looks like HTML while NOT pulling out anything that is legitimately part of the PDF would be an exercise in self-induced insanity.

Is there some sort of library or class available that does this? Is the file I'm seeing, despite being readable in Adobe Reader, really an "improper" .pdf? When I right-click it in Windows, and choose Properties, then the PDF tab, it says it was created by Adobe Acrobat 6.0 and is PDF Version 1.5.

Any guidance would be appreciated.
Ulf Dittmer

Joined: Mar 22, 2005
Posts: 42965
I don't think there's a "good" answer to this. What is "proper" PDF is determined by the PDF specification, and I'm fairly certain that doesn't allow HTML tags. What is readable by Adobe Reader (or some other PDF reader like Preview on OS X) is certainly a wider range of files which allows for some file corruption or extra characters. That's simply the Robustness Principle in action.

You could try opening the file with a PDF library like iText or PDFBox, and -if it succeeds- use that as an approximation of "PDF-ness", but that's almost guaranteed to be different set of files than either the PDF spec or Adobe Reader would accept.

YMMV :-)
Joe Vahabzadeh
Ranch Hand

Joined: Jan 05, 2005
Posts: 140
Yikes . . I was afraid that might be the case...
I agree. Here's the link:
subject: Identifying readable PDF files in Java
It's not a secret anymore!