Ok, I know this isn't entirely a Java question but . . .
I have a Java program that, among other things, has to look at a file and determine if it's a PDF. I have, basically, a byte array to work with.
I know I can convert the byte array to a String, and look at the first 5 characters for "%PDF-"
However, this apparently isn't a hard-and-fast rule. In fact, I'm seeing now some files coming into my program which have something like:
Now, the code I've got already obviously doesn't recognize this - yet Adobe Reader handles it just fine!
So, what's the "correct" way to see if a file that comes in as a .pdf is actually legitimately a .pdf that is readable by Adobe Reader? I think trying to look at the string and yank out all of what looks like HTML while NOT pulling out anything that is legitimately part of the PDF would be an exercise in self-induced insanity.
Is there some sort of library or class available that does this? Is the file I'm seeing, despite being readable in Adobe Reader, really an "improper" .pdf? When I right-click it in Windows, and choose Properties, then the PDF tab, it says it was created by Adobe Acrobat 6.0 and is PDF Version 1.5.
I don't think there's a "good" answer to this. What is "proper" PDF is determined by the PDF specification, and I'm fairly certain that doesn't allow HTML tags. What is readable by Adobe Reader (or some other PDF reader like Preview on OS X) is certainly a wider range of files which allows for some file corruption or extra characters. That's simply the Robustness Principle in action.
You could try opening the file with a PDF library like iText or PDFBox, and -if it succeeds- use that as an approximation of "PDF-ness", but that's almost guaranteed to be different set of files than either the PDF spec or Adobe Reader would accept.