• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Identifying readable PDF files in Java

 
Ranch Hand
Posts: 140
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Ok, I know this isn't entirely a Java question but . . .

I have a Java program that, among other things, has to look at a file and determine if it's a PDF. I have, basically, a byte array to work with.

I know I can convert the byte array to a String, and look at the first 5 characters for "%PDF-"

However, this apparently isn't a hard-and-fast rule. In fact, I'm seeing now some files coming into my program which have something like:



Now, the code I've got already obviously doesn't recognize this - yet Adobe Reader handles it just fine!

So, what's the "correct" way to see if a file that comes in as a .pdf is actually legitimately a .pdf that is readable by Adobe Reader? I think trying to look at the string and yank out all of what looks like HTML while NOT pulling out anything that is legitimately part of the PDF would be an exercise in self-induced insanity.

Is there some sort of library or class available that does this? Is the file I'm seeing, despite being readable in Adobe Reader, really an "improper" .pdf? When I right-click it in Windows, and choose Properties, then the PDF tab, it says it was created by Adobe Acrobat 6.0 and is PDF Version 1.5.

Any guidance would be appreciated.
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I don't think there's a "good" answer to this. What is "proper" PDF is determined by the PDF specification, and I'm fairly certain that doesn't allow HTML tags. What is readable by Adobe Reader (or some other PDF reader like Preview on OS X) is certainly a wider range of files which allows for some file corruption or extra characters. That's simply the Robustness Principle in action.

You could try opening the file with a PDF library like iText or PDFBox, and -if it succeeds- use that as an approximation of "PDF-ness", but that's almost guaranteed to be different set of files than either the PDF spec or Adobe Reader would accept.

YMMV :-)
 
Joe Vahabzadeh
Ranch Hand
Posts: 140
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yikes . . I was afraid that might be the case...
 
Squanch that. And squanch this tiny ad:
a bit of art, as a gift, that will fit in a stocking
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic