Hi, I'm trying to write a simple program that will read and write PDF documents - however, I'm having a few problems with the code.
My program seems to read the document fine, but will not write it out again properly. Comparing the two files side by side before and after writing, it appears there's a small number of control characters missing from the output file. Any clues as to why this is happening?
What's more is you can't treat a binary file like a PDF like a plain text file. A PDF file doesn't have "lines". It has some text data, but it also contains a ton of other binary data to describe what to do with that text. If you try to read the binary data in as text, Java tries to make it conform to a Unicode character set. Since the binary values can be outside the range of a particular character set, you'll lose information.
Yes, I've noticed that myself now. I've now converted the program to read the files on a character by character basis, and while it's converting a lot more of the characters properly, there's still certain ones that are getting changed.
I'm having to go through my output files with a hex editor and fine tooth-comb to find exactly where it's going wrong.
You just want to copy the file from one place to another? Then do not read the files one character at a time. What Joe said (about binary data versus Unicode characters) still applies no matter how many characters at a time you read. To copy any file, PDF or otherwise, just read bytes (not characters) from the input and write them to the output.
There is an open source library called iText which enables you to create, manipulate and also copy PDF files. It maybe overkill for what you are trying to perform, but it's worth knowing about as it enables you to copy only certain pages etc.