The characters could (at least partly) be unprintable characters. If you're using DoumentInputStream to get ta byte, then there is not much point in using POI, is there? You could just as well use the java.io package. The way to extract text from a DOC file using POI is documented here.
HWPF is still in early development. It is in the scratchpad section of the SVN. You will need to ensure you either have a recent SVN checkout, or a recent SVN nightly build (including the scratchpad jar!)
so there is no JAR to download. You have to get the source and build it.