• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Capturing UTF-8 from an exec statement

 
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I've tried about all I could think of and Google is no help either. I should perhaps also mention I posted this on java-forums.org as well, but didn't get a response at all. I have some files that include special characters in UTF-8 in their names. What I want to do is to run a 'dir' statement on the system command line, capture the output and return the resulting list of names. What I have is:


(Obviously, I'm running on Windows, but I expect the same issues on *n*x) This works as expected but, as you may have noticed, it uses codepage 850. Anything that's not in that code, but is in Unicode like ellipsis, is changed to a period. If I switch this to:


It no longer works correctly, replacing all special characters with other characters. If I then explicitly tell the command to use the UTF-8 codepage like this:


it still doesn't work. Even though this outputs Unicode just fine in a Windows command window. (provided I use a Unicode font like Lucida Console for the command window)

So, I know the cmd.exe statement outputs UTF-8 correctly, I know the stream reading from the process is configured for reading UTF-8, so what am I doing wrong? I suspect Java somehow knows the command processor runs Cp850 by default and interprets the output as such - but how would I go about changing this? Does anyone know what is going on?

Greetings,
Grismar.
 
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Well, no. The files don't "include special characters in UTF-8 in their names". It may be that their names include non-ASCII characters, but UTF-8 has nothing to do with it. Windows, and no doubt other operating systems, nowadays just treat file names as a sequence of Unicode characters.

However I don't know what to suggest to fix your problem. If you're going to use the 65001 code page then perhaps using "Cp65001" as the charset in your InputStreamReader would produce the correct data. I would try that first (I don't know if it's even a charset supported by Java but it wouldn't take long to find out).

Then if that failed, I would read the InputStream directly and dump the bytes to the console. Perhaps looking at the byte representation of various file names would indicate how Java was dealing with them.

Or better still, dispense with the whole chewing-gum-and-string idea of running the "dir" command through Runtime.exec() and just use the built-in features of java.io.File to get that list.
 
Jaap van der Velde
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for the quick response. I still have a few questions and reactions to your answer though.

I agree that "files don't "include special characters in UTF-8 in their names", that was poorly worded. The filenames include special characters that are not only non-ASCII, but that are also outside of the default codepage of my Windows installation, which is apparently 850. The example I gave - ellipsis - is one of those characters: it is part of Unicode (obviously) but is not in the codepage 850 set. I also agree that Windows in general will work well with filenames that contain any Unicode characters, but how a DOS prompt outputs them is of course another matter (generally controlled with the chcp command).

I tried using "Cp65001" for the InputStreamReader but Java doesn't support that directly. However, according to Microsoft documentation, codepage 65001 -is- UTF-8 and switching the console to a full Unicode font seems to support that. The console shows erroneous characters with the codepage set to 850, but displays them properly when set to codepage 65001 and configured with a Unicode font.

Your suggestion to just read the inputstream as a set of bytes is a useful one. I started out doing that, but once I thought I knew what the solution would be, I moved beyond that. I'll revisit it and post back if that yields anything. If I remember correctly however, I initially got xF7FF or xFDFF characters in the location of the ellipsis - whichever is the default UTF-8 encoding for an illegal character. It should be x2026 (ellipsis is U+2026). This is why I initially suggested that the problem might be in an underlying class performing some transformation outside of my control.

The reason I don't just use the built-in features is simple, though I may be at fault. I need a -fast- way to get a filtered list of thousands of files from a very large network share (several Terabytes, 100's of millions of files). The filtering is simple, by extension of the file, but I found that using the built-in features took up to twenty times as long as running it against the OS, with the best code I could muster. Perhaps I'm overlooking a specific built-in function I could best use to get a filtered list of files? I am new at Java, but then that's why I'm posting here

Thanks for any feedback you care to offer,
JAAP.

 
Paul Clapham
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I suppose one other thing you could try would be changing Java's default charset. I believe on Windows it's ISO-8859-1 so perhaps if you changed it to (say) UTF-8, that underlying code might not classify non-ASCII characters as undefined? That's just a speculation.
 
reply
    Bookmark Topic Watch Topic
  • New Topic