File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes I/O and Streams and the fly likes Problems while parsing files in different encoding Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Problems while parsing files in different encoding" Watch "Problems while parsing files in different encoding" New topic
Author

Problems while parsing files in different encoding

Jan Kwiatkowski
Greenhorn

Joined: Nov 20, 2008
Posts: 12
Hello

If You want, please help me with the following:

A have a test.txt file, with single line:

Caption = Rückwärt Caption = Vorwärts Caption = Vorwärts

As you can see there are some Unicode characters like "ü"

With the following program:



I would like to display pairs (using regexp):
Caption = smth

but the output is:

Caption = &R
Caption = &Vorw
Caption = &Vorw

Those unicode characters don't won't to be printed, so as the rest of the letters after them

However this program (with no regexp, only printing this one line from the file):



prints fine:

Caption = &Rückwärt Caption = &Vorwärts Caption = &Vorwärts

It seems to be some problems with regexp's mechanism.

Could You please write what I'm doing wrong?
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

Jan Kwiatkowski wrote:As you can see there are some Unicode characters like "ü"

... which are obviously not in either of the ranges "a-z" or "A-Z'. But that's what you selected. So you excluded those characters.

There's a special code in regex which means "any Unicode letter". I don't remember what it is but you could look it up in your regex reference.
Jan Kwiatkowski
Greenhorn

Joined: Nov 20, 2008
Posts: 12
ok, thanks. I'm so dummy

the character is : \p{L}

Tanks for help
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

Right, that looks familiar.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Problems while parsing files in different encoding