File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Other JSE/JEE APIs and the fly likes Java Internationalization Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login
JavaRanch » Java Forums » Java » Other JSE/JEE APIs
Reply Bookmark "Java Internationalization" Watch "Java Internationalization" New topic
Author

Java Internationalization

Kodo Tan
Ranch Hand

Joined: Aug 14, 2001
Posts: 105
Hi all
I was using some Java internationalization packages and found that the API I wrote to count the number of words depend on whether there are white spaces in my unicode string.
Basically, my program is as follows:
import java.text.BreakIterator;
import java.util.Locale;

public class ChineseWordLength {
public static int countWords(String source, BreakIterator bi) {
int count = 0;
bi.setText(source);
int start = bi.first();
int end = bi.next();
while (end != BreakIterator.DONE) {
String word = source.substring(start, end);
if (Character.isLetterOrDigit(word.charAt(0))) {
++count;
System.out.println(word);
}
start = end;
end = bi.next();
}
return count;
}
public static void main (String args[]) {
String str = "\u9700 \u8981 \u5132 \u5b58 \u7d00 \u9304";
BreakIterator wi = BreakIterator.getWordInstance(Locale.CHINESE);
System.out.println("No of words: " + countWords(str, wi));
}
}

When the string is "\u9700 \u8981 \u5132 \u5b58 \u7d00 \u9304", the program counts 6 words. But when the string is "\u9700\u8981\u5132\u5b58\u7d00\u9304" (without white space), it counts as 1 word.
I thought the Java internationalization package
handles the whitespace automatically ?
Thomas Paul
mister krabs
Ranch Hand

Joined: May 05, 2000
Posts: 13974
I believe the reason has to do with the way the BreakIterator was meant to be used. It is supposed to be used to help people writing word proccesing logic so that they can skip to the next character, word, sentence, etc. In order for the character instance and the word instance to have any separate meaning in chinese characters, the word instance looks for a white space. The word instance was designed to be used for double-click selection which requires everything to be selected between white spaces. This was reported as a bug for the Katakana character set and was rejected by Sun as being the correct behavior of the BreakIterator.


Associate Instructor - Hofstra University
Amazon Top 750 reviewer - Blog - Unresolved References - Book Review Blog
 
 
subject: Java Internationalization
 
Threads others viewed
variable problem
Calculatin number of words repeating in one sentence
How SortedMap works Internally
read specific line from text file
Knowing how many words are in each string sentence?
developer file tools