I have sort of a hack of a solution for Russian Cyrillic, but am still looking for a more general algorithm / solution in Java. Here was my scenario which worked for me.
I crawled a bunch of Russian documents and wrote them to disk as UTF-8 files. I then tokenized based upon spaces to a single UTF-8 file which contained a long list of words, most of them presumably Russian words. I then created another file with the entire Russian Cyrillic alphabet: upper and lower case. For some reason in my
word file, there were 3 bytes prepended to each line for whatever reason, followed by 2 bytes per Cyrillic character very predictably. I know not everything will be 2 bytes across alphabets, but this is the case for Cyrillic. So an 8 character Russian word will be 19 bytes: I ignore the first 3, and then compare each 8 pair of bytes with the alphabet file I created, which I read in as a single
String. Based upon this, I was able to successfully pick out all Russian words or at least all words containing only Cyrillic characters which is equivalent to checking if a word contains a non-Cyrillic character or if each character in a word is contained in the String representing the Russian Cyrillic alphabet.
This solution fails in a number of other language situations and again I would like a more general solution that can work for any language, or perhaps a library out there that attempts to do such a thing.