• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

"Contains" with UTF-8

 
Ranch Hand
Posts: 54
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I just don't see it right away. How would I find a UTF-8 "substring" in a UTF-8 word (I am trying to state this without using Java objects and primitives to cloud my thought).

Example: Москва. Does this word contain the character sequence скв? Answer: Yes.

Since we are dealing with byte[] when talking about character encoding the problem does not seem so straight forward.
 
Greg Werner
Ranch Hand
Posts: 54
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have sort of a hack of a solution for Russian Cyrillic, but am still looking for a more general algorithm / solution in Java. Here was my scenario which worked for me.

I crawled a bunch of Russian documents and wrote them to disk as UTF-8 files. I then tokenized based upon spaces to a single UTF-8 file which contained a long list of words, most of them presumably Russian words. I then created another file with the entire Russian Cyrillic alphabet: upper and lower case. For some reason in my word file, there were 3 bytes prepended to each line for whatever reason, followed by 2 bytes per Cyrillic character very predictably. I know not everything will be 2 bytes across alphabets, but this is the case for Cyrillic. So an 8 character Russian word will be 19 bytes: I ignore the first 3, and then compare each 8 pair of bytes with the alphabet file I created, which I read in as a single String. Based upon this, I was able to successfully pick out all Russian words or at least all words containing only Cyrillic characters which is equivalent to checking if a word contains a non-Cyrillic character or if each character in a word is contained in the String representing the Russian Cyrillic alphabet.

This solution fails in a number of other language situations and again I would like a more general solution that can work for any language, or perhaps a library out there that attempts to do such a thing.

 
Rancher
Posts: 43081
77
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Since we are dealing with byte[] when talking about character encoding


I don't understand this - why would you want to deal with byte[] when the data is actually text? What prevents you from keeping two strings and then using the methods of the String class?
 
Greg Werner
Ranch Hand
Posts: 54
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Ulf Dittmer wrote:

Since we are dealing with byte[] when talking about character encoding


I don't understand this - why would you want to deal with byte[] when the data is actually text? What prevents you from keeping two strings and then using the methods of the String class?



Ah yes, as long as I maintain the text as an object which is a CharSequence at all times throughout my execution path, then I can use contain. So if I need a single character from a string, I need to do like substring(pos, pos+1) where pos is an arbitrary position.
 
Java Cowboy
Posts: 16084
88
Android Scala IntelliJ IDE Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Greg Werner wrote:So if I need a single character from a string, I need to do like substring(pos, pos+1) where pos is an arbitrary position.


Better is just call charAt(pos) to get the character at the specified position.

But you don't need to do complicated things with separate characters, just use the contains() method of class String:
 
Greg Werner
Ranch Hand
Posts: 54
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator


Better is just call charAt(pos) to get the character at the specified position.



No, that was the approach which was not working for me. I was trying to do a character by character get and comparing that with a String containing the entire alphabet. The char returned did not match my String
 
Jesper de Jong
Java Cowboy
Posts: 16084
88
Android Scala IntelliJ IDE Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Can you post your non-working code? Because it really sounds like you're looking for a complicated way to do something simple...
 
author and iconoclast
Posts: 24207
46
Mac OS X Eclipse IDE Chrome
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Greg Werner wrote:
No, that was the approach which was not working for me. I was trying to do a character by character get and comparing that with a String containing the entire alphabet. The char returned did not match my String



Keep in mind the difference between bytes and chars. In a UTF-8 file, Cyrillic characters are going to be (mostly) 3 bytes each(*) -- but in a Java String, they're still one char each.

(*) UTF-8 encodes Unicode chars using one byte for ASCII, and 3 for most of the rest of the world's alphabets. Works great for the US, lousy for everyone else!
 
Greg Werner
Ranch Hand
Posts: 54
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

In a UTF-8 file, Cyrillic characters are going to be (mostly) 3 bytes each(*) -- but in a Java String, they're still one char each.



No, they are 2 bytes, I saw this by doing .toByteArray() as mentioned. Let us just close this one, we are going in circles and I am able to do what I need to do.
 
Marshal
Posts: 79151
377
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
That is because Java uses UTF-16 as its default encoding for Strings.
 
reply
    Bookmark Topic Watch Topic
  • New Topic