File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Coverting character sets Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Soft Skills this week in the Jobs Discussion forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Coverting character sets" Watch "Coverting character sets" New topic
Author

Coverting character sets

Brian Danson
Greenhorn

Joined: May 15, 2007
Posts: 4
I am trying to create a duplicate String detection function, part of which involves replacing certain special characters with a more common letter in the current character set

For example:
'�' is replaced by 'e'
'�' is replaced by 'y' etc.

While this can be handled programatically for the core latin character set,
at some point this functionality may be called using different ones.

Is there any function available which would handle an alphabet (such as Cryrillic for example) without having to be specifically hardcoded in advance.

Any thoughts would be much appreciated!


Thanks.
Paul Sturrock
Bartender

Joined: Apr 14, 2004
Posts: 10336

My first thought is: why are you doing this? How do you swap characters into a "more common" character set and produce anything other than garbage data?

All Java Strings are Unicode. Why not just use the language the way it is intended to be used? What are you trying to achieve (I assume you have a purpose not supported by the core functionality of the Java language, since you posted this in the advanced forum)?

Also, can I ask you to have a glance at our naming policy? Just to make sure you have are complying with it?


JavaRanch FAQ HowToAskQuestionsOnJavaRanch
Brian Danson
Greenhorn

Joined: May 15, 2007
Posts: 4
Basically, we are determining whether something like 'My Awesome Surname' is considered to be a potential duplicate of 'my awesom� Sirname', based on user configurable rules and a result scoring system when a potential match is found.

I can assume a pair of strings being compared are in the same character set, but my point is how do I know whether special characters such as '�' need to be replaced (And with what character) since different character sets have different 'special' characters

Also, can I ask you to have a glance at our naming policy? Just to make sure you have are complying with it?

Done. Sorry!
Rahul Bhattacharjee
Ranch Hand

Joined: Nov 29, 2005
Posts: 2308
Originally posted by Brian Damage:

'�' is replaced by 'e'
'�' is replaced by 'y' etc.



I do not think that anything is available in java for this.

I found this article very useful.
[ May 15, 2007: Message edited by: Rahul Bhattacharjee ]

Rahul Bhattacharjee
LinkedIn - Blog
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
[Brian]: we are determining whether something like 'My Awesome Surname' is considered to be a potential duplicate of 'my awesom� Sirname', based on user configurable rules and a result scoring system when a potential match is found.

For the problem you describe, a Collator may possibly be useful. I don't think you can get it to tell you, given 'ü', replace that with 'u'. But you can use it to determine whether a string like "Jüngst" is equivalent to "Jungst" or (more correctly) "Juengst". (Probably not "Yingst" though unless there's a locale for Pennsylvania Dutch.) You can configure it only as far as choosing a Locale and a strength level. You could for example store a bunch of surnames in a TreeSet using a Comparator that compares using CollationKey values. If that sounds useful, it could be done fairly easily using Collators. Customizing the rules beyond that could take some effort though, if it's possible at all. It really depends how much the users need to be able to configure the system.


"I'm not back." - Bill Harding, Twister
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Come to think of it, you may also benefit from looking into the Soundex and Metaphone algorithms. They've also been discussed a few times here on the Ranch, so try putting those words into Search for more discussion.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42908
    
  69
You may want to run the string through the following code, which produces an ASCII equivalent:

import sun.text.Normalizer;

String temp = Normalizer.normalize(inputString, Normalizer.DECOMP, 0);
temp = temp.replaceAll("[^\\p{ASCII}]", "");


I've found this to be very useful in getting text to disregard variations in accents, umlauts and the like, which are wont to happen if people enter text.

Starting with Java 6, the Normalizer class is part of the java.text package, by the way, so this class isn't going away.

Soundex and Metaphone won't help in this case, as they don't have a concept of accents and umlauts. I just did some research on this subject, and was quite happy to discover the above code. Combine this with DoubleMetaphone (which still can't deal with accents, but at least knows about Slavo/Germanic languages in addition to English) it's possible to identify a broad range of spellings and misspellings of the same word/phrase. All these algorithms are implemented by the Jakarta Commons Codec library.
[ May 15, 2007: Message edited by: Ulf Dittmer ]
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18985
    
    8

Well, whether accented letters are equivalent in some way to non-accented versions of those same letters is a much more subtle question than what you seem to be considering. However here's what I think is a relevant blog entry:

http://weblogs.java.net/blog/joconner/archive/2007/02/normalization_c.html

and here's a link to Unicode's article on the subject:

http://unicode.org/reports/tr15/

You may be interested in compatibility decomposition rather than canonical decomposition. And then you might want to go through the result and drop any characters whose Unicode type is MODIFIER_SYMBOL and probably a lot of other things.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Ah, cool - I hadn't seen that.

[Ulf]: Starting with Java 6, the Normalizer class is part of some java.* package, by the way, so this class isn't going away.

it's in java.text, right next to Collator. While the sun.java.Normalizer hasn't gone away yet, it has changed enough that your code does not work under JDK 6. As is always the risk with sun.* packages. The new version would be like the java.text.Normalizer version, something like:


[Paul C]: Well, whether accented letters are equivalent in some way to non-accented versions of those same letters is a much more subtle question than what you seem to be considering.

Agreed, but this might at least be useful as a first stab at it, and might be good enough for his needs.
[ May 15, 2007: Message edited by: Jim Yingst ]
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42908
    
  69
Originally posted by Paul Clapham:
Well, whether accented letters are equivalent in some way to non-accented versions of those same letters is a much more subtle question than what you seem to be considering.


Quite true. It depends very much on the application at hand, and the amount of accuracy required. All these phonetic transformations and algorithms should be applied to a sizeable body of text representative of what the live scenario might entail, and the result compared to the alternatives.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Coverting character sets