wood burning stoves 2.0*
The moose likes Java in General and the fly likes Need alternative to .toUpperCase(), messes up some characters Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Need alternative to .toUpperCase(), messes up some characters" Watch "Need alternative to .toUpperCase(), messes up some characters" New topic
Author

Need alternative to .toUpperCase(), messes up some characters

Ryan Stille
Greenhorn

Joined: Aug 09, 2008
Posts: 10
I need to uppercase my part numbers before sending them to SAP. We were using .toUpperCase(), but recently ran into a part that contained a μ (latin Mu, micro) character. The .toUpperCase() method turns this into 'M'! I know this is kind of technically correct, but I doubt its ever what the developer wants.

Anyway, does anyone have a good alternative idea? In perl I would just say "$foo =~ tr/a-z/A-Z/" is there anything like that in Java?

I'm thinking about just looping over the string and upper casing each letter as long as its in the normal ascii range.

Thanks.
Ryan Stille
Greenhorn

Joined: Aug 09, 2008
Posts: 10
Turns out looping over the string works pretty good. I get the ascii code of the character, and if its between 97 and 122 I toUpperCase() it. Benchmarking shows it takes 0ms even with a 50 character string.

I'm not sure this will work perfectly though, if someone pastes in a 'd' for example, could that 'd' be unicode and end up not matching my 97 to 122 check? From looking at the unicode/ascii charts it looks like there is no way 'd' can be represented with a higher number so it should work ok?
Greg Charles
Sheriff

Joined: Oct 01, 2001
Posts: 2833
    
  11

Yes, that should work fine. String.toUpperCase() makes letters from all languages uppercase. (M is an uppercase mu in Greek.) There's a version of toUpperCase() that takes a Locale, but I just took a look at the source code, and it seems like the only thing the Locale is used for is to handle a special case for Turkish. The looping should be nice and speedy, and I can pretty much guarantee that's what Perl is doing under the covers anyway. You might want to benchmark 10,000 50 character strings. Getting sub-mil times for a single String is no great shakes!

By the way, I assume you build up a character array in your loop and create a String from that. You want to be careful that you aren't building up a bunch of intermediate Strings.
Hauke Ingmar Schmidt
Rancher

Joined: Nov 18, 2008
Posts: 433
    
    2
This will only work for the simplest use case; ASCII and the letter range a-z are not even sufficient to write all english words. Please see this article about encodings. In short: Relying on ASCII only is a nogo.

If you have certain characters that should be treated special, like not turning a greek lowercase m (µ) to the uppercase version (Μ - that is a greek capital M even if it looks like a latin capital M! It has a different unicode code point.), then you have to define your exceptions and e.g. split the string around it. The one use case I can think of are mathematical terms which should not treated by a toUpperCase. They should be "marked" or masked in your source string (e.g. by XML tags) so you can split them out and spare them.

0 ms is quite fast, even for a string as short as 50 characters. This is a sign of a faulty benchmark. The method per se may be fast enough for your needs though.
Ryan Stille
Greenhorn

Joined: Aug 09, 2008
Posts: 10
I have read that article before, but it was a good refresher to read again anyway, thanks.

Even if the user pastes in some UTF-8 text, I'm pretty sure my code that returns me the code for the character will return 63 for 'c', not 'U+0063'. If I am misunderstanding how that works - in the end I'm only talking about doing this to part numbers, so I think this will be pretty safe.
Greg Charles
Sheriff

Joined: Oct 01, 2001
Posts: 2833
    
  11

In theory, a user could paste a lower-case Omicron into your text thinking it's the same as a lower-case O, and then you'd miss it in your conversion. I think Hauke was more concerned about non-ASCII characters that you might still want to capitalize ... like c with a cedilla, n with a tilde, and a with an acute accent for example. He may be right that excluding exceptions is the way to go. I suppose the μ in your part number stands for micro. What other special characters are there, which are also letters? If there's just a few, you could skip them while looping through the string and calling Character.toUpperCase().
Ryan Stille
Greenhorn

Joined: Aug 09, 2008
Posts: 10
Yes that would be another way to go about it, certainly if I run into issues this way I could compile an exceptions list and capitalize everything except whats in the list. But I think my users will run into less issues if I do it this way (capitalize only a-z). After all, this is for part numbers, not a paragraph of text. If there is an n with a tilde in the part number, it probably needs to stay that way - NOT be capitalized. That is the issue I'm running into with this part number with a micro/Mu in it, anyway.
Hauke Ingmar Schmidt
Rancher

Joined: Nov 18, 2008
Posts: 433
    
    2
Ryan Stille wrote:After all, this is for part numbers, not a paragraph of text.


Oh, part numbers, my bad, I didn't put to much attention on that (no blush emoticon here?). Sure, for a part number you may need specific rules for changing.

But I am little concerned that part numbers allow arbitrary input. The important part here is not technical but to allow users to know why some parts change, others not. ("Maßgüldner" -> "MAßGüLDNER" instead of "MASSGÜLDNER").

Even if the user pastes in some UTF-8 text, I'm pretty sure my code that returns me the code for the character will return 63 for 'c', not 'U+0063'.


Sure, you get the numeric value, not the literal. And for most letters from latin alphabets the codes are the same for ASCII and other encodings.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 2982
    
    9
Hmmm, my feeling is that whoever decided it was OK to put a "μ" into a field called "part number" needs to be beaten with a tire iron. Actually I'm tempted to apply that to anyone who puts non-numeric characters into a field called a "number" of some sort - but that's far too common, and I would promptly be arrested for homicide at my current job. Might be worth it, though.

Perhaps the most useful thing for you to do here would be to analyze the "part numbers" you actually have. Write a program to look at all the "part numbers" you can find, from wherever your input comes from. Have the program report all instances of non-US-ASCII values that it finds. Then you and/or your users look at those examples and figure out how those exceptions need to be handled. (Do not disregard the tire iron approach here; it may yet apply.) Which is more common: using toUpperCase(), or not? Either way, you will want to create a general policy (use toUpperCase(), or don't) and then create a list of exceptions to that policy.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 37879
    
  22
Fascinating discussion, particularly what Mike Simmons says, but I think too advanced for "beginning". Moving thread.
Greg Charles
Sheriff

Joined: Oct 01, 2001
Posts: 2833
    
  11

I think Mike needs to write an article, "There are no silver bullets, but tire irons are plentiful". Maybe he already has!

I like his idea about catalogging the special characters in a broad sample of part "numbers" and presenting findings to the users. Often developers need to drive requirements this way. Wasn't it Java Ranch's own Kathy Sierra who said not to just give the users what they ask for, give them what they actually want?
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4646
    
    5

Mike Simmons wrote:. Actually I'm tempted to apply that to anyone who puts non-numeric characters into a field called a "number" of some sort - but that's far too common, and I would promptly be arrested for homicide at my current job.

I completely agree with @mike, and have held that belief for decades. But more than 30 years ago, I learned that the ISBN, International Standard Book Number has an X in it. Sigh.
Greg Charles
Sheriff

Joined: Oct 01, 2001
Posts: 2833
    
  11

Pat Farrell wrote:But more than 30 years ago, I learned that the ISBN, International Standard Book Number has an X in it. Sigh.


Hey, X is a number in Roman numerals!

Actually, I'm not totally joking there. The last character of an ISBN is a check digit. They do a formula on the other digits and take the result modulo 11, and represent 10 as X. (Prime number modulos generally work better for check digits.) So, X is a base 11 number the same way CAFEBABE is a base 16 number.
Hauke Ingmar Schmidt
Rancher

Joined: Nov 18, 2008
Posts: 433
    
    2
So we are nearly back to topic . a..z are not all letters, 0..9 not all digits (and not all symbols needed to write Java numeric literals).
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4646
    
    5

Hauke Ingmar Schmidt wrote::. a..z are not all letters, 0..9 not all digits (and not all symbols needed to write Java numeric literals).


Yeah, I mean, what about A..Z?
Even American's sometimes use capital letters.
Greg Charles
Sheriff

Joined: Oct 01, 2001
Posts: 2833
    
  11

Pat Farrell wrote:
Yeah, I mean, what about A..Z?
Even American's sometimes use capital letters.


Don't you mean letter's?

Americans don't need to capitalize capital letters, generally speaking of course. x2
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 2982
    
    9
Greg Charles wrote:Americans don't need to capitalize capital letters, generally speaking of course. x2

Some Texans might, I suppose.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 37879
    
  22
I'm sure there's lots of capital to be made from this thread.
Hauke Ingmar Schmidt
Rancher

Joined: Nov 18, 2008
Posts: 433
    
    2
Pat Farrell wrote:
Hauke Ingmar Schmidt wrote::. a..z are not all letters, 0..9 not all digits (and not all symbols needed to write Java numeric literals).


Yeah, I mean, what about A..Z?


Hm... in my view "a" and "A" are the same letter, expressed by different symbols with different contextual semantics. Majuscule and minuscule versions are just different representations of the letter.

But I disgress.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 37879
    
  22
Hauke Ingmar Schmidt wrote: . . . "a" and "A" are the same letter . . .
. . . until somebody calls you HAuke IngmAr Schmidt
Hauke Ingmar Schmidt
Rancher

Joined: Nov 18, 2008
Posts: 433
    
    2
Campbell Ritchie wrote:
Hauke Ingmar Schmidt wrote: . . . "a" and "A" are the same letter . . .
. . . until somebody calls you HAuke IngmAr Schmidt


 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Need alternative to .toUpperCase(), messes up some characters
 
Similar Threads
unable to find error in the code?
JDiscuss:while loop question
life
life
on popular demand... quiz of the day