aspose file tools*
The moose likes Beginning Java and the fly likes Sun String IndexOf method API Question Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Sun String IndexOf method API Question" Watch "Sun String IndexOf method API Question" New topic
Author

Sun String IndexOf method API Question

catherine powell
Greenhorn

Joined: Oct 07, 2006
Posts: 26
Hi Everyone,

I just have a small question that's bugging me regarding the API documentation on Sun's web site for the indexOf method of the String class that takes a char argument...why does the documentation have int listed as the arg data type:


indexOf
public int indexOf(int ch)
Returns the index within this string of the first occurrence of the specified character. If a character with value ch occurs in the character sequence represented by this String object, then the index (in Unicode code units) of the first such occurrence is returned. For values of ch in the range from 0 to 0xFFFF (inclusive),...

Parameters:
ch - a character (Unicode code point).


Thanks very much in advance for your help...Catherine
Jeanne Boyarsky
author & internet detective
Marshal

Joined: May 26, 2003
Posts: 30929
    
158

Catherine,
In Java a char is one byte. However, some Unicode characters are encoded in two bytes. Allowing an int as input allows these characters to be passed to indexOf.


[Blog] [JavaRanch FAQ] [How To Ask Questions The Smart Way] [Book Promos]
Blogging on Certs: SCEA Part 1, Part 2 & 3, Core Spring 3, OCAJP, OCPJP beta, TOGAF part 1 and part 2
catherine powell
Greenhorn

Joined: Oct 07, 2006
Posts: 26
thank you so much!

I believe I found the documentation in the Character Class API...


The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value...

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters.
A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points...

The methods that accept an int value support all Unicode characters, including supplementary characters.


Thanks again for your help...Catherine
[ November 16, 2006: Message edited by: catherine powell ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
[Jeanne]: In Java a char is one byte. However, some Unicode characters are encoded in two bytes.

Ummm... I can't really see how this is true. In Java, a char has a range from 0 to 65535, which requires two bytes. (Minimum) It's true that, if you encode a group of chars as bytes, using most common encoding schemes (ASCII, ISO-8859-1, Cp-1252, UTF-8), the most common (English-language) characters can be encoded in 1 byte per character. But that's not an absolute rule, and I think it's dangerously misleading to say that a char is a byte.

So, why do they use int int rather than char as the return type here? (For String.indexOf() as well as various other methods scattered in the standard API.) I think the reason is for convenience, given that int is the "default" type for most expressions, unless something in the expression forces the expression to be float or double instead. Because anytime you perform simple arithmetic, or even just write a plain literal like 1 or 42, Java assumes you mean an int. And if it's expecting a char rather than an int, Java gets pissy and blaks until you fix it. I think that the decision to define indexOf(int) rather than indexOf(char) is motivated by nothing more than the desire to save users from the mild annoyance of having to cast a result from int to char. Which is not a particularly compelling reason, in my opinion, but it's the best one I can think of.

So: chars in Java require two bytes. But Java often tends to assume that computations will need four bytes, and to accommodate this, indexOf() and other methods accept parameters of int type.
[ November 17, 2006: Message edited by: Jim Yingst ]

"I'm not back." - Bill Harding, Twister
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42608
    
  65
Starting with Java 5, a 16-bit char is no longer big enough to identify all possible Unicode characters, because Unicode 4.0 (which is what Java 5 supports) has more than 65536 characters. Some explanation about this can be found in these two blog entries.


Ping & DNS - my free Android networking tools app
Jeanne Boyarsky
author & internet detective
Marshal

Joined: May 26, 2003
Posts: 30929
    
158

Oops. Thanks for catching that Jim.

I remembered the idea (some characters don't fit in a char), but mis-remembered the # bytes involved. You and Ulf described it much better.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
[Ulf]: Starting with Java 5, a 16-bit char is no longer big enough to identify all possible Unicode characters, because Unicode 4.0 (which is what Java 5 supports) has more than 65536 characters.

True. A char is still in the range 0-65535, but characters (code points) can have higher values. And in fact indexOf() does now make use of this, allowing you to pass in code point values larger than 65535. The interesting thing though is that indexOf(int) was the signature for this method from way back at JDK 1.0 (and probably earlier). I'm pretty sure that back then they weren't thinking about the possibility that Unicode might need to be expanded to wider ranges - if they were, then there are several other methods that should have been defined differently. Which is why String has now added methods like codePointAt() to supplement charAt().

So, it looks to me like the original reason to use indexOf(int) rather than indexOf(char) had nothing to do with the possibility of code points larger than 65535. But now, it turns out that they got lucky, because in fact it is possible to have higher values for characters - though not for chars.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Sun String IndexOf method API Question