• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Sun String IndexOf method API Question

 
Greenhorn
Posts: 26
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Everyone,

I just have a small question that's bugging me regarding the API documentation on Sun's web site for the indexOf method of the String class that takes a char argument...why does the documentation have int listed as the arg data type:


indexOf
public int indexOf(int ch)
Returns the index within this string of the first occurrence of the specified character. If a character with value ch occurs in the character sequence represented by this String object, then the index (in Unicode code units) of the first such occurrence is returned. For values of ch in the range from 0 to 0xFFFF (inclusive),...

Parameters:
ch - a character (Unicode code point).



Thanks very much in advance for your help...Catherine
 
author & internet detective
Posts: 41860
908
Eclipse IDE VI Editor Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Catherine,
In Java a char is one byte. However, some Unicode characters are encoded in two bytes. Allowing an int as input allows these characters to be passed to indexOf.
 
catherine powell
Greenhorn
Posts: 26
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
thank you so much!

I believe I found the documentation in the Character Class API...


The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value...

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters.
A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points...

The methods that accept an int value support all Unicode characters, including supplementary characters.



Thanks again for your help...Catherine
[ November 16, 2006: Message edited by: catherine powell ]
 
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
[Jeanne]: In Java a char is one byte. However, some Unicode characters are encoded in two bytes.

Ummm... I can't really see how this is true. In Java, a char has a range from 0 to 65535, which requires two bytes. (Minimum) It's true that, if you encode a group of chars as bytes, using most common encoding schemes (ASCII, ISO-8859-1, Cp-1252, UTF-8), the most common (English-language) characters can be encoded in 1 byte per character. But that's not an absolute rule, and I think it's dangerously misleading to say that a char is a byte.

So, why do they use int int rather than char as the return type here? (For String.indexOf() as well as various other methods scattered in the standard API.) I think the reason is for convenience, given that int is the "default" type for most expressions, unless something in the expression forces the expression to be float or double instead. Because anytime you perform simple arithmetic, or even just write a plain literal like 1 or 42, Java assumes you mean an int. And if it's expecting a char rather than an int, Java gets pissy and blaks until you fix it. I think that the decision to define indexOf(int) rather than indexOf(char) is motivated by nothing more than the desire to save users from the mild annoyance of having to cast a result from int to char. Which is not a particularly compelling reason, in my opinion, but it's the best one I can think of.

So: chars in Java require two bytes. But Java often tends to assume that computations will need four bytes, and to accommodate this, indexOf() and other methods accept parameters of int type.
[ November 17, 2006: Message edited by: Jim Yingst ]
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Starting with Java 5, a 16-bit char is no longer big enough to identify all possible Unicode characters, because Unicode 4.0 (which is what Java 5 supports) has more than 65536 characters. Some explanation about this can be found in these two blog entries.
 
Jeanne Boyarsky
author & internet detective
Posts: 41860
908
Eclipse IDE VI Editor Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Oops. Thanks for catching that Jim.

I remembered the idea (some characters don't fit in a char), but mis-remembered the # bytes involved. You and Ulf described it much better.
 
Jim Yingst
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
[Ulf]: Starting with Java 5, a 16-bit char is no longer big enough to identify all possible Unicode characters, because Unicode 4.0 (which is what Java 5 supports) has more than 65536 characters.

True. A char is still in the range 0-65535, but characters (code points) can have higher values. And in fact indexOf() does now make use of this, allowing you to pass in code point values larger than 65535. The interesting thing though is that indexOf(int) was the signature for this method from way back at JDK 1.0 (and probably earlier). I'm pretty sure that back then they weren't thinking about the possibility that Unicode might need to be expanded to wider ranges - if they were, then there are several other methods that should have been defined differently. Which is why String has now added methods like codePointAt() to supplement charAt().

So, it looks to me like the original reason to use indexOf(int) rather than indexOf(char) had nothing to do with the possibility of code points larger than 65535. But now, it turns out that they got lucky, because in fact it is possible to have higher values for characters - though not for chars.
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic