Win a copy of Pipeline as Code this week in the Cloud/Virtualization forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Jeanne Boyarsky
  • Bear Bibeault
Sheriffs:
  • Rob Spoor
  • Henry Wong
  • Liutauras Vilda
Saloon Keepers:
  • Tim Moores
  • Carey Brown
  • Stephan van Hulst
  • Tim Holloway
  • Piet Souris
Bartenders:
  • Frits Walraven
  • Himai Minh
  • Jj Roberts

How String constructor work exactly when pass Array?

 
Greenhorn
Posts: 1
1
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have a doubt regarding String constructor
Normally when i write this it compiles successfully,  but when i write  this it show compilation error. Please explain this weird behavior.

       
 
Marshal
Posts: 72441
315
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Welcome to the Ranch

Please go through the documentation and see how many of the constructors you can get to match your code.
 
Saloon Keeper
Posts: 12816
278
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Welcome to CodeRanch!

A string is composed of characters, not of numbers. There are ways to convert an array of numbers to a String, but that's not the job of a constructor.
 
Marshal
Posts: 26475
81
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Actually, there is this constructor (since 1.5):
It works based on the idea that chars in a String make up Unicode code points. So conceivably in an alternate Java universe there could be the constructor
but in our universe there isn't.
 
Stephan van Hulst
Saloon Keeper
Posts: 12816
278
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It has always bothered me that they didn't take the time to make a proper CodePoint class. They probably did it for performance reasons. I suppose Valhalla might remedy that, but by then it's (code) pointless (haha) because ints are already ubiquitous.
 
Saloon Keeper
Posts: 23417
159
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I don't know what a CodePoint class would do, since a "code point" is defined as "any of the numerical values that make up the codespace". And until we find a way to use floating-point as a codespace index, cardinal numbers (java ints) will have to do.

Using codepoints to assemble a string requires you to know specific encoding values for a specific character encoding. Unlike many languages, Java doesn't just allow you to shovel bytes into a character string because A) that would defeat I18n and B) a byte is most definitely not the same thing as a character, and Unicode is a prime example of that. That Java supports codepoints at all is because it understands that sometimes you have to work with interfaces that don't deal directly with characters so you need something that can translate from raw bits. For internal use, Java's character and string classes can generally deal with codepage translations without resorting to direct codepoint manipulation.
 
Campbell Ritchie
Marshal
Posts: 72441
315
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Sorry, this post got delayed between writing it and pushing the submit button.

Paul Clapham wrote:. . . in an alternate Java universe there could be the constructor
but in our universe there isn't.

Which is why OP has been getting the compile time error. Of course, new String(new int[]{1, 2, 3}, 0, 3); would be invisible since those are non‑printing characters. Who knows what would happen if you include 4 or 26 because those are end of transmission control characters.

Maybe the changes to Unicode took Sun by surprise, I think. They had a platform supporting Unicode all the way from 1 to ffff and then they had potentially another 1000000 characters to handle. Would you regard a CodePoint class as a wrapper for an int or similar, Stephan?
 
Stephan van Hulst
Saloon Keeper
Posts: 12816
278
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:Would you regard a CodePoint class as a wrapper for an int or similar, Stephan?


Not an int but a byte sequence that encodes a single Unicode codepoint in UTF-8. This of course is an implementation detail.

It would act like char, except it doesn't allow partial codepoints (char can represent loose code units that are part of a surrogate pair that make up the supplementary characters).

int is a really poor data type to represent a Unicode codepoint. Most of the time it's very wasteful, and if we ever manage to accrue more than 2^32 codepoints (which admittedly probably won't be in my or my children's, or their children's lifetimes), we have the same problem we had with char. The strongest case against int is that some values may not even represent existing or legal codepoints.
 
Paul Clapham
Marshal
Posts: 26475
81
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:The strongest case against int is that some values may not even represent existing or legal codepoints.



This problem was already there when a String was basically an array of chars, before code points came along in Java 5. Many char values don't represent existing Unicode characters either, although if I remember right the documentation back then equated chars with Unicode characters.

But back then (prior to 2004), trying to build infrastructure to make chars really be Unicode characters would have been scorned as over-architecting. Look at how long it took for the proponents of architecture to prevail over the cheap code which java.util.Date was, and to produce an architecturally sound version of date handling.
 
Tim Holloway
Saloon Keeper
Posts: 23417
159
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:Not an int but a byte sequence that encodes a single Unicode codepoint in UTF-8. This of course is an implementation detail.



No.

A Codepoint is:

Wikipedia wrote:any of the numerical values that make up the codespace



Assuming that it's comprised of bytes, or indeed of any concrete quantity defeats the purpose. It not only constrains the size of the codepoint space, it assumes the ordering of the bits. And the means by which they are ordered (endian-ness). And even today, there are still machines that enumerate bytes in continuous order rather than bytewise-discontinuous order. To say nothing of when you serialize them into bitstreams.

Java's codepage translation does not limit itself so straitly. It's write-once/run anywhere. It is reasonable to accept that a codepoint is a numerical value in the domain 0..infinity, although practicality needs allow us to subset that into the Java integer space. Bytes are not integers and only for legacy reasons does Java allow them to be interchanged in any way. Or, indeed, to be arbitrarily defined in bits, and 8 bits at that, much less with/without signs and which direction you count the bit magnitude increases and nybble order.
 
Stephan van Hulst
Saloon Keeper
Posts: 12816
278
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Tim Holloway wrote:Using codepoints to assemble a string requires you to know specific encoding values for a specific character encoding.


No it doesn't. By codepoint I mean a Unicode codepoint, which represents an abstract character, without assuming any particular encoding. A string is just a sequence of abstract characters.

I'm not saying it's practical, but just to illustrate my point you could envision a special syntax for "codepoint literals" that uses the name that Unicode assigns to the abstract character:

Unlike many languages, Java doesn't just allow you to shovel bytes into a character string


Yes, I agree that this is a good thing. If I were to design a new String class, its constructors would only accept sequences of codepoints, where a single codepoint does NOT represent an encoded character, but rather an abstract character. Of course, one could add a limited number of factory methods:

That Java supports codepoints at all is because it understands that sometimes you have to work with interfaces that don't deal directly with characters so you need something that can translate from raw bits.


This is incorrect. Java treats the word "codepoint" the same way Unicode does: An abstract character without an associated encoding. When you pass an int to one of Character's static methods, that int represents the unsigned ordinal position of the character in Unicode's list of abstract characters. Of course, you could argue that the ordinal value IS a form of encoding, but its a far cry from the concept of "raw bits".

For internal use, Java's character and string classes can generally deal with codepage translations without resorting to direct codepoint manipulation.


Unicode codepoints are the most practical thing for a programmer to use when they're dealing with the individual units that a string is composed of:

Bytes require the programmer to know about an encoding, and individual bytes might not mean anything meaningful.

Chars are deficient because they are really just UTF-16 code units. Besides, the word "character" is poorly defined and means different things to different people.

Not all int values point to a valid unicode character. Finally, they are wasteful in most cases.

A Glyph type is mostly only useful in the context of a text viewer/editor. It could represent a unit of selectable text, but is usually much more complex than we need in most applications. It might also not be appropriate for control characters.
 
Tim Holloway
Saloon Keeper
Posts: 23417
159
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Unicode codepoints are codepoints into the Unicode character set. Java may internally work with Unicode, but it does not otherwise prefer one character set over any other. And again, the ASCII family is not the only codepage out there. EBCDIC is still very much alive and well, for example. That's not even allowing for legacy support where "bytes" could be 12 bits or similar weirdness.

The Unicode nbsp character designation isn't a codepoint. It's an Entity Name for the Unicode non-break space character. Entity names can resolve to different codepoint values depending on the code page they target.
 
Stephan van Hulst
Saloon Keeper
Posts: 12816
278
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Tim Holloway wrote:Assuming that it's comprised of bytes, or indeed of any concrete quantity defeats the purpose.


How do you figure? Have you forgotten that Java is based on the principle of encapsulation, and that even if a dedicated Unicode codepoint datatype is internally represented by a UTF-8 encoded byte sequence, this is just an implementation detail that is not communicated to the outside world? You could even swap it out with an int if you prefer, although this makes little sense.

It not only constrains the size of the codepoint space, it assumes the ordering of the bits.


How so? An array of bytes hardly constrains the size of the codespace, unless you advertise a fixed encoding. Why would you? I'm not saying that a codepoint type must ALWAYS use the same internal encoding, or even disclose what encoding it uses in the first place.

And the means by which they are ordered (endian-ness). And even today, there are still machines that enumerate bytes in continuous order rather than bytewise-discontinuous order. To say nothing of when you serialize them into bitstreams.


How does any of this matter? A codepoint is an abstract data type, and the programmer is not privy to its internal representation. If you want to serialize a codepoint that's fine, but you have to specify an encoding. The mechanism of converting a codepoint to a bitstream is the business of the encoding.

It is reasonable to accept that a codepoint is a numerical value in the domain 0..infinity, although practicality needs allow us to subset that into the Java integer space.


You accept that a codepoint might be represented by an int, a wasteful AND limiting data type, but not that it might be represented by a byte array and an implicit encoding? In either case, the representation would be hidden to the programmer. If you want to expose the ordinal value as a public property, you can just add a method that converts the UTF-8 to Big Endian UTF-32, but again, the conversion would be hidden (and seldomly used, I might add).
 
Tim Holloway
Saloon Keeper
Posts: 23417
159
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Imprimus, a byte is not - popular opinion to the contrary - a character data type. A Byte is technically supposed to be the smallest directly-addressable unit of memory for a given machine, although 16-bit word machines often used "byte" to mean halfword.
A byte is nominally supposed to be considered - when it's thought of as a number - as an unsigned set of bits. Ask Campbell about bytes and Java. Or meditate on the fact that while ASCII was a 7-bit code, ASCIIZ required a full 8 bits, including the "sign" bit. Which among other things makes it a perilous entity to use for a character lookup table index.

One of the really annoying things about Fortran prior to about 1980 was that since it was a numerical language, you had to store characters in integers, and - considering limited resource of the day - often packed multiple characters per integer. It was also extremely non-portable, I can attest from experience, bouncing between mainframe and minicomputer.

Secundus, the term numerical value wasn't made up by me. It was a direct quote from the Wikipedia article, which while hardly the Word of God gets enough review that I think permits one to think that the definition has not been seriously challenged.

A Byte is not a number. It's a unit of storage. That, yes, in Java represents an integral signed (whole number) value whose range is -128 to 127 (did I get that right? Boundaries are my bane). One distinguishing characteristic of integers in computer terms is that an "int" is normally the "natural" computing value size. Subject to negotiation. The original Macintosh worked with 16-bit integers, the Amiga worked with 32-bit integers if you were using the Lattice or Green Hills compilers. All ran on the same Motorola CPU.

Tertius, a Codepoint class is almost guaranteed to occupy more resources in a JVM. Furthermore, since Java is neither C++ nor Ada, you cannot define an object of class Codepoint to fulfill the primary purpose of being, and again I repeat: "any of the numerical values that make up the codespace". Which is to say, a cardinal integer index into the codespace.

In actual practice, I don't expect to be leaving codepoints just laying around. They are usually going to be a transient intermediate stage between some sort of external media and/or foreign character set and Java's own Unicode objects. I'm very much into type safety, but attempting to make codepoints distinct object types strikes me as way more trouble than it's worth. You could argue about making codepoints be a primitive - and in fact, you could actually do so in Ada - but Java hasn't gone that route and I don't expect it to.
 
Stephan van Hulst
Saloon Keeper
Posts: 12816
278
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Tim Holloway wrote:Unicode codepoints are codepoints into the Unicode character set. Java may internally work with Unicode, but it does not otherwise prefer one character set over any other.


It absolutely does. The string constructor takes a char[], which whether you like it or not represents a sequence of UTF-16 code units. Even if in more recent versions of the language it sometimes internally uses a byte array rather than a character array, most of its outward facing methods still use char. When a method takes or returns an int instead, that method is still tightly coupled to the Unicode character set.

I'm absolutely fine with coupling your string type to a fixed character set. In the end you always need to agree on a particular character set, unless you require the programmer to pass a character set identifier with each and every method call, which isn't practical. Unicode is a great choice.

If you've agreed on a fixed character set for your language's string data type, then why not use an abstract codepoint data type as the atom of that string data type, without specifying how that codepoint is represented internally?

And again, the ASCII family is not the only codepage out there. EBCDIC is still very much alive and well, for example. That's not even allowing for legacy support where "bytes" could be 12 bits or similar weirdness.


I feel like you are saying that the atoms of the String data type should be whatever comes natural to the native platform. This will land you back in C country very quickly. If it's not what you're saying then I don't know how it has any bearing on the discussion.

The Unicode nbsp character designation isn't a codepoint. It's an Entity Name for the Unicode non-break space character. Entity names can resolve to different codepoint values depending on the code page they target.


Okay, and what I am saying is that I propose an abstract data type that targets the "Unicode" code page, and nothing else. If you want a codepoint that targets EBCDIC, then its YOUR responsibility to convert from the language's default to a custom data type.

Would you be more at ease if I called it StringAtom, rather than Codepoint?
 
Stephan van Hulst
Saloon Keeper
Posts: 12816
278
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Tim Holloway wrote:They are usually going to be a transient intermediate stage between some sort of external media and/or foreign character set and Java's own Unicode objects.


I think we're talking about different things. All the while I have been talking about Java's own Unicode objects and what I want them to be like. I'm talking about the thing I would like charAt(index) to have returned.

I'm very much into type safety, but attempting to make codepoints distinct object types strikes me as way more trouble than it's worth. You could argue about making codepoints be a primitive - and in fact, you could actually do so in Ada - but Java hasn't gone that route and I don't expect it to.


This is why I mentioned project Valhalla. I don't expect the language to actually do this, but it is nice to fantasize.

Thinking back on it, for it to work as a primitive it would have to be a fixed size data type, which precludes an internal representation of UTF-8. Still, an unsigned 32 bit data type that is range-checked by the compiler would have been preferable over int.
 
Tim Holloway
Saloon Keeper
Posts: 23417
159
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Well, as I said, Java prefers Unicode internally, but the ecosystem is geared towards being equal to all character sets.

I missed what you mean by charAt() function, though. To me, charAt() is supposed to return the character at position "index" in the String and that's what Java does, unless I missed something. Same thing in C, although the actual operation would depend on whether you compiled with wide-character mode or legacy character mode. Since a java.lang.String is always, by definition, a Unicode string, charAt's jobs seems pretty simple, even though String might do some internal space-packing if it was transparent to API users.

I'll split the difference on codepoint values. Putting a range limit on them means potentially limiting the size of codespaces. Unicode doesn't even take the full potential 16-bit range itself as I recall, but there's always the chance that visitors might descend from Alpha Centauri wil a 3 million glyph character set. and if so, the current setup could easily adapt, whereas a hard-limited space could not. But since unsigned integers are not supported in Java, you can also easily define illegal (negative) codepoint values, and that's unfortunate.
 
Master Rancher
Posts: 3831
50
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Tim Holloway wrote:
I'll split the difference on codepoint values. Putting a range limit on them means potentially limiting the size of codespaces. Unicode doesn't even take the full potential 16-bit range itself as I recall



Tim, you might want to check out BabelStone : How many Unicode characters are there ?

Unicode 2.0 exceeded the 16-bit range back in 1996.  Under that standard there were 1,114,112 possible code points, with 178,500 actually defined.  By now with Unicode 13.0 they're up to 283,506 code points defined.
 
Tim Holloway
Saloon Keeper
Posts: 23417
159
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm not surprised. But that does make the argument for range-checking more cogent. If even Unicode itself cannot stay finitely0bounded, how can we realistically set an arbitrary limit on codepoint values? Where are the Kilngon points, anyway? I went looking for them the other day and closest I ended up with was Linear A.
 
Paul Clapham
Marshal
Posts: 26475
81
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Wikipedia wrote:The Unicode Technical Committee rejected the Klingon proposal in May 2001 on the grounds that research showed almost no use of the script for communication, and the vast majority of the people who did use Klingon employed the Latin alphabet by preference.

 
Mike Simmons
Master Rancher
Posts: 3831
50
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Party poopers. :p. More info on Klingon from unicode.org
 
Mike Simmons
Master Rancher
Posts: 3831
50
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Unicode has previously revised its own standard to allow for more possible code points, in Unicode 2.0.  I'm not sure if they have a backup plan for what to do if they grow past the current limit of 1,114,112.

I do agree that setting an upper limit in Java seems unnecessary and potentially counterproductive.  I wouldn't mind excluding negative numbers though.

Ideally (for me), a proper modern String class would have charAt() return a code point as an unsigned int, and there would be no legacy method that returned a char.  There would be no char type.  The existing chars() method would be gone and the existing codePoints() method would be renamed chars() for brevity.  Unfortunately though, we still have methods like the current charAt() and chars() that encourage people to mishandle extended-range code points.  But, that's where we are.
 
Campbell Ritchie
Marshal
Posts: 72441
315
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mike Simmons wrote:. . . previously revised its own standard . . .  I'm not sure if they have a backup plan . . . .  I wouldn't mind excluding negative numbers though.

They seemed to have a backup plan for extending Unicode1 but that took 0x0800 codes out of use (0xd800...0xdfffinclusive). Since there are unused codes in many of the tables, that would reduce the available total more. That Klingon proposal for example would have used 0x30 (48decimal) codes to store 0x25 (37decimal) characters.

. . . we still have methods like the current charAt() . . . But, that's where we are.

Backwards compatibility. Why do we have datatypes like float, char, and short in the first place? Do they have any use nowadays? Are they only there for completeness' sake, or because they reflect the gamut of datatypes in C? And why didn't they ever implement the unsigned keyword? Is there anything you can do with chars that you can't do just as well with ints, and confine yourself to the codePointAt(int) method if you use Strings? Backwards compatibility to the days when 16‑bit computers were still in use and a 1GB hard drive cost 50× what a 1TB hard drive costs nowadays means they have to maintain charAt() and similar.
Isn't 0x0010_ffff a pretty number in decimal! If you ever need code points > 0x10_ffff, won't that mess up the UTF‑16 encoding. But they managed to alter the format of the backing array for Strings in Java8, so it should be possible to make analogous changes if Unicode ever strays beyond U+10FFFF. Excluding negative numbers would probably not even allow for 2,000,000,000 different glyphs. We shall doubtless need space for more soon enough 
 
Stephan van Hulst
Saloon Keeper
Posts: 12816
278
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:Backwards compatibility. Why do we have datatypes like float, char, and short in the first place? Do they have any use nowadays?


To be fair, these data types probably get a lot of use in embedded systems and devices that run the Java ME platform. Some of these devices have limited storage or memory capacity, or have processors that don't deal well with 32 bit integers or double precision floating-point values. This is not just backwards compatibility, some of these devices are newly developed.

Sometimes I long for the Pascal days where you could just specify a range of numbers that is valid for your variable, and not worry about the rest.
 
Tim Holloway
Saloon Keeper
Posts: 23417
159
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mike Simmons wrote:Ideally (for me), a proper modern String class would have charAt() return a code point as an unsigned int, and there would be no legacy method that returned a char.  There would be no char type.


Why? The whole point of characters is to avoid all the messiness that I recounted - from personal experience - from FORTRAN. What you're advocating isn't charAt(), but codePointAt(), which is trivially done in Java by casting the charAt() return value to an int.

As I've hinted more than once in recent days, the legacy interchangeability between "byte" (signed), integer, and character in Java is actually more free than I would like for strict type accountability anyway.

Incidentally, I used to tend the JavaME forum. It's been essentially dead for probably a decade. Most devices these days are either like the Raspberry Pi, which has more than enough resources to run full Java SE/EE, or they are using AVRs, which are too minimalist to run a JVM at all. Parts pricing being what it is as the moment, there's just no call for anything in between.

If you loved Pascal, you really should try Ada. It's built on Ada syntax, but it's the ultimate language for defining custom types and type safety. It was designed to provide a common fool-proof language for the bulk of USA IT services, especially military (where each branch had their set of custom languages for years), with especial emphasis on bug-proofing. Its big problem, I think, was that it was ahead of its time, and would utterly trash even a mid-sized IBM mainframe just to compile. But in actuality, probably not much more resources than Java, which came out shortly afterwards, but on more powerful processors. There's an open-source Ada implementation called gnat You can even write GUI apps in Ada using Gtk.
 
Mike Simmons
Master Rancher
Posts: 3831
50
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Tim Holloway wrote:

Mike Simmons wrote:Ideally (for me), a proper modern String class would have charAt() return a code point as an unsigned int, and there would be no legacy method that returned a char.  There would be no char type.


Why? The whole point of characters is to avoid all the messiness that I recounted - from personal experience - from FORTRAN. What you're advocating isn't charAt(), but codePointAt(), which is trivially done in Java by casting the charAt() return value to an int.



Wait, what?  No, those aren't the same.  The codePointAt() method can return something completely different than charAt().  And when that happens, it's usually an indication that charAt() is not what you should be using to understand the data.  What you propose is like looking at UTF-8 data and treating each byte as a character.  It works fine until you encounter more exotic characters outside the range you expected.


Consider strings like "Hello 🌐" and "👨🏼‍👩🏽‍👧🏾‍👦🏿".  Hopefully your system renders those nicely by default; my Mac shows the first with a nice blue-and-white "globe" icon, and the second is a series of four faces.  We can look at them using Java methods:  

   
Running this, we can see that for "Hello 🌐", charAt(6) return 55356, while codePointAt(6) returns 127760.  That's not a cast to int or char - that's because codePointAt(7) is actually looking at the pair of chars at position 6 and 7, and resolving them to a single code point 127760.

The second example is much more complex - four face glyphs are represented as 19 chars, which represent 11 code points.  So here, even code points do not exactly match our notion of "characters" as displayed.  But they're at least closer to what we mean, I think.  I stole the four faces from the beginning of the previously-cited Babelstone article; there's more discussion there.
 
Mike Simmons
Master Rancher
Posts: 3831
50
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
@Campbell, re: "backwards compatibility" - yes, that certainly is the case.  I think Stephen and I are envisioning how we would like things if they were designed today.  While recognizing that sure, there are historical reasons that's not the case.  As well as performance issues not yet resolved, though they could be in the future.
 
Campbell Ritchie
Marshal
Posts: 72441
315
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes. Doesn'it look strange when you try to combine backwards‑compatible code and new‑fangled code to cope with today's situation.
Do you get 19 chars for 11 code points in UTF‑8 or UTF‑16? What are the hex values of those two numbers? I can't read decimal
 
Tim Holloway
Saloon Keeper
Posts: 23417
159
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
OK, I gotcha. I thought that they'd done a better job than that. Here I've been talking about transparently packing Unicode characters into memory-efficient storage and in fact, Sun did the exact opposite.

Yuck.
 
Campbell Ritchie
Marshal
Posts: 72441
315
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Maybe if I read MS' code, it will have the hex values in…
 
Tim Holloway
Saloon Keeper
Posts: 23417
159
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:Maybe if I read MS' code, it will have the hex values in…



\uD83C/55356 puts the character into the surrogate space. Meaning that java.lang.String is not truly a string, it's a Unicode UTF-16 sequence and charAt is returning the value at the indicated index in an array of 16-bit elements. False advertising and not at all in keeping with Java's general focus on abstraction.

Apparently Javascript has taken steps to address that, but Java is stuck with what we've got.
 
Mike Simmons
Master Rancher
Posts: 3831
50
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:Do you get 19 chars for 11 code points in UTF‑8 or UTF‑16? What are the hex values of those two numbers? I can't read decimal


It's 19 chars as Java chars.  As bytes, that apparently converts to 40 bytes under UTF-16 (I was expecting 38; not sure why 2 more) or 41 bytes under UTF-8.

 
Mike Simmons
Master Rancher
Posts: 3831
50
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
@Tim, yeah it's an annoying situation.  I think Java's approach made sense when they were creating it, under the assumption that all of Unicode fit nicely in 16 bits, and no one would ever need more.  Which proved wrong within a few years of Java's release, and Java's String/char design immediately started showing its warts.  Which is a good example of your point that it's a bad idea to put any maximum on the code point space.  Of course, Integer.MAX_VALUE is a maximum too... but hopefully we'll be long gone before that creates a problem.
 
Tim Holloway
Saloon Keeper
Posts: 23417
159
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It's actually worse than that. It implies a brute-force method of storage.

Consider: the bulk of Western text strings can be represented by the bottom 256 codepoints of Unicode. I could pack them into bytes, rather than halfwords, typically saving a lot of space and only a small penalty for access time.

I could extend this mechanism to have a "run header", based on the concept that in many cases you might have primarily one segment, with rare outliers. Here I can continue the original space optimisation, but trade some access time for the ability to directly access the outliers, even if they required 16 or even 24 bits. Because I wouldn't have to scan from the beginning to randomly find a character without being thrown off by surrogates. Variants on this theme should also be doable for the ideographic segments and alternative alphabets (such as Cyrllic). You can do all this with relatively metadata in most cases.

Except that Java can't. Because they didn't. If Java had been invented back when storage was more expensive, maybe history would have been different. Then again, back then, Unicode wasn't a thing.

In the Real World, I guess you could invent a MetaString class that did all this, but there are limits to how tightly it could integrate into the existing universe.
 
Paul Clapham
Marshal
Posts: 26475
81
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Tim Holloway wrote:Consider: the bulk of Western text strings can be represented by the bottom 256 codepoints of Unicode. I could pack them into bytes, rather than halfwords, typically saving a lot of space and only a small penalty for access time.



The String class now does that. When it sees that the string value can be encoded with only 8-bit characters ("Latin1") then it uses a byte array internally. Otherwise ("UTF-16") it uses a char array internally.

Looks like that happened in Java 9, I tracked down the JEP here... and I see I didn't quite describe it correctly. For Latin1 it uses a byte array and for UTF-16 it uses a byte array of twice as many bytes.
 
Stephan van Hulst
Saloon Keeper
Posts: 12816
278
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mike Simmons wrote:It's 19 chars as Java chars.  As bytes, that apparently converts to 40 bytes under UTF-16 (I was expecting 38; not sure why 2 more) or 41 bytes under UTF-8.


It probably emits a BOM at the start for UTF_16. If you want to omit the BOM, you have to use UTF_16BE or UTF_16LE.
 
Stephan van Hulst
Saloon Keeper
Posts: 12816
278
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Tim Holloway wrote:Consider: the bulk of Western text strings can be represented by the bottom 256 codepoints of Unicode. I could pack them into bytes, rather than halfwords, typically saving a lot of space and only a small penalty for access time.

I could extend this mechanism to have a "run header", based on the concept that in many cases you might have primarily one segment, with rare outliers. Here I can continue the original space optimisation, but trade some access time for the ability to directly access the outliers, even if they required 16 or even 24 bits. Because I wouldn't have to scan from the beginning to randomly find a character without being thrown off by surrogates.


Soooo, you're talking about UTF-8?

Wait, you mean that your data structure would consist of multiple runs of text, with different subsequent encodings? I guess that would work, but I think that would add a layer of complexity that is not needed. In general, you don't want to perform random access on characters in a string.

If you have some really weird use case where you do, you're probably better off converting the string to an array of characters first. By characters I mean proper characters, not chars or ints.
 
Campbell Ritchie
Marshal
Posts: 72441
315
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Tim Holloway wrote:. . . the bulk of Western text strings can be represented by the bottom 256 codepoints of Unicode. I could pack them into bytes . . . you could invent a MetaString class . . .

I think they did something like that in Java8. I also notice Paul C has already told us that more accurately.

Yes, we write in English and usually confine ourselves to the bottom 256 128 code points. But English isn't necessarily the most widely used language in the World. Chinese is the most widely used first language. Most Asian writing can't be coded with ASCII or similar. Indeed most European writing uses characters outwith the range of ASCII. It is fortuitous that ASCII and Unicode developed in an English‑speaking country. It might seem unlucky that Java® chose UTF‑16 as a default encoding, but Gosling didn't expect Unicode to expand. UTF‑32 would have been wasteful on space and UTF‑8 would have required encoding and decoding, so UTF‑16 sounded like a good compromise. It also made charAt() and toCharArray() very easy to implement. Maybe those latter two methods should be deprecated and new code should use code point methods throughout. Don't know.
 
Stephan van Hulst
Saloon Keeper
Posts: 12816
278
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:But English isn't necessarily the most widely used language in the World. Chinese is the most widely used first language. Most Asian writing can't be coded with ASCII or similar.


This is a flawed argument though. While ASCII is completely out, encodings that favor the Latin alphabet over Han ideographs (like UTF-8) are still more useful in countries like China because a sizable portion of information exchange is performed using the Latin alphabet. Consider a Chinese Wikipedia article. While the flat text consists of ideographs, the HTML is Latin, JavaScript is Latin, CSS is Latin, XML is Latin, the HTTP protocol is Latin, etc.

I also like the argument that an ideograph carries more information than a Latin letter, so it is justified that it requires more bytes.

If your goal is to store a large chunk of Chinese text without markup, don't worry about the encoding to use. Worry about a good compression algorithm.
 
Space pants. Tiny ad:
SKIP - a book about connecting industrious people with elderly land owners
https://coderanch.com/t/skip-book
reply
    Bookmark Topic Watch Topic
  • New Topic