wood burning stoves 2.0*
The moose likes Programmer Certification (SCJP/OCPJP) and the fly likes SPLIT METHOD IN STRING CLASS Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Certification » Programmer Certification (SCJP/OCPJP)
Bookmark "SPLIT METHOD IN STRING CLASS" Watch "SPLIT METHOD IN STRING CLASS" New topic
Author

SPLIT METHOD IN STRING CLASS

anil kumar
Ranch Hand

Joined: Feb 23, 2007
Posts: 447
i could not understand split method in string properly, i have looked thorough this forum , i got this example ,but i could not understand the explanation
1) String str = " apples";
String s[] = str.split("\\w*");
for (String i:s)
System.out.println("Token" + i + "Token");

Output is :
TokenToken
Token Token

2) String str = "apples";
String s[] = str.split("\\w*");
for (String i:s)
System.out.println("Token" + i + "Token");
No Output

3) String str = "apples ";
String s[] = str.split("\\w*");
for (String i:s)
System.out.println("Token" + i + "Token");
Output is :
TokenToken
TokenToken
Token Token

what i have understood is

for first case:
space a p p l e s
the split method contains space,space
but the out put is not like that.


Please Explain me this
gianni ipez
Ranch Hand

Joined: Jan 02, 2007
Posts: 65
I have no idea.
I thought I understod the split method, but I didn't too.
Ciao,
Gianni
Chandra Bhatt
Ranch Hand

Joined: Feb 28, 2007
Posts: 1707
Hi Anil,

Let us see one by one:


Output:
>< //beginning blank string
> < //space after that

Use of second argument of split() method:

Output:
>< //begininning blank string
> < //space after that
>< //blank string after space

Let us modify the code to understand it much better:

Output:
>< //beginning blank string
> < //space after that
>< //blank string after space
>< //blank string after "s" in apples


Your next doubt:


Output:
Nothing !!!


To get the concept quickly read the second point below!

1- Remember by default the second argument is 0 in the split method.
2- "*" is greedy 0 or more



Regards,
cmbhatt
[ April 28, 2007: Message edited by: Chandra Bhatt ]

cmbhatt
anil kumar
Ranch Hand

Joined: Feb 23, 2007
Posts: 447
hi
chandra

>< //beginning blank string
> < //space after that

what is the difference between those two comments?

please exaplain that

Thanks
Anil Kumar
Chandra Bhatt
Ranch Hand

Joined: Feb 28, 2007
Posts: 1707
Originally posted by anil kumar:
hi
chandra

>< //beginning blank string
> < //space after that

what is the difference between those two comments?

please exaplain that

Thanks
Anil Kumar


"" is call blank string (whose length() = 0)


" " this is called space, one space means length =1

Got that?

>< //beginning blank string
> < //space after the previous blank string


Regards,
cmbhatt
[ April 28, 2007: Message edited by: Chandra Bhatt ]
anil kumar
Ranch Hand

Joined: Feb 23, 2007
Posts: 447
Ok
i got that but see here
3) String str = "apples ";
String s[] = str.split("\\w*");
for (String i:s)
System.out.println("Token" + i + "Token");
Output is :
TokenToken
TokenToken
Token Token

Here the o/p should be
TokenToken//blank after apples
Token Token//space
TokenToken//blank
But the o/p is not like that
Why ?

Thanks
Anil Kumar
Matt Russell
Ranch Hand

Joined: Aug 15, 2006
Posts: 165
It's really worth having a browse of the source code of java.util.regex.Pattern to get a clear understanding of what's going on here (String.split() calls Pattern.split()). I'll try and explain what's going on in words, but looking at the code is probably more helpful at this point -- it's attached at the end of this post (limit = 0 for the default String.split() call).

In the case of " apples".split("\\w*"), the regular expression matches three times ("" at the start of the string, "apples", and "" at the end of the string -- Chandra Bhatt's analysis above isn't quite correct); so you get three fragments added to the output: "", " ", and "". The split() algorithm then adds an extra fragment to account for the remainder from the end of the last match to the end of the string: in our case, that's just the empty string again. Finally, the algorithm prunes those two empty strings ("") from the end of the array of results -- the pruning removes all the empty strings from the end of the results up to the first non-empty string. (Note: this pruning doesn't happen if you pass in a non-zero limit parameter to the split method).

In the case of "apples".split("\\w*"), the regular expression matches twice
("apples" and "" at the end) to give fragments "" and "". Another empty string is added to account for the remainder, but all three ""'s are pruned at the end, resulting in an empty array as output.

Finally, "apples ".split("\\w*"): the regular expression matches three times, "apples", "" and "", to give fragments "", "" and " ". The empty string is again added to the end of the outputs for the remainder, but is pruned off at the final stage (and that's the only one that's pruned).



Matt
Inquisition: open-source mock exam simulator for SCJP and SCWCD
Chandra Bhatt
Ranch Hand

Joined: Feb 28, 2007
Posts: 1707
What you get from this pattern: "\\w*"

It says 0 or more character/digit/"_" (underscore)
meta character "*" is known as greedy (it says "I WANT MORE, COMMON")

Charsequence "apple ":
Can you guess how many blank strings are their in?

BLANK STRING FINDER CODE
Try the following code:



In the same way when the CharSequence is "apple "
and the pattern is "\\w*"

1- The first point will be beginning of the "apple ", that is before "a"
2- The second point will be blank before space (" ")
3- The third point will be blank string after space



----------
cmbhatt
----------
[ April 28, 2007: Message edited by: Chandra Bhatt ]
anil kumar
Ranch Hand

Joined: Feb 23, 2007
Posts: 447
Hi

Matt
--------------------------------------------------------------------------
("" at the start of the string, "apples", and "" at the end of the string -- Chandra Bhatt's analysis above isn't quite correct);
--------------------------------------------------------------------------
can you allobrate this please ?
In the starting of apples there is space ,but how this "" is comming ,i think it has to come " "

Thanks
Anil Kumar
Chandra Bhatt
Ranch Hand

Joined: Feb 28, 2007
Posts: 1707

Matt says,

In the case of " apples".split("\\w*"), the regular expression matches three times ("" at the start of the string, "apples", and "" at the end of the string


I find the above lines missing something...

IMHO, " apples".split("\\w*"), the regular expression matches 0 occurrence
in the very beginning of the " apples" and then space. By default split() skips the last blank string "", as the API says.

The second argument of the split() is helpful to tell the "limit".





Thanks,
cmbhatt
anil kumar
Ranch Hand

Joined: Feb 23, 2007
Posts: 447
Hi
Chandra

The trailing empty string are removed but here it is not like that why

I am speaking about in this case
i got that but see here
3) String str = "apples ";
String s[] = str.split("\\w*");
for (String i:s)
System.out.println("Token" + i + "Token");
Output is :
TokenToken
TokenToken
Token Token
[ April 28, 2007: Message edited by: anil kumar ]
Chandra Bhatt
Ranch Hand

Joined: Feb 28, 2007
Posts: 1707
Pattern and Matcher classes will tell the truth!!!



In case of, str="apple";
out with the above code will be
>apple<
><


Got any idea???


|--------|
|cmbhatt |
|--------
Matt Russell
Ranch Hand

Joined: Aug 15, 2006
Posts: 165
Anil, the following code shows where the regular expression matches. Remember: split() outputs the bits between the matches (and before and after the first and last matches respectively), but trims empty strings from the end of the output.
anil kumar
Ranch Hand

Joined: Feb 23, 2007
Posts: 447
Hi
Matt
I have tried your program i have understood,But when i tried the same thing i am getting different o/p(SEE THE SPACE BETWEEN THE TWO TOKENS)
Why?
See below line1
This is the only thing i could not understood since morning

3) String str = "apples ";
String s[] = str.split("\\w*");
for (String i:s)
System.out.println("Token" + i + "Token");
Output is :
TokenToken
TokenToken
Token Token ////line1

[ April 28, 2007: Message edited by: anil kumar ]
[ April 28, 2007: Message edited by: anil kumar ]
Matt Russell
Ranch Hand

Joined: Aug 15, 2006
Posts: 165
Originally posted by anil kumar:
Hi
This is the only thing i could not understood since morning

3) String str = "apples ";
String s[] = str.split("\\w*");
for (String i:s)
System.out.println("Token" + i + "Token");
Output is :
TokenToken
TokenToken
Token Token ////line1


OK, step 1: where does the regular expression match? The program I pasted above shows you:

Let's call them matches 1, 2 & 3.

Step 2: What are all bits before, between and after the matches? Well, before match 1 (i.e. "apples"), we have nothing, so output 1 = "". Between match 1 & match 2 we also have nothing, so output 2 = "". Between match 2 & match 3 we have a space, so output 3 = " ". Finally, after match 3 we have nothing, so output 4 = "". OK, so far we have:

Outputs: 1 = "", 2 = "", 3 = " " and 4 = "".

Step 3: Pruning: when called with no limit argument, split() removes all the empty strings at the end of the output, so this becomes:

Outputs: 1 = "", 2 = "" and 3 = " ".

(If you'd used str.split("\\w*", -1) instead, you'd get all of the strings without any pruning.)

-- Matt
anil kumar
Ranch Hand

Joined: Feb 23, 2007
Posts: 447
Now i have understood
Thanks you Matt and Chandra for your value time and response

And chandra May first week starts from tuesday so i don't know your exam date

But thanks and the all the best for your exam
Meena R. Krishnan
Ranch Hand

Joined: Aug 13, 2006
Posts: 178


Results:

Test6's results:
Looking for a word char w greedy quantified(1 or more).

>< -->Blank before 'This'
>< -->Blank after 'This'
> < -->Space betn 'This' and 'is'
>< -->Blank after 'is'
> < -->Space betn 'is' and 'to'
>< --> Blank after 'to'
> < --> Space betn 'to' and 'test'
>< --> Blank after 'Test'
>< --> ???
Matt Russell
Ranch Hand

Joined: Aug 15, 2006
Posts: 165
Originally posted by M Krishnan:

I find it helps to view this in terms of the regular expression matches first:

If you then work out what bits of the string are before, between and after the matches, you get the same output as split(..., -1):

[ April 29, 2007: Message edited by: Matt Russell ]
Sasha Ruehmkorf
Ranch Hand

Joined: Mar 29, 2007
Posts: 115
Matt, thanks for your explanations, they made things much clearer for me. Still there is one very special case that I do not understand:

gives output:
><

I thought trailing empty strings are discarded, so the output should be nothing... ?
[ May 07, 2007: Message edited by: Sasha Ruehmkorf ]
sharan vasandani
Ranch Hand

Joined: Feb 22, 2007
Posts: 100
In the case of " apples".split("\\w*"), the regular expression matches three times ("" at the start of the string, "apples", and "" at the end of the string -- Chandra Bhatt's analysis above isn't quite correct); so you get three fragments added to the output: "", " ", and "".


i am unable to understand how the bolded part is mathcing.please explain.

and
in case of "this is to test"
"this" first match
"" second match ,m not understanding how this is coming.
sharan vasandani
Ranch Hand

Joined: Feb 22, 2007
Posts: 100
System.out.println("------Test3------");
tokens = s.split("\\S",-1); //Non-White space char
for(String ss :tokens)
{
System.out.println(">"+ss+"<");
}

System.out.println("------Test4------");
tokens = s.split("\\W",-1); //Non-word char - same as space
for(String ss :tokens)
{
System.out.println(">"+ss+"<");
}


what is this non-white space and no -word char?
Chandra Bhatt
Ranch Hand

Joined: Feb 28, 2007
Posts: 1707
Hi Sharan,

Blank string and space are two difference things.
See this:

"apple" :- in this String literal "apple" there are six blank strings ""


All above discussion concluded that non-matching trailing blank strings are chopped of by the split method until you pass Limit as second argument to the split method.

The latest question was regarding
"".split("x*"); that returns ><, I mean one blank string.

It is only the non-matching trailing blank string that is chopped off by the split method. What is returned by this is just leading blank string. What
the pattern says is find 0 or more occurrence of x.

I think, I may confirm this by this example:

Example #1:


Output:
><
> <

Trailing blank is chopped off.

Example #2:


Output:
><
> <
><

This is because of the second argument (Limit) we have passed to the split(...) method.


String [] sarr = " ".split("s*",-1);
Output:
><
> <
> <
><



Thanks,
[ May 08, 2007: Message edited by: Chandra Bhatt ]
sharan vasandani
Ranch Hand

Joined: Feb 22, 2007
Posts: 100
--------------------------------------------------------------------------------

String [] tokens = "".split("x*");for (String s : tokens) System.out.print(">" + s + "<");

--------------------------------------------------------------------------------


gives output:
><

I thought trailing empty strings are discarded, so the output should be nothing... ?
[ May 07, 2007: Message edited by: Sasha Ruehmkorf ]


what about this issue?
[ May 08, 2007: Message edited by: sharan vasandani ]
Chandra Bhatt
Ranch Hand

Joined: Feb 28, 2007
Posts: 1707
Hi,

I think if you read the post, I have just posted above carefully, you will get that. What couple of examples I have given are just for that case only.


Thanks,
sharan vasandani
Ranch Hand

Joined: Feb 22, 2007
Posts: 100
am sorry but its not clear to me what do you want to say by this line.

It is only the non-matching trailing blank string that is chopped off by the split method. What is returned by this is just leading blank string.

in previous post matt has said all empty strings are pruned till a non -empty string is encountered, in our case there is no non-empty string so still why its printing "><"
Chandra Bhatt
Ranch Hand

Joined: Feb 28, 2007
Posts: 1707


We have pattern "x*" that says 0 or more occurrence of x. Remember 0 occurrence will do there too. So therefore spilt() has to return the tokens
following the Pattern as a sort of delimiting sequence. I can think why confusion comes, it is because there is only blank string, but that can't be discarded by the split; what is returned by the split, we can say that is leading string (although that is trailing too (source of confusion)).

It that blank is followed by any other char literal that are constituting the string to be split, in that case only split would have chopped the un-matched trailing blanks, as I did in couple of examples in my previous post.

To get all the unmatched trailings pass the second parameter Limit negative for all or positive for the limit how many times it should be applied.



Thanks,
sharan vasandani
Ranch Hand

Joined: Feb 22, 2007
Posts: 100
stil not clear.

according to mat

Finally, the algorithm prunes those two empty strings ("") from the end of the array of results -- the pruning removes all the empty strings from the end of the results up to the first non-empty string

all empty strings are removed until a non-empty string is encountered.
Chandra Bhatt
Ranch Hand

Joined: Feb 28, 2007
Posts: 1707
Hi Sharan,

No issue to worry about.

Anyways, what do you think about this issue; how is this done? I think you should just
manipulate the code, try using several modifications, split with second
argument, with some positive values, -1 and all. You tell me how the things
are happening there.

This is far better way as I think.


Keep it up!

Thanks,
sharan vasandani
Ranch Hand

Joined: Feb 22, 2007
Posts: 100
i know passing -1 will not prune any empty strings but will print them all.

but am confused between these two,

In the case of "apples".split("\\w*"), the regular expression matches twice
("apples" and "" at the end) to give fragments "" and "". Another empty string is added to account for the remainder, but all three ""'s are pruned at the end, resulting in an empty array as output.



String [] tokens = "".split("x*");for (String s : tokens) System.out.print(">" + s + "<");

--------------------------------------------------------------------------------


gives output:
><

I thought trailing empty strings are discarded, so the output should be nothing... ?
[ May 07, 2007: Message edited by: Sasha Ruehmkorf ]


what about this issue?
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18123
    
  39


String [] tokens = "".split("x*");
for (String s : tokens) System.out.print(">" + s + "<");

----------------------------------------------------------


gives output:
><

I thought trailing empty strings are discarded, so the output should be nothing... ?


It looks like you may have found a bug -- or at least, an undocumented exception condition. From the source code, it looks like if there are *no* matches for the delimiter, it will just return the original string as an array of size one.

It doesn't even bother to check to limit parameter, or call the part that removes the trailing blanks.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18123
    
  39

Oops, I was wrong. This exception condition is documented in the JavaDoc...

If this pattern does not match any subsequence of the input then the resulting array has just one element, namely the input sequence in string form.


It looks like if there are no matches for the split delimiter, then the limit part of split (and any side effects) is not even applied.

Henry
Matt Russell
Ranch Hand

Joined: Aug 15, 2006
Posts: 165
Hmm. With regards to "".split("x*"), this may be either a bug or just undefined behaviour -- interestingly, I get different results with Sun's libraries than with GNU Java.

This is probably not something that's tested on SCJP ;-)



With Sun's JDK, I get:


With GNU Java, I get


What I think is happening is as follows: the split() JavaDoc says that, "If this pattern does not match any subsequence of the input then the resulting array has just one element, namely the input sequence in string form." However, this is implemented in Sun's code (pasted a few messages back) by testing if the index variable == 0. That would normally indicate no matches had occurred, however, it's also the case where the string itself is empty and there is a zero-length match.

My suspicion is that this is a Sun bug, in that the spec states that trailing empty strings will be discarded.

-- Matt
[ May 08, 2007: Message edited by: Matt Russell ]
Matt Russell
Ranch Hand

Joined: Aug 15, 2006
Posts: 165
Originally posted by Henry Wong:
Oops, I was wrong. This exception condition is documented in the JavaDoc...
It looks like if there are no matches for the split delimiter, then the limit part of split (and any side effects) is not even applied.
Henry

Sure...but in the case of "".split("x*"), there is one match.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18123
    
  39

Originally posted by Matt Russell:

Sure...but in the case of "".split("x*"), there is one match.


Actually, no. I was referring to the matching of the delimiter, not the results. There are no x's to match.

Henry
Matt Russell
Ranch Hand

Joined: Aug 15, 2006
Posts: 165
Originally posted by Henry Wong:
Actually, no. I was referring to the matching of the delimiter, not the results. There are no x's to match.
Henry

I was referring to the matching of the delimiter too -- * matches 0 or more: so x* matches even though there are no x's to match. It's quite possible I'm being dense and missing something, though ;-)
Sasha Ruehmkorf
Ranch Hand

Joined: Mar 29, 2007
Posts: 115

It's quite possible I'm being dense and missing something, though ;-)

Don't think so. My Test-Program gives:
Pattern = x*
Matcher = ""
I found the text "" starting at index 0 and ending at index 0.

So, thank you very much for the fruitful discussion. Finally I feel like being able to predict the output of the split-method() in absolutely every case. :-)
Unfortunately I am not able to state the same for all these parse-Methods around. Lots of work still to be done...
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18123
    
  39

Originally posted by Matt Russell:

I was referring to the matching of the delimiter too -- * matches 0 or more: so x* matches even though there are no x's to match. It's quite possible I'm being dense and missing something, though ;-)


Interesting. You are absolutely correct.

From the source code, it does look like a bug. Apparently, it is checking to see if an internal variable (index) is not changed (to determine no matches). This variable starts of as zero, and ends up as the end of the last match -- which in this case is still zero.

Henry
 
 
subject: SPLIT METHOD IN STRING CLASS
 
Similar Threads
Question regex
Split using regex doubt
Regex and Generics Notes/Questions
String.split method result