aspose file tools
The moose likes Java in General and the fly likes regular expressions on Strings Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login
JavaRanch » Java Forums » Java » Java in General
Reply Bookmark "regular expressions on Strings" Watch "regular expressions on Strings" New topic
Author

regular expressions on Strings

Andrew Carney
Ranch Hand

Joined: Oct 17, 2006
Posts: 96
Hello,

Suppose I have String that always looks like this:
<SomeText>,<SomeText>,<SomeText>,<SomeText>,....
I would like to get all the text which starts at the 4th "," from the end till the end.
I know how to do it using String manipulations (substring, replace ect) but I am looking for more elegant way using regular expressions. Anyone can suggest a solution?

[ December 26, 2006: Message edited by: Roy Cohen ]
[ December 26, 2006: Message edited by: Roy Cohen ]
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16695
    
  19

Well, here is one way to do it by Regex. There are probably many more...



Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Andrew Carney
Ranch Hand

Joined: Oct 17, 2006
Posts: 96
Thank you Henry but I think you gave me an expression which counts the "," from the begining of the String, I need to count it from the end backward. The reason for that is bacuse I don't know the length of the String when I get it but I do know that I need to take the substring that that starts from the 4th "," from the end and ends at the end of the String.
Can you please help me in that?
[ December 26, 2006: Message edited by: Roy Cohen ]
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16695
    
  19

Originally posted by Roy Cohen:
Thank you Henry but I think you gave me an expression which counts the "," from the begining of the String, I need to count it from the end backward.
Can you please help me in that?


Not sure what you are asking for... Please provide a *before* and *after* string as an example.

Henry
Andrew Carney
Ranch Hand

Joined: Oct 17, 2006
Posts: 96
Here is before: ...-1,0,1,2,3,4,5,6,7,8,9,10
Here is after: 6,7,8,9,10
[ December 26, 2006: Message edited by: Roy Cohen ]
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16695
    
  19

Oh, I see...



BTW, hope this topic will give you some incentive to learn regex. As you can see, it is quite powerful.

Henry
Andrew Carney
Ranch Hand

Joined: Oct 17, 2006
Posts: 96
It is working thanks a lot!
I know regexp is very powerful, this is why I wanted to use it, but I am only famailiar with it's basics.
Can you please explain me the logic behind what you did, it was toooo fast to grasp

[ December 26, 2006: Message edited by: Roy Cohen ]
[ December 26, 2006: Message edited by: Roy Cohen ]
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16695
    
  19

Originally posted by Roy Cohen:
It is working thanks a lot!
I know regexp is very powerful, this is why I wanted to use it, but I am only famailiar with it's basics.
Can you please explain me the logic behind what you did, it was toooo fast to grasp


Hmmm... not really. Regex is too complex to explain in a few paragraphs. You definitely need to pick up a good book (or website) on the topic. But the main points here to learn are ...

1. The "{0,4}" is a qualifier that is used to greedy match, at most 4 CSV fields. The CSV fields are defined prior to it. And the "$" specifies that it is from the end.

2. The "()" are regex groupings. It is grouping it, so that the last 4 CSV can be extracted.

Now that I think about it, I don't believe the first grouping was even necessary, as it will be thrown away anyway.

Henry
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
What is the last question mark (the one after the colon) good for?


The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16695
    
  19

Originally posted by Ilja Preuss:
What is the last question mark (the one after the colon) good for?


Not sure which question mark you are refering to... but...



The two "?:" are used disable the grouping, as they won't be extracted.

The ".*?" is for relunctant matching. This is needed because this is for left-overs. We want to make sure that the last four fields goes to group one.

The ",?" is so that the comma is optional. Since the matcher before it is greedy, this is to take care of the last field, which doesn't end in a comma.

And BTW, I believe this can be further optimized to...



Henry
[ December 26, 2006: Message edited by: Henry Wong ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18670
One thing to consider here: Roy, is it possible that the "SomeText" sections might contain additional commas, which are not to be counted as they are within < > sections? If this is possible, then it will complicate the answer somewhat. If not, then it's preferable to avoid this complexity, and the current answer is fine. Only Roy knows what the input is really like here.


"I'm not back." - Bill Harding, Twister
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Originally posted by Henry Wong:
The ",?" is so that the comma is optional. Since the matcher before it is greedy, this is to take care of the last field, which doesn't end in a comma.


That's the one I meant - comma, not colon...

Thanks for the explanation, now I understand...
Andrew Carney
Ranch Hand

Joined: Oct 17, 2006
Posts: 96
Hey Jim,

The <SomeText> never includes a comma inside it, so Henry's solution is a good one.
Andrew Carney
Ranch Hand

Joined: Oct 17, 2006
Posts: 96
By the way, the class Pattern has a good explanation of regular expressions but I agree with Henry that it needs to be further studied with a good comprehensive book.
Henry - Thank you for taking the time not only to answer but to explain as well. Well done!
Andrew Carney
Ranch Hand

Joined: Oct 17, 2006
Posts: 96
BTW Henry,

I think instead of:
String result = str.replaceFirst(".*?((?:[^,]*,?){0,4})$", "$1");

I can use:
String result = str.replaceFirst(".*?((?:[^,]*,?){4})$", "$1");

Since I need there 4 exactly.

And even if I remove the grouping:
String result = str.replaceFirst(".*?(([^,]*,?){4})$", "$1");

The results are the same...


Roy
[ December 27, 2006: Message edited by: Roy Cohen ]
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16695
    
  19

There is a very subtle difference between "{4}" and "{0,4}". If you are sure that you will never pass a string with less than 4 fields, then fine, there is no difference.

If you do pass a string with less than 4 fields, the first case will return nothing. While the second case will return the original string. My thinking was returning the original string was better than returning nothing.


And even if I remove the grouping:
String result = str.replaceFirst(".*?(([^,]*,?){4})$", "$1");

The results are the same...


True... But why would you want to do that? This will cause the regex to create a group number 2 for you to extract. This group is used so that you can use the "{}" qualifier -- there is no reason to remove the non-capturing nature of the group.

Henry
[ December 27, 2006: Message edited by: Henry Wong ]
Andrew Carney
Ranch Hand

Joined: Oct 17, 2006
Posts: 96
Henry,

Thank you again for the detailed response.
I didn't quite understood the explanation regarding the 2nd topic of ?:
Can you please elaborate this and tell me to which part at the regexp this applies to?

Roy
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16695
    
  19

Originally posted by Roy Cohen:

Thank you again for the detailed response.
I didn't quite understood the explanation regarding the 2nd topic of ?:
Can you please elaborate this and tell me to which part at the regexp this applies to?


When you use regex groups, besides being able to use qualifiers on the whole group, they are also captured. If you look at the second parameter, "$1", this specifies group 1 -- meaning replace the match with group one, the last four parameters.

The other group, is used to specify a single field for the "{}" qualifier. There is *no* need to capture this group, as we don't use "$2" in the second parameter (or \\2 in the regex itself), so we turn of capturing for this group with "?:".

Basically, when you removed the "?:", you are not removing the group, you are removing the part that tells the regex to not capture for the group. (Just because the regex is shorter doesn't mean it does less work. Some stuff are used to help the regex engine work more efficiently)

Henry
Andrew Carney
Ranch Hand

Joined: Oct 17, 2006
Posts: 96
Now that you are putting it this way, I totally agree with you.
Per your advice, I have returned the ?: to my code, thank you!
 
I agree. Here's the link: http://zeroturnaround.com/jrebel - it saves me about five hours per week
 
subject: regular expressions on Strings
 
Similar Threads
Need help in writing a regular expression
Need help in regular expressions
Problem with replaceAll method
Getting the text out of a HTML?
Validating password field