File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes I/O and Streams and the fly likes Need to tokenize a String , but i need to keep what comes between Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Need to tokenize a String , but i need to keep what comes between "and"" Watch "Need to tokenize a String , but i need to keep what comes between "and"" New topic
Author

Need to tokenize a String , but i need to keep what comes between "and"

Renato Bobbio Calogero
Greenhorn

Joined: Apr 20, 2011
Posts: 18
Dear users,

I wish to tokenize a String which should have the following pattern :

list(groups("A")),list(groups("B")),list(groups("C"))

I need to extract only the values between my " character.

Is there any smart way to do this without implementing any logic to recognize the Strings after tokenization ?

Thanks a lot!
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19693
    
  20

Let me get this straight. You have a String list(groups("A")),list(groups("B")),list(groups("C")). You want to retrieve A, B and C. Right? Sounds like something a regular expression could easily do, with a capturing group.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Renato Bobbio Calogero
Greenhorn

Joined: Apr 20, 2011
Posts: 18
Yes, you got the point ... regular expression you say... i used them in javascript but never in java... could you give me some hint on
which api to use and some best practice to do that?

Thanks a lot!
Renato Bobbio Calogero
Greenhorn

Joined: Apr 20, 2011
Posts: 18
I tried with the following :



but it returns just the string " ... do you have any suggestion on which regex I should use ? The strings between the characters " and " are variable, and the have not a
standard patter. I just need to get rid of

list(groups( ...

stuff , getting only the contents and not the keywords, and I wish not to implement any buisness logic to recognize if the value which populate the list are useless stuff or the contents I need... do you have any hint ? Thanks ...
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19693
    
  20

Your regular expression almost works. You just need to make it reluctant instead of greedy. Check out the Javadoc page of java.util.regex.Pattern for more information.
Renato Bobbio Calogero
Greenhorn

Joined: Apr 20, 2011
Posts: 18
I checked your link , but now that I have recognized a pattern in my content strings I got a bit confused .
I can always assume that my A , B , C from before begin with the string PI .

But I can't see any pattern which apply to my case : I need to catch any string starting with PI , and reject any other. I am looking forward to this,

any help is appreciated. Thanks a lot!
Renato Bobbio Calogero
Greenhorn

Joined: Apr 20, 2011
Posts: 18
Or maybe something more like this

Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19693
    
  20

Let's take your original regular expression. I just now see one flaw: "\"*\"" means zero or more occurrences of " followed by a single ". That's because in regular expressions * is a meta character that applies to the previous entity, in this case the ". It doesn't work like command line wild cards where * means "any character any number of times". A quick (and incorrect) fix: "\".*\"". That dot makes the * bind to that, so the regular expression becomes a single " followed by any character any number of times followed by a single ". That looks more like it. So we test it:
Output: "A")),list(groups("B")),list(groups("C"
Not quite what we want, and the reason is simple: .* is greedy. It takes everything between the first " and the last ". What we want is to take everything between each " and the next ".

There are two ways:
1) make the matching not greedy but reluctant. We do this by appending a simple ? behind the *; see also the Javadoc I pointed you to. The regex becomes "\".*?\".
2) do not capture everything but only everything that's not a ". We can use a negating character class for that: [^"]. The regex becomes "\"[^\"]*\"".

Both now result in this output:
"A"
"B"
"C"

Now all you need to do is use substring to get the values. Another option is use a matching group, by using ( and ). You will then get a group 1 inside the matcher. Group 0 (the entire match, which is what group() returns; group() and group(0) are equivalent) is no longer relevant:
Output:
A
B
C


Note that it's important to use a loop for find(), and not a single if. That's because you can have multiple matches, and with if you only check the first one. The loop will let you check them all.
Renato Bobbio Calogero
Greenhorn

Joined: Apr 20, 2011
Posts: 18
This is a very elegant way to do that, which is what I wanted to achieve, but indeed I thought that could have been something more "less-brainer",
so I did this :



but I really appreciate the fact I learnt the use of java.util.regex, which will be very useful in future ! Thanks a lot !
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19693
    
  20

You're welcome.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Need to tokenize a String , but i need to keep what comes between "and"