File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Other JSE/JEE APIs and the fly likes Regex to parse arguments Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Other JSE/JEE APIs
Bookmark "Regex to parse arguments" Watch "Regex to parse arguments" New topic
Author

Regex to parse arguments

Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4658
    
    5

I'm working on parsing a string from an RFC, and I can't get my regex to work. So I've written a small Java program to test. I don't understand the results, so I can't figure out what I'm doing wrong.

The applicable section deals with a "type=" string.

The regex that I'm using is:

The specs are that there can be either a series of type=X separated by semicolons,
type=X;type=Y;type=Z
or you can have a series of arguments,
type=X,Y,Z
where the X values are keywords

It seems to work fine for the "type=X;type=Y" model
The output doesn't do a proper greedy match with the series of keywords separated by commas. such as



Thanks
pat
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18840
    
  40


Unfortunately, I think you are confusing how regex groups work. Group 1 is always the first parenthesis. Group 2 is always the second parenthesis. etc.

For example, let say you patterns is .... "(hello)*" .... You can match a long string of 100 hello strings. But in terms of the number of groups, it will only be one group -- for the one parenthesis. And it's value will be assigned to the last match of the subgroup.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18840
    
  40

type=CELL,pref,msg:(703) 555-8914
gc: 1 = CELL
gc: 2 = ,msg
gc: 3 = msg
gc: 4 = null
gc: 5 = null
gc: 6 = null
gc: 7 = null


So, the first match is CELL, which is the first paren. The second is ",msg" which is the latest match using the second paren (the eariler match of ",pref" is lost). The third match is "msg" which is the latest match using the third paren (the eariler match of "perf" is lost). And all the rest is null because there were no successful sub-matches with parens 4 thru 7.

Henry
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4658
    
    5

[quote=Henry Wong]Unfortunately, I think you are confusing how regex groups work. Group 1 is always the first parenthesis. Group 2 is always the second parenthesis. etc. [/quote]

Wouldn't be the first time. My understanding is from my 40 year old study of BNF and formal languages, I've not done much with serious pattern matching using regex in any languages.

[quote=Henry Wong]For example, let say you patterns is .... "(hello)*" .... You can match a long string of 100 hello strings. But in terms of the number of groups, it will only be one group -- for the one parenthesis. And it's value will be assigned to the last match of the subgroup.[/quote]

Do you not get any indication that you matched "hello" vs "hellohellohello"? Both meet the rule.

Do extra parens help?

So if the term is (foo|baz)* does my understanding that foobazbazfoo is not matched?
i.e. foo or baz, repeated as many times as you want?

Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18840
    
  40

So if the term is (foo|baz)* does my understanding that foobazbazfoo is not matched?
i.e. foo or baz, repeated as many times as you want?


In this case, it does match, but the result is probably not what you are expecting.

Group zero (which haven't been discussed yet), is the true match of the regex, and will match "foobazbazfoo". Group 1 is actually the first subgroup (that the first paren matches). This matches 4 times during this match, and will be assigned to the last submatch, which is "foo".

Do you not get any indication that you matched "hello" vs "hellohellohello"? Both meet the rule.


Well, group zero is different. But you probably mean how would you deal with each "hello". In general, the regex is changed so that find() will return the smaller portion -- probably just a "hello" with a lookbehind or lookahead, to make sure that it is attached to the previous hello, etc. (EDIT: it's probably easier to extract the chain of hellos first, and then use regex again on the chain)

Henry
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4658
    
    5

Henry Wong wrote:Group zero (which haven't been discussed yet), is the true match of the regex

Thanks Henry.
I've been playing arround with it, and there seems to be no way to get the unique values of the early parts matchied by the
(foo|baz)*
Getting the last one is easy.

Looks like I'll need to use one regex to identify the substring that matches the final BNF, and then use another to parse/split it into pieces.

Where is snobol when we need it?
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Regex to parse arguments