aspose file tools*
The moose likes Developer Certification (SCJD/OCMJD) and the fly likes Question regarding regular expression Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Certification » Developer Certification (SCJD/OCMJD)
Bookmark "Question regarding regular expression" Watch "Question regarding regular expression" New topic
Author

Question regarding regular expression

Samantha O'Neill
Greenhorn

Joined: Apr 15, 2003
Posts: 26
Hi all
Peter put the following regular expression in a mail a while back

(\w+)='([^'])'(?,\s*(\w+)='([^'])')*

and I had a question regarding the ? that appears in this sequence. According to the API anything with a ? after it means that this char should appear either 0 or 1 times in the sequence but I can't make sense of it within the contect of the sequence above.
If anyone can shed any light on this for me I would really appreciate it.
Many thanks Sam
Andrew Monkhouse
author and jackaroo
Marshal Commander

Joined: Mar 28, 2003
Posts: 11460
    
  94

Hi Sam,
I look forward to a full explanation from someone who is more knowledgeable. But my reading of the section:

is that this is the section describing what we dont want in this particular match. From the API:
Groups beginning with (? are pure, non-capturing groups that do not capture text and do not count towards the group total.

So if we are looking at an input like: A=B, C=D
The section I have cut out will match on ", C=D"
(ignore commas followed by white space followed by words followed by equals signs ....)
Regards, Andrew


The Sun Certified Java Developer Exam with J2SE 5: paper version from Amazon, PDF from Apress, Online reference: Books 24x7 Personal blog
Samantha O'Neill
Greenhorn

Joined: Apr 15, 2003
Posts: 26
Thanks Andrew - I have found the correct section in the API now.
Although I am still not clear as to the purpose of these non-capturing groups.
Are there any experts out there as don't want to use an expression I don't truly understand the meaning of.

Many thanks Sam
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Looks like there's some sort of typo here. If you try to compile a Pattern for this (after escaping each \ to \\) you get:

Looking at the API there's no use for this syntax - the ( must be part of a group, since it's not escaped, but there's no group syntax for "(?," - it's just not there. My guess is that Peter might have been aiming for one of the other constructs like "(?<!", "(?<=", or "(?>". (Note that ',' is an un-shifted '<' on most keyboards.) Can't say much more without knowing what the pattern was intended to do...
[ May 06, 2003: Message edited by: Jim Yingst ]

"I'm not back." - Bill Harding, Twister
Peter den Haan
author
Ranch Hand

Joined: Apr 20, 2000
Posts: 3252
Originally posted by Samantha O'Neill:
[...] I had a question regarding the ? that appears in this sequence. According to the API anything with a ? after it means that this char should appear either 0 or 1 times in the sequence but I can't make sense of it within the contect of the sequence above.
No, that wouldn't make sense, would it? What I wanted was a non-capturing group (?: .... ). Switching briefly to (not-so-)formal grammar mode, a criteria string has the following syntactical structure
criteria := criterion | criterion ',' criteria
criterion := field '=' ''' value '''
field := regexp(\w+)
value := regexp([^']*)
You need a regexp (?: .... )* to implement the "criterion ',' criteria" construct; because you're not interested in what's being grouped, a non-capturing group is best here. In addition, you want capturing groups for "field" and "regexp", since they provide the information you're interested in.
The regexp can be compiled directly from the grammar and other information above:
(\w+)='([^']*)'(?:,\s*(\w+)='([^']*)')*
Does that answer the question?
- Peter
[ May 07, 2003: Message edited by: Peter den Haan ]
Samantha O'Neill
Greenhorn

Joined: Apr 15, 2003
Posts: 26
Thanks for that Peter
I had worked that one out and have now found another problem which you may be able to shed some light on but it isn't a fault with your regexp.
Well actually there is a bit of a hiccup in that using \w here does not include whitespace and the name of two of the fields does include a space e.g. "Origin airport" and "Destination airport".
Instead I could use a sequence like
[a-zA-Z_0-9[ ]]+
but this would allow a space to appear at the beginning or the end of the value so I then need to trim() it.
The real issue however is that once I have parsed my regexp and all is well (bar the space issue) I then call matcher.groupCount() so that I can cycle through the matched groups calling matcher.group(int num) to store the values returned.
This works fine for a string such as:
"Origin airport='SFO', Carrier='SpeedyAir'"
but as soon as I have more than two criterion such as
"Origin airport='SFO', Destination airport='DEN', Carrier='SpeedyAir'"
I find that the group count remains at 4 instead of increasing to 6 and I will only get the details
for two of the criterion. I presume this is because although the group that starts ?: is also marked with a * afterwards there are just 4 groups that defined in the regexp at compile time.
At the moment I can't see a way around this but to use this expression to validate the whole criteria string at the start and then have a sub-regexp to use with the matcher.find() and matcher.group() methods in a loop to get the individual values. Somehow this doesn't seem right. Am I doing something obviously wrong?
I would really like to understand the problem so any help offered wil be much appreciated.
Many thanks Sam
Arun Bommannavar
Ranch Hand

Joined: Jan 11, 2003
Posts: 53
Originally posted by Samantha O'Neill:
At the moment I can't see a way around this but to use this expression to validate the whole criteria string at the start and then have a sub-regexp to use with the matcher.find() and matcher.group() methods in a loop to get the individual values. Somehow this doesn't seem right. Am I doing something obviously wrong?
I would really like to understand the problem so any help offered wil be much appreciated.
Many thanks Sam

Lasse Koskela posted a cute technique using Map and Set. Part of his posting is as follows:

In other words, this Map/Set approach relies on the implementation of java.util.Map and java.util.Set to perform the comparison. What my code is doing is
1) creating a Map ("A") based on a given query string, and
2) creating another Map ("B") based on a DataInfo, a record read from the database.
3) saying, "Map B, do you contain Map A?"


I have tried it and it works. Real cute.
Regards
Arun
Samantha O'Neill
Greenhorn

Joined: Apr 15, 2003
Posts: 26
Ah yes Arun my approach is much the same but the problem I am having here is with the part

1) creating a Map ("A") based on a given query string,

Here I would like to use Peter's regular expression, with a couple of tweaks, to validate the criteria string and then break it down into groups
but the matcher.groupCount() method is not working as I had hoped.
I think most people have done this using a StringTokenizer but I'd like to get to know regular expressions now if I can. So I wanted to know if anyone had got this to work and give me some hints as to how.
Thanks Arun
Sam
Leslie Chaim
Ranch Hand

Joined: May 22, 2002
Posts: 336
Sam,
A few points:
About \w hiccup, you can simply include \w in a class as in [\w ].
The Map/Set thing maybe a very cute idea, but first some parsing must be done.
You did not list the final regex which you are using. Can you please post your code using UBB tags, I'll try to get back at the end of the day.


Normal is in the eye of the beholder
Samantha O'Neill
Greenhorn

Joined: Apr 15, 2003
Posts: 26
Hi Leslie
Have ben off doing other things so sorry for slow response . I have got it working nicely now. First I check the whole criteria string against the the following regex
<code>
"\\w+(\\s?\\w+)*='[^',=]+'"
+ "(,\\s*\\w+(\\s?\\w+)*='[^',=]+')*";
</code>
Looks more frightening than Peter's I know

This does some extra work to make sure the column
name always starts with at least one of the following [a-zA-Z_0-9] and then may be followed
by more of the same, or, a single space and more of the same, 0 or many times - if that makes sense.
Then I use the split() method of the Pattern class to break up the string into its various parts because there is a problem using the matcher.group() method. The number of groups is set at compile time so in Peter's original string although he declared his last two groups as repeating - decalred within the (...)* - they were just overwritten each time with each column name=column value pair and the groupcount never increased above 4.
For this reason I added the , and = characters to the list of chars not allowed in the column value i.e. '[^',=]' as it would mess up the split() method calls.
Anyway it all works really well but thanks for your reply in any case.
Sam
Leslie Chaim
Ranch Hand

Joined: May 22, 2002
Posts: 336
Hi Sam,
So you are using regex to validate your string then some other tools to glean out the wanted values. Although you're doing this for the SCJD exam you may want to study a bit more the regex part on it's own. I think your approach is a bit complicated you may want to review it for the sake KIS not to mention KISS
In asking for your code I really wanted to see not just the regex but your whole parsing solution so I could comment accordingly. Also, you did not provide enough sample data, I just had to do with whatever you posted. It's possible my solution may need some fine-tuning accordingly.
I created a file called sam_input:

Origin airport='SFO', Carrier='SpeedyAir'
Origin airport='SFO', Destination airport='DEN', Carrier='SpeedyAir'

And the following code to parse em:

Then I run 'java Parser < sam_input' and Output is:

Key=|Origin airport| Value=|SFO|
Key=|Carrier| Value=|SpeedyAir|
Key=|Origin airport| Value=|SFO|
Key=|Destination airport| Value=|DEN|
Key=|Carrier| Value=|SpeedyAir|

Cheers,
Leslie
[ May 16, 2003: Message edited by: Leslie Chaim ]
Samantha O'Neill
Greenhorn

Joined: Apr 15, 2003
Posts: 26
Hi Leslie
Sorry its taken me so long to get back to this.
What I have is:
The top level looking like the following:

Now in my getCriteriaMap() method I have the following:

The REGEX is a little more complicated but it means I can validate the entire criteria String. If I do it your way, and I agree it works, I am only picking out matching parts of the criteria String when there could be all sorts of extraneous characters in it. For example I could have a String:
"Origin airport='SFO',,,,, ??Destination airport='DEN', Carrier='SpeedyAir'"
and your group matching REGEX will correctly
decipher it into its <column name>=<column value>
pairs and ignore the extra unwanted characters.
Do you think its ok not to parse the criteria string as a whole and check the entire format is ok.
May be the best solution is somewhere between yours and mine where there are two Regex - one for validating the whole string and one for extracting the groups.
Let me know what you think. Your input has been much appreciated and got me thinking!
Many thanks Sam
Samantha O'Neill
Greenhorn

Joined: Apr 15, 2003
Posts: 26
PS Sorry about the code formatting it seems to have got a bit screwed up so not as readable as it could be!
Sam
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Question regarding regular expression