aspose file tools*
The moose likes Java in General and the fly likes Problem with Regex Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Problem with Regex" Watch "Problem with Regex" New topic
Author

Problem with Regex

Taco Fleur
Greenhorn

Joined: Jul 11, 2005
Posts: 21
I was undert the impression that \\ would escape characters in a character class, but it doesnt seem to work for me.

java.util.regex.PatternSyntaxException: Illegal octal escape sequence near index 4
[^\0-9A-Z\p{Blank}
^_`~\p{Punct}]

I need to allow the following characters
\
0-9
A-Z
\p{Blank}
\r
\n
^_`~
\p{Punct}
[]{}|

How would I format the regex?
I tried:
[^\\0-9A-Z\\p{Blank}\r\n^_`~\\p{Punct}\\[\\]\\{\\}\\|]
Stefan Evans
Bartender

Joined: Jul 06, 2005
Posts: 1018
To escape backslash \ in a regex string you need FOUR backslashes.
You need two backslashes in the regex.
You need to escape each backslash in the string = total of four.

Also you need to escape the backslashes with the \r and \n
And do you mean to start with a ^? Because that indicates a logical not - ie all characters EXCEPT what you include in the square brackets.

String escapePattern = "[^\\\\0-9A-Z\\p{Blank}\\r\\n^_`~\\p{Punct}\\[\\]\\{\\}\\|]"

Hope this helps some,
evnafets
[ July 12, 2005: Message edited by: Stefan Evans ]
Stefan Evans
Bartender

Joined: Jul 06, 2005
Posts: 1018
Oh, and \p{punct} includes all of these characters: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Thus you don't need to include them in your regex explicitly.

Check out java.regex.Pattern for full pattern syntax.

Actually there are quite a few patterns you could use.
\p{Upper}\p{Digit}\p{Blank}\p{Punct}\r\n should do it.

What is it you are trying to filter out? Just lower case letters?
In that case [^\p{Lower}] would do it.
Taco Fleur
Greenhorn

Joined: Jul 11, 2005
Posts: 21
Hi thanks,

at least it doesnt error this time. but it doesnt work as it should either ;-)

Pattern myValidation = Pattern.compile( "[^" + VALID_MESSAGE_PATTERN + "]", Pattern.DOTALL );
myMatch = myValidation.matcher( messagePart );
isValid = myMatch.matches();
if ( !isValid )
{
setError( "Invalid message format, message: " + messagePart + " Invalid characters are: " + messagePart.replaceAll( "[" + VALID_MESSAGE_PATTERN + "]", "" ) );
}

This is the string I run it on: ZHDASCTXID0400\nZTX777Y20050711777571\nZTRENDTXID3\n

The output is:
Invalid message format, message: ZHDASCTXID0400
ZTX777Y20050711777571
ZTRENDTXID3
Invalid characters are:

which I do not understand, because it says there are invalid characters, but when I remove all valid characters from the string nothing is left.
Stefan Evans
Bartender

Joined: Jul 06, 2005
Posts: 1018
The pattern given will match ONE character only.
To make it match One or more characters, use a + on the end of it
ie [A-Z]+ will match 1 or more characters from A-Z

Rather than trying to get the regex in one foul swoop, I would recommend starting small, and building it up a bit at a time.
ie [A-Z]+ and then [A-Z0-9]+ and then add punctuation...

Also be aware that [^abc] will match any character EXCEPT abc which seems a bit different from what you wanted. I don't know why you are putting it around your pattern as a whole when constructing it.

Good luck,
evnafets
[ July 12, 2005: Message edited by: Stefan Evans ]
Taco Fleur
Greenhorn

Joined: Jul 11, 2005
Posts: 21
Hi,

first I want to make sure there are no other characters than the allowed characters, i.e. the ones in the regex, then if there are characters in the message outside the allowed range then I want to display the invalid characters, removing all valid characters shouls leave all the invalid characters, right?

The removing works like a charm it removes all the valid chars, but the check to see if there are any characters outside the range is giving me troubles.
Taco Fleur
Greenhorn

Joined: Jul 11, 2005
Posts: 21
I added the + as you susgested, I come from another language where you can specify "ALL" i.e. remove all chars in the set ;-)

Pattern myValidation = Pattern.compile( "[^" + VALID_MESSAGE_PATTERN + "]+", Pattern.DOTALL );
myMatch = myValidation.matcher( messagePart );
isValid = myMatch.matches();
if ( !isValid )
{

However it still complains saying the message is invalid.
Taco Fleur
Greenhorn

Joined: Jul 11, 2005
Posts: 21
Even if I only do
Pattern myValidation = Pattern.compile( "[A-Z]+", Pattern.DOTALL );
myMatch = myValidation.matcher( messagePart );
isValid = myMatch.matches();
it still tells me the match is false.
Maybe I am just missing something vital here.?
The string has characters between A-Z, so it should match right?
I am just trying to take it one step at a time as you suggested.
Stefan Evans
Bartender

Joined: Jul 06, 2005
Posts: 1018
There is a subtle difference between the two methods Match and RemoveAll.

Matches returns true if the String provided can be met by the pattern exactly. Currently I think it is failing because you didn't have the + sign - so the pattern would fail on anything longer than one character

They are both working as expected.
See if this example code explains it.
Note the difference between pattern and pattern2 is the + sign.

[ July 12, 2005: Message edited by: Stefan Evans ]
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Problem with Regex