aspose file tools*
The moose likes Beginning Java and the fly likes Doubt regarding Pattern Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "Doubt regarding Pattern" Watch "Doubt regarding Pattern" New topic
Author

Doubt regarding Pattern

Mansukhdeep Thind
Ranch Hand

Joined: Jul 27, 2010
Posts: 1157

Hi

Have a look at the following code:



The pattern printed is :

\p{javaWhitespace}+

How is this happening? What does the line tell the JVM to do with the scanned input?


~ Mansukh
Rommel Sharma
Greenhorn

Joined: Oct 31, 2003
Posts: 18
It's printing the delimiter in use. In this case it is:

\p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()

where

\p{javaWhitespace} is the regular-expression constructs used to accommodate all valid whitespaces.

Looking at the java documentation would lead you to the following:

A character is a Java whitespace character if and only if it satisfies one of the following criteria:
It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
It is '\u0009', HORIZONTAL TABULATION.
It is '\u000A', LINE FEED.
It is '\u000B', VERTICAL TABULATION.
It is '\u000C', FORM FEED.
It is '\u000D', CARRIAGE RETURN.
It is '\u001C', FILE SEPARATOR.
It is '\u001D', GROUP SEPARATOR.
It is '\u001E', RECORD SEPARATOR.
It is '\u001F', UNIT SEPARATOR.

Thanks,
Rommel.

Matthew Brown
Bartender

Joined: Apr 06, 2010
Posts: 4391
    
    8

Mansukhdeep Thind wrote:What does the line tell the JVM to do with the scanned input?

It doesn't tell it to do anything. It's just getting the Pattern that is currently being used by the Scanner to process any input it receives (which you then print out). Since your code doesn't set the pattern anywhere, this must be the default pattern used by Scanner. Which is what the documentation says (java.util.Scanner):
A Scanner breaks its input into tokens using a delimiter pattern, which by default matches whitespace.
Mansukhdeep Thind
Ranch Hand

Joined: Jul 27, 2010
Posts: 1157

I understood that the default delimiter is a White space character. Then what is "\p" for. And why does it print "+" which is a greedy quantifier searching for 1 or more white spaces. It should simply use{javaWhitespace}. Where does the rest of the regex come from(the \p and +)?
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7795
    
  21

Mansukhdeep Thind wrote:Where does the rest of the regex come from(the \p and +)?

Have a look at the docs for java.util.regex.Pattern. It explains all this stuff.

Winston

Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Mansukhdeep Thind
Ranch Hand

Joined: Jul 27, 2010
Posts: 1157

That is too much of information to digest in one go Winston. It would be like searching for a needle in a hay stack. Could you be more specific as to under which heading I should read?
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3647
    
  16

\p{javaWhitespace} simply means "one whitespace character". \p is part of the character class. Without it, {...} would be interpreted as a quantifier, and since "javaWhitespace" is not a number, it would likely throw an exception.

The + is needed because you want entire lengths of whitespace to be seen as one single delimiter. If you don't use the quantifier, it will also return empty string tokens between two spaces.

Note that it's worth it to read and understand the entire Pattern javadoc.
Matthew Brown
Bartender

Joined: Apr 06, 2010
Posts: 4391
    
    8

Look for "character classes". Basically, {javaWhitespace} isn't a valid regular expression. \p{javaWhitespace} is. And it uses + because by default it treats multiple spaces as if they were a single space.
Mansukhdeep Thind
Ranch Hand

Joined: Jul 27, 2010
Posts: 1157

Stephan van Hulst wrote:\p{javaWhitespace} simply means "one whitespace character". \p is part of the character class. Without it, {...} would be interpreted as a quantifier, and since "javaWhitespace" is not a number, it would likely throw an exception.

The + is needed because you want entire lengths of whitespace to be seen as one single delimiter. If you don't use the quantifier, it will also return empty string tokens between two spaces.

Note that it's worth it to read and understand the entire Pattern javadoc.


Point noted Stephen. Will devote time to read through the Pattern documentation and try things.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7795
    
  21

Mansukhdeep Thind wrote:That is too much of information to digest in one go Winston. It would be like searching for a needle in a hay stack. Could you be more specific as to under which heading I should read?

Personally, I just use Ctrl+F.

The fact is that at some point you will have to digest all that information, so why not now, when you actually need it?

Winston
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Doubt regarding Pattern