File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Suggest one regex to match all the following CharSequence Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Soft Skills this week in the Jobs Discussion forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Suggest one regex to match all the following CharSequence " Watch "Suggest one regex to match all the following CharSequence " New topic
Author

Suggest one regex to match all the following CharSequence

a sarkar
Ranch Hand

Joined: Aug 05, 2010
Posts: 92
Hi,
I am looking for a regex to match all the following CharSequence and extract 2 groups out of 'em. The first group would be the name without the 4-digit year and parentheses. The second group, if present, would be the 4-digit year without the parentheses.
abc.avi
Ab-C.mkv
abc def.mkv
AbC DeF.divx
abc (2010).avi
ABC-DEF (2010).mkv

One I came up with does not work as expected:


Any help would be appreciated.
--
Abhi
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19784
    
  20

I see that you're using [ and ] to match the extension. That won't work. You should use ( and ), or if you don't want it as a capturing group use (?: and )

When creating regexes it's best to build them in little blocks. You first write down in normal words what you want to do, then translate each part into a sub-regex, and then paste these together.

So let's break it down:
- letters, dashes or spaces
- optionally: opening parentheses, 4-year digit (grouped), closing parentheses
- a dot
- mkv, avi, mp4, etc

I see you're using \u2212\u0020 for space and dash. You don't need those, you can add them as they are. Well, if you put the dash at the start of the character class, otherwise it will get special meaning: [-\w ]

See if you can put all of this together to form one regex. Make sure to check your year group against null. This will occur if it's not present.
a sarkar
Ranch Hand

Joined: Aug 05, 2010
Posts: 92
Rob Prime wrote:
See if you can put all of this together to form one regex. Make sure to check your year group against null. This will occur if it's not present.

Thank you Rob for your input. I will chew on that and post back with the results.
--
Abhi
"Old user, new username"
Darryl Burke
Bartender

Joined: May 03, 2008
Posts: 4664
    
    5

a sarkar wrote:"Old user, new username"

Why, were you taken up for cross posting under the old username?
http://forums.oracle.com/forums/thread.jspa?threadID=1255349
a sarkar
Ranch Hand

Joined: Aug 05, 2010
Posts: 92
Darryl Burke wrote:
a sarkar wrote:"Old user, new username"

Why, were you taken up for cross posting under the old username?
http://forums.oracle.com/forums/thread.jspa?threadID=1255349

Nops, I didn't like the old username.

cross posting by Daryll -
http://forums.sun.com/thread.jspa?threadID=5441460
http://www.coderanch.com/t/498351/GUI/java/Bug-IconUIResource-doesn-paint-animated

--
Abhi
"Old user, new username"
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3649
    
  17

Except he notified everyone he was cross posting?
a sarkar
Ranch Hand

Joined: Aug 05, 2010
Posts: 92
Stephan van Hulst wrote:Except he notified everyone he was cross posting?

Cross posting, as I understand, is applicable to foums within a single website. Including all sites on the Web is a pretty big scope, I would say.
For argument's sake, even if we consider this as cross posting, notifying everyone is hardly a justification, don't you think? What if you commit a murder and notify all that you did it? Does that make it any less of a crime?
--
Abhi
"Old user, new username"
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19784
    
  20

Let's not discuss this here any further. If you guys want to continue, go to the Ranch Office forum.

Abhi, please read our BeForthrightWhenCrossPostingToOtherSites FAQ entry. We have this policy to prevent people from spending much time answering a question that may have been answered days ago on another forum. People on other forums will also like it if you notify them of posts here.
The issue you mentioned is our UseOneThreadPerQuestion policy.
a sarkar
Ranch Hand

Joined: Aug 05, 2010
Posts: 92
This is what I finally got working...with help from this and Oracle forum.

Unless someone wants to suggest a better regex, this is good for me.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19784
    
  20

Let's break it down:
- ([-\\w\\s]++) : I used greedy quantification in my little test, so one +. However, since the ++ is applied to characters that are not part of the remainder (parentheses or dot) this won't matter. All is in a capturing group which is fine.
- (?:\\((\\d{4})\\))?+ : non capturing like my test. A (, followed by a capturing group of 4 digits, followed by another ). The entire thing is optional. Exactly like my test.
- \\. : a dot. Can't be simpler.
- (?:avi|mkv|mp4|divx){1}+ : this is where I have some questions about. First of all, {1} is never needed. It means "exactly one time", which is the same you get if you don't add any quantifiers. But that trailing + is odd. Do you want things like "avimp3divx" to be allowed? Surely not?

In the end, if you'd remove that {1}+ part you get almost what I had. I had no capturing group for the first part, and turned the extension in a capturing group, but apart from that and the ++ vs + it was equal.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3018
    
  10
Rob Prime wrote:- (?:avi|mkv|mp4|divx){1}+ : this is where I have some questions about. First of all, {1} is never needed. It means "exactly one time", which is the same you get if you don't add any quantifiers. But that trailing + is odd. Do you want things like "avimp3divx" to be allowed? Surely not?

That's not what the + does here. Instead it makes the preceding quantifier possessive. In this case that's {1}, which means exactly once - but now {1}+ means exactly once in possessive mode, disabling backtracking if the first attempt for this part of the expression fails.

Abhi is using possessive quantifiers throughout his regex here. I like using possessive quantifiers in many cases - but I'm not sure they're very helpful here. I suspect there are some not-yet-considered corner cases where using possessives will prevent a match from happening, even in cases where we might expect a match to happen. Not sure though.

Abhi, your examples don't show any punctuation characters besides '-' and '.'. Do you know for sure that they won't occur in the files you need to handle? You might want to include more test cases to make sure you're handling things well. For example, these look like video titles - movies? Picking a few movie titles semi-randomly out of imdb.com lists, I see

Wall Street: Money Never Sleeps
Legend of the Guardians: The Owls of Ga'Hoole
Crouching Tiger, Hidden Dragon
Kill Bill: Vol. 1
9 1/2 Weeks

If these were converted into file names by adding an optional (2010) (or other year) and a file extension, would your regex successfully parse them? Or, can you guarantee that filenames like that will not occur? Worth considering, I think.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19784
    
  20

Mike Simmons wrote:
Rob Prime wrote:- (?:avi|mkv|mp4|divx){1}+ : this is where I have some questions about. First of all, {1} is never needed. It means "exactly one time", which is the same you get if you don't add any quantifiers. But that trailing + is odd. Do you want things like "avimp3divx" to be allowed? Surely not?

That's not what the + does here. Instead it makes the preceding quantifier possessive. In this case that's {1}, which means exactly once - but now {1}+ means exactly once in possessive mode, disabling backtracking if the first attempt for this part of the expression fails.

Ah ok. I didn't know that possessive quantifiers also applied to {}.
a sarkar
Ranch Hand

Joined: Aug 05, 2010
Posts: 92
Rob Prime wrote:I see that you're using [ and ] to match the extension. That won't work. You should use ( and ), or if you don't want it as a capturing group use (?: and )

I see that this is true but I don't understand the logic behind. Could you explain this statement or point to some documentation that does?
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19784
    
  20

How about the Javadoc of java.util.regex.Pattern? 90% of what you need for regexes can be found there.
a sarkar
Ranch Hand

Joined: Aug 05, 2010
Posts: 92
Rob Prime wrote:How about the Javadoc of java.util.regex.Pattern? 90% of what you need for regexes can be found there.

It seems to me that [avi|mkv|mp4|divx] is following straight from the Character class [abc] which means "a, b, or c (simple class)". Why would [avi|mkv|mp4|divx] not work and (avi|mkv|mp4|divx) would?
a sarkar
Ranch Hand

Joined: Aug 05, 2010
Posts: 92
Mike Simmons wrote:
Picking a few movie titles semi-randomly out of imdb.com lists, I see
Wall Street: Money Never Sleeps
Legend of the Guardians: The Owls of Ga'Hoole
Crouching Tiger, Hidden Dragon
Kill Bill: Vol. 1
9 1/2 Weeks

If these were converted into file names by adding an optional (2010) (or other year) and a file extension, would your regex successfully parse them?

Thanks for pointing this out Mike. From your example, let me see what I missed in my regex:
Colon (: ) as in "Wall Street: Money Never Sleeps" - Not allowed in a physical file name on Windows. As long as I am reading file from a Windows directory, I can guarantee this will not appear. Colon is permitted in a filename on Unix though but I am yet to see a practical example of such a file.
Apostrophe (') as in "The Owls of Ga'Hoole" - Should be added to the regex.
Comma (,) as in "Crouching Tiger, Hidden Dragon" - Should be added to the regex.
Full stop (.) as in "Kill Bill: Vol. 1" - Should be added to the regex.
Slash (/) as in "9 1/2 Weeks" - Like Colon, not allowed in a filename. Not in Windows or Unix.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19784
    
  20

a sarkar wrote:
Rob Prime wrote:How about the Javadoc of java.util.regex.Pattern? 90% of what you need for regexes can be found there.

It seems to me that [avi|mkv|mp4|divx] is following straight from the Character class [abc] which means "a, b, or c (simple class)". Why would [avi|mkv|mp4|divx] not work and (avi|mkv|mp4|divx) would?

Because a character class matches one single character. You want to match one of a few substrings, and that's what | is for. The () - which can be replaced by (?:) - is used to limit the | to only the things inside them.
a sarkar
Ranch Hand

Joined: Aug 05, 2010
Posts: 92
Rob Prime wrote:
... a character class matches one single character. You want to match one of a few substrings, and that's what | is for. The () - which can be replaced by (?:) - is used to limit the | to only the things inside them.

Thank you Rob for the explanation - it definitely cleared my misconception. Incorporating Mike's suggestion and with little update, following is the latest regex: I will hence mark this thread as resolved.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Suggest one regex to match all the following CharSequence