• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Regular Expressions in String's split() method.

 
Siju Odeyemi
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

I've got a String variable expFile with the following value in it:



THEN I split the string using the following method:



I'm trying to write a regular expression to split the file after every 10 paragraphs OR at every 1000 characters at most. Unfortunately, I can't seem to get the regular expression right. Can someone with regex skills please show me the light? I'm quite desperate.

Thanks in advance.
 
prem pillai
Ranch Hand
Posts: 87
Java Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
have a look at java.util.regex.Matcher ....
 
Wouter Oet
Saloon Keeper
Posts: 2700
IntelliJ IDE Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I think that he knows about the Matcher class since he is asking for someone with regex skills. However what have you tried so far? The regex you're looking for isn't very complicated.
 
Siju Odeyemi
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
prem & Wouter, thanks for responses.

I don't know regexp syntax at all, I know that the split method breaks the string up everytime it encounters the tag, but I need an expression that does what I explained in my opening post.

Cheers guys.

 
prem pillai
Ranch Hand
Posts: 87
Java Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
but I need an expression that does what I explained in my opening post.


Why are you insisting that it should be done using a regex ? If you are not comfortable with regexes , why don't you have a look at other options to break up your string? There are options available in java.lang.String class itself. Why dont you give it a try ... in the simple way first.

 
Henry Wong
author
Marshal
Pie
Posts: 21190
80
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Siju Odeyemi wrote:
I'm trying to write a regular expression to split the file after every 10 paragraphs OR at every 1000 characters at most. Unfortunately, I can't seem to get the regular expression right. Can someone with regex skills please show me the light? I'm quite desperate.


Generally, split() is good when you can describe what you want in terms of it's delimiters. Descriptions like "10 paragraphs" are more towards what you actually want, than how they are separated. In those cases, it is probably better to use the find() method instead of the split() method.

Henry
 
Henry Wong
author
Marshal
Pie
Posts: 21190
80
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Siju Odeyemi wrote:I don't know regexp syntax at all, I know that the split method ....


I seriously recommend against using regexes if you don't know how they work (or their syntax). With regex, it is very easy to write code that you don't understand, even with some experience; to try it with no experience at all is sure to wind up with code you don't understand (and completely unmaintainable).

Henry
 
Vinoth Kumar Kannan
Ranch Hand
Posts: 276
Chrome Java Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Siju Odeyemi wrote:
I don't know regexp syntax at all....


Regex is no big deal. Its easy, yes. A few tutorials and trying out a few sample code would get you going.
I suggest you try reading this - http://www.regular-expressions.info/tutorial.html
This one is really good & easy to understand.
 
Campbell Ritchie
Sheriff
Pie
Posts: 49379
62
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Vinoth Kumar Kannan wrote: . . . Regex is no big deal. Its easy, yes. . . .
. . . and,

I'm from the Government; I'm here to help.
The cheque's in the post.
etc etc
 
James Sabre
Ranch Hand
Posts: 781
Java Netbeans IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:
I seriously recommend against using regexes if you don't know how they work (or their syntax). With regex, it is very easy to write code that you don't understand, even with some experience; to try it with no experience at all is sure to wind up with code you don't understand (and completely unmaintainable).


++

But if you do know how they work then
 
Joanne Neal
Rancher
Posts: 3742
16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
James Sabre wrote:

As Vinoth said. Easy
 
James Sabre
Ranch Hand
Posts: 781
Java Netbeans IDE Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Joanne Neal wrote:
James Sabre wrote:

As Vinoth said. Easy


Certainly not difficult and I would normally write it with comments to make it obvious; something along the lines


Regex don't have to be difficult and the biggest problem I see with regex is people trying to write them as one long string. Yes, one can write very very complex regex that are incomprehensible probably even to the author but the same applies to any computer language; it just happens to be easier to do with regex.

If you want to see really incomprehensible syntax then take a look at APL. I spent several years teaching APL and learned to both love and hate the mathematical notation.

Edit : :-( Must be complex regex since nobody has pointed out that my regex is actually rubbish so I have added weight to the arguments of those who are against regex. At this time I can't correct the regex. Funny really since my initial approach would have been to use Pattern with Matcher.find() and that is easy to code correctly. Using StringTokenizer would follow the same approach as Pattern and Matcher.find() so would probably be easier still.
 
Harivittal Atreya Hk
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Why Dont you try solving it with "StringTokenizer class", you can specify the common occurences at the end of 1000 chars as its a static doc.
 
Henry Wong
author
Marshal
Pie
Posts: 21190
80
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
James Sabre wrote:Edit : :-( Must be complex regex since nobody has pointed out that my regex is actually rubbish so I have added weight to the arguments of those who are against regex. At this time I can't correct the regex.


That's the other thing about regexes, a complex regex is just a mess of characters....

I won't try to fix this, but if you want to, I would first recommend adding the matches for the characters, in-between the paragraph markers. The way it is written, it will only match if the markers are back to back.

Second, you will likely run into the issue that unbounded regexes are not allowed for look-behinds. To fix that, you can't use "*", or "+", which isn't a problem; it isn't a problem because the maximum match is a 1000 characters anyway. You can cap each at 1000 characters, which will bound the look behind as no more than 10,000 characters, which will trigger the other part of the pattern anyway.

Third, there may be some issues with the start and end boundaries.

And at this point, I am sure that I missed something...

Henry
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic