• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Regex - Delimiter question

 
Ranch Hand
Posts: 191
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Good day,

I need to have a delimiter to parse this from the file, anyone can guide me? Thanks in advance!

the whole file contain format as below:

Format of file contain many records like below:


Result should be :

Trying below, but i'm not sure how to skip the tab split if got "()" in between
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Nakataa Kokuyo wrote:Trying below, but i'm not sure how to skip the tab split if got "()" in between


I don't understand what that last sentence means. Could you provide a precise example of each situation?

Thanks.

Winston

BTW: I'm pretty sure you don't have to escape '|' when it's inside square brackets; but you do need to escape the TAB, so
Pattern.compile("[|\\t]")
would be what you want for "'|' or TAB".
 
Nakataa Kokuyo
Ranch Hand
Posts: 191
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hey Winston,

May be it is easier to explain with the result, please read the first post with update expecting result.

Sorry for poor explanation:(
 
Nakataa Kokuyo
Ranch Hand
Posts: 191
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
There should not split with tab when () surrounding to words
 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Nakataa Kokuyo wrote:May be it is easier to explain with the result, please read the first post with update expecting result.


Better, but I'm still a bit mystified.

In both cases, is the TAB immediately before the number? Or are you saying that if there are brackets, the delimiter will be a TAB rather than a space?

Winston
 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Nakataa Kokuyo wrote:There should not split with tab when () surrounding to words


It seems an odd way to do things. Why not just always use a TAB? That way, you don't have to worry whether the brackets are there or not. It's the standard method used for many Unix delimited files.

Winston
 
Nakataa Kokuyo
Ranch Hand
Posts: 191
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
from the input


if i usd just TAB, my value will be




What i need is

 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Nakataa Kokuyo wrote:from the input if i usd just TAB, my value will be


Not if you do it properly. In such a situation, I would make the inputwhere '{TAB}' denotes the '\t' character.

And THEN, your split regex will do exactly what you want.

Winston
 
Saloon Keeper
Posts: 15510
363
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You shouldn't be using delimiters here.

Make a regex that describes the entire record, and use capturing groups to get individual parts of the record. Then use the findWithinHorizon() method to read all the records.
I have not compiled this, so it might be completely off. The point is that it describes the records, and then finds those records within the file, regardless of delimiters. The pattern consists of three capturing groups: the text before the pipe, the text after the pipe and the final number. We see that text before and after the pipe may consist of any number of x characters, where x is any character except for whitespace or pipes, but including tabs. Each of the three groups may be separated by any number of whitespace.
 
Nakataa Kokuyo
Ranch Hand
Posts: 191
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Stephan,

I'm confuse on regex that you use on below sample, any chance to explain what is trying to achieve from below code


 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Nakataa Kokuyo wrote:I'm confuse on regex that you use on below sample, any chance to explain what is trying to achieve from below code


Basically, it's a regex that contains a duplicate substring, which he defined separately. I suggest you look at the String.format() method documentation for more details.

I guess my question is: why is your input so confusing? Why not just TAB-delimit (or pipe-delimit) the whole darn thing?
It looks suspiciously to me like there might be "layers" to this input, in which case regexes may not be the best solution anyway; but if not, and it's just columnar data, pick ONE delimiter and just use it everywhere.

Winston
 
Nakataa Kokuyo
Ranch Hand
Posts: 191
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Winston for the reply!

The input file was generated from Hadoop application with format that i mentioned and I was think if there is a chance for me to used delimited to handle all given format ...

Help me to understand better, are you suggesting using a tab as delimited and then i spilt the remaining part ? it will look probably with following steps :-

My input :


1. By delimited with tab, and result


2. I need to spilt again with delimited "|", and result


3. Then merge and get the result

 
Nakataa Kokuyo
Ranch Hand
Posts: 191
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I think above should work but i afraid there is a paid for performance as there are many records(200k) from the given input textfile.
 
Too many men are afraid of being fools - Henry Ford. Foolish tiny ad:
a bit of art, as a gift, that will fit in a stocking
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic