aspose file tools*
The moose likes Java in General and the fly likes urgent help--- fileformatting Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "urgent help--- fileformatting" Watch "urgent help--- fileformatting" New topic
Author

urgent help--- fileformatting

Deepika menon
Greenhorn

Joined: May 07, 2003
Posts: 1
I have a file with the following output
=================================================
ID: SUPER.SHIBU NODE: LOCAL UID: 065 GID: 255 DEFAULTSEC: CCCC
PHONE: DATEFMT: MMDDYYYY DECPOINT: . CASE: M NLSCODE: US
EFFDATE: 06/30/2003 EXPDATE: 06/30/2003 CALENDAR:
LOGINCNT: 4 LLDATE: 01/07/2003 LLTIME: 04:12:02 LLPID: 0,0
LLSOURCE: \NONAME.$ZTN29.#PTVWABB LLNODE: \NONAME
PWDDATE: 01/06/2003 PWDTIME: 11:56:12 PWDUSER: SUPER.SUPER
PWDSOURCE: \NONAME.$ZTN29.#PTVWAB7 PWDPID: 0,1195
PWDNODE: \NONAME PWDEXPDATE: PWDCHANGE: YES
PWDQSIZE: 4 PWDCHGMINDAYS: 0 PWDCHGMAXDAYS: 0 PWDVIOCOUNT: 0
PWDVIODATE: PWDVIOTIME: 00:00:00 PWDVIONODE:
PWDVIOSOURCE:
PWDVIOUSER: VIOLMODE: SYSTEM VIOLACTION: CANUSER
VIOCOUNT: 0000 VIODATE: VIOTIME: 00:00:00
VIOSOURCE: VIOPID: 0,0
VIONODE: VIOUSER: STATUS: ACTIVE
SUSDATE: 05/09/2002 SUSTIME: 14:13:52 SUSLID:
SUSSOURCE: \NONAME.$ZTN29.#PTHGUJT SUSPID: 0,216
SUSNODE: \NONAME SUSUSER: SUPER.SUPER SUSCOUNT:
UPDATEDATE: 04/02/2003 UPDATETIME: 02:28:23 UPDATEPID: 0,254
UPDATESOURCE: \NONAME.$ZTN29.#PTDP1QJ
UPDATENODE: \NONAME UPDATEUSER: SUPER.SUPER
CREATEDATE: 05/09/2002 CREATETIME: 14:13:52 CREATEPID: 0,216
CREATESOURCE: \NONAME.$ZTN29.#PTHGUJT
CREATEUSER: SUPER.SUPER CREATENODE: \NONAME
DEFAULTSVS: $DATA01.SHIBU DATESEPARATOR: -
NAME: SHIBU GOPI
INITDIR:
INITPROGNAME:
INITPROGTYPE: PROGRAM
UPSSTAT: PWDEXPGRACE:
CIPROG: CINAME:
CILIB : CICPU:
CISWAP: CIPRI:
CIPARAM:
================================================
Whta I want is to make the file foramatted in the format
key = value pair for example
ID: SUPER.SHIBU
NODE: LOCAL
UID: 065
and so on........
Please provide a code strip to do this
karl koch
Ranch Hand

Joined: May 25, 2001
Posts: 388
hi
you can use StringBuffer to break the lines appart. then concat them in the way you want and write them back to the file.
k
Bear Bibeault
Author and ninkuma
Marshal

Joined: Jan 10, 2002
Posts: 61420
    
  67

StringBuffer won't help you break anything apart (though it will help with the concatenating). Use StringTokenizer to break the string up into its tokens.
hth,
bear
P.S. Welcome to The Ranch! You'll find a lot of great Java knowledge here. But please refrain from using "urgent" on your posts. Your post is no less or more important than any other members, and only serves to use up space that could be used to create meaningful topic subjects.
[ May 08, 2003: Message edited by: Bear Bibeault ]

[Asking smart questions] [Bear's FrontMan] [About Bear] [Books by Bear]
Leslie Chaim
Ranch Hand

Joined: May 22, 2002
Posts: 336
You can simply use the String.split() method using a space for an argument then you can loop over the returned array, printing two at a time.
The problem (or difficulty) with your data is that not every key has a value, and as a result the file is not very well structured.
And, once again when it comes to text processing Perl is best:

produces this:

ID: SUPER.SHIBU
NODE: LOCAL
UID: 065
GID: 255
DEFAULTSEC: CCCC
PHONE:
DATEFMT: MMDDYYYY
DECPOINT: .
CASE: M
NLSCODE: US
EFFDATE: 06/30/2003
EXPDATE: 06/30/2003
CALENDAR:
LOGINCNT: 4
LLDATE: 01/07/2003
LLTIME: 04:12:02
LLPID: 0,0
LLSOURCE: \NONAME.$ZTN29.#PTVWABB
LLNODE: \NONAME
PWDDATE: 01/06/2003
PWDTIME: 11:56:12
PWDUSER: SUPER.SUPER
PWDSOURCE: \NONAME.$ZTN29.#PTVWAB7
PWDPID: 0,1195
PWDNODE: \NONAME
PWDEXPDATE:
PWDCHANGE: YES
PWDQSIZE: 4
PWDCHGMINDAYS: 0
PWDCHGMAXDAYS: 0
PWDVIOCOUNT: 0
PWDVIODATE:
PWDVIOTIME: 00:00:00
PWDVIONODE:
PWDVIOSOURCE:
PWDVIOUSER:
VIOLMODE: SYSTEM
VIOLACTION: CANUSER
VIOCOUNT: 0000
VIODATE:
VIOTIME: 00:00:00
VIOSOURCE:
VIOPID: 0,0
VIONODE:
VIOUSER:
STATUS: ACTIVE
SUSDATE: 05/09/2002
SUSTIME: 14:13:52
SUSLID:
SUSSOURCE: \NONAME.$ZTN29.#PTHGUJT
SUSPID: 0,216
SUSNODE: \NONAME
SUSUSER: SUPER.SUPER
SUSCOUNT:
UPDATEDATE: 04/02/2003
UPDATETIME: 02:28:23
UPDATEPID: 0,254
UPDATESOURCE: \NONAME.$ZTN29.#PTDP1QJ
UPDATENODE: \NONAME
UPDATEUSER: SUPER.SUPER
CREATEDATE: 05/09/2002
CREATETIME: 14:13:52
CREATEPID: 0,216
CREATESOURCE: \NONAME.$ZTN29.#PTHGUJT
CREATEUSER: SUPER.SUPER
CREATENODE: \NONAME
DEFAULTSVS: $DATA01.SHIBU
DATESEPARATOR: -
NAME: SHIBU
GOPI
INITDIR:
INITPROGNAME:
INITPROGTYPE: PROGRAM
UPSSTAT:
PWDEXPGRACE:
CIPROG:
CINAME:
CILIB
:
CICPU:
CISWAP:
CIPRI:
CIPARAM:

Just to give you an idea.
Cheers,


Normal is in the eye of the beholder
Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
For some reason, I've done an awful lot of file processing over the years at a "word" level, where a word is string of characters separated by white space. If you take this file a word at a time you might have an algorithm like this:

The Perl version was shorter, wasn't it! I don't claim to be good at Java RegEx - anybody do it that way?


A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
Leslie Chaim
Ranch Hand

Joined: May 22, 2002
Posts: 336
For some reason, this text-processing question has me over my head. I was thinking about a regex solution (just for the exercise) but I was just preoccupied with work and I kept pushing it off. Finally, after many tries, I have some time on a Sunday with the solution
The Perl version was shorter, wasn't it!
It sure was! BTW if follows exactly (well, almost) your algorithm. If you wish, I can do a step-by-step illustration.
I don't claim to be good at Java RegEx - anybody do it that way?
Yes, sure! But first let me clarify a small detail, and of course with regex it's all about details. :roll:
Regular Expression are just that: regex! There is no such thing as Java RegEx You probably meant to say that you would like to see a solution in Java using regex. In fact, the Java API doc for the regex package specifically states the following:
Originally from Sun documentation:

An instance of the Pattern class represents a regular expression that is specified in string form in a syntax similiar to that used by Perl.

Too bad for the that typo (did you see it?) but that's not really a problem for the human brain
Getting back to the topic, since our whole point is regex, I will stick with Perl so we can focus on the regex. Furthermore, I think you have to appreciate the conciseness that Perl offers and you should have no problem converting from Perl to Java. The core regex will be the same, except for some annoying differences such as the LTS problem when having to backslash the backslash and ending each comment with a newline when using Pattern.COMMENTS. If you wish, I can post the Java solution later.

So here we go:
Talking about short let me begin with something shorter� All you need is this!

Which produces what the we ultimately want:

ID: SUPER.SHIBU
NODE: LOCAL
UID: 065
GID: 255
DEFAULTSEC: CCCC
PHONE:
DATEFMT: MMDDYYYY
DECPOINT: .
CASE: M
NLSCODE: US
EFFDATE: 06/30/2003
EXPDATE: 06/30/2003
CALENDAR:
LOGINCNT: 4
LLDATE: 01/07/2003
LLTIME: 04:12:02
LLPID: 0,0




Now, I will attempt to explain this obfuscation and do with the spiral approach
The basic idea behind solving this problem the regex way, is to repeatedly apply the regex on input line capturing the key/value pairs (with capturing parenthesis) in $1 and $2.
With the lazy quantifiers, that would be easy:

The regex (.*?: ?) *([^ ]*) * as well as the overall concept of matching this is pretty simple:
  • First there's the key in .*?:, capture to $1
  • Followed by * (there's one space there) for any leading spaces (of the value)
  • The value is [^ ]* which is any number of non-spaces, capture to $2
  • Followed by * (there's one space there) for any trailing spaces (of the value)

  • Then, the regex is applied across the line (as needed) with the /g modifier (In java that would be in the loop of m.find()).
    You can see in the above, there's some confusion with the spaces. It is hard to tell whether the spaces is part of the regex or not. Since details are important, I will be specific and use the octal escape, which is \040. I could have also used \s, but \040 is more precise so I'll use that. Furthermore, since we are dealing with ':' and '(' and ')', UBB is acting up in some unwanted ways! In fact, if you examine closely the previous regexes you'll notice that what's in bold is a 'space' followed by a '?', which based on our data has no effect at all. This works as a regex and avoids the UBB smartness. I will continue from here and use the /x modifier (which is like Java's Pattern.COMMENTS) so that whitespace will be insignificant to the regex itself.
    Now, there is just one problem (for now) with the lazy quantifiers and that is the case of missing values. Not all keys have associated values and we need to leave them blank (not grabbing the next KEY!). With the previous regex, the next key will be displayed as part of the value:
    The second line:

    PHONE: DATEFMT: MMDDYYYY DECPOINT: . CASE: M NLSCODE: US

    Results in:

    PHONE: DATEFMT:
    MMDDYYYY DECPOINT: .
    CASE: M
    NLSCODE: US

    So, we need something more intelligent and control within the regex. Perl has a special syntax which allows if|then|else logic within the regex. In our case we just need an anchor within the regex after we have matched the key value asserting that the following word is not a key.
    You would think that it would be as simple as using the regex we defined for the key and wrap it around with the negative lookahead construct and be done with it:

    Does this work? NO! In fact, there's practically no effect on the output at all! With this regex $1 gobbles up everything until the last colon! I hope someone with qualities of Jeff and the MRE book would come up with the right words explaining why the above fails. But in a nutshell, the problem is with the 'dot' which can match any character including a space.
    Lets modify our regex to match a key from .*?: to [^\040:]*:. With the negated character class, greediness is not an issue:

    Using [^\040:]*: as the key, we are more specific of what a key is made up, which is any sequence of non-spaces or ':' followed with a colon.
    Going forward with our lookahead strategy, lets give it a shot:

    And here is this little obfuscation explained:

    Are we done yet? No! There's another problem!!!
    The third line:

    LOGINCNT: 4 LLDATE: 01/07/2003 LLTIME: 04:12:02 LLPID: 0,0

    Results in:

    LOGINCNT: 4
    LLDATE: 01/07/2003
    LLTIME:
    12: 02
    LLPID: 0,0

    The value for LLTIME: is null when it should have been 04:12:02. This can get you very and annoyed but it's worth the journey.
    The problem is with the key, which we defined as [^\040:]*:, and in our case there are values such as for LLTIME which have colons within the value. Our regex haphazardly grabs the values returning them as keys.
    Before going into the details of what a key should be, lets review the chunks of the overall regex. This way we'll stay in-touch with the overall strategy.
    Considering a line of input, key/value pairs will be gleaned with a regex as follows:
  • Get the key; capture to $1
  • Followed by any leading spaces (of the value)
  • Anchoring that the following word is NOT a key
  • Followed by the value; any number of non-spaces, capture to $2
  • Followed by any trailing spaces (of the value) which can really be ignored with the /g modifier


  • Maybe there are "other" ways to think about this problem, but I think this is good enough.
    This leaves us with the KEY problem, or more precisely the VALUE problem, which can have an embedded colon. After meticulous analysis, I came up that a value has the following clauses:
  • Any number of non-spaces and/or non-colons or
  • A colon immediately followed by a non-space char and must
  • End with a NON colon followed by
  • A space (or end of line for missing values)


  • In short, a value may have colons within them but must end in a NON-colon followed by a space (or end of line).
    On the other hand, a KEY is much simpler:
  • Any number of non-spaces and/or non-colons must be followed by
  • End with a colon where you can peek ahead and look for
  • A space (or end of line for missing values)


  • As you can imagine, there are many ways you can turn from here to construct the regex. While I was writing this, I considered at least (twenty) seven different approaches. I will only list the simplest and ignore efficiency. After all, the goal was to come up with a regex.
    Let's go with the overall strategy and focus on the key which we will change from [^\040:]*: to [^\040:]*: (?=\040|$) following the above reasoning.
    Once we have the key defined, the rest is up to the strategy, which yields:

    And of course here's the obfuscation explained:

    And here are some notes:
    1) Our final regex without \040 is:

    You can remove the bold 'space' and '?' again the regex with UBB are clashing here.
    2) It's interesting to note that within the negative lookahead, I am not using positive lookahead since it's just an assertion and noting is consumed from the target string. (I hope this made sense to you)
    3) Oh, I just thought I would show you on of the seven ways, that I mentioned above.
    [code ]
    perl -pe 's/([^ :]+(?:[^ :]*)*: ?(?= |$)) *(?!(?:[^ :]+(?:[^ :]*)*: ?(?= |$)))([^\n ]+)? */$1 $2\n/g' data
    [/code ]
    This works as well and uses a technique called unrolling the loop. As I tried to solve this problem and going through the MRE book I applied this technique. Once I started writing this post I realized I would have a rough time to explain this not to mention it's an overkill. I was able to narrow it down to the above.
    4) There are still some problems with the specific data from this post if you examine the very end of the file. However, for that I would put on the classical programmer's hat and proclaim that it's just bad data!
    5) While working on this problem, I found this very nice tool called The Regex Coach, it was very helpful. Though there are a few cosmetic issues where it's flawed but in general, I can find enough praise for the idea of such a tool.
    All of this writing would have been impossible if not for my best technical book ever, the aforementioned MRE book. You have seen here the fruits reaped from this book which really thought me how to Think regex and thinking in general. Jeffrey Friedl is the best! Just read the reviews of his book on Amazon. Unfortunately, he is so obsessed with regex that he just keeps on printing new additions in all kinds of languages. I hope that someone can convince Mr. Friedl to write another technical book!
    I hope you enjoyed the above,
    Cheers,
    Leslie Chaim
    [ September 22, 2003: Message edited by: Leslie Chaim ]
    [ September 23, 2003: Message edited by: Leslie Chaim ]
    [ September 25, 2003: Message edited by: Leslie Chaim ]
    Leslie Chaim
    Ranch Hand

    Joined: May 22, 2002
    Posts: 336
    Originally posted by Stan James:

    The Perl version was shorter, wasn't it! I don't claim to be good at Java RegEx - anybody do it that way?


    Now this was quite a while ago Stan... I think by now there should have had some feedback.
    Just looking for opinions
    Jim Yingst
    Wanderer
    Sheriff

    Joined: Jan 30, 2000
    Posts: 18671
    Hmmm, I don't think I even saw this earlier. For fun I tried a regex/Java solution without looking at what you had done, to see how similar they might be. Of course with regexes there can be many different strategies depending on what rules you believe the data obey. In this case the data is extremely crappy and the rules are not obvious. Here's what I came up with:

    Or a somewhat deobfuscated form:

    I came across a few points which don't seem to be addressed in Leslie's treatise above. There's one line in the original data:
    NAME: SHIBU GOPI
    From context and the fact that GOPI is not followed by a :, I believe that this is evidence that the value field needs to allow spaces. There are also the lines
    CIPROG: CINAME:
    CILIB : CICPU:
    The problem is the space after CILIB. I believe from context that CIPROG, CINAME, CILIB, and CICPU are all keys, with no values. This seems to mean that there may be space before the colon.
    There is no such thing as Java RegEx
    Well, except when there is. There are different flavors of regex out there; they're not entirely interchangeable. E.g. the java.util.regex package includes posessive quantifiers, which AFAIK are not available yet in Perl. Which of course is why I make a habit of using them wherever possible. Well, that and the fact that they make things easier for both the regex compiler and myself, as they cut down on possible backtracking effects.
    3) Oh, I just thought I would show you on of the seven ways, that I mentioned above.
    [code ]
    perl -pe 's/([^ :]+(?:[^ :]*)*: ?(?= |$)) *(?!(?:[^ :]+(?:[^ :]*)*: ?(?= |$)))([^\n ]+)? */$1 $2\n/g' data
    [/code ]

    What do those (?:[^ :]*)* do? Wouldn't [^ :]* by itself do the same thing?
    [ December 09, 2003: Message edited by: Jim Yingst ]

    "I'm not back." - Bill Harding, Twister
    Rene Larsen
    Ranch Hand

    Joined: Oct 12, 2001
    Posts: 1179

    Hi Jim,
    When I try to run your sample code I'm getting the time formated with lines breaks.
    e.g.

    becomes:

    Ren�
    [ December 09, 2003: Message edited by: Rene Larsen ]

    Regards, Rene Larsen
    Dropbox Invite
    Jim Yingst
    Wanderer
    Sheriff

    Joined: Jan 30, 2000
    Posts: 18671
    When I try to run your sample code I'm getting the time formated with lines breaks.
    Oops, I seem to have introduced a new error during my last few modifications which went unnoticed. I've fixed the code above; should work now. Thanks for the catch.
    Leslie Chaim
    Ranch Hand

    Joined: May 22, 2002
    Posts: 336
    Hmmm, I don't think I even saw this earlier.
    And I sheepishly thought that you followed every link recursively (if I can only point with UBB to the 19th post )
    depending on what rules you believe the data obey
    Good regex line; I will adapt this one!
    a few points which don't seem to be addressed in Leslie's treatise above.
    The crappiness of the data is addressed in point (4) check em out.
    LC: There is no such thing as Java RegEx
    JYY: Well, except when there is. There are different flavors of regex out there; they're not entirely interchangeable.
    Correct! Unfortunately, using the term flavor can be pretty bitter when it comes to regex. Do you mean flavor by what a given tool or language supports or does the flavor term distinguish between BRE or ERE? What about NFA vs. DFA?
    In my view, regex should be treated as its own science where you simply string up a pattern to match some target data (text). (Warning! *nix specific stuff ahead ) You've got grep and egrep with different regex engine and rules and what about Posix regex. Consider parenthesis, which grep (and sed, expr) deem as a regular character, matching itself, while egrep (and awk) deem parenthesis as meta-characters. I refer you to Chapter 3 of the MRE book: "Overview of Regular Expression Features and Flavors" for a complete discussion.
    In the case of Java (and some other tools and languages) the regex component is just an external package or library of functions (or routines). Compare this to sed and even Perl when regex is at the very heart of these tools with special operator(s) support for handling regexes. Yes, Java's regex package may have some features not available in Perl but I still proclaim:
    There is no such thing as Java RegEx
    E.g. the java.util.regex package includes possessive quantifiers, which AFAIK are not available yet in Perl.
    Yes! I knew exactly what I was missing. I should have mentioned it. Nevertheless, to add to your delight there is also:
    Class set operations: such as [[a-z]&&[^aeiou]] which is only supported by Sun's regex package and not my Perl (yet ).
    Which of course is why I make a habit of using them wherever possible.
    Using them? How about over using them In this case, the only need for possessiveness is when dealing with values (or sub-values). Once you've matched a key you want to 'keep' the value (if there was any) not allowing backtracking to treat it (the value) as the next key.
    In the sprit of deobfuscation and nitpicking, I hope you'll agree that the following is more precise:

    What do those (?:[^ :]*)* do? Wouldn't [^ :]* by itself do the same thing?
    Yep, it does. I guess you're also finding joy in nitpicking. Gee, I am in good company.
    In any case, thanks Jim, for making this a lively discussion. I surly learned something.
    Cheers,
    Leslie
     
    I agree. Here's the link: http://aspose.com/file-tools
     
    subject: urgent help--- fileformatting