permaculture playing cards*
The moose likes Java in General and the fly likes Parsing Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Parsing" Watch "Parsing" New topic
Author

Parsing

Chris Cairns
Ranch Hand

Joined: Jan 31, 2003
Posts: 48
Hi Folks,
I'm hoping that someone could help me out with the following problem. I'm trying to parse the following text.

As you can see, there is a sort of column and row heading. For example, RES/FUEL means reserved fuel. I'm a little clueless as how to approach this. I just started this job a few days ago (right out of college) and the company has absolutely no documentation on similar solutions to help me out. If I took the time to try to figure out the other developer's solution, it would take me forever because it's too complex at this point for me.
One idea I had is to create a multidimensional array. That way, for example, when I try to extract the value for attribute reserved fuel, I could just use indices. Another issue is dealing with the heading inside the txt document, which MAX POSS LOAD.
I don't know. Any help would truly be appreciated.
[ April 01, 2003: Message edited by: Jim Yingst ]
Chris Cairns
Ranch Hand

Joined: Jan 31, 2003
Posts: 48
As you can see, the text didn't properly format when I pasted it.
Chris Cairns
Ranch Hand

Joined: Jan 31, 2003
Posts: 48
Actually, it did come out all right. Just make sure your text size is lowered.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
I added [code] tags to preserve the indentation.
First, you can break this into separate lines by reading it with a BufferedReader, which has a readLine() method. For each line then, you need to separate the different fields. It looks like you can identify fields just by counting characters - e.g. the DIST field seems to start in the 12th or 13th column, and end in the 15th. You can extract this with String's substring() method. (Read the API for this carefully.) The you can convert the data to a non-String format using methods like trim() and Integer.parseInt() - E.g.
String line = buffReader.readLine();
...
String distStr = line.substring(11, 15);
int dist = Integer.parseInt(distStr.trim());
This approach should work if the column numbers are consistent throughout your input file. If they're not, well, you'll have to study the file structure more to look for consistent patterns which you can use.


"I'm not back." - Bill Harding, Twister
Jason Menard
Sheriff

Joined: Nov 09, 2000
Posts: 6450
Let's begin by defining the problem set? What are the different fields each line contains and the possible permutations thereof. Are there any field delimiters? It looks to me like it's just spacing things out, but it's still something to think about. Can we guarantee that each field occupies a certain range of character positions in all cases? For example, can we guarantee that the first field will always be found in the first nine characters?
Chris Cairns
Ranch Hand

Joined: Jan 31, 2003
Posts: 48
I like your approach Jason. To answer your first question, I (nor does anyone else at my company) actually know what the possible permutations are. We're using this particular textfile as a example of a possible text that would be sent to us from an external data provider. When we begin to test the classes "live", then will have to address the code to any permuations. (A shitty approach if you ask me, but not much I can do.) The delimiters, as Jim suggested, are the spaces in between. Yes, at this point, have to assume each field specifies an exact range.
[ April 01, 2003: Message edited by: Chris Cairns ]
[ April 01, 2003: Message edited by: Chris Cairns ]
Chris Cairns
Ranch Hand

Joined: Jan 31, 2003
Posts: 48
Jim,
I think your approach would be good, but I need to treat the textfile as a String. For example, here is a larger portion of the textfile that's being parsed.

Ignore most of that. Point I'm trying to get at is I'm breaking the text into blocks. Then parsing the blocks. So the block is actually a String, so I have to operate on that. The block I have pasted in my previous post is the one I have to work on.
[ April 01, 2003: Message edited by: Chris Cairns ]
[ April 01, 2003: Message edited by: Chris Cairns ]
Jason Menard
Sheriff

Joined: Nov 09, 2000
Posts: 6450
Just so I'm sure I understand you, are you saying you have that block as one String? So it would roughly be the equivalent of this?

[ April 01, 2003: Message edited by: Jason Menard ]
Chris Cairns
Ranch Hand

Joined: Jan 31, 2003
Posts: 48
Exactly, Jason.
Chris Cairns
Ranch Hand

Joined: Jan 31, 2003
Posts: 48
I'm thinking that way one to do this is to assume that the fields hold a fixed position. However, let's just say that the distance for 670 for TRIP-SAEZ and DIST is 5,000. I could grab a substring with an index position a few line spaces before, then just trim it. That way, if one of the fields do change, it would be compensated for.
Jason Menard
Sheriff

Joined: Nov 09, 2000
Posts: 6450
I'm thinking regexes. See if this code points you in the right direction. I might have a couple of different patterns to handle different permutations if it gets too messy with one regex, but hopefully you get the idea.

I only worked up to the first three groups, but hopefully it will get you started.
Jason Menard
Sheriff

Joined: Nov 09, 2000
Posts: 6450
Let me break the regex down a little better in case you are unfamiliar with them:
1: ^
The beginning of the String...
2: ([A-Z-]{1,9})
...followed by 1-9 characters which may each be either capital A through Z, or the '-' character. Capture this sequence to group 1.
3: \\s{2,8}
Followed by 2 - 8 whitespace characters...
4: (\\d{1,4})?
...followed by zero or one occurences of a 1-4 digit sequence, which is captured to group 2.
5: \\s{3,11}
Followed by 3 - 11 whitespace characters...
6: (\\d{1,5})
followed by a 1-5 digit sequence which is captured to group 3.
7: .*
Followed by zero or more occurences of any other character.
HTH
[ April 01, 2003: Message edited by: Jason Menard ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
The fact that all your input is in one big string doesn't prevent you from using a BufferedReader. You can construct it like this:

You could also use regular expressions for this (as Jason is doing) - if you're familiar with them and/or have time to learn, they're extremely powerful and flexible. But I'm pretty confident you can do this with BufferedReader and StringReader too, if you can understand the format properly.
From your longer file example, it seems as if the biggest problem is not figuring out how to parse the individual lines that have the data you want, but rather, how do you parse just those lines, ignoring the other stuff (which I assume for now that you don't need)? Based on what you've said so far, I might suggest: read lines until you find one that says

That indicates the start of data, as far as you're concerned. Now read each subsequent line and try to parse it. If it's null or blank, that indicates the end of the table of useful data, so you can stop reading.
Of course, if the other parts of the file also contain useful data that you need to understand, but in a different format, then you will have to study the format more to decide how to approach it.
Good luck...
[ April 01, 2003: Message edited by: Jim Yingst ]
Jason Menard
Sheriff

Joined: Nov 09, 2000
Posts: 6450
Even if you are using regexes, my preference would be to do as Jim suggested and read in each line at a time.
Chris Cairns
Ranch Hand

Joined: Jan 31, 2003
Posts: 48
Okay, thanks you guys. I really appreciate it. I have a lot to learn!
William Barnes
Ranch Hand

Joined: Mar 16, 2001
Posts: 984

I say do it in perl!


Please ignore post, I have no idea what I am talking about.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Parsing
 
Similar Threads
Optimizing tomcat5 to run with less memory on a small VDS server
Connecting to MySql in JBOSS5.1.0.GA
Parsing Headache - Please Help
Parsing and Reformatting
virtual dedicated server