aspose file tools*
The moose likes Java in General and the fly likes Parsing text Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Parsing text" Watch "Parsing text" New topic
Author

Parsing text

Layne Lund
Ranch Hand

Joined: Dec 06, 2001
Posts: 3061
Okay, maybe this isn't an advanced question, but I'm not quite sure where to put it. In fact, my questions don't have to do with Java directly, but more on that later.

I'm working on a project that reads in a decent-sized RTF file and parses the content for information. I already figured out how to use the javax.swing.text package (and related subpackages) to obtain the content as a plain String. Now I'm trying to figure out how to parse this String to obtain the information that I want. Basically, the text contains multiple records that are all in a fairly uniform format:


Let me describe my meta-language before I go any further:
< > encloses a description of the text
[ ] encloses optional text
... means that the pattern can repeat

My first thought is to write a lexer and parser similar to what I learned about in my compilers class. The above description would work fairly well as a grammar, I think. If I need to, I can change it into BNF even. My first question is would it be worth my time to download JavaCC or something similar to help write the lexer and parser? Ultimately, the program I am writing should be able to parse multiple documents with varying formats. Does JavaCC support multiple grammars?

Whether I use a tool like JavaCC or roll my own parser, there are a few complications:

1) In some situations, new lines have some significance. This is especially true for addresses. The only way I know how to tell where the City, State ZIP starts is by looking for a new line character. Usually a parser ignores whitespace, though, so I'm not sure how to deal with this. I would like to obtain the first two address lines, city, state, and zip separately, but I'm not entirely sure how. The optional second line in the address also complicates things. Does anyone have a suggestion here?

2) Should I view labels like "Debtor Address:" as a single token or as two tokens? I'm not sure which would be best/easiest to implement. It might not matter if I'm using a compiler-compiler, but I would like to know if anyone has suggestions here.

I will greatly appreciate any input anyone has.

Thanks,

Layne


Java API Documentation
The Java Tutorial
Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
I learned text parsing (and programming) in a line-oriented world so I'd be tempted to read a line at a time, test for the easy tags like "Creditor:" or "Creditor Address:" and rely on position in a sequence for the others like the first few lines. That would probably be lots of code and fragile relative to changes or variations in format. I'll look forward to more sophisticated ideas!


A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12805
    
    5
I think I would read the whole thing into a String[] - then locate the groups of 1 or more lines that go with a particular marker such as "Debtor Address:" and send them to a method specific to each data type. In other words, separate the tasks of locating the data from interpreting the data.
The specific methods would not have to worry about detecting the end of the data type.
Bill
Layne Lund
Ranch Hand

Joined: Dec 06, 2001
Posts: 3061
Originally posted by William Brogden:
I think I would read the whole thing into a String[] - then locate the groups of 1 or more lines that go with a particular marker such as "Debtor Address:" and send them to a method specific to each data type. In other words, separate the tasks of locating the data from interpreting the data.
The specific methods would not have to worry about detecting the end of the data type.
Bill


That's actually very similar to what I've decided to try next. At the moment, I'm putting all the text into a single String rather than using a String[]. Then I'm trying to use regular experessions and the java.util.regex package to locate the data. I plan on writing individual methods to parse the individual blocks of data from there.

But now I'm running into trouble with the regular expressions stuff. I am starting by just trying to match the "<###> OF <###> DOCUMENTS" that occurs at the beginning of each record. Here's is the method that does most of the work:

The readRTFFile() uses the javax.swing.text.rtf to read the RTF file and return the contents as a String. I could easily split() this into a String[] if I want to. This method is actually in an abstract base class because there are at least two slightly different file formats. (This might end up being unneccessary because I may be able to deal with the slight differences if I write my regular experssion just the right way. But I'll figure that out later after I figure out what's wrong with my current regex.) At the moment, the subclass that I'm writing for testing returns the following regular experssion:

The two SOPs before the while loop print out (including the regex as I expect it). However, m.find() must be returning false because the SOP at the beginning of the loop doesn't print. This is where I'm stumped. I also printed the content String and it looks fine. So why am I not getting a match for my regex? Any ideas?

Thanks for your time to read my questions. I sure hope someone has some suggestions that can help me fix this latest problem.

Regards,

Layne
Layne Lund
Ranch Hand

Joined: Dec 06, 2001
Posts: 3061
Okay, I'm just retarded. After some experimenting, I found that the document contains the word "of" in the first line, but my regex was looking for "OF" instead. Once I fixed that, I was able to incrementally build a regex that matches the whole record!

Thanks for your comments Stan and William. Even though I didn't use them directly, they helped me start thinking about other ways to do it.

Regards,

Layne
Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
For grins, look at how Fitnesse parses Wiki markup. I'll see if I can say this so it makes sense:

I copied this scheme for my Wiki and it works pretty slick. Since each REGEX finds the smallest possible match it handles nested tags nicely from the inside out. You're not replacing (tho you could delete text) so this might not work, but it's worth an evening to read Fitnesse no matter what.
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: Parsing text