wood burning stoves 2.0*
The moose likes Java in General and the fly likes RegularExpression.java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "RegularExpression.java" Watch "RegularExpression.java" New topic
Author

RegularExpression.java

Nicholas Jordan
Ranch Hand

Joined: Sep 17, 2006
Posts: 1282
Trying to write the second phase of my program, reads in a standard one word per line dictionary - built in first phase of program - and searches second file of unknown length & line length and looks for matches.

Obviously, this problem domain is well researched in Regular Expressions, but A:\RegularExpression.java is 129,152 characters, 11,152 words, 3,189 lines, which I don't mind tearing into if it will do me some good.

I found this file by going to the java.sun domain and looking for Regular Expressions, then opening Src.zip in my newly downloaded JDK-5

I just want to make sure I am reading the right file as this is quite a deep-well of information just to do first-draft, get-it-sputtering coding.

Java site gives package name of [java.util.regex]
File A:\RegularExpression.java gives:
[com.sun.org.apache.xerces.internal.impl.xpath.regex]

as package name, do I have the right file ?

http://www.docdubya.com/belvedere/statement/Denial.html
[docdubya.com expired on 12/02/2006 and is pending renewal or deletion. ]
"....we can't guarantee a valid email hasn't been tossed, but the alternative is nothing would get done." - Greg Comeau

anybody laughing ?

[ December 09, 2006: Message edited by: Nicholas Jordan ]
[ December 10, 2006: Message edited by: Nicholas Jordan ]

"The differential equations that describe dynamic interactions of power generators are similar to that of the gravitational interplay among celestial bodies, which is chaotic in nature."
marc weber
Sheriff

Joined: Aug 31, 2004
Posts: 11343

Originally posted by Nicholas Jordan:
...this is quite a deep-well of information just to do first-draft, get-it-sputtering coding...

Indeed, it is!

Rather than dissecting API source code (which exists, after all, so that we don't need to concern ourselves with those inner workings "just to do first-draft, get-it sputtering coding"), I would start with this Sun Tutorial - Regular Expressions.

And, of course, refer to the API documentation (especially for the Pattern and Matcher classes).


"We're kind of on the level of crossword puzzle writers... And no one ever goes to them and gives them an award." ~Joe Strummer
sscce.org
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
There are at least three complete regex implementations in the JDK, but the one that's meant for us to use is the java.util.regex package. Why are you looking at the source code, anyway? If you want to learn how to use regexes, take a look at this site:

http://www.regular-expressions.info/

And if that leaves you hungry for more, there's The Book:

http://regex.info/
Nicholas Jordan
Ranch Hand

Joined: Sep 17, 2006
Posts: 1282
Originally posted by Alan Moore:
Why are you looking at the source code, anyway?


I would LOVE to discuss this; Duntemann gives a really human to human answer in the preface and introduction to Assembly Language, Step by Step.

When I watched Silence of the Lambs, I thought the guy was portraying a Clown, be though it may a good one.

You live in a different world from the one I do, or you wouldn't ask the question.
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
Originally posted by Nicholas Jordan:
You live in a different world from the one I do,


That's a relief.
Nicholas Jordan
Ranch Hand

Joined: Sep 17, 2006
Posts: 1282
Originally posted by Alan Moore:
And if that leaves you hungry for more, there's The Book:


I was working on my second reading of Mastering Regular Expressions last night.

I am in theLesson: Regular Expressions (The Java™ Tutorials > Essential Classes)tutorial right now.

Program (this phase) is simple, in student's concept.


I am sure this program has been written and studied thousands of times, I am open for suggestions as the ultimate intended user is not a bench technician and I have to forsee all reasonable failure modes trapping to an error log for sys-admin.


[ December 09, 2006: Message edited by: Nicholas Jordan ]
marc weber
Sheriff

Joined: Aug 31, 2004
Posts: 11343

There are a number of compelling reasons for taking apart source code, but usually not among those is "just to do first-draft, get-it-sputtering coding."

In the long term, if you really want to understand how these classes work (and sometimes don't work), that can be a valuable and worthwhile approach. But in the short term, -- e.g., under a deadline to write working code today -- the meandering scenic route might not be the best plan.

Understand that I'm not advocating undue shortcuts. I'm just pointing out that there are different levels of understanding, and different approaches to meet different goals.

Nicholas Jordan
Ranch Hand

Joined: Sep 17, 2006
Posts: 1282
Originally posted by marc weber:

Indeed, it is!
... of course, refer to the API documentation (especially for the Pattern and Matcher classes).

I did, here is what I got.(since last post)
I know this is a large post, but there's an awful lot of masters here - this follow up adheres strictly to the posting guidelines - I have coded my question, compiler says clean build on this code.

[ December 09, 2006: Message edited by: Nicholas Jordan ]
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
The java.util.regex package doesn't recognize \< and \> as word-start and word-end boundaries. Use \b to match either the start or end of a word.
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24166
    
  30

That a lot of characters. You're gonna get carpal tunnel, man.


[Jess in Action][AskingGoodQuestions]
Nicholas Jordan
Ranch Hand

Joined: Sep 17, 2006
Posts: 1282
Originally posted by Alan Moore:
The java.util.regex package doesn't recognize \< and \> as word-start and word-end boundaries. Use \b to match either the start or end of a word.

Stupid question time: Efficiency ? - not for me to examine at this point, just get it working, correct ?

How do you differentiate begining of word from end of word to make sure you really have a word in the buffer ?
[ December 09, 2006: Message edited by: Nicholas Jordan ]
Nicholas Jordan
Ranch Hand

Joined: Sep 17, 2006
Posts: 1282
Originally posted by Ernest Friedman-Hill:
That a lot of characters. You're gonna get carpal tunnel, man.


Use ice, copiously. Take scheduled walk-around breaks.

There is no mercy where professionals are concerned.
Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
Is RegEx a part of the problem definition or only one possible solution among many? I always like to drag out Ternary Search Trees as a very fast way to look stuff up. You could load the word tree first, then read the text file a character at a time (buffered) and work your way through the tree for every word. I'd be interested to see how speed compares.


A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
Originally posted by Nicholas Jordan:
How do you differentiate begging of word from end of word to make sure you really have a word in the buffer ?


or
You have to put something in there to match the actual word or words anyway.
Nicholas Jordan
Ranch Hand

Joined: Sep 17, 2006
Posts: 1282
Originally posted by Stan James:
Is RegEx a part of the problem definition or only one possible solution among many? ... I'd be interested to see how speed compares.


One possible solution, for one strictly defined problem domain that is recurrent throughout the application.

Do you want millisecond inner-loop times or overall responsiveness from the shell ?

RegEx is/was/will be first thought of answer - it is widely encountered throughout computer science, therefore will have been tweaked by more computer science workers than the Sargasso Sea has waves.

I recoded the loop this morning, using String class' methods, as an expedient to getting farther along in prototyping - JDK 1.2 does not compile regex - so commented it out ~ I will not be able to do critical loop timing untill I figure out some threading issues.

Tried it this morning, let me know by pm if you want the runlog.

[ December 09, 2006: Message edited by: Nicholas Jordan ]
[ December 09, 2006: Message edited by: Nicholas Jordan ]
Nicholas Jordan
Ranch Hand

Joined: Sep 17, 2006
Posts: 1282
Originally posted by Alan Moore:
You have to put something in there to match the actual word or words anyway.

I figured it out, no matter how coded, one has to find a boundary, then lookahead to see if the next char is [a-zA-Z]; and so on.

I assume several approaches to trailing apostrophies and s's in plurals and so on will become within range of my coding skills after some use.
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
I think you're putting too much emphasis on word boundaries. Their main purpose is to make sure that any match you find is a whole word (or words), not a substring of some longer sequence. If you have a regex like this:
...and it matches something, there's no ambiguity about which \b matched where, and you don't need to do any extra lookaheads.

By the way, the regex above is a first cut at regex to match names that might include some non-word characters.

Why do you always use fully-qualified class names? That's a waste of time and space, and it makes the code harder to read. Just import the appropriate classes or packages and use the classes' simple names. But you won't need to import the java.lang package; it's imported by default
Nicholas Jordan
Ranch Hand

Joined: Sep 17, 2006
Posts: 1282
Exactly what I match, or try to match, is something I need to give deep thought to, I will take your example and make it my first example. With any luck, I will come in tomorrow and tell you what I think it does. This is good for beginners, even if they fail - they give the matter some thought.

As for the fully qualified names, I work in crisis-intervention and blowout-control in multi-million dollar projects. When it is so bad that no one with any sense will take it, I take over.

When things are back to S.N.A.F.U., I hand it back to the people who are trained to to the job.

I have to know where everything is to the bazillionth, within about 30 ms.

It is hard to be effective in a house of mirrors with Clowns all around, I have developed a coding style that uses variable names from outside of computer science because of a diagnostic that I do not understand being issued on my C++ compiler when I use namespaces - it may be irritating, but even then I use variable names chosen for their memorability, and that will not under any reasonable test be in any build file supplied by compilers.

If I have to, I will reduce these for posting. I need the assistance.

My build directory clocks in at well over a quarter of a million bytes, I really have to know every line of code, every statement being exclusionary + concise.

There is no mercy between professionals.
Nicholas Jordan
Ranch Hand

Joined: Sep 17, 2006
Posts: 1282
Originally posted by Stan James:
Ternary Search Trees as a very fast way to look stuff up.


As soon as I saw the concept, I unzipped it. Because it so closely models what I intend to do in the next phase of the program, it is a real home run if you like accolades, I dreaded trying to re-invent this wheel.

If and when something is found, there is a split decison on the basis of some information gleaned elsewhere, such as mabye this is an operator with authority or just a casual user who does not want to know. It shouldn't take a deep contemplation to come up with some other 'split-decisions' - but once branched, we stay on that side of the main trunk, so the tool effectively models my thinking.

I took a really short peek at your page, I am sure you can understand how this will adjuavate coding later in the project.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: RegularExpression.java
 
Similar Threads
socket creation doubt ?
Packaging, naming and directory structure.
Java File Name details
Manipulating a text file