File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Performance and the fly likes Regex - Java or Perl Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Java » Performance
Bookmark "Regex - Java or Perl" Watch "Regex - Java or Perl" New topic
Author

Regex - Java or Perl

bob connolly
Ranch Hand

Joined: Mar 10, 2004
Posts: 204
Hello!

I just ran this JAVA program to validate 75 million records using the REGEX patterns and it took around 3 hours! and i have another 25 files to go to go, so that's 26 x 3 hrs!!! OUCH!

Initially, i used the String split commmand but that produced some unpredictable results, so i just parsed the record into an array, but i think that may have also improved performance somewhat, but i'm now wondering if i should try to create my own REGEX array routines instead of creating objects in order to improve the performance?

Also, the HashSets only contain a max of about 20 records if that's a potential performance issue?

So, at this point, i'm wondering if i should continue to try an improve the Java code somehow or consider writing this in C or Perl, to do the validation, does anyone have any thoughts on this predicament please?

Thanks very much for any insights, references or pointers to other articles or websites!

Beginning Mon Jan 24 16:51:11 EST 2005
ERROR3: Pattern doesnt match HFR_UPB -.01 ([0-9]|(\.|)){1,10}
ERROR3: Pattern doesnt match HFR_UPB -.01 ([0-9]|(\.|)){1,10}
........
Ending Mon Jan 24 19:25:17 EST 2005



[added code tags - Ilja]

[ January 26, 2005: Message edited by: Ilja Preuss ]
[ February 08, 2005: Message edited by: Lasse Koskela ]
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24187
    
  34

Well, right off the bat I see a number of small inefficiencies (using "new String(ResultSet.getString())" -- the "new String()" is unnecessary, as is the initialization of the variable to a new String() right before that; elsewhere you say "new String("NULL")" which is again just an expensive way to say "NULL") and at least one big one. Compiling a regular expression is a relatively expensive operation; you're expected to compile the expression once, and use it for matching many times. But you've written this program to compiler every expression over and over again, many times, never reusing a compiled pattern twice.

What you should do is compile all the Pattern objects once, and store those in some arrays, then reuse them as needed. You should see a substantial speedup!

Another one: your "getCategory" routine opens a new connection to the database every time it runs, and builds a new Statement. If you opened only a single Connection, and used a single PreparedStatement and just plugged in the changing values using prepareStatement(), again, you'd likely notice an improvement.


[Jess in Action][AskingGoodQuestions]
bob connolly
Ranch Hand

Joined: Mar 10, 2004
Posts: 204
Thanks alot Ernest! great points! i'm going to make those changes immediately and try again!

Sure appreciate your quick response and expert recommendations!

Have a great week Ernest!

bc
bob connolly
Ranch Hand

Joined: Mar 10, 2004
Posts: 204
Hello again Ernest!

I noticed that your home page is the JESS! website and i've been interested in JESS and CLIPS for years!

Hat's off to you for working in that area! i can't imagine what a huge effort that must have been to have worked on that incredibly complex software!

Lately i've been thinking about CLIPS & JESS even more, in particular, what a perfect fit XML may be in defining the FACTS that might be floating through a system and how those FACTS could activate navigational and other types of RULES ect!

Thanks! again Enrest and I'm honored to have conversed with you and will
be taking another look see at your website!

Have a great week Ernest!
David Harkness
Ranch Hand

Joined: Aug 07, 2003
Posts: 1646
Well, Ernest nailed the goodies, but I'll take the leftovers.

I see you're caching the list of categories, so the getCategory() method looks to be called only three times (once for each "cat" field type). You may want to take Ernest's suggestion about using a single Connection and PreparedStatement further and preload the three categories up front and close the Connection.

First a handy shortcut for array initialization, though not really a performance helper. You can apply this to the other arrays and even keep the embedded comments.Now for some low-hanging fruit.sa is created to have NUM_COLS elements, and i stops before NUM_COLS, thus "i>(sa.length-1)" is impossible. As well, you may save garbage collection if you eat up enough memory to kick it in by checking for an empty field while splitting and set the sa[i] to null instead of "". You'd need to change a bunch of code to handle it, so it may be more trouble than it's worth. But you're processing 75 million records. If each one has an empty string, that's roughly 1.2GB of objects (assuming an empty String is 16 bytes, but I haven't looked this up in years) you're no longer creating.

75 million records? Wow. With that many, every little bit will help. Let's micro-optimize chkMinMax().How should chkMinMax() handle a "NULL" string?

I haven't used Java's regexp yet, but I've used plenty of others. These look a little suspect to me; perhaps they cause innefficient Patterns to be created?I read that as "1 to 14 of ( [0-9] or ( / ) )". This simplifies toSimilarly, this oneonly differs from the above by the extra |, but (/|) is "/ or nothing" so the whole thing is the same as above with the difference that you need to allow for 0 in the range since 3 "nothings" is the same as 0 "somethings."As for the three that start withIf I understand prepPT(), you replace the "#&" with "\\." for a literal period (you could have put it in the original String as such, by the way). Like the ones above, these can all be simplified toA "." inside a character class "[]" is always taken as a literal period.

Okay, now we tredge into egregious optimization territory. Ilja, if you read this, it's not my fault!

For the min-max calculations, you'll have to try this to see if it is even faster, but since the range is so small you could get away with regexps here. [See below; I realized later that I misunderstood your code but left this in case you need it elsewhere.] For the first one, ACT_DTE, that goes from 0 to 14, the following would even handle leading zeros if needed (remove "0*" otherwise).The reason i suggest this is to avoid needless parsing. Regexps compile down to a few simple character comparisons.

However, my real motivation is to turn the entire record checking into a single, giant, messy regular expression. Don't split the record into fields at all. It'll take some head-scratching and lots of coffee, but with enough brain power applied you could turn it into a regexp. You said the list of values for each category is about 20 or so, so that can be included as well.

Hmm, as I'm writing this out of order, I just wrote the part about chkMinMax() and now realize those only apply to the length of the string. Oops. Well I'll leave the above in case it's useful elsewhere and readdress it now.

To do string lengths you can still use a regexp. For a string with at least 1 character and no more than 14, useNote also that the fields with regexps do this already so the min-max check is redundant.

So to roll this all together with an example, you would build a final regexp with the following.That gives you an example of all three types of fields. I'll leave the rest to you.

Armed with this mega-regexp, you're processing loop is reduced toThe main problem with this approach is that this only tells you that some field is bad -- not which nor why. However, once a record is found to be bad, you could hand it off to be split and processed field-by-field to find all the errors.

Given so many records to process, your main issue is garbage collection churn from creating so many tiny objects. You can help by giving the JVM more RAM to play with. Also, set the initial heap size to be equal to the max heap size to avoid having it reallocate the heap multiple times to get to the max because clearly you're going to hit the max anyway.

Okay, I'm done for now. That was pretty fun and good practice. Thanks for posting it, and let us know how it turns out!

[ Fixed some typos. ]
[ January 26, 2005: Message edited by: David Harkness ]
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Already some very good suggestions here, especially those about pre-compiling patterns and using a single connection and PreparedStatement.

Regarding other optimizations, I'd be rather wary implementing them before running the code through a profiler and observing where the code actually spends most of its time.


The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
bob connolly
Ranch Hand

Joined: Mar 10, 2004
Posts: 204
Wow!!!

David! thanks so much for the additional help, that's alot of effort on your part! going to try your suggestions asap!

Just finished a PERL pgm to do the same thing, compare run times for an upcoming meeting, and it took about the same time as the JAVA pgm, 1.5 hours for 75 million records, and with the additional enhancements, it looks like JAVA will quite a bit better! And since we need to process over 1 billion records, every second will count, so as a final test, i'm going to write a C program, just to get a final comparison, but i'm sure that's not going to be an easy task!

And Ilja, thanks for tool reference, didn't know such a tool existed, should be invaluable! looking forward to giving it a try!

Got to say, it's been another one of those amazingly helpfull visits to the Ranch!

Have a great week everyone!
bob connolly
Ranch Hand

Joined: Mar 10, 2004
Posts: 204
Hello again Ernest!

Forgot to mention in the previous post that i made those changes you recommended, moving the compiled pattern to an array, very cool idea! and it lopped off 1 hour off the processing of those 75 million records, making it 1.5 hours from 2.75 hours!

Thanks again Ernest!

Good one!
Bill Nelsen
Greenhorn

Joined: Aug 11, 2004
Posts: 27
You might to also take a look at Jeffrey Friedl's "Mastering Regular Expressions" book. He provides lots of examples for regex optimizations, to limit the number of different combinations that the generated regular expression has to check. Since these regular expressions are being run so many times, it seems that even a small optimization could potentially result in a significant performance increase.
bob connolly
Ranch Hand

Joined: Mar 10, 2004
Posts: 204
Hi David!

Just finished that C program and it ran about 30% longer than the JAVA program, WOW what a surprise!

Could it be that the C REGEX package is not as efficient as the Java REGEX package?

I'm sure there are some inefficiencies in the way i've programmed the C code, but i think it's the pretty much the same way i programmed the Java code, so at least it's an apples to apples approach more or less!?

Anyway here is the C program and i'm going to spend the rest of today looking for anything obvious, but for now, i think it's starting to look like Java is the way to go, not only because it's seems to be on par with C, time wise, but because it's so much easier to maintain and program!

So starting tomorrow and into next week, i'm going to get back to the Java program and make those fine tuning recommendations you've pointed out!

Thanks again David and fellow Rachero's!

#include <stdio.h>
#include <stdlib.h>
#include <regex.h>
#include <strings.h>
#include <errno.h>
#include <time.h>
#define SIZE 256

int main(int argc, char **argv) {

char buffer[SIZE]; time_t curtime; struct tm *loctime;curtime=time(NULL);loctime=localtime(&curtime);fputs (asctime (loctime), stdout);
int c; char *filename=NULL;extern char *optarg; extern int optind, opterr, optopt; int err_no;char *pattern;
if(argc==1) { return(EXIT_SUCCESS); }

while((c=getopt(argc, argv, "f :"))!= EOF) { switch(c) {
case 'f': filename=optarg; break; case 'p': pattern=optarg; break;
case '?': fprintf (stderr, "Unknown option `-%c'.\n", optopt); return EXIT_FAILURE; default: return EXIT_SUCCESS; } }

int c2;int a1=0;int a2=0;char aHS[30][20]={"\0"};FILE *HS; HS=fopen("HSTATUS.in","r");
while((c2=fgetc(HS))!=EOF){if (c2=='\n'){;a1++;a2=0;}else{aHS[a1][a2]=c2;a2++;}}fclose(HS);
char (*pHS)[20]=aHS;

c2=0; a1=0; a2=0;char aHF[30][20]={"\0"};FILE *HF; HF=fopen("HFATYPCD.in","r");
while((c2=fgetc(HF))!=EOF){if (c2=='\n'){;a1++;a2=0;}else{aHF[a1][a2]=c2;a2++;}}fclose(HF);
char (*pHF)[20]=aHF;

c2=0; a1=0; a2=0;char aDQ[30][20]={"\0"};FILE *DQ; DQ=fopen("DQIND.in","r");
while((c2=fgetc(DQ))!=EOF){if (c2=='\n'){;a1++;a2=0;}else{aDQ[a1][a2]=c2;a2++;}}fclose(DQ);
char(*pDQ)[20]=aDQ;

int NUM_COLS=10;

struct parms_grp { int type; char *nam; char *pat; int l; int m; } par[10];
par[0].type='p';par[0].nam="ACT_DTE" ;par[0].pat="([0-9]|(/|)){1,14}"; par[0].l=0;par[0].m=14;
par[1].type='b';par[1].nam="FNMA_LN" ;par[1].pat=" "; par[1].l=1;par[1].m=10;
par[2].type='c';par[2].nam="DQIND" ;par[2].pat="cat"; par[2].l=0;par[2].m=0;
par[3].type='c';par[3].nam="HFATYPCD";par[3].pat="cat"; par[3].l=0;par[3].m=0;
par[4].type='c';par[4].nam="HSTATUS" ;par[4].pat="cat"; par[4].l=0;par[4].m=0;
par[5].type='p';par[5].nam="LPI_DTE" ;par[5].pat="([0-9]|(/|)){1,14}"; par[5].l=0;par[5].m=14;
par[6].type='p';par[6].nam="ACT_UPB" ;par[6].pat="([0-9]|(/|)){1,09}"; par[6].l=1;par[6].m=9;
par[7].type='p';par[7].nam="HFR_UPB" ;par[7].pat="([0-9]|(/|)){1,10}"; par[7].l=0;par[7].m=10;
par[8].type='p';par[8].nam="REMLIFE" ;par[8].pat="([0-9]|(/|)){1,04}"; par[8].l=0;par[8].m=4;
par[9].type='b';par[9].nam="HPOOL_NO";par[9].pat=" "; par[9].l=0;par[9].m=6;

regex_t *regx[10];

int i;for(i=0;i<NUM_COLS;i++) {
if (par[i].type=='p') {
regx[i]=(regex_t *) malloc(sizeof(regex_t)); memset(regx[i],0,sizeof(regex_t));
if((err_no=do_regex(regx[i], par[i].pat))!=EXIT_SUCCESS) { return EXIT_FAILURE; }
}
}
FILE *FH=NULL;if((FH=fopen(filename, "r"))==NULL){fprintf(stderr,"x open file %s;%s\n",filename,strerror(errno));exit(EXIT_FAILURE);}
char line[1024]; int line_no=1;char *cp;

while((cp=fgets(line, 1023, FH))!=NULL) {
size_t i; char field [10][20]; char *ptr = line; int n = 0;
for (i=0; i<sizeof field/sizeof *field; ++i )
{
if ( sscanf(ptr, "%19[^|]%n%*c", field[i], &n) == 1 )
{
if (par[i].type=='p')
{
if (regexec(regx[i],field[i],0,NULL,0)==0)
{
chklen(par[i].l,par[i].m,field[i]);
}
else { errp(i,field[i]); }
}
else
if (par[i].type=='c')
{
if (par[i].nam=="HSTATUS") {chkcat(pHS,field[i]);}else
if (par[i].nam=="HFATYPCD") {chkcat(pHF,field[i]);}else
if (par[i].nam=="DQIND") {chkcat(pDQ,field[i]);}
chklen (par[i].l,par[i].m,field[i]);
}
else
if (par[i].type=='b')
{
chklen(par[i].l,par[i].m,field[i]);
}
else { /*printf("ERROR1: type not found");*/ }

line_no++; ptr += n + 1;
}
else
{
field[i][0] = '\0'; ++ptr;
}
}
}

int i2;for(i2=0;i2<NUM_COLS;i2++)
{
if (par[i].type=='p') { regfree(regx[i2]); free(regx[i2]); }
}

curtime=time(NULL);loctime=localtime(&curtime);fputs (asctime (loctime), stdout);
fclose(FH);return EXIT_SUCCESS;

} /* end of MAIN */

chkcat(char (*ar)[20],char *fld)
{
int i=0;for (i=0;i<30;i++) {
if (strcmp(fld,ar[i])==0) { /*printf("ERROR2 match yes %s",fld);*/}
}
}
int do_regex(regex_t *r, char *p)
{
int err_no=0;
if((err_no=regcomp(r, p,REG_EXTENDED))!=0) {
size_t length; char *buffer; length=regerror(err_no, r, NULL, 0);buffer = malloc(length);regerror(err_no, r, buffer, length);
fprintf(stderr, "%s\n", buffer); free(buffer); regfree(r); return EXIT_FAILURE; }
return EXIT_SUCCESS;
}
chklen(int min,int max,char *s)
{
if (min>0 && max>0)
{
if ((strlen(s)==min|strlen(s)>min) && (strlen(s)<max|strlen(s)==max)) {}
else {/*printf("ERROR3 MinMax: %s %d",s,strlen(s));*/}
}
else
if (min==0 && max>0)
{
if ( (strlen(s)<max | strlen(s)==max) || strlen(s)==0) {}
else {/*printf("ERROR4 MinMax: %s %d",s,strlen(s));*/}
}
}
errp(int i,char *s)
{
/*printf("ERROR5 with PATTERN: %d %s",i,s);*/
}
David Harkness
Ranch Hand

Joined: Aug 07, 2003
Posts: 1646
Excellent! Those are very surprising results indeed. I must confess it's been quite a long time since I've done C programming, and I never did it outside school beyond trivial programs for work. Thus, I'll simply have to say that your C version looks absolutely fantastic!

My first guess as to the speed difference is that the Java version uses buffering for the disk I/O. I don't know if the C version does, but if it doesn't, for the file sizes you're talking about it would have a big impact.

Hmm, in do_regex I see you're freeing the "r" parameter (compiled regex pattern?) but not if it matches. Is this correct? Is it a memory leak for each successful match? Again, my C is very rusty, so I'm probably off base.

I look forward to the results you get. Be forewarned that the suggestions I made will likely make the program highly unmaintainable. Then again, maybe not if you build the regex with lots of comments in the code. But of course, you always do that anyway, right?
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Bob, please use code tags. You can also add them to your already posted code above by using the edit icon at the top right of the post.

Thanks!
bob connolly
Ranch Hand

Joined: Mar 10, 2004
Posts: 204
Hi David thanks for reviewing the C code!

I'll be getting back to the Java program next week, so i'll get back with another update on that!

Also Ilja, i found this list of PROFILERS, is there one in particular that you would recommend for analyzing Java performance, on this list or other?

http://www.manageability.org/blog/stuff/open-source-profilers-for-java/view

Thanks again all!
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Originally posted by bob connolly:
I found this list of PROFILERS, is there one in particular that you would recommend for analyzing Java performance, on this list or other?

http://www.manageability.org/blog/stuff/open-source-profilers-for-java/view


For Java profiling, we are using JProfiler at work, which isn't open source. When I tried the Eclipse Profiler, it didn't work very well, but that was some time ago.

We quite successfully used P6Spy for JDBC profiling, though there might be more comfortable tools out there.
 
GeeCON Prague 2014
 
subject: Regex - Java or Perl