This week's giveaway is in the Android forum.
We're giving away four copies of Android Security Essentials Live Lessons and have Godfrey Nolan on-line!
See this thread for details.
The moose likes Java in General and the fly likes Text processing in Java with regex Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Text processing in Java with regex" Watch "Text processing in Java with regex" New topic
Author

Text processing in Java with regex

Alan Smith
Ranch Hand

Joined: Oct 19, 2011
Posts: 152

Hi,

I just did an interview coding test that required me to read in a line of text from a file and do the following:

- Get the frequency of each character
- Get the sum of all the numbers

the text string looked something like "7ghksh @4ndng754jndv= *&wbd234 Kner75>< wfs093"

Numbers were considered to be all numbers that were in a row ie 7, 4, 754, 234, 75, 093

I failed to finish the test in time because I got stuck extracting the numbers from the string correctly.

My question is would this have been possible and easier using a regular expression to find sequences of numbers? Its one of the things I have never looked at but after this test I plan on doing so. In the test I was using nested loops to loop through the string and try and match the numbers with a seperate array of numbers I had containing 0 - 9. It was a lot trickier than I though it would be! I know some of you think this would be a breeze but text processing is something I never really came across or learned with Java.

Thanks,
Alan


Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Possible? Yes, I'm pretty sure it would be possible to extract all of the numeric sequences from a string using a regex.

Easy? Well, it would be easy for somebody who had enough experience with regexes. For me it wouldn't be easy, but there are plenty of people who could tell you the correct expression right away.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41180
    
  45
This is a really bad interview question, as it requires you to know and recall some very arcane details. During a normal workday, you'd rarely need those, and could easily google them if you actually did. I'd say either that company interviews badly, or they have a bad engineering culture. Too bad (for them and for you) if it's the former, but lucky you if it's the latter.


Ping & DNS - my free Android networking tools app
Alan Smith
Ranch Hand

Joined: Oct 19, 2011
Posts: 152

Ulf Dittmer wrote:This is a really bad interview question, as it requires you to know and recall some very arcane details. During a normal workday, you'd rarely need those, and could easily google them if you actually did. I'd say either that company interviews badly, or they have a bad engineering culture. Too bad (for them and for you) if it's the former, but lucky you if it's the latter.


It was actually a phone interview with the usual whats polymorphism, interfaces, generics, etc and then they sent me this test to do by email. I had an hour and a half, and I completely botched the extracting numbers part trying different things. I came close but not close enough. I felt it was a bad test as well as I am quite competent in Java but I have never used Java for this kind of thing ever. Just took me by surprise. Better luck next time hopefully.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18550
    
  40

Alan Smith wrote:
Ulf Dittmer wrote:This is a really bad interview question, as it requires you to know and recall some very arcane details. During a normal workday, you'd rarely need those, and could easily google them if you actually did. I'd say either that company interviews badly, or they have a bad engineering culture. Too bad (for them and for you) if it's the former, but lucky you if it's the latter.


It was actually a phone interview with the usual whats polymorphism, interfaces, generics, etc and then they sent me this test to do by email. I had an hour and a half, and I completely botched the extracting numbers part trying different things. I came close but not close enough. I felt it was a bad test as well as I am quite competent in Java but I have never used Java for this kind of thing ever. Just took me by surprise. Better luck next time hopefully.



Debates about the validity of the test aside. To answer the original question, yes, in my opinion, regular expression is definitely a worthwhile tool to have your arsenal -- and definitely worth getting good at too.

In this case, the pattern for a series of digits is "\\d+", and you could have extracted all six numbers with a single loop.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Alan Smith
Ranch Hand

Joined: Oct 19, 2011
Posts: 152

Henry Wong wrote:
In this case, the pattern for a series of digits is "\\d+", and you could have extracted all six numbers with a single loop.
Henry


Thats exactly what I thought while I was doing the test. Thats good to hear. I'm off to buy this. Thanks guys.
Darryl Burke
Bartender

Joined: May 03, 2008
Posts: 4523
    
    5

Alan Smith wrote:I was using nested loops to loop through the string and try and match the numbers with a seperate array of numbers I had containing 0 - 9.

Regex aside, you need to familiarize yourself with the methods of the Character class. I'm not a programmer, but I wouldn't think it unreasonable for an interviewer to expect you to be aware of the API of all the primitive wrapper classes, and String. Professional developers, please correct me if that's not realistic.

FWIW, this took a lot less than 1½ hours, using a single loop.


luck, db
There are no new questions, but there may be new answers.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7553
    
  18

Darryl Burke wrote:FWIW, this took a lot less than 1½ hours, using a single loop...

Weirdly enough, that's pretty much exactly the way I'd have done it too; except I think I'd have used a HashMap<Character, AtomicInteger>.

@Alan: Regexes are very useful, but they're not for everything. Just for starters, it's likely that a regex-based solution would be significantly slower than Darryl's.

Secondly, when you read the book, be sure to digest the Java chapters (if it has them) as well as the standard operators. One particular one to know about is the 'possessive' operator, which is peculiar to Java (and maybe a few other languages, like perl).

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Alan Smith
Ranch Hand

Joined: Oct 19, 2011
Posts: 152

Winston Gutkowski wrote:
Darryl Burke wrote:FWIW, this took a lot less than 1½ hours, using a single loop...

Weirdly enough, that's pretty much exactly the way I'd have done it too; except I think I'd have used a HashMap<Character, AtomicInteger>.

@Alan: Regexes are very useful, but they're not for everything. Just for starters, it's likely that a regex-based solution would be significantly slower than Darryl's.

Secondly, when you read the book, be sure to digest the Java chapters (if it has them) as well as the standard operators. One particular one to know about is the 'possessive' operator, which is peculiar to Java (and maybe a few other languages, like perl).

Winston


@Darryl, very nice! Answers like yours never hit me straight away, I always over complicate things. Guess ill have to go back to the drawing board with the wrapper classes. I feel like I'm going backwards with programming!

Thanks Winston, i'll have a look, if anything regex will be good to have under my belt like another poster said.
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4646
    
    5

Winston Gutkowski wrote:except I think I'd have used a HashMap<Character, AtomicInteger>


Why Atomic.... since its not multithreaded?
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3003
    
    9
I imagine it's not the atomic part that's useful here, but just the fact that it's a mutable int-like object. For this use case, it reduces the amount of key lookups you do, having to look up one object and then insert a different object. This way it's just one lookup per update, or two for the very first update of a given key.
Alan Smith
Ranch Hand

Joined: Oct 19, 2011
Posts: 152

Darryl Burke wrote:
Alan Smith wrote:I was using nested loops to loop through the string and try and match the numbers with a seperate array of numbers I had containing 0 - 9.

Regex aside, you need to familiarize yourself with the methods of the Character class. I'm not a programmer, but I wouldn't think it unreasonable for an interviewer to expect you to be aware of the API of all the primitive wrapper classes, and String. Professional developers, please correct me if that's not realistic.

FWIW, this took a lot less than 1½ hours, using a single loop.


Hi Darryl,

I am looking back over this again, the frequency part is ok but I am confused about what exactly is happening on this line:



Say we focus on the '745' sequence in the string... if 7 is the lastNumber variable and 4 is the current character in the loop, then that line will look like this:

7 = 7 * 10 + (4 - '0');

This works without the brackets as well but what does the - '0' actually achieve. I understand the * 10 multiplication is to get 70 + 4 ie 74 but why the - '0'? I also know it doesn't work without the - '0' but I can't see it.

Thanks,
Alan
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19656
    
  18

Not (4 - '0') but ('4' - '0'). The issue here is that '0' is not the same as 0.

If you take a look at http://www.asciitable.com/ you will see that the character '4' has ASCII value 52. Likewise, '0' is the same as 48. By subtracting '0' from '4' you are extracting 48 from 52, yielding 4 - the decimal value of the character.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Alan Smith
Ranch Hand

Joined: Oct 19, 2011
Posts: 152

Rob Spoor wrote:Not (4 - '0') but ('4' - '0'). The issue here is that '0' is not the same as 0.

If you take a look at http://www.asciitable.com/ you will see that the character '4' has ASCII value 52. Likewise, '0' is the same as 48. By subtracting '0' from '4' you are extracting 48 from 52, yielding 4 - the decimal value of the character.


Ah cool, thanks Rob!
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19656
    
  18

You're welcome.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Text processing in Java with regex
 
Similar Threads
Rich Text Box in Java
52% the first time, 93% today!
Mr.dan's mock VS Mr. Marcus' mock VS scjp1.4 VS scjp1.2
Why did Java allow to use primitives in methods?
POI Converting Long Number to Sciebtific Notarion