Win a copy of Re-engineering Legacy Software this week in the Refactoring forum
or Docker in Action in the Cloud/Virtualization forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Text processing in Java with regex

 
Alan Smith
Ranch Hand
Posts: 185
Firefox Browser Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I just did an interview coding test that required me to read in a line of text from a file and do the following:

- Get the frequency of each character
- Get the sum of all the numbers

the text string looked something like "7ghksh @4ndng754jndv= *&wbd234 Kner75>< wfs093"

Numbers were considered to be all numbers that were in a row ie 7, 4, 754, 234, 75, 093

I failed to finish the test in time because I got stuck extracting the numbers from the string correctly.

My question is would this have been possible and easier using a regular expression to find sequences of numbers? Its one of the things I have never looked at but after this test I plan on doing so. In the test I was using nested loops to loop through the string and try and match the numbers with a seperate array of numbers I had containing 0 - 9. It was a lot trickier than I though it would be! I know some of you think this would be a breeze but text processing is something I never really came across or learned with Java.

Thanks,
Alan


 
Paul Clapham
Sheriff
Pie
Posts: 20758
30
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Possible? Yes, I'm pretty sure it would be possible to extract all of the numeric sequences from a string using a regex.

Easy? Well, it would be easy for somebody who had enough experience with regexes. For me it wouldn't be easy, but there are plenty of people who could tell you the correct expression right away.
 
Ulf Dittmer
Rancher
Pie
Posts: 42967
73
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is a really bad interview question, as it requires you to know and recall some very arcane details. During a normal workday, you'd rarely need those, and could easily google them if you actually did. I'd say either that company interviews badly, or they have a bad engineering culture. Too bad (for them and for you) if it's the former, but lucky you if it's the latter.
 
Alan Smith
Ranch Hand
Posts: 185
Firefox Browser Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ulf Dittmer wrote:This is a really bad interview question, as it requires you to know and recall some very arcane details. During a normal workday, you'd rarely need those, and could easily google them if you actually did. I'd say either that company interviews badly, or they have a bad engineering culture. Too bad (for them and for you) if it's the former, but lucky you if it's the latter.


It was actually a phone interview with the usual whats polymorphism, interfaces, generics, etc and then they sent me this test to do by email. I had an hour and a half, and I completely botched the extracting numbers part trying different things. I came close but not close enough. I felt it was a bad test as well as I am quite competent in Java but I have never used Java for this kind of thing ever. Just took me by surprise. Better luck next time hopefully.
 
Henry Wong
author
Marshal
Pie
Posts: 20893
75
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Alan Smith wrote:
Ulf Dittmer wrote:This is a really bad interview question, as it requires you to know and recall some very arcane details. During a normal workday, you'd rarely need those, and could easily google them if you actually did. I'd say either that company interviews badly, or they have a bad engineering culture. Too bad (for them and for you) if it's the former, but lucky you if it's the latter.


It was actually a phone interview with the usual whats polymorphism, interfaces, generics, etc and then they sent me this test to do by email. I had an hour and a half, and I completely botched the extracting numbers part trying different things. I came close but not close enough. I felt it was a bad test as well as I am quite competent in Java but I have never used Java for this kind of thing ever. Just took me by surprise. Better luck next time hopefully.



Debates about the validity of the test aside. To answer the original question, yes, in my opinion, regular expression is definitely a worthwhile tool to have your arsenal -- and definitely worth getting good at too.

In this case, the pattern for a series of digits is "\\d+", and you could have extracted all six numbers with a single loop.

Henry
 
Alan Smith
Ranch Hand
Posts: 185
Firefox Browser Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:
In this case, the pattern for a series of digits is "\\d+", and you could have extracted all six numbers with a single loop.
Henry


Thats exactly what I thought while I was doing the test. Thats good to hear. I'm off to buy this. Thanks guys.
 
Darryl Burke
Bartender
Posts: 5125
11
Java Netbeans IDE Opera
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Alan Smith wrote:I was using nested loops to loop through the string and try and match the numbers with a seperate array of numbers I had containing 0 - 9.

Regex aside, you need to familiarize yourself with the methods of the Character class. I'm not a programmer, but I wouldn't think it unreasonable for an interviewer to expect you to be aware of the API of all the primitive wrapper classes, and String. Professional developers, please correct me if that's not realistic.

FWIW, this took a lot less than 1½ hours, using a single loop.
 
Winston Gutkowski
Bartender
Pie
Posts: 10109
56
Eclipse IDE Hibernate Ubuntu
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Darryl Burke wrote:FWIW, this took a lot less than 1½ hours, using a single loop...

Weirdly enough, that's pretty much exactly the way I'd have done it too; except I think I'd have used a HashMap<Character, AtomicInteger>.

@Alan: Regexes are very useful, but they're not for everything. Just for starters, it's likely that a regex-based solution would be significantly slower than Darryl's.

Secondly, when you read the book, be sure to digest the Java chapters (if it has them) as well as the standard operators. One particular one to know about is the 'possessive' operator, which is peculiar to Java (and maybe a few other languages, like perl).

Winston
 
Alan Smith
Ranch Hand
Posts: 185
Firefox Browser Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:
Darryl Burke wrote:FWIW, this took a lot less than 1½ hours, using a single loop...

Weirdly enough, that's pretty much exactly the way I'd have done it too; except I think I'd have used a HashMap<Character, AtomicInteger>.

@Alan: Regexes are very useful, but they're not for everything. Just for starters, it's likely that a regex-based solution would be significantly slower than Darryl's.

Secondly, when you read the book, be sure to digest the Java chapters (if it has them) as well as the standard operators. One particular one to know about is the 'possessive' operator, which is peculiar to Java (and maybe a few other languages, like perl).

Winston


@Darryl, very nice! Answers like yours never hit me straight away, I always over complicate things. Guess ill have to go back to the drawing board with the wrapper classes. I feel like I'm going backwards with programming!

Thanks Winston, i'll have a look, if anything regex will be good to have under my belt like another poster said.
 
Pat Farrell
Rancher
Posts: 4678
7
Linux Mac OS X VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Winston Gutkowski wrote:except I think I'd have used a HashMap<Character, AtomicInteger>


Why Atomic.... since its not multithreaded?
 
Mike Simmons
Ranch Hand
Posts: 3028
10
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I imagine it's not the atomic part that's useful here, but just the fact that it's a mutable int-like object. For this use case, it reduces the amount of key lookups you do, having to look up one object and then insert a different object. This way it's just one lookup per update, or two for the very first update of a given key.
 
Alan Smith
Ranch Hand
Posts: 185
Firefox Browser Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Darryl Burke wrote:
Alan Smith wrote:I was using nested loops to loop through the string and try and match the numbers with a seperate array of numbers I had containing 0 - 9.

Regex aside, you need to familiarize yourself with the methods of the Character class. I'm not a programmer, but I wouldn't think it unreasonable for an interviewer to expect you to be aware of the API of all the primitive wrapper classes, and String. Professional developers, please correct me if that's not realistic.

FWIW, this took a lot less than 1½ hours, using a single loop.


Hi Darryl,

I am looking back over this again, the frequency part is ok but I am confused about what exactly is happening on this line:



Say we focus on the '745' sequence in the string... if 7 is the lastNumber variable and 4 is the current character in the loop, then that line will look like this:

7 = 7 * 10 + (4 - '0');

This works without the brackets as well but what does the - '0' actually achieve. I understand the * 10 multiplication is to get 70 + 4 ie 74 but why the - '0'? I also know it doesn't work without the - '0' but I can't see it.

Thanks,
Alan
 
Rob Spoor
Sheriff
Pie
Posts: 20495
54
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Not (4 - '0') but ('4' - '0'). The issue here is that '0' is not the same as 0.

If you take a look at http://www.asciitable.com/ you will see that the character '4' has ASCII value 52. Likewise, '0' is the same as 48. By subtracting '0' from '4' you are extracting 48 from 52, yielding 4 - the decimal value of the character.
 
Alan Smith
Ranch Hand
Posts: 185
Firefox Browser Linux Netbeans IDE
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Rob Spoor wrote:Not (4 - '0') but ('4' - '0'). The issue here is that '0' is not the same as 0.

If you take a look at http://www.asciitable.com/ you will see that the character '4' has ASCII value 52. Likewise, '0' is the same as 48. By subtracting '0' from '4' you are extracting 48 from 52, yielding 4 - the decimal value of the character.


Ah cool, thanks Rob!
 
Rob Spoor
Sheriff
Pie
Posts: 20495
54
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You're welcome.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic