aspose file tools*
The moose likes General Computing and the fly likes Resume Parsing Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Engineering » General Computing
Bookmark "Resume Parsing" Watch "Resume Parsing" New topic
Author

Resume Parsing

Michael Malley
Greenhorn

Joined: Oct 12, 2011
Posts: 20
So my company has a website that they use to upload resumes (.doc, .docx) and manually input data from the resume such as Name, Tel No, Address, etc. The site uses PHP, mySQL, and is hosted on an Apache server. They want to automate the process. At first I was thinking of doing some PHP and parsing the file on the website, but I decided against that. I feel the best way to do this would be to use Java EE with a few EJBs and some relational mapping to the database that the website already uses. Therefore- I am here.

My questions range simple to complex:
- Is it a good idea to use Java EE for this? (I think it's the most powerful way to do it with an apache server running mySQL- more robust than PHP)
- Are there some parsing algorithms that one could start me out with? I've done recursive descent parsing with J2SE back in school before, but I think this is a different situation. Obviously the part I'm having difficulty with is predicting where information will be with a lot of possibilities for labels, titles, and formatting (job history vice work history vice professional experience, headed sections vice bolded sections vice indented sections, etc.)
- Additionally, the solution I'm envisioning will involve a lot of looping and looking up words in an enumeration... ("first word is a name so let's see if it matches those criteria, if not that criteria, then all other criteria, and if not them, then move on") I feel that would be very very very inefficient. Any conceptual algorithms anyone could lend me?

After reviewing my questions it's obvious to me that I have no idea what I'm doing, and a starting point would be much appreciated.

Oh, skill level: I've done a lot of academic work with Java and I'm strong in OOP concepts. I've been developing little programs here and there for my company up until now. I wouldn't say I'm an "expert" but I'm competent.
Shahzad Latif
Greenhorn

Joined: Apr 28, 2011
Posts: 28
Hi Michael,

After reading your question, it looks like you want to process the Doc/Docx files in your Java applications. If you're planning to do it from scratch then I would say it's going to be very tough and complicated. However, you may want to try some Java based API to process the Word documents like Aspose.Words for Java. This is a commercial product though, you'll be able to process your documents quite easily with this component. You may try it at your end to see if it helps. If you need further assistance with this, please write back.


Developer Evangelist @ Aspose. I love to explore and learn new technologies and help other developers along the way.
Michael Malley
Greenhorn

Joined: Oct 12, 2011
Posts: 20
Thanks Shahzad.

Actually, I've used Aspose Words for other projects before. The processing isn't really the issue. What I'm really looking for is some sort of algorithm. I would like it if anyone has done something like this before and shared with me the type of parsing they used and some of the ways they went about doing it efficiently (i.e. did they make up an enumeration of common words to search for, did they use recursion- if so, how?- etc.). So I guess I'm really just looking for an in-depth discussion and some brainstorming partners
Shahzad Latif
Greenhorn

Joined: Apr 28, 2011
Posts: 28
Hi Michael,

I got your point. In that case, I think you should post the query in Algorithm related forum. "Java in General" is not that specific. Well, wish you good luck in your endeavour.

By the way, I have also tweeted this so maybe some one good in algorithm come across and help you with this: https://twitter.com/#!/shahzad_latif/status/149861663842115586.
Michael Malley
Greenhorn

Joined: Oct 12, 2011
Posts: 20
Do you know the correct forum? I have looked and don't see any sub-forums about algorithms in the main forums.
Shahzad Latif
Greenhorn

Joined: Apr 28, 2011
Posts: 28
If I search algorithm on this site, I find most of the algorithm related discussions in General Computing forum. So, I suppose that's the forum where you should discuss this.
Michael Malley
Greenhorn

Joined: Oct 12, 2011
Posts: 20
Thanks.

Rather than re-posting a topic, could a Mod please move this thread to the General Computing forum?
Michael Malley
Greenhorn

Joined: Oct 12, 2011
Posts: 20
As a follow-up question, could I use JavaCC to generate a parser for this project? I know it's not parsing lines of code and expressions, but is there a way I could define a grammar for a resume?
Michael Malley
Greenhorn

Joined: Oct 12, 2011
Posts: 20
This was sort of a brainstorming topic. I've since started this project and would like to thank all those who participated.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Resume Parsing
 
Similar Threads
Help with my game.
Interviewing tip
Project Suggestion
The reality of the job market (from the other side)
on the server side - why Java vs say PHP?