| Author |
Resume Parsing
|
Michael Malley
Greenhorn
Joined: Oct 12, 2011
Posts: 19
|
|
So my company has a website that they use to upload resumes (.doc, .docx) and manually input data from the resume such as Name, Tel No, Address, etc. The site uses PHP, mySQL, and is hosted on an Apache server. They want to automate the process. At first I was thinking of doing some PHP and parsing the file on the website, but I decided against that. I feel the best way to do this would be to use Java EE with a few EJBs and some relational mapping to the database that the website already uses. Therefore- I am here.
My questions range simple to complex:
- Is it a good idea to use Java EE for this? (I think it's the most powerful way to do it with an apache server running mySQL- more robust than PHP)
- Are there some parsing algorithms that one could start me out with? I've done recursive descent parsing with J2SE back in school before, but I think this is a different situation. Obviously the part I'm having difficulty with is predicting where information will be with a lot of possibilities for labels, titles, and formatting (job history vice work history vice professional experience, headed sections vice bolded sections vice indented sections, etc.)
- Additionally, the solution I'm envisioning will involve a lot of looping and looking up words in an enumeration... ("first word is a name so let's see if it matches those criteria, if not that criteria, then all other criteria, and if not them, then move on") I feel that would be very very very inefficient. Any conceptual algorithms anyone could lend me?
After reviewing my questions it's obvious to me that I have no idea what I'm doing, and a starting point would be much appreciated.
Oh, skill level: I've done a lot of academic work with Java and I'm strong in OOP concepts. I've been developing little programs here and there for my company up until now. I wouldn't say I'm an "expert" but I'm competent.
|
 |
Shahzad Latif
Greenhorn
Joined: Apr 28, 2011
Posts: 28
|
|
Hi Michael,
After reading your question, it looks like you want to process the Doc/Docx files in your Java applications. If you're planning to do it from scratch then I would say it's going to be very tough and complicated. However, you may want to try some Java based API to process the Word documents like Aspose.Words for Java. This is a commercial product though, you'll be able to process your documents quite easily with this component. You may try it at your end to see if it helps. If you need further assistance with this, please write back.
|
Developer Evangelist @ Aspose. I love to explore and learn new technologies and help other developers along the way.
|
 |
Michael Malley
Greenhorn
Joined: Oct 12, 2011
Posts: 19
|
|
Thanks Shahzad.
Actually, I've used Aspose Words for other projects before. The processing isn't really the issue. What I'm really looking for is some sort of algorithm. I would like it if anyone has done something like this before and shared with me the type of parsing they used and some of the ways they went about doing it efficiently (i.e. did they make up an enumeration of common words to search for, did they use recursion- if so, how?- etc.). So I guess I'm really just looking for an in-depth discussion and some brainstorming partners
|
 |
Shahzad Latif
Greenhorn
Joined: Apr 28, 2011
Posts: 28
|
|
Hi Michael,
I got your point. In that case, I think you should post the query in Algorithm related forum. "Java in General" is not that specific. Well, wish you good luck in your endeavour.
By the way, I have also tweeted this so maybe some one good in algorithm come across and help you with this: https://twitter.com/#!/shahzad_latif/status/149861663842115586.
|
 |
Michael Malley
Greenhorn
Joined: Oct 12, 2011
Posts: 19
|
|
|
Do you know the correct forum? I have looked and don't see any sub-forums about algorithms in the main forums.
|
 |
Shahzad Latif
Greenhorn
Joined: Apr 28, 2011
Posts: 28
|
|
|
If I search algorithm on this site, I find most of the algorithm related discussions in General Computing forum. So, I suppose that's the forum where you should discuss this.
|
 |
Michael Malley
Greenhorn
Joined: Oct 12, 2011
Posts: 19
|
|
Thanks.
Rather than re-posting a topic, could a Mod please move this thread to the General Computing forum?
|
 |
Michael Malley
Greenhorn
Joined: Oct 12, 2011
Posts: 19
|
|
|
As a follow-up question, could I use JavaCC to generate a parser for this project? I know it's not parsing lines of code and expressions, but is there a way I could define a grammar for a resume?
|
 |
Michael Malley
Greenhorn
Joined: Oct 12, 2011
Posts: 19
|
|
|
This was sort of a brainstorming topic. I've since started this project and would like to thank all those who participated.
|
 |
Vinay Johar
Greenhorn
Joined: Apr 22, 2012
Posts: 2
|
|
Wonderful Topic and wonderful discussions as I have spent my good 4 yrs in writing Resume Parser.
let me share my experiences in this
When we write any parser or extraction engine you have to take care for two things, a position and 2nd grammar as Michael said. When you start working on this, you realize there are around 67 fields which need to be addressed . so you should be ready to build 3 dimensional mapping one for position and one for grammar and third is for validation.
once you go through that the next challenge is to achieve accuracy, in market if we take only English, we found around 1300+ formats so collect them, understand and then build it.
once you are done, then you have to work with different languages, formats, designs, and new field coming up.
We spent good 2+ years with 5 programmers and 2 mathematicians working full time to achieve which is 95% accurate.
Hope I am able to share my experiences. Any help needed do let me know I love to help the development community in here
Thanks
Vinay
www.rchilli.com
|
CEO, RChilli
www.rchilli.com
|
 |
 |
|
|
subject: Resume Parsing
|
|
|