So my company has a website that they use to upload resumes (.doc, .docx) and manually input data from the resume such as Name, Tel No, Address, etc. The site uses PHP, mySQL, and is hosted on an Apache server. They want to automate the process. At first I was thinking of doing some PHP and parsing the file on the website, but I decided against that. I feel the best way to do this would be to use Java EE with a few EJBs and some relational mapping to the database that the website already uses. Therefore- I am here.
My questions range simple to complex:
- Is it a good idea to use Java EE for this? (I think it's the most powerful way to do it with an apache server running mySQL- more robust than PHP)
- Are there some parsing algorithms that one could start me out with? I've done recursive descent parsing with J2SE back in school before, but I think this is a different situation. Obviously the part I'm having difficulty with is predicting where information will be with a lot of possibilities for labels, titles, and formatting (job history vice work history vice professional experience, headed sections vice bolded sections vice indented sections, etc.)
- Additionally, the solution I'm envisioning will involve a lot of looping and looking up words in an enumeration... ("first word is a name so let's see if it matches those criteria, if not that criteria, then all other criteria, and if not them, then move on") I feel that would be very very very inefficient. Any conceptual algorithms anyone could lend me?
After reviewing my questions it's obvious to me that I have no idea what I'm doing, and a starting point would be much appreciated.
Oh, skill level: I've done a lot of academic work with Java and I'm strong in OOP concepts. I've been developing little programs here and there for my company up until now. I wouldn't say I'm an "expert" but I'm competent.
After reading your question, it looks like you want to process the Doc/Docx files in your Java applications. If you're planning to do it from scratch then I would say it's going to be very tough and complicated. However, you may want to try some Java based API to process the Word documents like Aspose.Words for Java. This is a commercial product though, you'll be able to process your documents quite easily with this component. You may try it at your end to see if it helps. If you need further assistance with this, please write back.
Developer Evangelist @ Aspose. I love to explore and learn new technologies and help other developers along the way.
Joined: Oct 12, 2011
Actually, I've used Aspose Words for other projects before. The processing isn't really the issue. What I'm really looking for is some sort of algorithm. I would like it if anyone has done something like this before and shared with me the type of parsing they used and some of the ways they went about doing it efficiently (i.e. did they make up an enumeration of common words to search for, did they use recursion- if so, how?- etc.). So I guess I'm really just looking for an in-depth discussion and some brainstorming partners
Joined: Apr 28, 2011
I got your point. In that case, I think you should post the query in Algorithm related forum. "Java in General" is not that specific. Well, wish you good luck in your endeavour.