We need to build an application which parse the Resumes (CV) document, this documents may be in any format like, DOC, PDF, etc.
Now we have to extract the various fields from this files, like Name, Location, Experience, Skill Set, etc.. and create a XML of it.
The files (Resumes) we are collecting from Job Portal sites and this files are not fixed width fields nor delimiter separated.
That's a tough problem, both because you're dealing with different input formats for which no common API exists, and because the format is not fixed. E.g., one applicant might write "Location: ..." while another might use "City: ..." How do you know those refer to the same thing?
For accessing the textual contents of PDF, check out the PDFBox and JPedal libraries. For DOC files the Apache POI library would be the way to go.
Ulf Dittmer wrote:That's a tough problem, both because you're dealing with different input formats for which no common API exists, and because the format is not fixed. E.g., one applicant might write "Location: ..." while another might use "City: ..." How do you know those refer to the same thing?
We can have a set for the commonly used terms regarding Location, Name, Age/sex, etc, Like, say for Location we can use City/Location/Place this terms.
And it might get touger if one used pattern like,
Location: .........
AND other document has
Location
..........
On a new line..
Still thanks for the inputs..
(I think this problem is tough to solve , so I'm thinking about a suggestion for a Job Portal site to create an web service for us )
I do not see anything specific to java leave aside Java -Advanced, so moving to General Computing forum.
If you need to discuss something specific to open source projects that will help you, then we can move it to Other Open Source projects forum later.