File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Generic Parser for files Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login


Win a copy of The Mikado Method this week in the Agile and other Processes forum!
JavaRanch » Java Forums » Java » Java in General
Reply Bookmark "Generic Parser for files" Watch "Generic Parser for files" New topic
Author

Generic Parser for files

Sagar Rohankar
Ranch Hand

Joined: Feb 19, 2008
Posts: 2896
    
    1


Hello Ranchers,

We need to build an application which parse the Resumes (CV) document, this documents may be in any format like, DOC, PDF, etc.
Now we have to extract the various fields from this files, like Name, Location, Experience, Skill Set, etc.. and create a XML of it.

The files (Resumes) we are collecting from Job Portal sites and this files are not fixed width fields nor delimiter separated.

So any pointers ?


[LEARNING bLOG] | [Freelance Web Designer] | [and "Rohan" is part of my surname]
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 35241
    
    7
That's a tough problem, both because you're dealing with different input formats for which no common API exists, and because the format is not fixed. E.g., one applicant might write "Location: ..." while another might use "City: ..." How do you know those refer to the same thing?

For accessing the textual contents of PDF, check out the PDFBox and JPedal libraries. For DOC files the Apache POI library would be the way to go.


Android appsImageJ pluginsJava web charts
Sagar Rohankar
Ranch Hand

Joined: Feb 19, 2008
Posts: 2896
    
    1

Ulf Dittmer wrote:That's a tough problem, both because you're dealing with different input formats for which no common API exists, and because the format is not fixed. E.g., one applicant might write "Location: ..." while another might use "City: ..." How do you know those refer to the same thing?


We can have a set for the commonly used terms regarding Location, Name, Age/sex, etc, Like, say for Location we can use City/Location/Place this terms.

And it might get touger if one used pattern like,
Location: .........

AND other document has

Location
..........

On a new line..

Still thanks for the inputs..

(I think this problem is tough to solve , so I'm thinking about a suggestion for a Job Portal site to create an web service for us )
Nitesh Kant
Bartender

Joined: Feb 25, 2007
Posts: 1638

I do not see anything specific to java leave aside Java -Advanced, so moving to General Computing forum.
If you need to discuss something specific to open source projects that will help you, then we can move it to Other Open Source projects forum later.


apigee, a better way to API!
 
I agree. Here's the link: http://zeroturnaround.com/jrebel - it saves me about five hours per week
 
subject: Generic Parser for files
 
Similar Threads
XML and Java
Can OpenCSV parse CSV files and return data in a list as attribute:value
Attachment
Groovy : Process CSV to SQL
Sr. Java Developer - San Diego, CA