I got the assignment to check submissions of students for plagiarism and to to advice the students in their writing style.
Plagiarism should only be checked for in documents of the same assignment.
In summary I need to check if they have described every abbreviation, have good to understand sentences and used easy to understand language.
I know about the libraries OpenNlp and UIMA but i am unsure if they really are what i need to get the job done.
To check for plagiarism i am using Lucene and the target language is mostly german.
For the plagiarism part, I had a friend who was a teacher who would type unlikely looking passages into Google and see if they came back embedded in in the rest of the student's paper It would be easy to automate that except for determining what was an unlikely passage. But that might not be necessary since it is automated. You could submit every sentence, but she wasn't going to enter every sentence manually. I haven't used the two packages you mention but reading about them just now I'm not sure they would help in detecting plagiarism. How advanced are the students? Grade school? Post-doc?
Yeah I know that but they are expensive and my department thinks that it is a good idea to program it from scratch...
I plan to use OpenNlp mainly to extract the sentences and to analyze them for style.
I than plan to send those sentences/phrases to Lucene for indexing.
The problem isn't that we suspect the students to copy from the internet but that they copy from each other so comparing it with web content isn't the issue.
It also would be a legal issue to transfer those documents to servers outside of our control.
If you have any general advice I would be very grateful.