PramilaT Thakur

Greenhorn
+ Follow
since Aug 21, 2003
Merit badge: grant badges
For More
Cows and Likes
Cows
Total received
0
In last 30 days
0
Total given
0
Likes
Total received
0
Received in last 30 days
0
Total given
0
Given in last 30 days
0
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by PramilaT Thakur

Hi Ernest Friedman-Hill ,

you are right. I files go through OCR.
the text is extracted from the images.
The way I can differentiate each Chapter is from their heading in a special font.
So I need to read the whole text in one short.
But because the text is too long, the machine could run out of memory.
So, I am looking for someother alternative.

Page numbers are also not in sequence.
Each chapter begings with page 1, in image

Thanks
15 years ago
Hi Ulf Dittmer ,

I know we can do that, i.e have each page as a pdf file. I tried that and it worked.

But I do not want my resultant pdf's to be of single page.

For e.g
My original pdf contains 25,000 pages
All pages are images. I will run OCR to get the text out of them.

Now the pdf has several chapters in them.(Say 100 chapters)
The heading of each chapter has a particular string of a Unique font.

So I want to split the pdf into chapters, so the resultant pdf's contain one full chapter from start to finish.
So the result I want to see is 100 pdf with 100 different chapters.

This is what I am expecting and I need help with it or any pointers.

Thanks in advance.
15 years ago
Hi Everyone,

I would like to have some pointers from all our remarakable readers.

Here is what I want to achieve.
I have few 100's of files in pdf format.(very large in size)
all the pages are scanned and are images.
No consistent page numbers
No bookmarks.

What I want to do is split the pdf files into smaller chunks. To do this I have a tokenString that I want to look into the document text.
Save the individual chunk into separate pdf files.

Reading the whole stream is a problem as the file is too big.
What are the other options I have.

Can someone help me here with some pointers, please.
15 years ago
Hi,

I tried it and after some trial and error I got it working on my local machine as a standalone application.

Now I need to intergrate it with Solr, so that Solr server can do the search from the index files.

I had been reading about solr a lot but it is confusing to me. specially with the SOLR_HOM, solr.solr.home.

If any one has any pointers please help me. Or any mini tutorials.

thanks in advance.
Hi Everyone,

I am new to lucene. I need to index some pdf files. I tried using PDFBox and lucene document. But when I try to run the programme it does not run.

I have no idea. I also tried to use the code given @ https://coderanch.com/t/424178/open-source/PDF-file-indexing-Searching-lucene posting even this does not work.
Can anyone help me.

I think it is some version issues. My code is

package org.apache.solr.pdf.test;

import java.io.File;

import org.apache.lucene.index.IndexWriter;
import org.pdfbox.searchengine.lucene.IndexFiles;

public class PDFBoxIndexFiles {

/**
* @param args
*/
public static void main(String[] args)throws Exception {
IndexFiles indexFiles = new IndexFiles();
indexFiles.index(new File("who.pdf"), true, "C:/temp");
}

}

After running I get Exception in thread "main" java.lang.IllegalAccessError: tried to access field org.apache.lucene.index.IndexWriter.maxFieldLength from class org.pdfbox.searchengine.lucene.IndexFiles
at org.pdfbox.searchengine.lucene.IndexFiles.index(IndexFiles.java:158)
at org.apache.solr.pdf.test.PDFBoxIndexFiles.main(PDFBoxIndexFiles.java:15).

I need some pointers please.

thanks
Hi Manish,

If possible can you send me some info on velocity , as I am trying to use infoglue for content management and I am new to Velocity templates.

Thanks
Hi,
There could be another solution for this. You could use
<html:form action=".someAction.do">
<input type="submit" name="submit" value="Click"/>
<!-- add other buttons-->
All to be named as submits, but different values
</html:form >
In the action class just determine what the value of submit button is.
And depending on that determine the forward page name and return that.
But remember these action forwards should be included in th e struts-config.xml
Hope this helps.
Enjoy.
Pramila Thakur
SCJP,SCJD, SCWCD
20 years ago
Hi,
There could be another solution for this. You could use
<html:form action=".someAction.do">
<input type="submit" name="submit" value="Click"/>
<!-- add other buttons-->
All to be named as submits, but different values
</html:form >
In the action class just determine what the value of submit button is.
And depending on that determine the forward page name and return that.
But remember these action forwards should be included in th e struts-config.xml
Hope this helps.
Enjoy.
Pramila Thakur
SCJP,SCJD, SCWCD
20 years ago