aspose file tools*
The moose likes Java in General and the fly likes Extracting Data Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Extracting Data " Watch "Extracting Data " New topic
Author

Extracting Data

kishore snaham
Greenhorn

Joined: Sep 23, 2009
Posts: 10
Hi forum,

I want to extract pdfs and text files from the popular news sites , i already found those sites which maintain pdfs and have the download option. so here i want to extract those pdfs in to my site database , later i retrieve from my database to my site for to display . if i give the link is there any chance to extract it to our database directly . i still confused on which technology is used for this for better results, is there any other chances to extract pdfs to my database.
(or)
If i enter the URL of the content web page in the text field , then directly the data will extract and save in database. later i will retrieve from the database to my site.
can any one of both have the possibility to done
can any one help me with there valuable suggestions.

thank you
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14432
    
  23

Aside from the technical issues, you should check if you are legally allowed to copy content from those other websites before you do this. Just because other websites make content available to you doesn't mean you're allowed to republish that content yourself.


Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 8 API documentation
kishore snaham
Greenhorn

Joined: Sep 23, 2009
Posts: 10
Thank you Jesper Young for your suggestion,
But i have All the rights for to access those data. some how those are my colleagues sites so no problem regarding that. can you have any idea regarding the process and coding
thank you
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42926
    
  68
The Apache HttpClient library can access and download web content if you know the URLs. If you only know the domain names -but not the exact URLs- it gets a lot trickier; you'd essentially have to implement a web spider that parses the HTML pages for PDF links (and links to other pages where you could continue spidering). You can probably find existing spiders written in Java on java-source.net.
kishore snaham
Greenhorn

Joined: Sep 23, 2009
Posts: 10
Thank you very much Ulf Dittmer,
I just started searching as you said,
In the mean any suggestions

thank you
kishore snaham
Greenhorn

Joined: Sep 23, 2009
Posts: 10
Hai Ulf Dittmer,
As you said i tried with web spider and related concepts , there is some codes also but those are representing the database applications but not web based ,
According to me the data files(PDF, html or images) are directly fall in to my database server , when ever i give the URL link of the files .
Is there any possibility for this , and first thing that i confused to where to start. i have the idea from database to website but not how to retrieve data file from web page to data base.

thank you
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42926
    
  68
There are two steps to it. First, you need to download the page/file through HTTP; the HttpClient library can help you with that. That will result either in a file on disk, or in a byte[] in memory. Either way, the second step is to store the binary data in the DB, most likely in a Blob field; that's probably the simpler part, but neither part is really hard. Let us know if you encounter any problems.
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24187
    
  34

Odd that you first say the files are on "popular news sites" and then say the sites belong to your colleagues as soon as someone questions your legal standing.

Why can't you just link to the documents where they are? Why do you have to make copies?


[Jess in Action][AskingGoodQuestions]
kishore snaham
Greenhorn

Joined: Sep 23, 2009
Posts: 10
Thank you where much Ulf Dittmer,
I felt my self there is a solution as you said, But i am unaware of HttpClient library .
so first i started towards that . i hoping there is a solution. i will soon back with my result , in the mean please discuss with me , if you found any ideas regarding.

coming to Ernest Friedman-Hill surely i will give the all the links if i succeed. i want make a site for my regional language people .

thanks for your suggestions
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24187
    
  34

I'm still not understanding. In your HTML on www.yoursite.com, you can just include

<a href="http://www.othersite.com/document.pdf">Click here to read PDF document on other site</a>

and you're done. What more do you need?

kishore snaham
Greenhorn

Joined: Sep 23, 2009
Posts: 10
Thank you Ernest Friedman-Hill , of course i can do as you said to display the file but its not a static site , After Publish my site, if i need to add the more data for every time , i cant change the html pages for every link.
for that i am creating admin panel , along with data i will add the fields like name, content site name, date etc. to display.
that to those are also displayed in latest items , most Reading items etc.
i think now you getting my point.
thank you again for your valuable suggestion.
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24187
    
  34

kishore venv wrote:
i think now you getting my point.


No, I am still not getting it. If the site is dynamic, then the HTML is generated dynamically, but it can still link to a document in its original location, rather than a pirated copy on your own site. Your database can simply contain the link to the remote document.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42926
    
  68
Why can't you keep (and update) the metadata in a DB on your site without having to keep a copy of the content?
kishore snaham
Greenhorn

Joined: Sep 23, 2009
Posts: 10
ok fine let me explain clearly.
i am constructing a site with the categories like ,
regional news ,
film reviews,
help desk (for social awareness program will be held in the surrounding) etc
its just like a show case of those informations.
as already said when user search along category then there found a data files along with the content site name,
of course , of course this content site name in the link format. if user want to read the data , then he /she read file , then chose another category. if user want to read more topics in that category , went to content site by clicking the content site name link. may be my language make you confused. thats the project that i want to do.
thanks for your suggestions
thomas silver
Ranch Hand

Joined: Jun 20, 2003
Posts: 32
so they are not from your colleages' sites like you said? I guess I am confused too
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Extracting Data