wood burning stoves 2.0*
The moose likes JDBC and the fly likes search engine to search both at database and web application level Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Databases » JDBC
Bookmark "search engine to search both at database and web application level" Watch "search engine to search both at database and web application level" New topic
Author

search engine to search both at database and web application level

sapna rana
Greenhorn

Joined: Sep 01, 2008
Posts: 18
Plesae suggest a search engine which can search both at database and
we application level
Paul Sturrock
Bartender

Joined: Apr 14, 2004
Posts: 10336

Lucene is good.


JavaRanch FAQ HowToAskQuestionsOnJavaRanch
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42061
    
  64
Lucene rocks, but it needs an indexer for each data source that you want to search. I'm not aware that one for databases exists, although it wouldn't be hard to write one. Just wanted to give a heads-up that it's not a simple plug-and-play solution.


Ping & DNS - my free Android networking tools app
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19697
    
  20

Originally posted by Ulf Dittmer:
I'm not aware that one for databases exists, although it wouldn't be hard to write one.

It's not, I've written a simple tool to search any database for any String myself a while ago. DatabaseMetaData will help you out for retrieving the tables from a connection, and for the rest it's just "SELECT * FROM <table>", iterate through the result set and the columns (using ResultSetMetaData) and voila.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Paul Sturrock
Bartender

Joined: Apr 14, 2004
Posts: 10336

Some databases, SQL Server for example, provide this sort of service out the box. It includes a free text search service - so some sort of Lucene/SLQ Server mix would be a possibility. Other databases presumably have competing offerings.
sapna rana
Greenhorn

Joined: Sep 01, 2008
Posts: 18
Hi ,

I have tried to use Nutch but it is only web search and does not include any search at database level.

And following error appear while using it

/**********************************/

Nutch search engine(nutch-0.7.2).

After install Nutch and Tomcat, I tried to crawl three url one of them is my web application on jboss.

using command as:

nutch crawl urls -dir crawl -depth 3>& crawl.log

where urls ia a file under the nutch directory and contains three urls as
"http://localhost:8080/vinweb"
"http://www.orkut.co.in"
"http://apache.com"


But, after crawling, I checked the crawl.log, seems it
didn't fetch anything

080901 193120 FetchListTool started
080901 193121 Overall processing: Sorted 0 entries in 0.0 seconds.

following is my crawl.log file
*****************************************
run java in C:\Program Files\Java\jdk1.5.0_12
080901 193120 parsing file:/E:/SearchTools/nutch-0.7.2/conf/nutch-default.xml
080901 193120 parsing file:/E:/SearchTools/nutch-0.7.2/conf/crawl-tool.xml
080901 193120 parsing file:/E:/SearchTools/nutch-0.7.2/conf/nutch-site.xml
080901 193120 No FS indicated, using default:local
080901 193120 crawl started in: crawl
080901 193120 rootUrlFile = urls
080901 193120 threads = 10
080901 193120 depth = 3
080901 193120 Created webdb at LocalFS,E:\SearchTools\nutch-0.7.2\crawl\db
080901 193120 Starting URL processing
080901 193120 Plugins: looking in: E:\SearchTools\nutch-0.7.2\plugins
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\clustering-carrot2
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\creativecommons
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\index-basic\plugin.xml
080901 193120 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\index-more
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\language-identifier
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\nutch-extensionpoints\plugin.xml
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\ontology
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-ext
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\parse-html\plugin.xml
080901 193120 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-js
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-msword
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-pdf
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\parse-rss
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\parse-text\plugin.xml
080901 193120 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\protocol-file
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\protocol-ftp
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\protocol-http\plugin.xml
080901 193120 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\protocol-httpclient
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\query-basic\plugin.xml
080901 193120 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\query-more
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\query-site\plugin.xml
080901 193120 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\query-url\plugin.xml
080901 193120 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter
080901 193120 not including: E:\SearchTools\nutch-0.7.2\plugins\urlfilter-prefix
080901 193120 parsing: E:\SearchTools\nutch-0.7.2\plugins\urlfilter-regex\plugin.xml
080901 193120 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter
080901 193120 found resource crawl-urlfilter.txt at file:/E:/SearchTools/nutch-0.7.2/conf/crawl-urlfilter.txt
.080901 193120 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
080901 193120 bad url: "http://localhost:8080/vinweb"
.080901 193120 bad url: "http://www.orkut.co.in"
....080901 193120 Added 0 pages
080901 193120 FetchListTool started
080901 193121 Overall processing: Sorted 0 entries in 0.0 seconds.
080901 193121 Overall processing: Sorted NaN entries/second
080901 193121 FetchListTool completed
080901 193121 logging at INFO
080901 193122 Updating E:\SearchTools\nutch-0.7.2\crawl\db
080901 193122 Updating for E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193120
080901 193122 Finishing update
080901 193122 Update finished
080901 193122 FetchListTool started
080901 193122 Overall processing: Sorted 0 entries in 0.0 seconds.
080901 193122 Overall processing: Sorted NaN entries/second
080901 193122 FetchListTool completed
080901 193122 logging at INFO
080901 193123 Updating E:\SearchTools\nutch-0.7.2\crawl\db
080901 193123 Updating for E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193122
080901 193123 Finishing update
080901 193123 Update finished
080901 193123 FetchListTool started
080901 193123 Overall processing: Sorted 0 entries in 0.0 seconds.
080901 193123 Overall processing: Sorted NaN entries/second
080901 193124 FetchListTool completed
080901 193124 logging at INFO
080901 193125 Updating E:\SearchTools\nutch-0.7.2\crawl\db
080901 193125 Updating for E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193123
080901 193125 Finishing update
080901 193125 Update finished
080901 193125 Updating E:\SearchTools\nutch-0.7.2\crawl\segments from E:\SearchTools\nutch-0.7.2\crawl\db
080901 193125 reading E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193120
080901 193125 reading E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193122
080901 193125 reading E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193123
080901 193125 Sorting pages by url...
080901 193125 Getting updated scores and anchors from db...
080901 193125 Sorting updates by segment...
080901 193125 Updating segments...
080901 193125 Done updating E:\SearchTools\nutch-0.7.2\crawl\segments from E:\SearchTools\nutch-0.7.2\crawl\db
080901 193125 indexing segment: E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193120
080901 193125 * Opening segment 20080901193120
080901 193125 * Indexing segment 20080901193120
080901 193125 * Optimizing index...
080901 193125 * Moving index to NFS if needed...
080901 193125 DONE indexing segment 20080901193120: total 0 records in 0.047 s (NaN rec/s).
080901 193125 done indexing
080901 193125 indexing segment: E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193122
080901 193125 * Opening segment 20080901193122
080901 193125 * Indexing segment 20080901193122
080901 193125 * Optimizing index...
080901 193125 * Moving index to NFS if needed...
080901 193125 DONE indexing segment 20080901193122: total 0 records in 0.0 s (NaN rec/s).
080901 193125 done indexing
080901 193125 indexing segment: E:\SearchTools\nutch-0.7.2\crawl\segments\20080901193123
080901 193125 * Opening segment 20080901193123
080901 193125 * Indexing segment 20080901193123
080901 193125 * Optimizing index...
080901 193125 * Moving index to NFS if needed...
080901 193125 DONE indexing segment 20080901193123: total 0 records in 0.0 s (NaN rec/s).
080901 193125 done indexing
080901 193125 Reading url hashes...
080901 193125 Sorting url hashes...
080901 193125 Deleting url duplicates...
080901 193125 Deleted 0 url duplicates.
080901 193125 Reading content hashes...
080901 193125 Sorting content hashes...
080901 193125 Deleting content duplicates...
080901 193125 Deleted 0 content duplicates.
080901 193125 Duplicate deletion complete locally. Now returning to NFS...
080901 193125 DeleteDuplicates complete
080901 193125 Merging segment indexes...
080901 193125 crawl finished: crawl

*******************************************

and following entries are made at my crawl-urlfilter.txt.

*******************************************

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*synapse.com (where synapse is my domain name)

+^http://([a-z0-9]*\.)*apache.org

+^http://([a-z0-9]*\.)*localhost:8080/vinweb

+^http://([a-z0-9]*\.)*orkut.co.in


# skip everything else
-.

*************************************************

And the search result is return NULL in web UI.

Any suggestion will be very helpful,

/**********************************/

At the same time Please suggest which one is better to use Lucene or Nutch.

I have to implement it in my application in struts/jboss.

Please suggest,

Suggestion are of great value for me.

Thanks in advance




Please suggest.
Thanks,
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42061
    
  64
I have no hands-on experience with Nutch, but it's just a web crawling engine. I'm fairly certain that it uses Lucene underneath to do the indexing and searching.
Sachin Joshi
Ranch Hand

Joined: Aug 06, 2008
Posts: 83

If you are looking for a database level search then you need to store all required data in a Search Engine Index.

Fetch all search-able data from database and store in (say Lucene) index.

Direct search on Database may not be that effective if you have to use SQL for each search.


Web Development Tips and Tutorials - By Sachin
sapna rana
Greenhorn

Joined: Sep 01, 2008
Posts: 18
Hi,

I have implemented Lucene in my application and able to search and index a pdf,word,text and Html document.
Plesae provide me reference how can be parse and index a excel and XML.

Thanks n Regards
Paul Sturrock
Bartender

Joined: Apr 14, 2004
Posts: 10336

Doesn't Lucence's own documentation have links for that? Excel, you just need to use POI. XML you can either parse it first or just treat it as plain text.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42061
    
  64
You'd need to write code that reads XLS files and extracts the text from it; then you can feed the text to Lucene. Apache POI is a library that allows you to access the text in an XLS file.

For XML it's probably easiest to use the SAX API; the characters method of the document handler provides you with the text contained in the file.
sapna rana
Greenhorn

Joined: Sep 01, 2008
Posts: 18
How can we index a database using lucene .
I have tried to write a DBindex in order to retrieve some records from database and then write a index file . But when i search the same value, no results found.

Please provide me details how we can search a database as that is our main focus to do.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39095
    
  23
I think if you are searching databases, this thread is no longer a "beginner's" thread. I shall have to move it.
Paul Sturrock
Bartender

Joined: Apr 14, 2004
Posts: 10336

Originally posted by sapna rana:
How can we index a database using lucene .
I have tried to write a DBindex in order to retrieve some records from database and then write a index file . But when i search the same value, no results found.

Please provide me details how we can search a database as that is our main focus to do.


There is no more detail to add really. You need to index your source of data, if its a database your indexer will need to connect via JDBC to do this. Can you show us your code and the query you expect to return results?

Also, there is a tool called luke that will let you examine the index and run ad-hoc queries. Sometime the issue is nothing more than a mistake in your query syntax.
sapna rana
Greenhorn

Joined: Sep 01, 2008
Posts: 18
Please find mu code as follows
*****************************************************

package com.knowledgebooks.utils;

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.hibernate.mapping.Index;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.Connection;
import java.sql.DriverManager;
import java.io.File;
import java.io.IOException;
import java.sql.*;
import java.io.*;
import org.apache.lucene.index.*;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;


/**
* Created by IntelliJ IDEA.
* User: rachna.rana
* Date: Sep 12, 2008
* Time: 11:46:31 AM
* To change this template use File | Settings | File Templates.
*/
public class DBIndex {

/**
* @param args
*/
private Connection con;
private String dbDriver,connectionURL,user,password;

public DBIndex()
{
con=null;
dbDriver="com.mysql.jdbc.Driver";
connectionURL="jdbc:mysql://172.16.80.214:3306/vinprocure1";
user="root";
password="root";
}

public void setDBDriver(String driver)
{
this.dbDriver=driver;
}

public void setConnectionURL(String connectionURL)
{
this.connectionURL=connectionURL;
}

public void setAuthentication(String user,String password)
{
this.user=user;
this.password=password;
}

public Connection getConnection()
{
try{
Class.forName(dbDriver);
con= DriverManager.getConnection(connectionURL,user,password);
}
catch(Exception e){
e.printStackTrace();
}
return con;
}

private boolean isIndexExist(String indexPath)
{
boolean exist=false;
try{
IndexReader ir=IndexReader.open(indexPath);
exist=true;
ir.close();
}
catch(IOException e){
System.out.println("ioexception:-> "+e);
}catch(Exception e){
System.out.println("exception:-> "+e);
}

return exist;
}


public static void main(String[] args) {
DBIndex dbi=new DBIndex();
try{
Connection connection=dbi.getConnection();
String query="select user_id,login_name,first_name,last_name,email_address from vinusers";
Statement statement=connection.createStatement();
ResultSet contentResutlset=statement.executeQuery(query);
System.out.println("111");
IndexWriter writer = new IndexWriter(new File("c:\\dbindex"),new StandardAnalyzer(),true);

while(contentResutlset.next()){
//Adding all fields' contents to a single string for indexing

String contents=contentResutlset.getString(2)+""+contentResutlset.getString(3)+" "+contentResutlset.getString(4);

System.out.println("Indexing Content no.(ID) " + contentResutlset.getShort(1)+"\n"+contents);
System.out.println("Indexing Content no.(ID) " + contentResutlset.getString(1));

//Creating index for a single content(record in contents table)
Document doc = new Document();
doc.add(new Field("contents", contents,false,false,true));
doc.add(new Field("id",contentResutlset.getString(2),true,false,false));
writer.addDocument(doc);

}
writer.close();
contentResutlset.close();
statement.close();
connection.close();
}//try
catch(Exception e){
System.out.println(e.getMessage());
}//catch*/
}
}
****************************************************************

It will create a file in C:\dbindex and write the index value into it
which contain the result of the query feched above.

During search of the same name result found is 0
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42061
    
  64
Please go back and edit your post to UseCodeTags. It's unnecessarily hard to read as it is.

How are you searching the index?
 
GeeCON Prague 2014
 
subject: search engine to search both at database and web application level