We are designing an application to enable full-text query and display of documents encoded in Unicode. The target language is classical Greek containing lots of weird accent marks, etc. The documents have to be encoded in a flavor of XML used by most universities for text markup called TEI (Text Encoding Initiative). Document metadata, title, author, and so on, are included in the xml file along with the text of a given document.
What database strategy should I use to best enable (a) full-text search of the unicode, (b) routine meta-data searches by author, title, etc., both of which would allow for multiple views of the same xml document--either, for example, display of the entire document, or simply a list of authors, titles, etc.?
The choice seems to lie between,
Relational database in combination with a full-text search engine like Jarkarta Lucene...
An embedded xml database such as Tamino or Exist.
I am more familiar, and comfortable with option #1. The xml-encoded document would be stored in a single database column, and the title, author, and other metadata information would be reduplicated in adjacent database columns or perhaps in other tables, etc. We have prototyped a unicode full-text searcher in Lucene.
But there are problems with this. Reduplicating the data in additional columns is clumsy, and for proper searching in Lucene the text data would either have to be stripped out of the xml to begin with, or else cumbersome filters would have to be written to strip out the xml on the fly.
The altnative, it seems to me, is to use one of the new xml databases. These would allow, supposedly, for xml-based queries of the full-text unicode as well as of the metadata, and would store XMl documents one by one, with no mapping required. I am less familiar with these, however. I do not much much about their query languages or much about their performance and flexibility. It seems I would always be dealing with xml formatted data, instead of simple strings, which would require xml parsing even for simple kinds of data.
So, which do you think is the best? 1. breakdown of xml into relation model? 2. use of some kind of additional xml-query functionality to access the xml document in the relational column (Oracle, I think, provides something like this ability)? 3. use an xml embedded database to the exclusion of the relational database model?
Huge thanks in advance for any comments or opinions you might have to offer!
Have you considered keeping your documentation in XML file format and using a collection builing tool like Verity? I've worked on some Content Management systems where this was the route we took. Verity (substitue in any other simmilar product here, but Verity is the only one I am familiar with so I'll just refer to it) is a specialist product for searchable collection building. The advantages are that your don't have to put anything in a Database, something which doesn't make sense anyway. Your docs contain their own meta-data, so what do you gain by repeating it elsewhere? And of course defining it in two places means there is a synchronization headache too. Also searching a Verity collection can be much more perforant than searching a DB (especially if your Relational model were relatively complex). [ January 17, 2005: Message edited by: Paul Sturrock ]
Verity sounds pretty interesting. Thanks much for the input.
Here are three questions about Verity: (1) does it offer a a set of Java interfaces such that I could integrate it within a larger application including, say, a web-service and a Swing GUI interface? (2) does it handle unicode searches of tagged xml data; (3) might I,as a member of an academic project based at a University (Oxford in the UK) be able to obtain it for free or for a reduced cost?
As an example of 2, an example of a greed word tagged in our system. We are putting on certain Greek texts that are tattered and fragmentary, and so tagged with various markings that show where the manuscript cannot be read. Here's an example, using an English phrase, "This is a word in the text" to demonstrate the general point. Assume that the last part of the word "word" is damaged in the manuscript and had to be guessed:
<line-segment>This is a w> <editors-guess>ord </editors-guess> <line-segment>in the text.</line-segment>
In this example, the <editors-guess> tag would be replaced, by XSLT, to render a display of "This is a w[ord] in the text."
Could verity manage such a search? The alternative, clumsy for the reasons I have mentioned but nevertheless functional, is to use Lucene with filters that strip out either all the <editors-guess> xml or else, in the finished text, the "[" and "]" enabling a search on "word"
Great project. What you are describing sounds like either A) a full-text database application, or B) a full-text search of filesystem data. Either of these would work depending on one or multiple concurrent clients. Either of these would perform depending on building & storing indexes for the search.
Filtering text CDATA from XML is easy to code, but you're correct that this would be inefficient to process repeatedly. Either DBMS or filesystem approach should manage this process so as to perform it only once upon document loading/ modification.
The Database option does have advantages in allowing concurrent access & maintenance of the document library. Separate columns (CLOB or BLOB) would store the XML version and perhaps also the 'stripped' plain text, but the main requirement would be storage of Lucene's index data. The issue here would be concurrent access and whether index data could be stored/ updated by document chunks, or as one monolithic central index (more difficult for concurrent updates).
I'm not sure what the advantages of an XML database would be as text-search is the particular challenge and XML databases wouldn't offer any extra help in this regard.