I am working on an academic project to put ancient manuscripts on line. We will display an image of each manuscript page, some metadata fields (author, title, etc., and formatted ancient Greek text in unicode characters. We must enable search by the metadata fields and by full-text. Later, we will enable (re)editing of each Greek text, and the reconstructions of larger wholes out of collections of multiple Greek texts. To satisfy various grant-funding bodies and to enable convenient formatting of the text (using XSLT), I thought we should store--or at least output--our text data in a certain flavor of XML.
My idea--and I admit it might have been a bit knee-jerk--was to use J2EE + relational database + Lucene as the software platform. The datafiles would be "redundantly" stored in the RDBMS, with metadata fields "unpacked" from the XML files and stored in table columns, and the entire XML file stored in the last column (most likely as a CLOB). This all opposed to using an XML database like eXist. The fulltext, stripped of XML tags, would be stored as a file for indexing by Lucene. I have prototyped the XSLT display and the Lucene full-text searching and I rather liked my choice of platform.
Then I read Robert Martin's book on Agile development. One of the things he said was to keep everything as simple as possible.
This sent me into a tizzy. Have I knee-jerked my way into making things WAY too complicated and over-tooled? What do you think? Part of me suspects no, I have not made things more complicated, but part of me wonders, for example, could dispense entirely with the RDBMS and the Lucene and perform all searching with simple greps for the xml files before calling them up for the XSLT processing? Do I even need J2EE?
If I were to choose right now, I would stick with my current platform. But perhaps some of you with loads of experience but recommend otherwise(?)
Doesn't sound too out of line if you know and want to use Java. If the whole language, platform, architecture area is wide open, some of the dynamic languages can be very productive. Ruby On Rails is quite the rage.
I hope J2EE just means a servlet container. An EJB container is a very big step, not one I'd try without an experienced lead and some strong requirements I couldn't meet otherwise.
We could probably get a debate going about whether to use frameworks like Struts or Hibernate. On a first app I'd probably avoid them and try to make the simplest framework I could manage. I made a little "front controller servlet" that I'm pretty happy with and Ben Souther (right?) has some great examples of very simple solutions if you ask up in the Servlet forum.
BTW: Lucene doesn't index files (unless there's some "default implementation" I didn't know about.) Instead, you call Lucene methods to index content. You can store a tagged value, like a database primary key, that you get back with search results. You could call Lucene with the content right before you put it in the database and not bother with an external copy. I hope that made things easier right away. I generally trust Lucene to keep its index in nice shape and I only call it with changes, but I kept an older strategy in my system to re-index everything from scratch just in case the index becomes corrupted.
So your thinking sounds sane so far. What do you think you'll do next?
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
Joined: Jan 29, 2003
Replying to myself I guess in a fresh note because I'm changing gears. I realized that didn't touch any of the XP notions of keeping things simple.
An XP project would likely run an "architectural spike", a short iteration with a small number of lead techies. They can explore and build just enough architecture to support a small story from end to end. They might even cheat a bit on the ends, say writing flat files instead of building a database just to show the rest of it works. Once you know that works, turn the developer hordes loose to use it.
Some of my favorite lessons from XP are to build only exactly what you need, nothing on speculation that somebody might want to do X or Y in the future, and build it just barely sufficient to do the job right now. With extraordinary care to responsibilities and dependencies, strongly assisted by test driven design, you can keep the framework bits flexible enough to build out as new requirements come along.
For example, my little front controller mentioned above has a feature to require login before handling some requests. It remembers the request, diverts you to a login screen and after you log in succesfully resumes the original request. That was not in the first release, heck, logging in was not in the first release, but it took only a few minutes to implement.
Joined: Apr 08, 2003
Thanks, Stan. I have a reasonable amount of experience with Java and OOP, but not vast experience with web frameworks. I also have substantial experience with RDBMS and SQL. My thought therefore was yours: certainly no EJB, and, after some internal struggle, have decided not--at least at the outset--to use a framework (e.g. Spring). I have used in-house frameworks before, and although I know Spring is supposed to be very good, I am nevertheless leery of the prospect of debugging through framework code when I might not need it. Some kind of controller servelet/command pattern would likely do the trick for a while. Another internal debate was over the database/persistence layer, and I have decided, for the time being, to go either with home-grown JDBC or at most with iBatis. Hibernate indeed looks great but since I have not done lots of coding for a couple of years, I have decided not worry about persisting object graphs until it seems I need to. With iBatis I would retain control over the SQL sent to the database.
In a prototype of the display formatting I used some initial XSLT complemented by JSTL XML processing. This seemed to work pretty well: the prototype is up at http://126.96.36.199/prototypeApp (to view it properly you need a unicode font enabled for classical Greek such as code2000.) The point of the text markup is to represent a Greek text, originally written on a papyrus manuscript, to show where letters are missing or obscure, where certain marks (called paragraphoi) are written on the papyrus etc. The markup conventions--brackets etc., are standard marking conventions used in academic hardcopy publications.
And thanks for the hint about Lucene. In my Lucene prototype I had simply read the unicode in from text files, since some demo code was available and since this was an easy way to ascertain exactly what unicode text I was processing. But you are right that in a real application I might well want to slurp up the text into Lucene right before insertion into the RDBMS.
The RDBMS would be likely be PostGreSQL (I would like Oracle but PostGreSQL is free). Here again, though, I am haunted by the Agile maxims: perhaps I should just use MS Access and be done with it. Of course that would bind me to a Windows platform--maybe we will use Access as a front-end reporting tool. I also don't know if Access can handle a CLOB datatype.
This is all just a paraphrase to invite further discussion. Any comments, questions, suggestions or rebuttals on these ideas are welcome!
Originally posted by Benjamin Weaver: Here again, though, I am haunted by the Agile maxims: perhaps I should just use MS Access and be done with it. Of course that would bind me to a Windows platform--maybe we will use Access as a front-end reporting tool.
The "simplest thing" doesn't mean the most "constraining thing" either.
before you dismiss it out of hand (RoR will also work with a number of other databases too). You could always tell yourself that you are prototyping. You'll probably be able to whip up your reports as web pages in a lot less time. Robert C. Martin: Agile Web Development in Rails
I think an important concept from Lean Software Development applies here: use Set Based Decision Making (instead of Point Based Decision Making).
That is, try to keep your options as open as possible as long as possible. Only commit to a single decision at the last responsible moment.
Keep your design flexible, so that you don't have to commit to a persistence mechanism yet. Use a simply approach - such as flat files - to begin with, but in a way that it's easy to switch to something more elaborate later, should it prove necessary.
Or try several approaches in parallel until one proves to be superior to the other. That will take a little bit more effort now, but can safe you a lot of time later, when you aren't committed to a technology that proves to be inappropriate. It also helps you design the system so that you aren't bound to one implementation.
Hope this helps...
The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
Joined: Apr 08, 2003
Excellent advice about refraining from committment, Ilja. Checked out Ruby on R. Wow! Glib, hot, cool. Definitely a new thing. I will definitely try a little of this, a little of that in the forthcoming design, not least on the persistence layer.
I don't have time right now to elaborate, but as I try out a bit of Ruby I will want to find out about its (1)handling of unicode--UTF-8; (2) how to write business logic and classes in it (I need to learn a bit of it); (3) exception handling and debugging. Ruby does look very interesting though. More later.