aspose file tools*
The moose likes Distributed Java and the fly likes A story of 8 servers - a distributed design & architecture issue Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Distributed Java
Bookmark "A story of 8 servers - a distributed design & architecture issue" Watch "A story of 8 servers - a distributed design & architecture issue" New topic
Author

A story of 8 servers - a distributed design & architecture issue

Kareem Gad
Ranch Hand

Joined: Aug 06, 2001
Posts: 89
Hi guys/gals ...

I've been given the task of architecting and designing a solution to the following problem domain. I have something in mind but I'd like to hear from everyone as it is my first real distributed application experience.

Situation:
-8 anti virus servers generate log files daily in their separate repository.
-Each month a report needs to be generated showing the top ten virus occurrences & the overall count of virus occurrences. (All information available in the logs)
-The estimated size of all log files from the 8 servers to be analyzed each month is around 1.5 Gigabytes of compressed xml logs. (a single zipped log file is 10 � 12kb)
-Currently all files from the 8 servers are copied manually to a single location and a UNIX shell- script is run to perform all the required analysis.

The Problem:
-The script requires 2 days to complete analyzing all log files.

The Opportunity:
-There are some idle servers that can be utilized as extra processing power to reduce the time this script needs for completion.

Proposed Solution:
-Replace the shell script with a distributed application based on Java technology & SALSA, taking advantage of idle processing power on available servers.


Any thoughts on a recommended architecture ?


<b><i>KaReEm</i><br /><ul type="square"><li>SCJP-Free Range Web Developer <br /></ul></b>
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12823
    
    5
The first thing that strikes me about this problem may not even be one of your options - but anyway:
Reports at once a month intervals seems WAY too slow for the rapidly changing virus threat. If the log file format permits, why not digest on a daily basis into a format that makes it easy to aggregate any number of daily reports. For example, a running 7 day average would be handly is spotting trends.
Thus your one big digestion problem turns into a lot of little digestion problems that could be distributed more easily.

Bill
Kareem Gad
Ranch Hand

Joined: Aug 06, 2001
Posts: 89
Hi Bill,
I think you are absolutely right on both counts, first that it is way too slow to report on various virus attacks, secondly they could have just opted for certain weekly reports and then sum up the values every month to reduce the big hassle. BUT after speaking with them about this turns out that the report is only needed once a month.

They apparently have other mechanisms to follow up with virus attacks this is just for administrative purposes.

I'm posting in the next message my view on the approach to take and let me know what you think.
Kareem Gad
Ranch Hand

Joined: Aug 06, 2001
Posts: 89
Some info & pre-requisites
-They have a constraint on network shares so log files will have to SSH'd to a single file server before we start
-I've decided to give it a go with SALSA
-the SALSA framework defines "Theaters" & "Actors", read more about them at the SALSA homepage.

The algorithm I have in mind :

- arrange for copying the log files from the 8 servers to a single file server (let's call it FSX).
- a "maestro" application starts identifying the period required for the report.
- "maestro" will identify the list of log files on FSX that are related to this period.
- depending on the number of log files & number of available Theaters the "maestro" will instantiate a proportionate number of file reader threads.[assuming a many-to-one relationship between log files and file readers]
- file readers will read files and prepare them into byte arrays.
- each file reader will create a FileGroupAnalyzer (FGA) actor on available theaters and pass on the byte array of each log file read to that group analyzer.
- the file group analyzer will divide the work to separate FileAnalyzer (FA) actors, which will do the actual work of locating the pattern being searched for and counting it.
- the FGA recieves the responses from each FA and consolidates them preparing them to be sent back to the "maestro"
- the maestro recieves the responses from each FGA and consolidates them and prepares them for being presented appropriately as the requested report.


I need to know what are the downfalls of this approach and where i can tweak/tune it for performance my primary objective.
Gagan Indus
Ranch Hand

Joined: Feb 28, 2001
Posts: 346
So, if they are so sure it is *only* needed once a month and *only* for administrative non-actionable summary, then why is 2 days runtime actually a problem? What does the business says the return_on_investment for such an effort is?

I would imagine simply updating the unix script to rsh/ftp to 8 servers and programmatically copy the log files instead of doing it manually, should be all what must be needed


Gagan (/^_^\) SCJP2 SCWCD IBM486 <br />Die-hard JavaMonk -- little Java a day, keeps you going.<br /><a href="http://www.objectfirst.com/blog" target="_blank" rel="nofollow">My Blog</a>
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12823
    
    5
file readers will read files and prepare them into byte arrays.

How large are these files anyway? Is fitting an entire log into memory a good idea (or am I not understanding what you mean)?

I think any approach will need to keep in mind the effect on performance of the balance between network transmission time, local file reading time and CPU analysis time. It would be a mistake to charge in with a complete architecture before you have a clear idea of what is really time consuming.

Years ago - at the beginning of the Java era - I wrote a text indexing program that used three Threads. One read disk files and prepared byte[] blocks of data, locating the start points of interest in each block. These data structures were queued up for a sorting Thread to work on, the final Thread took sorted data structures and merged them into the master file on disk. On a two CPU NT system this managed to keep both CPUs about 70% busy, an improvement over a single Thread doing all the work, but less than I hoped.

I think the moral of that story is "get some real measurements before building a complex architecture."

SALSA certainly looks interesting and your problem may be a good fit - please keep us up to date with your progress.
Bill
Kareem Gad
Ranch Hand

Joined: Aug 06, 2001
Posts: 89
Each log file is .gz xml document, size of the gz file is 12K. But as I've recollected from the situation after 1month, these log files accumulate from 8 servers to a total size of 1.5Gigs but all in the form of those 12K zipped files.

There's the overhead of unzipping them then reading them (either as plain text or parsing the XML) .. I think if read as plain text will save on the overhead of reading each file but parsing into XML can be very beneficial when analyzing the files at the next stage.

I think it is wise to go for a "measure first" approach at this stage it will certainly provide with alot of input into the overall architecture and approach.

I guess I'll need to measure the overhead of :
- reading the gzip files
- uncompressing it
then either
- reading the xml document as a plain text file
or
- parsing the xml document into a DOM or SAX model whatever

then I need to sample the overhead of the analysis algorithm on
- the plain text array
or
- the xml parsed object


Would you suggest I sample a single file and just do the calculations to reach the expected volume or to just increase the sample.

I know I need to also measure the overhead of sending object references to the multiple actors over the network thru the chosen framework (SALSA).

Anything I missed ?
[ February 12, 2006: Message edited by: Kareem Gad ]
Kareem Gad
Ranch Hand

Joined: Aug 06, 2001
Posts: 89
Hi Gagan,

I know what you mean, but for reasons beyond our understanding (usually called "business requirements") they think they need an improvement on the 2 day job. Perhaps when it's faster they might think of using it on a more frequent basis or whatever.

Besides the domain of the problem I'm interested to hear your insights on the general approach towards resolving such an issue.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12823
    
    5
Would you suggest I sample a single file and just do the calculations to reach the expected volume or to just increase the sample.

Interesting question!
In order to measure file reading and network delays you will have to use a big set of files - otherwise the operating system file buffers will make the operations appear to be much faster that they would be in reality.
Given a file in memory, I think you could get an estimate of XML parsing time by repeatedly using the same file.

Rather than writing a bunch of timing code, look into
the Jamon toolkit. which can time a bunch of different methods.
Bill
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: A story of 8 servers - a distributed design & architecture issue