Forum:

Distributed Java

A story of 8 servers - a distributed design & architecture issue

Ranch Hand

Posts: 89

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Hi guys/gals ...

I've been given the task of architecting and designing a solution to the following problem domain. I have something in mind but I'd like to hear from everyone as it is my first real distributed application experience.

Situation:
-8 anti virus servers generate log files daily in their separate repository.
-Each month a report needs to be generated showing the top ten virus occurrences & the overall count of virus occurrences. (All information available in the logs)
-The estimated size of all log files from the 8 servers to be analyzed each month is around 1.5 Gigabytes of compressed xml logs. (a single zipped log file is 10 � 12kb)
-Currently all files from the 8 servers are copied manually to a single location and a UNIX shell- script is run to perform all the required analysis.

The Problem:
-The script requires 2 days to complete analyzing all log files.

The Opportunity:
-There are some idle servers that can be utilized as extra processing power to reduce the time this script needs for completion.

Proposed Solution:
-Replace the shell script with a distributed application based on Java technology & SALSA, taking advantage of idle processing power on available servers.

Any thoughts on a recommended architecture ?

KaReEm <ul type="square"><li>SCJP-Free Range Web Developer </ul>

William Brogden

Author and all-around good cowpoke

Posts: 13078

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

The first thing that strikes me about this problem may not even be one of your options - but anyway:
Reports at once a month intervals seems WAY too slow for the rapidly changing virus threat. If the log file format permits, why not digest on a daily basis into a format that makes it easy to aggregate any number of daily reports. For example, a running 7 day average would be handly is spotting trends.
Thus your one big digestion problem turns into a lot of little digestion problems that could be distributed more easily.

Bill

Kareem Gad

Ranch Hand

Posts: 89

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Hi Bill,
I think you are absolutely right on both counts, first that it is way too slow to report on various virus attacks, secondly they could have just opted for certain weekly reports and then sum up the values every month to reduce the big hassle. BUT after speaking with them about this turns out that the report is only needed once a month.

They apparently have other mechanisms to follow up with virus attacks this is just for administrative purposes.

I'm posting in the next message my view on the approach to take and let me know what you think.

KaReEm <ul type="square"><li>SCJP-Free Range Web Developer </ul>

Kareem Gad

Ranch Hand

Posts: 89

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Some info & pre-requisites
-They have a constraint on network shares so log files will have to SSH'd to a single file server before we start
-I've decided to give it a go with SALSA
-the SALSA framework defines "Theaters" & "Actors", read more about them at the SALSA homepage.

The algorithm I have in mind :

- arrange for copying the log files from the 8 servers to a single file server (let's call it FSX).
- a "maestro" application starts identifying the period required for the report.
- "maestro" will identify the list of log files on FSX that are related to this period.
- depending on the number of log files & number of available Theaters the "maestro" will instantiate a proportionate number of file reader threads.[assuming a many-to-one relationship between log files and file readers]
- file readers will read files and prepare them into byte arrays.
- each file reader will create a FileGroupAnalyzer (FGA) actor on available theaters and pass on the byte array of each log file read to that group analyzer.
- the file group analyzer will divide the work to separate FileAnalyzer (FA) actors, which will do the actual work of locating the pattern being searched for and counting it.
- the FGA recieves the responses from each FA and consolidates them preparing them to be sent back to the "maestro"
- the maestro recieves the responses from each FGA and consolidates them and prepares them for being presented appropriately as the requested report.

I need to know what are the downfalls of this approach and where i can tweak/tune it for performance my primary objective.

KaReEm <ul type="square"><li>SCJP-Free Range Web Developer </ul>

Gagan Indus

Ranch Hand

Posts: 346

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

So, if they are so sure it is *only* needed once a month and *only* for administrative non-actionable summary, then why is 2 days runtime actually a problem? What does the business says the return_on_investment for such an effort is?

I would imagine simply updating the unix script to rsh/ftp to 8 servers and programmatically copy the log files instead of doing it manually, should be all what must be needed

Gagan (/^_^\) SCJP2 SCWCD IBM486 Die-hard JavaMonk -- little Java a day, keeps you going. <a href="http://www.objectfirst.com/blog" target="_blank" rel="nofollow">My Blog</a>

William Brogden

Author and all-around good cowpoke

Posts: 13078

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

file readers will read files and prepare them into byte arrays.

How large are these files anyway? Is fitting an entire log into memory a good idea (or am I not understanding what you mean)?

I think any approach will need to keep in mind the effect on performance of the balance between network transmission time, local file reading time and CPU analysis time. It would be a mistake to charge in with a complete architecture before you have a clear idea of what is really time consuming.

Years ago - at the beginning of the Java era - I wrote a text indexing program that used three Threads. One read disk files and prepared byte[] blocks of data, locating the start points of interest in each block. These data structures were queued up for a sorting Thread to work on, the final Thread took sorted data structures and merged them into the master file on disk. On a two CPU NT system this managed to keep both CPUs about 70% busy, an improvement over a single Thread doing all the work, but less than I hoped.

I think the moral of that story is "get some real measurements before building a complex architecture."

SALSA certainly looks interesting and your problem may be a good fit - please keep us up to date with your progress.
Bill

Kareem Gad

Ranch Hand

Posts: 89

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Each log file is .gz xml document, size of the gz file is 12K. But as I've recollected from the situation after 1month, these log files accumulate from 8 servers to a total size of 1.5Gigs but all in the form of those 12K zipped files.

There's the overhead of unzipping them then reading them (either as plain text or parsing the XML) .. I think if read as plain text will save on the overhead of reading each file but parsing into XML can be very beneficial when analyzing the files at the next stage.

I think it is wise to go for a "measure first" approach at this stage it will certainly provide with alot of input into the overall architecture and approach.

I guess I'll need to measure the overhead of :
- reading the gzip files
- uncompressing it
then either
- reading the xml document as a plain text file
or
- parsing the xml document into a DOM or SAX model whatever

then I need to sample the overhead of the analysis algorithm on
- the plain text array
or
- the xml parsed object

Would you suggest I sample a single file and just do the calculations to reach the expected volume or to just increase the sample.

I know I need to also measure the overhead of sending object references to the multiple actors over the network thru the chosen framework (SALSA).

Anything I missed ?
[ February 12, 2006: Message edited by: Kareem Gad ]

KaReEm <ul type="square"><li>SCJP-Free Range Web Developer </ul>

Kareem Gad

Ranch Hand

Posts: 89

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Hi Gagan,

I know what you mean, but for reasons beyond our understanding (usually called "business requirements") they think they need an improvement on the 2 day job. Perhaps when it's faster they might think of using it on a more frequent basis or whatever.

Besides the domain of the problem I'm interested to hear your insights on the general approach towards resolving such an issue.

KaReEm <ul type="square"><li>SCJP-Free Range Web Developer </ul>

William Brogden

Author and all-around good cowpoke

Posts: 13078

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Would you suggest I sample a single file and just do the calculations to reach the expected volume or to just increase the sample.

Interesting question!
In order to measure file reading and network delays you will have to use a big set of files - otherwise the operating system file buffers will make the operations appear to be much faster that they would be in reality.
Given a file in memory, I think you could get an estimate of XML parsing time by repeatedly using the same file.

Rather than writing a bunch of timing code, look into
the Jamon toolkit. which can time a bunch of different methods.
Bill

Don't get me started about those stupid light bulbs.