Forum:

Tomcat

Efficient use of apache compression

Joe Areeda

Ranch Hand

Posts: 334

I like...

posted 9 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Here's the situation:

I have a new servlet that implements a RESTful service which sends rows from a mysql table to the client as a csv file.

If I run it uncompressed through Tomcat on localhost it transfers at over 12MB/sec (about a gigabit).

The production and prototype servers use Apache as a front end through an AJP connector, and if I run it uncompressed it transfers at the full speed of my Internet connection 2MB/s (15Mbps).

If I enable DEFLATE for csv files it only seems to be transferring at about 454KB/s. Compression still cuts the transfer time in half but it should be more like a factor of 15-20.

The compression ratio I get with gzip or zip is 18.6:1 (559MB vs 30MB).

If I use scp transfer time compressed is 15 sec vs 191 sec uncompressed or 12.7:1, this is on the same systems over the same connection so I doubt CPU is maxed out.

The last test I just thought of: I took the file I downloaded, put it on the server and then downloaded it with compression turned off and it was 195 sec, compression turned on 12 seconds.

The code that does the transfer looks like:

My question is how to speed this up? I understand the default compression block is 8K so increasing that a bit might help. I wonder if there's any built in buffering in the response writer or if we'd get a big gain if we spawn a task to get the rows and convert them to CSV and buffer them in the servlet. Something simpler might be to buffer multiple rows between calls to the response writer.

As always, I appreciate any comments or suggestions before i start experimenting.

Best,
Joe

It's not what your program can do, it's what your users do with the program.

Joe Areeda

Ranch Hand

Posts: 334

I like...

posted 9 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Well I tried a few things and managed to knock off about a second out of 191, yeah baby!!

I used a StringBuilder to buffer up to 10,000 rows (out of about 9M) no change.
I changed the Apache compression buffer size from 8K to 32K and set the compression factor explicitly to 6. Well 6 seems to be the default as curl still reports 78.8M downloaded to write a 933M file.

I guess we try a helper task next. That's about all I can think of.

Joe

It's not what your program can do, it's what your users do with the program.

Ulf Dittmer

Rancher

Posts: 43081

posted 9 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

If I run it uncompressed through Tomcat on localhost it transfers at over 12MB/sec (about a gigabit).

Can you tell us what the purpose of this effort is? I don't follow how 12MB/sec is about a GB/s (it sounds rather less than that to me), but it is rather more than your average internet connection, so I assume you're interested in optimizing performance across a LAN - which brings up the issue of what kinds of speed the routers, the cables, and the client network cards support.

Joe Areeda

Ranch Hand

Posts: 334

I like...

posted 9 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Hi Ulf,

Sorry for my lack of clarity. I really should start describing the problem I'm trying to solve rather than the low level issue. There is more background but I think this might be a decent place to start.

I have a database with couple of tables that contain meta-data describing big stores of real-time acquired data that people process off line. Some of these people would like to get a big part or all of the information in these tables, rather than search my database. Their reasons seem a bit flimsy to me but I'm willing to try to implement it for them. The clients are expected to have a wide range of network connection from perhaps a 10GbE on the cluster (I think there are some routes that don't pass through gigabit switches), through University grade Internet connections and gigabit LAN, all the way down to home DSL connections at 1Mbps.

A typical download request asks for about 1/3 of a table with 24M rows. This results in a 553MB file with 8.5M rows. The file is an ASCII CSV file which compresses easily. gzip from the command line results in a compression ratio of 18:1.

My naive expectation was that using the gzip compression in Apache and curl or their browser or their (probably) Python application would see a similar speed up but it doesn't. curl reports the amount of data actually downloaded and the wire's transfer rate. What I calculate is a compression ratio of 8:1 but the transfer time is only increased about 2:1 as the transfer rate drops to ~500KB/s.

I believe the compression ratio is a setting which we can adjust but as far as I can tell it set this way for a good trade-off between time to compress and effective compression.

I don't understand why the transfer rate drops so dramatically. All those numbers are meant to narrow down the reasons.

My first guess was the database row retrieval. This was disproved by the uncompressed localhost connection.

Next idea was it could be that Apache is just slow to compress. This was disproved by taking the file I downloaded, uploading it to the server and making it available as a static file. That was able to saturate my home Internet connection at 2MB/s (15Mbps). I just realized I should test that through the server's localhost connection to get an actual measurment of compression time. Also in this vein I used scp with and without compression. That was able to saturate the Internet pipe with a comparable compression ratio.

My next futile theory was that it was the overhead of sending millions of small buffers to the servlet response's PrintWriter that was making the compression so inefficient. So I added a StringBuilder object and buffered 10,000 rows which were appended all at once. No change, uncompressed saturates the Internet connection, compressed comes in at about 25% or 500KB/s.

I surmise we can rule out the AJP connection as uncompressed data go through the same connection at much higher rates.

I'm running out of ideas about where this problem could be occurring. Ah, I just thought of another wild goose to chase. I wonder if the AJP connection to Apache could be compressing data it sends to Apache, which is then decompressing and recompressing? That seems pretty far fetched as I didn't see anything about AJP compression settings, only Tomcat's HTTP server ports.

I did capture the headers to prove data is being gzipped and it is:

I hope that long winded tour of my muddled thought process at least answers your question.

Best,
Joe

It's not what your program can do, it's what your users do with the program.

Ulf Dittmer

Rancher

Posts: 43081

posted 9 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Oh wow, you put a lot of experimentation in this, kudos!

I believe the compression ratio is a setting which we can adjust

No, it doesn't work that way. The compression ratio for a particular file is an output of a compression algorithm, not an input. The only algorithm I can think where you can affect the size of the output is JPEG, but the higher a compression you choose, the more quality it looses - not something you'd want to do for text data :-)

I wonder if the AJP connection to Apache could be compressing data it sends to Apache, which is then decompressing and recompressing? That seems pretty far fetched as I didn't see anything about AJP compression settings, only Tomcat's HTTP server ports.

I don't think AJP figures into it, it's either Tomcat or Apache. If Tomcat uses gzip I would expect Apache to leave it alone, but it might be worthwhile to make sure that's the case. By default, Tomcat's gzip support is turned off (compression attribute in the Connector), you could try if that gives different results than using Apache's compression.

Tim Holloway

Saloon Keeper

Posts: 27807

196

I like...

posted 9 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

HTTP is not the optimal channel for large data transfers. Just aside from the fact that HTTP was designed as a text protocol for traffic that could spend parts of its journey in different codesets (ASCII and EBCDIC, for example), there's no recovery mechanism for interrupted transfers. Compression isn't part of the original mechanism, since most compression algorithms output bits rather than characters, which means that in the worst case, the compressed bits would have to be character-encoded, at the rate of 2 characters per each 8 bits.

Ulf didn't use those terms, but the difference between what's done to JPEG's and what's done to text files is that JPEG compression is "lossy" compression. It's safe to discard data in a picture, because although picture quality will suffer, human vision compensates (somewhat). Text data, however, cannot afford to lose anything, so there are no adjustable loss levels. The closest you can get is alternative compression algorithms, which is what the ZIP utilitity does. It does trial runs for each file being compressed and picks whichever one works best. So when you see a file listed as "stored" in a ZIP archive that means that no compression at all was more efficient than any of the available algorithms. Compression algorithms have worst-case conditions where the "compressed" data is larger than the original data, and we'd rather avoid that!

Here are a couple more considerations while testing, though:

1. If you have any part of the network that does compression in hardware, you could end up with your compressed data being double-compressed, and that's one of the more likely ways to get the worst-case compression scenario.

2. Don't overlook the overhead of the expanded data. If you have a hyper-efficient compressed web channel, but the time spent writing the data to the client's disk after it's decompressed is such that the client becomes disk-bound, then it will skew the apparent benefits from compressed network traffic.

The secret of how to be miserable is to constantly expect things are going to happen the way that they are "supposed" to happen.

You can have faith, which carries the understanding that you may be disappointed. Then there's being a willfully-blind idiot, which virtually guarantees it.

Joe Areeda

Ranch Hand

Posts: 334

I like...

posted 9 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Thanks Ulf, Tim,

As far as the term "compression level" I was referring to the "DeflateCompressionLevel" directive in Apache's configuration file (see http://httpd.apache.org/docs/2.2/mod/mod_deflate.html#deflatecompressionlevel). Tim is correct it is more a how hard to search for an optimal compression algorithm than some tunable parameter for a specific algorithm.

I agree HTTP is not the optimal protocol for large file transfers, however it does have a few redeeming qualities:

It is (almost) never blocked by firewalls and this transfer (if it can be fast enough) would be done from hotels and guest accounts on a University campus (which block way too many ports).

RESTful queries are well supported in most language

It's the easiest one to use from a servlet

I did 3 tests that I think show the problem is either in the servlet itself or in the Tomcat to Apache connection and not in redundant compression in the networking hardware:

Sending uncompressed data results in transfer times that are very close to file size / network speed, so I doubt there is any hardware compression available.

Saving the uncompressed data to a file, and posting it on the server as a static file results in transfer times close to the expected compression factor

Using scp with and without compression results in transfer times close to the Apache only times with and without compression

To step back further in my problem domain, my application can be viewed a web front-end to an application called the Network Data Server (NDS). NDS implements a proprietary network protocol and is itself a front-end to a proprietary file format that stores real time data from Gravitational Wave observatories. NB: I'm not going explain GW observatories but if your interested, the public web site is http://www.ligo.org.

There are currently 6 NDS servers available to my application. The database tables we are discussing here are read-mostly. Currently once a day, in the middle of the night, we check each of the servers to see if the meta-data (called the channel list) has changed, if so we download the full channel list and update these tables. Here's the rub: the channel list is huge and growing and NDS does not compress it. People use NDS directly from computing clusters, desktops and laptops. They currently download the full uncompressed list and as usually users spend their waiting time complaining to developers about it.

There are other alternatives including compressing the list by NDS itself. I thought a servlet was an easy alternative but unless I can get better performance out it, it won't be accepted.

In my mind the solution is to get people to do their queries against the central database and not maintain local copies of part of it. We've been debating this for a long time and positions have hardened. Some people demand an efficient bulk transfer of all the meta-data from a single server.

Right now my biggest problem is that I don't understand where the bottleneck is. The servlet that provides a RESTful search is working and the bulk transfer is less important.

I will continue to search for the reason this approach doesn't work. If I figure anything out, I'll report back.

If you think of anything please let me know.

Joe

It's not what your program can do, it's what your users do with the program.

Joe Areeda

Ranch Hand

Posts: 334

I like...

posted 9 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Here's an interesting data point for this discussion:

I wondered if knowing the size of the file we're transferring would make a difference so I added an option to the servlet to send a file from disk rather than build it on the fly. This is the same data, I transferred it using the build on the fly method, then copied to the server where the servlet just copied it using a FileReader and the response's PrintWriter.

It saturated my Internet connection with compression at 2MB/s and sent 932M in 25 sec.

Another interesting timing result is when I sent it uncompressed over localhost it clocked at 385MB/s.

This at least gives me some options.

I still don't understand the bottleneck though, but at least we're narrowing in on it.

Best,
Joe

It's not what your program can do, it's what your users do with the program.

Joe Areeda

Ranch Hand

Posts: 334

I like...

posted 9 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Greetings,

Well a little more experimentation and I think I have a handle on what's happening but I still don't understand why.

For these experiments:

Everything was done on a Virtual Machine running Scientific Linux 6.5 (same upstream as CentOS) with 8GB RAM and 4 cores (I7-2600K).

Client used is curl

Compression was controlled with --compress option.

All transfers used https over the localhost connection.

Apache version 2.2.15

Tomcat version 7.0.53

Apache uses the AJP connector to Tomcat which is needed for Shibboleth authentication

I started with the database transfer (code is in one of the previous posts) and saved the resulting file (933M). This file was then used as static content served by Apache or through another servlet.

This table summarizes the timing tests. They were repeated a few times and consistent when rounded to seconds.

Operation              | Compressed | Compressed  | Uncompressed | Uncompressed
                       | time (s)   | rate (MB/s) | time         | rate
-----------------------|------------|-------------|--------------|------------
Apache file transfer   |            |             |              |
w/Length               |    8       |   113       |     8        |  114
-----------------------|------------|-------------|--------------|-------------
servlet file transfer  |
w/length, chunk=100K   |   20       |     3.9     |     9        |   97.2
-----------------------|------------|-------------|--------------|-------------
servlet file transfer  |            |             |              |
w/o length             |   20       |     4       |    14        |   65.9
-----------------------|------------|-------------|--------------|-------------
servlet db transfer    |  165       |     0.5     |   14         |   65.9
-----------------------|------------|-------------|--------------|-------------

The interesting points to note from this table:

Apache serving the file from disk is about twice as fast as Tomcat doing the same thing. This can be explained by the use of AJP since Tomcat sends the data to Apache which then sends the data to the client

When working with uncompressed data, setting the content-length header value makes the file transfer more efficient, about a 50% increase in speed

What I don't understand is why the serverlet that generates content on the fly is 11 time slower when it's compressed

Fortunately, these tables are read-mostly. They are currently updated once a day which may be increased to 4 times/day but I doubt more than that. Also, while the user can filter their request the most common queries that result in large (>10MB) results are few and well known. So the solution I will propose to my teammates is to generate the common BIG queries when the database is updated and serve those as static content. I haven't decided if that will be through the servlet or using only Apache.

If anyone has a clue why the database generated content is so slow over a compressed connection, I'd really like to understand it even if the fix is hard to impossible. If I learn more. I'll post.

Best,
Joe

It's not what your program can do, it's what your users do with the program.