jim terry

Ranch Hand
+ Follow
since Nov 18, 2018
Cows and Likes
Cows
Total received
0
In last 30 days
0
Total given
0
Likes
Total received
0
Received in last 30 days
0
Total given
12
Given in last 30 days
0
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by jim terry

Recently we experienced an interesting production problem. This application was running on multiple AWS EC2 instances behind Elastic Load Balancer. The application was running on GNU/Linux OS, Java 8, Tomcat 8 application server. All of sudden one of the application instances became unresponsive. All other application instances were handling the traffic properly. Whenever the HTTP request was sent to this application instance from the browser, we were getting following response to be printed on the browser.



We used our APM (Application Performance Monitoring) tool to examine the problem. From the APM tool, we could observe CPU, memory utilization to be perfect. On the other hand, from the APM tool, we could observe that traffic wasn’t coming into this particular application instance. It was really puzzling. Why traffic wasn’t coming in?

We logged in to this problematic AWS EC2 instance. We executed vmstat, iostat, netstat, top, df commands to see whether we can uncover any anomaly. To our surprise, all these great tools didn’t report any issue.

As the next step, we restarted the Tomcat application server in which this application was running. It didn’t make any difference either. Still, this application instance wasn’t responding at all.

DMESG command
Then we issued ‘dmesg’ command on this EC2 instance.  This command prints the message buffer of the kernel. The output of this command typically contains the messages produced by the device drivers. In the output generated by this command, we noticed the following interesting messages to be printed repeatedly:



We were intrigued to see this error message: “TCP: out of memory — consider tuning tcp_mem”. It means out of memory error is happening at the TCP level. We had always taught out of memory error happens only at the application level and never at the TCP level.

Problem was intriguing because we breathe this OutOfMemoryError problem day in and out. We have built troubleshooting tools like GCeasy, HeapHero to facilitate engineers to debug OutOfMemoryError that happens at the application level (Java, Android, Scala, Jython… applications). We have written several blogs on this OutOfMemoryError topic. But we were stumped to see OutOfMemory happening at the device driver level. We never thought there would be a problem at the device driver level, that too in, stable Linux operating system. Being stumped by this problem, we weren’t sure how to proceed further.

Thus, we resorted to Google god’s help 😊. Googling for the search term: “TCP: out of memory — consider tuning tcp_mem”, showed only 12 search results. But for one article, none of them had much content  ☹. Even that one article was written in a foreign language that we couldn’t understand. So, we aren’t sure how to troubleshoot this problem.

Now left with no other solutions, we went ahead and implemented universal solution i.e. “restart”. We restarted the EC2 instance to put-off immediate burning fire. Hurray!! Restarting the server cleared the problem immediately. Apparently, this server wasn’t restarted for several days (like more than 70+ days), maybe due to that application might have saturated TCP memory limits.

We reached out to one of our intelligent friends who works for a world-class technology company for help. This friend asked us the values that we are setting for the below kernel properties:

*core.netdev_max_backlog
*core.rmem_max
*core.wmem_max
*ipv4.tcp_max_syn_backlog
*ipv4.tcp_rmem
*net.ipv4.tcp_wmem[/list]

Honestly, this is the first time, we are hearing about these properties. We found that below are the values set for these properties in the server:



Our friend suggested to change values as given below:



He mentioned setting these values will eliminate the problem we had faced. Sharing the values with you (as it may be of help to you). Apparently, our values have been very low when compared to the values he has provided.

Conclusion
Here are a few conclusions that we would like to draw:

* Even the modern industry-standard APM (Application Performance Monitoring) tools aren’t completely answering the application performance problems that we are facing today.
* ‘dmesg’ command is your friend. You might want to execute this command when your application becomes unresponsive, it may point you out valuable information
* Memory problems doesn’t have to happen in the code that we write 😊, it can happen even at the TCP/Kernel level.



1 month ago

In this modern world, Garbage collection logs are still analyzed in a tedious & manual mode. i.e. you have to get hold of your Devops engineer who has access to production servers, then he will mail you the application’s GC logs, then you will upload the logs to GC analysis tool, then you have to apply your intelligence to anlayze it. There is no programmatic way to analyze Garbage Collection logs in a proactive manner. Thus to eliminate this hassle, gceasy.io is introducing a RESTful API to analyze garbage collection logs. With one line of code you can get your GC logs analyzed instantly.

Here are few use cases where this API can be extremely useful.

Use case 1:Automatic Root cause Analysis
Most of the DevOps invokes a simple Http ping or APM tools to monitor the applications health. This ping is good to detect whether application is alive or not. APM tools are great at informing that application’s CPU spiked  up by ‘x%’, memory utilization increased by ‘y%’, response time dropped by ‘z’ milliseconds. It won’t inform what caused the CPU to spike up, what caused memory utilization to increase, what caused the response time to degrade. If you can configure Cron job to capture thread dumps/GC logs on a periodic interval and invoke our REST API, we apply our intelligent patterns & machine learning algorithms to instantly identify the root cause of the problem.

Advantage 1: Whenever these sort of production problem happens, because of the heat of the moment, DevOps team recycles the servers with out capturing the thread dumps and GC logs. You need to capture thread dumps and GC logs at the moment when problem is happening, in order to diagnose the problem. In this new strategy you don’t have to worry about it, because your cron job is capturing thread dumps/GC logs on a periodic intervals and invoking the REST API, all your thread dumps/GC Logs are archived in our servers.

Advantage 2: Unlike APM tools which claims to add less than 3% of overhead, where as in reality it adds multiple folds, beauty of this strategy is: It doesn’t add any overhead (or negligible overhead). Because entire analysis of the thread dumps/GCeasy are done on our servers and not on your production servers..

Use case 2: Performance Tests
When you conduct performance tests, you might want to take thread dumps/GC logs on a periodic basis and get it analyzed through the API. In case if thread count goes beyond a threshold or if too many threads are WAITING or if any threads are BLOCKED for a prolonged period or lock isn’t getting released or frequent full GC activities happening or GC pause time exceeds thresholds, it needs to get the visibility right then and there. It should be analyzed before code hits the production. In such circumstance this API will become very handy.

Use case 3: Continuous Integration
As part of continuous integration it’s highly encouraged to execute performance tests. Thread dumps/GC Logs should be captured and it can be analyzed using the API.  If API reports any problems, then build can be failed. In this way, you can catch the performance degradation right during code commit time instead of catching it in performance labs or production.

How to invoke Garbage Collection log analysis API?

Invoking Garbage Collection log analysis is very simple:

1). Register with us. We will email you the API key. This is a one-time setup process. Note: If you have purchased enterprise version with API, you don’t have to register. API key will be provided to you as part of installation instruction.
2).Post HTTP request to https://api.gceasy.io/analyzeGC?apiKey={API_KEY_SENT_IN_EMAIL}
3).The body of the HTTP request should contain the Garbage collection log that needs to be analyzed.
4).HTTP Response will be sent back in JSON format. JSON has several important stats about the GC log. Primary element to look in the JSON response is: “isProblem“. This element will have value to be “true” if any memory/performance problems has been discovered. “problem” element will contain the detailed description of the memory problem.

CURL command

Assuming your GC log file is located in “./my-app-gc.log,” then CURL command to invoke the API is:


It can’t get any more simpler than that? Isn’t it?

How to invoke Java Garbage Collection log analysis API
2 months ago
Hi

The source is: https://www.youtube.com/watch?v=uJLOlCuOR4k&t=26s  and its been allowed to post it under copyright regulations.
2 months ago

Java Thread Dump Analyzer,  Troubleshoot JVM crashes, slowdowns, memory leaks, freezes, CPU Spikes
https://community.atlassian.com/t5/Marketplace-Apps-Integrations/How-do-you-analyze-GC-logs-thread-dumps-and-head-dumps/ba-p/985787

2 months ago
Based on the JVM version (1.4, 5, 6, 7, 8, 9), JVM vendor (Oracle, IBM, HP, Azul, Android), GC algorithm (Serial, Parallel, CMS, G1, Shenandoah) and few other settings, GC log format changes. Thus, today the world has ended up with several GC log formats.

‘GC Log standardization API’ normalizes GC Logs and provides a standardized JSON format as shown below.

Graphs provided by GCeasy is great, but some engineers would like to study every event of the GC log in detail. This standardized JSON format gives them that flexibility. Besides, that engineers can import this data to Excel or Tableau or any other visualization tool.

How to invoke Garbage Collection log standardization API?

Invoking Garbage Collection log analysis is very simple:

1). Register with us. We will email you the API key. This is a one-time setup process. Note: If you have purchased enterprise version, you don’t have to register. API key will be provided to you as a part of installation instruction.

2). Post HTTP request to http://api.gceasy.io/std-format-api?apiKey={API_KEY_SENT_IN_EMAIL}

3). The body of the HTTP request should contain the Garbage collection log that needs to be analyzed.

4). HTTP Response will be sent back in JSON format. JSON has several important stats about the GC log. Primary element to look in the JSON response is: “isProblem“. This element will have value to be “true” if any memory/performance problems have been discovered. “problem” element will contain the detailed description of the memory problem.

CURL command

Assuming your GC log file is located in “./my-app-gc.log,” then CURL command to invoke the API is:



It can’t get any more simpler than that? Isn’t it?

Note:use the option “–data-binary” in the CURL instead of using “–data” option. In “–data” new line breaks will be not preserved in the request. New Line breaks should be preserved for legitimate parsing.

Other Tools

You can also invoke the API using any web service client tools such as SOAP UI, Postman Browser Plugin,…..

[ATTACH=CONFIG]3624[/ATTACH]
  Fig: POSTing GC logs through PostMan plugin

Sample Response




JSON Response Elements

ElementDescription
gcEventsThis is the top-level root element. It will contain an array of GC events. For every GC event reported in the GC Log, you will see an element in this array.
timeStampTimestamp at which particular GC event ran
gcTypeYOUNG – if it’s a young GC event type. FULL – if it’s full GC event type.
durationInSecsDuration for which GC event ran
reclaimedBytesAmount of bytes reclaimed in this GC event
heapSizeBeforeGCOverall Heap size before this GC event ran
heapSizeAfterGCOverall Heap size after this GC event ran
youngGenSizeBeforeGCYoung Generation size before this GC event ran
youngGenSizeAfterGCYoung Generation size after this GC event ran
oldGenSizeBeforeGCOld Generation size before this GC event ran
oldGenSizeAfterGCOld Generation size after this GC event ran
permGenSizeBeforeGCPerm Generation size before this GC event ran
permGenSizeAfterGCPerm Generation size after this GC event ran
metaSpaceSizeBeforeGCMetaspace size before this GC event ran
metaSpaceSizeAfterGCMetaspace size after this GC event ran
2 months ago
Thus from Java 9, if you launch the application with -XX:+UseConcMarkSweepGC (argument which will activate CMS GC algorithm), you are going to see below WARNING message:



Why CMS is deprecated?
2 months ago

jim terry wrote:Hello Stephen

Some of the GCeasy tutorials are given below:

What is Garbage collection log? How to enable and analyze?
How to enable Java 9 GC Logs?
Key sections on GCeasy Report
OutOfMemoryError
GCeasy Tutorials

3 months ago

In this article, we have attempted to answer most common questions around System.gc() API call. We hope it may be of help.

What is System.gc()?

System.gc() is an API provided in java, Android, C# and other popular languages. When invoked it will make its best effort to clear accumulated unreferenced object (i.e. garbage) from memory.

Who invokes System.gc()?

System.gc() calls can be invoked from various parts of your application stack:

a.Your own application developers might be explicitly calling System.gc() method.
b.Sometimes System.gc() can be triggered by your 3rd party libraries, frameworks, sometimes even your application servers.
c.It could be triggered from external tools (like VisualVM) through use of JMX
d.If your application is using RMI, then RMI invokes System.gc() on a periodic interval.


What are the downsides of invoking System.gc()?

When System.gc() or Runtime.getRuntime().gc() API calls are invoked from your application, stop-the-world Full GC events will be triggered. During stop-the-world full GCs, entire JVM will freeze (i.e. all the customer transaction that are in motion will be paused). Typically, these Full GCs take long duration to complete. Thus, it has potential to result in poor user experiences and your SLAs at unnecessary times when GC aren’t required to be run.

JVM has sophisticated algorithm working all the time in background doing all computations and calculations on when to trigger GC. When you invoke System.gc() call, all those computations will go for toss. What if JVM has triggered GC event just a millisecond back and once again from your application you are going invoking System.gc()? Because from your application you don’t know when GC ran.

Are there any good/valid reasons to invoke System.gc()?

We haven’t encountered that many good reasons to invoke System.gc() from the application. But here is an interesting use case we saw in a major airline’s application. This application uses 1 TB of memory. This application’s Full GC pause time takes around 5 minutes to complete. Yes, don’t get shocked, it’s 5 minutes (but we have seen cases of 23 minutes GC pause time as well). To avoid any customer impact due to this pause time, this airline company has implemented a clever solution. On a nightly basis, they take out one JVM instance at a time from their load balancer pool. Then they explicitly trigger System.gc() call through JMX on that JVM. Once GC event is complete and garbage is evicted from memory, they put back that JVM in to load balancer pool. Through this clever solution they have minimized customer impact caused by this 5 minutes GC pause time.

How to detect whether System.gc() calls are made from your application?

As you can noticed in ‘Who invokes System.gc()?’ section, you can see System.gc() calls to be made from multiple sources and not just from your application source code. Thus searching your application code ‘System.gc()’ string isn’t enough to tell whether your application is making System.gc() calls. Thus it poses a challenge: How to detect whether System.gc() calls are invoked in your entire application stack?

This is where GC logs comes handy. Enable GC Logs in your application. In fact, it’s advisable to keep your GC log enabled all the time in all your production servers, as it helps you to troubleshoot and optimize application performance. Enabling GC logs adds negligible (if at all observable) overhead. Now upload your GC log to the Garbage Collection log analyzer tool like GCeasy, HP JMeter,…. These tools generate rich Garbage collection analysis report.


  Fig: GC Causes reported by GCeasy.io tool

Above figure is an excerpt from the ‘GC Causes’ section of the report generated by GCeasy. You can see that ‘System.gc()’ call to be invoked 304 times accounting for 52.42% of GC pause time.

How to remove System.gc() calls?

You can remove explicit System.gc() call through following solutions:

a. Search & Replace

This might be a traditional method :-), but it works. Search in your application code base for ‘System.gc()’ and ‘Runtime.getRuntime().gc()’. If you see a match, then remove it. This solution will work if ‘System.gc()’ is invoked from your application source code. If ‘System.gc()’ is going to invoked from your 3rd party libraries, frameworks or through external sources then this solution will not work. In such circumstance you can consider using the option outlined in #b.

b. -XX:+DisableExplicitGC

You can forcefully disable System.gc() calls by passing the JVM argument ‘-XX:+DisableExplicitGC‘ when you launch the application. This option will silence all the ‘System.gc()’ calls that is invoked anywhere from your application stack.

c. RMI

If your application is using RMI, then you can control the frequency in which ‘System.gc()’ calls are made. This frequency can be configured using the following JVM arguments when you launch the application:

-Dsun.rmi.dgc.server.gcInterval=n

-Dsun.rmi.dgc.client.gcInterval=n

The default value for these properties in

JDK 1.4.2 and 5.0 is 60000 milliseconds (i.e. 60 seconds)

JDK 6 and later release is 3600000 milliseconds (i.e. 60 minutes)

You might want to set these properties to a very high value so that it can minimize the impact.



3 months ago