Starting on ~Feb 26, control of antennas would be lost once every ~3-6 hours. I tracked the issue down to the QueueServer on antcntl and the following:
Exception in thread "dispatch" java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
The QueueServer would continue running but would not be able to respond to new JSDA packet connections. To "fix" it, I would kill -15 <PID of QueueServer> directly. Only to have the QueueServer run out of memory again with 3-6 hours.
Going through the SVN logs, I could find no changes that related to the JSDA classes used heavily by QueueServer, nor QueueServer.java itself for 3-12 months.
This started happening around the time of a series of power failures to the site yet there seemed to be no related hardware issue that I could find on antcntl or other JSDA connected hosts.
My initial suspicion was that some extra traffic was passing through antcntl on the JSDA network, keeping the QueueServer from being able to reclaim memory fast enough using incremental GC. I initially spent the day of Feb 27th adding the ability to write out all of the raw packets to /dev/shm so that I could capture the full packet log and process it after it crashed. I also added -Xmx500m to the startup of QueueServer to increase its available allocatable memory under the JVM. There has been some work done on the SonATABackend server, but, from what I could tell, should not be related to this issue.
While I let QueueServer run again to collect the packet log, it did take longer (~8hrs) before QueueServer ran out of memory again. Before starting it I also added
to the startup, allowing me to attach to the JVM using memory and resource tracking tools such as jmap, hat (later named jhat in JVM 6) and jconsole.
This allowed me to get an instantaneous view of the resource use for this running copy of QueueServer. Based on the initial memory allocation view using these tools, I became convinced that the issue really was related to a memory leak of some kind. I've been following the improvements to the JVM that came in version 6 and since the QueueServer itself does not instantiate the objects it passes as packets, I decided to use the JVM6 to run the QueueServer, which also allowed me to use jvisualvm, a much more responsive and useful resource tracking tool.
Using jvisualvm, I did track down a few slow memory leaks immediately and plugged them, however, memory was still being leaked and the QueueServer would crash after ~6-8 hours. Mean while, I was editing a new tool that could help process the raw packet log, but initial examinations did not immediately show some extra bump in traffic that might be leading to this behavior.
Paying attention to memory generational allocations (eden, tenured and perm) and noticed that it appeared much of the memory allocated during regular packet passing operations was not being GC'd during the "minor" GC events. This indicated that much of the memory was being passed into the older generation or tenured, which would keep it around until a full GC event. Even then the full GC events were showing a gradual memory leak that would culminate in the out of memory exception.
This brings us to Mar 1st. I began switching to different GC algorithms, settling on ParallelGC (better when using multi-core systems) and tuned the initial memory foot print with
This seemed to work well for ~6-9 hours, but still showed a general leak.
I spent most of Mar 2nd adding in more aggressive object handling in Server.java and QueueServer.java forcing more ObjectOutputStream resets and a few attempts at closing ObjectOutputStream and ObjectInputStream objects after handling 20K packets and then re-instantiating them (these streams are inherently leaky by design and require resets at least).
During the afternoon of Mar 2nd, I switched from allocating 100M minimum and tried allocating 300M minimum (this is used as "eden" space, there is a bit of trickiness here as major GC events may require double the eden space to handle transitioning eden space objects to the tenured space). This caused the failure to occur much sooner (~20minutes), even though there was plenty of memory available (max usage was 50M, with a fall down to ~4-5MB after each minor GC event, the low end would gradually rise to over 20M). This GC algorithm will allow real heap usage to grow to ~25% of the eden space and then do a minor GC. I'd post plots of all of this but I haven't been recording them off of jvisualvm.
Currently, I have removed the minimum allocation entirely and the current behavior is much improved. The initial eden space is ~8M, and the differences between peaks and dips in the minor GC events is strikingly different, maxing out at ~6.5M and dropping down to 2.5-3M. There does still appear to be an upward trend, but it is vastly reduced from the rate that was leading to crashes over the last 4 days. After 2.5 hours of monitoring, it appears that the heap usage floor increased from ~2.5M to ~3.2M. If this trend continues, then the QueueServer will run out of memory in ~2100+ hours or ~89 days. One issue with this is that GC events happen much more often and could potentially affect overall packet throughput. However, the parallel nature of the GC handling should help mitigate this and it does not seem to have caused problems so far.
I have collected about a dozen heap maps and used the Eclipse Memory Analyzer to try to discover what is actually being leaked, this lead me to the Object related streams above and also to jsda.Connection, however, I have not been able to determine exactly where in jsda.Connection is related to this. Also, jsda.Connection came up as about the 3rd order leak potential after I added resets per above. It may account for the .28M/hr of leaking that I found in the last 2.5 hours.
Update: had a crash ~12 hours after launching queueserver. Logging of the system indicated many threads having been created (2K total, with a max running live of 716). About 75% of these threads were unnamed and further investigation showed that these are part of a thread pool for the server connection dispatcher. This pool was limited to 10 threads. I've bumped this pool up to much larger value (50) which should help avoid what appears to have been a lot of thread thrashing this morning. These have been added to the Server c'tor as a parameter so that only the QueueServer on antcntl will have these extra-large thread pools. (note: this is different from the worker thread pool, which was set to allow an increase in threads up to a higher limit (100 vs 10) last summer):
I have also modified QueueServer to catch the OutOfMemory exception, issue a fatal error message and to immediately exit QueueServer. ProcessServer will now attempt to restart QueueServer within 1 second of the crash. This will help significantly as the QueueServer was remaining up, but in an autistic state, keeping the array hostage. From now on, if this error occurs, the antennas will come back under control within 1-2 minutes of a crash and without human intervention. One drawback is that some antennas come back needing to be re-init'd (same as before).
After running 3 hours, another thread thrashing event took place with an enormous rise in concurrent threads, going from ~120 to 695 over the course of 2 minutes. Will write out raw packet log now to try to catch this...
So it sounds like this is circling back around to something hammering the antenna server with a ton of requests? Is there no logging of the requests and where they're coming from?
The packet routing is rather intensive and until a few days ago there was no facility for logging requests behind how many per second were occurring. I have now added the ability to log the raw packets (java objects) and can store them without interfering with the normal operations of queueserver. I'm going to turn that back on this afternoon.
Did anything information come out of our little experiment of leaving the BackendServer off for over a day? That was Thursday 11pm, March 3rd to Friday 4:00, March 4th. Let me know if you'd like us to do that again, or if you need any help (or just encouragement!).
I don't think there was enough of a control, but it did seem as though the system was more stable during the period that the SonATABackendServer was off. I'd like to try some more controlled experiments during the maintenance tomorrow and the AF time on Thursday.
Some plots on traffic flowing through antcntl:
JSDA non-registration messages (all messages except for RegisterServer/RegisterQueue):
JSDA getItem calls to antennas, probably from tempmon2 (turned off today 09 Mar 2011)
One of the BackendServer problems we uncovered was that it is querying the AntennaServer every 10 seconds for the antenna states, even when it is NOT running an observation. This causes more traffic than needed on the antcntl QueueServer.
On Monday March 15, 2011, at about 2pm PST, I restarted the BackendServer on tumulus with a change. The change was that now the BackendServer queries the AntennaServer for the antenna states only when it is connected to SonATA and running an observation. Anytime SonATA is not running, the AntennaServer is not being queried. This will reduce JDSA traffic when SonATA is idle.