Currently have a call in to Jeff James at Pro Power for quotes on the Mitsubishi Netcom2 device+software. Ballpark estimate is $730 for the box + $10/each for ability to have shutdown notices sent to individual computers. One $10 license may cover multiple computers, need to confer with Jeff james.
To connect this box into our system, we would run multimode fiber from the shed into the SPR and use a pair of 100Mbps transceivers that we already have on site.
I don't think there was enough of a control, but it did seem as though the system was more stable during the period that the SonATABackendServer was off. I'd like to try some more controlled experiments during the maintenance tomorrow and the AF time on Thursday.
Did anything information come out of our little experiment of leaving the BackendServer off for over a day? That was Thursday 11pm, March 3rd to Friday 4:00, March 4th. Let me know if you'd like us to do that again, or if you need any help (or just encouragement!).
The packet routing is rather intensive and until a few days ago there was no facility for logging requests behind how many per second were occurring. I have now added the ability to log the raw packets (java objects) and can store them without interfering with the normal operations of queueserver. I'm going to turn that back on this afternoon.
So it sounds like this is circling back around to something hammering the antenna server with a ton of requests? Is there no logging of the requests and where they're coming from?
After running 3 hours, another thread thrashing event took place with an enormous rise in concurrent threads, going from ~120 to 695 over the course of 2 minutes. Will write out raw packet log now to try to catch this...
Update: had a crash ~12 hours after launching queueserver. Logging of the system indicated many threads having been created (2K total, with a max running live of 716). About 75% of these threads were unnamed and further investigation showed that these are part of a thread pool for the server connection dispatcher. This pool was limited to 10 threads. I've bumped this pool up to much larger value (50) which should help avoid what appears to have been a lot of thread thrashing this morning. These have been added to the Server c'tor as a parameter so that only the QueueServer on antcntl will have these extra-large thread pools. (note: this is different from the worker thread pool, which was set to allow an increase in threads up to a higher limit (100 vs 10) last summer):
I have also modified QueueServer to catch the OutOfMemory exception, issue a fatal error message and to immediately exit QueueServer. ProcessServer will now attempt to restart QueueServer within 1 second of the crash. This will help significantly as the QueueServer was remaining up, but in an autistic state, keeping the array hostage. From now on, if this error occurs, the antennas will come back under control within 1-2 minutes of a crash and without human intervention. One drawback is that some antennas come back needing to be re-init'd (same as before).
We probably want to check that the array is in a good state at the start of each observation. If the first calibrator fails to image properly, try observing it again. If it fails again, alarm or email the user.
I did say that strato is experiencing this problem, but I must have been wrong. I just now tried to get the original test to fail on strato, but it would not fail, even after over 1500 iterations of the test.
I just now tested on maincntl, which is OpenSUSE 10.2, and it failed. The first time it took 307 tries till it failed. The second try took 47 iterations before it failed. Then 33, then 1, then 65. So it appears to be random.
I had forgotten to test the most important server - tumulus.
tumulus - 10.3 - OK
So I am thinking we are experiencing a bug in OpenSUSE 10.2 and this is why the 189 second was not a problem for SonATA until we moved the BackensServers from tumulus to auxcntl.
Have you tried the same test on cinder, pulsar-2 or caldera? Keep in mind that auxcntl is running SUSE 10.2, while strato is running solairs, foid is running SUSE 11.2 and, I'm not sure what version your home server is running.
Another source for the monitoring device:
http://www.keyitec.com/keyitec3D.html#software, $725, includes 24 licenses (for use with ~24 computers for controlled shutdown?)
Currently have a call in to Jeff James at Pro Power for quotes on the Mitsubishi Netcom2 device+software. Ballpark estimate is $730 for the box + $10/each for ability to have shutdown notices sent to individual computers. One $10 license may cover multiple computers, need to confer with Jeff james.
To connect this box into our system, we would run multimode fiber from the shed into the SPR and use a pair of 100Mbps transceivers that we already have on site.
Service Hotline
Mitsubishi Technical support contact info:
Mitsubishi Regional Sales Manager
Pro Power (authorized re-seller of Netcom2 monitoring hardware+software)
Some plots on traffic flowing through antcntl:
JSDA bytes/sec
JSDA non-registration messages (all messages except for RegisterServer/RegisterQueue):
JSDA getItem calls to antennas, probably from tempmon2 (turned off today 09 Mar 2011)
Jon,
I don't think there was enough of a control, but it did seem as though the system was more stable during the period that the SonATABackendServer was off. I'd like to try some more controlled experiments during the maintenance tomorrow and the AF time on Thursday.
Colby,
Did anything information come out of our little experiment of leaving the BackendServer off for over a day? That was Thursday 11pm, March 3rd to Friday 4:00, March 4th. Let me know if you'd like us to do that again, or if you need any help (or just encouragement!).
Jon
The packet routing is rather intensive and until a few days ago there was no facility for logging requests behind how many per second were occurring. I have now added the ability to log the raw packets (java objects) and can store them without interfering with the normal operations of queueserver. I'm going to turn that back on this afternoon.
So it sounds like this is circling back around to something hammering the antenna server with a ton of requests? Is there no logging of the requests and where they're coming from?
Update: had a crash ~12 hours after launching queueserver. Logging of the system indicated many threads having been created (2K total, with a max running live of 716). About 75% of these threads were unnamed and further investigation showed that these are part of a thread pool for the server connection dispatcher. This pool was limited to 10 threads. I've bumped this pool up to much larger value (50) which should help avoid what appears to have been a lot of thread thrashing this morning. These have been added to the Server c'tor as a parameter so that only the QueueServer on antcntl will have these extra-large thread pools. (note: this is different from the worker thread pool, which was set to allow an increase in threads up to a higher limit (100 vs 10) last summer):
I have also modified QueueServer to catch the OutOfMemory exception, issue a fatal error message and to immediately exit QueueServer. ProcessServer will now attempt to restart QueueServer within 1 second of the crash. This will help significantly as the QueueServer was remaining up, but in an autistic state, keeping the array hostage. From now on, if this error occurs, the antennas will come back under control within 1-2 minutes of a crash and without human intervention. One drawback is that some antennas come back needing to be re-init'd (same as before).
There is now an easier way to do this - http://log.hcro.org/content/too-much-data-accumulate
Assigning to Billy.
Do you mean startup as in start up of the various parts of foxtrot related data capturing?
Figuring out whether Walshing is working correctly ought to be part of this.
We probably want to check that the array is in a good state at the start of each observation. If the first calibrator fails to image properly, try observing it again. If it fails again, alarm or email the user.
Vector to scalar ratio for calibrators also gives a good indicator of data quality.
Aye Aye, I can update the kernel on cinder and we can see where that gets us...
Colby, can you update the kernel on one of the offending servers to see if it fixes this? I assume after the Air Force event this month?
Oooh, I knew I'd miss something. Fixed.
I did say that strato is experiencing this problem, but I must have been wrong. I just now tried to get the original test to fail on strato, but it would not fail, even after over 1500 iterations of the test.
I just now tested on maincntl, which is OpenSUSE 10.2, and it failed. The first time it took 307 tries till it failed. The second try took 47 iterations before it failed. Then 33, then 1, then 65. So it appears to be random.
This also implies that we could update the kernels and fix them.
In the original posting, you'd mentioned that strato was showing the same 189 problem. Can you double check that strato does not do this?
I had forgotten to test the most important server - tumulus.
tumulus - 10.3 - OK
So I am thinking we are experiencing a bug in OpenSUSE 10.2 and this is why the 189 second was not a problem for SonATA until we moved the BackensServers from tumulus to auxcntl.
I tested on all the computers you mentioned. Here is a list that shows their linux version:
foid - 11.2 - OK
my server - 11.2 - OK
caldera - 10.3 - OK
strato - solaris - OK
auxcntl - 10.2 - FAIL
cinder - 10.2 - FAIL
pulsar-2 - 10.2 - FAIL
Have you tried the same test on cinder, pulsar-2 or caldera? Keep in mind that auxcntl is running SUSE 10.2, while strato is running solairs, foid is running SUSE 11.2 and, I'm not sure what version your home server is running.