Appears to have run out of local space to store. May have to up the intervals that logs are compressed and/or reduce the number of log files kept around. Longer term solution would be to feed back via syslog and to limit some logged outputs, such as the wrap pot value outputs. Moving the wrap pot value into the metadatabase would make more sense for tracking purposes. At the very least, it would make sense to shorten the log message string to something like "wpotv: 0.48 wpotaz: 246.42148"
I was able to get new log messages by removing older data from the ~ataant/logs area:
...
ataant@ant2a:~/logs> rm DishServer.log-20101007
ataant@ant2a:~/logs> rm ~ataant/logs/*.gz
...
ataant@ant2a:~/logs> tail -5 DishServer.log
10-10-14 16:11:44 | Wrap pot value = 0.48, wrappotaz = 246.42148
10-10-14 16:11:48 | Wrap pot value = 0.468, wrappotaz = 261.74927
10-10-14 16:11:52 | Wrap pot value = 0.481, wrappotaz = 253.81003
10-10-14 16:11:53 | Entering DishServer.getFocus()
Retrieved counts value = 2474
...
Load was very high on boot2 when I checked on this. No useful info about why in the logs. Sessions info are being properly culled and that was not part of the issue. Restarted the rails application, and this seemed to solve it. Perhaps a memory leak issue. As atasys, it's possible to do the following:
The inserted image (below) illustrates one instance of many I have captured, during slews, where AntennaServer Az/El values are not updated fast enough to satisfy CollisionServer (and possibly other programs as well). See Az/El samples taken within the red box: we would not believe the data which shows that antenna 3f moved ~ 39 degrees in azimuth in two seconds. See Az/El samples taken within the green boxes, all at 2010-10-05T23:07:37Z: if CollisionServer had taken these Az/El samples it would have concluded that antenna 3g was separated in azimuth from antennas 3f and 3h by ~ 35 degrees (190 - 155) and therefore inhibited motion.
I should note, that it's not a pulse that is sent that stops motion, it is the lack of a pulse being sent that stops motion. Another strange factor here is that the antenna is not immediately stopped, and if it coasts into being with 30 deg of another antenna, the pulses are restarted to that antenna, which could lead to a jittery start-stop-start motion.
The following plot shows 3 cases (blue vertical lines) of when the collision server reported that one of the antennas was greater than 30 degrees off from another antenna (in this case, 3fe vs 3he). Note that each of the 3 instances are on 2 second intervals (the interval at which the collision server checks if a pair is out of sync). Note that the measure the collision server is following is based off of the encoder readouts (3fe and 3he), and that due to the cadences of data flow through the system (updates are very slow to the backend control system) outlined on http://log.hcro.org/content/anti-collision-design-flaws-and-potential-hardware-problems, the collision server still reports the 2 antennas being out of sync, while their encoder readings on the antennas showed them to be < 30 deg apart.
So, the antennas are both getting out of sync (mechanically) and the software underlying that part of the system is overly anxious in its reporting and halting of the antennas. I'm going to edit the collision server to attempt to get encoder readings from all 3 antennas at once, instead of spreading the queries individually over 3 seconds. Also, this reopens the discussion of not having a faster channel for updates (a more efficient way of sending back telemetry data, rather than using the full weight of a JSDA call, a low latency regular telemetry stream) that the collision antennas can take advantage of specifically would solve the latency issues outlined in the anti-collision design flaws referenced above. Eventually expanding this to the rest of the antennas would also be useful, having a low impact on the network, while enriching our ability to have better performing backend services keyed to the antenna telemetry.
More info about this issue, it appears that the culprit was ant3f, as it is pointed away from the other two antennas and it apparently went offline first:
Don't think this is directly related, but should definitely be followed up, there are some odd looking exceptions being thrown by MonCom on all antennas, an example:
10-09-30 11:31:57 | Warning Moncom: Unexpected exception while updating ElIDraw.
10-09-30 11:31:57 | 10-09-30 11:31:57: java.lang.Exception: Invalid return value from servo.
jsda.Server.Call(Server.java:302)
jsda.Server.run(Server.java:276)
java.lang.Thread.run(Unknown Source)
However, again, I don't know if this is a problem causing the issue with the collision antennas going offline.
Taking a close look at the position updates that were logged on each of the collision antennas, it appears that the antennas were showing greater than 30deg differences at least momentarily during every slew that was happening around the time that 3f was knocked offline. It looks as though 3f was lagging behind 3g and 3h, which would indicate a mechanical issue with 3f or that there was some sort of reduction in 3f's maximum slewing speed, need to follow up on the max slew speed, but I have noticed one or another collision antenna lagging. Could also be a difference in when the trio are receiving new ephem files, which, even with the multithreaded distribution, can arrive a different times.
installed on caldera, cinder, spatter and foid. (on caldera, it segfaults when attempting to open an X window, but text window works fine... gah.. unset DISPLAY)
I had gotten around the permissions issue by just building inside ata/src.
I had thought that /hcro/atasys had the same arch-specific mounting setup as /hcro/miriad but it appears that was wrong. It's no biggie just to build a custom jmir and/or switch the build to a 64-bit host, so this bug just reflects my misunderstanding of the system setup.
The production system was always only compiled to run on 64 bit systems now for jmir. I hadn't anticiatped that anyone would use /hcro/atasys for shortcuts to ata/apps/jmir in a sandbox, since, the normal build process goes into those directories to actually attempt building (infact, it should error on permissions in those cases). /hcro is picked up by the automounter, so, atasys just appears their on its own. I'm open to suggestions about it not being mounted by default on the dev hosts, but that'll have to wait until next week.
Miriad is a corner case, as that is built separately and not part of the 'make' for the ata software. Better to build a copy of jmir in another directory in your lab set up for now.
f1.fxc fboard#1 has yellow sticker with "runs hot" on it. Left in place.
f2.fxc fboard#1, replaced fan, brought fxc back up, but fbtemps.rb only showed fboards 0,1,2 in the list.
At first, assumed fboard#3 was the missing one, powered down and reseated it, with no change in the temps listing missing board 3.
trouble shooting:
ssh root@f2.fxc
/sbin/lspci only shows 3 fboard devices:
02:0c.0 Signal processing controller: Xilinx Corporation Unknown device 0300
02:0d.0 Signal processing controller: Xilinx Corporation Unknown device 0300
02:0f.0 Signal processing controller: Xilinx Corporation Unknown device 0300
I would expect to see the first fboard to show up as device 02:0a.0, for reference a good lspci output from f1.fxc looks like:
02:0a.0 Signal processing controller: Xilinx Corporation Unknown device 0300
02:0c.0 Signal processing controller: Xilinx Corporation Unknown device 0300
02:0d.0 Signal processing controller: Xilinx Corporation Unknown device 0300
02:0f.0 Signal processing controller: Xilinx Corporation Unknown device 0300
It appears that the fbtemps.rb program is showing a relative count of fboards and not the absolute count and positions. Powered fcrate.fxc down again, reseated f2.fxc fboard#0, and now all 4 boards showed up with lspci and all 4 temps are showing up in fbtemps.rb output.
Attempted to init fxc with fxinit fx64c, and saw many suspicious looking errors:
------------------------
obs@user1 ~/bin % ssh x1.fxc
Last login: Sat Sep 11 21:51:24 2010 from strato.hcro.org
Have a lot of fun...
fxobs@x1.fxc:~> fxinit fx64c
Stopping slice a...done
Stopping slice b...done
Stopping slice c...done
Stopping slice d...done
Stopping slice e...done
Stopping slice f...done
Stopping slice g...done
Stopping slice h...done
Loading F board 0 (0x00005007)...OK
F Board FPGA Air
0 26 27 24
Using external sync from I board
Loading T board 0...OK
Setting initial bit selector values from '/opt/fx/etc/init_bitsel.f1'...OK
Loading F board 1 (0x00004006)...OK
F Board FPGA Air
1 28 27 25
Using external sync from I board
Loading T board 1...OK
Setting initial bit selector values from '/opt/fx/etc/init_bitsel.f1'...OK
Loading F board 2 (0x00005007)...OK
F Board FPGA Air
2 27 26 25
Using external sync from I board
Loading T board 2...OK
Setting initial bit selector values from '/opt/fx/etc/init_bitsel.f1'...OK
Loading F board 3 (0x00005007)...OK
F Board FPGA Air
3 29 29 24
Using external sync from I board
Loading T board 3...OK
Setting initial bit selector values from '/opt/fx/etc/init_bitsel.f1'...OK
Loading F board 0 (0x00005007)...OK
F Board FPGA Air
0 27 27 25
Using external sync from I board
/opt/fx/bin/fbreg.rb:15: [BUG] Segmentation fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/etc/init_bitsel.f2: line 7: 2576 Aborted bitsel.rb 0 0-7 512 13
OK
Loading F board 1 (0xffffffff)...OK
F Board FPGA Air
1 28 27 26
Using external sync from I board
Loading T board 1...OK
Setting initial bit selector values from '/opt/fx/etc/init_bitsel.f2'.../opt/fx/lib/ruby/fboard.rb:188: [BUG] Segmentation
fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/etc/init_bitsel.f2: line 7: 2597 Aborted bitsel.rb 0 0-7 512 13
OK
Loading F board 2 (0x00005007)...OK
F Board FPGA Air
2 29 30 26
Using external sync from I board
Loading T board 2...OK
Setting initial bit selector values from '/opt/fx/etc/init_bitsel.f2'.../opt/fx/lib/ruby/fboard.rb:188: [BUG] Segmentation
fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/etc/init_bitsel.f2: line 7: 2618 Aborted bitsel.rb 0 0-7 512 13
OK
Loading F board 3 (0x00005007)...OK
F Board FPGA Air
3 30 31 27
Using external sync from I board
Loading T board 3...OK
Setting initial bit selector values from '/opt/fx/etc/init_bitsel.f2'.../opt/fx/lib/ruby/fboard.rb:188: [BUG] Segmentation fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/etc/init_bitsel.f2: line 7: 2639 Aborted bitsel.rb 0 0-7 512 13
OK
Loading X board 0...OK
Loading X board 1...OK
Loading X board 2...OK
Loading X board 3...OK
Loading X board 0...OK
Loading X board 1...OK
Loading X board 2...OK
Loading X board 3...OK
x1...
xboard0 is CURRENTLY STOPPED at 0 frames
xboard1 is CURRENTLY STOPPED at 0 frames
xboard2 is CURRENTLY STOPPED at 0 frames
xboard3 is CURRENTLY STOPPED at 0 frames
x2...
xboard0 is CURRENTLY STOPPED at 0 frames
xboard1 is CURRENTLY STOPPED at 0 frames
xboard2 is CURRENTLY STOPPED at 0 frames
xboard3 is CURRENTLY STOPPED at 0 frames
Restarting slice a.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
Restarting slice b.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
Restarting slice c.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
Restarting slice d.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
Restarting slice e.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
Restarting slice f.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
Restarting slice g.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
Restarting slice h.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
------------------------
And, after this init, fcheck showed:
Checking f2.fxc - FPGA should be < 80
F Board FPGA Air
0 29 38 26
1 42 98 34 FPGA IS HOT!
2 41 75 37
3 42 72 37
bugger. Replaced fan again, reseated fboard#1 on f2.fxc, and again ran into fboard#0 not showing up. Twice powered off and reseated, lspci not showing board. Powered off and reseated all boards twice, including CPU board, powered back up and still fboard#0 not showing up. Went ahead and did fxinit fx64c on x1.fxc without fboard#0, after that, fboard#1 is cooler at 85C, but is still running hot. Both fboard#1 on f1.fxc and f2.fxc are sandwiched next to another fboard, perhaps they've both been running hot?
Still unable to get fboard#0 to show up. Will attempt to speak to Matt Dexter tomorrow (Sunday) to trouble shoot further.
A redesign of how the antennas report their positions (by pushing it, preferably via a lightweight telemetry stream that could be added to JSDA) could eliminate the slow update rate of AntennaServer/AntennaShadow. More directly, a modification to CollisionServer could have it update the position of all 3 antennas ever second, instead of interleaving them, shaving 2 seconds off a the potential 6, leaving the antennas divergence by up to ~ 32 deg.
Working on this now, it will take some time to fix, sessions problem rearing head _again_. Spoke with David M about approach to fix and he has blessed it. Interrupting work on alarms to fix this one :)
Node 3 dropped out over last evening. It seemed as though a few of the ants were having problems with their el drives. Anyway, a reset at the node house didn't do anything so Gary came out and had to bypass the Variable Frequency Drive which had gone out, to get the blower started again. Big thanks to Gary. Ant 3d, 3j and 3l are all back online and happy. The ion pump on 3e would not jump, so that one is out. Ant 3c was already in maint for pam issues and needs to have a feed replaced anyway. Totally unrelated to the node issue, ant 3f lost fiber link with the media converter and so Gary will have to troubleshoot that on Monday. This keeps the collisions offline for the rest of the weekend.
Several other antennas(3d,3e,3g,3h,3l) could not get on position so I
moved them to the maint group. I moved 3f to the collision group to
prevent it from being used.
I think Colby suggested a direct connection between a beamformer and
the GPU pulsar processor. I think this would be a good short-term
solution which avoids impact on the SETI lan. I'd rather not disrupt
the SETI Lan because we are trying to maintain our observing programs
on the Prelude system while trying to integrate and test its
replacement, SonATA. The SETI observing systems control the beamformer
through the BackendServer which is on the OBS lan. So, perhaps with a
temporary direct connection to a beamformer and a network connection on
the OBS lan, Gregory can test his processor.
Note: we currently have beamformers 2 & 3 connected to Prelude via analog outputs while SonATA is connect to digital packetized output from all three beamformers.
One thing to keep in mind about testing with the beamformer in the near
future, we plan to run tests on SonATA with one to three beamformers
much of the time. So any tests of the pulsar processor need to be
coordinated with the sonata group <sonata@seti.org>. For the long term, the pular processor cannot be on the SETI lan. We may have activity going on between our servers even when we're not observing. Perhaps we need a "pulsar" lan (or subnet?) which communicates with the OBS lan via sockets (that's what the SETI systems do).
Appears to have run out of local space to store. May have to up the intervals that logs are compressed and/or reduce the number of log files kept around. Longer term solution would be to feed back via syslog and to limit some logged outputs, such as the wrap pot value outputs. Moving the wrap pot value into the metadatabase would make more sense for tracking purposes. At the very least, it would make sense to shorten the log message string to something like "wpotv: 0.48 wpotaz: 246.42148"
I was able to get new log messages by removing older data from the ~ataant/logs area:
Adding Gary and Sam to notifications list.
Load was very high on boot2 when I checked on this. No useful info about why in the logs. Sessions info are being properly culled and that was not part of the issue. Restarted the rails application, and this seemed to solve it. Perhaps a memory leak issue. As atasys, it's possible to do the following:
ssh atasys@boot2
...
atasys@boot2:~> fxconf.sh stop
...
atasys@boot2:~> fxconf.sh start
...
atasys@boot2:~> fxconf.sh status
26149 ruby script/server -d -e production -p 8000
Leaving this bug open, as this software needs to get documented on an, oh, I dunno, documentation page.
Yes, now there is only one instance! Hmmmmm
The empty log file problem still remains. I will have to look into this. If should be putting out information.
Updated miriad on the following hosts today:
beast x86_64 11.1, carma, ata, sma
cosmic x86_64 11.2, carma, ata, sma
sun i386 10.2, carma, ata, sma
(need current versions of automake/autoconf/libtool, default configure, set path to include /usr/local/bin first)
The inserted image (below) illustrates one instance of many I have captured, during slews, where AntennaServer Az/El values are not updated fast enough to satisfy CollisionServer (and possibly other programs as well). See Az/El samples taken within the red box: we would not believe the data which shows that antenna 3f moved ~ 39 degrees in azimuth in two seconds. See Az/El samples taken within the green boxes, all at 2010-10-05T23:07:37Z: if CollisionServer had taken these Az/El samples it would have concluded that antenna 3g was separated in azimuth from antennas 3f and 3h by ~ 35 degrees (190 - 155) and therefore inhibited motion.
Now that I am back at work:
Did one instance get killed? I only see one instance now. And, it looks like the TrackingServer log is still zero anyway.
I should note, that it's not a pulse that is sent that stops motion, it is the lack of a pulse being sent that stops motion. Another strange factor here is that the antenna is not immediately stopped, and if it coasts into being with 30 deg of another antenna, the pulses are restarted to that antenna, which could lead to a jittery start-stop-start motion.
Reassigning to me...
The following plot shows 3 cases (blue vertical lines) of when the collision server reported that one of the antennas was greater than 30 degrees off from another antenna (in this case, 3fe vs 3he). Note that each of the 3 instances are on 2 second intervals (the interval at which the collision server checks if a pair is out of sync). Note that the measure the collision server is following is based off of the encoder readouts (3fe and 3he), and that due to the cadences of data flow through the system (updates are very slow to the backend control system) outlined on http://log.hcro.org/content/anti-collision-design-flaws-and-potential-hardware-problems, the collision server still reports the 2 antennas being out of sync, while their encoder readings on the antennas showed them to be < 30 deg apart.
So, the antennas are both getting out of sync (mechanically) and the software underlying that part of the system is overly anxious in its reporting and halting of the antennas. I'm going to edit the collision server to attempt to get encoder readings from all 3 antennas at once, instead of spreading the queries individually over 3 seconds. Also, this reopens the discussion of not having a faster channel for updates (a more efficient way of sending back telemetry data, rather than using the full weight of a JSDA call, a low latency regular telemetry stream) that the collision antennas can take advantage of specifically would solve the latency issues outlined in the anti-collision design flaws referenced above. Eventually expanding this to the rest of the antennas would also be useful, having a low impact on the network, while enriching our ability to have better performing backend services keyed to the antenna telemetry.
More info about this issue, it appears that the culprit was ant3f, as it is pointed away from the other two antennas and it apparently went offline first:
ant3f: 10-09-30 11:06:43 | Entering DishServer.offline()
ant3g: 10-09-30 11:14:05 | Entering DishServer.offline()
ant3h: 10-09-30 11:14:00 | Entering DishServer.offline()
Don't think this is directly related, but should definitely be followed up, there are some odd looking exceptions being thrown by MonCom on all antennas, an example:
10-09-30 11:31:57 | Warning Moncom: Unexpected exception while updating ElIDraw.
10-09-30 11:31:57 | 10-09-30 11:31:57: java.lang.Exception: Invalid return value from servo.
jsda.Server.Call(Server.java:302)
jsda.Server.run(Server.java:276)
java.lang.Thread.run(Unknown Source)
However, again, I don't know if this is a problem causing the issue with the collision antennas going offline.
Taking a close look at the position updates that were logged on each of the collision antennas, it appears that the antennas were showing greater than 30deg differences at least momentarily during every slew that was happening around the time that 3f was knocked offline. It looks as though 3f was lagging behind 3g and 3h, which would indicate a mechanical issue with 3f or that there was some sort of reduction in 3f's maximum slewing speed, need to follow up on the max slew speed, but I have noticed one or another collision antenna lagging. Could also be a difference in when the trio are receiving new ephem files, which, even with the multithreaded distribution, can arrive a different times.
fxconf.rb sals was giving me this error. This is how I cleaned up the files:
find /opt/atasys/atasw/rails/ata/tmp/sessions -name "ruby_sess*" | xargs rm
installed on caldera, cinder, spatter and foid. (on caldera, it segfaults when attempting to open an X window, but text window works fine... gah.. unset DISPLAY)
I had gotten around the permissions issue by just building inside ata/src.
I had thought that /hcro/atasys had the same arch-specific mounting setup as /hcro/miriad but it appears that was wrong. It's no biggie just to build a custom jmir and/or switch the build to a 64-bit host, so this bug just reflects my misunderstanding of the system setup.
The production system was always only compiled to run on 64 bit systems now for jmir. I hadn't anticiatped that anyone would use /hcro/atasys for shortcuts to ata/apps/jmir in a sandbox, since, the normal build process goes into those directories to actually attempt building (infact, it should error on permissions in those cases). /hcro is picked up by the automounter, so, atasys just appears their on its own. I'm open to suggestions about it not being mounted by default on the dev hosts, but that'll have to wait until next week.
Miriad is a corner case, as that is built separately and not part of the 'make' for the ata software. Better to build a copy of jmir in another directory in your lab set up for now.
f1.fxc fboard#1 has yellow sticker with "runs hot" on it. Left in place.
f2.fxc fboard#1, replaced fan, brought fxc back up, but fbtemps.rb only showed fboards 0,1,2 in the list.
At first, assumed fboard#3 was the missing one, powered down and reseated it, with no change in the temps listing missing board 3.
trouble shooting:
ssh root@f2.fxc
/sbin/lspci only shows 3 fboard devices:
02:0c.0 Signal processing controller: Xilinx Corporation Unknown device 0300
02:0d.0 Signal processing controller: Xilinx Corporation Unknown device 0300
02:0f.0 Signal processing controller: Xilinx Corporation Unknown device 0300
I would expect to see the first fboard to show up as device 02:0a.0, for reference a good lspci output from f1.fxc looks like:
02:0a.0 Signal processing controller: Xilinx Corporation Unknown device 0300
02:0c.0 Signal processing controller: Xilinx Corporation Unknown device 0300
02:0d.0 Signal processing controller: Xilinx Corporation Unknown device 0300
02:0f.0 Signal processing controller: Xilinx Corporation Unknown device 0300
It appears that the fbtemps.rb program is showing a relative count of fboards and not the absolute count and positions. Powered fcrate.fxc down again, reseated f2.fxc fboard#0, and now all 4 boards showed up with lspci and all 4 temps are showing up in fbtemps.rb output.
Attempted to init fxc with fxinit fx64c, and saw many suspicious looking errors:
------------------------
obs@user1 ~/bin % ssh x1.fxc
Last login: Sat Sep 11 21:51:24 2010 from strato.hcro.org
Have a lot of fun...
fxobs@x1.fxc:~> fxinit fx64c
Stopping slice a...done
Stopping slice b...done
Stopping slice c...done
Stopping slice d...done
Stopping slice e...done
Stopping slice f...done
Stopping slice g...done
Stopping slice h...done
Loading F board 0 (0x00005007)...OK
F Board FPGA Air
0 26 27 24
Using external sync from I board
Loading T board 0...OK
Setting initial bit selector values from '/opt/fx/etc/init_bitsel.f1'...OK
Loading F board 1 (0x00004006)...OK
F Board FPGA Air
1 28 27 25
Using external sync from I board
Loading T board 1...OK
Setting initial bit selector values from '/opt/fx/etc/init_bitsel.f1'...OK
Loading F board 2 (0x00005007)...OK
F Board FPGA Air
2 27 26 25
Using external sync from I board
Loading T board 2...OK
Setting initial bit selector values from '/opt/fx/etc/init_bitsel.f1'...OK
Loading F board 3 (0x00005007)...OK
F Board FPGA Air
3 29 29 24
Using external sync from I board
Loading T board 3...OK
Setting initial bit selector values from '/opt/fx/etc/init_bitsel.f1'...OK
Loading F board 0 (0x00005007)...OK
F Board FPGA Air
0 27 27 25
Using external sync from I board
/opt/fx/bin/fbreg.rb:15: [BUG] Segmentation fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/bin/ftxinit: line 123: 2565 Aborted fbreg.rb ${n} 0x6400 ${SYNCSEL} 0x30 >/dev/null
/opt/fx/bin/fbreg.rb:15: [BUG] Segmentation fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/bin/ftxinit: line 123: 2566 Aborted fbreg.rb ${n} 0x6400 0x000 0x100 >/dev/null
/opt/fx/bin/fbreg.rb:15: [BUG] Segmentation fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/bin/ftxinit: line 123: 2567 Aborted fbreg.rb ${n} 0x6400 0x100 0x100 >/dev/null
/opt/fx/bin/fbreg.rb:15: [BUG] Segmentation fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/bin/ftxinit: line 123: 2568 Aborted fbreg.rb ${n} 0x6400 0x000 0x100 >/dev/null
Loading T board 0...OK
Setting initial bit selector values from '/opt/fx/etc/init_bitsel.f2'.../opt/fx/lib/ruby/fboard.rb:188: [BUG] Segmentation
fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/etc/init_bitsel.f2: line 1: 2572 Aborted bitsel.rb 0 0-7 0-1023 8
/opt/fx/lib/ruby/fboard.rb:188: [BUG] Segmentation fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/etc/init_bitsel.f2: line 7: 2576 Aborted bitsel.rb 0 0-7 512 13
OK
Loading F board 1 (0xffffffff)...OK
F Board FPGA Air
1 28 27 26
Using external sync from I board
Loading T board 1...OK
Setting initial bit selector values from '/opt/fx/etc/init_bitsel.f2'.../opt/fx/lib/ruby/fboard.rb:188: [BUG] Segmentation
fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/etc/init_bitsel.f2: line 1: 2593 Aborted bitsel.rb 0 0-7 0-1023 8
/opt/fx/lib/ruby/fboard.rb:188: [BUG] Segmentation fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/etc/init_bitsel.f2: line 7: 2597 Aborted bitsel.rb 0 0-7 512 13
OK
Loading F board 2 (0x00005007)...OK
F Board FPGA Air
2 29 30 26
Using external sync from I board
Loading T board 2...OK
Setting initial bit selector values from '/opt/fx/etc/init_bitsel.f2'.../opt/fx/lib/ruby/fboard.rb:188: [BUG] Segmentation
fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/etc/init_bitsel.f2: line 1: 2614 Aborted bitsel.rb 0 0-7 0-1023 8
/opt/fx/lib/ruby/fboard.rb:188: [BUG] Segmentation fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/etc/init_bitsel.f2: line 7: 2618 Aborted bitsel.rb 0 0-7 512 13
OK
Loading F board 3 (0x00005007)...OK
F Board FPGA Air
3 30 31 27
Using external sync from I board
Loading T board 3...OK
Setting initial bit selector values from '/opt/fx/etc/init_bitsel.f2'.../opt/fx/lib/ruby/fboard.rb:188: [BUG] Segmentation fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/etc/init_bitsel.f2: line 1: 2635 Aborted bitsel.rb 0 0-7 0-1023 8
/opt/fx/lib/ruby/fboard.rb:188: [BUG] Segmentation fault
ruby 1.8.6 (2007-06-07) [i586-linux]
/opt/fx/etc/init_bitsel.f2: line 7: 2639 Aborted bitsel.rb 0 0-7 512 13
OK
Loading X board 0...OK
Loading X board 1...OK
Loading X board 2...OK
Loading X board 3...OK
Loading X board 0...OK
Loading X board 1...OK
Loading X board 2...OK
Loading X board 3...OK
x1...
xboard0 is CURRENTLY STOPPED at 0 frames
xboard1 is CURRENTLY STOPPED at 0 frames
xboard2 is CURRENTLY STOPPED at 0 frames
xboard3 is CURRENTLY STOPPED at 0 frames
x2...
xboard0 is CURRENTLY STOPPED at 0 frames
xboard1 is CURRENTLY STOPPED at 0 frames
xboard2 is CURRENTLY STOPPED at 0 frames
xboard3 is CURRENTLY STOPPED at 0 frames
Restarting slice a.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
Restarting slice b.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
Restarting slice c.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
Restarting slice d.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
Restarting slice e.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
Restarting slice f.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
Restarting slice g.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
Restarting slice h.../opt/fx/bin/xb8netcast.rb:58: Integration time is zero (RuntimeError)
done
------------------------
And, after this init, fcheck showed:
Checking f2.fxc - FPGA should be < 80
F Board FPGA Air
0 29 38 26
1 42 98 34 FPGA IS HOT!
2 41 75 37
3 42 72 37
bugger. Replaced fan again, reseated fboard#1 on f2.fxc, and again ran into fboard#0 not showing up. Twice powered off and reseated, lspci not showing board. Powered off and reseated all boards twice, including CPU board, powered back up and still fboard#0 not showing up. Went ahead and did fxinit fx64c on x1.fxc without fboard#0, after that, fboard#1 is cooler at 85C, but is still running hot. Both fboard#1 on f1.fxc and f2.fxc are sandwiched next to another fboard, perhaps they've both been running hot?
Still unable to get fboard#0 to show up. Will attempt to speak to Matt Dexter tomorrow (Sunday) to trouble shoot further.
poop. Fixed. Need to come up with checks that will feed into alarm system/automatic ticket creator.
fixed...
A redesign of how the antennas report their positions (by pushing it, preferably via a lightweight telemetry stream that could be added to JSDA) could eliminate the slow update rate of AntennaServer/AntennaShadow. More directly, a modification to CollisionServer could have it update the position of all 3 antennas ever second, instead of interleaving them, shaving 2 seconds off a the potential 6, leaving the antennas divergence by up to ~ 32 deg.
Problem is being cleared, but slowly.
Working on this now, it will take some time to fix, sessions problem rearing head _again_. Spoke with David M about approach to fix and he has blessed it. Interrupting work on alarms to fix this one :)
Should have been closed long ago...
Node 3 dropped out over last evening. It seemed as though a few of the ants were having problems with their el drives. Anyway, a reset at the node house didn't do anything so Gary came out and had to bypass the Variable Frequency Drive which had gone out, to get the blower started again. Big thanks to Gary. Ant 3d, 3j and 3l are all back online and happy. The ion pump on 3e would not jump, so that one is out. Ant 3c was already in maint for pam issues and needs to have a feed replaced anyway. Totally unrelated to the node issue, ant 3f lost fiber link with the media converter and so Gary will have to troubleshoot that on Monday. This keeps the collisions offline for the rest of the weekend.
Several other antennas(3d,3e,3g,3h,3l) could not get on position so I
moved them to the maint group. I moved 3f to the collision group to
prevent it from being used.
I think Colby suggested a direct connection between a beamformer and
the GPU pulsar processor. I think this would be a good short-term
solution which avoids impact on the SETI lan. I'd rather not disrupt
the SETI Lan because we are trying to maintain our observing programs
on the Prelude system while trying to integrate and test its
replacement, SonATA. The SETI observing systems control the beamformer
through the BackendServer which is on the OBS lan. So, perhaps with a
temporary direct connection to a beamformer and a network connection on
the OBS lan, Gregory can test his processor.
Note: we currently have beamformers 2 & 3 connected to Prelude via analog outputs while SonATA is connect to digital packetized output from all three beamformers.
One thing to keep in mind about testing with the beamformer in the near
future, we plan to run tests on SonATA with one to three beamformers
much of the time. So any tests of the pulsar processor need to be
coordinated with the sonata group
<sonata@seti.org>. For the long term, the pular processor cannot be on the SETI lan. We may have activity going on between our servers even when we're not observing. Perhaps we need a "pulsar" lan (or subnet?) which communicates with the OBS lan via sockets (that's what the SETI systems do).
Amending this, someone issued a shutdown() command to DishServer on all antennas. The ProcessServer restarted the DishServer software automatically.