|
|
|
|
Ben Cotton
Purdue University
Systems Administrator
Post Count: 9
|
4/29/2008 10:02 am
When I try to boot the nodes, I get a message (below) saying that power is off to the nodes. The main power is obviously on, since I'm logged in to the SSP. I checked the internal connections and nothing appears to be unplugged. The NICs on the main board all have lights. I've added (but not yet configured) 4 SATA drives, but even when I pulled the power from them, the issue persists. What am I missing?
Thanks,
BC
sicortex-ssp ~ # scboot -p sca
/var/state/route_info.sca checks out OK!
Creating boot configuration
Setting up MSPs
Setting up node rootfs image
Halting all nodes
Restarting ev1d
* Stopping ev1d ... [ ok ]
* Starting ev1d ... [ ok ]
Starting Master Fabric Daemon
slurm_load_jobs error: Unable to contact slurm controller (connect failure)
slurm_load_part: Unable to contact slurm controller (connect failure)
slurm_update error: Unable to contact slurm controller (connect failure)
Trouble setting nodes to SLURM 'Idle' state.
Cleaning up FabriCache client partitions
Clearing event daemon state
Restarting syslog-ng
* Stopping syslog-ng ... [ ok ]
* Starting syslog-ng ... [ ok ]
Loading and booting linux
bamf: power is off to the nodes on sca-msp0 - turn it on before booting!
power off on some selected modules
Failed loading linux
Caught signal, cleaning up.
/sbin/scboot: line 326: 31983 Terminated ( trap - ${scboot_traps}; sleep "${msp_timeout}"; echo "TIMEOUT" >"${timeout_file}" )
|
|
Bobby Woods-Corwin
Engineer
Post Count: 7
|
4/29/2008 10:25 am
"bamf: power is off to the nodes on sca-msp0 - turn it on before booting!"
Node power is under software control, and is separate from the SSP power.
You can confirm the state of the node power with scpower:
sicortex-ssp % scpower -p sca
reading power from ['sca-msp0:0xfff']
[1L]
[0L] means it's off.
You can turn the power back on with
sicortex-ssp % scpower -p sca -s 1
setting power to 1 on ['sca-msp0:0xfff']
and try again.
If the power is off, there are a number of possible reasons:
1. MSP temperature monitoring:
check SSP's /var/log/msp-messages-YYYYMM for emerg level messages. If there's a message containing "Shutdown on overtemp", the MSP shut down the nodes for safety because it thought they got too hot.
2. SSP monitoring:
check SSP's /var/log/policyd.log for messages indicating that policyd shut down the power.
If either of these logs has interesting stuff in it, that may lead to a next step...
|
|
Richard Dischler
Director, Custom Systems
Post Count: 45
|
4/29/2008 10:28 am
More quick tricks...
First try the reset button on the front (next to the larger power switch).
Also, make sure the four fans on the side panel are spinning.
|
|
Ben Cotton
Purdue University
Systems Administrator
Post Count: 9
|
4/29/2008 12:22 pm
As it turns out, the fans were not securely plugged to the power supply, so that caused the SSP to freak out. The nodes now boot, but now there's a complaint that none of them are the router:
sicortex-ssp ~ # scboot -p sca
/var/state/route_info.sca checks out OK!
Creating boot configuration
Setting up MSPs
Setting up node rootfs image
Halting all nodes
Restarting ev1d
* Stopping ev1d ... [ ok ]
* Starting ev1d ... [ ok ]
Starting Master Fabric Daemon
slurm_load_jobs error: Unable to contact slurm controller (connect failure)
slurm_load_part: Unable to contact slurm controller (connect failure)
slurm_update error: Unable to contact slurm controller (connect failure)
Trouble setting nodes to SLURM 'Idle' state.
Cleaning up FabriCache client partitions
Clearing event daemon state
Restarting syslog-ng
* Stopping syslog-ng ... [ ok ]
* Starting syslog-ng ... [ ok ]
Loading and booting linux
bamf: Loading vmlinux
bamf: Loading bootk
Finished loading linux (kernel boot initiated)
----------scboot-monitor----------
secs kernel fabric initfs slurm
33 12 0 0 0
err: all 12 nodes checked in, but no router available (none are gateways) - BOOT FAILED
151 12 12 0 0
Timeout (120 seconds since last activity)
1077 12 12 0 0
Here's my sicortex-system.conf
[DEFAULT]
sca.cluster.head-node = sca-m0n8
sca.cluster.io-nodes= sca-m0n8
sca.cluster.default-router = ssp
sca.node.sca-m0n8.router = nat
sca.node.sca-m0n8.interfaces = eth1
#sca.node.sca-m0n8.eth1.address = dhcp
sca.node.sca-m0n8.eth1.address = 128.210.127.241
sca.node.sca-m0n8.eth1.netmask = 255.255.255.0
sca.node.sca-m0n8.eth1.gateway = 128.210.127.1
|
|
Bobby Woods-Corwin
Engineer
Post Count: 7
|
4/29/2008 12:36 pm
> As it turns out, the fans were not securely plugged to the
> power supply, so that caused the SSP to freak out. The
> nodes now boot, but now there's a complaint that none
> of them are the router:
The message means that m0n8 (which should be physically connected to the outside world) didn't come up. /var/log/sca/sca-m0n8.console will have some information about what happened there.
> Here's my sicortex-system.conf
>
> [DEFAULT]
> sca.cluster.head-node = sca-m0n8
> sca.cluster.io-nodes= sca-m0n8
> sca.cluster.default-router = ssp
>
> sca.node.sca-m0n8.router = nat
> sca.node.sca-m0n8.interfaces = eth1
> #sca.node.sca-m0n8.eth1.address = dhcp
> sca.node.sca-m0n8.eth1.address = 128.210.127.241
> sca.node.sca-m0n8.eth1.netmask = 255.255.255.0
> sca.node.sca-m0n8.eth1.gateway = 128.210.127.1
We generally run with DHCP; I'm not 100% sure what to look for if you statically define the address.
|
|
Richard Dischler
Director, Custom Systems
Post Count: 45
|
4/29/2008 1:10 pm
The default config for sca-m0n8 has to be used.
Simply allow the two RJ Enet ports to DHCP and then
use the addresses they receive. If you edit sicortex-system.conf, then the virtual console between the 72 processor module and the x86 board will break.
|
|
Richard Dischler
Director, Custom Systems
Post Count: 45
|
4/29/2008 1:11 pm
The file should look like this:
# cat /etc/sicortex-system.conf
# SiCortex system configuration
# See /etc/sicortex-system.conf.example
# This header is necessary for the config
# parser module. Do not alter.
[DEFAULT]
sca.cluster.head-node = sca-m0n8
sca.node.sca-m0n8.eth1.address = dhcp
sca.node.sca-m0n8.interfaces = eth1
sca.node.sca-m0n8.router = nat
|
|
Ben Cotton
Purdue University
Systems Administrator
Post Count: 9
|
4/29/2008 2:22 pm
Richard Dischler wrote:
The default config for sca-m0n8 has to be used.
Simply allow the two RJ Enet ports to DHCP and then
use the addresses they receive.
***
Okay, so then how do I get my head node to have a static IP? I don't control the DHCP server so I can't give it a dynamically-assigned static IP. So then how would I give the head node a DNS host name to allow users to log in to it.
The above point is moot temporarily. I used the sicortex-system.conf that Richard provided and I get the same problem. Any suggestions as to where I should look next?
|
|
Richard Dischler
Director, Custom Systems
Post Count: 45
|
4/29/2008 2:41 pm
Since the fan cable was in question, are the others ok? Looking down the rear edge of the x86 board, the blue Enet cable should be seen first and it gets connected to J3; low down on the 72processor module.
The enet cable in the middle rear of the x86 board should be green and connect to the coupler in the rear bulkhead. The far red enet cable (hardest to see) should connect to the RJ45 port that is near the rear bulkhead, but remains internal.
The recommendations from earlier notes were a way to reduce the changes and see if we can get the nodes to boot. Once that is solved, then other addressing changes can be looked into.
Here, I tell our network administrator what the HW MAC is of the port I wish to put on the net and a hostname. Then, when he sees that MAC, we automatically get the hostname that was requested.
|
|
Richard Dischler
Director, Custom Systems
Post Count: 45
|
4/29/2008 2:52 pm
Another thing to try is simply re-executing scboot -p sca or scboot -p sca --start_msp=force .
All boots should complete before the counter gets over 200s.
|