Wednesday, July 30, 2008

RHEL and Infiniband - advanced diagnostics - part one

I will continue from the point where I finished last time. The remaining diagnostics tools depend on sysfs interface. The provided information is extracted from this filesystem. If you don't remember the meaning of each entry under the /sys/class/infiniband directory use these tools.

The IB subnet manager is not running is one of the IB network issues. The IB nodes don't have assigned any LIDs and they aren't able to see each other. The node or his IB ports are connected but they aren't initialized yet. To find out this without sysfs use the ibstat command:

CA 'mthca0'
CA type: MT25208 (MT23108 compat mode)
Number of ports: 2
Firmware version: 4.7.400
Hardware version: a0
Node GUID: 0x0003ba0001007ba8
System image GUID: 0x0003ba0001007bab
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510a68
Port GUID: 0x0003ba0001007ba9
Port 2:
State: Initializing
Physical state: LinkUp
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510a68
Port GUID: 0x0003ba0001007baa


The output contains everything what we need - port state, LID, GUID, rate. The IB link is up but the ports are in the INIT state. No IB subnet manager is running. It is clear because the Sm lid parameter has zero value. It should have LID value of the node which acts like IB subnet manager. The same holds for Base lid. The zero value means that the IB network isn't initialized yet. The similar information will be provided by the ibnetdiscover command:

vendid=0x144f
devid=0x0
switchguid=0x144f00006e9794
Switch 9 "S-00144f00006e9794" # "" base port 0 lid 2 lmc 0
[6] "H-0003ba0001003de4"[2] # "node2 HCA-1" lid 0
[5] "H-0003ba0001003de4"[1] # "node2 HCA-1" lid 0
[4] "H-0003ba0001007ba8"[2] # "node1 HCA-1" lid 0
[3] "H-0003ba0001007ba8"[1] # "node1 HCA-1" lid 0

vendid=0x3ba
devid=0x6278
sysimgguid=0x3ba0001003de7
caguid=0x3ba0001003de4
Ca 2 "H-0003ba0001003de4" # "node2 HCA-1"
[2] "S-00144f00006e9794"[6] # lid 0 lmc 0 "" lid 2
[1] "S-00144f00006e9794"[5] # lid 0 lmc 0 "" lid 2

vendid=0x3ba
devid=0x6278
sysimgguid=0x3ba0001007bab
caguid=0x3ba0001007ba8
Ca 2 "H-0003ba0001007ba8" # "node1 HCA-1"
[2] "S-00144f00006e9794"[4] # lid 0 lmc 0 "" lid 2
[1] "S-00144f00006e9794"[3] # lid 0 lmc 0 "" lid 2

The square brackets contain the physical port number at the switch. As we can see, the network is up and discoverable. It consists of one IB switch and two IB nodes, each with dual ported IB HCA. The switch has assigned LID 2, the nodes aren't initialized yet. To display node GUIDs only, use the ibnodes command:

Ca : 0x0003ba0001003de4 ports 2 "node2 HCA-1"
Ca : 0x0003ba0001007ba8 ports 2 "node1 HCA-1"
Switch : 0x00144f00006e9794 ports 9 "" base port 0 lid 2 lmc 0

Finally, two commands remained - ibroute and ibchecknet. As the IB network is not fully initialized the nodes can't contact the switch for forwarding table. So the ibroute command isn't working otherwise it is helpful. The ibchecknet command produces address resolution errors, the IB network is not valid:

lid 2 address resolution: FAILED
# Switch: nodeguid 0x00144f00006e9794 failed

# Checking Ca: nodeguid 0x0003ba0001003de4
lid 0 address resolution: FAILED
# Ca: nodeguid 0x0003ba0001003de4 failed

# Checking Ca: nodeguid 0x0003ba0001007ba8
lid 0 address resolution: FAILED
# Ca: nodeguid 0x0003ba0001007ba8 failed

## Summary: 3 nodes checked, 3 bad nodes found
## 8 ports checked, 0 bad ports found
## 0 ports have errors beyond threshold

In the beginning, I stated the IB subnet manager is not running. Let's launch it with /etc/init.d/opensmd script and we will see how the behaviour of the tools will change.

No comments: