Thursday, October 16, 2008

RHEL and Infiniband - advanced diagnostics - part two

It's almost two months ago when I began to write about advanced diagnostics of IB networks. In the end of the article, I suggested to start the IB subnet manager. So let's do it with the init script:
/etc/init.d/opensmd start
Now, we are ready to compare the outputs of the commands when the IB subnet manager wasn't running and when it is running. There should be noticeable differences because the IB network should be fully initialized since now. At first, what new shows us the ibstat command:
CA 'mthca0'
CA type: MT25208 (MT23108 compat mode)
Number of ports: 2
Firmware version: 4.7.400
Hardware version: a0
Node GUID: 0x0003ba0001007ba8
System image GUID: 0x0003ba0001007bab
Port 1:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x02510a6a
Port GUID: 0x0003ba0001007ba9
Port 2:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 3
LMC: 0
SM lid: 1
Capability mask: 0x02510a68
Port GUID: 0x0003ba0001007baa
The IB subnet manager is responsible for finishing IB hardware initialization phase. Both ports of HCA are in Active state and they have assigned the Base lid which is required for communication over IB network. The IB subnet manager is working because it has assigned SM lid as well. What about other nodes in the network? Let's try the ibnetdiscover command. It should say something more:
Switch 9 "S-00144f00006e9794" # "" base port 0 lid 2 lmc 0
[6] "H-0003ba0001003de4"[2] # "node2 HCA-1" lid 5
[5] "H-0003ba0001003de4"[1] # "node2 HCA-1" lid 4
[4] "H-0003ba0001007ba8"[2] # "node1 HCA-1" lid 3
[3] "H-0003ba0001007ba8"[1] # "node1 HCA-1" lid 1

Ca 2 "H-0003ba0001003de4" # "node2 HCA-1"
[2] "S-00144f00006e9794"[6] # lid 5 lmc 0 "" lid 2
[1] "S-00144f00006e9794"[5] # lid 4 lmc 0 "" lid 2

Ca 2 "H-0003ba0001007ba8" # "node1 HCA-1"
[2] "S-00144f00006e9794"[4] # lid 3 lmc 0 "" lid 2
[1] "S-00144f00006e9794"[3] # lid 1 lmc 0 "" lid 2
Do you remember the LIDs number from the uninitialized IB network? There were same zeroes, the HCAs were uninitialized. Now, each channel has an unique LID. Next time, we are going to decompose this output.

No comments: