Monday, October 20, 2008

RHEL and Infiniband - advanced diagnostics - part three

Let's decompose the ibnetdiscover output a bit. The first paragraph begins with Switch keyword. The switch has GUID 0x144f00006e9794. The channel adapter begins with Ca keyword. Their GUIDs are 0x3ba0001003de4 (node node2) and 0x3ba0001007ba8 (node node1). The second one corresponds with the node displayed by the ibstat above. You had to notice that there are many numbers in the square brackets. They identify the ends of IB physical connections. Let's inspect them in more detail:
  • connections from switch to IB nodes (switch -> nodes)
    • switch port [6] is connected to the [2] channel of IB node node2
    • switch port [5] is connected to the [1] channel of IB node node2
    • switch port [4] is connected to the [2] channel of IB node node1
    • switch port [3] is connected to the [1] channel of IB node node1
  • connections from IB node node1 to switch (node -> switch)
    • the [1] IB channel is connected to switch port [3]
    • the [2] IB channel is connected to switch port [4]
  • connections from IB node node2 to switch (node -> switch)
    • the [1] IB channel is connected to switch port [5]
    • the [2] IB channel is connected to switch port [6]
Do you understand the logic of it? I think it's simple. And it is evident the IB connections are full-duplex in our scenario.

I'm going to skip the ibnodes command. Its output is the same as without running subnet manager. Next command, the ibroute command, is producing the following nice forwarding table:
Unicast lids [0x0-0x5] of switch Lid 2 guid 0x00144f00006e9794 ():
Lid Out Port Destination Info
0x0001 003 : (Channel Adapter portguid 0x0003ba0001007ba9: 'node1 HCA-1')
0x0002 000 : (Switch portguid 0x00144f00006e9794: '')
0x0003 004 : (Channel Adapter portguid 0x0003ba0001007baa: 'node1 HCA-1')
0x0004 005 : (Channel Adapter portguid 0x0003ba0001003de5: 'node2 HCA-1')
0x0005 006 : (Channel Adapter portguid 0x0003ba0001003de6: 'node2 HCA-1')
5 valid lids dumped
It lists the assigned LIDs, corresponding switch ports and the other ends of the connections. It's classical routing table saying that a LID X is reachable via a switch port Y with an additional information about the entity owning that LID number. For example, the LID 1 is reachable via the switch port 3 and it is the channel adapter of node node1.

To make the final decision if the IB network is working run the ibchecknet command. The output might say that we have 2 working IB HCAs, 3 IB nodes (two with HCA and one switch) and 8 working IB ports (physically only four but the network is full-duplex in our scenario).
# Checking Ca: nodeguid 0x0003ba0001003de4
# Checking Ca: nodeguid 0x0003ba0001007ba8
## Summary: 3 nodes checked, 0 bad nodes found
## 8 ports checked, 0 bad ports found
## 0 ports have errors beyond threshold
From now, we have working Infiniband network and we are able to do this:
  • ibping the nodes natively
  • ping the nodes over ipovib
  • run unmodified network applications over ipoib (e.g. NFS, FTP and so on)
  • run natively RDMA application

No comments: