Wednesday, July 30, 2008

RHEL and Infiniband - advanced diagnostics - part one

I will continue from the point where I finished last time. The remaining diagnostics tools depend on sysfs interface. The provided information is extracted from this filesystem. If you don't remember the meaning of each entry under the /sys/class/infiniband directory use these tools.

The IB subnet manager is not running is one of the IB network issues. The IB nodes don't have assigned any LIDs and they aren't able to see each other. The node or his IB ports are connected but they aren't initialized yet. To find out this without sysfs use the ibstat command:

CA 'mthca0'
CA type: MT25208 (MT23108 compat mode)
Number of ports: 2
Firmware version: 4.7.400
Hardware version: a0
Node GUID: 0x0003ba0001007ba8
System image GUID: 0x0003ba0001007bab
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510a68
Port GUID: 0x0003ba0001007ba9
Port 2:
State: Initializing
Physical state: LinkUp
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510a68
Port GUID: 0x0003ba0001007baa

The output contains everything what we need - port state, LID, GUID, rate. The IB link is up but the ports are in the INIT state. No IB subnet manager is running. It is clear because the Sm lid parameter has zero value. It should have LID value of the node which acts like IB subnet manager. The same holds for Base lid. The zero value means that the IB network isn't initialized yet. The similar information will be provided by the ibnetdiscover command:

Switch 9 "S-00144f00006e9794" # "" base port 0 lid 2 lmc 0
[6] "H-0003ba0001003de4"[2] # "node2 HCA-1" lid 0
[5] "H-0003ba0001003de4"[1] # "node2 HCA-1" lid 0
[4] "H-0003ba0001007ba8"[2] # "node1 HCA-1" lid 0
[3] "H-0003ba0001007ba8"[1] # "node1 HCA-1" lid 0

Ca 2 "H-0003ba0001003de4" # "node2 HCA-1"
[2] "S-00144f00006e9794"[6] # lid 0 lmc 0 "" lid 2
[1] "S-00144f00006e9794"[5] # lid 0 lmc 0 "" lid 2

Ca 2 "H-0003ba0001007ba8" # "node1 HCA-1"
[2] "S-00144f00006e9794"[4] # lid 0 lmc 0 "" lid 2
[1] "S-00144f00006e9794"[3] # lid 0 lmc 0 "" lid 2

The square brackets contain the physical port number at the switch. As we can see, the network is up and discoverable. It consists of one IB switch and two IB nodes, each with dual ported IB HCA. The switch has assigned LID 2, the nodes aren't initialized yet. To display node GUIDs only, use the ibnodes command:

Ca : 0x0003ba0001003de4 ports 2 "node2 HCA-1"
Ca : 0x0003ba0001007ba8 ports 2 "node1 HCA-1"
Switch : 0x00144f00006e9794 ports 9 "" base port 0 lid 2 lmc 0

Finally, two commands remained - ibroute and ibchecknet. As the IB network is not fully initialized the nodes can't contact the switch for forwarding table. So the ibroute command isn't working otherwise it is helpful. The ibchecknet command produces address resolution errors, the IB network is not valid:

lid 2 address resolution: FAILED
# Switch: nodeguid 0x00144f00006e9794 failed

# Checking Ca: nodeguid 0x0003ba0001003de4
lid 0 address resolution: FAILED
# Ca: nodeguid 0x0003ba0001003de4 failed

# Checking Ca: nodeguid 0x0003ba0001007ba8
lid 0 address resolution: FAILED
# Ca: nodeguid 0x0003ba0001007ba8 failed

## Summary: 3 nodes checked, 3 bad nodes found
## 8 ports checked, 0 bad ports found
## 0 ports have errors beyond threshold

In the beginning, I stated the IB subnet manager is not running. Let's launch it with /etc/init.d/opensmd script and we will see how the behaviour of the tools will change.

Friday, July 25, 2008

RHEL and Infiniband - basic diagnostics

I am going to close the article series about Infiniband technology on RHEL platform (check the previous posts 1, 2, 3) with posts intended to the IB troubleshooting. I would like to introduce a basic diagnostic steps of IB environment which may help you to uncover errors and misconfiguration.

The most of troubles you may meet with are traceable via OFED diagnostics tools. They are part of openib-diags package until OFED 1.2. Since version 1.3, it is replaced with infiniband-diags package. Let's take a look at the most useful ones:
  1. ibstat - shows IB device status like firmware version, ports state, their rate, GUIDs, LIDs ...
  2. ibnetdiscover - discovers IB network topology
  3. ibroute - queries for IB switch forwarding table (like routing table)
  4. ibnodes - shows IB nodes in topology
  5. ibchecknet - runs IB network validation
  6. ibping - ping IB address
  7. sysfs - Linux virtual filesystem representing kernel structures, for IB is there directory /sys/class/infiniband
The IB network is similar to the other high performance network technologies like Fibre Channel. The most of troubles with IB are in common. You may need to resolve connectivity issues, firmware or higher level software revisions incompatibilities, driver bugs and similar.

At first, I would like to explain the usage of last two tools - ibping and sysfs. They are simple enough and known from other fields. The IB ping works in client-server fashion. That means you need to run ibping in server mode at one side and another side will act as a client. The server is ponging to the client's pings.
  1. Server mode - ibping -S -v
  2. Client mode - ibping -v SERVER_LID_ADDR
The -v argument increases verbosity level only. The right LID address can be found with ibnetdiscover command. Run it, find the server node line and use the associated LID now. I will explain it later. If the IB network is healthy ibping should produce the output at the server side like this (the server LID is 4, his hostname is node2):

ibwarn: [6795] ibping_serv: starting to serve...
ibwarn: [6795] ibping_serv: Pong: node2.(none)

The pongs have to be visible at the client side:

ibwarn: [17946] ibping: Ping..

Pong from node2.(none) (Lid 4): time 0.235 ms

If you aren't able to see them you should check the connectivity status of your IB HCA. One method to do it is via sysfs. Each IB HCA is represented with a subdirectory under the /sys/class/infiniband directory where you can find a lof of useful stuff. For example, if you have dual ported HCA from Mellanox then there should be the following entries for port states:
  1. /sys/class/infiniband/mthca0/ports/0/state
  2. /sys/class/infiniband/mthca0/ports/1/state
The state can have three predefined values with these meanings:
  1. DOWN - port is physically disconnected
  2. INIT - port is connected and it is initialized
  3. ACTIVE - port is online and it is working
If ibping has to work the ports of both nodes have to be in ACTIVE state. If they are in INIT state then the subnet manager may be not running. The DOWN state simply means cable problem. By the way, there are other methods to achieve this with help of remaining tools. I am going to explore them next time.

Wednesday, July 23, 2008

VMware ESXi will be free

During a few days or weeks, VMware should release their lightweight hypervisor VMware ESXi for free. It is an enterprise-class hypervisor with footprint about 32MB which is integrated into modern servers through e.g. solid state disks. The small footprint is achieved by dropping so-called Console Operating System (based on RHEL 3). It includes basic functionalities like vSMP or VMFS and for advanced ones, you need to manage it with VMware VirtualCenter. You can download it from here.

Tuesday, July 22, 2008

Quickly - /dev/vcs and /dev/tty magic on Linux

Have you ever wanted to check the content of the first virtual console without switching to it with "Ctrl+Alt+F1" shortcut from your desktop session? Or the second console of a remote server? Or would you like to send something to the user who is working at the third virtual console (not via wall command)?

The GNU/Linux kernel provides two character devices for such tasks:
  • /dev/ttyX - represents X. virtual console
  • /dev/vcsX - represents X. virtual console text contents
So, to answer the questions use these commands:
  1. cat /dev/vcs1
  2. ssh root@server 'cat /dev/vcs2'
  3. echo "something" > /dev/tty3
More information about Linux allocated devices is written in /usr/src/linux/Documentation/devices.txt. You have to have GNU/Linux sources installed.

Friday, July 18, 2008

RHEL and Infiniband - basic usage

As I written in the previous post, the /etc/init.d/openibd init script is in charge of starting Infiniband (IB) network. The script parses the /etc/ofed/openibd.conf configuration file where you can specify which ULPs should be initialized. By default, all ULPs I mentioned last time - ipoib, srp, sdp - are enabled.

The opensm IB network manager is controlled with the /etc/init.d/opensmd init script which is configurable via /etc/ofed/opensm.conf configuration file. You can turn on debugging here but it is not normally needed. It is more useful to enable verbose mode which increases the log verbosity level. The default log file is /var/log/osm.log. So, if something goes wrong enable verbose mode and check the log file.

After executing the init scripts, you should check the IB network state. The openibd script is started automatically during the system startup, while the opensm has to be enabled (with ntsysv or chkconfig). Follow this checklist:
  1. Is Mellanox HCA recognized?
    • check the output of lsmod | grep ib_mthca
    • check the output of dmesg
  2. Are appropriate ULPs loaded?
    • check the output of lsmod | grep ib_
      • should contain ib_ipoib, ib_srp, ib_sdp
  3. Is IB network initialized and working?
    • check the output of cat /sys/class/infiniband/mthca0/ports/X/state
      • should be ACTIVE
  4. Is ib0 network interface available?
    • check the output of ifconfig -a
If you passed all the checks you would be able to use IP protocol over IB network. I supposed you have two IB nodes in the IB network at least, both are configured the same way and both have passed the checks (like in the first article). To configure it follow the commands:
  1. assign an IP address to the nodes
    • run ifconfig ib0 IP_ADDR1 up at first node
    • run ifconfig ib0 IP_ADDR2 up at second node
  2. check the IPoIB functionality
    • run ping IP_ADDR2 from the first node
    • run ping IP_ADDR1 from the second node
So, wasn't it simple? If everything is working the ping should receive replies from the other side. Now, you can run any IP based application over IB - FTP, NFS and so on and utilize its benefits like high throughput and low latencies. Please, if you are interested in the topic leave me a comment.

Tuesday, July 15, 2008

Quickly - RPM uninstall and scriptlet failure

Sometimes it happens that I'm not able to uninstall a RPM package because of some internal SPEC file errors related to the scriptlets. Last time it happened when I was uninstalling the HP OpenView Storage Data Protector packages from a RHEL server. By mistake, I uninstalled one package which was a dependency of another package and after that I wasn't able to uninstall it due to that dependency and due to it wasn't checked correctly. The whole uninstall procedure looked like this:
  1. rpm -e OB2-CORE-A.06.00-1
  2. rpm -e OB2-DA-A.06.00-1
And the produced error follows:
  • ERROR: Cannot find /opt/omni//bin/omnicc
  • error: %preun(OB2-DA-A.06.00-1.x86_64) scriptlet failed, exit status 3
So, is there a way how to get rid of such a package? Yes, it is and it is simple, just disable executing the scriptlets like this:
  1. rpm -e --noscripts OB2-DA-A.06.00-1
I think it is pretty simple feature of RPM but it is a bit difficult to remember it.

Friday, July 11, 2008

Sun released new servers and storage arrays

We had to wait for upcoming AMD Opteron servers from Sun a few months since the new quad-core AMD processors, code named Barcelona, were released. Now, it would be a few days when Sun officially announced here the availibility of the second generation of their AMD servers and new storage arrays, together called as "next generation open storage hardware". More about the Open Storage hardware and related projects, you can find here.

The new servers based on quad-core AMD processors are Sun Fire X4140, 4240 and X4540. At the storage field, there were introduced new Sun Storage J4200, J4400 and J4500 arrays. All of them are SAS JBOD arrays. For more details, look at the Sun System Handbook at SunSolve.

Monday, July 7, 2008

RHEL and Infiniband - software intro

Let's continue with software introduction. As I wrote the switch is equipped with the ALOM remote management. There is an universal set of commands for platform independent management like password, poweroff, setupsc, resetsc and so on and then a set of commands which are more specific to the platform. In the case of our IB switch there are two such commands:
  1. setbp - for setting so-called blueprint of switch
  2. showbp - for showing the current blueprint
  3. there are five predefined blueprints:
    • 9 node, 12 node, 18 node, none and unmanaged
The natural question is what does the blueprint mean? According to official documentation it seems to be like a predefined configuration of the switch. You can change it with the setbp command which asks you if you want to run IB management software, how many hosts will be in the subnet and what is the subnet identifier. By default, if you use the switch preconfigured from the factory then two switches will have the same subnet ID. The trouble is, if you intend to configure some level of redundancy between IB switches you will have to have them in different subnets with different subnet IDs. I think it strange because I had to disable the IB management software otherwise I wasn't able to see the nodes in the fabric. As we will see, the IB mangement software including IB subnet manager doesn't seem to like the OFED included in RHEL distro (more about RHEL and OFED I wrote here).

What about the servers? I preinstalled them with CentOS 5.1 distribution (which is binary compatible with RHEL 5.1). The distribution contains the OFED implementation in version 1.2. The complete OFED implementation in CentOS is divided in a set of RPM packages. The platform dependent part of OFED that means kernel modules are distributed with kernel package. Let's make a quick summary of basic packages:
  1. kernel - contains IB hardware, IB core and IB ULP modules
    • ULP means Upper Level Protocol
    • everything is placed in the following directories:
      • /lib/modules/`uname -r`/kernel/drivers/infiniband/hw
      • /lib/modules/`uname -r`/kernel/drivers/infiniband/core
      • /lib/modules/`uname -r`/kernel/drivers/infiniband/ulp
    • currently there are supported only IB HCAs from Mellanox
    • the supported ULPs are
      • ipoib - IP over IB driver
      • srp - IB SCSI RDMA initiator driver
      • sdp - SDP driver
  2. openib - this package contains a lot of useful documentation and the important part is the OFED configuration file /etc/ofed/openib.conf and the init script /etc/init.d/openibd which takes care of activating/deactivating the IB network interfaces. Simply, it loads the IB core modules and specified ULP modules in the config.
  3. openib-diags - this package contains diagnostic tools for IB debugging, I will introduce them later.
  4. opensm - here we have our IB subnet manager. The package provides the init script /etc/init.d/opensmd for starting it and the /etc/ofed/opensm.conf configuration file.
  5. libibverbs - this package provides a library allowing userspace programs direct hardware access.
  6. libibcommon, libibmad, libibumad, opensm-libs - and finally library dependencies for the above packages.
I need to add that the OFED packages belongs to the System Environment/Libraries RPM group and they are not installed by default apart from the openib and libibverbs and of course kernel package. That's all for now and next time I'm going to describe how to work with it.

Tuesday, July 1, 2008

RHEL and Infiniband - hardware intro

In my two previous articles, I summarized a few facts about the Infiniband support in RHEL distros and included protocols - you can go through them from the following links - RHEL and Infiniband support and Infiniband, RDP, SDP.... Let's be more particular now.

My scenario was based on two servers Sun Fire X4200 M2 and one Infiniband (IB) switch Sun IB Switch 9P. The servers had installed Infiniband host channel adapters (HCA) Sun Dual Port 4x IB HCA to be able to communicate over the IB fabric. The switch provides nine IB compliant ports at dual speeds of 4X/12X what means that each port is able to deliver of 10/30Gbit raw bandwidth. What surprised me was that the switch management is like at the SUN SPARC midrange servers. Yes, it is ALOM and it is perfect because you can use the same interface and similar commands you are used to. By the way, the switch chassis looks like a regular SUN server.

The switch is equipped with the IB subnet manager (SM) which is required to initialize the IB hardware and to allow the communication over the IB fabric. Each IB subnet has to have at least one and each has unambiguous identifier (ID) over the fabric. To be complete, the fabric comprises defined subnets. In my opinion, the IB SM seems to be working like ARP cache and DHCP server in LANs. Each HCA in a fabric is globally identified with so-called node GUID which is like WWN in FC or MAC in LAN. The switch has own GUID as well. The ports of HCA have so-called port GUID. Now, when one HCA or its port want to communicate with another one in the subnet we need to have assigned some network address. This address is called LID or local identifier and the IB SM is in charge of assigning it to the members of the subnet. The conclusion is the LIDs are available inside the subnet only and the GUIDs are routable over the subnets of fabric.

But one thing confused me a bit. When you configure the switch you will need to remember setting its blueprint otherwise you will ask for trouble. I'm going to write about it in the next part.