Showing posts with label troubleshooting. Show all posts
Showing posts with label troubleshooting. Show all posts

Thursday, February 24, 2011

ESXi log files

What is the fastest way to retrieve log files from an ESXi host? In my opinion, the best way is to configure remote logging via syslog server but this requires host reboot to apply configuration changes (KB1016621). The alternative method is to forward log files to different datastore. 
If  you don't have prepared syslog server for remote logging you can use vsphere client and generate system log bundles for particular host. But this takes some time. 
The last method is I think the fastest one because it will allow you to access log files  directly with your web browser. You can use web interface of the ESXi host,  enter the following URL:

https://ESXi_HOST_ADDR/host

The output should looks like shown at this picture:
You can download ESXi log files  messages, hostd.log and vpxa.log  now.

Tuesday, February 8, 2011

vMA missing libraries

If you are using vMA (vSphere Management Assistant) for some specific management tasks like UPS monitoring  or running a scheduled backup script from cron daemon, you may experience an error similar to this one:
Can't load '/usr/lib/perl5/site_perl/5.8.8/libvmatargetlib_perl.so'
for module vmatargetlib_perl: libtypes.so: cannot open shared object
file: No such file or directory at
/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/DynaLoader.pm line 230.
at /usr/lib/perl5/site_perl/5.8.8/VMware/VmaTargetLib.pm line 10
Compilation failed in require at /usr/lib/perl5/site_perl/5.8.8/VMware/VIFPLib.pm line 10.
A reason for such  behaviour is typically caused by some misunderstandings how shell environment in vMA is configured. The most common mistake is testing the affected script with sudo which strips out some environment variables - especially LD_LIBRARY_PATH - due to some security restrictions. Otherwise, the error shouldn't appear because /etc/bashrc exports vmware SDK library path implicitly:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/vmware/vma/lib64:/opt/vmware/vma/lib
So in  case of sudo or other unspecified scenarios throwing the presented error try to create a wrapper script which explicitly exports a list of directories where to search for ELF libraries again:
#!/bin/bash

LD_LIBRARY_PATH=/opt/vmware/vma/lib64:/opt/vmware/vma/lib  
export LD_LIBRARY_PATH

/path/to/original-script "$@"

exit $?

Friday, January 21, 2011

VCB basic usage - debugging

During the series of articles about VCB usage I supposed that all the presented VCB command examples are running smoothly and without errors. But this is not always true. There can  be a lot of reason why it is not working as expected, e.g. wrong permissions assigned to VCB backup user, misconfigured SAN which doesn't allow to access VMFS volumes or  unspecified problem with creating virtual machine snapshot.

If something goes wrong all VCB commands can be run in more verbose mode with command line switch -L and verbosity level from 0 to 6. The next example illustrates it. We want to perform a full backup of a virtual machine named vcb-backup and it seems the provided user vcbadmin doesn't have required permissions to do it:
vcbmounter -h host -u vcbadmin -p pass -a name:vcb-backup -r c:\mnt\vcb-backup -t fullvm -m nbd -L4
[2010-06-15 13:15:12.843 'vcbMounter' 360 info] Connected using API Namespace vim25.
[2010-06-15 13:15:12.843 'vcbMounter' 360 info] Authenticating user vcbadmin
[2010-06-15 13:15:12.859 'vcbMounter' 360 info] Logged in!
[2010-06-15 13:15:12.890 'vcbMounter' 360 info] Got VM MoRef
[2010-06-15 13:15:12.890 'vcbMounter' 360 info] Got access method
[2010-06-15 13:15:12.890 'vcbMounter' 360 info] Got coordinator object
[2010-06-15 13:15:12.890 'vcbMounter' 360 info] Attempting data access.
[2010-06-15 13:15:12.890 'vcbMounter' 360 info] Creating mount directory
[2010-06-15 13:15:12.890 'vcbMounter' 360 info] No snapshot info for this VM, nothing to do.
[2010-06-15 13:15:12.890 'vcbMounter' 360 info] Creating snapshot
[2010-06-15 13:15:19.296 'vcbMounter' 360 info] Snapshot created, ID: snapshot-579
[2010-06-15 13:15:19.296 'vcbMounter' 360 info] Mount operation created snapshot.
[2010-06-15 13:15:19.312 'vcbMounter' 360 info] Mount operation obtained backupinfo.
[2010-06-15 13:15:19.312 'vcbMounter' 360 info] Performing SearchIndex find.
[2010-06-15 13:15:19.328 'vcbMounter' 360 info] Successfully obtained instance lock.
[2010-06-15 13:15:24.359 'vcbMounter' 360 error] Error: No permission to perform this action.
[2010-06-15 13:15:24.359 'vcbMounter' 360 error] An error occurred, cleaning up...
[2010-06-15 13:15:24.359 'vcbMounter' 360 info] Performing SearchIndex find.
[2010-06-15 13:15:24.359 'vcbMounter' 360 info] Successfully obtained instance lock.
[2010-06-15 13:15:29.437 'vcbMounter' 360 info] Remove clone disks successful.
Deleted directory c:\vcb\vcb-backup
The bold line helps us to identify the cause of the problem.

Tuesday, September 29, 2009

VMware Server 1.0.x library dependency problem

In the beginning of the year, I wrote this article about some problems between older VMware server 1.0.x and newer Linux distributions. The problem is related to the vmware kernel modules whose source code are not compatible with newer Linux kernels.

I was surprised with one thing. When I upgraded VMware Server from version 1.0.8 to 1.0.9, VMware Server console stopped working. The new version was installed on the same system (OpenSUSE 11.1) as the old one, so I don't understand the reason. The important thing is I found a solution. The new version began producing these new error messages after trying to run vmware command:
/usr/lib/vmware/lib/libgcc_s.so.1/libgcc_s.so.1: version `GCC_4.2.0' not found (required by /usr/lib/libstdc++.so.6)
/usr/lib/vmware/lib/libgcc_s.so.1/libgcc_s.so.1: version `GCC_4.2.0' not found (required by /usr/lib/libstdc++.so.6)
/usr/lib/vmware/bin/vmware: symbol lookup error: /usr/lib/libgio-2.0.so.0: undefined symbol: g_thread_gettime
I have tried to unset this environment variable influencing behavior of GTK2 applications:
unset GTK2_RC_FILES
Otherwise, the variable is referencing related gtkrc files defining GTK2 user's environment. Try it and I hope it will help.

Wednesday, August 19, 2009

Linux rc.local script

Sometimes, you need to run some commands during your Linux server startup. And you don't want to waste time with preparing valid init script now. The common task is to load some kernel module or to change speed of network interface and so on.

Red Hat distributions provides for this task rc.local script. You can find it in the directory /etc/rc.d. The script is executed after all the other init scripts. This is ensured with the proper START scripts linking to the /etc/rc.d/rc.local script:

/etc/rc.d/rc2.d/S99local
/etc/rc.d/rc4.d/S99local
/etc/rc.d/rc3.d/S99local
/etc/rc.d/rc5.d/S99local

SUSE distros like SLES or OpenSUSE provide similar mechanism. You have available two scripts. The before.local script should contain everything you want to run before runlevel is entered. The after.local script works like RedHat's rc.local script. It contains stuff which should be executed after runlevel is reached. The scripts don't exist by default, you need to create them at first in the directory /etc/init.d. They don't have to be set executable.

Besides this, the RedHat's rc.local script is executed only in runlevels 2, 3, 4 or 5. It is ignored in the single user mode. SUSE's version of after.local or before.local is interpreted during all runlevels including runlevel 1.

Thursday, April 16, 2009

Linux kernel crash dumps with kdump

Kdump is official GNU/Linux kernel crash dumping mechanism. It is part of vanilla kernel. Before it, there exists some projects like LKCD for performing such things. But they weren't part of mainline kernel so you needed to patch the kernel or to rely on Linux distribution to include it. In the event of LKCD, it was difficult to configure it, especially which device to use for dumping.

The first notice about kexec (read what it is useful for and how to use it) in GNU/Linux kernel was in changelog of version 2.6.7. Kexec tool is prerequisite for kdump mechanism. Kdump was firstly mentioned in changelog of version 2.6.13.

How is it working? When the kernel crashed the new so called capture kernel is booted via kexec tool. The memory of previous crashed kernel is leaved intact and the capture kernel is able to capture it. In detail, first kernel needs to reserve some memory for capture kernel. It is used by capture kernel for booting. The consequence is the total system memory is lowered by reserverd memory size.

When the capture kernel is booted, the old memory is captured from the following virtual /proc files:
  • /proc/vmcore - memory content in ELF format
  • /proc/oldmem - really raw memory image!

Next, we will check how to initialize kdump mechanism, how to configure it and how to invoke it for testing purposes.

Thursday, March 12, 2009

Running Linux kexec

The generic form of kexec command looks like
kexec -l kernel_image --initrd=kernel_initrd --append=command_line_options
The command has available many other options but the presented ones are the most important. To start kernel reset, run
kexec -e
How does it work? Linux kernel is placed in memory at defined address offset. On x86 architecture, it begins at 0x100000. Kexec is capable to call and run another kernel in the context of current kernel. It copies the new kernel somewhere into memory, moves it into kernel dynamic memory and finally copies it to the final destination which is the offset and runs it - kernel is exchanged and the reset is performed. An example how to reset running SLES 10.x kernel follows
kversion=`uname -r`
kexec -l /boot/vmlinuz-$kversion --initrd=/boot/initrd-$kversion --append="`cat /proc/cmdline`"
kexec -e
The example for RHEL 5.x is slightly different:
kexec -l /boot/vmlinuz-$kversion --initrd=/boot/initrd-${kversion}.img --append="`cat /proc/cmdline`"

Does it have any drawbacks? As I said, there may be some buggy devices which won't work after kernel reset. Typically, there are troubles with VGAs and their video memory initialization which results in garbled console after reset. The recommendation is to use normal video mode for console. You can change it with vga parameter set to zero and passed as kernel options (e.g. SLES 10 uses video framebuffer by default)
vga=0
Next, the earlier version of kexec had stability issues on any other platform than x86. Today, kexec is uspported on x86, x86_64, ppc64 or ia64.

Tuesday, March 10, 2009

Fast linux reboot with kexec

Kexec is a GNU/Linux kernel feature which allows to perform kernel reboots faster. The time savings around a few minutes are the result of not performing BIOS procedures and hardware reinitialization (each hardware part - like SCSI/FC HBAs - may have own BIOS and POST which takes some amount of time to finish). As we have cold or warm reset we can newly say we have kernel reset.

The GNU/Linux boot process consists of several stages. The hardware stage, firmware stage and bootloader stage are kernel independent and are run in defined order. The hardware stage performs basic hardware tasks such device initialization and testing it. The firmware stage known on PCs as BIOS is in charge of hardware detection. The bootloader can be split into two parts. The first-level bootloader is like master boot record on PCs which calls second-level bootloader which is able to boot Linux kernel. The final stage is kernel stage.

Kexec is smart thing. It can bypass all listed stages up to kernel stage. That means it is able to bypass all the things connected with hardware and jump to the kernel stage directly. The risk is a likely unreliability of untouched devices, typically VGAs or some buggy cards.

What about requirements to try it? The kernel has to be kexec-capable plus you have to have installed kexec tools. It is not problem in today's Linux distributions. Both RHEL 5.x and SLES 10.x contains kexec-tools package which you have to install. Their production kernels are capable of doing kernel resets. On SLES 10, you can check the running kernel configuration for CONFIG_KEXEC variable.
zgrep CONFIG_KEXEC /proc/config.gz


Kexec is controlled with command line program kexec. The command takes defined values for kernel to be booted, its initrd and kernel parameters and starts the kernel reset.

Thursday, January 8, 2009

VMware Server 1.0.8 on openSUSE 11.1

I decided to upgrade my laptop system from almost "prehistoric" openSuSE 10.1 to the newest version 11.1. It was quite successful but I had to resolve an issue with VMware Server 1.0.8 which I am used to using in my work a lot.

The whole configuration process crashed on vmware kernel modules compilation. The kernel version in new openSUSE is 2.6.27.7. As there aren't precompiled modules for it within version 1.0.8 they need to be recompiled at first. Don't forget to have installed kernel-source, make, gcc and patch packages. Secondly, you need to configure installed kernel sources with make cloneconfig to correspond with the running kernel and platform. Finally, configure VMware Server installation. Everything follows here:
zypper in -y kernel-source make gcc patch
cd /usr/src/linux
make mrproper; make cloneconfig
vmware-config.pl
But the last command produces these errors:
Building the vmmon module.
Using 2.6.x kernel build system.
make: Entering directory `/tmp/vmware-config2/vmmon-only'
make -C /lib/modules/2.6.27.7-9-pae/build/include/.. SUBDIRS=$PWD SRCROOT=$PWD/. modules
make[1]: Entering directory `/usr/src/linux-2.6.27.7-9-obj/i386/pae'
make -C ../../../linux-2.6.27.7-9 O=/usr/src/linux-2.6.27.7-9-obj/i386/pae/. modules
CC [M] /tmp/vmware-config2/vmmon-only/linux/driver.o
In file included from /tmp/vmware-config2/vmmon-only/./include/x86.h:20,
from /tmp/vmware-config2/vmmon-only/./include/machine.h:24,
from /tmp/vmware-config2/vmmon-only/linux/driver.h:15,
from /tmp/vmware-config2/vmmon-only/linux/driver.c:49:
/tmp/vmware-config2/vmmon-only/./include/x86apic.h:79:1: warning: "APIC_BASE_MSR" redefined
In file included from include2/asm/fixmap_32.h:29,
from include2/asm/fixmap.h:5,
from include2/asm/apic.h:9,
from include2/asm/smp.h:13,
from /usr/src/linux-2.6.27.7-9/include/linux/smp.h:28,
from /usr/src/linux-2.6.27.7-9/include/linux/topology.h:33,
from /usr/src/linux-2.6.27.7-9/include/linux/mmzone.h:687,
from /usr/src/linux-2.6.27.7-9/include/linux/gfp.h:4,
from /usr/src/linux-2.6.27.7-9/include/linux/kmod.h:22,
from /usr/src/linux-2.6.27.7-9/include/linux/module.h:13,
from /tmp/vmware-config2/vmmon-only/linux/driver.c:12:
include2/asm/apicdef.h:134:1: warning: this is the location of the previous definition
In file included from /tmp/vmware-config2/vmmon-only/./include/machine.h:24,
from /tmp/vmware-config2/vmmon-only/linux/driver.h:15,
from /tmp/vmware-config2/vmmon-only/linux/driver.c:49:
/tmp/vmware-config2/vmmon-only/./include/x86.h:830:1: warning: "PTE_PFN_MASK" redefined
In file included from include2/asm/paravirt.h:7,
from include2/asm/irqflags.h:55,
from /usr/src/linux-2.6.27.7-9/include/linux/irqflags.h:57,
from include2/asm/system.h:11,
from include2/asm/processor.h:17,
from /usr/src/linux-2.6.27.7-9/include/linux/prefetch.h:14,
from /usr/src/linux-2.6.27.7-9/include/linux/list.h:6,
from /usr/src/linux-2.6.27.7-9/include/linux/module.h:9,
from /tmp/vmware-config2/vmmon-only/linux/driver.c:12:
include2/asm/page.h:22:1: warning: this is the location of the previous definition
In file included from /tmp/vmware-config2/vmmon-only/linux/vmhost.h:13,
from /tmp/vmware-config2/vmmon-only/linux/driver.c:71:
/tmp/vmware-config2/vmmon-only/./include/compat_semaphore.h:5:27: error: asm/semaphore.h: No such file or directory
/tmp/vmware-config2/vmmon-only/linux/driver.c:146: error: unknown field 'nopage' specified in initializer
/tmp/vmware-config2/vmmon-only/linux/driver.c:147: warning: initialization from incompatible pointer type
/tmp/vmware-config2/vmmon-only/linux/driver.c:150: error: unknown field 'nopage' specified in initializer
/tmp/vmware-config2/vmmon-only/linux/driver.c:151: warning: initialization from incompatible pointer type
/tmp/vmware-config2/vmmon-only/linux/driver.c: In function 'LinuxDriver_Ioctl':
/tmp/vmware-config2/vmmon-only/linux/driver.c:1670: error: too many arguments to function 'smp_call_function'
make[4]: *** [/tmp/vmware-config2/vmmon-only/linux/driver.o] Error 1
make[3]: *** [_module_/tmp/vmware-config2/vmmon-only] Error 2
make[2]: *** [sub-make] Error 2
make[1]: *** [all] Error 2
make[1]: Leaving directory `/usr/src/linux-2.6.27.7-9-obj/i386/pae'
make: *** [vmmon.ko] Error 2
make: Leaving directory `/tmp/vmware-config2/vmmon-only'
Unable to build the vmmon module.
The compilation of vmmon module crashed because of incompatibility between the kernel version and available vmmon module. The solution is to download updated version of modules vmware-update-2.6.27-5.5.7-2 and update them:
wget http://www.insecure.ws/warehouse/vmware-update-2.6.27-5.5.7-2.tar.gz
tar zxfv vmware-update-2.6.27-5.5.7-2.tar.gz
cd vmware-update-2.6.27-5.5.7-2
./runme.pl
This update updates all required modules and configuration script vmware-config.pl. After that, the compilation of vmmon module is successful and you can finish the configuration. I hope it will help you.

Monday, October 20, 2008

RHEL and Infiniband - advanced diagnostics - part three

Let's decompose the ibnetdiscover output a bit. The first paragraph begins with Switch keyword. The switch has GUID 0x144f00006e9794. The channel adapter begins with Ca keyword. Their GUIDs are 0x3ba0001003de4 (node node2) and 0x3ba0001007ba8 (node node1). The second one corresponds with the node displayed by the ibstat above. You had to notice that there are many numbers in the square brackets. They identify the ends of IB physical connections. Let's inspect them in more detail:
  • connections from switch to IB nodes (switch -> nodes)
    • switch port [6] is connected to the [2] channel of IB node node2
    • switch port [5] is connected to the [1] channel of IB node node2
    • switch port [4] is connected to the [2] channel of IB node node1
    • switch port [3] is connected to the [1] channel of IB node node1
  • connections from IB node node1 to switch (node -> switch)
    • the [1] IB channel is connected to switch port [3]
    • the [2] IB channel is connected to switch port [4]
  • connections from IB node node2 to switch (node -> switch)
    • the [1] IB channel is connected to switch port [5]
    • the [2] IB channel is connected to switch port [6]
Do you understand the logic of it? I think it's simple. And it is evident the IB connections are full-duplex in our scenario.

I'm going to skip the ibnodes command. Its output is the same as without running subnet manager. Next command, the ibroute command, is producing the following nice forwarding table:
Unicast lids [0x0-0x5] of switch Lid 2 guid 0x00144f00006e9794 ():
Lid Out Port Destination Info
0x0001 003 : (Channel Adapter portguid 0x0003ba0001007ba9: 'node1 HCA-1')
0x0002 000 : (Switch portguid 0x00144f00006e9794: '')
0x0003 004 : (Channel Adapter portguid 0x0003ba0001007baa: 'node1 HCA-1')
0x0004 005 : (Channel Adapter portguid 0x0003ba0001003de5: 'node2 HCA-1')
0x0005 006 : (Channel Adapter portguid 0x0003ba0001003de6: 'node2 HCA-1')
5 valid lids dumped
It lists the assigned LIDs, corresponding switch ports and the other ends of the connections. It's classical routing table saying that a LID X is reachable via a switch port Y with an additional information about the entity owning that LID number. For example, the LID 1 is reachable via the switch port 3 and it is the channel adapter of node node1.

To make the final decision if the IB network is working run the ibchecknet command. The output might say that we have 2 working IB HCAs, 3 IB nodes (two with HCA and one switch) and 8 working IB ports (physically only four but the network is full-duplex in our scenario).
# Checking Ca: nodeguid 0x0003ba0001003de4
# Checking Ca: nodeguid 0x0003ba0001007ba8
## Summary: 3 nodes checked, 0 bad nodes found
## 8 ports checked, 0 bad ports found
## 0 ports have errors beyond threshold
From now, we have working Infiniband network and we are able to do this:
  • ibping the nodes natively
  • ping the nodes over ipovib
  • run unmodified network applications over ipoib (e.g. NFS, FTP and so on)
  • run natively RDMA application

Thursday, October 16, 2008

RHEL and Infiniband - advanced diagnostics - part two

It's almost two months ago when I began to write about advanced diagnostics of IB networks. In the end of the article, I suggested to start the IB subnet manager. So let's do it with the init script:
/etc/init.d/opensmd start
Now, we are ready to compare the outputs of the commands when the IB subnet manager wasn't running and when it is running. There should be noticeable differences because the IB network should be fully initialized since now. At first, what new shows us the ibstat command:
CA 'mthca0'
CA type: MT25208 (MT23108 compat mode)
Number of ports: 2
Firmware version: 4.7.400
Hardware version: a0
Node GUID: 0x0003ba0001007ba8
System image GUID: 0x0003ba0001007bab
Port 1:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x02510a6a
Port GUID: 0x0003ba0001007ba9
Port 2:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 3
LMC: 0
SM lid: 1
Capability mask: 0x02510a68
Port GUID: 0x0003ba0001007baa
The IB subnet manager is responsible for finishing IB hardware initialization phase. Both ports of HCA are in Active state and they have assigned the Base lid which is required for communication over IB network. The IB subnet manager is working because it has assigned SM lid as well. What about other nodes in the network? Let's try the ibnetdiscover command. It should say something more:
vendid=0x144f
devid=0x0
switchguid=0x144f00006e9794
Switch 9 "S-00144f00006e9794" # "" base port 0 lid 2 lmc 0
[6] "H-0003ba0001003de4"[2] # "node2 HCA-1" lid 5
[5] "H-0003ba0001003de4"[1] # "node2 HCA-1" lid 4
[4] "H-0003ba0001007ba8"[2] # "node1 HCA-1" lid 3
[3] "H-0003ba0001007ba8"[1] # "node1 HCA-1" lid 1

vendid=0x3ba
devid=0x6278
sysimgguid=0x3ba0001003de7
caguid=0x3ba0001003de4
Ca 2 "H-0003ba0001003de4" # "node2 HCA-1"
[2] "S-00144f00006e9794"[6] # lid 5 lmc 0 "" lid 2
[1] "S-00144f00006e9794"[5] # lid 4 lmc 0 "" lid 2

vendid=0x3ba
devid=0x6278
sysimgguid=0x3ba0001007bab
caguid=0x3ba0001007ba8
Ca 2 "H-0003ba0001007ba8" # "node1 HCA-1"
[2] "S-00144f00006e9794"[4] # lid 3 lmc 0 "" lid 2
[1] "S-00144f00006e9794"[3] # lid 1 lmc 0 "" lid 2
Do you remember the LIDs number from the uninitialized IB network? There were same zeroes, the HCAs were uninitialized. Now, each channel has an unique LID. Next time, we are going to decompose this output.

Tuesday, October 7, 2008

SLES10 update and SSL certificate problem

Have you ever needed to update some remote SLES10 system from your local update server (e.g. YUP server)? There may be many reasons for such situation. For example, the remote system can have unstable Internet connectivity to connect to the Novell servers or no connectivity at all with ability to see your local update server via VPN network only. You are able to imagine other situations, of course.

Let's suppose our update server is reachable from the remote locality via HTTPS protocol at URL https://update.domain.tld/path/. The update source is of YUM type and we want to update the system with the zypper command. At first, we need to subscribe to the update server. If the update server SSL certificate is subscribed by some well-known certification authority, then you don't have to worry. You can use the following command to add the update server to the update sources:
zypper subscribe https://update.domain.tld/path/update update
But if you generated your own certification authority or self-subscribed server certificate, then you may notice these errors:
Curl error for 'https://update.domain.tld/path/repodata/repomd.xml':
Error code:
Error message: SSL certificate problem, verify that the CA cert is OK. Details:
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
The message is comprehensible and it says that the server certificate is untrusted and can't be verified by the known CA certificates. Simply said, the server certificate is subscribed by your untrusted certificate or it is self-signed. The message only warns you that there may be an attempt of man in the middle attack.

The curl application uses a CA bundle to verify server certificates. The bundle is typically stored in the /usr/share/curl/curl-ca-bundle.crt file. If you want to make your own CA certificate valid, then concat its PEM content to the end of the file like this:
cat ca.crt >> /usr/share/curl/curl-ca-bundle.crt
After this command, everything will begin to work and the update server URL will be added to the update sources.Then, the update may start:
zypper update
I didn't mention that you will have a similar problem if you use the rug command. If I apply the previous steps the rug command will produce an error about SSL certificate verification failure anyway. I suspect that rug doesn't use curl to access the update server. So, does anybody know how to resolve it in case of rug usage?

Wednesday, September 3, 2008

VMware server 1.x and GNOME library issue

If you install VMware server 1.x at your Linux workstation you may encounter the dependency issue between installed VMware libraries and available system libraries like this (lines are broken):
(vmware:30311): libgnomevfs-WARNING **:
Cannot load module `/opt/gnome/lib/gnome-vfs-2.0/modules/libfile.so'
(/usr/lib/vmware/lib/libgcc_s.so.1/libgcc_s.so.1:
version `GCC_4.2.0' not found (required by /usr/lib/libstdc++.so.6))

The above message can be initiated via adding a new virtual disk to the virtual machine or via assigning an ISO image to its virtual CD-ROM. Such operations end with the error displayed in the parent console. The reason why such situation maybe appear is that the installed libraries are compiled with an older GCC compiler than the system libraries. The above error was produced at SLES 10 SP1 distribution which includes GCC 4.1.2. The installed VMware server had version 1.0.6 and in my opinion, it is compiled with GCC 3.x.

The resolution for the problem is to set environment variable VMWARE_USE_SHIPPED_GTK to value "yes", export it and run vmware command after that:
  1. VMWARE_USE_SHIPPED_GTK=yes
  2. export VMWARE_USE_SHIPPED_GTK
  3. vmware &
I recommend to place the variable to your startup script, e.g. to your ~/.profile or ~/.bash_profile.

Thursday, August 14, 2008

Quickly - how to determine the ESX host version?

In my opinion, the easiest way how to find out the ESX host version, is to log in to the service console and use the esxupdate command. The major version can be found in the file /etc/vmware-release. For example, it may contain:
VMware ESX Server 3
So, the major version is 3.x. To determine minor version is a little complicated. Run this command from service console:
esxupdate query
And try to identify patches with the following prefixes from the end of command output (the last one is the right one):
  1. ESX - the minor version should be 0, so we have version 3.0.x
  2. ESX350 - the minor version is 5, the version is 3.5.0
  3. 3.5.0 - initial instalation of version 3.5.0
The corresponding lines may look like these:
3.5.0-64607    16:25:29 05/27/08 ESX 3.0.x to 3.5.0-64607 upgrade
3.5.0-64607 10:42:31 08/06/08 Full bundle of ESX 3.5.0-64607
It remains to identify the update level. Use the same command as above and check the full patch name now:
  1. ESX350-Update01 - we have 3.5.0 Update 1
  2. ESX350-Update02 - we have 3.5.0 Update 2
The corresponding lines are:
ESX350-Update01    16:59:56 05/27/08 ESX Server 3.5.0 Update 1
ESX350-Update02 10:42:38 08/06/08 ESX Server 3.5.0 Update 2
Do you know another way how to reach the version? Beside this, I have found this knowledge base article.

Tuesday, August 12, 2008

VMware ESX 3.5 Update 2 and power on virtual machine bug ?

What a coincidence! Me and my colleague were preparing a server with VMware Infrastructure Update 2 yesterday. Just another simple scenario, we were thinking. We just wanted to run some checks if it is suitable for production usage. Everything went smoothly. But when my colleague began his job, things went worse. His task was easy - install some new virtual machines in prepared environment and check their behaviour (details aren't important now).

When he wanted to power on a prepared virtual machine to begin guest installation, the VMware Infrastructure client displays the following error message:
  • A general system error occurred: Internal error
It is nothing interesting, isn't it? No clue where to go. The virtual machine stays powered off. Any running machine stays running until you suspend it or power it off. So,
  • don't power off or vmotion your virtual machines!!!
Otherwise, you will have a big trouble! Naturally, I checked the virtual machine vmware.log log file. I found here a "virtual" reason why such strange behaviour:
  • [msg.License.product.expired] This product has expired.
  • Be sure that your host machine's date and time are set correctly.
  • There is a more recent version available at ...
Hm, the right time is often critical part of any installation but my habit is to configure NTP server where it is possible. I done it in this scenario as well. I checked the server time and nothing was wrong. The ESX host is licenced properly. When I try to search the VMware knowledge base, there is no answer at all. And you need to have a luck today because it is really really overloaded by others. Maybe, they are deailing with the same problem.

Because my colleague needed to work, my last try was to move the time backwards. I installed the server two days ago and everything was working fine. It was August 10. My colleague began working this morning and since this time, any virtual machine couldn't be powered on. Today, it is August 12. So I moved the time here and it helped! Of course, the NTP server has to be shutdown. You can do it via VI client or you can log in to the service console and use date -s command. I'm not aware of other working solution now.

This afternoon, I found the first article about this annoying bug at www.virtualization.info. It is written here the VMware knowledge base seems to collapsed due to this issue and the support team is promising solution in 36 hours. By the way, Update 2 can't be downloaded now.


Wednesday, July 30, 2008

RHEL and Infiniband - advanced diagnostics - part one

I will continue from the point where I finished last time. The remaining diagnostics tools depend on sysfs interface. The provided information is extracted from this filesystem. If you don't remember the meaning of each entry under the /sys/class/infiniband directory use these tools.

The IB subnet manager is not running is one of the IB network issues. The IB nodes don't have assigned any LIDs and they aren't able to see each other. The node or his IB ports are connected but they aren't initialized yet. To find out this without sysfs use the ibstat command:

CA 'mthca0'
CA type: MT25208 (MT23108 compat mode)
Number of ports: 2
Firmware version: 4.7.400
Hardware version: a0
Node GUID: 0x0003ba0001007ba8
System image GUID: 0x0003ba0001007bab
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510a68
Port GUID: 0x0003ba0001007ba9
Port 2:
State: Initializing
Physical state: LinkUp
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510a68
Port GUID: 0x0003ba0001007baa


The output contains everything what we need - port state, LID, GUID, rate. The IB link is up but the ports are in the INIT state. No IB subnet manager is running. It is clear because the Sm lid parameter has zero value. It should have LID value of the node which acts like IB subnet manager. The same holds for Base lid. The zero value means that the IB network isn't initialized yet. The similar information will be provided by the ibnetdiscover command:

vendid=0x144f
devid=0x0
switchguid=0x144f00006e9794
Switch 9 "S-00144f00006e9794" # "" base port 0 lid 2 lmc 0
[6] "H-0003ba0001003de4"[2] # "node2 HCA-1" lid 0
[5] "H-0003ba0001003de4"[1] # "node2 HCA-1" lid 0
[4] "H-0003ba0001007ba8"[2] # "node1 HCA-1" lid 0
[3] "H-0003ba0001007ba8"[1] # "node1 HCA-1" lid 0

vendid=0x3ba
devid=0x6278
sysimgguid=0x3ba0001003de7
caguid=0x3ba0001003de4
Ca 2 "H-0003ba0001003de4" # "node2 HCA-1"
[2] "S-00144f00006e9794"[6] # lid 0 lmc 0 "" lid 2
[1] "S-00144f00006e9794"[5] # lid 0 lmc 0 "" lid 2

vendid=0x3ba
devid=0x6278
sysimgguid=0x3ba0001007bab
caguid=0x3ba0001007ba8
Ca 2 "H-0003ba0001007ba8" # "node1 HCA-1"
[2] "S-00144f00006e9794"[4] # lid 0 lmc 0 "" lid 2
[1] "S-00144f00006e9794"[3] # lid 0 lmc 0 "" lid 2

The square brackets contain the physical port number at the switch. As we can see, the network is up and discoverable. It consists of one IB switch and two IB nodes, each with dual ported IB HCA. The switch has assigned LID 2, the nodes aren't initialized yet. To display node GUIDs only, use the ibnodes command:

Ca : 0x0003ba0001003de4 ports 2 "node2 HCA-1"
Ca : 0x0003ba0001007ba8 ports 2 "node1 HCA-1"
Switch : 0x00144f00006e9794 ports 9 "" base port 0 lid 2 lmc 0

Finally, two commands remained - ibroute and ibchecknet. As the IB network is not fully initialized the nodes can't contact the switch for forwarding table. So the ibroute command isn't working otherwise it is helpful. The ibchecknet command produces address resolution errors, the IB network is not valid:

lid 2 address resolution: FAILED
# Switch: nodeguid 0x00144f00006e9794 failed

# Checking Ca: nodeguid 0x0003ba0001003de4
lid 0 address resolution: FAILED
# Ca: nodeguid 0x0003ba0001003de4 failed

# Checking Ca: nodeguid 0x0003ba0001007ba8
lid 0 address resolution: FAILED
# Ca: nodeguid 0x0003ba0001007ba8 failed

## Summary: 3 nodes checked, 3 bad nodes found
## 8 ports checked, 0 bad ports found
## 0 ports have errors beyond threshold

In the beginning, I stated the IB subnet manager is not running. Let's launch it with /etc/init.d/opensmd script and we will see how the behaviour of the tools will change.

Friday, July 25, 2008

RHEL and Infiniband - basic diagnostics

I am going to close the article series about Infiniband technology on RHEL platform (check the previous posts 1, 2, 3) with posts intended to the IB troubleshooting. I would like to introduce a basic diagnostic steps of IB environment which may help you to uncover errors and misconfiguration.

The most of troubles you may meet with are traceable via OFED diagnostics tools. They are part of openib-diags package until OFED 1.2. Since version 1.3, it is replaced with infiniband-diags package. Let's take a look at the most useful ones:
  1. ibstat - shows IB device status like firmware version, ports state, their rate, GUIDs, LIDs ...
  2. ibnetdiscover - discovers IB network topology
  3. ibroute - queries for IB switch forwarding table (like routing table)
  4. ibnodes - shows IB nodes in topology
  5. ibchecknet - runs IB network validation
  6. ibping - ping IB address
  7. sysfs - Linux virtual filesystem representing kernel structures, for IB is there directory /sys/class/infiniband
The IB network is similar to the other high performance network technologies like Fibre Channel. The most of troubles with IB are in common. You may need to resolve connectivity issues, firmware or higher level software revisions incompatibilities, driver bugs and similar.

At first, I would like to explain the usage of last two tools - ibping and sysfs. They are simple enough and known from other fields. The IB ping works in client-server fashion. That means you need to run ibping in server mode at one side and another side will act as a client. The server is ponging to the client's pings.
  1. Server mode - ibping -S -v
  2. Client mode - ibping -v SERVER_LID_ADDR
The -v argument increases verbosity level only. The right LID address can be found with ibnetdiscover command. Run it, find the server node line and use the associated LID now. I will explain it later. If the IB network is healthy ibping should produce the output at the server side like this (the server LID is 4, his hostname is node2):

ibwarn: [6795] ibping_serv: starting to serve...
ibwarn: [6795] ibping_serv: Pong: node2.(none)

The pongs have to be visible at the client side:

ibwarn: [17946] ibping: Ping..

Pong from node2.(none) (Lid 4): time 0.235 ms

If you aren't able to see them you should check the connectivity status of your IB HCA. One method to do it is via sysfs. Each IB HCA is represented with a subdirectory under the /sys/class/infiniband directory where you can find a lof of useful stuff. For example, if you have dual ported HCA from Mellanox then there should be the following entries for port states:
  1. /sys/class/infiniband/mthca0/ports/0/state
  2. /sys/class/infiniband/mthca0/ports/1/state
The state can have three predefined values with these meanings:
  1. DOWN - port is physically disconnected
  2. INIT - port is connected and it is initialized
  3. ACTIVE - port is online and it is working
If ibping has to work the ports of both nodes have to be in ACTIVE state. If they are in INIT state then the subnet manager may be not running. The DOWN state simply means cable problem. By the way, there are other methods to achieve this with help of remaining tools. I am going to explore them next time.

Tuesday, July 22, 2008

Quickly - /dev/vcs and /dev/tty magic on Linux

Have you ever wanted to check the content of the first virtual console without switching to it with "Ctrl+Alt+F1" shortcut from your desktop session? Or the second console of a remote server? Or would you like to send something to the user who is working at the third virtual console (not via wall command)?

The GNU/Linux kernel provides two character devices for such tasks:
  • /dev/ttyX - represents X. virtual console
  • /dev/vcsX - represents X. virtual console text contents
So, to answer the questions use these commands:
  1. cat /dev/vcs1
  2. ssh root@server 'cat /dev/vcs2'
  3. echo "something" > /dev/tty3
More information about Linux allocated devices is written in /usr/src/linux/Documentation/devices.txt. You have to have GNU/Linux sources installed.

Friday, July 18, 2008

RHEL and Infiniband - basic usage

As I written in the previous post, the /etc/init.d/openibd init script is in charge of starting Infiniband (IB) network. The script parses the /etc/ofed/openibd.conf configuration file where you can specify which ULPs should be initialized. By default, all ULPs I mentioned last time - ipoib, srp, sdp - are enabled.

The opensm IB network manager is controlled with the /etc/init.d/opensmd init script which is configurable via /etc/ofed/opensm.conf configuration file. You can turn on debugging here but it is not normally needed. It is more useful to enable verbose mode which increases the log verbosity level. The default log file is /var/log/osm.log. So, if something goes wrong enable verbose mode and check the log file.

After executing the init scripts, you should check the IB network state. The openibd script is started automatically during the system startup, while the opensm has to be enabled (with ntsysv or chkconfig). Follow this checklist:
  1. Is Mellanox HCA recognized?
    • check the output of lsmod | grep ib_mthca
    • check the output of dmesg
  2. Are appropriate ULPs loaded?
    • check the output of lsmod | grep ib_
      • should contain ib_ipoib, ib_srp, ib_sdp
  3. Is IB network initialized and working?
    • check the output of cat /sys/class/infiniband/mthca0/ports/X/state
      • should be ACTIVE
  4. Is ib0 network interface available?
    • check the output of ifconfig -a
If you passed all the checks you would be able to use IP protocol over IB network. I supposed you have two IB nodes in the IB network at least, both are configured the same way and both have passed the checks (like in the first article). To configure it follow the commands:
  1. assign an IP address to the nodes
    • run ifconfig ib0 IP_ADDR1 up at first node
    • run ifconfig ib0 IP_ADDR2 up at second node
  2. check the IPoIB functionality
    • run ping IP_ADDR2 from the first node
    • run ping IP_ADDR1 from the second node
So, wasn't it simple? If everything is working the ping should receive replies from the other side. Now, you can run any IP based application over IB - FTP, NFS and so on and utilize its benefits like high throughput and low latencies. Please, if you are interested in the topic leave me a comment.

Tuesday, July 15, 2008

Quickly - RPM uninstall and scriptlet failure

Sometimes it happens that I'm not able to uninstall a RPM package because of some internal SPEC file errors related to the scriptlets. Last time it happened when I was uninstalling the HP OpenView Storage Data Protector packages from a RHEL server. By mistake, I uninstalled one package which was a dependency of another package and after that I wasn't able to uninstall it due to that dependency and due to it wasn't checked correctly. The whole uninstall procedure looked like this:
  1. rpm -e OB2-CORE-A.06.00-1
  2. rpm -e OB2-DA-A.06.00-1
And the produced error follows:
  • ERROR: Cannot find /opt/omni//bin/omnicc
  • error: %preun(OB2-DA-A.06.00-1.x86_64) scriptlet failed, exit status 3
So, is there a way how to get rid of such a package? Yes, it is and it is simple, just disable executing the scriptlets like this:
  1. rpm -e --noscripts OB2-DA-A.06.00-1
I think it is pretty simple feature of RPM but it is a bit difficult to remember it.