Archive for the ‘Hardware’ Category.

SERR/PERR errors on IBM’s System x3650

After updating all the new firmware on a newly delivered IBM System x3650, i installed the operating system Windows Server 2003 R2. The machine worked fine, but crashed mysteriously after about 3 hours into operation, logging a RAID failure into the RSA.

When looking further through the RSA error logs, i’ve found this error occuring multiple times:

Unknown SERR/PERR detected on PCI bus Chassis#=NA Slot#=0 Bus#=0 Dev.ID=0x25e3 Vend.ID=0x8086 Status=0x0 DevFun#=0xff

I’ve called IBM support, and they told me that i should power cycle the machine after a firmware update. I did that and then continued to setup the machine. It’s been working flawlessly under heavy load for the past 3 months.

I’m going to remark this for the feature – after a firmware upgrade on a server, do a power cycle.

Disk IO performance is dependent on the number of disk arms

If you already know where i’m going with this after reading the subject, you can stop reading now.

As hard disks get bigger and bigger, servers in the Small Business environments are usually setup with too few disk arms to satisfy performance needs.

The problem is quite simple – a standard 36GB 2.5″ SAS Hard Drive can read data at factor x, and can do y IOPs per second.

A standard 72GB 2.5″ SAS Hard Drive can read data at factor x.1 (or similar), and can do y IOPs per second.

As you can see, disks get bigger, but they do not really get faster. If you need more IOPs per second, you need more disks.

If you have a legacy systems, with a considerable number of disk arms (more than 10), each at 4 or 8GB capacity, and migrate this setup two a new system with a RAID1 over two 147GB disks, you will get _WORSE_ performance than the old system.

And if we look at consumer hard drives, with 750GB, 1TB per disk, the performance gets even worse.

This is usually not a problem in more professional environments where systems are purchased by requirements, but in Small Businesses systems are usually purchased by the amount of money that is around for them.

Never forget about the need for disk arms.

i5 520/515/525 and the fan problem

You just got a new 520/515/525, and the first thing you’re seeing after turning the machine on is

1100 7611 or
1100 7621 or
1100 7631

Of course, this SRC is listed in the Hardware SRC list:

Fan missing error

A problem was detected with a Fan which can be caused by a Fan not being installed. Install Fan if missing, replace if already installed.

But when you open the lid, the fan is spinning, and only the green light is lit. Cycling the power on the machine does not resolve this problem.

So there’s only one thing left to do. Turn the machine off, remove the front cover, remove the tape/cd tray, remove the fan tray, remove and replace all fans, place all the stuff back into the machine, and then boot it.

You will see the attention light being lit, so you will need to reset the error flag in the Service Action Log in SST, and also close all problems in WRKPRB.

Buying tape drives for small businesses

Backups are very important, and the media and technology used for them are even more important. While Disk to Disk is the best form of Backups for home users, tapes still make a lot of sense in companies, because they make it a lot easier to get something off-site for disaster recovery.

However, tape technologies available for x86 servers are numerous. On the other hand, choosing tape drivers is as easy as it gets. The more expensive they are, the better they are.

I only recommend one type of tapes to customers – LTO. LTO tapes and drives are among the fastest and most reliable on the market. They are more expensive than cheaper alternatives like the VXA drives, but they are trouble free, which can’t be said about VXA drives.

LTO2 (200GB) Half-Height external drives can be had for about 2’500CHF from Tandberg. Buying them directly from HP/IBM, they are a bit more expensive, about 3000-3500CHF. Do not buy internal tape drives when they’re not from your server manufacturer, as this could cause trouble down the road.

LTO3 drives are a bit more expensive, but pack 400GB instead of 200GB. If that’s still not enough, you should consider purchasing a small tape libary – LTO 2 libraries with 8 tapes can be had from about 8000CHF, which is quite a bargain.

Remember, you can’t extend the capacity of your tape drive, except if you have a library. So if you buy a LTO2 drive, but need more than 200GB of storage, you should buy LTO3. If you think you need more than 400GB of storage, buy a LTO3 tape library.

I’ve had experiences with DLT (which are usually to small), VXA (unreliable), and 4mm tapes (unreliable). What i’ver never worked with are Sony’s AIT tape drives – i would be interested to hear some experiences with those drives.

Windows XP supports 4GB of RAM, period.

Many people say that Windows XP doesn’t fully support 4GB of RAM. That’s not true, because Windows XP supports exactly 4GB of usuable RAM, by using PAE.

If you can’t use 4GB of RAM, and have PAE enabled, you have bought hardware that doesn’t support 4GB of RAM. There’s a KB entry, which details some of the problems. If you want to use the full 4GB, buy better hardware.

Redundant equipment has to be monitored

While this might sound pretty much obvious, i’ve seen this more than once.

The problem with redundant equipment is that nobody notices when it fails. Okay, this is pretty much the target of having redundant equipment, but if nobody replaces the failed component, you’ve just lost that redundancy.

Better servers with multiple fans, power supplies, etc. usually offer integrated diagnostics with audible alert, which is usually enough for a small business (running MOM on your only server has limited usefullness). But smaller machines, lacking any redundant PSU/fans usually don’t have embedded diagnostics. These won’t make any audible alert when a disk in a RAID set fails.

On IBM servers with ServeRAID adapters, you can install the ServeRAID management program from the ServeRAID application CD (not the drivers CD, there are two of them). The ServeRAID management program is downward compatible with almost all ServeRAID controllers, as long as you have the IBM driver installed (for the 7e or similar controllers, there is also an Adaptec driver which works fine, but ServeRAID management doesn’t recognize it.

ServeRAID management can be configured to send mails automatically in case of a disk failure.

Booting PXElinux from RIS

You already have RIS/WDS set up to do a network installation of windows? But now you have some linux machines, that you want to setup using you already implemented Windows Server?

It’s quite easy to do:

Place the following text in X:\RemoteInstall\Setup\German\Tools\PXELinux\i386\templates

[OSChooser]
Description = "PXElinux"
Help = "Linux Network Boot Loader - Currently loads a Debian 4.0 Installer"
LaunchFile = "Setup\German\Tools\PXElinux\i386\templates\pxelinux.0"
Version = "1.00"
ImageType=Flat

You can get all the necessary debian files from netboot.tar.gz. You generally do not need to edit any files in there, just extract them into the templates directory.

Virtual Server 2005 R2, Windows Server 2003 and Broadcom NetExtreme II cards

Interesting issues with Microsoft’s Virtual Server.

A new IBM x3650 with two Broadcom NetExtreme II cards, running Windows Server 2003 since a few months, flawlessly.

After installing Virtual Server 2005, everything went mayhem. Some machines were still able to contact the server, some not. It looked like something was horribly broken, and at first i had no idea why something like this could happen.

After searching the web, i’ve found a few references to this and similar problems with newer NICs and Virtual Server.

The Broadcom NetExtreme II seem to have a special problem related to Virtual Server 2005, with IPMI. There is a fix from Broadcom available

IPMI disabling tool [Mirror]

Just a short network interruption, no restart necessary.

But there are other problem with modern network cards and Virtual Server 2005 (and possibly VMware’s offerings, but i don’t know that).

There’s a KB entry which talks about disabling checksum/segmentation offloading when using Virtual Server 2005.

Creating simple graphs using rrdtool

rrdtool graph for temperatures
As discussed in the previous post, you can gather temperature data from RSA II or iLO cards using SNMP quite easily.

While the data itself can be good enough to make a decision, executives in a company always like nice diagrams. So my first try was to load the CSV-like datafile generated using said script into Excel, and make a diagram out of it. But Excel is restricted to 255 parameters per axis, which was severely limiting.

I’ve been using Cacti for quite some time, but wasn’t willing to implement it because we’re mostly a Windows shop, and my plan was to integrate the linux boxes into Operations Manager 2007. Cacti uses Tobi Oetiker’s rrdtool to create the graphs.

Creating graphs using rrdtool is quite easy, actually. I wrote a simple script that handled this:

makerrd

Creates the appropriate rrd file. Replace the unix timestamp as appropriate. The last value on the RRA lines is the number of values saved into the data file.

#!/bin/sh
rrdtool create test.rrd           \\
           --start 1176465000     \\
           -s 300                 \\
           DS:temp:GAUGE:600:U:U  \\
           RRA:AVERAGE:0:1:5000

inputrrd

Loads the data from the simple CSV-like file into the RRD file. The more elegant approach would be to load the data directly from SNMP into the rrd database, but i’m no programmer.

#!/bin/zsh
while IFS=';' read timestamp temp ; do
        temp=`echo $temp | sed 's/\\..*//;'`
        rrdtool update test.rrd ${timestamp}:${temp}
        if [ $? != 0 ] ; then
                rrdtool failed
        fi
done < machine

makegraph

Creates a graph from the data in the rrd file. The HRULE lines create lines for error margins. In this case 35C and 30C.

#!/bin/sh
rrdtool graph temp.png                       \\
        --start 1176465385 --end `date +%s`  \\
        DEF:mytemp=test.rrd:temp:AVERAGE     \\
        LINE2:mytemp#0000FF                  \\
        HRULE:35#FF0000                      \\
        HRULE:30#FFA500

See the created graph to the right. Of course, rrdtool has much more options and can create much nicer graphs.

Do i need AC?

Another SMB topic, as most enterprises are obviously capable of doing this by the book.

Summer seems to be starting, with the days here getting warmer and warmer. A particular problem that seems to crop up every summer is servers shutting down or failing due to excessive temperatures. The tolerances of these machines to temperatures is actually quite low, even with redundant fans installed.

Most Small Businesses actually don’t follow any kind of strategy when choosing a place for servers, and usually try to ignore the AC problem – this works quite well when the new systems are installed during cold times.

While it might be possible to operate a server room without AC, this only works in rather rare circumstances:

  • No windows, or very small windows
  • Room is only during direct sunlight for a short time of the day
  • A very small number of machines installed in the room (one or two)

So, in general you will need an AC. But what are acceptable temperatures in a server room? The ideal would be 22C during the entire year. But it’s possible to run a server in a bit hotter environment. These specs usually depend on the server itself. Consider that there is other temperature sensitive equipment in the room – tape drives, UPSs, etc.

Start reviewing the spec sheets of your server to see what is acceptable. Here is an example for an IBM System x3650:

  • Air temperature:
    • Server on:
      10° to 35°C (50.0° to 95.0°F); altitude: 0 to 914.4 m (3000 ft). Decrease system temperature by 0.75°C for every 1000-foot increase in altitude.
    • Server off:
      10° to 43°C (50.0° to 109.4°F); maximum altitude: 2133 m (7000 ft)

As you can see, the maximum temperature during operation is 35C. With outside temperatures reaching this level during summer, an AC is almost always necessary. A UPS like the Powerware 9125 is specified to work from 0C to 40 C. This is a bit more generous than the x3650, but it’s still easy to get up to 40C with several servers in a room.

In order to figure out if you need an UPS, the best way to figure this out now is monitoring your server. If you are using IBM Director or HP Insight Manager, these tools already have this functionality integrated. I personally don’t like these two products (and they’re usually overdesigned for less than 10 servers). If you have an iLO or RSA II card in your server, you can use SNMP to get the temperature, write it to a file, and get a graph from this later.

I wrote a quick and dirty script to this. It runs on linux, but the same would be easily implementable in PowerShell or VB.

#!/bin/sh
while true ; do
        echo -n "`date +%s`;" >> ~/tempmon/machine
        snmpget -Onqv -c public -v 1 \
                machine.rsa.int.dataline.ch \
                SNMPv2-SMI::enterprises.2.3.51.1.2.1.5.1.0 |
                sed 's/"//g;s/Centigrade//;s/ //g' >> ~/tempmon/machine
        sleep 5m
done

Ugly? Yes. But it works fine. You can later load this “CSV” into Excel, and create appropriate graphs from the data. And get management to buy the AC before your servers die a fiery death. If you want to monitor this long term, you could integrate the appropriate values into cacti quite easily.

A sidenote about ACs from my personal experience: The same points as for servers apply – you get what you pay for. Buying self install ACs from Fust, MediaMarkt or some other chain in that direction won’t do you much good. Get a decent, two component AC, and let it get installed by a professional. This also avoids building damage. Also, let a professional size the system, provide him with the maximal output of your servers (measured in BTU), and then double that value just to be sure. Network managed ACs are usually not available for Small Business-acceptable pricing,