Bare Metal Linux Restore

Technical NotesSeveral weeks ago we started seeing some pretty scary errors showing up on the main system disk for our Blackboard server. We had an extra server hanging around, so we decided to move all the data off the failing disk and onto our spare server. The only question was how to make the new server as close to a perfect copy of the old one as possible. Simply restoring all the filesystems failed for a variety of reasons, mostly related to GRUB and the kernel, so I had to find a way of excluding only the files and directories that were tied to the specific model of server.

To do this, I started by installing a minimal copy of RHEL 4, making sure to lay the filesystems out in exactly the same way as they were on the old server. I then went through several experiments, leaving just the bare minimum files and directories required for the hardware and booting, but formatting all other filesystems and restoring the data from our old server. In the end, the below process resulted in system that worked perfectly, and very closely mirrored the original server.
Read more

Installing APC on CentOS

Casey needed me to install APC cache for the Scriblio project. It’s a PECL module, and pecl install apc gives an error. Here are some great instructions for getting it all to work.

RMAN 10G NFS Mount Options

We backup our Oracle databases using RMAN and then write the backup pieces out to an NFS share. This has always worked well, but RMAN started complaining that the NFS share was not mounted with the correct options when we upgraded to Oracle 10G. After some poking around in the docs I finally came up with a set of mount options that work.

Vfstab entry on a Solaria 8 box:
nfsserver.domain.com:/path/to/remote/mountpoint /local-mountpoint nfs 0 yes rw,bg,intr,hard,timeo=600,wsize=32768,rsize=32768
Manual mount on a Solaris 8 box:
mount -o rw,bg,intr,hard,timeo=600,wsize=32768,rsize=32768 nfsserver.domain.com:/path/to/remote/mountpoint /local-mountpoint

According to the docs, the options on a Linux box are pretty much the same, except you would add the following:
nfsver=3,tcp

Creating Linux Partitions for CLARiiON

Creating a properly offset slab of disk for Linux systems on your CLARiiON is not just a matter of creating a partition using the default fdisk values. The reason for this is that disk management utilities for Intel based systems generally write 63 sectors of metadata directly at the beginning of the LUN. The addressable space begins immediately after these initial sectors causing the CLARiiON to cross disks, especially when writing larger IO because it doesn’t match up with the stripe element size (usually 64k).

To get around this, you have to align the partition in such a way that it will start writing data on a sector that will mesh up nicely with the stripe element size. In this case, 128. Below is an example of how I create partitions on our CLARiiON for Linux systems. Check out the EMC Best Practices for Fibre Chanel storage white paper for more detail.

/sbin/fdisk /dev/emcpowera
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.

The number of cylinders for this disk is set to 39162.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-39162, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-39162, default 39162):
Using default value 39162

Command (m for help): x

Expert command (m for help): b
Partition number (1-4): 1
New beginning of data (63-629137529, default 63): 128

Expert command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

X11 Forwarding Broken on Solaris

If you’re running Solaris 8 or 9 and an upgrade results in broken SSH X11 forwarding, the problem may be Sun’s socfs bug. The symptom will be SSH’s failure to set the $DISPLAY variable and an error in your system log looking something like this:

Jun 3 09:40:24 servername sshd[26432]: [ID 800057 auth.error] error: Failed to allocate internet-domain X11 display socket.

To fix this, you can either install Sun’s latest socfs patch for your version of the OS, or simply force sshd into IPv4 mode by doing the following:

Edit you sshd_config file, adding the following:

# IPv4 only
ListenAddress 0.0.0.0

Edit your sshd startup script to issue a “-4″ to sshd on start:

case “$1″ in
’start’)
echo ’starting ssh daemon’
/usr/local/sbin/sshd -4
;;

Restart sshd, and that should pretty much do it… Enjoy.

VMware ESX 3.5 ntpdate strangeness

We just noticed that the time was very far off on our sparkly new VMware EXS 3.5 server. When I went to run ntpdate to bring it up to sync, I was suprised to find that it could not make a connection to the time server because outbound UDP 123 traffic was blocked by the internal firewall.

Here is what I got:
/usr/sbin/ntpdate -u time.nist.gov
9 Apr 03:47:53 ntpdate[20245]: sendto(192.43.244.18): Operation not permitted
9 Apr 03:47:54 ntpdate[20245]: sendto(192.43.244.18): Operation not permitted
9 Apr 03:47:55 ntpdate[20245]: sendto(192.43.244.18): Operation not permitted
9 Apr 03:47:56 ntpdate[20245]: sendto(192.43.244.18): Operation not permitted
9 Apr 03:47:57 ntpdate[20245]: no server suitable for synchronization found

Normally I would just add a rule to the “/etc/sysconfig/iptables” file to allow traffic out on this port, but Vmware ESX server does not use iptables… It uses its own firewall, so I had to figure out how to change it. Happily, it turns out that there is a handy “esxcfg-firewall” command built just for such things.

Running this:
/usr/sbin/esxcfg-firewall -q | grep 123

12300 1803K valid-tcp-flags  tcp  --  *   *     0.0.0.0/0        0.0.0.0/0

Confirmed that UDP port 123 outbound was disallowed.

Running this opened it up:
/usr/sbin/esxcfg-firewall -e ntpClient

Grep out “123″ again just to be sure:
/usr/sbin/esxcfg-firewall -q | grep 123

1  76 ACCEPT  udp  --  *    *    0.0.0.0/0      0.0.0.0/0     udp dpt:123

And you can now run ntpdate to sync up the time:
/usr/sbin/ntpdate -u time.nist.gov
9 Apr 09:52:54 ntpdate[20319]: step time server 192.43.244.18 offset 21689.039217 sec

RHEL System Configuration Changes for Oracle 10G

Below is a list of RHEL system configuration changes that Oracle 10G requires before it is installed.

First, check the following kernel parameters using the commands below:

/sbin/sysctl -a | grep kernel.shmall
/sbin/sysctl -a | grep kernel.shmmax
/sbin/sysctl -a | grep kernel.shmmni
/sbin/sysctl -a | grep kernel.sem
/sbin/sysctl -a | grep fs.file-max
/sbin/sysctl -a | grep net.ipv4.ip_local_port_range
/sbin/sysctl -a | grep net.core.rmem_default
/sbin/sysctl -a | grep net.core.rmem_max
/sbin/sysctl -a | grep net.core.wmem_default
/sbin/sysctl -a | grep net.core.wmem_max

If any parameters are lower than the examples below, you will have to increase them by editing “/etc/sysctl.conf” file, adding the appropriate lines as expressed below. If the current value is higher, leave it as is.

kernel.shmall = 2097152
kernel.shmmax = 2147483648
kernel.shmmni = 4096
kernel.sem = 250 32000 100 128
fs.file-max = 65536
net.ipv4.ip_local_port_range = 1024 65000
net.core.rmem_default = 262144
net.core.rmem_max = 262144
net.core.wmem_default = 262144
net.core.wmem_max = 262144

Next, edit your “/etc/security/limits.conf” file, adding the following lines:

oracle          soft    nproc           2047
oracle          hard    nproc           16384
oracle          soft    nofile          1024
oracle          hard    nofile          65536

If your current “/etc/pam.d/login” file does not already contain the following line, add it:

session    required     pam_limits.so

Finally, add the following lines to your “/etc/profile” file:

#Tweaks for Oracle
if [ $USER = "oracle" ]; then
    if [ $SHELL = "/bin/ksh" ]; then
    ulimit -p 16384
    ulimit -n 65536
    else
    ulimit -u 16384 -n 65536
    fi
fi

These are just the basic steps I take. See the “Oracle Database Installation Guide” for more complete instructions.

When Mac OSX SMB Connections Fail

Earlier today I had a problem with some Macs that could not establish SMB connections to our Windows File Server. There was no quick error, so the problem really “felt” like a firewall issue but strangely I was able to make a CLI connection to the file server using smbclient:
smbclient //server/share -U domain/username
Password:*******
Domain=[DOMAIN] OS=[Windows Server x] Server=[Windows Server x]
smb: \> exit

I started thinking that perhaps the Mac was doing NETBIOS name lookups and that nmbd might be knocking on the firewall. Turns out this was the problem. Opening up the following ports on the Windows File Server did the trick:

SMB uses ports:
UDP 137 (NETBIOS Name Service)
UDP 138 (NETBIOS Datagram Service)
TCP/UDP 139 (NETBIOS Session Service)

WARNING: Only open these ports to your trusted networks. Statistical data indicates that UDP ports 135 - 139 and TCP port 137 - 139 are amongst the most commonly scanned ports on remote computers.

Sun Project Blackbox - Datacenter in a Can

Lots of small companies want to hire an IT department in a can… You know, the ones who hire only one person to run their Linux servers, code their websites, architect their networks, support their users and order more printer toner. It’s a hard job, but it’s pretty common to see them advertised. What I never dreamed I would see is an entire data center in a can… Literally, in a can… Or at least a shipping container, which is really not that far off.

Leave it to Sun though. Not only have they packed an entire datacenter into a shipping container, they have packed a really good datacenter into a shipping container. Complete with integrated power, cooling, fire suppression, cable managment and redundant everything, this little server room-in-a-box has it all. They even showed off how tough it is by putting it through an earthquake!

All told, I really like the idea of my brand new datacenter rolling in on the back of a tractor-trailer truck. It kinda reminds me of the setup the bad guys had in latest Die Hard movie. I just hope nobody buys one and hires only one person to run it.

How to Make Gnarly Big Linux Filesystems

At least in RHEL 4, the fdisk command does not support the creation of filesystems larger than 2TB. In order to get around it, you have to use the parted command. I found the basic info here, but this is the long and short of how to cut off a big ol’ slice of disk using parted:

Run parted

# /sbin/parted

It’s interactive, so the following commands are issued within the utility.

1) Make the disk label

(parted) mklabel gpt

2) Create the partition

(parted) mkpart primary 0 -1

3) Verify

(parted) print


Disk geometry for /dev/sda: 0.000-38146.972 megabytes
Disk label type: msdos
Minor    Start       End     Type      Filesystem  Flags
1          0.031    101.975  primary   ext3        boot
2        101.975  38146.530  primary               lvm

4) Exit the GNU Parted command shell

(parted) quit

5) Finally, make the filesystem:

# mkfs.ext3 -m0 -F /dev/sdb1

6)Finally, you don’t want to wait for that big filesystem to fsck from time to time, so make sure it does not get checked unless you run the command yourself:

# tune2fs -c0 -i0 /dev/sdb1

That should just about do it. Remember that only RHEL 4 and higher can support filesystems larger than 2TB. If I remember correctly RHEL 3 can go up to 2TB, RHEL4 can handle 8TB, and RHEL 5 can make a whopping 16TB chunk of disk. Have fun!

Strange X11 Forwarding Problem

I started getting this error:
X11 connection rejected because of wrong authentication
when trying to forward X11 applications from a Linux server to my Mac. I had been forwarding the display on this server for years, so I was a little unsure what could be causing it. In the end, it turned out that I had filled up /var, and X11 could not write to “/var/log/XFree86.0.log”. It was an easy fix, but the error was certainly no help.

PHP and Sed for String Substitution

I needed to replace a string in several thousand files scattered all over the filesystem on one of our servers. I used find to create a list of files that needed to be changed, along with their complete path and called it "list.txt". It looked something like this:


/path/to/file/one/fileone.html
/path/to/file/two/filetwo.php
/path/to/file/three/filethree.htm
/path/to/directory with spaces/filefour.txt
and so on...

I worked out the "sed" command to do the in place editing, and Zach helped me whip up a quick PHP script to read the contents of "list.txt" into an array and iterate through it. He was also nice enough to show me how to use "str_replace" to escape any annoying spaces that happened to find their way into the names of directories.

PHP:
  1. <?php
  2. $files=file('list.txt');
  3.         foreach($files as $file)
  4.         {
  5.         $command='/bin/sed -i \'s/old-string/new-string/g\' '.str_replace(' ','\ ',$file);
  6.         exec($command);
  7.         }
  8. ?>

It's a handy little script that I'm sure I will find a use for later, so I thought I would put it up here.

Solaris 8 SAN Frustrations

Getting Solaris 8 to light up a Qlogic QLA2310 Fibre Channel card using the SUNWqlc and SUNWqlcx drivers can be frustrating enough, but the headaches are only beginning if you want to connect it to a SAN and you don't have all the right packages installed.

Last week, I installed the QLA2310 in a Sun Fire V210 running Solaris 8. I installed the latest versions of SUNWqlc, SUNWqlcx and SUNWsan. After doing a reboot -- -r, the system came up and attached the driver to the card. I zoned it in the fabric and logged into Navisphere, where the WWN showed up, but neither Power Path or the Navisphere host agent could communicate with the CLARiiON. I also could not see any of the LUNS I had presented.

I thought it was strange that the CLARiiON could see the host, but the host could not see the CLARiiON.

I ran:
luxadm -e port
Which returned:

Found path to 1 HBA ports

/devices/pci@1d,700000/SUNW,qlc@1/fp@0,0:devctl                    CONNECTED

Clearly, it could see the HBA.

I ran:

ls -l /dev/cfg
total 8
lrwxrwxrwx 1 root  root   38 Nov 30 14:31 c0 ->
../../devices/pci@1e,600000/ide@d:scsi
lrwxrwxrwx 1 root  root   39 Nov 30 14:31 c1 ->
../../devices/pci@1c,600000/scsi@2:scsi
lrwxrwxrwx 1 root  root   41 Nov 30 14:31 c2 ->
../../devices/pci@1c,600000/scsi@2,1:scsi
lrwxrwxrwx 1 root  root   48 Dec  4 13:49 c3 ->
../../devices/pci@1d,700000/SUNW,qlc@1/fp@0,0:fc

The card was C3... This becomes useful later when we have to config it.

I ran:
cfgadm -al -o show_FCP_dev
Which retuned:
cfgadm: Configuration administration not supported

There it was... I didn't have the complete SAN package installed. I hadn't done this in a few years, so I had forgotten all the packages I had to add to get the Sun SAN package working correctly... There are many.

Happily, Sun has now packaged them in a nice "SAN_4.4.12_install_it.tar.Z", which you can get from their website if you have a username. It installs everything for you in the right order.

The only thing left to do was another reboot -- -r and run cfgadm -c configure c3 to config the device. After this everything started working nicely.

Managing WordPress and Gallery2 With Subversion

Keeping WordPress up to date using the standard method of deleting old files, extracting the new ones and then running the database upgrade script is a bit cumbersome, but really not that difficult. Gallery2 uses more or less the same methodology, but it does not require you to delete your files prior to the upgrade because it generates a script to remove deprecated files after the install is complete. This is very kind of them, but the Gallery package is large, and upgrades can get a bit unwieldy. While there are certainly more difficult software packages to maintain, there are things I would much rather be doing than software updates, so I decided to make my life easier by using subversion to manage both applications.

Subversion is a code revision management tool that is everything CVS should have been. It is not only amazingly useful for software developers, but it can be readily used by end users as a convenient method of keeping their software up to date. It is for this reason that the Automattic and and Gallery folks have started recommending it for those who have command line access to subversion enabled servers.

There is no way to force the software to make an existing install into a subversion checkout, so to convert my site, I pretty much just followed the instructions at wordpress.org site.

Create a directory for the new install and "cd" into it:
$ mkdir wordpress-svn

Checkout the current WordPress version:
$ svn co http://svn.automattic.com/wordpress/tags/2.3.1 .

Copy the things I cared about into from the old directory into the new one:
$ cp ../wordpress/wp-config.php .
And the same for:
favicon.ico
.htaccess

The only thing I didn't really like was their method for copying the files from the old wp-content directory to the new one.

They suggested using "cp -rpf":
$ cd wordpress
$ cp -rpf wp-content/* ../wordpress-svn/wp-content

But I prefer to use a "tar | tar" operation as root from my original wp-content directory.
$ cd wordpress/wp-content
$ tar cpEf - * | (cd ../../wordpress-svn/wp-content; tar xf -)

I don't really have a reason for this other than it's the way I have always moved large directories full of files with varying permissions. I just have more confidence in "tar" to maintain permissions than I do "cp". I seem to remember being bitten by "cp" while moving some Oracle databases at some point.

That pretty much did it for WordPress... Next I moved onto Gallery2. The process was very similar.

Create the new gallery2 directory within your wordpress-svn folder and "cd" into it:
$ mkdir gallery2

Checkout the latest version of the Gallery2 code:
svn co https://gallery.svn.sourceforge.net/
svnroot/gallery/branches/BRANCH_2_2/gallery2 .

Copy over the config file:
cp ../../wordpress/gallery2/config.php .

Finally, copy the g2data directory over using "tar | tar":
$ cd ../../wordpress/gallery2/g2data
$ tar cpEf - * | (cd ../../wordpress-svn/gallery2/g2data; tar xf -)

That was it... Now all that was left was to rename wordpress to wordpress-presvn and wordpress-svn to wordpress:
$ mv wordpress wordpress-presvn; mv wordpress-svn wordpress

Everything worked fine, so I was golden. If it had not been, I could have simply renamed the directories back to their original names.

What did all this get me? Much much easier upgrades. Upgrading WordPress is now just a matter of switching to the latest tag and running "svn up":
$ svn sw http://svn.automattic.com/wordpress/tags/2.3.2
$ svn up

With Gallery, however, there is no need to do a switch for dot releases, so until it goes from 2.2.x to 2.3.x there is no need to run the "svn switch"... I can just "svn up".

Remember that this is a 10,000 foot view of the process. Please read the links to the Automattic and Gallery documentation if you are going to make the move yourself.

Making a Connector for the Teldyne R22D Oxygen Sensor

If you dive rebreathers much, chances are you will have to repair or replace the Molex plugs and pins that connect your Teledyne R22D oxygen sensors to your head electronics. Many manufacturers are cool about sending you the parts so that you can do the repair yourself, but some, such as AP Diving require that you send the entire head back for this simple repair.

If you are comfortable handling electronics, and you think it's silly to have to send your head all the way to England or wherever just to have a couple of parts costing less than $1 replaced, you can get the parts you need from just about any distributor that sells electronic components. I like Digi-Key because they would sell me the crystals to make a Red Box when nobody else would. I've been loyal ever since.

Here are the parts you will need:

CONNECTION TERMMINAL FEMALE 22-30AWG GOLD
Digi-Key part number: WM1129-ND
Manufacturer Part number: 08-56-0110

CONNECTION HOUSING 3POS .100 W/RAMP
Digi-Key Part Number: WM2001-ND
Manufacturor Part Number: 22-01-3037

Next Page »