Bare Metal Linux Restore

Technical NotesSeveral weeks ago we started seeing some pretty scary errors showing up on the main system disk for our Blackboard server. We had an extra server hanging around, so we decided to move all the data off the failing disk and onto our spare server. The only question was how to make the new server as close to a perfect copy of the old one as possible. Simply restoring all the filesystems failed for a variety of reasons, mostly related to GRUB and the kernel, so I had to find a way of excluding only the files and directories that were tied to the specific model of server.

To do this, I started by installing a minimal copy of RHEL 4, making sure to lay the filesystems out in exactly the same way as they were on the old server. I then went through several experiments, leaving just the bare minimum files and directories required for the hardware and booting, but formatting all other filesystems and restoring the data from our old server. In the end, the below process resulted in system that worked perfectly, and very closely mirrored the original server.
Read more

Installing APC on CentOS

Casey needed me to install APC cache for the Scriblio project. It’s a PECL module, and pecl install apc gives an error. Here are some great instructions for getting it all to work.

RMAN 10G NFS Mount Options

We backup our Oracle databases using RMAN and then write the backup pieces out to an NFS share. This has always worked well, but RMAN started complaining that the NFS share was not mounted with the correct options when we upgraded to Oracle 10G. After some poking around in the docs I finally came up with a set of mount options that work.

Vfstab entry on a Solaria 8 box:
nfsserver.domain.com:/path/to/remote/mountpoint /local-mountpoint nfs 0 yes rw,bg,intr,hard,timeo=600,wsize=32768,rsize=32768
Manual mount on a Solaris 8 box:
mount -o rw,bg,intr,hard,timeo=600,wsize=32768,rsize=32768 nfsserver.domain.com:/path/to/remote/mountpoint /local-mountpoint

According to the docs, the options on a Linux box are pretty much the same, except you would add the following:
nfsver=3,tcp

Creating Linux Partitions for CLARiiON

Creating a properly offset slab of disk for Linux systems on your CLARiiON is not just a matter of creating a partition using the default fdisk values. The reason for this is that disk management utilities for Intel based systems generally write 63 sectors of metadata directly at the beginning of the LUN. The addressable space begins immediately after these initial sectors causing the CLARiiON to cross disks, especially when writing larger IO because it doesn’t match up with the stripe element size (usually 64k).

To get around this, you have to align the partition in such a way that it will start writing data on a sector that will mesh up nicely with the stripe element size. In this case, 128. Below is an example of how I create partitions on our CLARiiON for Linux systems. Check out the EMC Best Practices for Fibre Chanel storage white paper for more detail.

/sbin/fdisk /dev/emcpowera
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.

The number of cylinders for this disk is set to 39162.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-39162, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-39162, default 39162):
Using default value 39162

Command (m for help): x

Expert command (m for help): b
Partition number (1-4): 1
New beginning of data (63-629137529, default 63): 128

Expert command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

X11 Forwarding Broken on Solaris

If you’re running Solaris 8 or 9 and an upgrade results in broken SSH X11 forwarding, the problem may be Sun’s socfs bug. The symptom will be SSH’s failure to set the $DISPLAY variable and an error in your system log looking something like this:

Jun 3 09:40:24 servername sshd[26432]: [ID 800057 auth.error] error: Failed to allocate internet-domain X11 display socket.

To fix this, you can either install Sun’s latest socfs patch for your version of the OS, or simply force sshd into IPv4 mode by doing the following:

Edit you sshd_config file, adding the following:

# IPv4 only
ListenAddress 0.0.0.0

Edit your sshd startup script to issue a “-4″ to sshd on start:

case “$1″ in
’start’)
echo ’starting ssh daemon’
/usr/local/sbin/sshd -4
;;

Restart sshd, and that should pretty much do it… Enjoy.

Darkness Beckons

All next week I’ll be taking a cave diving class on my CCR down in North Florida. Cave diving has been a dream of mine since reading an article about Sheck Exley’s exploration of the Nacimiento Mante cave system in Mexico. At a time in my life when I almost bought into the idea that divers should not venture deeper than 130 feet, there I was, reading about a man who had plunged to a world record depth of 881 feet and returned safely to the surface after 14 hours of decompression. It was as if the wool that had been pulled over my eyes by the recreational diving agencies had suddenly been removed, and I was left totally inspired. I remain inspired to this day, and I am honored to have the opportunity to learn cave diving from legendary cave and technical diver Tom Mount.

VMware ESX 3.5 ntpdate strangeness

We just noticed that the time was very far off on our sparkly new VMware EXS 3.5 server. When I went to run ntpdate to bring it up to sync, I was suprised to find that it could not make a connection to the time server because outbound UDP 123 traffic was blocked by the internal firewall.

Here is what I got:
/usr/sbin/ntpdate -u time.nist.gov
9 Apr 03:47:53 ntpdate[20245]: sendto(192.43.244.18): Operation not permitted
9 Apr 03:47:54 ntpdate[20245]: sendto(192.43.244.18): Operation not permitted
9 Apr 03:47:55 ntpdate[20245]: sendto(192.43.244.18): Operation not permitted
9 Apr 03:47:56 ntpdate[20245]: sendto(192.43.244.18): Operation not permitted
9 Apr 03:47:57 ntpdate[20245]: no server suitable for synchronization found

Normally I would just add a rule to the “/etc/sysconfig/iptables” file to allow traffic out on this port, but Vmware ESX server does not use iptables… It uses its own firewall, so I had to figure out how to change it. Happily, it turns out that there is a handy “esxcfg-firewall” command built just for such things.

Running this:
/usr/sbin/esxcfg-firewall -q | grep 123

12300 1803K valid-tcp-flags  tcp  --  *   *     0.0.0.0/0        0.0.0.0/0

Confirmed that UDP port 123 outbound was disallowed.

Running this opened it up:
/usr/sbin/esxcfg-firewall -e ntpClient

Grep out “123″ again just to be sure:
/usr/sbin/esxcfg-firewall -q | grep 123

1  76 ACCEPT  udp  --  *    *    0.0.0.0/0      0.0.0.0/0     udp dpt:123

And you can now run ntpdate to sync up the time:
/usr/sbin/ntpdate -u time.nist.gov
9 Apr 09:52:54 ntpdate[20319]: step time server 192.43.244.18 offset 21689.039217 sec

RHEL System Configuration Changes for Oracle 10G

Below is a list of RHEL system configuration changes that Oracle 10G requires before it is installed.

First, check the following kernel parameters using the commands below:

/sbin/sysctl -a | grep kernel.shmall
/sbin/sysctl -a | grep kernel.shmmax
/sbin/sysctl -a | grep kernel.shmmni
/sbin/sysctl -a | grep kernel.sem
/sbin/sysctl -a | grep fs.file-max
/sbin/sysctl -a | grep net.ipv4.ip_local_port_range
/sbin/sysctl -a | grep net.core.rmem_default
/sbin/sysctl -a | grep net.core.rmem_max
/sbin/sysctl -a | grep net.core.wmem_default
/sbin/sysctl -a | grep net.core.wmem_max

If any parameters are lower than the examples below, you will have to increase them by editing “/etc/sysctl.conf” file, adding the appropriate lines as expressed below. If the current value is higher, leave it as is.

kernel.shmall = 2097152
kernel.shmmax = 2147483648
kernel.shmmni = 4096
kernel.sem = 250 32000 100 128
fs.file-max = 65536
net.ipv4.ip_local_port_range = 1024 65000
net.core.rmem_default = 262144
net.core.rmem_max = 262144
net.core.wmem_default = 262144
net.core.wmem_max = 262144

Next, edit your “/etc/security/limits.conf” file, adding the following lines:

oracle          soft    nproc           2047
oracle          hard    nproc           16384
oracle          soft    nofile          1024
oracle          hard    nofile          65536

If your current “/etc/pam.d/login” file does not already contain the following line, add it:

session    required     pam_limits.so

Finally, add the following lines to your “/etc/profile” file:

#Tweaks for Oracle
if [ $USER = "oracle" ]; then
    if [ $SHELL = "/bin/ksh" ]; then
    ulimit -p 16384
    ulimit -n 65536
    else
    ulimit -u 16384 -n 65536
    fi
fi

These are just the basic steps I take. See the “Oracle Database Installation Guide” for more complete instructions.

When Mac OSX SMB Connections Fail

Earlier today I had a problem with some Macs that could not establish SMB connections to our Windows File Server. There was no quick error, so the problem really “felt” like a firewall issue but strangely I was able to make a CLI connection to the file server using smbclient:
smbclient //server/share -U domain/username
Password:*******
Domain=[DOMAIN] OS=[Windows Server x] Server=[Windows Server x]
smb: \> exit

I started thinking that perhaps the Mac was doing NETBIOS name lookups and that nmbd might be knocking on the firewall. Turns out this was the problem. Opening up the following ports on the Windows File Server did the trick:

SMB uses ports:
UDP 137 (NETBIOS Name Service)
UDP 138 (NETBIOS Datagram Service)
TCP/UDP 139 (NETBIOS Session Service)

WARNING: Only open these ports to your trusted networks. Statistical data indicates that UDP ports 135 - 139 and TCP port 137 - 139 are amongst the most commonly scanned ports on remote computers.

Girl in Clinton’s “Red Phone” Ad Supports Obama

It turns out that the sleeping little girl in Hillary Clinton’s “Red Phone” TV advertisement is supporting Barack Obama. The Clinton campaign used stock footage of the girl who is now of voting age and calls “Red Phone” ad “Fear Mongering”. I guess the Clinton crew should have thought about that.

Sun Project Blackbox - Datacenter in a Can

Lots of small companies want to hire an IT department in a can… You know, the ones who hire only one person to run their Linux servers, code their websites, architect their networks, support their users and order more printer toner. It’s a hard job, but it’s pretty common to see them advertised. What I never dreamed I would see is an entire data center in a can… Literally, in a can… Or at least a shipping container, which is really not that far off.

Leave it to Sun though. Not only have they packed an entire datacenter into a shipping container, they have packed a really good datacenter into a shipping container. Complete with integrated power, cooling, fire suppression, cable managment and redundant everything, this little server room-in-a-box has it all. They even showed off how tough it is by putting it through an earthquake!

All told, I really like the idea of my brand new datacenter rolling in on the back of a tractor-trailer truck. It kinda reminds me of the setup the bad guys had in latest Die Hard movie. I just hope nobody buys one and hires only one person to run it.

Who Cares if the Rebreather Has Integrated Deco

For some time now, Innerspace Systems has been working on a Megalodon head called APECS 3 that supports integrated decompression. As with any major software / hardware engineering project, there have been some delays, which has Meg owners clambering for information about when it will come out. It’s amazing how so many of these rebreather divers are pestering the company and acting like a bunch of kids a few days before Christmas. What I don’t really understand is why people are so anxious.

It’s not that I wouldn’t like to have integrated deco, but I really don’t see it as being all that big a deal. When software gets more complex it also gets more buggy, which is why I’m pretty happy having a very basic loop controller. Keeping the deco on a different unit like a VR3 is a nice modular system, and besides, I don’t really even use the computer on really deep dives.

When I plan a bigger dive, I do it like this:

  • Work out the details on the laptop
  • Cut the tables and laminate them (wrist mounted slate)
  • Cut bailout tables and laminate them (also on wrist)
  • Fill 02 and Diluent
  • Fill bailout using thirds
  • Do the dive as it was planned and as it appears on my wrist.

While I use the computer to validate my deco schedule, it is really only there for backup.

Again, it would be nice to have integrated deco, but IMHO, you should not do big dives if you can’t maintain a setpoint. Provided you can, or even if you depend on your loop controller to do it, your actual setpoint will match that on your computer. Everything should jive and you can validate the deco schedule on your wrist.

It’s tempting just to jump in, do a gnarly dive and depend on your computer to get you out of it, but doing so ignores some of the basic safety precautions of technical diving like proper gas management, which is a risk that I really don’t feel comfortable taking.

RHEL3 Upgrade to RHEL4 Breaks up2date

Last week I had to upgrade one of our old RHEL3 servers in order to get it to address disks larger than 2TB. I did the upgrade from CD, and it went fairly smoothly, except up2date would not run after the box came back up.

It gave me the following error:

[root@x up2date]# Traceback (most recent call last):
  File "/usr/sbin/up2date", line 27, in ?
    from up2date_client import repoDirector
  File "/usr/share/rhn/up2date_client/repoDirector.py", line 5, in ?
    import rhnChannel
  File "/usr/share/rhn/up2date_client/rhnChannel.py", line 10, in ?
    import up2dateAuth
  File "/usr/share/rhn/up2date_client/up2dateAuth.py", line 5, in ?
    import rpcServer
  File "/usr/share/rhn/up2date_client/rpcServer.py", line 23, in ?
    from rhn import rpclib
ImportError: No module named rhn

It turns out that there is no “really easy” way to fix it, but these directions on spaceblog do work. Basically it involves removing a lot of packages and re-adding them. The problem is related to python, so rather than remove the entire list of packages, I focused only on those relating to python and up2date:


libxml2-python
popt
pyOpenSSL
python
rhnlib
rhpl
up2date

Make sure not to remove these packages:


rpm
rpm-libs
rpm-python

Or you will break rpm and not be able to add the packages back after you remove them. All told, this is a grisly process, and you will have to use rpm -e --nodeps in order to get it done. This sucks, but up2date will work everything out once you can get it running again.

How to Make Gnarly Big Linux Filesystems

At least in RHEL 4, the fdisk command does not support the creation of filesystems larger than 2TB. In order to get around it, you have to use the parted command. I found the basic info here, but this is the long and short of how to cut off a big ol’ slice of disk using parted:

Run parted

# /sbin/parted

It’s interactive, so the following commands are issued within the utility.

1) Make the disk label

(parted) mklabel gpt

2) Create the partition

(parted) mkpart primary 0 -1

3) Verify

(parted) print


Disk geometry for /dev/sda: 0.000-38146.972 megabytes
Disk label type: msdos
Minor    Start       End     Type      Filesystem  Flags
1          0.031    101.975  primary   ext3        boot
2        101.975  38146.530  primary               lvm

4) Exit the GNU Parted command shell

(parted) quit

5) Finally, make the filesystem:

# mkfs.ext3 -m0 -F /dev/sdb1

6)Finally, you don’t want to wait for that big filesystem to fsck from time to time, so make sure it does not get checked unless you run the command yourself:

# tune2fs -c0 -i0 /dev/sdb1

That should just about do it. Remember that only RHEL 4 and higher can support filesystems larger than 2TB. If I remember correctly RHEL 3 can go up to 2TB, RHEL4 can handle 8TB, and RHEL 5 can make a whopping 16TB chunk of disk. Have fun!

I Can Finally Have My Rocket Belt!

Juan Manuel is my new hero, plain and simple! Since the Bell RocketBelt of the early 1960’s, the world has been disappointingly devoid of this amazing invention, but no longer. Juan Manuel, a self-taught engineer from Mexico has been working diligently for nearly 30 years to develop a working rocket belt and now he has.

Supposedly the biggest trick to making these rigs work is getting the throttle to operate smoothly enough, but there are a number of other challenges as well. They run on 90% pure hydrogen peroxide which is extremely caustic, so material compatibility is a huge factor. You can’t just march into your nearest drug store and pick up this highly concentrated chemical either, so there are also many issues surrounding the distillation process of the fuel.

Lozano’s commitment to his projects is truly impressive. He has financed everything himself, and come up with a product that seems to work every bit as well as the old Bel Rocket Belt, but looks even cooler! What’s more, I can have one of these Jet Packs on sale now for $125,000! That may seem like a lot of money, but it is really very reasonable when you consider what you get:

1. A fully-tested, custom-made flying rocket belt,
2. This belt has been proved to be the most stable design and easier to fly
3. A special machine to make our own unlimited supply of rocket fuel
4. Hands-on training in the process and the equipment
5. Flight training of 10 flights in your own rocket belt
6. Maintenance and setup training
7. 24/7 expert support
8. Housing and food are included during training

When you think that the original rocket belt cost Bell Aerosystems $250,000 dollars in 1960, and that the guys from “The Rocket Belt Caper” spent a great deal more, you can only conclude that $125,000 is very reasonable indeed. This is not to say I can run out and buy one, but I admit that I am tempted by thoughts of being a full-time professional jet pack pilot.

Well done Lozano! My hat goes off to you!

Next Page »