Debugging UDP Connections

The most basic network troubleshooting trick in the book is a simple test to make sure that a daemon is listening on its respective port. This is easy with TCP connections because you can simply setup the daemon on the destination and telnet to the port. It’s harder for UDP connections because there is no ACK.

For Solaris, or even better yet illumos:

snoop -d ce0 'host server.example.com and udp and port 137'

ping -U -p 137 server.example.com

If you have linux it’s:

tcpdump -i en1 'host server.example.com and udp and port 137'

See also:
http://www.cosine.org/2007/08/21/debugging-connectivity-problems/

Mounting ISO Images on Illumos

I always have to look this up, so I’m wiring it down. To mount an ISO image on Solaris derived systems, do the following:

lofiadm -a /path/to/image.iso /dev/lofi/1
mount -F hsfs -o ro /dev/lofi/1 /path/to/mountpoint

Illumos ZFS Storage Appliance

A while back Casey started complaining that his Drobo storage robot was no longer being awesome. This got me thinking about how easy it would be to build a nice ZFS storage appliance that provide massive storage, constant data protection and self-healing against bit rot. I have wanted to build something like this for some time, but just never had the storage needs at home to justify it. Well, data needs grow and my discussion with Casey while stumbling around Fry’s was all it took to get me moving.

What is ZFS? Well, put simply, ZFS is Jeff Bonwick and The Bonwick Youth’s answer to every filesystem annoyance the world has ever known. It is the pinnacle of human achievement in filesystem development, and quite honestly, the only commonly available storage option that will truly protect your data. Now, before you start leaving angry comments explaining how [insert RAID solution here] does a perfectly good job of protecting data, I’m not talking about RAID. I’m talking about leveraging copy on write transactions and checksums at the block level to ensure data integrity, and an implemented strategy for self-healing against bit rot, current spikes, bugs in disk firmware, ghost writes, etc. I’m also talking about a dead simple and logical volume management layer and wealth of features too numerous to list here.

Anyhow, I cobbled together the items listed below, installed Openindiana Illumos on it and configured Netatalk. It works wonderfully, and I can’t say enough about how pleased I am with Illumos, and how happy I am to have an industrial strength, feature-rich UNIX in the open source community.

The links below will take you to listings for the components I used:
Motherboard
Memory X2
Case
PSU
Drives X5

Price: $807.92
Usable Storage: 7.6TB (raidz)


Using MacPorts Subversion With BBEdit

PROTIP: If you want to use the Subversion features in BBedit, and you also like using v1.7+ of svn, you have to change the default location. Obviously this assumes that you have MacPorts installed and have used it to build and install the Subversion port.

CODE:
  1. defaults write com.barebones.bbedit Subversion:SubversionToolPathOverride /opt/local/bin/svn

How to Replace a Failed Drive in a ZFS Pool

Featured

So you have a failed disk in a ZFS pool and you want to fix it? Routine disk failures are really a non-event with ZFS because the volume management makes replacing them so dang easy. In many cases, unlike hardware RAID or older volume management solutions, the replacement disk doesn't even need to be exactly the same as the original. So let's get started replacing our failed disk. These instructions will be for a Solaris 10 system, so a few of the particulars related to unconfiguring the disk and device paths will vary with different flavors of UNIX.

First, take a look at the zpools to see if there are any errors. The -x flag will only display status for pools that are exhibiting errors or are otherwise unavailable.
Note: If the disk is actively failing (a process that sometimes takes a while as the OS offlines it), any commands that use storage related system calls will hang and take a long time to return. These include "zpool" and "format", so just be patient; they will eventually return.

# zpool status -x

 pool: data
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        data        DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            c1t4d0  ONLINE       0     0     0
            c1t5d0  FAULTED      1    81     0  too many errors
          mirror-1  ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
            c1t3d0  ONLINE       0     0     0

errors: No known data errors

So we can easily see that c1t5d0 has failed. Take a look at the "format" output do get the particulars about the disk:
# format

Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c1t0d0 
          /pci@0/pci@0/pci@2/scsi@0/sd@0,0
       1. c1t1d0 
          /pci@0/pci@0/pci@2/scsi@0/sd@1,0
       2. c1t2d0 
          /pci@0/pci@0/pci@2/scsi@0/sd@2,0
       3. c1t3d0 
          /pci@0/pci@0/pci@2/scsi@0/sd@3,0
       4. c1t4d0 
          /pci@0/pci@0/pci@2/scsi@0/sd@4,0
       5. c1t5d0 
          /pci@0/pci@0/pci@2/scsi@0/sd@5,0
Specify disk (enter its number): 

Get your hands on a replacement disk that is as similar as possible to a SEAGATE-ST914602SSUN146G-0603-136.73GB. I was only able to dig up a HITACHI-H103014SCSUN146G-A2A8-136.73GB, so I'll be using that instead of a direct replacement.

Next, use "cfgadm" to look at the disks you have and their configuration status:

# cfgadm -al

Ap_Id                          Type         Receptacle   Occupant     Condition
c1                             scsi-sata    connected    configured   unknown
c1::dsk/c1t0d0                 disk         connected    configured   unknown
c1::dsk/c1t1d0                 disk         connected    configured   unknown
c1::dsk/c1t2d0                 disk         connected    configured   unknown
c1::dsk/c1t3d0                 disk         connected    configured   unknown
c1::dsk/c1t4d0                 disk         connected    configured   unknown
c1::dsk/c1t5d0                 disk         connected    configured   unknown

We want to replace t5, so we prepare it for removal by unconfiguring it:

# cfgadm -c unconfigure c1::dsk/c1t5d0

The "safe to remove" led should turn on and you can pull the disk, remembering to allow it several seconds to spin down. Replace it with the new disk and take a look at "cfgadm -al" output again to ensure that it has been automatically configured. If it has not, you can manually configure it like below:

# cfgadm -c configure c1::dsk/c1t5d0

Now, it's a simple matter of a quick "zpool replace" to get things rebuilding:

# zpool replace data c1t5d0

You can use the output of zpool status to watch the resilver process...

How to Enable SSL for CSWapache2

If you've spent any time at all around Solaris 10, you know that Sun has invested a fair amount of effort developing a pretty snazzy Service Management Facility (SMF). It is extremely flexible and feature rich, but it's not quite as strait forward as the old legacy /etc/init.d scripts. If you're running the OpenCSW Apache package, it installs a Service Manifest into the SMF, so you'll have to edit this to run Apache SSL... Here's how:


# svccfg

svc:> select cswapache2
svc:/network/http:cswapache2> listprop httpd/ssl

httpd/ssl  boolean  false

svc:/network/http:cswapache2> setprop httpd/ssl=true
svc:/network/http:cswapache2> exit

Now, make the changes active:


# svcadm disable cswapache2
# svcadm enable cswapache2
# svcprop -p httpd/ssl svc:/network/http:cswapache2

false

# svcadm refresh cswapache2
# svcprop -p httpd/ssl svc:/network/http:cswapache2

true

ZoneType.sh Version 2.0

We just started supporting Solaris 10 in our VMware cluster so I had to update my zone type script to detect if the OS is running there. I'm not sure how I feel about depending on the output of ptrdiag since the interface is labeled "unstable", but it works for now, and I really don't see Sun changing the first line of output where the system configuration is listed. Anyhow, when issued with the -v or --vmware flag, the script returns 0 if it's running on the cluster and 1 if it is not.

Usage:

# zonetype.sh -g or --global
Return 0: The machine is a global zone with 1 or more local zones
Return 1: The machine is not a global zone

# zonetype.sh -l or --local
Return 0: The machine is a local zone
Return 1: The machine is not a not a local zone

# zonetype.sh -v or --vmware
Return 0: The machine is running on a VMware hypervisor
Return 1: The machine is not running in VMware

#! /bin/bash
#
# When issued with the -g or --global flag, this script will return:
# 0 if the machine is a global zone and has one or more local zones. 
# Otherwise, it will return 1
#
# When issued with the -l or --local flag, this script will return:
# 0 if if is a local zone and 1 if it is not
#
# When issued with the -v or --vmware flag, this script will return:
# 0 if it is a vmware host and 1 if not.
#

list=( `/usr/sbin/zoneadm list -civ | awk '{ print $1 }'`)

  case "$1" in
    -g|--global)
        # If the third element in our array is null, set it to 0
        if [ "${list[2]}" == ""  ]; then
        list[2]=0
        fi
        # This is a global zone only if it has one or more local zones.
        if [ ${list[1]} -eq 0 ] && [ ${list[2]} -ge 1 ]; then
        # 1 is returned if we have a global and local zone, 
        # otherwise, we return 0
                exit 0
            else
                exit 1
        fi
              ;;
    -l|--local)
        # If the second element in our array is = or > 1, it is a local zone.
        if [ ${list[1]} -ge 1 ]; then
        # Return 1 if this is a local zone, otherwise return 0.
                exit 0
            else
                exit 1
        fi

              ;;
   -v|--vmware)
        # Don't run our check on local zones... Prtdiag can't run there
        if [ ${list[1]} != 0 ]; then
                exit 1
           else 
                vmhost=( `/usr/sbin/prtdiag | grep System | awk '{ print $5 }'`)
                if [ $vmhost == VMware ]; then
                        #If the host is running on the vmware cluster return 0, 
                        # otherwise, return 1
                        exit 0
                else
                        exit 1
                fi
        fi
              ;;
        *)
        echo "Usage: /local/adm/zonetype.sh {-l | --local | -g | --global | -v | --vmware}"
        exit 1
  esac