Making RHEL 3 See Multiple LUNS

For some reason RHEL 3 comes out of the box configured to see only the first Lun on a SCSI channel. This is usually not a problem, as the first Lun is all you care about, but in some instances, you will need to configure the SCSI module to see multiple Luns.

In this case we are using an Adaptec DuraStor 6200S, which is set up to present the RAID controller as Lun 00, and the actual RAID array as Lun 01. Without any modifications to the system, we plug in in, and after a reboot check /proc/scsi/scsi. We can see the RAID controller, but since we can only see the first Lun on the channel, we never get to the array:

Host: scsi2 Channel: 00 Id: 00 Lun: 00
Vendor: Adaptec Model: DuraStor 6200S Rev: V100
Type: Processor ANSI SCSI revision: 03

The actual array would show up as “Channel: 00 Id: 00 Lun: 01”, but it’s not there. To resolve this, we have to first edit “/etc/modules.conf” and add the following line:

options scsi_mod max_scsi_luns=128 scsi_allow_ghost_devices=1

In our case, modules.conf looks like this after the modification:

alias eth0 e1000
alias eth1 e1000
alias scsi_hostadapter megaraid2
alias usb-controller usb-uhci
alias usb-controller1 ehci-hcd
alias scsi_hostadapter1 aic7xxx
options scsi_mod max_scsi_luns=128 scsi_allow_ghost_devices=1

Next we have to build a new initrd image. This is done with the “mkinitrd” command.

WARNING: MAKE DARN SURE you build this against the right kernel (the kernel you want to use). If you are going to replace your current initrd image with the new one, you should make a back-up copy first. The -f option will force or overwrite the current initrd image file.

cp /boot/initrd-2.4.21-47.ELsmp.img /boot/initrd-2.4.21-47.ELsmp.img.bak
mkinitrd -f -v /boot/initrd-2.4.21-47.ELsmp.img 2.4.21-47.ELsmp

Once this is done, you can reboot your machine, and check “/proc/scsi/scsi” to see confirm that it sees the second Lun. You should see something like this:

Host: scsi2 Channel: 00 Id: 00 Lun: 00
Vendor: Adaptec Model: DuraStor 6200S Rev: V100
Type: Processor ANSI SCSI revision: 03

Host: scsi2 Channel: 00 Id: 00 Lun: 01
Vendor: Adaptec Model: DuraStor 6200S Rev: V100
Type: Direct-Access ANSI SCSI revision: 03

Hat Tip: Alan Baker for help figuring this out.
UPDATE: RHEL 4 doest not have this problem.

Why Modern RAID 5 is Ideal for Oracle Databases

There is a convention of thought amongst Oracle DBA’s that databases should never be installed on disks that are configured into a RAID 5 array. The argument goes, that since Oracle accesses and writes to random points within relatively large files, the overhead of constantly calculating block-level parity on these files is substantial, resulting in serious performance degradation. They suggest that RAID 1 (mirroring) is the ideal disk configuration since no parity needs to be calculated, and Oracle is more than happy to divide up its database over many smaller mount points.

This way of thinking has largely been correct over the years because most systems have traditionally used software RAID. This means that the CPU of the server itself had the job of doing all those parity calculations, and it really did slow down both the server and the disk when RAID 5 configurations were used. Oracle, in particular, had a hard time with these configurations for the exact reasons the DBA’s point to.

In many cases, software RAID is still used, and to be sure, it is wholly inappropriate to deploy RAID 5 in these environments. However, it is increasingly common to find IT departments using a SAN-type architecture where the RAID type and configuration are invisible to the host operating system. In these environments, the the disk array has a dedicated controller that is singly tasked with handling all read, write, and parity operations. The RAID controller is no longer software running on a generic CPU, but rather firmware that is optimized to handle parity calculations. This results in a system where parity is calculated so quickly by the dedicated controller that differences in speed between RAID 1 and Raid 5 should be virtually nonexistent.

To prove this, I carved up our new InfoTrend EonStor A12F-G2221 into three arrays – a RAID 5, a RAID 1, and a RAID 10. I then set out to run some benchmarks on these different arrays to see what, if any, the differences would be.

The hardware used was as follows:

  • The RAID 5 LUN consisted of 4 drives
  • The RAID 1 LUN consisted of 2 drives
  • The RAID 10 LUN consisted of 4 drives

I then identified the iozone tests that most accurately simulated Oracle disk activity. What I really wanted to do was to simulate select and update queries on various sized files and see how the different RAID types held up under the load. To do this, I ran iozone, a well-respected benchmark utility, with the following arguments:

/opt/iozone/bin/iozone -Ra -g 2G -b /home/sysop/new/raid5-2G-1.wks

This put the disk through its paces, as it ran the iozone tests in automatic mode on a 2 Gb file, but in the end, I was interested in analyzing the following tests because they were the ones our DBA team suggested would most closely represent database activity.

Random Read (select queries)

This test measures the performance of reading a file with accesses being made to random locations within the file. The performance of a system under this type of activity can be impacted by several factors such as: Size of operating system’s cache, number of disks, seek latencies, and others.

Random Write (update queries)

This test measures the performance of writing a file with accesses being made to random locations within the file. Again the performance of a system under this type of activity can be impacted by several factors such as: Size of operating system’s cache, number of disks, seek latencies, and others.

Strided Read (more complex select queries)

This test measures the performance of reading a file with a strided access behavior. An example would be: Read at offset zero for a length of 4 Kbytes, then seek 200 Kbytes, and then read for a length of 4 Kbytes, then seek 200 Kbytes and so on. Here the pattern is to read 4 Kbytes and then %%[Page: 3]%%

I ran several instances of the same tests using the same command line to ensure that there were no anomalies, and the machine was doing nothing else during the tests besides running the host OS. The results were pretty much as I expected, and I found little to no variation between the raid types on this disk subsystem.

Random Read Tests:

In this test, there seems to be the slightest advantage to the mirror-type RAID arrays when it comes to very small files. This, I suspect can be attributed to actual drive head latency as, in RAID 5 volumes, the correct block needs to be found on a larger number of disks. This advantage quickly falls off, however as the file size grows, meaning that this slight advantage would not be seen in an Oracle database.

Random Write Tests:

In this test, both RAID 5 and RAID 10 seem to hold a slight advantage over the direct mirror. This, I would imagine can be attributed to the fact that the writes are happening over a larger number of spindles. This indicates that the controller is calculating the parity faster than the 2Gb connection speed to the disk subsystem. Again, the variation is incredibly small, so there is no arguable performance advantage to using one type of RAID over another when using a hardware controller.

Stride Read Tests:

Here again we see no real advantage to one RAID type over any other. It could be said that the RAID 10 volume held up ever-so-slightly better on this test, but any edge is so slight that it would be hard to imagine how this could translate into a noticeable performance gain in an Oracle database.

In the end, these tests proved my suspicion that hardware RAID controllers have become so efficient and fast that it no longer makes any real difference what type of RAID you decide to use for your Oracle database. Largely gone are the days when your disk space and RAID volumes were inexorably tied to the server itself. So long as you are using hardware RAID, and the LUNS are abstracted from your operating system, you can largely feel free to make the most of your storage dollar by using RAID 5 in your production database environments.

REL 3 Direct Connect to EonStor A12F-G2221

This summer we have been migrating a bunch of data to our shiny new InfoTrend EonStor A12F-G2221. With 1G battery backed cache, it’s a screaming box of disk, and it looks cool to boot. There is a gotcha though if you want to direct connect it to QLogic QLA2340 card on a REL 3 server. Here is what you have to do.

First, get the new driver from QLogic, or install the one that came on CD with the HBA. The one that Red Hat packages is always old and useless, and one that QLogic provides is better anyways because the installer rebuilds the rdimage for you. Once you get the package just “cd” into the “qlafc-linux-X.XX.XX-X-install” and run “qlinstall”. This will install it all for you, so let it do it’s thing, and reboot the system when it’s done.

Now, go into the management console for your EonStor A12F-G2221. For the most part, the system defaults should work, but InfoTrend sets the default Fibre Connection to “Loop Only”. This is fine if you are dealing with a san, but since we are trying to do a direct connect, we have to change it to either “Auto” or “Direct Connect”. I suggest “Auto”, since that way you can have the other port connected to a loop if you want.


That should be all you have to do. You will have to reboot the controller for the change to take effect, so make sure you do this during a scheduled downtime if you have the disk in production.

Rebuilding the Solaris Device Tree

If you ever shift around any bootable drives within a Sun Solaris box, you may find that either the device names (cxtxd0sx) do not follow the disk position within the server, or, the system just fails to boot because it can’t mount the other disk slices.

Let’s assume you are booting off of target 8 (c1t8d0s0), but wish to move that disk to the appropriate slot to make it target 0 (c1t0d0s0). You have changed all references in the /etc/vfstab file to reflect the new disk position, physically moved the drive from the target 8 slot to the target 0 slot, and changed the boot-device variable within the OBP to the appropriate disk. You should now be all set to boot from the disk in target 0, right?

Not quite yet.

Solaris creates a device tree with links to all the disks it knows about, and these don’t get rebuilt upon reboot. If you simply tried to boot the disk now in target 0, it would find the kernel, but fail to mount any of the other filesystems, because these device links are still pointing to the disk slices on target 8.

In order to boot off the drive in the new position, you will have to remove these device links and rebuild them. Here is how we do that:

1. Insert a Solaris 8, 9 or 10 cd into the hosts cdrom

2. From the ok prompt, enter boot cdrom -s

ok> boot cdrom -s

3. fsck the boot disk

# fsck -y /dev/rdsk/c1t0d0s0

Remember that your boot disk may differ than the example above. Since in our example above, we have put the disk into the slot for target 0 (c1t0d0), that is what we are using here.

4. Mount the root slice on /mnt

# mount /dev/dsk/c1t0d0s0 /mnt

Note that your root slice may differ than the above example.

5. Move path_to_inst

# mv /mnt/etc/path_to_inst /mnt/etc/PATH_TO_INST_ORIG

6. Remove all old device links

# rm /mnt/dev/rdsk/c* ; rm /mnt/dev/dsk/c* ; rm /mnt/dev/rmt/* ; rm

7. Rebuild path_to_inst and devices

# devfsadm -r /a -p /mnt/etc/path_to_inst

8. Unmount the root slice and reboot

# umount /mnt ; init 6

You should now be able to boot off your old drive in its new slot.

What The Heck is RAID 10?

Earlier this month, a company came along and asked for a RAID 10 array. Understanding that RAID 10 is a cooler sounding way of saying RAID 1+0, I understood it as a mirror set that is striped across another mirror set. Simple enough… Just concatenate a couple of mirrors, and you’ve got RAID 10.

Indeed, RAID 10 is simply one or more RAID 1 arrays (mirrored sets) striped together (RAID 0).

RAID 1 creates an exact copy (or mirror) of all of data on two or more disks, while RAID 0 splits data evenly across two or more disks with no parity information for redundancy. By combining the two into a RAID 10 array, you are able to take advantage of the faster write speed offered by RAID 0, while protecting your data against drive failures with mirroring.

This method of RAID is pretty costly, but useful if you find yourself in a situation where you need a lot of throughput combined with a lot of data protection.

Solaris X86 Compatible RAID Controller

Every time I have to spec a solution using Solaris, I always have to answer a bunch of questions in meetings about why Sun is so costly compared to Dell servers. Usually the reason for the higher price is not the servers (especially with X86 sun), but rather the storage. Since Sun does not offer a system with a RAID card, you always have to purchase a high-end disk enclosure that is capable of performing the RAID functions unless you want the performance degradation that comes with software RAID.

The good news is that there is finally a really nice PCI RAID card that works with Solaris! The bad news is that it only works with X86 Solaris, and Sun only goes so far as to say that it is”reported to work“.

Anyhow, no matter. Here is the deal:

According to Sun Big Admin, the Mylex Accelaraid 150 is reported to work with Solaris 9 04/04 to Solaris 10 03/05 (read Solaris 9 and 10 X86). The firmware and bios on the card needs to be: BIOS Version 4.10-50; Firmware 4.08-37.

Pity that there still does not seem to be a RAID controller that works with SPARC hardware. If someone would come up with that, it would make my life as a Solaris administrator a whole lot easier.