Creating Linux Partitions for CLARiiON
Creating a properly offset slab of disk for Linux systems on your CLARiiON is not just a matter of creating a partition using the default fdisk values. The reason for this is that disk management utilities for Intel based systems generally write 63 sectors of metadata directly at the beginning of the LUN. The addressable space begins immediately after these initial sectors causing the CLARiiON to cross disks, especially when writing larger IO because it doesn’t match up with the stripe element size (usually 64k).
To get around this, you have to align the partition in such a way that it will start writing data on a sector that will mesh up nicely with the stripe element size. In this case, 128. Below is an example of how I create partitions on our CLARiiON for Linux systems. Check out the EMC Best Practices for Fibre Chanel storage white paper for more detail.
/sbin/fdisk /dev/emcpowera Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel Building a new DOS disklabel. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable. The number of cylinders for this disk is set to 39162. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite) Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-39162, default 1): Using default value 1 Last cylinder or +size or +sizeM or +sizeK (1-39162, default 39162): Using default value 39162 Command (m for help): x Expert command (m for help): b Partition number (1-4): 1 New beginning of data (63-629137529, default 63): 128 Expert command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks.
How to Make Gnarly Big Linux Filesystems
At least in RHEL 4, the fdisk command does not support the creation of filesystems larger than 2TB. In order to get around it, you have to use the parted command. I found the basic info here, but this is the long and short of how to cut off a big ol’ slice of disk using parted:
Run parted
# /sbin/parted
It’s interactive, so the following commands are issued within the utility.
1) Make the disk label
(parted) mklabel gpt
2) Create the partition
(parted) mkpart primary 0 -1
3) Verify
(parted) print
Disk geometry for /dev/sda: 0.000-38146.972 megabytes
Disk label type: msdos
Minor Start End Type Filesystem Flags
1 0.031 101.975 primary ext3 boot
2 101.975 38146.530 primary lvm
4) Exit the GNU Parted command shell
(parted) quit
5) Finally, make the filesystem:
# mkfs.ext3 -m0 -F /dev/sdb1
6)Finally, you don’t want to wait for that big filesystem to fsck from time to time, so make sure it does not get checked unless you run the command yourself:
# tune2fs -c0 -i0 /dev/sdb1
That should just about do it. Remember that only RHEL 4 and higher can support filesystems larger than 2TB. If I remember correctly RHEL 3 can go up to 2TB, RHEL4 can handle 8TB, and RHEL 5 can make a whopping 16TB chunk of disk. Have fun!
Solaris 8 SAN Frustrations
Getting Solaris 8 to light up a Qlogic QLA2310 Fibre Channel card using the SUNWqlc and SUNWqlcx drivers can be frustrating enough, but the headaches are only beginning if you want to connect it to a SAN and you don’t have all the right packages installed.
Last week, I installed the QLA2310 in a Sun Fire V210 running Solaris 8. I installed the latest versions of SUNWqlc, SUNWqlcx and SUNWsan. After doing a reboot -- -r, the system came up and attached the driver to the card. I zoned it in the fabric and logged into Navisphere, where the WWN showed up, but neither Power Path or the Navisphere host agent could communicate with the CLARiiON. I also could not see any of the LUNS I had presented.
I thought it was strange that the CLARiiON could see the host, but the host could not see the CLARiiON.
I ran:
luxadm -e port
Which returned:
Found path to 1 HBA ports
/devices/pci@1d,700000/SUNW,qlc@1/fp@0,0:devctl CONNECTED
Clearly, it could see the HBA.
I ran:
ls -l /dev/cfg
total 8
lrwxrwxrwx 1 root root 38 Nov 30 14:31 c0 ->
../../devices/pci@1e,600000/ide@d:scsi
lrwxrwxrwx 1 root root 39 Nov 30 14:31 c1 ->
../../devices/pci@1c,600000/scsi@2:scsi
lrwxrwxrwx 1 root root 41 Nov 30 14:31 c2 ->
../../devices/pci@1c,600000/scsi@2,1:scsi
lrwxrwxrwx 1 root root 48 Dec 4 13:49 c3 ->
../../devices/pci@1d,700000/SUNW,qlc@1/fp@0,0:fc
The card was C3… This becomes useful later when we have to config it.
I ran:
cfgadm -al -o show_FCP_dev
Which retuned:
cfgadm: Configuration administration not supported
There it was… I didn’t have the complete SAN package installed. I hadn’t done this in a few years, so I had forgotten all the packages I had to add to get the Sun SAN package working correctly… There are many.
Happily, Sun has now packaged them in a nice “SAN_4.4.12_install_it.tar.Z”, which you can get from their website if you have a username. It installs everything for you in the right order.
The only thing left to do was another reboot -- -r and run cfgadm -c configure c3 to config the device. After this everything started working nicely.
Little Known CLARiiON Facts and Trivia
I’ve just returned from EMC training in MA, where we learned a wealth of information about how to use the array, but also some interesting background information about the device itself.
First, the name CLARiiON has some interesting history. Before EMC was EMC, it was Data General, who had a 16-bit minicomputer called the NOVA. DG later came up with new product which the engineers had named the NOVAII. The marketing group, not wanting to recycle the “NOVA” name, insisted that a new name be chosen. The engineers, always wanting to get their way in the end, came up with a anagram “AViiON” by reversing the letters, and cleverly placing the two “ii’s” in the middle. The CLARiiON is simply a derivative of this naming convention.
Secondly, most people know that the operating system of the CLARiiON is called “FLARE”, but it is not commonly known that this is actually an acronyms that stands for Fibre Logic Array Runtime Environment.
It is also fairly common knowledge that one can access “Engineering Mode” on a CLARiiON by pressing Ctl,shft, and f12 and entering the password “messner”. The story behind this password, however, is that the engineering group at the time were avid mountain climbers and chose the password in honor of Reinhold Messner, the first person to climb Everest without the use of oxygen. Apparently the password before that was “pink floyd”, but the marketing group didn’t approve and made them change it.
Problems Registering Solaris Hosts With QLA 2310 HBAs in Navasphere
Sun Microsystems likes the QLA 2310 Fiber Channel HBA. It’s only a 2Gig card, but it works with the Sun native driver, which makes it wonderful for us Solaris Administrators. Unfortunately, it does not integrate perfectly with EMC CLARiiON SANs because it does not register properly with Navasphere. Even if you manually register the host, the LUNs will not be presented to the host because the agent can’t pass commands to the array.
To remedy this situation on my Solaris 8 host, I used the following procedure:
Edit the /etc/system file and add the following line:
set fcp:ssfcp_enable_auto_configuration=1
Next, I rebooted my Solaris host with the “-r” flag:
reboot -- -r
Next I checked Navisphere to make sure my paths have logged in. They were, so I logged into the Solaris host and ran the following commands:
cfgadm
devfsadm
format
I then saw the storage that was presented to my host. Finally, I restarted the Navisphere agent and started using my new LUNs.
How to Disable Automatic FSCK on EXT3 Filesystems
The e2fsck will regularly force a check of a filesystem even if the filesystem is marked clean. By default, this happens on every twenty mounts or 180 days, whichever comes first.
The ext3 filesystem does this as well, which can be annoying if you have a very large filesystem and a short downtime window. Therefore, it’s a good idea to disable this feature on large volumes. Keep in mind that you should still run fsck occasionally, by disabling the automatic checks, you get to Decide when, not the system.
Use the command:
tune2fs -i 0 /dev/hdxx
This disables periodic, automatic checking.
Setting Up The Automounter Service on RHEL
Mounting filesystems in RHEL is pretty straightforward and easy. Occasionally, however, you will not want the filesystem to remain mounted all the time, but rather to automatically mount for a set period of time only when it is needed. Because of networking overhead, and the general unreliability of networks, NFS mounts are a good example of when this can be especially useful.
In order to manage the automatic mounting and unmounting of filesystems on RHEL, we use the Automounter service. Here is how.
First, The main configuration file is “/etc/auto.master”. It should look something like this:
#
# $Id: auto.master,v 1.3 2003/09/29 08:22:35 raven Exp $
#
# Sample auto.master file
# This is an automounter map and it has the following format
# key [ -mount-options-separated-by-comma ] location
# For details of the format look at autofs(5).
#/misc /etc/auto.misc --timeout=60
#/misc /etc/auto.misc
#/net /etc/auto.net
Let’s assume that we want to set up an NFS mount on “/misc/backups”. We would first create an entry in this file that looks something like this:
/misc /etc/auto.misc --timeout=120
This tells the autofs service that we want to use it to manage mounts from within “/misc”, that the configuration file is “/etc/auto.misc”, and that it should disconnect after 2 minuets of inactivity.
Now, let’s edit the “/etc/auto.misc” file. The file has three columns: the mount point from within the /misc directory, the options for mounting the filesystem, and the filesystem to be mounted. It also includes the remote server’s name since we are using NFS. It should look something like this when you are done:
#
# $Id: auto.misc,v 1.2 2003/09/29 08:22:35 raven Exp $
#
# This is an automounter map and it has the following format
# key [ -mount-options-separated-by-comma ] location
# Details may be found in the autofs(5) manpage
cd -fstype=iso9660,ro,nosuid,nodev :/dev/cdrom
backups -rw,soft,intr remoteservername:/path/to/nfs/export
# the following entries are samples to pique your imagination
#linux -ro,soft,intr ftp.example.org:/pub/linux
#boot -fstype=ext2 :/dev/hda1
#floppy -fstype=auto :/dev/fd0
#floppy -fstype=ext2 :/dev/fd0
#e2floppy -fstype=ext2 :/dev/fd0
#jaz -fstype=ext2 :/dev/sdc1
#removable -fstype=ext2 :/dev/hdd
Next, we create the directory for the mount point in /misc:
# mkdir /misc/backups
And finally we restart the autofs service:
# service autofs restart
That should pretty much do it. If you don’t have autofs configured to start up, you can use chkconfig to enable it. “/misc/backups” will now be mounted whenever a user or process attempts to access data on it, and it will be automatically disconnected after 120 seconds of inactivity. Last, but not least, you can always confirm that it is running with the “service” command:
# service autofs status
As always, change the details to match your own requirements.
Working With Disk Labels in RHEL
When you install RHEL, the filesystems are labeled for you. Usually you won’t have to mess with it anymore, but on occasion, you may want to change them to more accurately represent the data that is stored on that partition. If, for instance, you used to have all of your database files on a partition labeled “/database”, but you have now moved them somewhere else, and you now wish to house your user account data there, it would make sense to change the label to something like “/users”.
Labels are, of course, arbitrary, so there is no technical need to do this, and you could, instead simply change the mount point in the fstab file, mounting the partition by device name rather than label, but it is usually cleaner to change the label. Here is how you do it:
First, let’s figure out what the partition is currently labeled as:
[root@calvin /]# /sbin/e2label /dev/hda4
/database
[root@calvin /]#
It’s current label is “/database”, and, since we have moved the database data somewhere else, we now want to store our user account data here, we need to change it to “/users”.
[root@calvin /]# /sbin/e2label /dev/hda4 /users
[root@calvin /]#
That’s all there is to it, now we check to make sure we have done what we think we have done.
[root@calvin /]# /sbin/e2label /dev/hda4
/users
[root@calvin /]#
Sure enough, it’s now labeled “/users” and the data on the disk remains intact. Now all we have to do is change the appropriate entry in the “/etc/fstab” file to represent the change.
Change this:
LABEL=/database /databases ext3 defaults 1 2
To this:
LABEL=/users /users ext3 defaults 1 2
And you’re all set to go. Make sure you have unmounted “/databases” before making the change.
Now, just run:
[root@calvin /]# mount /users
[root@calvin /]#
And you’re all set to go. As always, change the values here to represent those in your environment.
Using Sort to List Directories by Size
If you manage a UNIX system with a large number of directories that vary in size, chances are that you’ve needed to figure out which ones are using up the most disk space. Of course if the directories are user accounts, the best way to do this is to enable quotas and use the “repquota” command. If you just have a bunch of directories, however, you can easily figure out which ones are largest by giving the correct arguments to “du” and “sort”. Here is how:
du -sk * | sort +0nr
This will display the size of all directories and sort them from largest to smallest. If you want to sort them from smallest to largest, simply remove the “r”.
du -sk * | sort +0n
If you have nested directories, you will need to incorporate foreach to recurse through and get all the directory names.
Taking Disk Cylinders From Swap on Solaris 8
Kids… DO NOT TRY THIS AT HOME! If this is not done exactly right, you will render your system unbootable and corrupt your data. That being said, under some circumstances you can take some space from your swap partition and add it to an unused one without initializing your entire disk. This is particularly useful if you decide you want to use DiskSuite to mirror your system disk, but have not allocated the 100MB partition that is needed to hold the state databases. As always, BACK EVERYTHING UP FIRST. Better yet, make two backups and store them on two different systems. This is a risky procedure, and you don’t want to lose any data!
You can also use my instructions for copying a Solaris boot drive to a disk with a different partition layout as a safer alternative.
The first thing you need to do is figure out if your disk layout will allow for this procedure. Usually the swap partition is the second one on the disk, making it partition number 1 (Partition number 0 is root). If partition number 1 is swap on your system, and partition number 3 or 4 are unused, you are in good shape, and this should work. To figure this out, you should do something like this:
# format
Select the boot disk - usually disk 0
Specify disk (enter its number): 0
format> partition
format> print
This will show you the current disk layout.
Current partition table (original):
Total disk cylinders available: 24620 + 2 (reserved cylinders)
Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 725 1.00GB (726/0/0) 2097414
1 swap wu 726 - 9436 11.90GB (8635/0/0) 24946515
2 backup wm 0 - 24619 33.92GB (24620/0/0) 71127180
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 usr wm 9437 - 10888 2.00GB (1452/0/0) 4194828
6 var wm 10889 - 18148 10.00GB (7260/0/0) 20974140
7 unassigned wm 18149 - 24619 8.91GB (6471/0/0) 18694719
Here we see that partitions 3 and 4 are unused and directly after partition 1, so we can take some space from swap and assign it to one of these. Partition 2 is, of course the entire disk. I have not tried it, so I don’t know if you could assign non-sequential cylinders to a partition that is not directly after swap.
So to take some space from partition 1 and add it to partition 3, the first thing we have to do is disable swap, so the format utility will let us change it.
Comment out the following lines in your /etc/vfstab file and reboot the system.
#/dev/dsk/c1t0d0s1 - - swap - no -
#swap - /tmp tmpfs - yes -
This will bring the system up without swap enabled. You can now edit the disk label. Remember that our cylinders need to be sequential, so always work in cylinders when using the format utility.
Re-enter the format utility, select your system disk and view the partition table:
# format
Select the boot disk - usually disk 0
Specify disk (enter its number): 0
format> partition
format> print
Again we wee that partitions 3 and 4 are unused.
Current partition table (original):
Total disk cylinders available: 24620 + 2 (reserved cylinders)
Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 725 1.00GB (726/0/0) 2097414
1 swap wu 726 - 9436 11.90GB (8635/0/0) 24946515
2 backup wm 0 - 24619 33.92GB (24620/0/0) 71127180
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 usr wm 9437 - 10888 2.00GB (1452/0/0) 4194828
6 var wm 10889 - 18148 10.00GB (7260/0/0) 20974140
7 unassigned wm 18149 - 24619 8.91GB (6471/0/0) 18694719
The first thing we need to do is take some cylinders away from partition 1. In this example, we are looking to make partition 3 roughly 100MB, so we need to take about 75 cylinders from partition 1 so that we can add it to partition 3. Parititon 1 ends at cylinder 9436, so we need to subtract 75 from that number. 9436 - 75 = 9361, so that is the new ending cylinder for partition 1. We then subtract the beginning cylinder (726) from that number to give us the new total number of cylinders for partition 1. 9361 - 726 = 8635, so this is the number we enter when format asks for the size of the partition. Like so:
partition> 1
Part Tag Flag Cylinders Size Blocks
1 swap wu 726 - 9360 11.90GB (8635/0/0) 24946515
Enter partition id tag[swap]:
Enter partition permission flags[wu]:
Enter new starting cyl[726]:
Enter partition size[24946615b, 9436c, 12880.92mb, 12.00gb]: 8635c
partition>
Now we have to add these 75 cylinders to partition 3.
partition> 3
Part Tag Flag Cylinders Size Blocks
3 unassigned wm 0 0 (0/0/0) 0
Enter partition id tag[unassigned]:
Enter partition permission flags[wm]:
Enter new starting cyl[0]:9361
Enter partition size[0b, 0c, 0.00mb, 0.00gb]:75c
partition>
Print out the new partition table to make sure everything lines up correctly:
partition> print
Current partition table (original):
Total disk cylinders available: 24620 + 2 (reserved cylinders)
Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 725 1.00GB (726/0/0) 2097414
1 swap wu 726 - 9360 11.90GB (8635/0/0) 24946515
2 backup wm 0 - 24619 33.92GB (24620/0/0) 71127180
3 unassigned wm 9361 - 9436 107.21MB (76/0/0) 219564
4 unassigned wm 0 0 (0/0/0) 0
5 usr wm 9437 - 10888 2.00GB (1452/0/0) 4194828
6 var wm 10889 - 18148 10.00GB (7260/0/0) 20974140
7 unassigned wm 18149 - 24619 8.91GB (6471/0/0) 18694719
Partition 1 ends at cylinder 9360, and partition 3 picks right up at cylinder 9361. Partition 3 ends at cylinder 9436, and partition 5 begins at cylinder 9437. Partition 4, of course, remains unused. Since none of the cylinders overlap, we can go ahead and write the disk label out. DO NOT DO THIS if you have any doubt at all about what you have just done. By writing out the disk label, you could corrupt the data on your formated filesystems if any cylinders overlap into them. The format utility is usually pretty smart about keeping you from making mistakes, but be very careful anyway! You don’t want to end up with scrambled eggs on a disk that has valuable data on it.
partition> label
This writes out the disk label, so you can now exit the format utility and re-enable swap in your /etc/vfstab file. Simply uncomment out the following two lines and reboot the system.
/dev/dsk/c1t0d0s1 - - swap - no -
swap - /tmp tmpfs - yes -
Reboot your system, and if all goes well, it will come up, and you will see that partition 3 will have a little over 100MB on it. Usually people want to do this so they can store the DiskSuite meta database on the newly created partition. If this is the case for you, you can now move on to mirroring the system disk.
Making RHEL 3 See Multiple LUNS
For some reason RHEL 3 comes out of the box configured to see only the first Lun on a SCSI channel. This is usually not a problem, as the first Lun is all you care about, but in some instances, you will need to configure the SCSI module to see multiple Luns.
In this case we are using an Adaptec DuraStor 6200S, which is set up to present the RAID controller as Lun 00, and the actual RAID array as Lun 01. Without any modifications to the system, we plug in in, and after a reboot check /proc/scsi/scsi. We can see the RAID controller, but since we can only see the first Lun on the channel, we never get to the array:
Host: scsi2 Channel: 00 Id: 00 Lun: 00
Vendor: Adaptec Model: DuraStor 6200S Rev: V100
Type: Processor ANSI SCSI revision: 03
The actual array would show up as “Channel: 00 Id: 00 Lun: 01″, but it’s not there. To resolve this, we have to first edit “/etc/modules.conf” and add the following line:
options scsi_mod max_scsi_luns=128 scsi_allow_ghost_devices=1
In our case, modules.conf looks like this after the modification:
alias eth0 e1000
alias eth1 e1000
alias scsi_hostadapter megaraid2
alias usb-controller usb-uhci
alias usb-controller1 ehci-hcd
alias scsi_hostadapter1 aic7xxx
options scsi_mod max_scsi_luns=128 scsi_allow_ghost_devices=1
Next we have to build a new initrd image. This is done with the “mkinitrd” command.
WARNING: MAKE DARN SURE you build this against the right kernel (the kernel you want to use). If you are going to replace your current initrd image with the new one, you should make a back-up copy first. The -f option will force or overwrite the current initrd image file.
cp /boot/initrd-2.4.21-47.ELsmp.img /boot/initrd-2.4.21-47.ELsmp.img.bak
mkinitrd -f -v /boot/initrd-2.4.21-47.ELsmp.img 2.4.21-47.ELsmp
Once this is done, you can reboot your machine, and check “/proc/scsi/scsi” to see confirm that it sees the second Lun. You should see something like this:
Host: scsi2 Channel: 00 Id: 00 Lun: 00
Vendor: Adaptec Model: DuraStor 6200S Rev: V100
Type: Processor ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 00 Lun: 01
Vendor: Adaptec Model: DuraStor 6200S Rev: V100
Type: Direct-Access ANSI SCSI revision: 03
Hat Tip: Alan Baker for help figuring this out.
UPDATE: RHEL 4 doest not have this problem.
Why Modern RAID 5 is Ideal for Oracle Databases
There is a convention of thought amongst Oracle DBA’s that databases should never be installed on disks that are configured into a RAID 5 array. The argument goes, that since Oracle accesses and writes to random points within relatively large files, the overhead of constantly calculating block-level parity on these files is substantial, resulting in serious performance degradation. They suggest that RAID 1 (mirroring) is the ideal disk configuration since no parity needs to be calculated, and Oracle is more than happy to divide up its database over many smaller mount points.
This way of thinking has largely been correct over the years because most systems have traditionally used software RAID. This means that the CPU of the server itself had the job of doing all those parity calculations, and it really did slow down both the server and the disk when RAID 5 configurations were used. Oracle, in particular, had a hard time with these configurations for the exact reasons the DBA’s point to.
In many cases, software RAID is still used, and to be sure, it is wholly inappropriate to deploy RAID 5 in these environments. However, it is increasingly common to find IT departments using a SAN-type architecture where the RAID type and configuration are invisible to the host operating system. In these environments, the the disk array has a dedicated controller that is singly tasked with handling all read, write, and parity operations. The RAID controller is no longer software running on a generic CPU, but rather firmware that is optimized to handle parity calculations. This results in a system where parity is calculated so quickly by the dedicated controller that differences in speed between RAID 1 and Raid 5 should be virtually nonexistent.
To prove this, I carved up our new InfoTrend EonStor A12F-G2221 into three arrays - a RAID 5, a RAID 1, and a RAID 10. I then set out to run some benchmarks on these different arrays to see what, if any, the differences would be.
The hardware used was as follows:
- Dell OptiPlex GX260, 2.2 GHz Processor, 256 MB Ram
- RHEL 4 Linux
- QLogic QLA2340 HBA
- InfoTrend EonStor A12F-G2221 with 1GB cache
- The RAID 5 LUN consisted of 4 drives
- The RAID 1 LUN consisted of 2 drives
- The RAID 10 LUN consisted of 4 drives
I then identified the iozone tests that most accurately simulated Oracle disk activity. What I really wanted to do was to simulate select and update queries on various sized files and see how the different RAID types held up under the load. To do this, I ran iozone, a well-respected benchmark utility, with the following arguments:
/opt/iozone/bin/iozone -Ra -g 2G -b /home/sysop/new/raid5-2G-1.wks
This put the disk through its paces, as it ran the iozone tests in automatic mode on a 2 Gb file, but in the end, I was interested in analyzing the following tests because they were the ones our DBA team suggested would most closely represent database activity.
Random Read (select queries)
This test measures the performance of reading a file with accesses being made to random locations within the file. The performance of a system under this type of activity can be impacted by several factors such as: Size of operating system’s cache, number of disks, seek latencies, and others.
Random Write (update queries)
This test measures the performance of writing a file with accesses being made to random locations within the file. Again the performance of a system under this type of activity can be impacted by several factors such as: Size of operating system’s cache, number of disks, seek latencies, and others.
Strided Read (more complex select queries)
This test measures the performance of reading a file with a strided access behavior. An example would be: Read at offset zero for a length of 4 Kbytes, then seek 200 Kbytes, and then read for a length of 4 Kbytes, then seek 200 Kbytes and so on. Here the pattern is to read 4 Kbytes and then %%[Page: 3]%%
I ran several instances of the same tests using the same command line to ensure that there were no anomalies, and the machine was doing nothing else during the tests besides running the host OS. The results were pretty much as I expected, and I found little to no variation between the raid types on this disk subsystem.
In this test, there seems to be the slightest advantage to the mirror-type RAID arrays when it comes to very small files. This, I suspect can be attributed to actual drive head latency as, in RAID 5 volumes, the correct block needs to be found on a larger number of disks. This advantage quickly falls off, however as the file size grows, meaning that this slight advantage would not be seen in an Oracle database.
In this test, both RAID 5 and RAID 10 seem to hold a slight advantage over the direct mirror. This, I would imagine can be attributed to the fact that the writes are happening over a larger number of spindles. This indicates that the controller is calculating the parity faster than the 2Gb connection speed to the disk subsystem. Again, the variation is incredibly small, so there is no arguable performance advantage to using one type of RAID over another when using a hardware controller.
Here again we see no real advantage to one RAID type over any other. It could be said that the RAID 10 volume held up ever-so-slightly better on this test, but any edge is so slight that it would be hard to imagine how this could translate into a noticeable performance gain in an Oracle database.
In the end, these tests proved my suspicion that hardware RAID controllers have become so efficient and fast that it no longer makes any real difference what type of RAID you decide to use for your Oracle database. Largely gone are the days when your disk space and RAID volumes were inexorably tied to the server itself. So long as you are using hardware RAID, and the LUNS are abstracted from your operating system, you can largely feel free to make the most of your storage dollar by using RAID 5 in your production database environments.
REL 3 Direct Connect to EonStor A12F-G2221
This summer we have been migrating a bunch of data to our shiny new InfoTrend EonStor A12F-G2221. With 1G battery backed cache, it’s a screaming box of disk, and it looks cool to boot. There is a gotcha though if you want to direct connect it to QLogic QLA2340 card on a REL 3 server. Here is what you have to do.
First, get the new driver from QLogic, or install the one that came on CD with the HBA. The one that Red Hat packages is always old and useless, and one that QLogic provides is better anyways because the installer rebuilds the rdimage for you. Once you get the package just “cd” into the “qlafc-linux-X.XX.XX-X-install” and run “qlinstall”. This will install it all for you, so let it do it’s thing, and reboot the system when it’s done.
Now, go into the management console for your EonStor A12F-G2221. For the most part, the system defaults should work, but InfoTrend sets the default Fibre Connection to “Loop Only”. This is fine if you are dealing with a san, but since we are trying to do a direct connect, we have to change it to either “Auto” or “Direct Connect”. I suggest “Auto”, since that way you can have the other port connected to a loop if you want.
That should be all you have to do. You will have to reboot the controller for the change to take effect, so make sure you do this during a scheduled downtime if you have the disk in production.
Changing Linux Mount Points
If you’re familiar with UNIX, you know that changing mount points is really pretty easy. All you have to do is go into “/etc/fstab”, “/etc/vfstab” (or whatever your flavor of UNIX happens to call its filesystem table) and change the mount directory.
If, for instance, you had a Solaris box, and you wanted to make the disk currently mounted as “/data” be mounted as “/database”, all you would have to do is the following:
# umount /data
# mv /data /database
Change this line in “/etc/vfstab” from something like this:
/dev/dsk/c1d0s6 /dev/rdsk/c1d0s6 /data ufs 1 yes -
to something like this:
/dev/dsk/c1d0s6 /dev/rdsk/c1d0s6 /database ufs 1 yes -
and remount it as “/database”.
# mount /database
With Linux, however, it’s not quite so clear anymore… It’s still easy, but it’s just not so clear what you have to do since they have now taken to mounting filesystems using the volume label. Rather than pointing directly to the disk device, Linux points to the label, and “/etc/fstab” look more like this:
LABEL=/data /data ext3 defaults 1 2
You can always simply change the disk label, but if you don’t care, you can just tell linux where the raw device is, bypassing the need to worry about the label. The easiest way to do this is simply to replace the “LABEL=/data” value to the “/dev” entry of the disk itself. Then, simply change “/data” to “/database” and you’re all set.
Here is an example of what you would do to change the mountpoint of “/data” to /database”:
# umount /data
# mv /data /database
Change this line in “/etc/fstab” from this:
LABEL=/data /data ext3 defaults 1 2
to this:
/dev/sda6 /database ext3 defaults 1 2
and remount it as /database
# mount /database
Remembering to change the example values here with those required for your situation.
Solaris Automounter
Whenever you’re using NFS mount points, it’s really nice to use some type of automounter. Linux and FreeBSD use AMD to accomplish this, but Solaris uses automountd, and it’s fun and easy to use… Here is an example of a configuration that will automatically mount an NFS share and unmount it after 5 minuets of inactivity.
We have a system called micky which has an NFS point shared to a system called minny as /shareme.
We can see that it is set up in the /etc/dfs/dfstab file on micky:
share -F nfs -o ro=minny.yourdomain.com -d “NFS ShareMe” /shareme
The above will share the directory read-only. If you would like to map the directory as root and be able to write to it, the command would look more like this:
share -F nfs -o rw,root=minny.yourdomain.com -d “NFS ShareMe” /shareme
You can run the share command on micky to check to make sure it is shared:
# share
- /shareme ro=minny.yourdomain.com “NFS ShareMe”
If it’s not shared, run shareall to share it:
# shareall
Now, jump on over to minny and add the following line to /etc/auto_master:
/- auto_direct
Automountd will now look in /etc/auto_direct for direct mount points.
Next edit /etc/auto_direct and add the following line:
/micky-shareme micky:/shareme
Now, create the directory for the NFS mount point on minny:
# mkdir /micky-shareme
Finally, run the auromount command on minny to inform the daemon of the changes:
# automount
That should do it… Have fun with your new automount NFS share.
More information on this can be found here












