Adjusting Storage Device Access (204.2)
Adjusting Storage Device Access (204.2)¶
Candidates should be able to configure kernel options to support various drives. This objective includes software tools to view and modify hard disk settings including iSCSI devices.
Key Knowledge Areas¶
Tools and utilities to configure DMA for IDE devices including ATAPI and SATA
Tools and utilities to manipulate or analyse system resources (e.g. interrupts)
sdparmcommand and its uses
Tools and utilities for iSCSI
SSD and NVMe configuration and awareness of SAN
Terms and Utilities¶
iscsiadm, scsi_id, iscsidand
WWID, WWN, LUN numbers
iSCSI is an abbreviation of Internet Small Computer System Interface. iSCSI is simply a networked implementation of the well-known standardized SCSI protocol. SCSI defines a model, cabling, electrical signals and commands to use to access various devices. iSCSI uses just the model and the commands. SCSI (and hence: iSCSI) can be used to access all kinds of devices and though mostly used for disks and tape-units has been used to access USB storage, printers and flatbed scanners too, just to name a few.
The iSCSI setup is similar to the SCSI setup as it consists of a client
and a server. In between is an IP based network, which can be a LAN, WAN
or the Internet. The client issues low level SCSI commands as usual
erase). The commands can be encrypted
by the client and are then encapsulated and sent over the network. The
server receives and possibly decrypts the commands and executes the
commands. Unlike traditional Fibre Channel, which requires
special-purpose cabling, iSCSI can be run over long distances using
existing network infrastructure.
Clients are refered to as "initiators". The commands are referred to as "CDBs" (Command Descriptor Blocks) and the server storage devices are known as "targets".
Definitions: iSCSI target, iSCSI initiator The exported storage entity is the target and the importing entity is the initiator.
On Linux hosts that act as a client (initiator) to a iSCSI server
(target) you will need to install the client software. You will also
need to edit the configuration file so you can discover the desired
target(s). On Red Hat Linux and its derivatives the configuration file
/etc/iscsi/iscsid.conf. Edit the file so it points to the proper
server (target) and contains a set of valid credentials - see example
file at the bottom of this segment.
After configuration, ensure that the iSCSI service runs using the following command:
If needed start the iSCSI service using the following command:
iscsiadm Use the
iscsiadm command to discover the targets. Where our
example shows questionmarks, substitute the IP address of the target:
After successfully discovering targets you can log in into the volumes
that were discovered. This can be done with the
iscsiadm tool, which
can also be used to log out of a volume. A fast way to add newly
discovered volumes is to restart the iscsi service:
When adding an iscsi mount point to the
fstab file use the
_netdev option ensures that the network is up before trying to
mount the device.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
There are a number of ways to set up a target. The SCSI Target Framework (STGT/TGT) was the standard before linux 2.6.38. The current standard is the LIO target.
Another well-known implementation was the iSCSI Enterprise Target (IET) and its successor the SCSI Target Subsystem (SCST). IET was a candidate for kernel inclusion too, but LIO was the final choice. LIO (an abbreviation of linux-iscsi.org) now is the standard open-source SCSI Target for shared data storage in Linux. It supports all prevalent storage fabrics, including Fibre Channel (QLogic), FCoE, iEEE 1394, iSCSI, iSER (Mellanox InfiniBand), SRP (Mellanox InfiniBand), USB, vHost, etc.
Adding a new target (on the target host)¶
To add a new iSCSI target edit the
file. This file contains many examples of different configuration
options that have been commented out. A basic target may be defined as:
1 2 3 4
This example defines a single target with two LUNs. LUNs are described with either the backing-store or direct-store directives where backing-store refers to either a file or a block device, and direct-store refers to local SCSI devices. Device parameters, such as serial numbers and vendor names, will be passed through to the new iSCSI LUN.
The target service is aptly named
tgtd. To start it run:
To stop the
tgtd service, run:
If there are open connections, use:
Any connected SCSI device is assigned a unique device name. The device
name begins with
/dev/sd, for example
Additionally, each SCSI device has a unique World Wide Identifier
(WWID). The identifier can be obtained from the device itself by issuing
inquiry command. On Linux systems you can find the WWIDs in the
/dev/disk/by-id/ directory, in which you will find symbolic links
whose names contain the WWID and which point to the assigned device
name. There are two types of WWIDs, known as 'page 0x83' and 'page
0x80' and any SCSI device has one of these.
For example, a device with a page 0x83 identifier would have:
Or, a device with a page 0x80 identifier would have:
Red Hat Enterprise Linux automatically maintains the proper mapping from
the WWID-based device name to a current
/dev/sd name on that system.
Applications can use the
/dev/disk/by-id/ name to reference the data
on the disk, even if the path to the device changes, and even when
accessing the device from different systems. This also allows detection
and access to plugable devices, say a photocamera. If it is plugged in,
the system will sent a
inquiry command and will create the proper
symbolic link, hence allowing you to access the camera under the same
It is possible and feasible to have more than one path to a device, for
example to offer a fail-over or fall-back option and/or to increase
performance. Such "paths" might be implemented as additional network
paths (preferably using their own network card) or over fibre
(preferably using their own HBAs) etc. If there are multiple paths from
a system to a device, a special kernel-module and associated daemons and
other software will use the WWID to detect the paths. A single
/dev/mapper/wwid will be created by them, such as
/dev/mapper/3600508b400105df70000e00000ac0000, which will give access
to the device, regardless which path(s) to it there are in use, as long
as at least one of them can be used ('is active').
multipath -l shows the mapping to the non-persistent
/dev/sd name, and the
1 2 3 4 5 6 7 8
Device-mapper-multipath automatically maintains the proper mapping of
each WWID-based device name to its corresponding
/dev/sd name on the
system. These names are persistent a cross path changes, and they are
consistent when accessing the device from different systems. When the
user_friendly_names feature (of device-mapper-multipath) is used, the
WWID is mapped to a name of the form
/dev/mapper/mpathn. By default,
this mapping is maintained in the file
mpath names are persistent as long as the file is maintained.
Note Important If you use user_friendly_names, then additional steps are required to obtain consistent names in a cluster. Refer to the Consistent Multipath Device Names in a Cluster section in the Using DM Multipath Configuration and Administration book.
Another way to address a SCSI device is by using its LUN, the Logical unit number. The LUN is a three bit number (0..7) which indicates a device within a target. Any target should at least have a device with LUN 0, which can be used like any device, but is also capable of providing information about the other LUNs in the device (if any). If there is just one device in a target that device has LUN 0 as any target should have a LUN 0.
Say you bought a SCSI device that contained three disks (all connected over the same SCSI cable). To find out how many disks there are in your target (or tape units etc.) you would sent an inquiry to LUN 0. The device that has LUN 0 is designed to be able to tell you the configuration and might report that it knows of other devices. To address these the client uses the LUNs as specified by LUN 0.
Because SCSI disks were often refered to by their LUN, the habit stuck and nowadays the term is also used to refer to a logical disk in a SAN. But there can be many more LUNs in a SAN than the humble 7 that are used in older configurations. Also, such a SAN-LUN may well consist of series of disks, which might possibly be configured as a RAID, possibly subdivided in volumes etc. - to the client, it is just a disk with a LUN.
Configuring device name persistence (aka LUN persistence)¶
Linux device names may change upon boot. This can lead to confusion and damage to data. Therefore, there are a number of techniques that allow persistent names for devices. This section covers how to implement device name persistence in guests and on the host machine with and without multipath. Persistence simply means that you will have the same name for the device, regardles its access paths. So, your camera or disk is always recognised as such and will be issued the same device name, regardless where you plug it in.
Implementing device name persistence without multipath¶
If your system is not using multipath, you can use
udev to implement
udev is a device manager that listens to an interface
with the kernel. If a new device is plugged in or a device is being
udev will receive a signal from the kernel and, using a set
of rules, will perform actions, for example assign an additional device
name to the device.
Before implementing LUN persistence in your system you will first need
to acquire the proper WWIDs from the devices you want to control. This
can be done by using the
scsi_id program. After you required the WWIDs
of your devices, you will need to write a rule so
udev can assign a
persistent name to it.
scsi_id A minor hurdle has to be taken first. The
will not list any device by default and will only list devices it has
been told to see ('whitelisted' devices). To whitelist a device you
will need to add its WWID to the
scsi_id configuration file, which is
/etc/scsi_id.conf. That's safe, but a nuisance if you want to
discover WWIDs of new devices. Hence you can specify the
-g option to
allow detection of any disk that has not been listed yet. If you feel
that this should be the default behaviour, you may consider editing
/etc/scsi_id.conf. If there is a line that says:
delete it or comment it out. Then add a line that says:
If this has been configured, next time a program uses the
command, it will add the
-g option (and any other options you might
have specified in the configuration file) and hence allow extracting the
WWIDs from new devices. The
-g should be used either by default from
the configuration file or be set manually in any
udev rule that uses
scsi_id command, or else your newly plugged in device won't be
You then will need to find out what device file your new device was
allocated by the kernel. This is routinely done by using the command
dmesg after you plugged in the new device. Typically, you will see
lines similar to these:
1 2 3 4 5 6 7 8
As you see, we plugged in a 4Gb Flash Disk that was assigned the device
name 'sdc'. This device is now available as
/sys/block/sdc in the
/sys pseudo-filesystem, and also as a devicefile
To determine the device WWID, use the
As you see, we used the
-g forcibly. The
-s option specifies that
you want to use the device name relative to the root of the
/sys). Newer versions of scsi_id (e.g. those in RHEL6) do
not allow use of
sysfs names anymore, so should use the device name
The long string of characters in the output is the WWID. The WWID does
not change when you add a new device to your system. Acquire the WWID
for each device in order to create rules for the devices. To create new
device rules, edit the
20-names.rules file in the
directory. The device naming rules follow this format:
So, when you plug in a new device the kernel will signal
udev it found
a new SCSCI device. the PROGRAM will be executed and the RESULT should
match. If so, the NAME will be used for the device file.
An example would be:
Or on RHEL6 systems:
When the kernel allocates a new device which name matches the KERNEL
pattern, the PROGRAM is executed and when the RESULT is found the NAME
is created. Note the
-g option that we have included so we can
actually see the device, though it is not whitelisted.
udevd daemon is typically started by the execution of a local
startup script, normally done by the
init process and initiated by a
line in the
/etc/inittab file. On older systems you would ensure that
this line was in
On some systems, e.g. RHEL6 the line should be in
Implementing device name persistence with multipath¶
You can implement device name persistence by using the
drivers and software. Of course, normally you would only use this if you
have more than one path to your device. But you can actually define a
multipath with only has one active path, and it will work just like
any other disk. Just define an alias and the name will be persistent
multipath software consists of a daemon, that guards the paths to
the various devices. It executes the external command
multipath command in turn will signal the daemon after it
has reconfigured paths.
To implement LUN persistence in a multipath environment, you can define
an alias name for the
multipath device. Edit the device aliases in the
configuration file of the
1 2 3 4 5 6
This defines a device that has a persistent device file
/dev/mpath/backupdisk. That will grant access to the device - whatever
its name - that has WWID 600a0b80001327510000015427b625e. The alias
names are persistent across reboots.
To install a new disk in your
system, you will need to physically install the hard drive and configure
it in your computer's BIOS. Linux will detect the new drive
automatically in most cases. You could type
dmesg | more to find out
the name and device file of the drive. As an example: the second IDE
drive in your system will be named
/dev/hdb. We will assume that the
new drive is
On older kernels IDE devicenames match the pattern
a is used for the primary master IDE device, b is the primary slave, c
the secondary master and d is the secondary slave IDE device. For
PATA/SATA drives the Linux SCSI driver is used as the PATA/SATA
architecture is similar to that of SCSCI. Newer devices hence have
device files that match the pattern
After installation of your disk and it being recognised by the
kernel you can choose to (re)partition your new disk. As
root in a
This will take you to a prompt
At the prompt, type
p to display the existing partitions. If you have
partitions that you need to delete, type
d, then at the prompt, type
the number of the partition that you wish to delete. Next, type
create the new partition. Type
1 (assuming that this will be the first
partition on the new drive), and hit
enter. Then you will be prompted
for the cylinder that you want to start with on the drive. Next, you
will be asked for the ending cylinder for the partition. Repeat these
steps until all partitions are created.
To put a clean filesystem on the first partition of your newly partitioned drive, type:
Repeat this step for all partitions. You can use the
-t parameter to
define the filesystem type to build, e.g.
You will now need to decide the mount point for each new
partition on your drive. We will assume
to create the mount point for your drive. Now you will need to edit
/etc/fstab. You will want to make an entry at the end of the file
similar to this:
Modern SSDs (Solid State Disks) might profit from some optimisation.
SSDs are very fast compared to standard harddisks with platters but
they have a bigger chance of suffering from bad blocks after many
rewrites to blocks. An SSD can use TRIM to discard those blocks
relatime can also be used to reduce the
amount of writes to disk. This will tell the filesystem not to keep
track of last accessed times, but only last modified times. In
/etc/fstab it would look like:
NVM Express (NVMe) is a specification for accessing SSDs attached
through the PCI Express bus. The Linux NVMe driver is included in kernel
version 3.3 and higher. NVMe devices can be found under
NVMe devices are a family of SSD's thatshould not be issued discards.
Discards are usualy disabled on setups that use ext4 and LVM, but other
filesystems might need discards to be disabled explicitly.
/tmp partition to memory (tmpfs) is another way to reduce
disk writes. This is advisable if your system has enough memory. Add the
following entry to
After you have created a similar entry for each partition, write the file.
will mount the partitions in the directory that you specified in
When a kernel interacts with ext2 or ext3 filesystems the
tune2fs command can be used to configure the interaction. Noteworthy
options of this command are:
This option defines how the kernel should react in case of errors on the filesystem to which this command is applied. Possible behaviours are:
: Ignore the error on the filesystem. Report the read or write error to the application that tries to access that part of the filesystem.
: Remount the filesystem as read-only. This secures data that is already on disc. Applications that try to write to this filesystem will result in error.
: This causes a kernel panic and the system will halt.
Full filesystems have very bad performance due to fragmentation. To prevent regular users from filling up the whole filesystem a percentage of the blocks will be reserved. This reserved part cannot be used by regular users. Only the root user can make use this part of the filesystem.
The default percentage of the reserved blocks is 5%. With disks becoming ever larger, setting aside 5% would result in a large number of reserved blocks. For large disks 1% reserved blocks should be sufficient for most filesystems.
This sets the mount options of the filesystem. Command line options or options in
/etc/fstabtake precedence. The ^-sign clears the specified option.
This option is supported by kernel 2.4.20 and above.
Enable or disable the sparse superblock feature. Enabling this feature on large filesystems saves space, because less superblocks are used.
After enabling or disabling this feature the filesystem must be made valid again by running the
hdparm command is used to set/get SATA/IDE device
parameters. It provides a command line interface to various kernel
interfaces supported by the Linux SATA/PATA/SAS "libata" subsystem and
the older IDE driver subsystem. Newer (from 2008) USB disk enclosures
now support "SAT" (SCSI-ATA Command Translation) so they might also work
The syntax for this command is:
Frequently used options are:
- Get or set the sector count for filesystem read-ahead.
- Get or set the
- Display drive geometry.
- Display identification information.
- Get (1) or set (0) the read-only flag.
- Perform device read for benchmarking.
- Perform device cache read for benchmarking.
- Display all settings, except -i.
man hdparm for more information.
sdparm command is used to access SCSI mode pages, read VPD
(Vital Product Data) pages and send simple SCSI commands.
The utility fetches and can change SCSI device mode pages. Inquiry data including Vital Product Data pages can also be displayed. Commands associated with starting and stopping the medium; loading and unloading the medium; and other housekeeping functionality may also be issued by this utility.
Configuring kernel options¶
In ??? and on, the process to configure and debug
Configuring Linux kernel options kernels is described. Additional kernel
options may be configured by patching the kernel source code. Normally
the kernel's tunable parameters are listed in the various header files
in the source code tree. There is no golden rule to apply here - you
need to read the kernel documentation or may even need to crawl through
the kernel-code to find the information you need. Additionally, a
running kernel can be configured by either using the
or by using the
Current kernels also support dynamic setting of kernel parameters. The
/proc easiest method to set kernel parameters is by modifying the
/proc filesystem. You can use the
echo command to set parameters, in
e.g. to activate IP-forwarding:
The changes take effect immediately but are lost during reboot. /proc/sys/net/ipv4/ip_forward
/proc/ filesystem can also be queried. To see which interrupts a
system uses, for example you could issue /proc/interrupts
which generates output that may look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
On multicore systems you will see multiple CPU-columns, e.g. CPU0..CPU3 for a four core system. The interrupts are delivered to core 0 by default. This can create a performance bottleneck, especially on networked storage systems. As CPU0 gets swamped by IRQ's the system feels sluggish, even when the other cores have almost nothing to do. Newer kernels try to load balance the interrupts. But you can override the defaults and choose which core will handle which interrupt. The procedure is listed in the next paragraph.
Each interrupt listed in the
/proc/interrupts file has a subdirectory
/proc/irq/ tree. So, interrupt 12 corresponds to the directory
/proc/irq/12. In that directory the file
smp_affinity contains data
that describes what core handles the interrupt. The file contains a
bitmask that has one bit per core. By setting the proper bit the APIC
chip will deliver the interrupt to the corresponding core, e.g. 2
selects CPU1, 4 selects CPU2, 8-3, 16-4.. etc. To set for example CPU5
to receive all interrupts #13, do:
By carefully selecting which interrupts go to which processor you can dramatically influence the performance of a busy system.
Kernel tuning can be automated by putting the
in the startup scripts (e.g.
sysctl command is used to modify kernel parameters at runtime. The
parameters that can be modified by this command are listed in the
/proc/sys directory tree. Procfs is required for
sysctl can be used to read as well as write kernel
Frequently used options of the
sysctl command are:
- Display all values currently available.
- Use this option to ignore errors about unknown keys.
- Use this option to disable printing of the key name when printing values.
- Load sysctl settings from the file specified, or /etc/sysctl.conf if
none given. Specifying
-as filename means reading data from standard input.
- Use this option to change a sysctl setting.