Compressed DASD Emulation


Contents


Introduction

Using compressed DASD files you can significantly reduce the file space required for emulated DASD files and possibly gain a performance boost because less physical I/O occurs. Both CKD (Count-Key-Data) and FBA (Fixed-Block-Architecture) emulation files can be compressed.

In regular (or uncompressed) files, each CKD track or FBA block occupies a specific spot in the emulation file. The offset of the track or block in the file can be directly calculated knowing the track or block number and the maximum size of the track or block. In compressed files, each track image or group of blocks may be compressed by zlib or bzip2, and only occupies the space neccessary for the compressed image. The offset of a compressed track or block is obtained by performing a two-table lookup. The lookup tables themselves reside in the emulation file.

Because FBA blocks are 512 bytes in length, and that being a rather small number, FBA blocks are grouped into block groups. Each block group contains 120 FBA blocks (60K).

Whenever a track or block group is written to a compressed file, it is written either to an existing free space within the file, or at the end of the file, then the lookup tables are updated, and then the space the track or block group previously occupied is freed. The location of a track or block group in the file can change many times.

In the event of a failure (for example, Hercules crash, operating system crash or power failure), the compressed emulation file on the host's physical disk may be out of sync if the host operating system defers physical writes to the file system containing the emulation file. Several methods have evolved to reduce the amount of data lost in these kind of events.

A compressed file may occupy only 20% of the disk space required by an uncompressed file. In other words, you may be able to have 5 times more emulated volumes using compressed DASD files. However, compressed files are more sensitive to failures and corruption may occur.


Shadow Files

An compressed CKD or FBA dasd can have more than one physical file. The additional files are called shadow files. The function is implemented as a kind of snapshot, where a new shadow file can be created on demand. An emulated dasd is represented by a base file and 0 or more shadow files. All files are opened read-only except for the current file, which is opened read-write.

Shadow files are specified by the sf=shadow-file-name parameter on the device statement for the compressed DASD device.

Please note that the specified shadow filename does not have to actually exist. The shadow-file-name operand of the sf= parameter is simply a filename template that will be used to name the shadow file whenever a shadow file is to be created, but shadow files do not actually get created until you specifically create them via the sf+xxxx (or sf+*) command. Please refer to the discussion of the sf command several paragraphs below for more information.

The shadow file name should have spot where the shadow file number will be set. This is either the character preceding the last period after the last slash or the last character if there is no period. For example:

0100 3390 disks/linux1.dsk sf=shadows/linux1_*.dsk

There can be up to 8 shadow files in use at any time for an emulated dasd device. The base file is designated file[0] and the shadow files are file[1] to file[8]. The highest numbered file in use at a given time is the current file, where all writes will occur. Track reads start with the current file and proceed down until a file is found that actually contains the track image.

A shadow file contains all the changes made to the emulated dasd since it was created, until the next shadow file is created. The moment of the shadow file's creation can be thought of as a snapshot of the current emulated dasd at that time, because if the shadow file is later removed, then the emulated dasd reverts back to the state it was at when the snapshot was taken.

Using shadow files, you can keep the base file on a read-only device such as cdrom, or change the base file attributes to read-only, ensuring that this file can never be corrupted.

Hercules console commands are provided to add a new shadow file, remove the current shadow file (with or without backward merge), compress the curent shadow file, and display the shadow file status and statistics:

sf+ unit Create a new shadow file
sf- unit merge
nomerge
force
Remove a shadow file. If merge is specified or defaulted, then the contents of the current file is merged into the previous file, the current file is removed, and the previous file becomes the current file. The previous file must be able to be opened read-write. If nomerge is specified then the contents of the current file is discarded and the previous file becomes the current file. However, if the previous file is read-only, then a new shadow file is created (re-added) and that becomes the current file. The force option is required when doing a merge to the base file and the base file is read-only because the ro option was specified on the device config statement.
sfc unit Compress the current file
sfk unit level Perform the chkdsk function on the current file. Level is a number -1 ... 4, the default is 2. The levels are:
-1     devhdr, cdevhdr, l1 table
  0     devhdr, cdevhdr, l1 table, l2 tables
  1     devhdr, cdevhdr, l1 table, l2 tables, free spaces
  2     devhdr, cdevhdr, l1 table, l2 tables, free spaces, trkhdrs
  3     devhdr, cdevhdr, l1 table, l2 tables, free spaces, trkimgs
  4     devhdr, cdevhdr. Build everything else from recovery
sfd unit Display shadow file status and statistics

Note. You can use * in place of unit address to apply the command to all compressed dasd (e.g. 'sf+*', or 'sf-* nomerge').


Compressed DASD File Structure

A compressed DASD file has 6 types of spaces, a device header, a compressed device header, a primary lookup table, secondary lookup tables, track or block group images, and free spaces. The first 3 types only occur once, at the beginning of the file in order. The rest of the file is occupied by the other 3 space types.

The first 512 bytes of a compressed DASD file contains a device header. The device header contains an eye-catcher that identifies the file type (CKD or FBA and base or shadow). The device type and file size is also specified in this header. The header is identical to the header used for uncompressed CKD files, except for the eye-catcher:

devid heads trksize
devt seq hicyl  


reserved


The next 512 bytes contains the compressed device header. This contains file usage information such as the amount of free space in the file:

vrm opts numl1 numl2 size
used ->free free largest
number   cyls   comp parm


reserved


After the compressed device header is the primary lookup table, also called the level 1 table or l1tab. Each 4 byte unsigned entry in the l1tab contains the file offset of a secondary lookup table or level 2 table or l2tab. The track or block group number being accessed divided by 256 gives the index into the l1tab. That is, each l1tab entry represents 256 tracks or block groups. The number of entries in the l1tab is dependent on the size of the emulated device:

l20 l21 l22 l23
l24 l25 l26 l27


.     .     .


l2n-4 l2n-3 l2n-2 l2n-1

Following the l1tab, in no particular order, are l2tabs, track or block group images, and free spaces.

Each secondary lookup table (or l2tab), contains 256 8-byte entries. The entry is indexed by the remainder of the track or block group number divided by 256. Each entry contains an unsigned 4 byte offset, an unsigned 2 byte length and an unsigned 2 byte size. The length is the space required for the track or block group image and the size is the amount of space actually used. The size may be greater than the length to prevent short free spaces from accumulating in the file.

0  ->image         length size
1  ->image         length size

.     .    .

255  ->image         length size

A track or block group image contains two fields, a 5-byte header and a variable amount of data that may or may not be compressed. The length in the l2tab entry includes the length of the header and the data.

hdr track or block group data

The 5 byte header contains a 1 byte compression indicator and 4 bytes that identify the track or block group. The format of the identifier depends on whether the emulated device is CKD or FBA:

CKD hdr
comp CC   HH  

The 2 byte CC is the cylinder number for the track image and the HH is the head number. These numbers are stored in big-endian byte order. When the compression indicator byte is zeroed, the 5 byte header is identical to the Home Address (or HA) for the track image. The data, which may or may not be compressed, begins with the R0 count and ends with the end-of-track (or eot) marker, which is a count field containing FFFFFFFFFFFFFFFF. The HA plus the uncompressed track data comprise the track image.

FBA hdr
comp nnnn        

The 4 byte nnnn field is the FBA block group number in big-endian byte order. The data contains 120 FBA blocks, which may or may not be compressed. Uncompressed, the FBA block group is 60K. The header for FBA, unlike CKD, is not used as part of the uncompressed image.

The compression indicator byte contains the value 0, 1 or 2. Any other value is invalid.
0
    Data is uncompressed
1
    Data is compressed using zlib
2
    Data is compressed using bzip2
3 .. 255    Not valid

Free space contains a 4-byte offset to the next free space, a 4-byte length of the free space, and zero or more bytes of residual data:

->next length    residual   

The minimum length of a free space is 8 bytes. The free space chain is ordered by file offset and no two free spaces are adjacent. The compressed device header contains the offset to the first free space. The chain is terminated when a free space has zero offset to the next free space. The free space chain is read when the file is opened for read-write and written when the file is closed; while the file is opened, the free space chain is maintained in storage.


How It Works

Reading

A track or block group image is read while executing a channel program or by the readahead thread. An image has to be read before it is updated or written to. An image may be cached. If an image is cached, then the channel program may complete synchronously. This means that if all the data a channel program accesses is cached and Hercules does not have to perform physical I/O, then the channel program runs synchronously within the SSCH or SIO instruction in the CPU thread. All DASD channel programs are started synchronously. If a CCW in the channel program requires physical I/O then the channel program is interrupted and restarted at that CCW asynchronously in a device I/O thread.

All compressed devices share a common cache; the devices can be a mixture of FBA and/or CKD device types. Each cache entry contains a pointer to a 64K buffer containing an uncompressed track or block group image. If the track or block group image being read is not found in the cache, then the oldest (or least recently used or LRU) entry that is not busy is stolen. A cache entry is busy if it is being read, or last accessed by an active channel program, or updated but not yet written, or being written. If no cache entries are available then the read must enter a cache wait. When images are detected to be accessed sequentially then the readahead thread(s) may be signalled to read following sequential images.

Writing

When a cache entry is updated or written to, a bit is turned on indicating the cache entry has been updated. When a cache wait occurs, or (more likely) during space recovery, a cache flush is performed. When the cache is flushed, if any entries have the updated bit on, then the writer thread(s) are signalled. The writer thread selects the oldest cache entry with the updated bit on, compresses the image, and writes it to the file. The new image is written to a new space in the file and then the space previously occupied by the image is freed. In certain circumstances, the image may be written under stress. A stress write occurs when a reading thread is in a cache wait or when a high percentage of cache entries are pending write. In this circumstance, the compression parameters are relaxed to reduce the CPU requirements. An image written under stress is likely to take up more space than the same image written not under stress. The writer thread(s) run 1 nicer than the CPU thread(s); compression is a CPU intensive activity.

Space Recovery

Space recovery is also called, somewhat inaccurately, garbage collection. The primary function of the space recovery thread, or garbage collector, is to keep the emulated compressed DASD files as small as possible. After all, that is the reason for using compressed DASD files in the first place.

When a track or block group image is written, it is written to a new location in the file. It is either written to an existing free space within the file or to the end of the file, increasing the size of the file. The space previously occupied by the image is freed, but it is not immediately available for space allocation requests. Instead, it is pending free space. It is assigned a pending value (typically 2) that is decremented each space recovery cycle (typically every 10 seconds). When the pending value reaches 0 then the space is available for allocation. This increases the chance that a track or block group image can be recovered in the event of a failure.

The space recovery routine relocates track or block group images towards the beginning of the file, causing free space to move towards the end of the file. When a free space reaches the end of the file, it `falls off', that is, the file size is reduced.

Simply, the space recovery routine selects a space after a sufficiently large non-pending free space. It then reads and writes consecutive spaces using the normal cckd read and write routines. The space read will become pending free space and will hopefully be written to a non-pending free space occurring earlier in the file. Sometimes it is necessary to write the space later in the file to increase free space size earlier in the file. Left to itself, the space recovery routine will eventually remove all free space from the file. However, it is not intended to be a replacement for the cckdcomp utility; rather, the intent is to provide sufficient free space to prevent excessive file growth.

Another function performed by space recovery is to relocate L2 (secondary lookup) tables towards the beginning of the file. This enables the chkdsk function to complete more quickly during initialization and simplifies chkdsk recovery.


The cckd command and initialization statement

The cckd command and initialization statement can be used to affect cckd processing. The CCKD initialization statement is specified as a Hercules configuration file statement and supports the same options as the cckd command explained below.

Syntax:
cckdhelpDisplay cckd help
cckdstats Display current cckd statistics
cckdoptsDisplay current cckd options
cckdopt=valueSet a cckd option
  Multiple options may be specified, separated by a comma with no intervening blanks.
 comp=nCompression to be used
 compparm=nCompression parameter to be used
 ra=nNumber readahead threads
 raq=nReadahead queue size
 rat=nNumber of tracks to readahead
 wr=nNumber writer threads
 gcint=nGarbage collection interval
 gcparm=nGarbage collection parameter
 nostress=nTurn stress writes on or off
 freepend=nSet the free pending value
 fsync=nTurn fsync on or off
 trace=nNumber of trace table entries
 linuxnull=nCheck for null linux tracks
 gcstart=nStart garbage collector

Options:
comp=n Compression type:
-1 Default
  0 None
  1 zlib
  2 bzip2

Override the compression used for all cckd files. -1 (default) means don't override the compression.

compparm=n Compression parameter. A value between -1 and 9. -1 means use the default parameter. A higher value generally means more compression at the expense of cpu and/or storage.

ra=n Number of readahead threads. When sequential track or block group access is detected, some number (rat=) of tracks or block groups are queued (raq=) to be read by one of the readahead threads.

The default is 2.

You can specify a number between 1 and 9.

raq=n Size of the readahead queue. When sequential track or block group access is detected, some number (rat= ) of tracks or block groups are queued in the readahead queue.

The default is 4.

You can specify a number between 0 and 16 (a value of zero disables readahead).

rat=n Number of tracks or block groups to read ahead when sequential access has been detected.

The default is 2.

You can specify a number between 0 and 16 (a value of zero disables readahead).

wr=n Number of writer threads. When the cache is flushed updated cache entries are marked write pending and a writer thread is signalled. The writer thread compresses the track or block group and writes the compressed image to the emulation file. A writer thread is cpu-intensive while compressing the track or block group and i/o-intensive while writing the compressed image. The writer thread runs one nicer than the CPU thread(s).

The default is 2.

You can specify a number between 1 and 9.

gcint=n Number of seconds the garbage collector thread waits durinng an interval. At the end of an interval, the garbage collector performs space recovery, flushes the cache, and optionally fsyncs the emulation file. (However, the file will not be fsynced unless at least 5 seconds have elapsed since the last fsync).

The default is 10 seconds.

You can specify a number between 1 and 60.

gcparm=n A value affecting the amount of data moved during the garbage collector's space recovery routine. The garbage collector determines an amount of space to move based on the ratio of free space to used space in an emulation file, and on the number of free spaces in the file. (The garbage collector wants to reduce the free space to used space ratio and the number of free spaces). The value is logarithmic; a value of 8 means moving 28 the selected value while a negative value similarly decreases the amount to be moved. Normally, 256K will be moved for a file in an interval. Specifying a value of 8 can increase the amount to 64M. At least 64K will be moved. Interestingly, specifying a large value (such as 8) may not increase the garbage collection efficiency correspondingly.

The default is 0.

You can specify a number between -8 and 8.

nostress= Indicates whether stress writes will occur or not. A track or block group may be written under stress when a high percentage of the cache is pending write or when a device i/o thread is waiting for a cache entry. When a stressed write occurs, the compression algorithm and/or compression parm may be relaxed, resulting in faster compression but usually a larger compressed image. If nostress is set to one, then a stressed situation is ignored. You would typically set this value to one when you want create the smallest emulation file possible in exchange for a possible performance degradation.

The default is 0.

You can specify 0 (enable stressed writes) or 1 (disable stressed writes).

freepend= Specifies the free pending value for freed space. When a track or block group image is written the space it previously occupied is freed. This space will not be available for future allocations until n garbage collection intervals have completed. In the event of a catastrophic failure, previously written track or block group images should be recoverable if the current image has not yet been written to the physical disk. By default the value is set to -1. This means that if fsync is specified then the value is 1 otherwise it is 2. If 0 is specified then freed space is immediately available for new allocations.

The default is -1.

You can specify a number between -1 and 4.

fsync= Enables or disables fsync. When fsync is enabled, then the disk emulation file is synchronized with the physical hard disk at the end of a garbage collection interval (however, no more often than 5 seconds). This means that if freepend is non-zero then if a catastrophic error occurs then the emulated disks should be recovered coherently. However, fsync may cause performance degradation depending on the host operating system and/or the host operating system level.

The default is 0 (fsync disabled).

You can specify 0 (disable fsync) or 1 (enable fsync).

trace= Number of cckd trace entries. You would normally specify a non-zero value when debugging or capturing a problem in cckd code. When the problem occurs, you should enter the k Hercules console command which will print the trace table entries.

The default is 0.

You can specify a number between 0 and 200000. Each entry represents 128 bytes. Normally, for debugging, I use 100000.

linuxnull= If set to 1 then tracks written to 3390 cckd volumes that were initialized with the -linux option will be checked if they are null (that is, if all 12 4096 byte user records contain zeroes). This is used by the dasdcopy utility.

The default is 0.

gcstart= If set to 1 then space recovery will become active on any emulated disks that have free space. Normally space recovery will ignore emulated disks until they have been updated.

The default is 0.


Utilities

ckd2cckd  
cckd2ckd  
fba2cfba  
cfba2fba  
These utilities are deprecated. Use the dasdcopy utility instead

cckdcdsk   [-v] [-f] [-ro] [-level] filename1 [filename2 ...]
  Check the integrity of one or more compressed files. Recover damaged files.
 
-v   Display version and exit.
-f   Perform check even if the OPENED bit is on.
-ro   Open the file(s) read-only. The file will not be updated.
-level   A number 1 .. 4 indicating the level of checking.
1 Minimal checking (default)
2 Medium checking. All track headers will be read.
3 Maximal checking. All track images will be read and uncompressed.
4 Recover everything

cckdcomp   [-v] [-f] [-level] filename1 [filename2 ...]
  Remove all free space from a compressed file or files.
 
-v   Display version and exit.
-f   Perform compress even if the OPENED bit is on.
-level   A number 1 .. 3 indicating the chkdsk level.

cckdswap   [-v] [-f] [-level] filename1 [filename2 ...]
  Change the endianess or byte-order of a compressed file or files
 
-v   Display version and exit.
-f   Perform swap even if the OPENED bit is on.
-level   A number 1 .. 3 indicating the chkdsk level.


Greg Smith gsmith@nc.rr.com


back

Last updated $Date$ $Revision$