Thursday, April 23, 2009

Finding ‘lost’ Tape Volumes

The DELETE VOLHISTORY command deletes volume history file records that are no longer needed (for example, records for obsolete database backup volumes).

When you delete records for volumes that are not in storage pools (for example, database backup or export volumes), the volumes return to scratch status if TSM acquired them as scratch volumes. Scratch volumes of device type FILE are deleted. When you delete the records for storage pool volumes, the volumes remain in the TSM database.

For users of DRM, the database backup expiration should be controlled with the SET DRMDBBACKUPEXP command instead of this DELETE VOLHISTORY command. Using the DELETE VOLHISTORY command removes TSM's record of the volume. This can cause volumes to be lost that were managed by the MOVE DRMEDIA command. The following bash script identifies these volumes:

#!/bin/bash
# --------------------------------------------------------
#
# Description: 'Missing Tapes' volumes.
# Date: 29th March 2007
# Queries: A Singh - singh.ajith@gmail.com
#
# --------------------------------------------------------


# Update this with the highest value of the volume labels.
MAX_VOLUMES=1300


# Update this with the lowest value of the volume labels.
MIN_VOLUMES=1000


# --------------------------------------------------------


# Tape Label Parameters - update as necessary
PREFIX="BL"
SUFFIX="L3"
# length of label excluding PREFIX and SUFFIX
LABEL_LENGTH=4
ZERO="0"


# --------------------------------------------------------


# TSM Server administrator account details - update as necessary
DSM_DIR=/opt/tivoli/tsm/client/ba/bin
DSM_ADMIN=admin
DSM_PWD=secret
DSM_CMD="$DSM_DIR/dsmadmc -id=$DSM_ADMIN -pa=$DSM_PWD -datao=y"


# --------------------------------------------------------


test -x $DSM_DIR/dsmadmc { echo "TSM Client Administrative CLI not installed."; if [ "$1" = "stop" ]; then exit 0; else exit 5; fi }


DATA_VOLS_SQL="select volume_name from volumes order by 1 asc"
VOLH_SQL="select volume_name from volhistory order by 1 asc"
LIBVOLS_SQL="select volume_name from libvolumes order by 1 asc"


MISSING_VOLS=" "
DATA_VOLS=`$DSM_CMD $DATA_VOLS_SQL`
VOLH_VOLS=`$DSM_CMD $VOLH_SQL`
LIB_VOLS=`$DSM_CMD $LIBVOLS_SQL`


DSM_VOLS=`echo $DATA_VOLS $VOLH_VOLS $LIB_VOLS sort uniq`

for (( i=$MIN_VOLUMES; i<=$MAX_VOLUMES; i++ )) do tmpvol="$i" for (( j=${#tmpvol}; j<$LABEL_LENGTH; j++ )); do tmpvol=$ZERO$tmpvol; done tmpvol=$PREFIX$tmpvol$SUFFIX MISSING_VOLS=" "$tmpvol$MISSING_VOLS done for i in $DSM_VOLS do MISSING_VOLS=`(for j in $MISSING_VOLS; do echo $j; done) grep -v $i` done echo "'MISSING' TAPE VOLUMES" echo "----------------------" echo echo "This is a list of tape volumes that are not in the tape libraries and are not listed in the volume history file and the TSM volumes list." echo echo $MISSING_VOLS tr [" "] ["\n"] # --------------------------------------------------------

Monday, April 20, 2009

Recovery Log Pinning

It is possible that the recovery log appears to be out of space when in fact it is being pinned by an operation or combination of operations on the server. A pinned recovery log is where space in the recovery log cannot be reclaimed and used by current transactions because an existing transaction is processing too slowly or is hung.

To determine if the recovery log is pinned, issue SHOW LOGPINNED repeatedly over many minutes. If this reports the same client session or server processes as pinning the recovery log, it may be necessary to take action to cancel or terminate that operation in order to keep the recovery log from running out of space.

To cancel or terminate a session or process that is pinning the recovery log, issue SHOW LOGPINNED CANCEL. Server version 5.1.7.0 and above as well as 5.2.0.0 and above have additional support for the recovery log to automatically recognize that the recovery log is running out of space and where possible to detect and resolve a pinned recovery log using the SHOW LOGPINNED processing.


For the PDF version of this document, send a blank email, with subject line "Recovery Log Pinning", to TSM Assist

Sunday, April 19, 2009

Delaying the Re-use of Tape Storage Pools

The REUSEDELAY attribute of a sequential access (tape or file disk pools) storage pool the number of days that must elapse before a volume can be reused or returned to scratch status, after all files have been expired, deleted, or moved from the volume.

When you delay reuse of such volumes and they no longer contain any files, they enter the pending state. Volumes remain in the pending state for as long as specified with the REUSEDELAY parameter for the storage pool to which the volume belongs. Server internals will take care of finally deleting the Pending Volume from the storage pool when its time is up.

Delaying reuse of volumes can be helpful under certain conditions for disaster recovery. When TSM expires, deletes, or moves files from a volume, the files are not actually erased from the volumes: the database references to these files are removed. Thus the file data may still exist on sequential volumes if the volumes are not immediately reused.

If a disaster forces you to restore the TSM database using a database backup that is old or is not the most recent backup, some files may not be recoverable because TSM cannot find them on current volumes. However, the files may exist on volumes that are in pending state. You may be able to use the volumes in pending state to recover data by doing the following:

1. Restore the database to a point-in-time prior to file expiration.
2. Use a primary or copy storage pool volume that has not been rewritten and contains the expired file at the time of database backup.

If you back up your primary storage pools, set the REUSEDELAY parameter for the primary storage pools to 0 to efficiently reuse primary scratch volumes. For your copy storage pools, you should delay reuse of volumes for as long as you keep your oldest database backup. No useful purpose is served by setting REUSEDELAY to a value dramatically larger than the Retention period for Database backups.

Volumes in a storage pool with a non-zero REUSEDELAY may not remain in the storage pool for the REUSEDELAY period if access is set to destroyed. If REUSEDELAY is set to zero (zero is the default), this problem does not apply. Volumes which are in a destroyed state will be immediately deleted from the storage pool and set to scratch once they have been restored or deleted. Try to avoid updating a volume's access to DESTROYED, use UNAVAILABLE instead.

The TSM database retention period is specified using the SET DRMDBBACKUPEXPIREDAYS. By specifying this value to the REUSEDELAY period in the copy pool definition ensures that the database can be restored to an earlier level and database references to files in the storage pool are still valid.


For the PDF version of this document, send a blank email, with subject line "Delaying the Re-use of Sequential Access Volumes", to TSM Assist

Thursday, April 16, 2009

Define a RAW volume to TSM

One of the main advantages of disk pools is the timing of send high loads to your tape drives.
Within TSM, there are three types of disk pools: Random Access Disk Pools (of device class DISK), File Disk Pools (of device class FILE) – files on hard drives that store data sequentially as on tape, and RAW Disk Pools.

The 3 types differ in the use and the performance you can reach. Best performance for large file migrations is found in RAW volumes. Random access disk pools are best for small files. In the middle, we find file disk pools which have the advantage of sequential read and write operations which make it better than random access disk pools.

The size of each volume within a disk pool seems to be very important within TSM. To improve performance, reduce the size of and increase the count of the volumes. Furthermore, and only on random access volumes, a single corrupt volume can be taken varied to offline without halting operations to the entire storage pool.

To define a RAW volume to TSM, follow these steps:


1. Prepare a raw volume using Operating System commands; raw volume ls_name and platform AIX is used here.

2. Define to a storage pool:

def v stgp_name /dev/rls_name [ /code ]

3. Define as a TSM database volume:

def dbv /dev/rls_name

4. Define as a TSM log volume:

def logv /dev/rls_name

For the PDF version of this document, send a blank email, with subject line "Define a RAW volume to TSM", to TSM Assist

Monday, April 13, 2009

TSM Server-Side Daily Administrator Checklist

1. List TSM license compliance.

audit lic
select compliance from licenses


2. Query server processes and pending requests to determine if any jobs are waiting on operator action.

q pr
q req
q se

3. Query all disk storage pools to determine if the migration process has completed.

select stgpool_name, pct_utilized from stgpools where devclass='DISK'

4. List all drives that are OFFLINE.

select drive_name from drives where not online='YES'

5. List all paths that are OFFLINE.

select source_name, source_type, destination_name, destination_type from paths where not online='YES'

6. List all locked nodes.

select node_name from nodes where not locked='NO'

7. List all non-writeable tape and disk volumes.

q v acc=unavail
q v acc=reado
q v acc=destroyed

select volume_name, read_errors, write_errors from volumes where (read_errors>0 or write_errors>0)

select volume_name from volumes where devclass_name='DISK' and not status='ONLINE'


8. Verify that the library has sufficient scratch volumes.

select library_name,status,count(*) as "VOLUMES" from libvolumes group by library_name,status

9. Verify that the database extension and reduction values are non-zero and that the Cache Hit Ration is above 99%.

q db f=d

10. Verify that the recovery log extension and reduction values are non-zero and that the Wait Percentage is zero.

q log f=d


11. Verify that database and recovery log volumes are online and synchronized.

q dbv f=d
q logv f=d


12. Inspect TSM database fragmentation level.

select cast((100 - (cast(max_reduction_mb as float) * 256 ) / (cast(usable_pages as float) - cast(used_pages as float) ) * 100) as decimal(4,2)) as PERCENT_FRAG from db

13. Verify that the scheduled database backups completed successfully.

select date (date_time) as date, time(date_time) as time, volume_name, type from volhistory where type in ('BACKUPFULL', 'BACKUPINCR', 'DBSNAPSHOT', 'DBDUMP')

14. Verify that all CLIENT schedules for the last day succeeded.

q ev * * begind=-1 endd=today begint=00:00:00 endt=00:00:00

To restrict the listing to only those nodes with non-completed status:

q ev * * begind=-1 endd=today begint=00:00:00 endt=00:00:00 ex=y

15. Verify that all ADMINISTRATIVE schedules for the last day succeeded.

q ev * t=a begind=-1 endd=today begint=00:00:00 endt=00:00:00

To restrict the listing to only those nodes with non-completed status:

q ev * t=a begind=-1 endd=today begint=00:00:00 endt=00:00:00 ex=y

16. Check the activity log for error messages.

q actl search=AN?????E begind=-1 begint=00:00 endd=today endt=00:00

17. Open files and other missed filed will often not have the schedule name in activity log error messages. This query will list these files:

select nodename,date_time,message from actlog where (date_time>currenttimestamp-1 day) and msgno in (4005,4007,4018,4037,4046,4047,4987,4973,4034,4042)


18. List nodes that are not associated with a backup schedule.

select node_name from nodes where node_name not in (select node_name from associations)

19. Cross match the TSM node name with the host name or computer name.

select node_name, tcp_address, tcp_name from nodes

20. List PRIMARY POOL volumes that have been checked out of the library.

select volume_name, stgpool_name from volumes where stgpool_name in (select stgpool_name from stgpools where devclass<>'DISK' and pooltype='PRIMARY') and volume_name not in (select volume_name from libvolumes)

21. Checkout all D/R Media for offsite storage.

move drm * wherest=mo tost=va rem=b

22. Verify that all D/R volumes have been checked out.

select volume_name from libvolumes where volume_name in (select volume_name from volumes where stgpool_name in (select stgpool_name from stgpools where devclass<>'DISK' and pooltype='COPY'))

23. Verify that all TSM database backup volumes have been checked out.

select volume_name from libvolumes where last_use='DbBackup'

24. Identify previous offsite volumes that can be recycled to scratch status and checkin the same.

q drm wherest=vaultr
move drm * wherest=vaultr tost=onsite
checki libv checkl=b stat=scr search=b wait=0


25. Generate a list of unlocked TSM administrator accounts with full system privileges.

select admin_name from admins where not system_priv='No' and not locked='No'

26. List TSM Nodes and Client (BA/TDP) versions by platform.

select platform_name as OS, client_os_level as OS_VER, node_name as Node, cast(cast(client_version as char(2)) '.' cast(client_release as char(2)) '.' cast(client_level as char(2)) '.' cast(client_sublevel as char(2)) as char(15)) as "TSM Client" from nodes order by platform_name, "TSM Client", Node

27. Data backed up in the last 24 hours:

select entity, date(start_time) as DATE, time(start_time) as START_TIME, time(end_time) as END_TIME, substr(char(end_time-start_time),3,8) as DURATION, cast((bytes/1024/1024/1024) as decimal(18,2)) as GB_BACKED_UP, successful from summary where cast((current_timestamp-start_time) hours as decimal)<24>=current_timestamp-24 hours and activity='BACKUP' group by entity

28. Size and duration of archive operations for each node in the last 24 hours:

select entity as "Node Name ", cast(sum(bytes/1024/1024) as decimal(10,3)) as "Total MB", substr(cast(min(start_time) as char(26)),1,19) as "Date/Time ", cast(substr(cast(max(end_time)-min(start_time) as char(20)),3,8) as char(8)) as "Length " from summary where start_time>=current_timestamp-24 hours and activity='ARCHIVE' group by entity

29. Compare PRIMARY and COPY pool occupancy totals.

select sum(num_files) as num_of_files,sum(physical_mb) as Physical_mb,sum(logical_mb) as logical_mb from occupancy where stgpool_name in (select stgpool_name from stgpools where pooltype='PRIMARY')

select sum(num_files) as num_of_files,sum(physical_mb) as Physical_mb,sum(logical_mb) as logical_mb from occupancy where stgpool_name in (select stgpool_name from stgpools where pooltype='COPY')



For the PDF version of this document, send a blank email, with subject line "TSM Server-Side Daily Admistrator Checklist", to TSM Assist

Running a TSM Library Audit

The AUDIT LIBR command synchronizes the TSM server’s library volume inventory with volumes that are physically located in an automated library. If TSM detects inconsistencies, it updates it inventory to reflect the current state of the library: missing volumes are removed from the server inventory list (q libv). The server does not automatically add new volumes; you must check in new volumes with the CHECKIN LIBVOLUME command.
When running a library audit, it is usually a good idea that the library is inactive:

1. Use the DISABLE SE command to prevent starting new client node sessions.
2. Use the QUERY SE command to identify any existing administrative and client node sessions.
3. Use the CANCEL SE command to cancel any existing administrative or client node sessions.
4. Use the Q PR command to identify active background processes.
5. Use the CANCEL PR command to cancel any active background processes.
6. Use the Q MO command to identify the status of any mounted tape volumes.
7. Use the DISMOUNT VOL command to dismount idle volumes.

With the library inactive, run the AUDIT LIBR command with the switch CHECKL=b. This switch is optional, but it will make the audit run much faster. This audit involves your robot scanning the barcode labels of all tapes. If the robot cannot read the barcode label or the barcode label is missing, TSM mounts the tape to read the label.

AUDIT LIBR CHECKL=B

The default action is to mount each tape to identify the volume. The audit runs until all tapes are dismounted.

Lastly, checkin any new volumes (first for SCRATCH volumes, then for PRIVATE volumes) that the audit process may discover:

CHECKIN LIBV CHECKL=B STAT=SCR SEARCH=Y WAITT=0

CHECKIN LIBV CHECKL=B STAT=PRI SEARCH=Y WAITT=0

End this process, by running the ENABLE SE command to enable new client node sessions.


For the PDF version of this document, send a blank email, with subject line "Running a TSM Library Audit", to TSM Assist

Sunday, April 12, 2009

Halting the TSM Server

The HALT command forces an abrupt shutdown, which cancels all the administrative and client node sessions even if they are not completed. Any transactions in progress interrupted by the HALT command are rolled back when you restart the server.

Use the HALT command only after the administrative and client node sessions are completed or cancelled. To shut down the server without severely impacting administrative and client node sessions, perform the following steps:

  1. Use the DISABLE SE command to prevent starting new client node sessions.
  2. Use the QUERY SE command to identify any existing administrative and client node sessions.
  3. Use the CANCEL SE command to cancel any existing administrative or client node sessions.
  4. Use the Q PR command to identify active background processes.
  5. Use the CANCEL PR command to cancel any active background processes.
  6. Use the Q MO command to identify the status of any mounted tape volumes.
  7. Use the DISMOUNT VOL command to dismount idle volumes.
  8. With no existing administrative and client node sessions, no active background processes and no mounted volumes, run the HALT command to shut down the TSM server.

For the PDF version of this document, send a blank email, with subject line "Halting the TSM Server", to TSM Assist