Solving a disk outage situation during an online backup in DS5.2

Teknoloji

10 Jun 2009
Problem description

R3 replica run out of disk space on the 32GB transaction logs
partition /DB/txlogs while a nightly online backup was taking place
during the night of April 14th. More than 3200 transaction logs were
existing at the time the system run out of disk. The disk outage did
not impact the DB partition, which had enough room for both the DB and
the backup to be stored.

Analysis

An analysis was first executed on R3 errors log output to understand why the online backup in R3 had failed:

This is the start of the backup at 2am… So far so good! The very first transaction log recorded at 2:00am will be log.0000241693:

[15/May/2009:02:00:05 +0200] - Backup starting (/DATA/backup/DB_TO_BAK/last_backup)
[15/May/2009:02:00:06 +0200] - Multiple pool memory management enabled.
[15/May/2009:02:00:07 +0200] - Backing up file 1 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_id2entry.db3)
[15/May/2009:02:00:07 +0200] - Backing up file 2 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_entrydn.db3)

At 03:59, the server should have finished the backup. Unfortunately, a transaction log/s which was present but not processed at the time the backup was executed, log.0000241676, has been in the meantime processed by the DS compromising the backup, hence the backup needs to be restarted (this is a known behavior, an RFE is opened to change this):

[15/May/2009:03:59:08 +0200] - Backing up file 92 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/cldb/ou_puro_o_siux47e852bf000002bd0000.db3)
[15/May/2009:03:59:08 +0200] - Backing up file 93 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/cldb/ou_uva_o_siux486973b7000003230000.db3)
[15/May/2009:03:59:08 +0200] - WARNING<20510> - Backend Database - conn=-1 op=-1 msgId=-1 -
                               WARNING: Log log.0000241676 has been swiped out from under me! (retrying)
[15/May/2009:03:59:08 +0200] - Backing up file 94 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/UVA/uva_id2entry.db3)
[15/May/2009:03:59:08 +0200] - Backing up file 95 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/UVA/uva_entrydn.db3)

At 05:50, once again at the end of the second try for a good backup, we hit another issue while copying one of the transaction logs which has been created since 2:00 (I can’t deduce WHY we have an error on copying log.0000242125, but it is clear these transaction log is the transaction log #433 since the backup started (for as the first transaction log is log.0000241693) and the backup is restarted once again with the message "Backing up file 620 (186+433+1):

[15/May/2009:05:50:38 +0200] - Backing up file 186 (/DATA/backup/DB_TO_BAK/last_backup/cldb/ou_uva_o_aux486973b7000003230000.db3)
[15/May/2009:05:52:31 +0200] - WARNING<20511> - Backend Database - conn=-1 op=-1 msgId=-1 -
                               Error copying file '/BD/txlog/log.0000242125' (err=-1). Starting over...
[15/May/2009:05:52:31 +0200] - Backing up file 620 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_id2entry.db3)
[15/May/2009:05:52:31 +0200] - Backing up file 621 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_entrydn.db3)

All these restarts have been to much for the transaction logs partition disk space… Only 2 minutes afterwards, we have run out of disk in the transaction logs partition and from that moment onwards nothing will ever get written in the DB:

[15/May/2009:05:52:31 +0200] - Backing up file 631 (/DATA/backup/DB_TO_BAK/last_backup/userRoot/userRoot_id2entry.db3)
[15/May/2009:05:54:54 +0200] - WARNING<20993>  - conn=-1 op=-1 msgId=-1 - disk space under /BD/txlog is low
[15/May/2009:05:55:14 +0200] - ERROR<20994>  - conn=-1 op=-1 msgId=-1 -  disk full under /BD/txlog
[15/May/2009:05:55:16 +0200] - WARNING<10249> - Incremental Protocol - conn=-1 op=-1 msgId=-1 -
Internal error  Failed to replay LDAP operation with csn 4a0cf65a000000150000 for replica o=aux,
got error: DSA is unwilling to perform (53).

By looking at the data above, some evidences can be written:

  • R3 run out of disk space in the transaction logs partition and
    that has rendered the replica completely non capable to process any
    further update operation (as proven by the replication error messages
    from 05:55:16 onwards). It is therefore a smart move to stop R3
    immediately and proceed with a database recovery
  • this is the first time this problem (online backup requiring to be
    restarted several times up until the transaction log partition
    exploses) has been hit in one year. Possibly the execution of the
    maintenance task only one hour before this online backup started may have had an influence on this event, although that remains to be proven
Resolution

Customer allowed our P+ team to stop R3 and run a recovery work immediately
on Friday 15th right during  a long weekend window. For this, the
following was planned and executed succesfully:

  • stop R3 (Friday 15, 14:25)
  • remove the DB cache and region files under nsslapd-db-home-directory (/tmp/JESdb) to force a database recovery at next start up
  • restart R3 (Friday 15, 14:40) => the DB recovery starts and
    we start monitor ing how the recovery moves on over time. For this, we
    start running iteratively pfiles on the nsslapd PID and monitor the
    inodes opened by the process inside the transaction logs partition to
    verify that the recovery is running through each and every existing
    transaction log from the newest till the oldest at a rate that is close
    to 5-6 tx logs/minute at the beggining of the recovery and degrades
    over time to 1 tx-log per minute at the end, about 21 hours later
  • Saturday 16, 11:30 => the DB recovery ends the iteration above
    and starts a second iteration on the transaction log files, this time
    from the oldest to the youngest, at a much faster rate of 50-70 tx
    logs/minute.
  • Saturday 16, 13:14 => the DB recovery second iteration phase
    ends and the errors log stops writing any further status message (the
    last one written is "[15/May/2009:14:37:25 +0200] - Database recovery
    is 97% complete."), the CPU stabilizes at 25% of usage with nsslapd
    still in a single-thread
  • Saturday 16, 14:30 => as the status is still the same (no
    further messages in the errors log since 13:14, CPU stable at 25%), we
    start collecting evidences on the process itself. truss shows nothing
    at all, not even "sleeping" messages. pstack alway shows the following
    type of stack, most times looping on several different adresses on
    __memp_close, a bucket of adresses that overtime we detect to be always
    the same ones
        18:03:56 [root@PERA3]: /BD/userRoot # pstack 24623
        24623:  ./ns-slapd -D /var/opt/mps/serverroot/slapd-pera -i /var/opt/mps/serve
         0000000100036bd8 __memp_close (10076e3b0, 0, 0, 0, 0, 0) + 94
         0000000100021fc0 ???????? (10076e3b0, 100750070, 1004d03d0, 0, 5ade9b, 478)
         0000000100021adc __dbenv_open (10076e3b0, 0, 4000, 0, 7821, 40020) + 55c
         ffffffff7653760c dblayer_start (1003b6ea0, c00, f78, 1006b9080, 1003b5d18, 100021160) + 63c
         ffffffff76530840 ldbm_back_start (100785030, 1003b6ea0, 1003b6ea0, 0, 1001c6570,...
         ffffffff7ec86bf0 ???????? (1c00, 1f30, 100785030, 5400, 57b8, 10000)
         ffffffff7ec865c8 ???????? (57b8, ffffffff7ededa88, 35, ffffffff7edfe2a0, 2, 1f00)
         ffffffff7ec86a38 plugin_startall (5, ffffffff7ffffaf8, 1, 1, c80, 7d00) + 30
         00000001000195b4 main (0, 0, 185, ffffffff7ffffaf8, 1001bf550, 1001c1c48) + bf0
         0000000100010e7c _start (0, 0, 0, 0, 0, 0) + 17c
  • Saturday 16, 23:30 => situation is still the same one, same
    pstacks, same CPU usage, same absence of log messages. we are in a
    comfrotable position to deduce that an infinite loop on __memp_close is
    in place. stop-slapd
  • Saturday 16, 23:32 => without removing any db cache nor region
    files, start again R3. We decide to try our chances by running a new
    recovery to verify whether that new recovery would start from scratch
    the 23 hour long transaction logs iteration or if it would go directly
    into memp_close loop mode. And we are lucky, the DB recovery goes
    directly into memp_close loop without iterating through the transaction
    logs. This is excellent news, it means that, whichever the problem is
    regarding the memp_close loop, it has nothing to do with the Database
    integrity, which seems already ensured (no need to run over the
    transaction logs any longer). Could the memp_close loop be related to
    the lack of disk space in the transaction log partition instead?
  • Saturday 16, 23:35 => stop the server once again and remove the
    first 3000 transaction logs to empty the disk space in the transaction
    logs partition. We can safely do this now that we have proved via a
    second DB recovery that there was no longer need to iterate through
    them. We live however the last 200 files transaction logs untouched to
    be sure the server will be able to find out about its last checkpoint.
  • Saturday 16, 23:40 => restart the server once again, it enters
    for the third time in DB recovery mode, and this time leaves such mode
    in a couple of seconds and switches to regular mode immediately
    afterwards. A dozen of seconds afterwards, R3 starts receiving
    successfully replicated operations and the old 200 transaction logs are
    automatically trimmed by the DS instance in the next minute. Problem
    solved.

Source/Kaynak : http://blogs.sun.com/marcos/entry/solving_a_disk_outage_situation

Comment Form

Content In Different Language


Recent Comments


  • Jim Dougherty: You can fix Solaris 8 named_to_major, path_to_inst, drivers_alias errors on boot by simply installin [...]
  • psha: doesn't work [...]
  • Jiji joseph: Can you please let me know how can I get the SRMTools ? [...]
  • Sebastian: Hi, I don't think using a suite will work either. The order is also random. It is just a coincide [...]
  • Henry: Hey, I can't seem to get this working on my mac. The page down works if I put the focus on the wind [...]
  • Our Scores