center of tech

R3 replica run out of disk space on the 32GB transaction logs
partition /DB/txlogs while a nightly online backup was taking place
during the night of April 14th. More than 3200 transaction logs were
existing at the time the system run out of disk. The disk outage did
not impact the DB partition, which had enough room for both the DB and
the backup to be stored.
An analysis was first executed on R3 errors log output to understand why the online backup in R3 had failed:
This is the start of the backup at 2am… So far so good! The very first transaction log recorded at 2:00am will be log.0000241693:
[15/May/2009:02:00:05 +0200] - Backup starting (/DATA/backup/DB_TO_BAK/last_backup) [15/May/2009:02:00:06 +0200] - Multiple pool memory management enabled. [15/May/2009:02:00:07 +0200] - Backing up file 1 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_id2entry.db3) [15/May/2009:02:00:07 +0200] - Backing up file 2 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_entrydn.db3)
At 03:59, the server should have finished the backup. Unfortunately, a transaction log/s which was present but not processed at the time the backup was executed, log.0000241676, has been in the meantime processed by the DS compromising the backup, hence the backup needs to be restarted (this is a known behavior, an RFE is opened to change this):
[15/May/2009:03:59:08 +0200] - Backing up file 92 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/cldb/ou_puro_o_siux47e852bf000002bd0000.db3)
[15/May/2009:03:59:08 +0200] - Backing up file 93 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/cldb/ou_uva_o_siux486973b7000003230000.db3)
[15/May/2009:03:59:08 +0200] - WARNING<20510> - Backend Database - conn=-1 op=-1 msgId=-1 -
WARNING: Log log.0000241676 has been swiped out from under me! (retrying)
[15/May/2009:03:59:08 +0200] - Backing up file 94 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/UVA/uva_id2entry.db3)
[15/May/2009:03:59:08 +0200] - Backing up file 95 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/UVA/uva_entrydn.db3)
At 05:50, once again at the end of the second try for a good backup, we hit another issue while copying one of the transaction logs which has been created since 2:00 (I can’t deduce WHY we have an error on copying log.0000242125, but it is clear these transaction log is the transaction log #433 since the backup started (for as the first transaction log is log.0000241693) and the backup is restarted once again with the message "Backing up file 620 (186+433+1):
[15/May/2009:05:50:38 +0200] - Backing up file 186 (/DATA/backup/DB_TO_BAK/last_backup/cldb/ou_uva_o_aux486973b7000003230000.db3)
[15/May/2009:05:52:31 +0200] - WARNING<20511> - Backend Database - conn=-1 op=-1 msgId=-1 -
Error copying file '/BD/txlog/log.0000242125' (err=-1). Starting over...
[15/May/2009:05:52:31 +0200] - Backing up file 620 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_id2entry.db3)
[15/May/2009:05:52:31 +0200] - Backing up file 621 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_entrydn.db3)
All these restarts have been to much for the transaction logs partition disk space… Only 2 minutes afterwards, we have run out of disk in the transaction logs partition and from that moment onwards nothing will ever get written in the DB:
[15/May/2009:05:52:31 +0200] - Backing up file 631 (/DATA/backup/DB_TO_BAK/last_backup/userRoot/userRoot_id2entry.db3) [15/May/2009:05:54:54 +0200] - WARNING<20993> - conn=-1 op=-1 msgId=-1 - disk space under /BD/txlog is low [15/May/2009:05:55:14 +0200] - ERROR<20994> - conn=-1 op=-1 msgId=-1 - disk full under /BD/txlog [15/May/2009:05:55:16 +0200] - WARNING<10249> - Incremental Protocol - conn=-1 op=-1 msgId=-1 - Internal error Failed to replay LDAP operation with csn 4a0cf65a000000150000 for replica o=aux, got error: DSA is unwilling to perform (53).
By looking at the data above, some evidences can be written:
Customer allowed our P+ team to stop R3 and run a recovery work immediately
on Friday 15th right during a long weekend window. For this, the
following was planned and executed succesfully:
18:03:56 [root@PERA3]: /BD/userRoot # pstack 24623
24623: ./ns-slapd -D /var/opt/mps/serverroot/slapd-pera -i /var/opt/mps/serve
0000000100036bd8 __memp_close (10076e3b0, 0, 0, 0, 0, 0) + 94
0000000100021fc0 ???????? (10076e3b0, 100750070, 1004d03d0, 0, 5ade9b, 478)
0000000100021adc __dbenv_open (10076e3b0, 0, 4000, 0, 7821, 40020) + 55c
ffffffff7653760c dblayer_start (1003b6ea0, c00, f78, 1006b9080, 1003b5d18, 100021160) + 63c
ffffffff76530840 ldbm_back_start (100785030, 1003b6ea0, 1003b6ea0, 0, 1001c6570,...
ffffffff7ec86bf0 ???????? (1c00, 1f30, 100785030, 5400, 57b8, 10000)
ffffffff7ec865c8 ???????? (57b8, ffffffff7ededa88, 35, ffffffff7edfe2a0, 2, 1f00)
ffffffff7ec86a38 plugin_startall (5, ffffffff7ffffaf8, 1, 1, c80, 7d00) + 30
00000001000195b4 main (0, 0, 185, ffffffff7ffffaf8, 1001bf550, 1001c1c48) + bf0
0000000100010e7c _start (0, 0, 0, 0, 0, 0) + 17c
Source/Kaynak : http://blogs.sun.com/marcos/entry/solving_a_disk_outage_situation