02 January 2013

Repairing Suspect SQL Database


I'm not a DBA but I have a real life experience repairing a suspect databases twice. I guess I can say that I was unlucky at that time as I was presented in a tough situation where I was still in the process of learning how things work and we didn't have a DBA resource (remember I am a developer), but still consider it lucky as even some experienced DBAs might not have even experienced this situation. I experienced a similar situation before, and restoring it was easy before as we have, but that was easier as I have the proper restore points.
Now in this situation, the backups had failed so the last clean backup we had was 2 weeks old, so I could only rely on that one plus the current transaction log which is not backed up for now. So the first thing that came to my mind was to backup the transaction log so I could easily restore the full lot. I tried to backup the transaction log, but I had to remove the suspect status first by running this TSQL:
sp_resetstatus 
Now it was not a suspect anymore, I was hoping to run the backup of transaction log by using this TSQL:
BACKUP LOG 
TO DISK='X:\YourFolder\YourFileNameLOG.bak',
But to my surprise, it did not work as the database went back to a suspect status. I repeated the process again hoping that it would work but got no luck out of it, with further research its suspect status will always go back when the corruption is severe, so the next thing that came into my mind was that the disk might fail more so I have copied everything to a remote disc (MDF, LDF and the Backup), so I have to kill all connections to it and stop the database so that I can copy the files properly. After waiting for a long time around 15 minutes (remember the file is huge), it presented me an error which I cannot further, I cannot remember the exact error anymore but it's similar to copying files from a badly scratched CD or DVD, the error is in the lines of an I/O error (Oh sh*t!). So I went ahead and checked the event logs and to my surprise, here is what I got!
Everything is in red nearly all 10K events, don't ask me why this happened without someone noticing (remember I am the developer).
The message is almost the same from event to event, the HEX part is almost repetitive, and looks like it's using the same 20 values all over again.
Error: 823, Severity: 24, State: 3
I/O error 2(error not found) detected during write
at offset 0x000001d527a000
in file 'E:\Program Files\Data\MyDatabase.mdf'.
So moving ahead, the first thing that came to my mind was to check the disk and defragment as it might help as the file system would be repaired if there were errors that are repairable and I was hoping that the data would be moved from the offset mentioned logs to a better location on the drive. It took nearly 6 hours doing it as the data was large and drive was large, but after that I now tried copying the files I mentioned above and it worked, now I have a clean copy in a good disk of the MDF, LDF and the Backup which gave me peace of mind, as I know I can restore from those (but using those is not easy as the LDF is not a backup copy).
From there knowing that the disk was repaired by the checkdisk and defragmentation, I can now backup the transaction logs and the database which did not give me much problems anymore. Also having that in mind, I can now perform a full backup of the database which I can restore on a new server. And this is the TSQL I used:
BACKUP DATABASE 
TO DISK='X:\YourFolder\YourFileNameDB.bak',
That worked as well, so the checkdisk and defragmentation really helped.
Now I have a real backup, next is to restore it on a new machine which again did not give me a lot of issues until…
Once the applications start connecting to the database, I checked the logs and to my surprise the same errors (Error: 823) appeared, now I tried contacting some of the DBAs I know and they could not give me a convincing answer, so I just trusted my gut feeling and did it on my own with the help of Google. Now I found out that the error might be a remnant of the old database which was resorted in that state so the best thing was to repair it before applications start using it so I repaired it using this TSQL Command:
USE [master]
GO
ALTER DATABASE  SET SINGLE_USER WITH ROLLBACK IMMEDIATE
GO
ALTER DATABASE  SET SINGLE_USER
GO
DBCC CHECKDB , REPAIR_REBUILD) WITH ALL_ERRORMSGS, NO_INFOMSGS
GO
ALTER DATABASE  SET MULTI_USER
GO
So what it does is, you set the database to be used only by you then you repair and rebuild the indexes, suppressing all information messages except for errors, then set the database back for multi user use.
There are a lot of repair options and this is the most time consuming one so you need a lot of patience for at least 6 hours for a huge database, there are some other options and here they are defined.
Here is the complete list of options:
  1. REPAIR_REBUILD – Performs all repairs done by REPAIR_FAST and includes time-consumingrepairs while rebuilding indexes. This will not have data loss and this was my first option which worked for me.
  2. REPAIR_FAST – Performs minor, nontime-consuming repair actions without rebuilding the indexes and will not have a data loss as well.
  3. REPAIR_ALLOW_DATA_LOSS - Performs all repairs done by REPAIR_REBUILD but you allow data loss as the irrepairable corrupted text will be deleted, this was my last resort.
So after running that script, here is my result:
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, 
 alloc unit ID 0 (type Unknown), page ID (1:2425877) 
 contains an incorrect page ID in its page header. 
 The PageId in the page header = (0:0).
 The error has been repaired.
CHECKDB found 0 allocation errors and 1 consistency errors not associated 
 with any single object.
CHECKDB fixed 0 allocation errors and 1 consistency errors not associated 
 with any single object.
Repair: The Clustered index successfully rebuilt for the object 
 "dbo.MySampleTable" in database "MyDatabase".
Repair: The page (1:2425877) has been deallocated from object ID 462321894, 
 index ID 1, partition ID 311773704355840, alloc unit ID 30298727645184 
 (type In-row data).
Msg 8945, Level 16, State 1, Line 1
Table error: Object ID 462321894, index ID 1 will be rebuilt.
 The error has been repaired.
Msg 8928, Level 16, State 1, Line 1
Object ID 462321894, index ID 1, partition ID 311773704355840, 
 alloc unit ID 30298727645184 (type In-row data): 
 Page (1:2425877) could not be processed. See other errors for details.
 The error has been repaired.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 462321894, index ID 1, partition ID 311773704355840, 
 alloc unit ID 311773704355840 (type In-row data). 
 Page (1:2425877) was not seen in the scan although its parent (1:2425841) 
 and previous (1:2425876) refer to it. Check any previous errors.
 The error has been repaired.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'MySampleTable' 
 (object ID 462321894).
CHECKDB fixed 0 allocation errors and 2 consistency errors in table 'MySampleTable' 
 (object ID 462321894).
CHECKDB found 0 allocation errors and 3 consistency errors in database 'MyDatabase'.
CHECKDB fixed 0 allocation errors and 3 consistency errors in database 'MyDatabase'.
Looks good from that point as it repaired any errors it found, so the last thing to do is backup your spanking shiny new repaired database in case something happens and create a good maintenance plans and backup.
So to summarize, here are the steps to follow in repairing the suspect database if the culprit was an I/O error:
  1. Kill all connections and stop SQL Server
  2. Perform a Checkdisk with repair option
  3. Perform a Defragmentation
  4. Now copy the files necessary (MDF, LDF and Backups)
  5. Perform a backup
  6. Restore the backup
  7. Repair the Database (DBCC CHECKDB)
  8. Backup the repaired DB
  9. Have fun!

No comments: