Sunday, March 4, 2012

Troubleshooting a NetApp Snapmirror problem

Last night I started getting Nagios alerts that one of my Snapmirror replication sessions was falling way behind it's normal schedule. The source filer is a critical part of every web application we run and this failing Snapmirror session is critical to our disaster recovery plan.
I started by checking 'Filer At-A-Glance' on the source and destination filers and my MRTG charts. The CPU utilization was very high on the source filer and it was throwing syslog errors every minute saying "pipeline_3:notice]: snapmirror: Network communication error" and "SnapMirror source transfer from srcname to dstname: transfer failed." The CPU was at a 45% - 50% when it normally runs in the 3% - 8% range. The destination filer was throwing syslog errors about failed transfers and destination snapshots are in use and can't be deleted. For example "[snapmirror.dst.snapDelErr:error]: Snapshot dst-netapp(543534345435)_volname.12313 in destination volume volname is in use, cannot delete."

The high CPU was really worrying because come Monday morning I was picturing the additional load of web site traffic might cause the source NetApp to fall over dead. I let it run over night thinking the CPU load might be caused by a batch job, sort of hoping it would clear up by the morning and the Snapmirror would recover once the load dropped. Unfortunately I woke up to the exact same situation.

This morning I started with some triage. First task was get the CPU load down so we could at least function tomorrow. Since this all seemed related to Snapmirror I started by turning off Snapmirror all together on both filers. The command line to do this is 'snapmirror off'. VoilĂ , the CPU load on the source filer immediately dropped down to normal levels. Now worst case we could at function come Monday morning even if that meant not having the DR site up to date. I also disabled the Snapmirror schedule for this particular job which normally runs every fifteen minutes. With the schedule for the troublesome job disabled I turned Snapmirror back on (snapmirror on).

Next I wanted to test Snapmirror on a separate volume to see if Snapmirror was generally broken or if it was specific to that particular volume. To do this I created a new small (1GB) volume on both the source and destination filers. I copied a few hundred megs of data into it and then setup snapmirror for the volume. I hit Initialize and it worked just fine. Hmmm... ok so Snapmirror works. It must be a problem with that particular volume. 

From my CPU chart I had a pretty good idea of what time things went wrong. I was starting to consider the possibility that some strange combination of events took place at around 7pm which somehow interrupted/damaged the Snapmirror sync that took place at that time. Looking at the snapshots on both the source and destination filers I saw that I still had a couple hourly and one nightly snapshot in common between both filers before the CPU spike. So my next plan was to delete snapshots on the destination filer until I was at a spot prior to the CPU spike. I had to break the snapmirror relationship to make the destination filer read/write (snapmirror break volname). Then I deleted the snapshots that took place after and just prior to the CPU spike. Now replication on the destination filer was set back in time prior to the problem. I was desperately trying to avoid starting over with a full sync of 125GB of data! Fingers crossed I ran resync (on the destination filer!) to reestablish the snapmirror relationship:

dst-netapp> snapmirror resync dst-netapp:volname
The resync base snapshot will be: hourly.1
These older snapshots have already been deleted from the source and will be deleted from the destination:
    hourly.4
    hourly.5
    nightly.1
Are you sure you want to resync the volume? y
Sun Mar  4 11:43:26 CST [snapmirror.dst.resync.info:notice]: SnapMirror resync of volname to src-netapp:volname is using hourly.1 as the base snapshot.
Volume volname will be briefly unavailable before coming back online.
Sun Mar  4 11:43:28 CST [wafl.snaprestore.revert:notice]: Reverting volume vol1a to a previous snapshot.
Sun Mar  4 11:43:28 CST [wafl.vol.guarantee.replica:info]: Space for replica volume 'volname' is not guaranteed.
Revert to resync base snapshot was successful.
Sun Mar  4 11:43:28 CST [snapmirror.dst.resync.success:notice]: SnapMirror resync of volname to src-netapp:volname successful.
Transfer started.
Monitor progress with 'snapmirror status' or the snapmirror log.

Phew!! Success! CPU is at 3% and snapmirror is working again.

The error messages that the source filer was throwing were less than helpful. Network communication error? WTH? The problem had nothing to do with networking. Must be a general error message when a snapmirror transfer fails. The destination filer syslog messages gave me the best clue to the problem. One word of caution: Make sure you run the resync command on the correct filer! In this situation if I ran the resync command on the source filer it would have reverted it back to the previous afternoon losing all the data in between.

1 comment:

Please note all comments are moderated by me before they appear on the site. It may take a day or so for me to get to them. Thanks for your feedback.