Saturday, March 17, 2012

Measuring light levels and temperature with a pyMCU board

My pyMCU board running a light sensor and a Maxim DS18B20 digital temperature sensor.

Detailed logging for chrooted sftp users


At work we have been migrating some of our customers from ftp to sftp. This gives us and the customer better security but one drawback with my initial sftp setup was that we didn't have detailed logs like most ftp servers produce. All we were getting in the logs were records of logins and disconnects. We didn't have any information on what a client was doing once they were connected. Things like file uploads, file downloads, etc. I had some time this morning to take a look at this. I started with doing some google searches for 'sftp logging'.

I found a lot of blog posts saying that all you had to do was change this line in sshd_config:
ForceCommand internal-sftp
to:
ForceCommand internal-sftp -l VERBOSE
I tried this but didn't get any additional logging. What I finally figured out is that the logging setup for chrooted sftp is a bit more involved. I ran across this blog which spells out what needs to be done quite clearly. The meat of the problem is that the chrooted sftp process can't open /dev/log because it is not within the chrooted filesystem. An additional layer of complexity is that my sftp home directories exist on an NFS mount. Here are the steps from bigmite.com's blog that I used for my CentOS system.

1. Modify /etc/ssh/sshd_config
Edit /etc/ssh/sshd_config and add -l VERBOSE -f LOCAL6 to the internal-sftp line.
Match group sftpuser
 ChrootDirectory /sftp/%u
 X11Forwarding no
 AllowTcpForwarding no
 ForceCommand internal-sftp -l VERBOSE -f LOCAL6

2. Modify the syslog configuration
If the users sftp directory is not on the root filesystem syslog will need to use an additonal logging socket within the users filesystem. For example /sftp is the seperate sftp filesystem (like my setup with the sftp home directories on an NFS mount). For syslog on Redhat/CentOS edit /etc/sysconfig/syslog so that the line:
SYSLOGD_OPTIONS="-m 0"
reads:
SYSLOGD_OPTIONS="-m 0 -a /sftp/sftp.log.socket"
To log the sftp information to a separate file the syslog daemon needs to be told to log messages for LOCAL6 to /var/log/sftp.log. Add the following to /etc/syslog.conf:
#For SFTP logging
local6.* /var/log/sftp.log
Restart syslog with the command service syslog restart. When syslog starts up it will create the sftp.log.socket file.


3. Create links to the log socket
Now you will need to create a link in each users chrooted home directory so the chrooted sftp process can write to the log. This will also need to be done everytime you create a new user.
mkdir /sftp/testuser1/dev
chmod 755 /sftp/testuser1/dev
ln /sftp/sftp.log.socket /sftp/testuser1/dev/log


And that's it! Now sftp will log everything an sftp user does while connected to your server. Here is a sample of what the logs look like:

Mar 16 15:36:45 sftpsrvname internal-sftp[2449]: session opened for local user sftpusername from [192.168.1.10]
Mar 16 15:36:45 sftpsrvname internal-sftp[2449]: received client version 3
Mar 16 15:36:45 sftpsrvname internal-sftp[2449]: realpath "."
Mar 16 15:37:13 sftpsrvname internal-sftp[2449]: lstat name "/"
Mar 16 15:37:13 sftpsrvname internal-sftp[2449]: lstat name "/"
Mar 16 15:37:13 sftpsrvname internal-sftp[2449]: opendir "/"
Mar 16 15:37:13 sftpsrvname internal-sftp[2449]: closedir "/"
Mar 16 15:37:21 sftpsrvname internal-sftp[2449]: realpath "/backup"
Mar 16 15:37:21 sftpsrvname internal-sftp[2449]: stat name "/backup"
Mar 16 15:37:33 sftpsrvname internal-sftp[2449]: lstat name "/backup"
Mar 16 15:37:33 sftpsrvname internal-sftp[2449]: lstat name "/backup/"
Mar 16 15:37:33 sftpsrvname internal-sftp[2449]: opendir "/backup/"
Mar 16 15:37:33 sftpsrvname internal-sftp[2449]: closedir "/backup/"
Mar 16 15:37:37 sftpsrvname internal-sftp[2449]: open "/backup/testfile" flags WRITE,CREATE,TRUNCATE mode 0664
Mar 16 15:37:37 sftpsrvname internal-sftp[2449]: close "/backup/testfile" bytes read 0 written 288
Mar 16 15:41:45 sftpsrvname internal-sftp[2449]: lstat name "/backup"
Mar 16 15:41:45 sftpsrvname internal-sftp[2449]: lstat name "/backup/"
Mar 16 15:41:45 sftpsrvname internal-sftp[2449]: opendir "/backup/"
Mar 16 15:41:45 sftpsrvname internal-sftp[2449]: closedir "/backup/"
Mar 16 15:42:16 sftpsrvname internal-sftp[2449]: lstat name "/backup/testfile"
Mar 16 15:42:16 sftpsrvname internal-sftp[2449]: remove name "/backup/testfile"
Mar 16 15:42:24 sftpsrvname internal-sftp[2449]: session closed for local user sftpusername from [192.168.1.10]


Sunday, March 4, 2012

Troubleshooting a NetApp Snapmirror problem

Last night I started getting Nagios alerts that one of my Snapmirror replication sessions was falling way behind it's normal schedule. The source filer is a critical part of every web application we run and this failing Snapmirror session is critical to our disaster recovery plan.
I started by checking 'Filer At-A-Glance' on the source and destination filers and my MRTG charts. The CPU utilization was very high on the source filer and it was throwing syslog errors every minute saying "pipeline_3:notice]: snapmirror: Network communication error" and "SnapMirror source transfer from srcname to dstname: transfer failed." The CPU was at a 45% - 50% when it normally runs in the 3% - 8% range. The destination filer was throwing syslog errors about failed transfers and destination snapshots are in use and can't be deleted. For example "[snapmirror.dst.snapDelErr:error]: Snapshot dst-netapp(543534345435)_volname.12313 in destination volume volname is in use, cannot delete."

The high CPU was really worrying because come Monday morning I was picturing the additional load of web site traffic might cause the source NetApp to fall over dead. I let it run over night thinking the CPU load might be caused by a batch job, sort of hoping it would clear up by the morning and the Snapmirror would recover once the load dropped. Unfortunately I woke up to the exact same situation.

This morning I started with some triage. First task was get the CPU load down so we could at least function tomorrow. Since this all seemed related to Snapmirror I started by turning off Snapmirror all together on both filers. The command line to do this is 'snapmirror off'. VoilĂ , the CPU load on the source filer immediately dropped down to normal levels. Now worst case we could at function come Monday morning even if that meant not having the DR site up to date. I also disabled the Snapmirror schedule for this particular job which normally runs every fifteen minutes. With the schedule for the troublesome job disabled I turned Snapmirror back on (snapmirror on).

Next I wanted to test Snapmirror on a separate volume to see if Snapmirror was generally broken or if it was specific to that particular volume. To do this I created a new small (1GB) volume on both the source and destination filers. I copied a few hundred megs of data into it and then setup snapmirror for the volume. I hit Initialize and it worked just fine. Hmmm... ok so Snapmirror works. It must be a problem with that particular volume. 

From my CPU chart I had a pretty good idea of what time things went wrong. I was starting to consider the possibility that some strange combination of events took place at around 7pm which somehow interrupted/damaged the Snapmirror sync that took place at that time. Looking at the snapshots on both the source and destination filers I saw that I still had a couple hourly and one nightly snapshot in common between both filers before the CPU spike. So my next plan was to delete snapshots on the destination filer until I was at a spot prior to the CPU spike. I had to break the snapmirror relationship to make the destination filer read/write (snapmirror break volname). Then I deleted the snapshots that took place after and just prior to the CPU spike. Now replication on the destination filer was set back in time prior to the problem. I was desperately trying to avoid starting over with a full sync of 125GB of data! Fingers crossed I ran resync (on the destination filer!) to reestablish the snapmirror relationship:

dst-netapp> snapmirror resync dst-netapp:volname
The resync base snapshot will be: hourly.1
These older snapshots have already been deleted from the source and will be deleted from the destination:
    hourly.4
    hourly.5
    nightly.1
Are you sure you want to resync the volume? y
Sun Mar  4 11:43:26 CST [snapmirror.dst.resync.info:notice]: SnapMirror resync of volname to src-netapp:volname is using hourly.1 as the base snapshot.
Volume volname will be briefly unavailable before coming back online.
Sun Mar  4 11:43:28 CST [wafl.snaprestore.revert:notice]: Reverting volume vol1a to a previous snapshot.
Sun Mar  4 11:43:28 CST [wafl.vol.guarantee.replica:info]: Space for replica volume 'volname' is not guaranteed.
Revert to resync base snapshot was successful.
Sun Mar  4 11:43:28 CST [snapmirror.dst.resync.success:notice]: SnapMirror resync of volname to src-netapp:volname successful.
Transfer started.
Monitor progress with 'snapmirror status' or the snapmirror log.

Phew!! Success! CPU is at 3% and snapmirror is working again.

The error messages that the source filer was throwing were less than helpful. Network communication error? WTH? The problem had nothing to do with networking. Must be a general error message when a snapmirror transfer fails. The destination filer syslog messages gave me the best clue to the problem. One word of caution: Make sure you run the resync command on the correct filer! In this situation if I ran the resync command on the source filer it would have reverted it back to the previous afternoon losing all the data in between.