
seligman at nevis
Apr 16, 2012, 1:38 PM
Post #10 of 10
(918 views)
Permalink
|
|
Re: problem with nfs and exportfs failover
[In reply to]
|
|
On 4/16/12 1:47 PM, Seth Galitzer wrote: > Just a quick update. I set the wait_for_leasetime_on_stop parameter on > the exportfs resource to false, no it no longer sleeps for 92 sec and > the switchover is instantaneous. Now I just need to figure out how to > disable nfsv4 on the server side and I should be home-free. As you're testing this, a couple of reminders/observations: - You're exporting /exports/admin with option rw. If your clients are actually writing to that directory, and you want to have true failover, you may need NFSv4. I suggest running a test in which you have a client do an extended write (with dd, for example) then pull the plug on coronado. Is your file or filesystem trashed when you do this? - If you don't need your clients to be able to write to /exports/admin, you have to don't figure out how to turn off NFSv4 (on RHEL6, this is done by passing "-N 4" to nfsd, and is typically done in /etc/sysconfig/nfs). I have the following exportfs definitions on my primary-primary cluster, and my failover tests work just fine: ' primitive ExportUsrNevis ocf:heartbeat:exportfs \ description="Site-wide applications installed in /usr/nevis" \ op start interval="0" timeout="40" \ op stop interval="0" timeout="120" \ params clientspec="*.nevis.columbia.edu" directory="/usr/nevis" fsid="20" options="ro,no_root_squash,async" rmtab_backup="none" Note that I'm exporting this directory ro. If I wanted to support writes with failover (especially in a primary-primary setup!) I'd have tons more work to do. I notice in the configuration you've posted, you haven't included fencing yet. Don't forget this! And test it as well. > On 04/16/2012 12:42 PM, Seth Galitzer wrote: >> I've been poking at this more over the weekend and this morning. And >> while your tip about rmtab was useful, it still didn't resolve the >> problem. I also made sure that my exports were only being >> handled/defined by pacemaker and not by /etc/exports. Though for the >> cloned nfsserver resource to work, it seems you need an /etc/exports >> file to exist on the server, even if it's empty. >> >> It seems the clue as to what's going on is in this line from the log: >> >> coronado exportfs[20325]: INFO: Sleeping 92 seconds to accommodate for >> NFSv4 lease expiry >> >> If I bump up the timeout for the exportfs resource to 95 sec, then after >> the very long timeout, it switches over correctly. So while this is a >> working solution to the problem, a 95 sec timeout is a little long for >> my personal comfort on a live and active fileserver. Any idea what is >> instigating this timeout? Is is exportfs (looks that way from the log >> entry), nfsd, or pacemaker? If pacemaker, then where can I reduce or >> remove this? >> >> I've been looking at disabling nfsv4 entirely on this server, as I don't >> really need it, but haven't found a solution that works yet. Tried the >> suggestion in this thread, but it seems to be for mounts, not nfsd, and >> still doesn't help: >> http://lists.debian.org/debian-user/2011/11/msg01585.html >> >> Though I have found that v4 is being loaded on one host but not the >> other. So if I can find what's different, I may be able to make that work. >> >> coronado:~# rpcinfo -u localhost nfs >> program 100003 version 2 ready and waiting >> program 100003 version 3 ready and waiting >> program 100003 version 4 ready and waiting >> >> cascadia:~# rpcinfo -u localhost nfs >> program 100003 version 2 ready and waiting >> program 100003 version 3 ready and waiting >> >> Any further suggestions are welcome. I'll keep poking until I find a >> solution. >> >> Thanks. >> Seth >> >> On 04/16/2012 11:49 AM, William Seligman wrote: >>> On 4/14/12 5:55 AM, emmanuel segura wrote: >>>> Maybe the problem it's the primitive nfsserver lsb:nfs-kernel-server, i >>>> think this primitive was stoped befoure exportfs-admin >>>> ocf:heartbeat:exportfs >>>> >>>> And if i rember the lsb:nfs-kernel-server and exportfs agent does the same >>>> thing >>>> >>>> the first use the os scripts and the second the cluster agents >>> >>> Now that Emmanuel has reminded me, I'll offer two more tips based on advice he's >>> given me in the past: >>> >>> - You can deal with issue he raises directly by putting additional constraints >>> in your setup, something like: >>> >>> colocation fs-homes-nfsserver inf: group-homes clone-nfsserver >>> order nfssserver-before-homes inf: clone-nfsserver group-homes >>> >>> That will make sure that all the group-homes resources (including >>> exportfs-admin) will not be run unless an instance of nfsserver is already >>> running on that node. >>> >>> - There's a more fundamental question: Why are you placing the start/stop of >>> your NFS server on both nodes under pacemaker control? Why not have the NFS >>> server start at system startup on each node? >>> >>> The only reason I see for putting NFS under Pacemaker control is if there are >>> entries in your /etc/exports file (or the Debian equivalent) that won't work >>> unless other Pacemaker-controlled resources are running, such as DRBD. If that's >>> the case, you're better off controlling them with Pacemaker exportfs resources, >>> the same as you're doing with exportfs-admin, instead of /etc/exports entries. >>> >>>> Il giorno 14 aprile 2012 01:50, William Seligman< >>>> seligman [at] nevis> ha scritto: >>>> >>>>> On 4/13/12 7:18 PM, William Seligman wrote: >>>>>> On 4/13/12 6:42 PM, Seth Galitzer wrote: >>>>>>> In attempting to build a nice clean config, I'm now in a state where >>>>>>> exportfs never starts. It always times out and errors. >>>>>>> >>>>>>> crm config show is pasted here: http://pastebin.com/cKFFL0Xf >>>>>>> syslog after an attempted restart here: http://pastebin.com/CHdF21M4 >>>>>>> >>>>>>> Only IPs have been edited. >>>>>> >>>>>> It's clear that your exportfs resource is timing out for the admin >>>>> resource. >>>>>> >>>>>> I'm no expert, but here are some "stupid exportfs tricks" to try: >>>>>> >>>>>> - Check your /etc/exports file (or whatever the equivalent is in Debian; >>>>>> "man exportfs" will tell you) on both nodes. Make sure you're not already >>>>>> exporting the directory when the NFS server starts. >>>>>> >>>>>> - Take out the exportfs-admin resource. Then try doing things manually: >>>>>> >>>>>> # exportfs x.x.x.0/24:/exports/admin >>>>>> >>>>>> Assuming that works, then look at the output of just >>>>>> >>>>>> # exportfs >>>>>> >>>>>> The clientspec reported by exportfs has to match the clientspec you put >>>>>> into the resource exactly. If exportfs is canonicalizing or reporting the >>>>>> clientspec differently, the exportfs monitor won't work. If this is the >>>>>> case, change the clientspec parameter in exportfs-admin to match. >>>>>> >>>>>> If the output of exportfs has any results that span more than one line, >>>>>> then you've got the problem that the patch I referred you to (quoted >>>>>> below) is supposed to fix. You'll have to apply the patch to your >>>>>> exportfs resource. >>>>> >>>>> Wait a second; I completely forgot about this thread that I started: >>>>> >>>>> <http://www.gossamer-threads.com/lists/linuxha/users/78585> >>>>> >>>>> The solution turned out to be to remove the .rmtab files from the >>>>> directories I was exporting, deleting& touching /var/lib/nfs/rmtab (you'll >>>>> have to look up the Debian location), and adding rmtab_backup="none" to all >>>>> my exportfs resources. >>>>> >>>>> Hopefully there's a solution for you in there somewhere! >>>>> >>>>>>> On 04/13/2012 01:51 PM, William Seligman wrote: >>>>>>>> On 4/13/12 12:38 PM, Seth Galitzer wrote: >>>>>>>>> I'm working through this howto doc: >>>>>>>>> http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf >>>>>>>>> and am stuck at section 4.4. When I put the primary node in standby, it >>>>>>>>> seems that NFS never releases the export, so it can't shut down, and >>>>>>>>> thus can't get started on the secondary node. Everything up to that >>>>>>>>> point in the doc works fine and fails over correctly. But once I add >>>>>>>>> the exportfs resource, it fails. I'm running this on debian wheezy with >>>>>>>>> the included standard packages, not custom. >>>>>>>>> >>>>>>>>> Any suggestions? I'd be happy to post configs and logs if requested. >>>>>>>> >>>>>>>> Yes, please post the output of "crm configure show", the output of >>>>>>>> "exportfs" while the resource is running properly, and the relevant >>>>>>>> sections of your log file. I suggest using pastebin.com, to keep >>>>>>>> mailboxes filling up with walls of text. >>>>>>>> >>>>>>>> In case you haven't seen this thread already, you might want to take a look: >>>>>>>> >>>>>>>> <http://www.gossamer-threads.com/lists/linuxha/dev/77166> >>>>>>>> >>>>>>>> And the resulting commit: >>>>>>>> <https://github.com/ClusterLabs/resource-agents/commit/5b0bf96e77ed3c4e179c8b4c6a5ffd4709f8fdae> >>>>>>>> >>>>>>>> (Links courtesy of Lars Ellenberg.) >>>>>>>> >>>>>>>> The problem and patch discussed in those links doesn't quite match >>>>>>>> what you describe. I mention it because I had to patch my exportfs >>>>>>>> resource (in /usr/lib/ocf/resource.d/heartbeat/exportfs on my RHEL >>>>>>>> systems) to get it to work properly in my setup. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://seligman [at] nevis PO Box 137 | Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
|