Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Xen: API

sharing NFS SRs

 

 

Xen api RSS feed   Index | Next | Previous | View Threaded


Dave.Scott at eu

May 26, 2012, 1:57 AM

Post #1 of 3 (458 views)
Permalink
sharing NFS SRs

Hi,

IMHO one of the weaknesses of the current NFS SR backend in XCP is that it a single SR cannot be shared between pools. This is because the backend relies on the xapi pool framework to prevent

1. multiple hosts from coalescing the same vhds.

2. the same vhd being attached to two VMs at the same time.

3. a vhd being read one one node even after it has been coalesced and deleted on another

If multiple pools could safely share the same NFS SR then a cross-pool migrate (which is possible with the current code) wouldn't have to actually mirror the disks.

With this in mind I've been looking into NFS locking again. I realize this is a... tricky thing to get right... and google turns up lots of horror stories. Anyway, here's what I was thinking:

For handling (1) and (2), we would only need one lock file (really a "lease file") per vhd. In the event of a network interruption we already know that running VMs are likely to fail after 90s or so -- the maximum time (IIRC) a windows VM will allow a page file write to take. So we could

* explicitly tell tapdisk to shutdown after this long (since the VM will probably have blue-screened anyway)

* periodically refresh our leases, setting them to expire well after the tapdisks are guaranteed to have shutdown

So if a host leaves the network, all disks become unlocked a few minutes later and the VMs (and coalesce jobs) can safely be restarted on another pool. This could then be used as the foundation for a new "HA" feature, where only VMs whose I/Os have failed are shutdown and restarted.

From an implementation point of view, this python library looks pretty good:

http://bazaar.launchpad.net/~barry/flufl.lock/trunk/view/head:/flufl/lock/_lockfile.py

I'm not totally sure how to handle (3): would it be sufficient to periodically reopen the vhd chain in tapdisk, or just handle the error where a read fails and reopen the chain then?

Comments are welcome!

Cheers,
Dave

_______________________________________________
Xen-api mailing list
Xen-api [at] lists
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api


george.shuklin at gmail

May 26, 2012, 2:50 AM

Post #2 of 3 (407 views)
Permalink
Re: sharing NFS SRs [In reply to]

On 26.05.2012 12:57, Dave Scott wrote:
> Hi,
>
> IMHO one of the weaknesses of the current NFS SR backend in XCP is that it a single SR cannot be shared between pools. This is because the backend relies on the xapi pool framework to prevent
>
> 1. multiple hosts from coalescing the same vhds.
>
> 2. the same vhd being attached to two VMs at the same time.
>
> 3. a vhd being read one one node even after it has been coalesced and deleted on another
>
> If multiple pools could safely share the same NFS SR then a cross-pool migrate (which is possible with the current code) wouldn't have to actually mirror the disks.
>
> With this in mind I've been looking into NFS locking again. I realize this is a... tricky thing to get right... and google turns up lots of horror stories. Anyway, here's what I was thinking:
>
> For handling (1) and (2), we would only need one lock file (really a "lease file") per vhd. In the event of a network interruption we already know that running VMs are likely to fail after 90s or so -- the maximum time (IIRC) a windows VM will allow a page file write to take. So we could
>
> * explicitly tell tapdisk to shutdown after this long (since the VM will probably have blue-screened anyway)
>
> * periodically refresh our leases, setting them to expire well after the tapdisks are guaranteed to have shutdown
>
> So if a host leaves the network, all disks become unlocked a few minutes later and the VMs (and coalesce jobs) can safely be restarted on another pool. This could then be used as the foundation for a new "HA" feature, where only VMs whose I/Os have failed are shutdown and restarted.
>
> From an implementation point of view, this python library looks pretty good:
>
> http://bazaar.launchpad.net/~barry/flufl.lock/trunk/view/head:/flufl/lock/_lockfile.py
>
> I'm not totally sure how to handle (3): would it be sufficient to periodically reopen the vhd chain in tapdisk, or just handle the error where a read fails and reopen the chain then?
>

I've somehow afraid idea of 'leasing' operation (and periodic open/close
operation).

Here some scenarios to think about:

1) temporal loss of the host SAN connectivity. NFS on the host is going
to interruptible sleep and continue IO as soon as we get connectivity
back. We already kill tapdisk, remove lease, restart vm on other host
and suddenly networking is revived... And pending NFS write operation is
going straight in the middle of 'mission critical' database with fresh
'week after expiration date' data. May be weeks later after 'issue' with
VM restart.
2) SR live migration is still very important feature I very hope to see.
3) Those leases will create additional IO. F.e. if we do have ~20k VMs
(not really large number for clouds of new age) and lease is 10 minutes,
it wll create ~33 IOPS - equivalent about 60-70 VMs (according to
statistic from our cloud).
4) how do you plan to guarantee to tapdisk shutdown (this is NFS, if
server is down or some issues with connectivity, there is no way to shut
down locked in IO process)?
5) I think 30s is not very good number. Linux kernel starts to throwing
IO errors after 120 seconds of IO wait.
6) about this library: '''you also need to make sure that your clocks
are properly synchronized. """ I think this must add requirement to
coexisting of hosts: do not allow to plug nfs sr until clock is synced
with master. (Same for cross-pool migration - reject migration if clock
is out of sync, but allow to shoot own leg with --force).


_______________________________________________
Xen-api mailing list
Xen-api [at] lists
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api


Dave.Scott at eu

May 30, 2012, 2:05 AM

Post #3 of 3 (405 views)
Permalink
Re: sharing NFS SRs [In reply to]

> On 26.05.2012 12:57, Dave Scott wrote:
> > Hi,
> >
> > IMHO one of the weaknesses of the current NFS SR backend in XCP is
> that it a single SR cannot be shared between pools. This is because the
> backend relies on the xapi pool framework to prevent
> >
> > 1. multiple hosts from coalescing the same vhds.
> >
> > 2. the same vhd being attached to two VMs at the same time.
> >
> > 3. a vhd being read one one node even after it has been coalesced and
> deleted on another
> >
> > If multiple pools could safely share the same NFS SR then a cross-
> pool migrate (which is possible with the current code) wouldn't have to
> actually mirror the disks.
> >
> > With this in mind I've been looking into NFS locking again. I realize
> this is a... tricky thing to get right... and google turns up lots of
> horror stories. Anyway, here's what I was thinking:
> >
> > For handling (1) and (2), we would only need one lock file (really a
> "lease file") per vhd. In the event of a network interruption we
> already know that running VMs are likely to fail after 90s or so -- the
> maximum time (IIRC) a windows VM will allow a page file write to take.
> So we could
> >
> > * explicitly tell tapdisk to shutdown after this long (since the VM
> will probably have blue-screened anyway)
> >
> > * periodically refresh our leases, setting them to expire well after
> the tapdisks are guaranteed to have shutdown
> >
> > So if a host leaves the network, all disks become unlocked a few
> minutes later and the VMs (and coalesce jobs) can safely be restarted
> on another pool. This could then be used as the foundation for a new
> "HA" feature, where only VMs whose I/Os have failed are shutdown and
> restarted.


George wrote:
> I've somehow afraid idea of 'leasing' operation (and periodic
> open/close
> operation).
>
> Here some scenarios to think about:
>
> 1) temporal loss of the host SAN connectivity. NFS on the host is going
> to interruptible sleep and continue IO as soon as we get connectivity
> back. We already kill tapdisk, remove lease, restart vm on other host
> and suddenly networking is revived... And pending NFS write operation
> is
> going straight in the middle of 'mission critical' database with fresh
> 'week after expiration date' data. May be weeks later after 'issue'
> with
> VM restart.
> 2) SR live migration is still very important feature I very hope to see.
> 3) Those leases will create additional IO. F.e. if we do have ~20k VMs
> (not really large number for clouds of new age) and lease is 10 minutes,
> it wll create ~33 IOPS - equivalent about 60-70 VMs (according to
> statistic from our cloud).
> 4) how do you plan to guarantee to tapdisk shutdown (this is NFS, if
> server is down or some issues with connectivity, there is no way to
> shut
> down locked in IO process)?
> 5) I think 30s is not very good number. Linux kernel starts to throwing
> IO errors after 120 seconds of IO wait.
> 6) about this library: '''you also need to make sure that your clocks
> are properly synchronized. """ I think this must add requirement to
> coexisting of hosts: do not allow to plug nfs sr until clock is synced
> with master. (Same for cross-pool migration - reject migration if clock
> is out of sync, but allow to shoot own leg with --force).

I see what you mean, perhaps leasing isn't a good idea.

Perhaps we can take advantage of driver domains (which we'll have good support for soon).

We could:

* acquire a lock (not a lease) whenever we use a .vhd file
* in the driver domain, a single process will send heartbeat I/O to the NFS server
* each tapdisk inside the driver domain is unmodified (so no tapdisk lease refresh generating extra I/O)
* when the heartbeat I/O fails for more than 'n' seconds (TBD), the whole domain is rebooted (either cleanly or forcibly by the hypervisor watchdog?)
* when another host (the pool master?) sees the driver domain has shutdown, it knows it can forcibly release the locks since all the buffers will have been thrown away

I quite like the idea of sending some heartbeats to the NFS server, since we don't really handle (or even detect) storage failure very well at the moment.

If we also support blkback/blkfront reconnect then the guest VBDs will be able to reconnect after the driver domain has rebooted-- that would be good.

What do you think?

Cheers,
Dave

_______________________________________________
Xen-api mailing list
Xen-api [at] lists
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api

Xen api RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.