
steveu at netpage
Mar 29, 1999, 2:03 AM
Post #2 of 6
(1636 views)
Permalink
|
This is my first intrusion into the HA forum. I have implemented a variety of high availability and fault tolerant hardware and software at various times over the years, so I hope I have something meaningful to say. I don't really see a problem with current commercial SCSI PCI cards in high availability systems, as far as functionality is concerned. The cards don't have to reinitialise the SCSI drives as they come up. That is (usually) controllable in software, and the drivers can be adapted appropriately. Adaptec, BusLogic and other cards have all seem to have the functionality needed to provide a high availability solution, with hosts rebooting at any time. An important issue, however, is how you terminate the bus. The active termination on some cards goes crazy during reboot, so termination must be provided elsewhere. As long as the cards are just another SCSI node along the bus there should be no problem with reboots. Cycling the power is another issue, and starts to get to the heart of problems with SCSI and high availability. Detecting failures and working around them in a high availability system is simple compared to the MAJOR problem you have to face - systems that don't die, but just act screwy. Screwy here could mean babbling, or functionality that comes and goes intermittently. When you think you have allowed for every screwy effect an intermittent fault could cause, you can be sure you haven't. SCSI offers lots of potential for screwy behaviour. Like most systems, it has the potential for babbling faults, but these seem quite rare in practice. The real problems are intermittent connections, grounding problems and so forth. An important issue, for any complex multi-machine SCSI configuration, is that it tends to get flaky, due to grounding problems. This not only causes data errors - they can actually be bad enough to blow up the hardware. Grounding was a particularly bad problem in the Ultra-2 era, with high clock rates on single ended systems. The newest SCSI hardware has gone differential (yes, I know differential was always an option for SCSI, but I haven't seen much differential hardware in use for the older SCSI revisions) grounding should be a bit less of a bother. The system is still not immune from grounding troubles, though. Its just a bit more tolerant. Remember, those maximum cable lengths quoted in the SCSI specs. are nothing to do with science (e.g. like the maximum length for a thick Ethernet being based on the speed of light), but are engineering guestimates of what you might be able to get away with. The Linux HA HOW TO says a bit about ground and termination problems, but doesn't sufficiently emphasise the extent to which these, and SCSI's generally poor design for distributed operation are a pain in the posterior. Modern SCSI variants use very flimsy connectors, which are quite trouble prone. SCSI is a parallel bus, so it has a lot of connections, and requires every one to be reliable. It has no fault tolerance, and only simple parity (which is often not even implemented) as a fault detection mechanism. Taking SCSI cables and connectors outside a box, to connect to another box, is asking for trouble. Its fine in a test environment. In the real world anything remotely fragile gets damaged, even in a fairly well controlled room, or a rack. I only ever feel comfortable with SCSI when its all tidily in one box, with one power supply. One processor box connected to one adjacent RAID box, tightly cabled together, and plugged into the same power outlet is about the greatest risk I want to take with distributed SCSI. I've never seen anything more complex be 100% reliable. So, spreading a SCSI bus around a number of boxes may have more potential for reducing availability than increasing it. "If you want to know about fault tolerant systems ask Microsoft. I know of nobody else that tolerates quite so many faults in the systems!" Steve James O'Kane wrote: > Hi, > I've been thinking about the shared SCSI bus question, and I have > some ideas that I would like to bounce off some people before I spend too > many hour researching an empty dream. I've done a little reading on the > archives, but I don't think I'm using the right search words. > > I'm curious why we should restrict ourselves to current SCSI > hardware that is available? What about designing a special card that does > what we want? I'll admit up front that I am no where near familiar with > the current scsi specs, but I would guess that we could add control codes > to the set as long as we don't overlap in namespace. > > The card that I have in mind would work just like any other SCSI > card when by itself, but when you configure it in a network of machines, > each SCSI card could 'ping' the other scsi cards on the bus, and tell if > they are still alive. My first thoughts on how control is done would be by > SCSI ID. The lower your number the more authority you have. > > I read that a problem is when a card comes back on-line it > reinitalizes the drives, but with a card designed for this purpose, it > could be made to detect if the bus is already active. > > I have some more ideas about this, but they start getting into > details of implimentation. I'm hoping that I get one of a few replies. > Either someone can point me to a link of someone who is already doing > this. Some one could point out why this idea is flawed. Or someone could > point me to some links for where I could start researching more. > > thanks > -james
|