Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: NANOG: users

HE.net, Fremont-2 outage?

 

 

First page Previous page 1 2 Next page Last page  View All NANOG users RSS feed   Index | Next | Previous | View Threaded


stef-list at memberwebs

Nov 4, 2009, 11:23 AM

Post #26 of 41 (833 views)
Permalink
Re: HE.net, Fremont-2 outage? [In reply to]

Scott Howard wrote:
> Has anyone managed to get a root cause from HE yet regarding what happened?
>
> I'm still waiting for them to get back to me over 24 hours later...

Good luck.

I'm still waiting for them to get back to me about the outage six weeks
ago. I called and emailed all sorts of folks there, got the run around
for a week at least. Eventually got promises of "so and so should let
you know shortly" but that never occurred.

Cheers,

Stef


sethm at rollernet

Nov 4, 2009, 12:28 PM

Post #27 of 41 (830 views)
Permalink
Re: HE.net, Fremont-2 outage? [In reply to]

Joe Greco wrote:
>
> Yup. Related: "100% availability" is a marketing person's dream; it
> sounds good in theory but is unattainable in practice, and is a reliable
> sign of non-100%-reliability.
>
> The most common way to gain "100% availability" is to avoid testing
> under load. This surely protects the equipment against a whole slew of
> failures in the less-used portions of your power systems, but also
> protects you from detecting them outside your Hour(s) Of Greatest Need.

Not testing under load is silly, IMHO. Does it work? Maybe. If it does
something strange during testing it's attended, expected, and utility is
available to fall back on. Starting your generator only means it'll turn
over and idle, not that it'll provide power under load all the way to
the racks.

Some people may prefer a colo that never risks it and therefore never
does more than idle the genset to claim 100% uptime. Others may prefer
one that won't promise 100% everything but does load tests. I'd rather
have a test go wrong while utility is available rather than a failed
backup with no utility hoping the power comes back before the UPS dies
or the room cooks itself. Both extremes are available to choose from if
you do your research before picking a colo.


> And even for those who follow best practices... You can inspect and
> maintain things until you're blue in the face. One day a contractor
> will drop a wrench into a PDU or UPS or whatever and spectacular things
> will happen. Or a battery develops a strange fault.
>
> You do live load testing, you'll lose now and then. It's best to simply
> assume no single circuit is 100% reliable. You should be able to get
> two circuits from separate power systems and the combination of the two
> should really closely approximate 100%, but even there... it isn't.
>

Separate power systems are overrated, especially if the fire department
ends up being involved for some reason. (Re: the infamous gas leak
story.) And of course with increased complexity comes increased risk of
failure and longer downtime to diagnose and repair. There is no perfect
balance.

~Seth


jgreco at ns

Nov 4, 2009, 1:14 PM

Post #28 of 41 (837 views)
Permalink
Re: HE.net, Fremont-2 outage? [In reply to]

> > Yup. Related: "100% availability" is a marketing person's dream; it
> > sounds good in theory but is unattainable in practice, and is a
> > reliable sign of non-100%-reliability.
>
> You are confusing two different things.

No, I'm not. They're interrelated. That doesn't mean that they are
the same thing, but to talk about them in terms of their relationship
or their effect on service is perfectly fair.

> > And even for those who follow best practices... You can inspect and
> > maintain things until you're blue in the face. One day a contractor
> > will drop a wrench into a PDU or UPS or whatever and spectacular things
> > will happen.
>
> That's were policies, procedures and methods come in (read: SAS70)

Policies, procedures, and methods are nice. Unfortunately, it is not
too uncommon for all of the above to be bent or broken for a whole
slew of reasons. What about a problem that hasn't been planned for?
It only takes one time ... one mistake ... of just the right kind.

> > Or a battery develops a strange fault.
>
> Get more than one string, one more than one UPS, with monitoring.
> Batteries are NOT the Achilles heel everyone wants to make you
> believe they are.

I know you have a rather higher faith in batteries than some of us,
but practical experience suggests that batteries are merely a mostly-
reliable technology.

... JG
--
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"We call it the 'one bite at the apple' rule. Give me one chance [and] then I
won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN)
With 24 million small businesses in the US alone, that's way too many apples.


raphael.carrier at gmail

Nov 4, 2009, 2:08 PM

Post #29 of 41 (835 views)
Permalink
Re: HE.net, Fremont-2 outage? [In reply to]

> I know you have a rather higher faith in batteries than some of us,
> but practical experience suggests that batteries are merely a mostly-
> reliable technology.
>

Agreed batteries are unreliable, an alternative to battery based UPS
are flywheel energy storage devices, they come either as an integrated
solution with the diesel generator (i think cat offers such a package)
or as a standalone UPS (see:
www.pentadyne.com/uploads/18/File/Pentadyne-VSS-Brochure.pdf)

another vendor is Active Power (which i think partners with cat)

They seem to be MUCH more reliable than batteries from what i read

HE probably acquired one of those solutions


-Raphael Carrier


owen at delong

Nov 4, 2009, 2:18 PM

Post #30 of 41 (837 views)
Permalink
Re: HE.net, Fremont-2 outage? [In reply to]

On Nov 4, 2009, at 2:08 PM, Raphael Carrier wrote:

>> I know you have a rather higher faith in batteries than some of us,
>> but practical experience suggests that batteries are merely a mostly-
>> reliable technology.
>>
>
> Agreed batteries are unreliable, an alternative to battery based UPS
> are flywheel energy storage devices, they come either as an integrated
> solution with the diesel generator (i think cat offers such a package)
> or as a standalone UPS (see:
> www.pentadyne.com/uploads/18/File/Pentadyne-VSS-Brochure.pdf)
>

Apparently you do not remember 365 Main...

Batteries are reliable.
Flywheels are reliable.

Both require proper maintenance and proper procedures to handle
corner cases (like the multiple-outage corner-case that took out
365 main).

Both have their issues.

In my experience working at and with a variety of datacenters, I have
to day that I have had generally better luck with batteries than
flywheels,
but, the key difference that suggests flywheels could actually be better
technology is this:

About 50% of battery failures traced back to human factors.

100% of the flywheel failures I experienced were human factors related.

Owen

Speaking as an individual, not representing any affiliation.


scott at doc

Nov 4, 2009, 2:20 PM

Post #31 of 41 (830 views)
Permalink
Re: HE.net, Fremont-2 outage? [In reply to]

On Wed, Nov 4, 2009 at 2:08 PM, Raphael Carrier
<raphael.carrier [at] gmail>wrote:

> Agreed batteries are unreliable, an alternative to battery based UPS
> are flywheel energy storage devices, they come either as an integrated
> solution with the diesel generator (i think cat offers such a package)
>

Yup, just ask 365 Main how reliable they are -
http://365main.com/status_update.html

I'm not saying that battery-based UPS's are better, but no matter what type
of system you look at you're going to find failures.

Scott


jgreco at ns

Nov 4, 2009, 2:56 PM

Post #32 of 41 (834 views)
Permalink
Re: HE.net, Fremont-2 outage? [In reply to]

> On Wed, Nov 4, 2009 at 2:08 PM, Raphael Carrier
> <raphael.carrier [at] gmail>wrote:
>
> > Agreed batteries are unreliable, an alternative to battery based UPS
> > are flywheel energy storage devices, they come either as an integrated
> > solution with the diesel generator (i think cat offers such a package)
>
> Yup, just ask 365 Main how reliable they are -
> http://365main.com/status_update.html
>
> I'm not saying that battery-based UPS's are better, but no matter what type
> of system you look at you're going to find failures.

I would point out that my cursory review of the document linked above
leaves a very positive impression. I don't know the actual details well
enough to know if there is any reason to doubt the document...

I would, however, tend to trust a vendor who disclosed events in this
manner. Even the best systems can fail. How a failure is handled is
in many ways the more important factor; being transparent about it is
good for confidence.

Best to plan for the occasional issue.

... JG
--
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"We call it the 'one bite at the apple' rule. Give me one chance [and] then I
won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN)
With 24 million small businesses in the US alone, that's way too many apples.


bking at inline

Nov 4, 2009, 2:56 PM

Post #33 of 41 (835 views)
Permalink
RE: HE.net, Fremont-2 outage? [In reply to]

Sry for the top post...

As more facilities are built/retrofitted with an eye toward overall efficiency using CCHP, we will start seeing more facilities (like Syracuse U's new datacenter) use systems like the Capstone turbines for primary power/secure power/CCHP. The main grid will become the backup. Not saying this approach replaces the need for batteries or some other storage device such as a flywheel system..


"This Year InGuard has Stopped 159,953,000 Spam E-Mails and 573,000 Viruses... Do you have http://www.inline.com/SolutionsbyTechnology/InternetDataCenter/InGuard/tabid/129/Default.aspx?"


bryan king | Internet Department Director
600 Lakeshore Pkwy
Birmingham AL, 35209
205-278-8139 [p]
205-314-7729[f]
bking [at] inline
www.InLine.com

All Quotes from InLine are only valid for 30 days. This message and any attached files may contain confidential information and are intended solely for the message recipient. If you are not the message recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.

From: Owen DeLong [mailto:owen [at] delong]
Sent: Wednesday, November 04, 2009 4:18 PM
To: Raphael Carrier
Cc: nanog [at] nanog; Joe Greco
Subject: Re: HE.net, Fremont-2 outage?


On Nov 4, 2009, at 2:08 PM, Raphael Carrier wrote:

>> I know you have a rather higher faith in batteries than some of us,
>> but practical experience suggests that batteries are merely a mostly-
>> reliable technology.
>>
>
> Agreed batteries are unreliable, an alternative to battery based UPS
> are flywheel energy storage devices, they come either as an integrated
> solution with the diesel generator (i think cat offers such a package)
> or as a standalone UPS (see:
> www.pentadyne.com/uploads/18/File/Pentadyne-VSS-Brochure.pdf)
>

Apparently you do not remember 365 Main...

Batteries are reliable.
Flywheels are reliable.

Both require proper maintenance and proper procedures to handle
corner cases (like the multiple-outage corner-case that took out
365 main).

Both have their issues.

In my experience working at and with a variety of datacenters, I have
to day that I have had generally better luck with batteries than
flywheels,
but, the key difference that suggests flywheels could actually be better
technology is this:

About 50% of battery failures traced back to human factors.

100% of the flywheel failures I experienced were human factors related.

Owen

Speaking as an individual, not representing any affiliation.


scott at doc

Nov 4, 2009, 4:31 PM

Post #34 of 41 (835 views)
Permalink
Re: HE.net, Fremont-2 outage? [In reply to]

On Wed, Nov 4, 2009 at 2:56 PM, Joe Greco <jgreco [at] ns> wrote:

> > Yup, just ask 365 Main how reliable they are -
> > http://365main.com/status_update.html
>
> I would point out that my cursory review of the document linked above
> leaves a very positive impression. I don't know the actual details well
> enough to know if there is any reason to doubt the document...
>
> I would, however, tend to trust a vendor who disclosed events in this
> manner. Even the best systems can fail. How a failure is handled is
> in many ways the more important factor; being transparent about it is
> good for confidence.
>

Absolutely! 365 Main handled this outage very well, both at the time, but
more importantly with the followup as you can see from the URL above, which
as you can see was made (very!) public by them at the time, and not covered
in "confidential/customer only/etc" warnings.

Those that have (finally) received the notification from HE about yesterdays
outage will notice the stark difference between the way they've handled it
and the way 365 Main handled things...

Scott.


nevin at enginehosting

Nov 4, 2009, 6:38 PM

Post #35 of 41 (830 views)
Permalink
Re: HE.net, Fremont-2 outage? [In reply to]

On Tuesday, November 3, 2009 10:03pm, "Joe Greco" <jgreco [at] ns> said:

>> Jeffrey Lyon wrote:
>> > FWIW: http://www.he.net/releases/release18.html
>>
>> No date on that 'press release' but the way back machine helps put it
>> somewhere in 2002. A lot of good this "Alameda" sized generator has done
>> recently...
>>
>> http://web.archive.org/web/*/http://www.he.net/releases/release18.html
>
> 2MW isn't super huge or anything. I would expect that, given the size
> I have been led to believe HE is, they've got a lot more than that now.
>
> My memory is that Alameda isn't huge, but it isn't small either. I'm
> not sure .. ah, here

The 2002 press release is talking about the Fremont 1 facility not the newer Fremont 2 facility. Fremont 1 has a fixed power availability to each cabinet of just a single 15A circuit. You can not modify or change that, and if you need more power your option is to add another cabinet. You are not allowed to route power cords between cabinets so you are forever running a single circuit and 80% of your 15A circuit max. The data center was built in a different time.

-- Nevin Lyne
-- Founder / Director of Technology
-- EngineHosting.com


stef-list at memberwebs

Nov 4, 2009, 6:51 PM

Post #36 of 41 (822 views)
Permalink
Re: HE.net, Fremont-2 outage? [In reply to]

nevin [at] enginehosting wrote:
> The 2002 press release is talking about the Fremont 1 facility not
> the newer Fremont 2 facility. Fremont 1 has a fixed power
> availability to each cabinet of just a single 15A circuit. You can
> not modify or change that, and if you need more power your option is
> to add another cabinet. You are not allowed to route power cords
> between cabinets so you are forever running a single circuit and 80%
> of your 15A circuit max. The data center was built in a different
> time.

The same is true of racks in most of the suites in the more recent
Freemont 2 facility.

Cheers,

Stef


nevin at enginehosting

Nov 4, 2009, 6:53 PM

Post #37 of 41 (833 views)
Permalink
Re: HE.net, Fremont-2 outage? [In reply to]

On Wednesday, November 4, 2009 10:00am, "dan syn" <dan.syn.ack [at] gmail> said:

> Maybe some of us [[soon-to-be-]ex-]customers of Hurricane can bake them a
> cake and beg for UPSes.
> Or reliable power.
> Or for someone to actually answer the voicemails much less phone calls
> within even a few hours of an outage.
> Or for there to be at the very least a status page notifying customers that
> they are, in fact, screwed, and for how long, and that it's useless to
> continue trying to get through at such time.
>
> Who's with me?

Yeah, after years of dealing with them all I can say is Best of luck. While we still have some legacy systems in Fremont #1 we moved 98% of our operations out to other data centers back in 2005 because of the same lack of communications even about scheduled events (which to this day I don't believe are posted anywhere). We were rapidly expanding at the time, and given the brush off, so we moved. That was the only way to get good, timely, and details information about things taking place. Flash forward almost 5 years and it seems their flagship Fremont #2 which was just being announced when we started moving, is still the same song, different year...


-- Nevin Lyne
-- Founder / Director of Technology
-- EngineHosting.com


Valdis.Kletnieks at vt

Nov 4, 2009, 6:57 PM

Post #38 of 41 (829 views)
Permalink
Re: HE.net, Fremont-2 outage? [In reply to]

On Wed, 04 Nov 2009 12:26:15 CST, Joe Greco said:

> With power:
>
> N+1 is usually better than N
> Best to assume full load when doing math
> Things will go wrong, predict common failures

And uncommon ones. :)

So as part of a major compute-cluster install, we upgraded our UPS and diesel
generator one weekend, and breathed a collective sigh of relief that we were
now safe from power outages and mostly dodged a bullet. We *did* have some
scary moments when we discovered that (a) of the 400 or so disks on our Sun
E10K, about 10 didn't spin up again and (b) several of the boot disks on said
box weren't mirrored. Fortunately, none of the 10 fails were on a non-mirrored
disk. By Tuesday, all the non-mirrored boot disks were in fact mirrored.

That Friday, a bozo contractor relocating a doorway managed to set off the
Halon. Only lost two disks on the E10K. Guess which two? ;)

And a month later, we discovered that the nice shiny new automatic cutover
switch was wired in backwards, necessitating another power outage to re-wire it
correctly.

So much for safe from power outages... :)


scott at doc

Nov 4, 2009, 7:50 PM

Post #39 of 41 (825 views)
Permalink
Re: HE.net, Fremont-2 outage? [In reply to]

On Wed, Nov 4, 2009 at 6:38 PM, <nevin [at] enginehosting> wrote:

> The 2002 press release is talking about the Fremont 1 facility not the
> newer Fremont 2 facility. Fremont 1 has a fixed power availability to each
> cabinet of just a single 15A circuit. You can not modify or change that,
> and if you need more power your option is to add another cabinet. You are
> not allowed to route power cords between cabinets so you are forever running
> a single circuit and 80% of your 15A circuit max. The data center was built
> in a different time.
>

A different time, but obviously not that much different...

Fremont 2 is still limited to either a single 15A or a single 20A circuit
per rack.

They are rebuilding one of the Fremont 2 wings and turning it into a single
area rather than the existing suites, so it'll be interesting to see if
things are done differently there.

Scott


mathews at hawaii

Nov 4, 2009, 9:32 PM

Post #40 of 41 (830 views)
Permalink
Re: HE.net, Fremont-2 outage? [In reply to]

Alex Rubenstein wrote:
>> Yup. Related: "100% availability" is a marketing person's dream; it
>> sounds good in theory but is unattainable in practice, and is a
>> reliable sign of non-100%-reliability.
>
> You are confusing two different things.
>
> Availability != Reliability.

Pardon the interruption...

In the aforementioned statement, there appears an intense/flagrant -
compartmentalization/separation of terms without sufficient
explanation. Note that in being available, 'a' criteria to ensure
reliability is met. If one has the desire to delve into some of the
nuanced operational perspective, see: http://ow.ly/zmQg (pdf) or
http://ow.ly/zmTB (web friendly). The article is also available
through the IEEE Portal at http://ow.ly/zn3a (if one of the other links
appear to be unavailable, anytime).

> For instance, an airplane is designed to be 100% reliable, but much less available. To keep a 747 from not crashing (100% reliability) it needs significant downtime (not 100% available).

This explanation, aside from being unsatisfactory, is misleading.
Operating times and maintenance times are very much separate quantities.

>> And even for those who follow best practices... You can inspect and
>> maintain things until you're blue in the face. One day a contractor
>> will drop a wrench into a PDU or UPS or whatever and spectacular things
>> will happen.
>
> That's were policies, procedures and methods come in (read: SAS70)

For the operationally minded -- on one hand, there is an assumption here
that 'accidents' are not preventable; on the other hand, there is at
least an assumption being made here that SAS 70 is the curative for
'accidents.' To be brief, accounting for human behavior as an
underlying contributor to accidents can be a backbreaking and immensely
messy endeavor. In this respect, SAS 70 can only be assistive.


All the best,
Robert Mathews.
--


jgreco at ns

Nov 5, 2009, 5:49 AM

Post #41 of 41 (819 views)
Permalink
Re: HE.net, Fremont-2 outage? [In reply to]

> Alex Rubenstein wrote:
> >> Yup. Related: "100% availability" is a marketing person's dream; it
> >> sounds good in theory but is unattainable in practice, and is a
> >> reliable sign of non-100%-reliability.
> >
> > You are confusing two different things.
> >
> > Availability != Reliability.
>
> Pardon the interruption...
>
> In the aforementioned statement, there appears an intense/flagrant -
> compartmentalization/separation of terms without sufficient
> explanation.

Correct. It's even a bit more interesting than that; there's an
implication that marketing people will not really know the difference,
having heard repeatedly about "high availability", may proceed to
use "availability" as a buzzword... I guess I was a bit more oblique
than intended.

> Note that in being available, 'a' criteria to ensure
> reliability is met. If one has the desire to delve into some of the
> nuanced operational perspective, see: http://ow.ly/zmQg (pdf) or
> http://ow.ly/zmTB (web friendly). The article is also available
> through the IEEE Portal at http://ow.ly/zn3a (if one of the other links
> appear to be unavailable, anytime).

I doubt marketing people will care. :-)

> > For instance, an airplane is designed to be 100% reliable, but much less available. To keep a 747 from not crashing (100% reliability) it needs significant downtime (not 100% available).
>
> This explanation, aside from being unsatisfactory, is misleading.
> Operating times and maintenance times are very much separate quantities.

And airplanes aren't 100% reliable regardless...

For a power system as a whole, though, one could see 100% availability
as a prereq for 100% reliability. Of course, you more closely approach
100% through redundancies... oops, should we introduce another term to
debate? :-)

> >> And even for those who follow best practices... You can inspect and
> >> maintain things until you're blue in the face. One day a contractor
> >> will drop a wrench into a PDU or UPS or whatever and spectacular things
> >> will happen.
> >
> > That's were policies, procedures and methods come in (read: SAS70)
>
> For the operationally minded -- on one hand, there is an assumption here
> that 'accidents' are not preventable;

You cannot eliminate accidents. Accidents represent things which are by
definition unforeseen and unplanned. Accidents may be reducible through
the use of good planning and practices. On one hand, one can foresee a
risk in resting a wrench near some energized busbars while needing one's
hands to do something else; you can define good practices that forbid
this sort of thing. Even that may not completely eliminate the practice;
there are plenty of examples of companies having good policies that are
disregarded by employees in the field. On the other hand, when Bruno is
moving a construction excavator around next door, suffers a heart attack,
and floors the controls such that the excavator rams your building and
the boom arm penetrates your wall and shoves a guy face-first into the
busbars, well, obviously we're talking extremely unlikely (I hope it's
obvious I'm even trying to be a bit ridiculous), but that's an Accident.
And they happen.

> on the other hand, there is at
> least an assumption being made here that SAS 70 is the curative for
> 'accidents.' To be brief, accounting for human behavior as an
> underlying contributor to accidents can be a backbreaking and immensely
> messy endeavor. In this respect, SAS 70 can only be assistive.

Correct. We can only hope to reduce accidents.

My original point was simply that I prefer people who recognize 100% as a
desirable-but-unobtainable goal.

... JG
--
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"We call it the 'one bite at the apple' rule. Give me one chance [and] then I
won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN)
With 24 million small businesses in the US alone, that's way too many apples.

First page Previous page 1 2 Next page Last page  View All NANOG users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.