XMission

Announcements

See the archive of all announcements, or read frequently asked questions about the announcements.

From support@xmission.com  Wed Nov 12 16:37:32 MST 2008
Date: Wed, 12 Nov 2008 16:37:32 MST
From: XMission Support 
To: announce@xmission.com
Subject: ANNOUNCEMENT: XMission Outage: Updates and Moving Ahead
Status: RO

XMission Outage: Updates and Moving Ahead
-----------------------------------------

We at XMission wanted to update our customers with more details and offer
a proper apology now that the dust is clearing after yesterday's outage,
on Tuesday, November 11th.  This was a big one and all customers
experienced problems to one degree or another.

This announcement is very long but we wanted to address questions and
concerns that have arisen to restore customer confidence.

Synopsis
--------
While the power only went out for a moment, many systems were adversely
affected by the outage and took extensive attention and time to recover.
In the case of our primary storage device, we could not bring it back
online even after hours of trying so we restored files to the new NetApp
we already had on site.  Many systems tie into this device, which
exacerbated the problem.

We are happy to report that we will be greatly increasing base quota in
the near future for customers at no additional charge from 100 MB to 5 GB
now that we have the new NetApp storage appliance online. Our web hosting
customers will also see significantly increased quota in the very near
future.

As of the opening of business hours this morning, file backups had
completed and most everything was in working order. We continue to find
and address remaining issues, though. Some customers continue to
experience delays with sending and receiving email but the queues are
clearing.

Additional Technical Details
----------------------------
We didn't have all the answers last night so here are further details
regarding the outage:

* Our primary storage device (a NetApp F801 we were in the process of
replacing this week with a new NetApp FAS2020) suffered the loss of 2
drives on one of the volumes, causing us to lose the data on the device
entirely.
 - We were waiting to get our snapmirror license from NetApp to
   copy data over but at least we had the new hardware on site and
   ready.
 - Since many systems NFS mount to this gear, which handles /home, other
   servers required attention to get up and running properly.
 - Web hosting was down into the night until customer files were
   restored to the new hardware. This was completed by morning.
 - Our new NetApp 2020 has additional recovery options not available on
   the older 810.
 - We have plans to purchase another NetApp 2020 in the near future to
   host offsite. While we already have off site backups, the 2020 is an
   upgrade to that system.

* While email services were down about 5 hours for most customers
yesterday, no mail should have been lost although some customers continue
to see delays sending and receiving email.

* DSL and UTOPIA customers were offline for up to an hour because our
radius server did not recover on its own. Some customers also needed to
powercycle their modem before they could reconnect. As a rule of thumb we
highly recommend customers powercycle their gear when troubleshooting.

* Although it was a holiday, our systems administrators were on site
within minutes and many worked through the night, some up to 18 hours
without a break.

* We are sorry about problems with our phone systems. They initially were
offline due to the outage, then we maxed out connections to it, and we
couldn't answer the calls because all but we only had a skeleton staff of
phone technicians due to the holiday. We are making some changes to the
existing phone system but are also in the process of replacing it by the
end of this year. Some expressed concerns that our status messages were
not very helpful.  Unfortunately, we often did not know when systems would
come back up and we also needed to keep the message short due to heavy
call volume.

* To clarify, the outage was due to human error while doing maintenance on
one of our 3 UPS's and not any of our equipment. A breaker was mislabeled,
which brought about the mistake.

* DNS (Domain Name Service) was sporadic for up to an hour. This was
mostly due to a Cisco 6509 that continued to have issues in the beginning
but we have since moved our two onsite authoritative nameservers
(ns.xmission.com and ns1.xmission.com), as well as most other servers, to
a new redundant connection.  We should note that we do have a tertiary
name server in California (ns2.xmission.com).  If you list
ns2.xmission.com as a tertiary nameserver for your domain, then your
domain will continue to have working nameservice in the event that the two
onsite nameservers are offline.

* QMOE customers suffered a prolonged outage due to the same Cisco 6509
that caused problems with our name servers. They have since been moved to
a different router with greater redundancy. We will send our QMOE
customers a separate email with further details by tomorrow.

* About 25% of our colocation customers suffered a brief power outage
since we have customers spread across 3 separate UPS's. Otherwise, aside
from networking being briefly down after the initial outage, colocation
services were not widely affected. In case some colocation customers are
not aware, they can purchase powerstrips from different UPS's to have
redundant power. If you suffered equipment loss from the power outage or
would like details about redundant power, please contact your sales rep
for details.

Resolutions and Moving Ahead
----------------------------
We are very sorry for all of the problems that this outage has caused our
customers and greatly appreciate all of the kind words and support you
have given us. More than anything, we want to assure you that we are
taking this matter seriously and proceeding with steps to lessen the
chances that something like this can happen again, which include:

* We will more dutifully use our already existing NetStatus page to keep
 customers informed about our systems:
 http://stats.xmission.com/netstatus

* As well, we will be announcing all upcoming maintenance on the NetStatus
page in the future and emailing those who opt into a list, which will soon
be created for this purpose. To be added to the list, please email
support@xmission.com.

* We recognize that we need to handle communication much better in the
future. We did setup an outage page with updates but realize that most
did not know such a page existed:
 http://stats.xmission.com/outage/

* We also have our Nagios systems status page, which provides a very
good look into our systems:
 http://stats.xmission.com/nagios/

* For those who did not know, XMission has a blog which we use to talk
with our customers:
 http://transmission.xmission.com

* We plan to run redundant power from a second UPS up to our server room
to feed essential hardware with dual power supplies. That alone would have
dramatically minimized the effects of yesterday's outage.

* While we already perform most maintenance outside of business hours, we
have decided to enforce a policy that all systems critical maintenance
(i.e., involving power, routers, core systems) must happen outside of
business hours. Some additional training is also planned in regards to our
electrical infrastructure.

------------------------------------------------------------------------------
This has been an XMission Announcement.   Past announcements available at: 
     WWW      - http://www.xmission.com/cgi-bin/announcements
     Homepage - http://home.xmission.com
     News     - xmission.announce
     FAQ      - http://www.xmission.com/help/misc/faqs/announcements.html

See the archive of all announcements, or read frequently asked questions about the announcements.

EPA Green Power Partner
XMission Internet
51 East 400 South Suite 200
Salt Lake City, Utah 84111
Phone: 801.539.0852
Toll-Free: +1.877.964.7746