FS#5570 — pcc-000159

Incident Report for Bare Metal Cloud

Resolved

The temporary filer used for beta tests of VPS,which has not been removed from production (internal error) is down.
We are currently migrating our customers who renewed the VPS after the beta.
258 VPS are impacted.
Data is not lost.
Customers should get back their service in about 1 hour.

Update(s):

Date: 2011-06-30 21:44:33 UTC
All impacted vps are back to normal.

Date: 2011-06-30 14:58:32 UTC
It misses some communication. We are sorry that the
information are not received in a tense flow even if the
team is working on the problem the whole time.
Here is some information which were posted on the ml
vps@ml.ovh.net

Date: Thu, 30 Jun 2011 00:39:15 +0200
From: Oles
To: \"\"
Cc: \"vps@ml.ovh.net\"
Subject: Re: [vps] the filer of the beta

some explications.

The maintenance task advances yet slower than expected.
In easier words, we have lost one of the filers of the 1st
generation which we have used for the beta.
We should have switched the customers since a long time ago
yet since they did not have all a 99.99%
therefore the switch meant an unavailability.
We will thus switch everybody to
99.99% then make the switches of filers on the spot.
The commercial offer has changed yesterday and we were preparing for
all the migrations and modification.
Unluckily, one of the discs has blocked half of the
filer and since this is the first generation, there is no second
half. Consequently, it is the crash. In the last version, the NAS
is HA with 2 shelves of discs
and not 1. The disc has so much crashed the NAS that
the zfs filesystem
is dead in writing. We were successful in mounting the
zfs in reading only and we copied the data from one
filer to another. The data are there, thus there is no loss
yet we need to switch everything to a new other filer.
In case of problem, we have the backups but since the data are there,
we prefer to recover the most recent data , i.e, that of the filer.

We hope we can finish during the night. In all cases,
we are working on it at 100%. We are sad and angry as you
due to this crash, because of this problem all the work
which we did all around the VPS was damaged.
This has proven again that we should not consider the price
but the reliability and availability. with 99.99% by default the migrations
should have been already done. and this problem et this problem would never
have existed. but it does and we will undertake it all along the 3 years to come.

In brief :(

Well.

As soon as it is fixed, we will continue to work on the
migrations. we will move everybody to 99.99%
then we will we will do migrations on the new filers.
We were at 30% in preparation. This wk, the migrations
should start on the spot.

Date: 2011-06-30 07:17:57 UTC
206 vps impacted were put back into production.

Date: 2011-06-29 22:08:44 UTC
There was a hardware problem on the filer.
We're moving data on a new filer.

Posted Jun 29, 2011 - 22:07 UTC