FS#8565 — stop and go

Scheduled Maintenance Report for Bare Metal Cloud

Completed

Hello,

As of a few days, we had instability issues on the VPS 2013 delivered to our customers.
VPS 2013 delivered few weeks ago weren't defected. The issue has appeared 10 days ago
and each day became increasingly important. We are urgently managing bugs related to
vCloud 5.1 and 1000v, which have appeared with thousands of performing VPS and real customers
doing actions in all ways.

We have decided to suspend selling new VPS, until fixing this problem.
We do believe that this would take us 7 to 8 days, means that next Tuesday
or Wednesday, we will be reopening orders and providing a good quality again.
This absolutely means that all customers who have had defects recently
will have a month for free.

Also during these 7-8 days, we are going to divide the VPS infrastructure in many
small infrastructures. This will be done tomorrow morning. This will cause a cutoff
in the service between 60-180 seconds per VPS. Regarding new orders, we are going to use this new
maximum size of an infrastructure (constructors' data are...wrong). Then, we will recode all the robots
and the API to use vSphere directly rather than vCloud. We will take 2-3 days to do it in 9 people.
Then we will be given 2-3 days of manager/api test and the running operations (reinstall, snap).
Therefore, this will end up next Wednesday, and we'll no longer hear talking about VPS issues.

During these works, it is most likely that the manager/api will have unusual issues/errors.
This is normal: we will recode it.
We are not used to take a decision as radical as to close the order,
but the idea is to set all the resources on this issue. Also, managing the flow (important) of new orders
will not allows us to move as fast while rechecking all the infrastructure as we will be doing.

We do apologize for these breakdowns.

And to work. We have 8 days at max. Here we go.

Regards,
Octave

Update(s):

Date: 2013-05-06 00:33:41 UTC
Hello,
Below some news about the VPS 2013 evolution.

We have found the origin of the encountered stability issues
on the new VPS 2013 platform. It was due to the incompatibility
between physical servers using the network interface in 10G and
the Cisco 1000v virtual switch.
For a reason yet unknown, VPS are sometimes stopping to ping,
sometimes often seen and randomly while turning around hosts with 10G.
Once we automatically switch the VPS from a host to another,
it works again then stops within a certain time if the new host was in 10G.
We have taken time to make the ratio between the 10G and 1000v hosts.
the vCloud had to be removed in order to make sure this
not the cause. Then, we have seen clearly the infrastructure at first
a doubt then confirmed it was a bug. As of Saturday at 04:00am, we have
switched the last VPS from a 10G host then we have
had no instability.

We have anyway, replaced the vCloud by a vSphere then ended
with recoding it for Tuesday night.
This will simplify the code as we had to code many \"workaround\"
of vCloud bugs, thing which worked directly in vSphere.
A huge waste of time for you and us,basically for windows,
the network etc. Regarding the code, 80% is already been
rewrote and performing. The rest of API will be fixed within 48h.

We are looking into the set to extend all VPS for 1 month on our costs.
We have had lots of failures as of a month and it was difficult for us
to justify the billing of such month.

As we are now in vSphere, it will be simpler for example to code
the \"high IO\" disks for those who need guaranteed storage performances.
Under vCloud, it's been 2 weeks when we started searching how to
make the operation \"non automatic\", while in vSphere we decide everything
and we do not leave any decision to vCloud. Briefly, we'll finally code simply
and directly.

The VPS 2013 infrastructures are protected by Arbor.
This allows to filter few simple attacks and to better protect
the infrastructure against instabilities. We are awaiting the rest
of mitigation infrastructure to add new functionalities in terms of
the type of detected attack.

We again do apologize for all failures which are unusual for us.

Regards,
Octave

Date: 2013-05-01 13:55:02 UTC
On investigating the remaining VPS issues,
an output problem was found on the VMs connected
to the 1000v on XL hosts

We are launching an immediate migration of all
VMs on XL hosts to L2+ hosts.

If you have a problem, don't hesitate to send us
an email (oles@ovh.net) or a tweet (@olesovhcom)
with the details of the problem and the name of the VPS.

Date: 2013-05-01 05:57:04 UTC
If you have any problem, feel free to contact us by email (oles@ovh.net) or via twitter (@olesovhcom) ,explain the issue and specify the VPS name.

Date: 2013-05-01 05:55:25 UTC
All VMs are up.

Date: 2013-04-30 23:14:33 UTC
All Cloud VPS were switched. We are ending the reboot
task of few VM which do not respond to ping.

Among the VPS which are not responding to ping, there's
windows which is not responding to ping naturally. However,
this case won't last for long.

We will be focusing with the remaining VPS that are down.
In their case, we do a \"vmotion\" from a host to another
and this does ping back... there's a bug.

Also, there are few remaining VPS which were not being reset
until it has crashed at the end of process ( setting ACL,
MAC, VLAN, the port on 1000v, IP/MAC on the router etc).
We are relaunching the script with the non rebooted VPS.

Date: 2013-04-30 23:04:20 UTC
Around 250 VPS are remaining to be switched.

Date: 2013-04-30 12:40:21 UTC
60% of the infrastructure has been migrated.
Everything is going as expected, we are continuing with the maintenance.

Date: 2013-04-30 10:46:55 UTC
Everything is going smoothly, we are continuing with the maintenance.

Date: 2013-04-30 08:56:22 UTC
The migrations to the new infrastructure are still in progress. Everything is going according to plan.

We have migrated 10% of the infrastructure.

Date: 2013-04-30 07:04:24 UTC
We are starting the migrations.

Date: 2013-04-30 03:52:47 UTC
VPS2013 robots are now cut. All new operations inserted in our data bases will be treated later. Setting new infrastructure is processing. The robots piloting basic actions such as the start, the stop and the reboot are already recoded to be interfaced on vsphere without going through the vCloud brick.
We are right now realising tests in order to ensure the good performance of the set and the consistence of data bases. We will keep you informed later during the process of these work tasks.

Posted Apr 29, 2013 - 23:49 UTC