OVHcloud Bare Metal Cloud Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
[BHS][Dedicated Servers] - Network Perturbation Incident Notification
Incident Report for Bare Metal Cloud
Postmortem

On Saturday 21th, October 2023 08:13 UTC a router in Montreal's PoP (Point of Presence) broadcasted warnings.
Our technical teams began investigating the situation as soon as the alerts was received.

While we started troubleshooting, clients began to report issues between BHS (Beauharnois) datacenter and external sources (such as their own external server or third-party services).
The external network traffic going through the PoP and cascading down to BHS was degraded while our internal services were operational.

Since our monitoring essentially rely on internal agents and network traffic wasn't fully down, we didn’t immediately identify that select external requests could fail randomly. In such a context the investigation was effectively slowed down which we regret.

Due to the nature of the issue (Network), multiple impacts have been identified, such as:

  • VPN instabilities

  • Reaching OVHcloud public IPs

  • Timeout from external domains to OVHcloud

  • Servers up from inside OVHcloud's network but down from an external point of view

  • Intermittent connection issues to OVHcloud servers 

  • Intermittent host names resolution issues

  • Servers not answering external requests or very slowly

  • Ping/traceroute not reaching OVHcloud

  • Packet loss between multiple network links and IPs

 

Around 14:30 UTC, we identified the faulty network device which had an issue with its FIB (Forward Information Base). We immediately began the verification and isolation processes.

The issue was resolved by this action.

Some additional time was needed to fully deem the incident resolved.

Post-incident investigation points to a third-party software malfunction.

The issue has been raised to appropriate recipients.

This incident will help us improve on our action plans (campaign plan to check all our Pop devices).

We are sorry for any inconvenience caused by this issue.

Posted Oct 26, 2023 - 13:31 UTC

Resolved
Start time : 21/10/2023 09:14 UTC
End time : 21/10/2023 14:24 UTC
Root cause : Post-Mortem will be published after incident closing
Our technical teams resolved the issue. All impacted services are now operational.
Posted Oct 26, 2023 - 13:29 UTC
Monitoring
A fix has been implemented by our technical teams to fix the network perturbations.
Since 14:24 UTC no more network perturbations present.
Our teams are monitoring the situation
Posted Oct 21, 2023 - 15:23 UTC
Update
Our teams are still working on fixing the issue.
Update will be posted as significant progress is made.
Posted Oct 21, 2023 - 14:00 UTC
Investigating
Start time : 21/10/2023 09:14 UTC
Service impact : We are experiencing some degradation on servers in BHS.
Ongoing actions : Our teams are investigating to determine the origin of the incident and fix it.
Update will be posted as significant progress is made.
Posted Oct 21, 2023 - 10:19 UTC
This incident affected: Dedicated Servers || Network (BHS).