We are aware that some DXI services is currently unavailable. Engineers are investigation the issue.
16:40 - Investigation are ongoing.
16:45 - Engineers have diagnosed the problem and are working to restore the service.
16:55 - All services have now been restored. We apologise for loss of service and will produce a fault report once we have investigated the root cause.
Post Mortem Investigation
Some customers experienced problems loading web pages between approximately 16:15 and 16:55.
High Level explanation
A firmware upgrade of one of our Cisco UCS clusters, which is performed regularly (100s have been done in the last 12 months) triggered a network event that caused a routing failure.
The issue was resolved when engineers reset the network devices.
This was caused by a brief loss of connectivity between virtualised load balancers due to a firmware patch on the Cisco UCS platform disconnecting virtualised machines from the network layer. Although the connection recovered within a second, a routing protocol within the network layer "GLBP" failed to correctly re-cache the MAC addresses of the loadbalancers affected and in some parts of the network the backup load balancer was live, and in other parts the original primary load balancer was live.
The issue was resolved by logging into every network device and forcing the rebuild of the ARP cache. For customers connecting with their own layer 2 infrastructure, the issue was resolved by issuing gratuitous ARPs from the DXI platform for force their cache to be reset.
Prevention of repetition
Whilst the root cause of the trigger in itself should not happen, the effect of this has highlighted an incompatibility between GLBP, REP and Keepalived which are relied on to provide failovers. This design has been reviewed and the engineering team will be performing emergency maintenance on Saturday 20th June from 7pm through the night to implement changes to remove this issue.