Earlier today we experienced two seperate but related network faults that affected a number of customers. We take the availability of our network very seriously, and with the the impact of these incidents being much larger than usual, we felt it warranted a post-mortem.
At 11:15pm on Monday 2nd July, we performed some routine maintenance on our IPv6 network. Following the change, our engineering team confirmed normal traffic flow and traffic levels and completed post-work checks.
At approximately 11:48pm we detected an issue for a small amount of traffic passing through one of our PE edge nodes in our Sydney syd02 core. After investigation, we saw labelled MPLS traffic being dropped in some circumstances where the adjacent next-hop was directly attached to the PE interface.
We rolled back the earlier change, however the incident was not fully resolved, and so we migrated the network traffic from the affected PE device to an alternate device. We once again completed post-incident checks and confirmed that all traffic now appeared to be flowing correctly. We completed an extensive test from all of our POPs and confirmed we had IP reachability to external destinations and a number of key services.
At approximately 6:50am we were made aware that some customers in Victoria were experiencing difficulty accessing on-net services that were attached to the same PE device that we had identified the issue around the previous night. The issues were specific to a handful of customers, and not replicable from other devices similarly configured in NSW, Victoria, Queensland, Canberra, Darwin or Perth.
We were able to identify that traffic following a certain MPLS labelled path from Victoria was now exhibiting a similar (but different) fault to the one we had identified the previous evening. We made a configuration change to the PE device relating to how it handled IPv4 and IPv6 routes within our IGP and restored full network connectivity for the affected customers.
Unfortunately, a side effect of this change was that we prevented the advertisement of some WAN routes for 6 additional customers throughout our network, resulting in a further incident. After consultation, our engineering team made the decision to migrate the interfaces for the affected customers to alternate PE devices. This was completed over a 60 minute period, with customer services being restored progressively within this time.
We are working with the hardware vendor around the specific set of events that triggered the outage incident, however there will be no further impact as all relevant services have been migrated to alternate network paths. The PE device that was at the core of the fault is unique in its hardware configuration within our infrastructure, and we do not have the same device in use elsewhere. As a result, we have a very high level of confidence that this problem will not reoccur.
On behalf of the whole team at Real World, I apologise for the inconvenience that any interruption such as this causes our customers, and want to thank the affected customers for their patience this morning as our team worked to restore all affected services.
If you have any follow up questions or concerns, please do not hesitate to contact our team.