Rogers Comes Clean on July 8 Outage

Clouddatablue

By: Mary Jander


Rogers Communications Inc. (NYSE: RCI) revealed new details to Canadian officials about the day-long, systemwide outage on July 8 that affected millions of customers and could ultimately cost the telco $125 million or more in fines and expenses.

On Friday, July 22, Rogers submitted a detailed reply to multiple questions from the Canadian Radio-television and Telecommunications Commission (CRTC) about the outage. Though many details have been redacted in the public version of the document, ostensibly for security reasons, enough remains to provide a rough outline of how the outage unfolded -- and what might be done to ensure it doesn't happen again.

Anatomy of a Telco Disaster

Rogers says the trouble was caused by a planned upgrade to the carrier’s core network, during which a router’s filter was deleted, setting off a chaotic flood of routing messages that crashed the net. According to research by content delivery network provider Cloudflare (NYSE: NET), the situation led to Border Gateway Protocol (BGP) route flapping, in which routers were opening and closing routes willy nilly.

What isn’t explained is how and why the router filter was deleted. Rogers claims that in making the upgrade, the company followed a seven-phase procedure that included “scoping, budget approval, project approval, kickoff, design document, method of procedure, risk assessment, and testing, finally culminating in the engineering and implementation phases.” What went wrong in that sequence, the “root cause” of the problem, has been redacted from the CRTC document.

Oddly, Rogers appears to lay some blame on its vendors. The document explains:

“[T]he two IP routing vendors Rogers uses have their own design and approaches to managing routing traffic and to protect their equipment from being overwhelmed. In the Rogers network, one IP routing manufacturer uses a design that limits the number of routes that are presented by the Distribution Routers to the core routers. The other IP routing vendor relies on controls at its core routers. The impact of these differences in equipment design and protocols are at the heart of the outage that Rogers experienced.”

Problems with Redundancy and Resiliency

Despite the CRTC document's many redactions, several issues clearly emerge:

  • Rogers didn’t know whether critical networks were redundant. “Rogers provides wireless and wireline connectivity services to various customers who are classified as critical infrastructure (e.g. hospitals, gas and energy providers, etc.),” states the CRTC document. “Each of these customers’ services were impacted by the outage. It is not known whether these customers were fully impaired or if they had some degree of dual-carriers diversity that protected them from full disablement.” [Italics added.]
  • Rogers was unable to roll service over to competitive networks from Bell Canada and Telus (NYSE: TU). Rogers had lost access to its customer registries and databases, which would have been required to port services to another carrier’s system. Also, the outage was so big that transferring services to other carriers would probably have overwhelmed those networks, Rogers says.
  • Rogers had no fallback position for failure of its emergency alert system. As a result, Rogers failed to deliver four alerts in Saskatchewan, three weather related and one dangerous person warning. Here Rogers faced relentless questioning from CRTC, but its ultimate answer remained: “The only way to fully restore our alerting capabilities was to bring back on-line our IP core network.” And that didn’t happen for many hours.
  • Rogers tied its wireless network to its core IP network. When the core IP network failed, it dragged the cellular network with it, prompting questions about why the two networks can’t operate separately.

Steps to Avoid Future Outages

One thing is clear: It's time for Rogers to take action to introduce better redundancy and resiliency in its network. On Sunday July 24, CEO and president Tony Staffieri outlined some general steps toward these goals in an open letter to customers: He said 9-1-1 alerts will be addressed as a priority, and in the CRTC document, Rogers notes that a Memorandum of Understanding about keeping alert networks up through cooperative effort is expected to be delivered to Canada’s Minister of Innovation, Science and Economic Development in September 2022 by members of the Canadian Security Telecommunications Advisory Committee (CSTAC), including Rogers, Bell, and Telus.

Separately, Staffieri pledged to physically separate the company’s cellular and IP networks; to invest $10 billion over the next three years to explore better reliability, enlisting artificial intelligence (AI) to help the cause. And Rogers will undertake a “full review of our network” with tech partners. (Will router vendors be replaced, perhaps?)

The Fallout

Despite all these good intentions, many Canadians aren't placated. Indeed, the outage has opened a Pandora's box of issues. Today, for instance, the outage was the subject of a meeting of Canada's House of Commons committee on industry and technology, during which representatives criticized Rogers, demanding legislation that would hold the telco more accountable. The redactions in the CRDC document were pilloried. Some even suggested punishing Rogers with fines.

Heads have rolled already: In the weeks since the outage, CTO Jorge Fernandes was replaced by Ron McKenzie, who formerly led Rogers for Business and also acted as SVP of technical operations. Significantly, McKenzie also held senior posts at Shaw Communications, the company Rogers is trying to buy for $15 billion. The outage is the latest of a series of hurdles to that deal, and it may threaten the closing.

Whatever the outcome, it’s clear that Rogers is in the hot seat and will stay there until it’s managed effective changes to its fundamental network architecture. Stay tuned.