The Case for a Holistic Approach to Resiliency in Next-Gen Telecommunications Networks

By Robert Novo, Service Delivery Director - Voice Communications, Americas, BT

Robert Novo, Service Delivery Director - Voice Communications, Americas, BT

Telecommunications networks are becoming increasingly complex as exponentially paced evolution in leading edge technologies such as cloud, IoT, AI, AR/VR and blockchain, pair with time-proven voice, video and data services to enable new, mission critical applications, some of which have strict requirements for up time and performance.  For large multi-national corporations, a network outage of even a few minutes can translate into losses of revenue or productivity valued at millions of dollars and ultimately result in unquantifiable damage to a company’s reputation and brand. Nevertheless, budgets available to deploy, upgrade and maintain the networks are not bottomless. Resiliency cannot be established by building-out hyper-redundancy across numerous diverse paths with ever increasing capacity.  Investments must be optimized to ensure they yield the highest RoI, both from a line of business (revenue) and IT ops (opex/capex)perspective.  The cost of added resilience must be considered against the value in the avoided risk of service interruption, a value often underestimated by network operators.

The saying goes, “a chain is only as strong as its weakest link.”  This is highly relevant in the telecommunications industry where applications and services are only as resilient as the complex networks they travel on, and the higher the complexity, the greater the vulnerabilities and risks. When pursuing resiliency, many people and organizations have a bias to focus on the network architecture and hardware, to achieve path diversity and active/standby or load-balanced configurations of all critical network elements. However, when designing and operating the network it is important to take a holistic view and use proven design and operational methodologies such as DevOps.  Other aspects such as the software in management tools and network elements, the skills of people and the processes in place to operate the network must be accounted for. Let us explore some areas that need to be accounted for in order to realize high end to end resiliency, some of which are frequent blind spots for operators who may not consider the importance of factors such as security, business continuity and disaster recovery planning, cross-layer inter-dependencies, operational simplicity and rapid/automated service restoration mechanisms.

"For the network to be truly resilient, a comprehensive approach must be undertaken that considers the everyday challenges of providing telecommunications services in a competitive environment"

Case in point, a network can be “diagrammed” to have full physical diversity with at least two fully separate paths between every pair of end-points. However, this diversity needs to be augmented with failure detection and automated protection switching in order to maintain or recover service in case of a fiber cut or other failure. A network operator may claim full diversity visually with a ring or mesh architecture on paper, but how would service be maintained or restored in the case of a fiber cut? If manual intervention is required to log into a console and reroute traffic, this will result in delays. Those delays increase in magnitude if a technician must travel to a remote site and physically divert the cables at a patch panel to restore the service. Furthermore, each of the various diverse paths must have adequate capacity to carry the protected traffic load in case of such a failure. This point may appear obvious, and typically networks are initially designed with adequate capacity when first deployed.  However, as demands grow, it is critical that this capacity, keeps pace accordingly and that resiliency is not sacrificed in the face of new traffic demands from growing business opportunities.

Current asset inventory and life cycle management is often overlooked yet is critical to service resiliency.  Again, this is something that is normally accurate when first deployed, but as networks evolve, ensuring that the record of all network elements and links is kept up to date and is easily accessible to all with a need to know is key, especially during emergency service restoration; thus the need to establish a sound asset management and change management platform and process. A fiber cable can carry various services at the MPLS, DWDM, Ethernet and TDM layers, all with different Class of Service (CoS)requirements and protection schemes. Should that fiber be cut, a prioritized restoration plan must be put into place accounting for the criticality and SLAs that apply to each service. For instance, priority should be given to emergency response services, if applicable, and network management communications to facilitate rapid restoration of the rest of the traffic. Successfully executing such a plan is dependent on an accurate inventory database.

Another example is in the area of security and its impact on resiliency. Consider the CISO of a network operator that implements a highly robust, multi-layer authentication system to improve resiliency against intruder events, yet does not audit or manage the user community to prevent the practice of writing passwords on whiteboards or sticky notes throughout the office for easy reference.

The evolution of the network and technology is yet another critical point for consideration, to maintain its resiliency.  As noted above, capacity needs to grow to keep pace with growing traffic, but that’s not all.  The lifecycle of the products in the network must also be considered.  Devices kept in service beyond the vendor end-of-life date may compromise overall resiliency as availability of spares and bug/vulnerability software fixes will be limited.

A frequent blind-spot to resiliency is simplex network operations. Many operators are inclined to give lower priority to performing otherwise proactive repairs on network elements when service is not currently impacted. With the day to day challenges of running a business, there’s a natural tendency to defer tasks that do not have an immediate impact on service. Of course, this temporarily increases risk exposure, as until the repair is completed, the network is running unprotected. We find that world class service providers mitigate this by tracking KPIs and having internal OLAs for non service impacting repairs.

We have examined only a few examples of areas for consideration. For the network to be truly resilient, a comprehensive approach must be undertaken that considers the everyday challenges of providing telecommunications services in a competitive environment.

Read Also

The future of the Internet

The future of the Internet

Jeff Finkelstein, Executive Director of Advanced Technology, Cox Communication
Moving at the Speed of the Market: Infusing the Start Up Mentality into Legacy Telcos

Moving at the Speed of the Market: Infusing the Start Up Mentality...

Rob Roy, Chief Digital Officer, Sprint [NYSE: S]
Next-Gen Wireless Trends Every CIO Needs to Know

Next-Gen Wireless Trends Every CIO Needs to Know

Dr. Derek Peterson, CTO, Boingo Wireless
A superlative digital experience is the starting point to successful CSP digital transformations

A superlative digital experience is the starting point to...

John Abraham, Principal Analyst, Digital Transformation, Analysys Mason
Why It's Time to Move SIEM to the Cloud

Why It's Time to Move SIEM to the Cloud

Ben Schoenecker, Director of Information Security, Hendrick Automotive Group
The Doctor Is In, Online and Connected

The Doctor Is In, Online and Connected

Robert DiLeo, CEO, Hylan