MPLS TE Reliability
Overview of MPLS TE Reliability
If attributes of a working MPLS TE tunnel, such as bandwidth, are modified, a new path is set up for the tunnel using modified attributes, and service traffic is switched to the new path. Reliability technologies are required to prevent or minimize packet loss in the process.
If a node or link on a working MPLS TE tunnel fails, reliability technologies are required to set up a backup CR-LSP and switch traffic to the backup CR-LSP, while minimizing packet loss in this process.
When a node on a working MPLS TE tunnel encounters a control plane failure but its forwarding plane is still working properly, reliability technologies are required to ensure nonstop traffic forwarding during fault recovery on the control plane.
Reliability Technology |
Description |
Function |
---|---|---|
Tunnel attribute update reliability |
Ensures reliable traffic transmission when a CR-LSP is set up because of attribute updates. |
|
Fault detection |
Rapidly detects MPLS TE network faults and triggers protection switching. |
|
Traffic protection |
Provides end-to-end path protection and local protection. |
Make-Before-Break
The make-before-break mechanism prevents traffic loss during a traffic switchover between two CR-LSPs. This mechanism improves MPLS TE tunnel reliability.
Background
Any change in link or tunnel attributes causes a CR-LSP to be reestablished using new attributes. Traffic is then switched from the previous CR-LSP to the new CR-LSP. If a traffic switchover is triggered before the new CR-LSP is set up, some traffic is lost. The make-before-break mechanism prevents traffic loss.
Implementation
The make-before-break mechanism sets up a new CR-LSP and switches traffic to it before the original CR-LSP is torn down. This mechanism helps minimize data loss and reduces bandwidth consumption. Make-before-break is implemented using the shared explicit (SE) resource reservation style.
The new CR-LSP may compete with the original CR-LSP for bandwidth on some shared links. The new CR-LSP cannot be established if it fails the competition. The make-before-break mechanism allows the system to reserve bandwidth used by the original CR-LSP for the new one, without calculating the reserved bandwidth on shared links. Additional bandwidth is required if links on the new path do not overlap the links on the original path.
In Figure 4-19, the maximum reservable bandwidth on each link is 60 Mbit/s. A CR-LSP has been set up along Path 1 (Switch_1 -> Switch_2 -> Switch_3 -> Switch_4) with the bandwidth of 40 Mbit/s.
A new CR-LSP needs to be set up along Path 2 (Switch_1 -> Switch_5 -> Switch_3 -> Switch_4) to forward data through the lightly loaded Switch_5. The available bandwidth of the link Switch_3 -> Switch_4 is only 20 Mbit/s, not enough for the new path. The make-before-break mechanism can be used in this situation to allow the new CR-LSP to use the bandwidth of the link between Switch_3 and Switch_4 reserved for the original CR-LSP. After the new CR-LSP is established, traffic switches to the new CR-LSP, and the original CR-LSP is torn down.
The make-before-break mechanism can also be used to increase tunnel bandwidth. If the reservable bandwidth of a shared link increases to the required value, a new CR-LSP can be established.
On the network shown in Figure 4-19, the maximum reservable bandwidth on each link is 60 Mbit/s. A CR-LSP has been set up along Path 1 with the bandwidth of 30 Mbit/s.
A new CR-LSP needs to be set up along Path 2 to forward data through the lightly loaded Switch_5, and the path bandwidth needs to increase to 40 Mbit/s. The available bandwidth of the link Switch_3 -> Switch_4 is only 30 Mbit/s. The make-before-break mechanism can be used in this situation. This mechanism allows the new CR-LSP to use the bandwidth of the link between Switch_3 and Switch_4 reserved for the original CR-LSP, and reserves an additional bandwidth of 10 Mbit/s for the new path. After the new CR-LSP is set up, traffic is switched to the new CR-LSP, and the original CR-LSP is torn down.
Switching and Deletion Delays
If a node is busy but its upstream or downstream node is idle, a CR-LSP may be torn down before a new CR-LSP is established, causing a temporary traffic interruption.
The make-before-break mechanism uses switching and deletion delay timers to prevent temporary traffic interruption. When the two timers are configured, the system switches traffic to a new CR-LSP after the switching delay time, and then deletes the original CR-LSP after the deletion delay time.
RSVP Hello
RSVP Hello mechanism is used to rapidly detect reachability between RSVP nodes.
Background
RSVP Refresh messages can synchronize PSB and RSB between nodes, monitor reachability between RSVP neighbors, and maintain RSVP neighbor relationships.
This soft state mechanism detects neighbor relationships using Path and Resv messages. The detection speed is low and a link failure cannot promptly trigger a service traffic switchover. RSVP Hello is introduced to solve this problem.
Implementation
RSVP Hello is implemented as follows:
Hello handshake
As shown in Figure 4-20, LSRA and LSRB are directly connected.
When RSVP Hello is enabled on the interface of LSRA, LSRA sends a Hello Request message to LSRB.
If LSRB is enabled with RSVP Hello, LSRB replies to LSRA with a Hello ACK message after receiving the Hello Request message.
After LSRA receives the Hello ACK message from LSRB, LSRA determines that the neighbor LSRB is reachable.
Neighbor loss detection
After a successful Hello handshake, LSRA and LSRB exchange Hello messages. If LSRA receives no Hello ACK message from LSRB after sending three consecutive Hello Request messages to LSRB, LSRA considers the neighbor LSRB lost.
CR-LSP Backup
CR-LSP backup provides end-to-end protection for an MPLS TE tunnel. If the ingress node detects a failure of the primary CR-LSP, it switches traffic to a backup CR-LSP. After the primary CR-LSP recovers, traffic switches back to the primary CR-LSP.
Concepts
CR-LSP backup functions include hot standby and the best-effort path:
Hot standby: A hot-standby CR-LSP is set up immediately after the primary CR-LSP is set up. When the primary CR-LSP fails, traffic switches to the hot-standby CR-LSP.
Best-effort path: If both the primary and backup CR-LSPs fail, a best-effort path is set up and takes over traffic.
In Figure 4-21, the primary CR-LSP is set up over the path PE1 -> P1 -> P2 -> PE2, and the backup CR-LSP is set up over the path PE1 -> P3 -> PE2. When both CR-LSPs fail, PE1 sets up a best-effort path PE1 -> P4 -> PE2 to take over traffic.
A best-effort path has no bandwidth reserved for traffic, but has a hop limit configured to control the nodes it passes.
Implementation
CR-LSP backup deployment
Table 4-16 lists CR-LSP backup deployment items.Table 4-16 CR-LSP backup deploymentItem Hot Standby
Best-Effort Path
Path Determine whether the paths of primary and hot-standby CR-LSPs partially overlap. A hot-standby CR-LSP can be established over an explicit path.
A hot-standby CR-LSP supports the following attributes:- Explicit path
- Hop limit
- Path overlapping
A best-effort path is automatically calculated by the ingress node.
A best-effort path supports the following attributes:- Hop limit
Bandwidth A hot-standby CR-LSP has the same bandwidth as a primary CR-LSP by default. Dynamic bandwidth protection can ensure that a hot-standby CR-LSP does not use additional bandwidth when it is not transmitting traffic.
A best-effort path is only a protection path that does not have reserved bandwidth.
Table 4-17 CR-LSP backup modesBackup Mode
Description
Advantage
Shortcoming
Hot standby A hot-standby CR-LSP is set up over a separate path immediately after a primary CR-LSP is set up. A rapid traffic switchover can be performed. If dynamic bandwidth adjustment is disabled, additional bandwidth needs to be reserved for a hot-standby CR-LSP. Best-effort path The system establishes a best-effort path over an available path if both the primary and backup CR-LSPs fail. Establishing a best-effort path is easy and a few constraints are needed. Some quality of service (QoS) requirements cannot be met. Backup CR-LSP setup
Multiple CR-LSP backup methods may be supported for a tunnel. The ingress node uses these methods in turn until a CR-LSP is successfully established.
If new tunnel configuration is committed or a tunnel goes Down, the ingress node attempts to establish a hot-standby CR-LSP and a best-effort path in turn, until a CR-LSP is successfully established.
Backup CR-LSP attribute modification
If attributes of a backup CR-LSP are modified, the ingress node uses the make-before-break mechanism to reestablish the backup CR-LSP with the updated attributes. After that backup CR-LSP has been successfully reestablished, traffic on the original backup CR-LSP (if it is transmitting traffic) switches to this new backup CR-LSP, and then the original backup CR-LSP is torn down.
Fault detection
CR-LSP backup supports the following fault detection functions:- Default error signaling mechanism of RSVP-TE: The fault detection speed is relatively slow.
- Bidirectional forwarding detection (BFD) for CR-LSP: This function is recommended because it implements fast fault detection.
Traffic switchover
After the primary CR-LSP fails, the ingress node attempts to switch traffic from the primary CR-LSP to a hot-standby CR-LSP. If the hot-standby CR-LSP is unavailable, the ingress node attempts to switch traffic to a best-effort path.
Traffic switchback
Traffic switches back to a path based on priorities of the available CR-LSPs. Traffic will first switch to the primary CR-LSP. If the primary CR-LSP is unavailable, traffic will switch to the hot-standby CR-LSP.
Dynamic Bandwidth Protection for Hot-standby CR-LSPs
Hot-standby CR-LSPs support dynamic bandwidth protection. The dynamic bandwidth protection function allows a hot-standby CR-LSP to obtain bandwidth resources only after the hot-standby CR-LSP takes over traffic from a faulty primary CR-LSP. This function improves bandwidth efficiency and reduces network costs.
- If the primary CR-LSP fails, traffic immediately switches to the hot-standby CR-LSP with 0 bit/s bandwidth. The ingress node uses the make-before-break mechanism to establish a hot-standby CR-LSP.
- After the new hot-standby CR-LSP has been successfully established, the ingress node switches traffic to this CR-LSP and tears down the hot-standby CR-LSP with 0 bit/s bandwidth.
- After the primary CR-LSP recovers, traffic switches back to the primary CR-LSP. The hot-standby CR-LSP then releases the bandwidth, and the ingress node establishes another hot-standby CR-LSP with 0 bit/s bandwidth.
TE FRR
Traffic engineering fast reroute (TE FRR) provides link protection and node protection for MPLS TE tunnels. If a link or node fails, TE FRR rapidly switches traffic to a backup path, minimizing traffic loss.
Background
A link or node failure triggers a primary/backup CR-LSP switchover. The switchover is not completed until the IGP routes of the backup path converge, CSPF calculates a new path, and a new CR-LSP is established. Traffic is lost during this process.
TE FRR technology can prevent traffic loss during a primary/backup CR-LSP switchover. After a link or node fails, TE FRR establishes a CR-LSP that bypasses the faulty link or node. The bypass CR-LSP can then rapidly take over traffic to minimize loss. At the same time, the ingress node reestablishes a primary CR-LSP.
Concepts
Table 4-18 explains the components shown in Figure 4-22.
Concept |
Description |
---|---|
Primary CR-LSP |
Protected CR-LSP. |
Bypass CR-LSP |
CR-LSP protecting the primary CR-LSP. A bypass CR-LSP is usually in idle state and does not forward service traffics. If the bypass CR-LSP is required to forward service data, it must be assigned sufficient bandwidth. |
PLR |
Point of local repair, ingress node of a bypass CR-LSP. The PLR can be the ingress node but not the egress node of the primary CR-LSP. |
MP |
Merge point, egress node of a bypass CR-LSP. It must be on the path of the primary CR-LSP but cannot be the ingress node of the primary CR-LSP. |
Classified by |
Type |
Description |
---|---|---|
Protected object |
Link protection |
In Figure 4-23 below, the primary CR-LSP passes through the direct link between the PLR (LSRB) and MP (LSRC). Bypass LSP 1 can protect this link, which is called link protection. |
Node protection |
In Figure 4-23 below, the primary CR-LSP passes through LSRC between the PLR (LSRB) and MP (LSRD). Bypass LSP 2 can protect LSRC, which is called node protection. |
|
Bandwidth |
Bandwidth protection |
It is recommended that you configure the bypass CR-LSP to be less than or equal to the bandwidth of the primary CR-LSP according to the actual situation. |
Non-bandwidth protection |
A bypass CR-LSP has no bandwidth and protects only the path of the primary CR-LSP. |
|
Implementation |
Manual protection |
A bypass CR-LSP is manually configured and bound to a primary CR-LSP. |
Auto protection |
An auto FRR-enabled node automatically establishes a bypass CR-LSP. The node binds the bypass CR-LSP to a primary CR-LSP if the node receives an FRR protection request and the FRR topology requirements are met. |
A bypass CR-LSP supports the combination of protection modes. For example, manual protection, node protection, and bandwidth protection can be implemented together on a bypass CR-LSP.
Implementation
TE FRR is implemented as follows:
Setup of a primary CR-LSP
A primary CR-LSP is set up in the same way as a common CR-LSP except that the ingress node adds flags into the SESSION_ATTRIBUTE object in a Path message. For example, the local protection desired flag indicates that the primary CR-LSP requires a bypass CR-LSP, and the bandwidth protection desired flag indicates that the primary CR-LSP requires bandwidth protection.
Binding between a bypass CR-LSP and the primary CR-LSP
FRR TE searches for a suitable bypass CR-LSP for the primary CR-LSP. A bypass CR-LSP can be bound to a primary CR-LSP only if the primary CR-LSP has a local protection desired flag. The binding process is completed before a CR-LSP switchover.
Before binding a bypass CR-LSP to a primary CR-LSP, the PLR must obtain the following from the Record Route Object (RRO) in the received Resv message: the outbound interface of the bypass CR-LSP, the next hop label forwarding entry (NHLFE), the label switching router (LSR) ID of the MP, the label allocated by the MP, and the protection type.
The PLR on the primary CR-LSP already knows its next hop (NHOP) and next NHOP (NNHOP). If the egress LSR ID of the bypass CR-LSP is the same as the NHOP LSR ID, the bypass CR-LSP provides link protection. If the egress LSR ID of the bypass CR-LSP is the same as the NNHOP LSR ID, the bypass CR-LSP provides node protection. In Figure 4-24, bypass LSP 1 protects the link between LSRB and LSRC, and bypass LSP 2 protects the node between LSRB and LSRD.If multiple bypass CR-LSPs are established, the PLR checks whether the bypass CR-LSP protect bandwidth, their implementations, and protected objects in sequence. Bypass CR-LSPs providing bandwidth protection are preferred over those that do not provide bandwidth protection. Manual bypass CR-LSPs are preferred over auto bypass CR-LSPs. Bypass CR-LSPs providing node protection are preferred over those providing link protection. Figure 4-24 shows two bypass CR-LSPs. If both the bypass CR-LSPs provide bandwidth protection and are manually configured, bypass LSP 2 is bound to the primary CR-LSP. (Bypass LSP 2 provides node protection, and bypass LSP 1 provides link protection.) If bypass LSP 1 provides bandwidth protection but bypass LSP 2 does not, bypass LSP 1 is bound to the primary CR-LSP.
After the binding is complete, the primary CR-LSP's NHLFE records the bypass CR-LSP's NHLFE index and an inner label that the MP allocates to the upstream node on the primary CR-LSP. This label is used to forward traffic during a primary/backup CR-LSP switchover.
Fault detection
- Link protection uses a link layer protocol to detect and report faults. The speed of fault detection at the link layer depends on the link type.
- Node protection uses a link layer protocol to detect link faults. If no fault occurs on a link, RSVP Hello or BFD for RSVP is used to detect faults on the protected node.
As soon as a link or node fault is detected, an FRR switchover is triggered.In node protection, only the link between the protected node and the PLR is protected. The PLR cannot detect faults on the link between the protected node and the MP.
Link fault detection, BFD, and RSVP Hello mechanisms detect a failure at descending speeds.
Switchover
When the primary CR-LSP fails, service traffic and RSVP messages are switched to the bypass CR-LSP, and the switchover event is advertised to the upstream nodes. Upon receiving a data packet, the PLR pushes an inner label and an outer label into the packet. The inner label is allocated by the MP to the upstream node on the primary CR-LSP, and the outer label is allocated by the next hop on the bypass CR-LSP to the PLR. The penultimate hop of the bypass CR-LSP pops the outer label and forwards the packet with only the inner label to the MP. The MP forwards the packet to the next hop along the primary CR-LSP according to the inner label.
Figure 4-25 shows nodes on the primary and bypass CR-LSPs, labels allocated to the nodes, and behavior that the nodes perform. The bypass CR-LSP provides node protection. If LSRC or the link between LSRB and LSRC fails, the PLR (LSRB) swaps the inner label 1024 to 1022, pushes the outer label 34 into a packet, and forwards the packet to the next hop along the bypass CR-LSP. The lower part of Figure 4-25 shows the packet forwarding process after a TE FRR switchover.Switchback
After a TE FRR switchover is complete, the ingress node of the primary CR-LSP reestablishes the primary CR-LSP using the make-before-break mechanism. Service traffic and RSVP messages are switched back to the primary CR-LSP after the primary CR-LSP is successfully reestablished. The reestablished primary CR-LSP is called a modified CR-LSP. The make-before-break mechanism allows the original primary CR-LSP to be torn down only after the modified CR-LSP is set up successfully.
FRR does not take effect if multiple nodes fail simultaneously. After data is switched from the primary CR-LSP to the bypass CR-LSP, the bypass CR-LSP must remain Up to ensure data forwarding. If the bypass CR-LSP fails, the protected data cannot be forwarded using MPLS, and the FRR function fails. Even if the bypass CR-LSP is reestablished, it cannot forward data. Data forwarding will be restored only after the primary CR-LSP restores or is reestablished.
Cooperation Between CR-LSP Backup and TE FRR
CR-LSP ordinary backup and TE FRR: TE FRR can rapidly detect a link failure and switch traffic to the bypass CR-LSP. When both primary and bypass CR-LSPs fail, a backup CR-LSP is established to take over traffic.
CR-LSP hot standby and TE FRR: TE FRR can rapidly detect a link failure and switch traffic to the bypass CR-LSP. Link failure information is then sent to the tunnel ingress node through a signaling protocol and traffic is switched to a backup CR-LSP.
SRLG
Shared risk link group (SRLG) is a constraint to calculating a backup or a bypass CR-LSP on a network with CR-LSP hot-standby or TE FRR configured. SRLG prevents bypass and primary CR-LSPs from being set up on links with the same risk level, which enhances TE tunnel reliability.
Background
A network administrator often uses CR-LSP hot-standby or TE FRR technology to ensure MPLS TE tunnel reliability. However, CR-LSP hot-standby or TE FRR may fail in real-world application.
In the top diagram of Figure 4-26, the bypass CR-LSP provides TE FRR protection for the link between P1 and P2, which is part of the primary CR-LSP.
Core nodes P1, P2, and P3 on the backbone network are connected by a transport network device. In Figure 4-26, the top diagram is an abstract version of the actual topology below. NE1 is a transport network device. During network construction and deployment, two or more core nodes may share links on the transport network. For example, the yellow links in Figure 4-26 are shared by P1, P2, and P3. A shared link failure affects primary and bypass CR-LSPs and makes FRR protection invalid. To enable TE FRR to protect the CR-LSP, bypass and primary CR-LSPs must be set up over links of different risk levels. SRLG technology can be deployed to meet this requirement.
However, an SRLG is a set of links that share the same risks. If one of the links fails, other links in the group may fail as well. Therefore, protection fails even if other links in the group function as the hot-standby or bypass CR-LSP for the failed link.
Implementation
SRLG is a link attribute, expressed by a numeric value. Links with the same SRLG value belong to a single SRLG.
The SRLG value is advertised to the entire MPLS TE domain using IGP TE. Nodes in a domain can then obtain SRLG values of all the links in the domain. The SRLG value is used in CSPF calculations together with other constraints such as bandwidth.
Strict mode: The SRLG value is a mandatory constraint when CSPF calculates paths for hot-standby and bypass CR-LSPs.
Preferred mode: The SRLG value is an optional constraint when CSPF calculates paths for hot-standby and bypass CR-LSPs. If CSPF fails to calculate a path based on the SRLG value, CSPF excludes the SRLG value when recalculating the path.
Usage Scenario
SRLG applies to networks with CR-LSP hot-standby or TE FRR configured.
Benefits
SRLG restricts the path calculation for hot-standby and bypass CR-LSPs, which avoids primary and bypass CR-LSPs with the same risk level.
BFD for MPLS TE
Bidirectional Forwarding Detection (BFD) can quickly detect faults in an MPLS TE tunnel and trigger a traffic switchover when a fault is detected, improving network reliability.
Background
In most cases, MPLS TE uses CR-LSP backup to enhance network reliability. These technologies detect faults using the RSVP Srefresh mechanism, but the detection speed is slow. When a Layer 2 device such as a switch or hub exists between two nodes, the traffic switchover speed is even slower, leading to traffic loss. BFD uses the fast packet transmission mode to quickly detect faults on MPLS TE tunnels, so that a service traffic switchover can be triggered quickly to better protect the MPLS TE service.
Implementation
-
BFD for Resource Reservation Protocol (RSVP) detects faults on links between RSVP nodes in milliseconds.
-
BFD for CR-LSP can rapidly detect faults on CR-LSPs and notify the forwarding plane of the faults to ensure a fast traffic switchover. BFD for CR-LSP is usually used together with a hot-standby CR-LSP.
BFD for RSVP
When Layer 2 devices exist between neighboring RSVP nodes, the two nodes can detect a link failure based only on the RSVP Hello mechanism. Several seconds are required to complete a switchover. This results in the loss of a great deal of data.
BFD for RSVP detects faults in milliseconds on links between RSVP neighboring nodes, as shown in Figure 4-27.
BFD for RSVP can share BFD sessions with BFD for OSPF, BFD for IS-IS, or BFD for Border Gateway Protocol (BGP). Therefore, the local node selects the minimum parameter values among the shared BFD session as the local BFD parameters. The parameters include the transmit interval, the receive interval, and the local detection multiplier.
BFD for CR-LSP
The device supports static BFD for CR-LSP.
BFD for CR-LSP can rapidly detect faults on CR-LSPs and notify the forwarding plane of the faults to ensure a fast traffic switchover. BFD for CR-LSP usually works with a hot-standby CR-LSP.
A BFD session is bound to a CR-LSP. That is, a BFD session is set up between ingress and egress nodes. A BFD packet is sent by the ingress node and forwarded to the egress node along a CR-LSP. The egress node then responds to the BFD packet. The BFD session at the ingress node can rapidly detect the status of the path through which the LSP passes.
Upon detecting a link failure, BFD notifies the forwarding plane of the failure. The forwarding plane searches for a backup CR-LSP, switches traffic to the backup CR-LSP, and reports the failure information to the control plane. In this case, you can configure static BFD for CR-LSP to detect backup CR-LSPs using BFD.