One important issue neglected by existing Traffic Engineering (TE) solutions is the network disturbance and service disruption caused by flow rerouting operations. To address this problem, we developed an RL-based TE solution called FlexDATE to reduce the network disturbance of TE while achieving near-optimal load balancing performance. Our idea is to leverage Reinforcement Learning (RL) to intelligently select and reroute a flexible number of critical flows that contribute the most to load balancing performance improvement, while the majority of network traffic is routed by the static ECMP method without any routing updates to mitigate network disturbance.
Project Overview:
(1) Motivation: Network disturbance caused by TE is neglected by existing works
(2) FlexDATE: Use RL to intelligently identify critical flows in dynamic networks, and then use Linear Programming (LP) to optimize routing for critical flows to achieve load balancing with low network disturbance
(3) RL training pipeline of FlexDATE
(4) Evaluation results: Generalizes well to dynamic traffic scenarios and unseen link failures with near-optimal load balancing performance and mitigated network disturbance
Contributions:
- We proposed a new QoS metric named network disturbance to evaluate the negative impact of TE’s flow rerouting operations on WANs, such as service disruption.
- We designed a disturbance-aware TE with GNN and RL to intelligently reroute flexible numbers of critical flows under dynamic traffic fluctuations and unexpected single link failures.
- Our proposed TE solution achieved close-to-optimal performance (i.e., above 90% of optimal performance) in 99% of network scenarios and mitigated network disturbance by up to 38.6% in five real networks.
Abstract:
Traffic Engineering (TE) is an important network operation that routes/reroutes flows based on network topology and traffic demands to optimize network performance. Recently, new emerging applications pose challenges to TE with dynamic network conditions, where frequent routing updates are required to maintain good network performance with Software-Defined Networking (SDN). However, flow rerouting operations could lead to considerable Quality of Service (QoS) degradation and service disruption, which is often neglected by existing TE solutions.
In this paper, we apply a new QoS metric named network disturbance to measure the negative impact of flow rerouting operations performed by TE. To achieve near-optimal load balancing performance and mitigate network disturbance together in dynamic network scenarios, we propose a flexible and disturbance-aware TE solution called FlexDATE that combines Reinforcement Learning (RL) and Linear Programming (LP). Specifically, FlexDATE leverages RL to intelligently identify flexible numbers of critical flows for each traffic matrix and reroutes these critical flows based on LP optimization to improve network performance with low disturbance. Empowered by a customized actor-critic architecture coupled with Graph Neural Networks (GNNs), FlexDATE can generalize well to unseen traffic scenarios and remain resilient to single link failures.
Extensive simulations are conducted on five real-world network topologies to evaluate FlexDATE with real and synthetic traffic traces. The results show that FlexDATE can achieve the performance target (i.e., 90% of optimal performance) in 99% of network scenarios and effectively mitigate the average and maximum network disturbance by up to 9.1% and 38.6%, respectively, compared to state-of-the-art TE solutions.
Publications:
- [ToN 22] Minghao Ye, Junjie Zhang, Zehua Guo, and H. Jonathan Chao, “FlexDATE: Flexible and Disturbance-Aware Traffic Engineering with Reinforcement Learning in Software-Defined Networks,” IEEE/ACM Transactions on Networking (ToN), 2022. (Impact factor: 3.7) [Paper URL] [PDF]
- [IWQoS ’21] Minghao Ye, Junjie Zhang, Zehua Guo, and H. Jonathan Chao, “DATE: Disturbance-Aware Traffic Engineering with Reinforcement Learning in Software-Defined Networks,” The 29th IEEE/ACM International Symposium on Quality of Service (IWQoS), 2021. (Acceptance rate: 25%, 64/256) [Paper URL] [Video] [PDF]
- [JSAC 20] Junjie Zhang, Minghao Ye, Zehua Guo, Chen-Yu Yen, and H. Jonathan Chao, “CFR-RL: Traffic Engineering with Reinforcement Learning in SDN,” IEEE Journal on Selected Areas in Communications (JSAC), 2020. (Impact factor: 16.4) [Paper URL] [arXiv] [Codes] [PDF]