D-CART is a platform architecture for flexible, customizable and fine-grained monitoring in potentially-large networks. Its primary goal is to expose the behavior of intra-domain routing protocols (IGPs), and track side effects of configuration changes on packet forwarding and network performance.
The D-CART architecture
D-CART uses both passive and active measurements. It ensures flexibility by relying on programmable hardware; moreover, it allows for fine-grained analyses (e.g., root cause ones) by relying on a centralized database and a monitoring controller.
More precisely, D-CART is based on three architectural components.
- Probers. They are (several) reprogrammable devices connected to specific routers in the network to be monitored. Their main role is to generate and register results of active measurements, in order to track data-plane performance.
- Routing listeners. They are one or a few (for robustness) routing process running on a server, connected to routers in the network. Listeners participate in the intra-domain routing protocol. They enable passive measurements of routing messages.
- Monitoring controller. It is a software process that coordinates the actions of other components. It configures both probers and the listener process, collects prober and listener logs, and enables their correlation for real-time and a-posteriori analyses.
D-CART is deployed in RENATER
The D-CART project originates from a collaboration between RENATER, the French service provider for research and education, and the ICube research team in Université de Strabourg.
The first objective of the project was to identify in real time the packet losses due to reconvergence of the routing protocol used in the RENATER network. A link-state IGP is indeed used in the RENATER network for internal connectivity (as in the vast majority of service provider networks). In a link-state IGP, each router computes how to forward traversing packets on the basis of a network map shared with all the other routers. Upon events like configuration changes and network failures, a new map has to be shared across all routers, and information about the new map are progressively disseminated from one router to another. During such synchronization phase, routers can use different map versions to forward some packets; in turn, this can result in transient loops, where packets are bounced back and forth between the same set of devices.
Transient loops triggered by link-state IGPs reconvergence have been studied in the scientific literature. The Network Research Group of ICube has been particularly active in this field lately. Researchers from this group and their colleagues from other institutions (Cisco and UCLouvain folks) have developed techniques that avoids transient loops by forcing the IGP to breakdown the synchronization process into a limited number of safe sub-steps .
RENATER and ICube were therefore interested in monitor possible transient loops resulting from network operation and failures with active and intensive probing at a fine grain (< 100ms for each path). RENATER does not have such monitoring functionnalities. It was out of the scope with the previous monitoring infrastructure installed in the network, since it aggregated the statistical data at a lower frequency that is too coarse-grained.
We therefore deployed a D-CART platform in RENATER. Since the deployment was a pilot project, we decided to realize the probers with commodity hardware, and more specifically Raspberry-PIs (see Fig.1). An overview of the DCART deployment in RENATER is reported in Fig. 2.
The platform has also been customized to answer questions like
- can transient loops occur in the RENATER network?
- can we detect those loops in almost real time?
- can we quantify the packet losses due to those loops?
- can we identify patterns and extract statistics (frequency, duration, impact, …) per transient loop?
- what are the specific events causing transient loops?
Our D-CART deployment did spot transient loops during the operation of the RENATER network. Figure 3 illustrates an example of detected loops. The triggering event has been the failure of the link between Bordeaux and Nantes (see cross in the figure), and affected traffic from Toulose (tagged ad SRC in the figure) to Quimper (tagged as DST in the figure). The figure represents in red the paths followed by traffic before the failure, and in green the final paths installed after the IGP convergence.
Intuitively, the Bordeaux-Nantes link failure can result in inconsistencies between routers and cause transient loops between several router pairs (e.g., between Toulouse and Montpellier, or between Bordeaux and Clermont). Among the four possible loops, our D-CART platform found evidences of the one between Montpellier and Marseille. This also illustrates the ability of our platform to pinpoint side effects of specific events (the link failure in this case) on remote routers (in Montpellier and Marseille, in this example). Furthermore, D-CART reports more details on the detected loops. First, it locates the loop, by instructing probers to send packets with different TTL, in order to expire at different routers. Moreover, our measurements enable to quantify the outage duration of transient loops. Figure 4 reports the measurements performed by our deployment in RENATER, around the time of the Bordeaux-Nantes link failure. It illustrates how the flow between Toulouse and Quimper was interrupted for approximately one second: starting from time x=1.2, no packets was collected on the receiver during this period. Moreover, TTL-Time Exceeded messages were collected from the prober connected to Toulouse for almost 900ms. This is an experimental evidence of the occurrence of the transient loop.
Finally, D-CART can quantify effects of events on network performance. As an illustration, Figure 4 quantifies the increase in the number of hops followed by packets after the Bordeaux-Nantes failure, and the consequent increase of one-way delay.
In order to improve our coverage of the RENATER network, we have set up a second set of probers (see Fig. 5) that will allow us to implement more systematic measurements to correlate routing events to network performance.
ICube : Pascal Mérindol (email contact), J-J. Pansiot, P. David.
RENATER : GIP RENALab
Autres : F. Clad & Pierre François (Cisco), Stefano Vissicchio (UCLouvain)
 Computing Minimal Update Sequences for Graceful Router-wide Reconfigurations F. Clad, S. Vissicchio, P. Mérindol, P. Francois and J.-J. Pansiot. IEEE/ACM Transactions on Networking, 2014.
 A Fine-Grained Multi-Source Measurement Platform Correlating Routing Transitions with Packet Losses P. Mérindol, P. David, J-J. Pansiot, F. Clad and S. Vissicchio. In Elsevier Computer Communications, 2018.