Diagnosing Network Problems with Looking Glass

Looking Glass

For many of our clients, connectivity is a key determiner for choosing network services. Here, network connectivity means the amount of interaction between different operators’ networks and consequently, the number of routes and intermediate nodes.

Network connectivity can be easily checked using Looking Glass services, which let you test routers from a remote network. Many organizations offer this kind of service (information on all of the Looking Glass services in the world can be found here).

We offer our own Looking Glass service at http://lg.selectel.com. Looking Glass diagnostics let you clearly identify errors and figure out why they occurred. We’ll talk more about this later on.

Our Looking Glass interface is extremely simple:

Снимок экрана 2016-09-02 в 18.42.46

All commands can be executed by four routers, two in St. Petersburg and two in Moscow. To choose a router, simply check the corresponding box; scans can be run from four routers at the same time.

Our Looking Glass service can run the following scans (chosen from the Operation dropdown menu):

  • ping — determines if the requested node has a network connection and measures the response time;
  • traceroute — tracks the packet route from the router to the given resource over the IP network;
  • bgp route detail — returns detailed information on the BGP routes to the given node in a routing table;
  • bgp route terse — returns concise information on BGP routes to the given node as well as the current active route;
  • bgp summary — returns information on all of the chosen routers’ BGP sessions.

After choosing a command, enter the destination host address (or IP address or DNS name) in the Host field, then click Execute. Let’s take a closer look at what each of these commands do and how we can interpret the results.

Ping

With the ping command, we can check a node’s availability over ICMP. The principle is simple: after receiving the ping command, the router sends requests to the target node, which sends it back (to the address it was sent from). Ping results tell us (1) if the outgoing packets were returned and (2) how much time (in milliseconds) it took them to make a roud trip.

The results may look something like this:

PING 217.69.139.201 (217.69.139.201): 56 data bytes
64 bytes from 217.69.139.201: icmp_seq=0 ttl=62 time=8.776 ms
64 bytes from 217.69.139.201: icmp_seq=1 ttl=62 time=8.676 ms
64 bytes from 217.69.139.201: icmp_seq=2 ttl=62 time=10.024 ms

--- 217.69.139.201 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 8.676/9.159/10.024/0.613 ms

From this example, we can see that the destination node responded to our requests and no packets were lost. Let’s look at another example:

PING www.jnto.go.jp (210.165.34.236): 56 data bytes
64 bytes from 210.165.34.236: icmp_seq=0 ttl=241 time=325.734 ms
64 bytes from 210.165.34.236: icmp_seq=1 ttl=241 time=325.894 ms
64 bytes from 210.165.34.236: icmp_seq=2 ttl=241 time=334.896 ms

--- www.jnto.go.jp ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 325.734/328.841/334.896/4.282 ms

We can see from these results that the destination node is available, that no packets were lost, and that the response time is high due to geographic distance: the destination node is in Japan, and the requests are being sent from St. Petersburg, Russia.

The following is also possible:

PING 89.108.112.69 (89.108.112.69): 56 data bytes

--- 89.108.112.69 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss

The results here tell us that none of the packets reached their destination. The node is most likely unavailable or doesn’t respond to ICMP requests.

That’s how we use ping requests to check connections with remote nodes. A successful return lets us assume that everything between the router and destination is working normally. We should point out though that packet loss may occur even when everything is in working order. For example, packet loss can occur when a network is overloaded. It’s not unusual to see routers assign diagnostic packets a low priority; however, even if only one packet returns, it’s evidence enough that there is a connection and the node is available.

It’s impossible to definitively conclude that something is out of order based on the results of a ping request. We can only move on to other possibilities, which require other tests with other tools.

Traceroute

Availability issues can also be caused by technical issues on the intermediate nodes that packets travel through to reach the destination host. These problems can be detected using the traceroute command.

How does this command work? Packets sent over a network have a time to live (TTL), which shows the number of hops (i.e. when packets ‘jump’ from one router to another) or time in milliseconds that a packet can survive (or “live”) on a network. Every router that processes a packet lowers the TTL value by one. When the TTL reaches zero, the packet is destroyed and an IMCP timeout message is returned to the sender. This mechanism was originally used to prevent loopback errors where network packets were copied infinitely.

Having received the traceroute command, the router sends a packet with TTL=1 to the destination host. The node the timeout response comes from is defined as the first hop (i.e. the first step towards the target). Then, packets with TTL=2, 3, 4, 5, etc. are sent out consecutively until one of the packets reaches its target node and a response is received.

A list of all the intermediate nodes that the packet passes, from the router to the destination, is printed on the screen. It may look like this:

traceroute to 89.108.112.69 (89.108.112.69), 30 hops max, 40 byte packets
 1 sap-b2-link.telia.net (213.248.86.117) 6.278 ms 62.572 ms 9.668 ms
 2 hls-b2-link.telia.net (213.155.131.128) 9.812 ms hls-b2-link.telia.net (80.91.245.172) 12.487 ms hls-b2-link.telia.net (80.91.245.170) 12.088 ms
 3 hls-b1-link.telia.net (213.155.133.132) 15.486 ms 12.360 ms s-bb2-link.telia.net (213.155.133.74) 90.086 ms
 4 s-b2-link.telia.net (213.155.131.33) 22.196 ms s-b2-link.telia.net (213.155.133.143) 22.272 ms s-b2-link.telia.net (80.91.246.235) 21.197 ms
 5 s-b2-link.telia.net (213.155.133.137) 21.807 ms s-b2-link.telia.net (80.91.247.217) 22.231 ms vimpelcom-ic-152423-s-b2.c.telia.net (213.155.129.118) 21.936 ms
 6 * * *
 7 * 81.211.13.162 (81.211.13.162) 58.058 ms 152.715 ms
 8 te6-1.rt1.dc5.agava.net (89.108.112.242) 40.044 ms 32.834 ms 39.235 ms
 9 te6-1.rt1.dc5.agava.net (89.108.112.242) 40.780 ms 43.640 ms *
10 * * *
11 * * *

What can we tell from these results?

At the 9th hop, an asterisk is given in place of a time. This means that a response wasn’t received for one of the packets. This could be caused by an overloaded network, or like a lot of routers, this one may just be dumping low priority ICMP packets. Asterisks in traceroute results are pretty common and are no cause for alarm.

If three asterisks are given for one of the intermediate routers, it means that no response was received. We shouldn’t jump to the conclusion that there’s something wrong just because of these asterisks though. There are a number of different reasons why we might get these. For example, routers are often configured so that they “quietly” dump old packets. In this case, packets still successfully hop to the next router.

Another reason may be that packets from this router take too long to return and Looking Glass just stops waiting for them. If three asterisks are given for nodes at the end of the route (like in our example above), then more likely than not, this is evidence that our packets didn’t reach their destination.

Interpreting traceroute outputs is more complicated and delicate than it may seem at first. You can find more information about this here.

BGP Diagnostics

BGP (Border Gateway Protocol) is the main protocol for dynamic routing on the Internet. It’s intended for exchanging information between autonomous systems, not different routers. According to RFC1930, an autonomous system is “a connected group of one or more IP prefixes run by one or more network operators which has a single and clearly defined routing policy”. Naturally, the Internet can be viewed as a set of interconnected autonomous systems.

Over BGP, autonomous systems tell one another (1) that they exist and (2) which networks they can access. They also gather information on how to access other Internet networks. By retrieving information on different routes to their destination, they determine which is the best (based on network rules and not technical metrics) and add it to their routing tables. This is why BGP is sometimes called the glue that holds the Internet together.

BGP was created at a time when the Internet didn’t have many of the problems or dangers that it has today, and so it’s fairly vulnerable. Protocol errors can be caused by technical issues on a specific autonomous system or intentional actions and can have very serious consequences. Because of these errors, traffic gets rerouted and/or dumped and does not reach the destination network; this creates network availability problems.

BGP can be checked using the commands bgp route detail, bgp route terse, and bgp summary: bgp route detail returns a routing table with a list of all the autonomous systems along a packet’s route, bgp terse prints out an abridged version of this routing table, and bgp summary prints a list of all the autonomous systems connected to our routers.

Conclusion

In this article, we took a very brief look at the capabilities of our Looking Glass service. We can’t cover everything involved in network diagnostics in one article though. If you have had any problems with Looking Glass, please tell us in the comments below. We’d also be happy to hear any constructive ideas or suggestions for the service.