Automatically fixing packet loss on my connection

A lesson in overengineering to avoid having to walk upstairs

by Robert May on Afternoon Robot

Sometimes my internet connection starts to experience packet loss, which is a common issue for many people. Unlike most ISPs, Andrews & Arnold actually provide a way to monitor packet loss statistics on your line. Physically restarting the modem is the most reliable way I've found to resolve packet loss issues.

Up until now I had to walk upstairs like some sort of muppet and manually toggle the modem (well, it's a DrayTek Vigor 130 which is sort-of a modem but more a PPPoA to PPPoE converter but whatever).

No longer! I have solved the issue with copious complexity.

Step 1: Write a Prometheus exporter for A&A line stats

I don't really enjoy writing Go, which a lot of exporters seem to be written in, so instead I've started writing my own small custom exporters in Ruby. This has overall made Prometheus significantly less frustrating to customise for my use-case.

https://gitlab.com/robotmay/prometheus-aanet-exporter/-/blob/master/prometheus-aanet-exporter

To use it you need to have Ruby set up, install the required gems at the top (or use the Gemfile), then move it to somewhere executable and run it using systemd or whatever. I'll update the instructions in the README at some point.

It works by fetching your personal stats data from an XML feed A&A provide. If you're a customer you can fetch it from inside your control panel under the line stats graph:

Step 2: Graph it in Grafana

I then set up a graph showing the line stats, with an axis specifically for packet loss:

Shiny line graph

This then has an alert configured which both alerts me directly when it starts firing, and also hits a Home Assistant webhook URL:

Configure the alert based on the packet loss axis

Step 3: Set up the network

Houses in the UK are typically not well set up for people who like to mess around with networking. Our walls are thick and made of brick, and the copper/fibre entry point is typically somewhere really stupid. In the case of our house, this is the front upstairs bedroom, and my networking equipment is in the nice and cold kitchen area, with its stone (fire-resistant) floor, downstairs at the back of the house. When we moved in I ran some metres of Cat-6a cable through walls and stairwells to join the two ends.

I have a couple of Z-Wave plugs sat around from some previous experiments, so at first I tried just sticking the Vigor 130 on one of those in the bedroom, but unfortunately the signal doesn't penetrate that far and would require a bunch of repeaters. But then I ordered an Uninterruptible Power Supply and figured that I might as well power the Vigor 130 from that along with the rest of my equipment so that my connection stays up in power outages. The setup is now:

  1. 4-way plug adapter into the UPS
  2. Z-Wave plug into the 4-way
  3. 48v PoE injector into the Z-Wave plug. Non-PoE output into my EdgeRouter, PoE output runs to the wall socket and then up through the house
  4. 48v PoE to RJ45/12v 2.1mm barrel-jack splitter
  5. Cat 6a cable from the adapter into the Vigor 130 network port
  6. 2.1mm barrel-jack with 2.5mm adapter into the Vigor's power input

High-tech fucking-bright-LEDs avoidance system on the front of the DrayTek

Step 4: Configuring Home Assistant

This is handled via an automation:

The webhook trigger type has to have a unique name, which requires a new notification channel in Grafana for each trigger, which is a bit of a pain but not the end of the world.

The next step is to come up with a process by which the modem can be restarted. I've displayed these as YAML because it's significantly more concise than screenshotting the GUI for it:

alias: ADSL Packet Loss
description: Reset the Vigor 130 when packet loss is detected from A&A
trigger:
  - platform: webhook
    webhook_id: packet-loss
condition: []
action:
  - service: notify.telegram
    data:
      message: >-
        Packet loss detected, restarting ADSL modem in 1 minute. The internet
        will disconnect whilst the modem restarts.
  - delay: '60'
  - repeat:
      until:
        - condition: state
          entity_id: binary_sensor.internet
          state: 'on'
      sequence:
        - type: turn_off
          device_id: 012345
          entity_id: switch.vigor_130_switch_2
          domain: switch
        - delay: '10'
        - type: turn_on
          device_id: 012345
          entity_id: switch.vigor_130_switch_2
          domain: switch
        - wait_template: '{{ is_state("binary_sensor.internet", "on") }}'
          continue_on_timeout: true
          timeout: '360'
        - choose:
            - conditions:
                - condition: state
                  entity_id: binary_sensor.internet
                  state: 'on'
              sequence:
                - service: notify.telegram
                  data:
                    message: >-
                      I have finished toggling the modem off and on and the
                      internet is back up.
          default: []
  - delay: '1800'
mode: single

Steps

Notify

Firstly, we notify users that the internet will be disconnected briefly, then pause the automation for 1 minute. I'm using Telegram for this because I find it flexible and reliable:

- service: notify.telegram
    data:
      message: >-
        Packet loss detected, restarting ADSL modem in 1 minute. The internet
        will disconnect whilst the modem restarts.
  - delay: '60'

I'd like to improve upon this by allowing the users to block the restart from happening by clicking a button in Telegram. Next iteration!

Repeat the next set of instructions

Next we start a repeating loop using repeat, which will keep trying the restart steps until the internet connection reappears.

Reset loop

The switch is turned off, we wait for 10 seconds, then turn it back on:

        - type: turn_off
          device_id: 012345
          entity_id: switch.vigor_130_switch_2
          domain: switch
        - delay: '10'
        - type: turn_on
          device_id: 012345
          entity_id: switch.vigor_130_switch_2
          domain: switch

Then we wait for up to 5 minutes (360 seconds) for the internet to come back up. binary_sensor.internet is numerous ping binary sensors to different websites, where any of them responding within the last minute will cause the sensor to report as on.

        - wait_template: '{{ is_state("binary_sensor.internet", "on") }}'
          continue_on_timeout: true
          timeout: '360'

This will time out after 5 minutes and continue to the next steps, but the sensor changing to on before that will exit it early. This means we allow up to 5 minutes for the connection to reappear before seeing whether we should notify the user that it's back:

          - choose:
            - conditions:
                - condition: state
                  entity_id: binary_sensor.internet
                  state: 'on'
              sequence:
                - service: notify.telegram
                  data:
                    message: >-
                      I have finished toggling the modem off and on and the
                      internet is back up.
          default: []

Pause for a bit

In case Grafana chooses to wing multiple alerts to the webhook, we should pause at the end of this automation. This will work with the single mode described in the next step to effectively rate limit how often this can happen; currently 30 minutes:

  - delay: '1800'

Super duper important

You almost certainly want this running in single mode:

mode: single

This will ensure only one version of this automation can run at once.

Potential issues

I wish Home Assistant had an option for until loops with a maximum retry count. I can't seem to find it in the documentation, but it might be possible with some other combination of steps.

I've tweaked my Grafana alert a few times to try and ensure it doesn't fire again as soon as the internet returns after the automation fires. This was a problem.

I almost certainly want to alter how this works when people are in the house vs an empty house. With an empty house it shouldn't really matter if we restart quickly, but with people in the house we might want to be able to override it and prevent any restarts for a certain timeframe. Some traffic inherently has high packet loss, such as (for me), downloads from Microsoft servers in the Xbox app on Windows.

Fin

It works! I could potentially drop one aspect of complexity and gather the packet loss stats directly in Home Assistant, but I already had it set up in Grafana/Prometheus so it was actually less effort this time around.

This will save me many minutes of effort over my lifetime, and is therefore a resounding success.