Juniper QFX10002 ICCP issue after migrating from 1.48.1 to 1.53.1 (same with 1.55)

louis · 2 October 2019 14:10

Hello,

we have a couple of Juniper QFX10002 in production with ICCP feature - ICCP uses BFD protocol.

We had an 1.48.1 librenms server and decided to migrate monitoring of this Juniper to a new 1.53.1 server this summer. We experienced an few BFD down issues several times after migration. We rollbacked the monitoring to the 1.48.1 server as well as other actions. No more issue after the rollbacks.

Last week, we tried to migrate to the same new librenms server (upgraded from 1.53.1 to 1.55) and we experienced again the BFD issue.

We saw a corelation between polling time and BFD down issue

root@QFX1:RE:0% cat /var/log/messages | grep BFDD_TRAP_MHOP_STATE
Oct 1 00:13:40.895 2019 QFX1 bfdd[4920]: %DAEMON-4-BFDD_TRAP_MHOP_STATE_DOWN: local discriminator: 16, new state: down, peer addr: 10.X.X.2
Oct 1 06:32:10.041 2019 QFX1 bfdd[4920]: %DAEMON-4-BFDD_TRAP_MHOP_STATE_DOWN: local discriminator: 16, new state: down, peer addr: 10.X.X.2

root@QFX2:RE:0% cat /var/log/messages | grep BFDD_TRAP_MHOP_STATE
Oct 1 00:13:40.953 2019 QFX2 bfdd[4914]: %DAEMON-4-BFDD_TRAP_MHOP_STATE_DOWN: local discriminator: 16, new state: down, peer addr: 10.X.X.1
Oct 1 06:32:10.054 2019 QFX2 bfdd[4914]: %DAEMON-4-BFDD_TRAP_MHOP_STATE_DOWN: local discriminator: 16, new state: down, peer addr: 10.X.X.1

Equipements QFX1 and QFX2 have respectively ID 68 et 69 on new librenms.

[root@librenms-prod-dc2 ~]$ journalctl --since “2019-10-01 0:12:00” --until “2019-10-01 0:14:00” | grep -Ew “68|69”
Oct 01 00:13:19 librenms-prod-dc2 librenms-service.py[2820]: Poller_0-15(INFO):Polling device 68
Oct 01 00:13:29 librenms-prod-dc2 librenms-service.py[2820]: Poller_0-15(INFO):Completed poller run for 68 in 9.90s
Oct 01 00:13:34 librenms-prod-dc2 librenms-service.py[2820]: Poller_0-14(INFO):Polling device 69
Oct 01 00:13:44 librenms-prod-dc2 librenms-service.py[2820]: Poller_0-14(INFO):Completed poller run for 69 in 9.74s

[root@librenms-prod-dc2 ~]$ journalctl --since “2019-10-01 6:32:00” --until “2019-10-01 6:33:00” | grep -Ew “68|69”
Oct 01 06:32:03 librenms-prod-dc2 librenms-service.py[2820]: Poller_0-10(INFO):Polling device 69
Oct 01 06:32:04 librenms-prod-dc2 librenms-service.py[2820]: Poller_0-12(INFO):Polling device 68
Oct 01 06:32:12 librenms-prod-dc2 librenms-service.py[2820]: Poller_0-12(INFO):Completed poller run for 68 in 8.62s
Oct 01 06:32:13 librenms-prod-dc2 librenms-service.py[2820]: Poller_0-10(INFO):Completed poller run for 69 in 9.95s

The differences between old server and new are :

Polling method : old server uses poller-wrapper.py in crontab. New server, services-wrapper.py with systemctl
Version : 1.48.1 (old) / 1.53.1/1.55 (new)
Polling frequency : 1 min (old) and 2 min (new)
Polling modules : more SNMP polling modules on old server

Polling method does not make a difference because both relies on poller.php. Am i right ?

So I think there was some polling differences on Juniper between 1.48.1 and 1.53.1. My guess is that a new OID is requested in new version and conducts in the BFD issue.

Do you any clue on what change could have created the issue ? I saw to many changes on github.