Polling just got ~300 % slower

Kristoffer_Fagerlund · 22 August 2018 07:39

RRDcache on, local dns rekursor, pDNS etc. all was setup and I got the polling time down from 300s with defautl settings down to ~40s on one device with all devices polling less then that.

But now polling times has increased

ie:

Even localhost polling have increased.

====================================

Component	Version
LibreNMS	1.42.01-63-ge448190
DB Schema	260
PHP	7.0.30-0ubuntu0.16.04.1
MySQL	10.0.34-MariaDB-0ubuntu0.16.04.1
RRDTool	1.5.5
SNMP	NET-SNMP 5.7.3
====================================

[OK] Composer Version: 1.7.2
[OK] Dependencies up-to-date.
[OK] Database connection successful
[OK] Database schema correct
[FAIL] Some devices have not completed their polling run in 5 minutes, this will create gaps in data.
[FIX] Check your poll log and see: Performance - LibreNMS Docs
Devices:

[FAIL] We have found some files that are owned by a different user than librenms, this will stop you updating automatically and / or rrd files being updated causing graphs to fail.
[FIX] chown -R librenms:librenms /opt/librenms
Files:
/opt/librenms/html/plugins/Weathermap/configs/location-example.php
/opt/librenms/html/plugins/Weathermap/configs/aistest.php
/opt/librenms/html/plugins/Weathermap/configs/AIS DC.php
/opt/librenms/html/plugins/Weathermap/configs/group-example.php
/opt/librenms/html/plugins/Weathermap/configs/testmap.conf
/opt/librenms/html/plugins/Weathermap/configs/hostname-example.php
/opt/librenms/html/plugins/Weathermap/configs/home-network-example.php
weathermap has www-data as owner.

Any tips on were to start tshoot?

The max repeaters and oids are the same .

TheGreatDoc · 22 August 2018 08:02

Try ./poller.php -d -h host -m ports or -m sensors, as are the ones with biggest polling times. Then you can check what is taking so much time, if SNMP, MYSQL or RRD.

I bet is a disk issue. Whats your disk I/O for librenms?

Kristoffer_Fagerlund · 22 August 2018 08:33

./poller.php r01 ports 2018-08-22 10:22:50 - 1 devices polled in 63.06 secs
SNMP [20/**2.67s**]: Get[3/0.05s] Getnext[0/0.00s] Walk[17/2.62s]
MySQL [1186/0.86s]: Cell[26/0.01s] Row[-26/-0.01s] Rows[18/0.04s] Column[2/0.00s] Update[1129/0.80s] Insert[37/0.02s] Delete[0/0.00s]
RRD [1138/0.23s]: Update[569/0.06s] Create [0/0.00s] Other[569/0.16s]


./poller.php ro01 sensors 2018-08-22 10:25:43 - 1 devices polled in 93.08 secs
SNMP [10/**30.53s**]: Get[10/30.53s] Getnext[0/0.00s] Walk[0/0.00s]
MySQL [94/0.09s]: Cell[25/0.01s] Row[-25/-0.01s] Rows[24/0.04s] Column[2/0.00s] Update[65/0.04s] Insert[3/0.00s] Delete[0/0.00s]
RRD [1018/0.23s]: Update[509/0.06s] Create [0/0.00s] Other[509/0.18s]

polling ports took 63s but the I cant add up to that number SNMP 2.67s , walk 2.62s are the largest parts, were are the other 50s?

polling sensors took 93.08s , we have SNMP 30.53s, get 30.53s , missing ~30s

I check with “iostat 1” and iowait never reached above 0,6%

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1,52    0,00    1,14    0,13    0,00   97,22

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
fd0               0,00         0,00         0,00          0          0
sda              54,00         0,00       796,00          0        796


dstat
    ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
    usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
      8   7  85   0   0   0|  12k  632k| 210k   99k|   0     0 |2636  6748
     10   9  80   0   0   0|  20k  372k| 366k  108k|   0     0 |3366  7949
      9   6  85   0   0   0|   0   656k| 115k  421k|   0     0 |2364  5900
      5   3  92   0   0   0|4096B  552k|  77k   46k|   0     0 |1894  5036
      8   2  90   0   0   0|  32k  372k|  52k   49k|   0     0 |1593  4240
     17   4  79   0   0   0|8192B  576k| 135k 2454k|   0     0 |3135  6272
     39   6  55   0   0   0|  28k  888k|  82k  326k|   0     0 |5125    11k
      8   2  90   0   0   0|   0   736k|  32k  216k|   0     0 |1836  4359
      2   1  97   0   0   0|   0   508k|  24k   43k|   0     0 | 798  2681
      2   1  98   0   0   0|   0   384k|  21k   40k|   0     0 | 720  2598
      2   1  97   0   0   0|   0   508k|  19k   43k|   0     0 | 772  2601
      1   1  98   0   0   0|   0   836k|  19k   43k|   0     0 | 809  2589
      2   1  97   0   0   0|   0   576k|  22k   43k|   0     0 | 836  2749
      2   1  97   0   0   0|   0   452k|  25k   46k|   0     0 | 964  3016
      2   1  98   0   0   0|   0   544k|  24k   43k|   0     0 | 725  2582
      2   1  97   0   0   0|  24k  396k|  27k   48k|   0     0 | 752  2613
      1   1  98   1   0   0|  16k 2284k|  26k   46k|   0     0 |1157  2825
      2   1  98   0   0   0|  12k  524k|  26k   45k|   0     0 | 766  2758
      2   1  97   1   0   0|   0   336k|  27k   44k|   0     0 | 863  2794
      9   1  90   0   0   0|   0   556k|  34k  125k|   0     0 |1303  3657
     14   2  84   0   0   0|4096B  412k| 139k 2451k|   0     0 |2565  5689
     28   6  66   0   0   0|   0  2448k|  82k  139k|   0     0 |4982    10k
      8   2  89   1   0   0|  96k  592k|  34k  246k|   0     0 |1058  3315

TheGreatDoc · 22 August 2018 09:00

When you ran the poller, did you noticed where it hangs up that time?

Kristoffer_Fagerlund · 22 August 2018 09:08

Running the poller there is no obvious hold up or wait times, the script is running along printing output/results continuously.

Although there’s alot of output writing to influxdb , but that has been running flawless for month.

Im gonna try and turn off influxdb and see.

Wee need influxdb tho, for our Grafana dashboards.

another observation, I have no apparent discontinuous graphs (running 1min polling) and with some polling taking 140s I should have gotten blanks in the RRD graphs.

Sean_Richards · 22 August 2018 09:41

Hello,

We are currently seeing the same issue with our LibreNMS deployment. We are seeing a notification when when logging into the portal which says: ‘It appears as though you have some devices that haven’t completed polling within the last 15 minutes, you may want to check that out :)’. Upon further investigation we have identified that 4 of our LNS routers are having issues with polling timeouts. From the poller graphs, it seems to be a timing out issue revolving around the OS/Ports. I have run snmpwalk on one of the affected devices and this is pulling data back.

Any thoughts?

Kristoffer_Fagerlund · 22 August 2018 10:00

Ok, polling times are down to normal values. It was the influxdb via http

This is the influx conf
#$config[‘influxdb’][‘enable’] = true;
#$config[‘influxdb’][‘transport’] = ‘http’; # Default, other options: https, udp
#$config[‘influxdb’][‘host’] = ‘thehost’;
#$config[‘influxdb’][‘port’] = ‘8086’;
#$config[‘influxdb’][‘db’] = ‘librenms’;
#$config[‘influxdb’][‘username’] = ‘theuser’;
#$config[‘influxdb’][‘password’] = ‘thepassword’;
#$config[‘influxdb’][‘timeout’] = 0; # Optional
#$config[‘influxdb’][‘verifySSL’] = false; # Optional

really odd that the http transport is the cause.
Im using ./daily script , can you somehow roll back librenms?

Kristoffer_Fagerlund · 22 August 2018 10:04

Try different values for max repeaters and max oids

Kristoffer_Fagerlund · 22 August 2018 11:05

Odd, polling is back to normal with http influx transport …

Chas · 22 August 2018 11:12

I would check LibreNMS interface utilisation and any packet loss to influxdb host etc… do you have a firewall in-front which is busy?

Kristoffer_Fagerlund · 22 August 2018 11:15

solved, but no, they are on the same subnet

TheGreatDoc · 22 August 2018 12:03

How do you solve it?

laf · 22 August 2018 19:42

If you have a lot of data being sent to influxdb I’d 100% switch to use udp

Kristoffer_Fagerlund · 28 August 2018 11:57

Thnx ill try to set that up,