Deadlocks when discovery-wrapper.py tries to do parallel fdb-table discovery

cjwbath · 28 October 2019 16:08

When I run the discovery-wrapper.py with the default 16 threads each on two poller nodes, I get errors in librenms.log as it encounters deadlocks in the database while trying to update the ports_fdb table, e.g.:

[2019-10-24 16:48:52] production.ERROR: SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction (SQL: UPDATE `ports_fdb` set `updated_at`=NOW() WHERE `device_id` = 435 AND `vlan_id` = 0 AND `mac_address` = aabbccddeeff) (SQL: UPDATE `ports_fdb` set `updated_at`=NOW() WHERE `device_id` = 435 AND `vlan_id` = 0 AND `mac_address` = aabbccddeeff)
#0 /opt/librenms/includes/discovery/fdb-table.inc.php(55): dbUpdate(Array, 'ports_fdb', '`device_id` = ?...', Array)
#1 /opt/librenms/includes/discovery/functions.inc.php(179): include('/opt/librenms/i...')
#2 /opt/librenms/discovery.php(120): discover_device(Array, true)
#3 {main}

Those updates are dropped so I miss out on some of the FDB table entries. Having a read of the code I couldn’t see how to avoid the deadlocks, I guess it is just a result of coincidental timing of the parallel processes writing the table. But apparently it is legitimate to just retry upon failures due to deadlocks, so can I suggest using the Laravel DB functions that deal with this?

I would replace the following around line 51 in includes/discovery/fdb-table.inc.php:
dbUpdate(
array(‘updated_at’ => array(‘NOW()’),), //we need to do this unless we use Eloquent “update” method
‘ports_fdb’,
’device_id = ? AND vlan_id = ? AND mac_address = ?’,
array($device[‘device_id’], $vlan_id, $mac_address_entry)
);

With:
DB::transaction(function () use ($device, $vlan_id, $mac_address_entry) {
DB::update(‘UPDATE ports_fdb set updated_at = NOW() ress` = ?’,
array($device[‘device_id’], $vlan_id, $mac_address_entry
}, 3);

Which will retry the operation 3 times upon deadlock. That’s enough to fix the issue for me, but if you wanted to refactor the other queries you could do that too. It might be a general principle to think about where we are running lots of processes updating the same table in parallel…