LibreNMS could benefit greatly from modernizing its data collection and graphing capabilities. This will make LibreNMS even more valuable than it already is today.
I’ve setup a LibreNMS → VictoriaMetrics → Grafana pipeline with custom dashboards and it’s working very well. But I want to see similar capabilities in LibreNMS natively.
(FYI, there was a previous discussion on this topic by @willhseitz from 2021.)
Below is a proposed architecture I generated about this topic. Please feel free to provide corrections or alternate ideas. Thanks! - Tristan
Strategic Architecture for Native Centralized Time-Series Storage in LibreNMS
The Architectural Imperative for Metric Storage Modernization
The foundation of LibreNMS, a widely deployed open-source network monitoring platform, is intricately tied to Round Robin Database (RRD) files for time-series metric storage. Inherited from its predecessor, Observium, this architecture relies heavily on RRDtool to define rigid data structures, store polled metrics, and dynamically render server-side graph images. While RRD files ensure predictable disk space utilization by automatically consolidating and discarding older data points through predefined Round Robin Archives (RRAs), the fundamental design introduces severe scalability limitations in modern, high-density network environments. As deployments scale to tens of thousands of ports and devices, the constant read-modify-write cycle of thousands of individual RRD files generates massive input/output operations per second (IOPS). Although the implementation of RRDCached mitigates immediate disk I/O bottlenecks by buffering writes in memory before flushing them to disk in batches, it merely defers the fundamental limitations of decentralized file-based storage.
Furthermore, the distributed polling architecture in LibreNMS, which utilizes a Dispatcher Service and Redis for horizontal scaling, requires all poller nodes to share a common storage backend. In the context of RRD, this necessitates complex and often brittle shared filesystems like NFS or GlusterFS to ensure all pollers and the central web interface can read and write to the same .rrd files across the /opt/librenms/rrd/ directory hierarchy. This shared filesystem requirement introduces significant latency and single points of failure, which degrade the efficiency of the polling cycle. Additionally, the graphical presentation layer is tightly coupled to the storage engine itself. LibreNMS utilizes modular PHP scripts to construct complex shell commands that execute the rrdtool graph binary, which parses the RRD files and outputs a static Portable Network Graphics (PNG) image to a temporary directory before streaming it to the client browser. This server-side rendering paradigm prevents the implementation of modern, interactive client-side graphing features such as dynamic zooming, real-time tooltip value inspection, and fluid panning without repeatedly querying the backend to generate a newly rendered static image.
While LibreNMS has incrementally introduced integrations with external Time-Series Databases (TSDBs) such as InfluxDB, Prometheus, Graphite, and OpenTSDB, these implementations operate strictly as write-only export mechanisms. The system replicates the data destined for RRD and pushes it to these external endpoints over HTTP protocols. However, the core LibreNMS web interface possesses no native capability to read from these TSDBs to render its built-in device, port, and health graphs. Consequently, administrators seeking modern visualization must deploy supplementary platforms like Grafana, manually rebuilding dashboards that replicate the automatic discovery topologies provided natively by LibreNMS. A comprehensive architectural redesign is required to sever the dependency on RRDtool entirely. This necessitates decoupling the storage ingestion layer from the visualization layer, introducing a centralized open-source TSDB as the primary backend, and refactoring the PHP presentation logic to consume JSON-based API payloads rendered via a modern JavaScript charting library.
Deconstructing the Legacy LibreNMS Metric Pipeline
To successfully abstract and replace the metric storage engine, it is necessary to map the exact code paths and structural dependencies that bind LibreNMS to RRDtool. The legacy system architecture operates across three highly interdependent phases: data definition, data ingestion, and graphical rendering. Understanding these components is critical to ensuring that a replacement architecture achieves absolute feature parity.
The Polling and Ingestion Subsystem
During the device discovery and polling phases, LibreNMS executes modular PHP scripts and YAML definitions tailored to specific vendor operating systems. The primary metadata, configuration settings, and sensor state information are stored in a relational database (typically MariaDB or MySQL). However, the time-series measurements themselves are directed to the RRD subsystem. When an SNMP query returns a valid numeric metric, the system invokes the LibreNMS\RRD\RrdDefinition class to establish the strict parameters of the metric. This class defines the exact structural requirements of the RRD file, specifying the Data Source (DS) type—such as GAUGE, COUNTER, DERIVE, or ABSOLUTE—as well as the minimum and maximum data bounds. The definition dictates how the metric will be treated over time, ensuring that 32-bit and 64-bit counter wraps and gauge fluctuations are handled correctly according to the strict mathematical rules of the RRD specification.
Once the definition is established, the polling mechanism utilizes the DataStorageInterface, specifically invoking the put method on the Datastore object, which routes the data into the storage backend. If a target .rrd file does not exist within the /opt/librenms/rrd/<hostname>/ directory structure, the system executes an rrdtool create command to initialize it, establishing the predetermined Round Robin Archives (RRAs) that govern how data will be averaged, minimized, and maximized over explicit time intervals. Subsequently, the system executes an rrdtool update command. If RRDCached is configured, this update is routed through a UNIX socket (e.g., unix:/run/rrdcached.sock) or over a TCP connection (${IPADDRESS}:42217), deferring the physical disk write to a background daemon.
Because RRD files strictly enforce time steps based on the configured polling interval, configuring parameters like the STEP and HEARTBEAT values is critical. By default, LibreNMS polls devices every 300 seconds (5 minutes), with a heartbeat of 600 seconds. If a polling cycle experiences latency and fails to deliver data within the heartbeat window, RRD registers a null value, resulting in discontinuous graphs and missing data points. Migrating to a 1-minute polling interval requires executing scripts like lnms maintenance:rrd-step to physically restructure the binary RRD files, an intensely disk-heavy operation that highlights the inflexibility of the format .
The Disk I/O Bottleneck and Distributed Polling Complexity
The decentralized nature of RRD storage creates profound infrastructure challenges as LibreNMS environments scale. Each network interface, sensor, CPU core, and memory pool generates its own discrete .rrd file. For a moderate deployment of 2,700 devices and 65,000 ports, the system must constantly update hundreds of thousands of individual files every five minutes. Even with the buffering capabilities of RRDCached, the underlying hypervisor must manage massive quantities of random write operations, often requiring administrators to migrate the entire /opt/librenms/rrd directory to dedicated NVMe storage arrays or even in-memory RAM disks to prevent the polling queue from stallin g.
This storage paradigm becomes exponentially more complex when implementing the LibreNMS Distributed Polling feature. Distributed polling is designed to spread the SNMP data collection workload across multiple discrete servers for horizontal scaling, coordinated via Redis and a Dispatcher Service. However, because all pollers must write metric data, and the central web server must read that data to generate UI graphs, the /opt/librenms/rrd directory must be accessible to all nodes. This relies on Network File System (NFS) mounts or clustered file systems. Network instability or lock contention on the NFS share immediately degrades poller performance, creating a fragile architectural dependency where a storage network issue can halt all network telemetry collection simultaneously. A centralized TSDB is uniquely equipped to resolve this by transforming all poller nodes into ephemeral, stateless agents that transmit payloads over stateless HTTP APIs rather than performing lock-based file operations.
The Server-Side Graphical Rendering Engine
The most complex barrier to replacing RRD is the deeply embedded graph generation logic located within the includes/html/graphs/ directory. Unlike modern decoupled web architectures that serve raw JSON time-series data to a frontend rendering library, LibreNMS dynamically builds extensive command-line strings that are fed directly into the rrdtool binary executab le.
When an administrator requests a graph via the web interface, the request is parsed through html/graph.php, which subsequently loads includes/html/graphs/graph.inc.php. Depending on the specific context of the request (e.g., viewing an interface’s packet discard rate), the system loads specific mapping files, such as includes/html/graphs/generic_simplex.inc.php. These mapping files define the visual parameters of the image: color hex codes for the lines and shaded areas, the unit text labels for the Y-axis, and the specific Data Sources (DS) to be extracted from the binary file.
The system then invokes the rrdtool_graph function, located in includes/rrdtool.inc.php. This function utilizes rrdtool_build_command to compile the variables into a massive string containing RRD graphing instructions. Crucially, this string contains imperative logic for mathematical transformations. RRDtool utilizes Reverse Polish Notation (RPN) via CDEF (Compute Data Definition) commands to manipulate data on the fly before drawing it. For example, converting raw SNMP octet byte counters into bits per second is executed dynamically via a command sequence like CDEF:out_bits=out_bytes,8,*. The command may also include VDEF (Variable Data Definition) instructions to calculate the 95th percentile, the total aggregated volume over the time period, and standard deviati ons.
Once the command string is constructed, it is piped to a background process via Proc->sendCommand($cmd), instructing the rrdtool binary to parse the historical data, execute the RPN mathematics, and output a static PNG image to the /tmp/ directory. This image is then read by the PHP processor, encoded as a base64 string, and served back to the browser. This architecture inherently prevents client-side interactivity; features such as interactive data point inspection, responsive canvas resizing without refreshing, and localized dynamic zooming are structurally impossible because the frontend browser has no access to the underlying time-series data, only a flattened raster i mage.
| Component | Legacy RRDtool Implementation | Proposed Centralized TSDB Architecture |
|---|---|---|
| Storage Medium | Decentralized Binary Files (.rrd) |
Centralized Database Cluster |
| Disk I/O Profile | High Random Write IOPS (Even with caching) | Optimized Sequential Batch Writes |
| Distributed State | Requires Shared Filesystem (NFS) | Stateless HTTP Pollers (via vmagent) |
| Mathematical Logic | Server-Side RPN (CDEF / VDEF) |
Database Query Engine (e.g., MetricsQL) |
| Graphical Rendering | Server-Side Static PNG Generation | Client-Side JavaScript (HTML5 Canvas) |
| Data Interactivity | None (Static images require re-polling) | Native (Hover tooltips, dynamic zoom) |
Evaluating Existing TSDB Integration Shortcomings
Over the years, community contributors have attempted to circumvent the limitations of RRD by building export integrations for modern Time-Series Databases. LibreNMS currently features configuration parameters to push metrics to Graphite, InfluxDB, OpenTSDB, and Prometheus. While these integrations demonstrate that LibreNMS can format its polled data for external ingestion, they are explicitly designed as unidirectional transport mechanisms. The official documentation clearly states that these backends cannot be used to display graphs within the LibreNMS interface. Users must rely entirely on external dashboarding software, such as Grafana, to visualize the exported metrics. Furthermore, analyzing the architectural implementations of these export modules reveals significant flaws that prevent them from serving as a 1:1 primary storage replacement for RRD in their current state.
Selecting the Optimal Centralized Backend: VictoriaMetrics
Given the structural limitations of RRDtool and the integration flaws of InfluxDB and Prometheus, VictoriaMetrics emerges as the most viable, performant, and architecturally sound open-source TSDB to serve as the primary backend for LibreNMS. Engineered specifically to handle massive volumes of telemetry data while maintaining extreme cost-efficiency, VictoriaMetrics serves as a high-performance drop-in replacement for both Prometheus and InfluxDB environ ments.
Columnar Storage and Unmatched Compression Ratios
The primary advantage of VictoriaMetrics is its extraordinary data compression capabilities. RRD files rely on static, pre-allocated block structures, meaning a newly created file instantly occupies its maximum required disk footprint on the filesystem, regardless of how much data it currently contains. VictoriaMetrics, utilizing optimized Log-Structured Merge (LSM) trees organized into columnar data files, achieves compression ratios up to 50:1 compared to uncompressed TSDBs.
Benchmarking demonstrates that VictoriaMetrics requires significantly less RAM and CPU compute power than Prometheus, scanning up to 50 million raw samples per second per CPU core during querying operations. In production environments, migrating from InfluxDB or Prometheus to VictoriaMetrics routinely results in a 60% to 90% reduction in disk space utilization and a massive decrease in memory pressure. For a LibreNMS deployment managing billions of data points across a multi-year retention period, this compression efficiency fundamentally alters the hardware economics of the monitoring platform.
Ingestion Throughput and Protocol Flexibility
Crucially for the LibreNMS architecture, VictoriaMetrics natively ingests the InfluxDB line protocol over standard HTTP endpoints. This allows LibreNMS to completely bypass the restrictive and synchronous Prometheus Pushgateway, utilizing rapid, high-volume HTTP POST requests. In performance comparisons, VictoriaMetrics has demonstrated the ability to ingest data at rates nearly six times faster than early versions of InfluxDB IOx, processing over 4 million rows per second on a single node. This guarantees that the LibreNMS polling threads will not be stalled waiting for storage acknowledgments.
Furthermore, unlike InfluxDB, VictoriaMetrics provides a fully open-source cluster version, allowing deployments to scale horizontally by decoupling the ingestion (vminsert), storage (vmstorage), and query (vmselect) layers into discrete microse rvices.
MetricsQL: Bridging the RRD CDEF/VDEF Gap
Replacing the RRD graphing engine requires a query language capable of executing complex mathematical transformations on the fly. VictoriaMetrics utilizes MetricsQL, a query language that is entirely backward-compatible with PromQL but introduces powerful proprietary extensions. MetricsQL provides built-in functions for calculating rates, derivatives, percentages, and complex arithmetic operations across multiple time-series .
When LibreNMS currently relies on an RRD CDEF command to multiply a byte counter by 8 to display bits per second, MetricsQL can replicate this natively during the data retrieval phase. When LibreNMS relies on a VDEF command to calculate the 95th percentile of bandwidth utilization for billing purposes, MetricsQL’s built-in histogram_quantile and aggregation operators can execute the calculation precisely across millions of stored data points. This mathematical parity is the linchpin that allows the server-side image generation to be retired in favor of raw data extraction.
Designing the Unified Open-Source Architecture
Transforming LibreNMS to utilize VictoriaMetrics as its primary data store involves a rigorous multi-layered software engineering effort. The objective is to construct an architecture that captures data efficiently via optimized HTTP payloads, abstracts the querying logic to remain backend-agnostic, and delegates the graphical rendering to the client’s web browser.
Resilient Ingestion via Local vmagent Relays
By architecting the new LibreNMS driver to route all metric writes exclusively through a local vmagent instance on each poller node, the system gains profound operational resilience and flexibility. vmagent is a lightweight metrics collection and routing agent developed by VictoriaMetrics that accepts push-based payloads and forwards them to the central database cluster.
The primary advantage of deploying vmagent locally is absolute protection against data loss. Network monitoring platforms often experience gaps in telemetry data when the central TSDB is taken offline for version upgrades, scaling, or routine maintenance. If a Wide Area Network (WAN) link between a remote LibreNMS distributed poller and the central data center drops, direct API writes would fail entirely. vmagent resolves this by acting as a highly durable queue. If the central VictoriaMetrics cluster becomes unreachable, vmagent seamlessly buffers the unsent metrics to local persistent disk files. It continually collects and stores data in this local safety buffer while waiting for connectivity to be restored. Once the database is back online, vmagent automatically flushes the persistent queue to the remote storage, systematically backfilling the historical data to ensure no gaps exist in the resulting graphs. Administrators can enforce storage limits on this buffer using the -remoteWrite.maxDiskUsagePerURL parameter, ensuring that an extended outage does not inadvertently exhaust the poller node’s entire local hard drive.
In addition to preventing data loss, passing metrics through vmagent introduces several other structural benefits:
-
Bandwidth Optimization: When transmitting data from the local
vmagentto the central cluster, the agent packages the data using the native VictoriaMetrics remote write protocol. This protocol is highly compressed, reducing network bandwidth utilization by 2x to 5x over WAN links compared to pushing standard, uncompressed JSON or InfluxDB line protocol payloads directly. -
Minimal Resource Footprint: Unlike other metrics agents that rely heavily on a Write-Ahead Log (WAL),
vmagentis explicitly engineered without one. This design choice drastically reduces its CPU and RAM consumption and allows the agent to restart nearly instantaneously (skipping broken chunks if necessary) without needing to perform slow WAL replays. -
Stream Aggregation and Deduplication:
vmagentfeatures a built-in processing pipeline that can manipulate data in-flight before it is transmitted. It can perform stream aggregation—such as automatically calculating and forwarding 5-minute averages rather than raw high-frequency samples—and real-time deduplication to aggressively reduce the volume of data stored centrally.
Constructing the Data Retrieval Abstraction Layer
The most significant structural alteration is the dismantling of the server-side image generation. Currently, LibreNMS lacks a unified, native class for extracting queried data back into the application, operating strictly through the Rrd facade to pipe comm ands t o t he sh ell.
A new core interface, logically named LibreNMS\Data\Retrieve, must be instituted to act as the translation layer between the web application’s request for data and the underlying syntax of the TSDB. When a user requests a traffic graph for a specific interface via the web UI, the frontend will issue an asynchronous AJAX request to an internal LibreNMS REST API endpoint (e.g., /api/v0/devices/:hostname/ ports/ :i d /metr ics).
The LibreNMS\Data\Retrieve driver will intercept this request, parse the requested parameters, and construct a targeted MetricsQL query. For instance, replacing the RRD logic that calculates outbound bits per second from a raw octet counter involves translating the mathematical steps. The system will issue a query to the VictoriaMetrics /api/v1/query_range endpoint :
rate(ifOutOctets{hostname="switch-01", ifIndex="101"}[5m]) * 8
This query leverages the VictoriaMetrics native rate() function to calculate the per-second derivative of the counter over a five-minute window, and immediately applies an arithmetic operator (* 8) to convert bytes to bits. This operation elegantly replaces the CDEF:out_bits=out_bytes,8,* logic previously handled exclusively by the RRDtool binary. VictoriaMetrics processes this command across the clustered backend with sub-second latency and returns a standardized JSON payload containing an array of timestamps and their corresponding floating-point values.
Re-engineering the Presentation Layer
Receiving a JSON array of timestamps and values fundamentally shifts the burden of graph rendering from the backend PHP server to the client’s local browser context. The legacy PHP files situated within the includes/html/graphs/ directory must be repurposed. Instead of executing shell command arrays, these files will act as declarative mapping templates, defining which MetricsQL query templates correspond to which LibreNMS UI elements.
Implementing Client-Side Rendering Integration
The LibreNMS WebUI must integrate a modern, high-performance JavaScript charting library to replace the static PNGs. Frameworks such as Apache ECharts, Chart.js, or uPlot are optimal candidates due to their proven ability to render large, dense time-series datasets efficiently using HTML5 Canvas or WebGL technologies.
When the user navigates to a device page, the browser will request the layout structure from the PHP server. The PHP server will return the HTML shell, embedding the necessary JavaScript parameters and the required API endpoint URLs. The client’s browser will then execute asynchronous fetch() requests to the /api/v0/devices/... endpoints, retrieving the JSON time-series data processed by VictoriaMetrics. The chosen JavaScript library will ingest this JSON and instantly render the graph within the browser window.
Achieving Graphing Capability Parity
To satisfy the requirement that the new system must possess at least the same capabilities as the existing system, the client-side implementation must meticulously replicate the nuanced features of the RRD engine:
-
Dynamic Zooming and Panning: Client-side libraries natively support interactive click-and-drag zooming. Because the raw data is held in the browser’s memory, zooming does not require a round-trip request to the backend server to generate a new image, vastly improving the flu idity of the i nterface.
-
Precise Tooltip Inspection: RRD images cannot show discrete values when hovered over. The JavaScript implementation provides instant, precise tooltip readouts of the data point values at any given timestamp, resolving a long-standing user inter face limitation.
-
Custom Graph Definitions: LibreNMS relies on user-contributed configuration files located in
resources/definitions/config_definitions.json(or locally inconfig.php) to define custom graphs, such as application-specific session counts or active users. The new architecture maps these JSON definitions directly to the MetricsQL queries, ensuring that the custom graph ecosystem continues to function flawlessly without requiring users to manually configur e Grafana panels. -
Advanced Aggregations: Parity with RRD’s
VDEFfeatures (such as generating 95th percentile lines for bandwidth billing) will be achieved by appending additional MetricsQL queries to the payload. The JavaScript library will receive the 95th percentile static value from VictoriaMetrics and render it as an overlay line on the canvas.
Data Retention, Downsampling, and Historical Migration
A fundamental shift in data storage architecture requires addressing the lifecycle of the data itself. RRD files are celebrated for their simplicity in data retention; they automatically and rigidly downsample data into lower-resolution archives over time (e.g., aggregating 5-minute data points into 2-hour averages after 30 days). This guarantees that the RRD file never expands beyond its initial byte allocation.
Replicating RRA with Recording Rules
VictoriaMetrics is an append-only TSDB that stores every raw metric pushed to it. While its 50:1 compression ratio ensures that storing billions of raw metrics requires a fraction of the space of uncompressed databases, storing raw 5-minute polling data indefinitely is computationally inefficient for performing multi-year trend analysis. The open-source version of VictoriaMetrics does not natively feature automatic downsampling, a capability that is strictly reserved for its commercial enterprise tier.
However, the architecture can precisely replicate RRD’s automatic aggregation behavior by leveraging the open-source vmalert component to execute recording rules. The LibreNMS configuration architecture can dynamically generate vmalert YAML configuration files that define continuous background MetricsQL queries. For example, to replicate an RRA that stores 1-hour averages of network interface traffic, a recording rule can be configured to execute every hour on the hour:
YAML
groups:
- name: downsample_traffic
interval: 1h
rules:
- record: ifOutOctets:1h_avg
expr: avg_over_time(ifOutOctets[1h])
The vmalert daemon evaluates these expressions against the raw data stored in VictoriaMetrics and writes the resulting aggregated data point back into the database under a new metric designation (ifOutOctets:1h_avg). To enforce strict data pruning and replicate RRD’s fixed-size constraints, administrators can deploy multiple vmstorage nodes (or utilize vmagent routing parameters) with distinct retention configurations. The raw, high-resolution 5-minute metrics can be routed to a storage pool with a -retentionPeriod of 30 days, while the downsampled metrics generated by vmalert can be directed to a separate storage pool configured with a multi-year retention policy. This guarantees perpetual data visibility without uncontrolled disk consumption.
The Historical RRD Migration Pathway
For established LibreNMS deployments, preserving historical data stored in legacy .rrd files is an absolute operational requirement. Migrating binary RRD structures into a centralized TSDB necessitates the development of a highly specific translation utility.
The system will require the development of a CLI utility command, conceptually similar to the existing lnms maintenance:rrd-step or lnms migrate scripts. This script will recursively iterate through the /opt/librenms/rrd/ hierarchy. For each identified RRD file, it will invoke the rrdtool xport command to extract the historical time-series data into a parseable JSON or XML format. The script will then parse this output, structurally mapping the hierarchical directory names and filenames into native VictoriaMetrics labels (e.g., algorithmically converting the file path /opt/librenms/rrd/switch-core-01/port-id10.rrd into the label payload {hostname="switch-core-01", port_id="10"}).
The script will then stream this historical payload via the InfluxDB line protocol into the local vmagent relay for buffering and eventual ingestion. Because extracting data from thousands of RRD files is intensely I/O bound, this migration script must be executed asynchronously and optionally across multiple worker threads to prevent locking the system. The script will append the original backdated timestamps (converted from UNIX epoch seconds to milliseconds) to ensure the historical data aligns perfectly seamlessly alongside newly incoming active telemetry.
Strategic Conclusion
The persistence of RRDtool within modern network monitoring solutions represents a technological debt that fundamentally constrains scalability, limits user interface modernization, and isolates critical telemetry data in rigid, inaccessible silos. While the integration of external Time-Series Databases as secondary, write-only endpoints within LibreNMS has provided a functional stopgap for advanced users, it has comprehensively failed to resolve the core architectural limitations of the platform’s primary storage and rendering engines.
By architecting a solution that completely excises RRDtool and elevates VictoriaMetrics to the primary native datastore, the monitoring system is fundamentally transformed. Writing locally to vmagent relays adds a critical layer of operational resilience, ensuring that network telemetry is securely buffered during maintenance windows or network interruptions. Furthermore, VictoriaMetrics provides the requisite ingestion velocity, extreme data compression efficiency, and the mathematical query language complexity (MetricsQL) necessary to absorb the demanding, high-cardinality workloads generated by active SNMP network polling. Finally, transitioning from static, server-side CDEF image compilation to JSON-driven, client-side graphical rendering modernizes the presentation layer to match contemporary analytical standards.
While the requisite codebase refactoring—specifically the translation of thousands of legacy rrdtool graph definitions into the new LibreNMS\Data\Retrieve API configurations and the implementation of vmalert for downsampling—is an extensive undertaking, the strategic dividends are incontrovertible. The resulting unified architecture decouples storage from compute, eradicates dangerous shared-filesystem dependencies in distributed polling environments, prevents data loss at the edge, and establishes a highly scalable, fully interactive network observability platform capable of meeting the demands of next-generation enterprise networks.
