Key Takeaways
- The most dangerous failure is silent. A Wazuh component that stops collecting without alerting leaves the SOC blind while believing it is covered.
- Agent status is the first health signal. Disconnected or never-connected agents are coverage gaps, and Wazuh tracks and can alert on them automatically.
- Cluster and indexer health have clear states. A red indexer cluster means unassigned primary shards and lost or unsearchable data.
- Queue depth and disk usage are leading indicators. They warn of saturation before events are actually dropped, which is the moment to act.
- High availability removes single points of failure through clustering and replicas, so the loss of one node degrades rather than blinds the SOC.
Why You Must Monitor the SIEM Itself
Every other monitoring tool in the organisation is watched by the SIEM. The question that is too often unasked is: what watches the SIEM? A Wazuh deployment that has silently stopped collecting from a critical server, or whose indexer has stopped accepting writes, presents the worst possible situation. The dashboards look calm, the SOC believes it has coverage, and an attacker operates in a blind spot that nobody knows exists.
Silent failure is the specific risk to engineer against. A loud failure (a server that crashes and pages someone) is recoverable. A quiet failure (an agent that stopped reporting three weeks ago, an index that filled and started rejecting writes) erodes coverage invisibly. Health monitoring exists to convert silent failures into loud ones, so that a gap in coverage generates an alert exactly like a security event would.
This is a SOC discipline, not a one-time setup. Coverage degrades continuously: agents are decommissioned and never cleaned up, new servers are built without an agent, disks fill, certificates expire. A mature Wazuh operation treats the health of the platform as a first-class monitored asset, with the same rigour applied to detecting an outage as to detecting an intrusion.
Tracking Agent Status and Coverage
Agent status is the front line of health monitoring. Wazuh classifies agents as active, disconnected, pending or never-connected, and each non-active state is a potential coverage gap. A disconnected agent stopped reporting; a never-connected agent was registered but never checked in; both mean the endpoint they represent is dark to the SOC.
Wazuh can alert on disconnection automatically. The manager detects when an agent stops sending keepalives within the expected window and generates an event, which can be routed to the SOC like any other alert. This turns a silent gap into an actionable notification: a server that drops off the grid produces an alert the same way a brute-force attempt would.
Coverage monitoring goes beyond individual agents to the whole estate. The right question is not only which agents are disconnected but whether every asset that should have an agent actually has one. Reconciling the agent inventory against the asset inventory catches the build-without-agent gap that pure agent-status monitoring misses, because you cannot alert on an agent that was never installed.
Need a Wazuh-Based Managed SOC?
Codesecure deploys and operates Wazuh, TheHive, n8n, Cortex and MISP as a managed SOC. 24x7 named analysts, detection engineering, tuned dashboards and audit-ready compliance reporting. No commercial SIEM licensing.
See Managed SOC →Cluster and Indexer Health
The indexer cluster reports an overall health colour that is the fastest read on storage availability. Green means all primary and replica shards are assigned. Yellow means primaries are assigned but some replicas are not, which is degraded but still searchable and not yet data-losing. Red means at least one primary shard is unassigned, which means some data is unsearchable or lost. Red is an incident.
Manager cluster health matters just as much. In a clustered deployment the worker nodes must stay synchronised with the master so detection is uniform. A worker that falls out of sync, or a master that becomes unreachable, breaks the assumption that a rule pushed in one place applies everywhere. Monitoring cluster node status and synchronisation catches this before it causes inconsistent detection.
Daemon-level health underpins both. The core Wazuh processes (analysisd, remoted, the indexer, the dashboard) must be running for the platform to function. A simple but essential check confirms each daemon is alive on each node, because a stopped analysisd silently halts detection while collection may appear to continue, which is exactly the kind of silent failure health monitoring exists to catch.
Queue Depth, Disk and Leading Indicators
The best health monitoring warns before failure, not after. The analysisd event queue is a leading indicator: a queue trending toward capacity means events are about to be dropped, and alerting at a threshold (well before one hundred percent) gives the SOC time to act while data is still being captured.
Disk usage is the other classic leading indicator. An indexer disk that fills will start rejecting writes, and write rejections mean lost events. Because disk fills predictably as data accumulates, alerting at a usage threshold turns a future outage into a planned maintenance task. Index lifecycle management that ages out old data automatically removes the most common cause of disk-full incidents.
Other leading indicators include JVM heap pressure on the indexer (frequent garbage collection precedes indexing lag), certificate expiry dates (an expired certificate silently breaks agent-to-manager communication), and ingestion-rate drops (a sudden fall in EPS from a source often means that source stopped reporting). Monitoring these trends catches problems in the window where they are still cheap to fix.
Designing for High Availability
Monitoring tells you when something breaks. High availability ensures a break degrades rather than blinds. The goal is to remove single points of failure so that the loss of any one component leaves the platform functioning, even if reduced, while the failed part is repaired.
On the manager side, multiple worker nodes behind a load balancer mean the loss of one worker reroutes its agents to survivors automatically, so collection continues. On the indexer side, replica shards mean the loss of a data node does not lose data, because a copy exists elsewhere, and searches continue to be served. These two patterns together cover the most common hardware and node failures.
High availability is incomplete without tested recovery. A replica that has never been verified, a backup that has never been restored and a failover that has never been exercised are assumptions, not capabilities. Regular drills (taking a node offline deliberately, restoring a backup to a test environment) prove the design works before a real failure does. Codesecure builds health monitoring, alerting and HA into managed Wazuh deployments and exercises recovery so coverage holds through component failures rather than collapsing at the worst moment.
Want Help With Detection Engineering?
Whether you run Wazuh in-house or want a fully managed service, our SOC engineers build custom rules, dashboards and integrations tuned to your environment. ISO/IEC 27001:2022 certified delivery, fixed-fee monthly retainer.
Talk to a SOC Engineer →Turning Health Monitoring Into Operational Practice
Health checks only protect you if someone acts on them. Routing platform-health alerts to the same queue that handles security alerts ensures a disconnected critical agent or a red cluster gets the same urgency as an intrusion. Health alerts that fire into an unwatched inbox recreate the silent-failure problem they were meant to solve.
A health dashboard gives the SOC an at-a-glance view: agent connectivity across the estate, cluster colour, queue depth, disk headroom and ingestion rates. Reviewed at the start of every shift, it surfaces slow-building problems (creeping disk usage, a steadily rising count of disconnected agents) long before they become outages. Operational rhythm turns monitoring data into prevented incidents.
Finally, health metrics belong in service reporting. For a managed SOC, platform availability, agent coverage percentage and mean time to restore are the numbers that prove the service is actually watching. Codesecure reports these to clients alongside detection metrics, because a SIEM's value rests entirely on it running, and demonstrating that it runs is part of demonstrating the service works.
Frequently Asked Questions
Why is monitoring the SIEM itself important?
Because a SIEM that fails silently is worse than no SIEM: the dashboards look calm while the SOC is actually blind, giving false confidence an attacker can exploit. Health monitoring converts silent failures (a disconnected agent, a full disk, a stopped daemon) into loud alerts so a coverage gap is treated like a security event.
How does Wazuh alert on disconnected agents?
The manager tracks agent keepalives and classifies agents as active, disconnected, pending or never-connected. When an agent stops sending keepalives within the expected window, Wazuh generates an event that can be routed to the SOC like any other alert, turning a silent coverage gap into an actionable notification.
What do green, yellow and red mean for the Wazuh indexer?
Green means all primary and replica shards are assigned. Yellow means primaries are assigned but some replicas are not, which is degraded but still searchable and not data-losing. Red means at least one primary shard is unassigned, so some data is unsearchable or lost. Red is an incident that needs immediate attention.
What are the leading indicators of Wazuh problems?
The analysisd event queue trending toward capacity (events about to drop), disk usage approaching full (writes about to be rejected), JVM heap pressure and frequent garbage collection (indexing lag coming), certificate expiry (agent communication about to break), and sudden ingestion-rate drops (a source that stopped reporting). All warn before actual failure.
How do you make Wazuh highly available?
Remove single points of failure. Multiple manager worker nodes behind a load balancer keep collection running if a worker fails; replica shards on the indexer keep data available if a data node fails. High availability must be paired with tested recovery, deliberately taking nodes offline and restoring backups, so the design is proven before a real failure occurs.
Can Codesecure monitor the health of our Wazuh platform?
Yes. Codesecure builds health monitoring, disconnection alerting and high-availability design into managed Wazuh deployments, routes platform-health alerts to the SOC queue, and reports agent coverage, availability and mean time to restore. We exercise recovery so coverage holds through failures. ISO/IEC 27001:2022 certified delivery.
Never Let Your SIEM Go Silently Blind
Codesecure monitors the Wazuh platform itself: agent coverage, cluster health, queue depth and high-availability design with tested recovery. ISO/IEC 27001:2022 certified delivery, named SOC engineers, 24x7 managed monitoring, fixed monthly retainer.

