Zabbix scales fine to 200 hosts. Zabbix configured well is a different question — it's the configuration that eats the weekend, not the scaling.

Template hygiene

Start from the official templates. Clone them. Never edit the originals. When a template update ships, you want to merge it into the clone, not discover your custom changes got overwritten.

Discovery rules

Auto-registration + host prototypes beats manual host creation. Hosts register with tags based on hostname conventions; hosts get templates linked based on tags. You add a machine to the fleet, and Zabbix knows what to monitor within 60 seconds.

The three dashboards

I build exactly three and stop:

  • Operations — active problems, recent resolves, host availability pie. This is the tab the team leaves open.
  • Capacity — disk/CPU/memory trends, top 10 offenders per resource. Review weekly.
  • Executive — uptime SLA, MTTR trend, incident count by severity. Monthly screenshot.

Alert routing

Severity maps to channel. Info goes to a log. Warning goes to a Slack channel the team already watches. Average and above goes to a separate channel with a ding. Disaster goes to phone via PagerDuty or Opsgenie. Don't route everything to phone; you'll train the team to silence it.

Macros save your future self

Every threshold should be a macro — {$CPU_WARN}, {$DISK_CRIT} — and every macro should be settable per-host. When the executive machine needs a 95% disk threshold because it's always full of logs, you change one macro instead of cloning a template.