Zabbix scales fine to 200 hosts. Zabbix configured well is a different question — it's the configuration that eats the weekend, not the scaling.
Template hygiene
Start from the official templates. Clone them. Never edit the originals. When a template update ships, you want to merge it into the clone, not discover your custom changes got overwritten.
Discovery rules
Auto-registration + host prototypes beats manual host creation. Hosts register with tags based on hostname conventions; hosts get templates linked based on tags. You add a machine to the fleet, and Zabbix knows what to monitor within 60 seconds.
The three dashboards
I build exactly three and stop:
- Operations — active problems, recent resolves, host availability pie. This is the tab the team leaves open.
- Capacity — disk/CPU/memory trends, top 10 offenders per resource. Review weekly.
- Executive — uptime SLA, MTTR trend, incident count by severity. Monthly screenshot.
Alert routing
Severity maps to channel. Info goes to a log. Warning goes to a Slack channel the team already watches. Average and above goes to a separate channel with a ding. Disaster goes to phone via PagerDuty or Opsgenie. Don't route everything to phone; you'll train the team to silence it.
Macros save your future self
Every threshold should be a macro — {$CPU_WARN}, {$DISK_CRIT} — and every macro should be settable per-host. When the executive machine needs a 95% disk threshold because it's always full of logs, you change one macro instead of cloning a template.