Availability covers many aspects from conceptual considerations, simple hardware designs up to a complex Storage Cluster Solution that is not offered by napp-it but available from http://www.high-availability.com/zfs-ha-plugin/. Such a High Availability solution requires beside money very skilled IT staff together with a good external support supplier.
In many cases it is better and cheaper to simply allow failures and ensure that this happen quite rarely and can be fixed within a timespan of 30-60 minutes by yourself, in best case without the need of an external support. Do not try to avoid a disaster, try to handle the consequences especially when cost is a matter.
Most important organisational considerations
- KISS - Kepp it Simple, Stupid. You must be able to understand what is happening and how to fix problems. Simplicity avoids costs and allows stable configurations without the otherwise needed expert knowledge internally or from an external support.
- Quality: Use high quality parts especially regarding power supply, ECC RAM, Disks or backplanes. Most problems are caused by them. Do not try new and special "new high-end solutions", use well tested standard solutions. If everyone with experience suggests LSI HBA in IT mode and Intel nics for ZFS do not use anything else.
- Redundancy: Provide a hot spare system, either without disks or use a second system with enough empty disk bays where you can plug in all disks after a failure of your primary storage. Use hot-spare and cold-spare disks. Prefer a redundant power supply where you can use an UPS for the second line.Select a raid-level that allows a failure of any two disks (3way mirror or Raid-Z2/3).
- Worst Case: Allow a real disaster (Fire or server stolen). Use a backupserver with replication on a physically different location so you are back online in a short time even on a total loss of your primary system. Problem is that you do not have newest data and you may have problems with files that were open on last replication.
- Documentation: Document your configuration ex Slot number and WWN of disks (ex print napp-it disk overview) and think of/ test recovery procedures in advance (in case of a disk failure, system failure or disaster)
Most important actions on daily use
- Do online scrubbings regularly to read checksums from all files and repair such silent data errors. Do scrubbings weekly with consumer disks and monthly with enterprise disks.
- React to errors rapidly. Enable alert and status-emails. Check system logs, fault-service logs, scrubbing messages about repaired checksum errors, temperatur, iostat messages about wait or errors or smart data.
- Act predictive. If single disks parameters got worse remove the disk and do a low level check with a manufacturers tool. A pretty awful problem is a semi dead disk that reacts slowly or blocks the bus. This can result for example in a ESXi datastore failure due to a timeout.
- Do auto-snapshots for example every 15 minutes in current hour (keep 4), every hour in current day (keep 24), daily in current week (keep 7) or on sunday in current monthy (keep 4). Enable a longterm snap history on your backupsystem (ex keep monthly snaps for last year).