Startup Tip #1: Measure, Monitor, Alert
I can’t stress enough the importance of good monitoring. At Invite Media, monitoring and alerting were indispensable. We monitored everything we could: server, business, and application metrics. We knew about problems long before they became serious.
Some sample questions you should be able to answer about your app:
- How many requests are we serving?
- How many failed requests are happening?
- How many exceptions, warnings, and errors occur per minute?
- What is our resource utilization per customer?
- What is our response time?
Here is sample graphite graph from our war console showing test campaigns from a few years back.
I recommend making a webpage that lists all your important graphs. Bonus points for having it constantly refresh on a big monitor for everyone to see (i.e. a “war console”).
- Standard monitoring and alerting software. Great for server level monitoring: disk, memory, server up/down. Its a bit tricky to learn.
- A useful tool in combination with a tool like zenoss to make sure that critical alerts get resolved. It enables “on-duty” rotations as well as escalation rules if alerts are not resolved in a fixed time period.
- Amazon CloudWatch (new) or Graphite
- Allows for easy, up to the minute, application and business level monitoring.
- Pingdom / Webmetrics / Gomez
- Easy way to monitor url up/down and response time. Pingdom is cheaper and best for up/down status of simple requests. Gomez & Webmetrics are good for response time monitoring of full pages. Besides just monitoring your own site, I recommend also including partner url’s in pingdom. This will help in debugging the root cause of latency on pages. Gomez is overpriced but, if your partners are using it, you have to use it. Otherwise, you won’t be able to isolate if a gomez datacenter is to blame instead of your service [yes, partners will blame Gomez issues on you unless you can prove otherwise].
You can’t keep an eye on everything all the time of course; Setting up lots of thresholds and alerts is also important. Zenoss allows for setting these up. A cron script will do the job on top of graphite.
I recommend breaking alerts into critical (must be fixed immediately) and non-critical (can wait until morning) subgroups. You can then set pagerduty to make sms & phone calls for the critical group and only send emails for the non-critical group.