Startup Tip #1: Measure, Monitor, Alert

I can’t stress enough the importance of good monitoring. At Invite Media, monitoring and alerting were indispensable. We monitored everything we could: server, business, and application metrics. We knew about problems long before they became serious.
Some sample questions you should be able to answer about your app:
- How many requests are we serving?
- How many failed requests are happening?
- How many exceptions, warnings, and errors occur per minute?
- What is our resource utilization per customer?
- What is our response time?
Here is sample graphite graph from our war console showing test campaigns from a few years back.

I recommend making a webpage that lists all your important graphs. Bonus points for having it constantly refresh on a big monitor for everyone to see (i.e. a “war console”).
Recommended Tools:
- Zenoss
- Standard monitoring and alerting software. Great for server level monitoring: disk, memory, server up/down. Its a bit tricky to learn.
- Pagerduty
- A useful tool in combination with a tool like zenoss to make sure that critical alerts get resolved. It enables “on-duty” rotations as well as escalation rules if alerts are not resolved in a fixed time period.
- Amazon CloudWatch (new) or Graphite
- Allows for easy, up to the minute, application and business level monitoring.
- Pingdom / Webmetrics / Gomez
- Easy way to monitor url up/down and response time. Pingdom is cheaper and best for up/down status of simple requests. Gomez & Webmetrics are good for response time monitoring of full pages. Besides just monitoring your own site, I recommend also including partner url’s in pingdom. This will help in debugging the root cause of latency on pages. Gomez is overpriced but, if your partners are using it, you have to use it. Otherwise, you won’t be able to isolate if a gomez datacenter is to blame instead of your service [yes, partners will blame Gomez issues on you unless you can prove otherwise].
You can’t keep an eye on everything all the time of course; Setting up lots of thresholds and alerts is also important. Zenoss allows for setting these up. A cron script will do the job on top of graphite.
I recommend breaking alerts into critical (must be fixed immediately) and non-critical (can wait until morning) subgroups. You can then set pagerduty to make sms & phone calls for the critical group and only send emails for the non-critical group.

Scott J. Becker is my hero.
I am proud that my name is J. Scott.
Jordan
August 31, 2010 at 6:13 pm
I totally get the monitor, measure, alert message in this article but as start up tip, it confused me a little. My expectations of the article were different, plus I read tip 2 before I read tip 1 so kinda confused myself a bit :/
Inbound Tweet Rating 9
Facebook Like = Yes
Article Rating 8
Overall Rating 5.5 lol
Good stuff Scott. Thanks for sharing.
Neil
September 12, 2010 at 9:51 pm
I enjoy what you guys are usually up too.
This sort of clever work and coverage! Keep up the great works guys I’ve added you guys to blogroll.
Gender equity
January 12, 2013 at 6:55 am
With the long reign of Elizabeth, architecture and decoration had passed into a new and Flemish phase, though Italy was still to the traveller “gazing only on the beauty of their cities, and the painted surface of their houses,” the
only paradise of Europe. In evaluation you take into consideration things
like height, safety etc. With modern houses devoid of
the sprawling living rooms and large rooms, it is the neat, simplistic appearance that is favored.
staircase railings
April 19, 2013 at 4:24 pm
Hey There. I discovered your weblog the usage of msn.
That is an extremely smartly written article.
I will be sure to bookmark it and return to read extra of your useful info.
Thank you for the post. I’ll certainly return.
http://irishdebs.ie
April 30, 2013 at 11:55 am