Scott Becker's Blog

Tech startup experiences and lessons learned

Archive for the ‘Development’ Category

Engineer Tip #1: Software Architecture Links

leave a comment »

Ever wonder how giant web systems work?  Wonder no more! (see links below)


Written by scottb

December 16, 2010 at 8:11 pm

Posted in Development

Startup Tip #3: Testing. Just Do it™

with one comment

Why Test?

ugghh! Testing… so lame.  Do we really have to do this?!

But you know what sucks more than testing:

  • Losing customers & partners after a massive outage
  • Having to figure out what brought your system down at peak while clients and partners are yelling at you (aka debugging).
  • Finding out too late that your last release has a huge memory leak and spending days constantly restarting downed servers.

I have learned the hard way how important testing is.  Lets just say… that I didn’t test a production push in my early days and… brought down a whole site for a few million people for an hour.  Oops, there goes that partner. 😦

If customers can be affected by the code, it needs to be tested thoroughly.

How to Test

Here are all the tests that are worth your time in the long run (in chronological order within a dev cycle).

  • unit tests:
    • Answers: Does this function work?
    • Besides catching bugs, unit testing is for your fellow developers to understand your code.
    • It’s also a quick memory aid if you have to come back to your own code in 2 months.
    • Aim for at least 70% code coverage.
  • integration tests:
    • Answers: Does the business logic of this feature work?
    • Here use cases from product specs are converted into automated tests.  These tests should exercise code from as many components as would be realistic on a production system (e.g. UI logic, backend logic, and database logic).
    • Sample bug that would be caught:  “The UI team called a field ‘foo’ but the backend assumed it would be ‘Foo'”
    • Integration tests should be clear enough for the product manager to read through.
    • To run quickly and with minimal setup, integration tests should be ran in a single process.  e.g. use sql lite through mocking instead of connecting over tcp to a real database.  (see mock objects and dependency injection)
  • smoke tests:
    • Answers: Does the server boot?  Does a basic request go through without throwing an error?
    • This test (often manual) is a quick sanity check before handing a feature off to a dedicated QA team.  Trust me, you will piss off your QA team if you say you are done and the server won’t even start.
  • code reviews:
    • Answers: Was proper testing done? Were coding standards followed? Where should the QA team be most concerned, especially around performance?
    • Code review can become a heated topic (see points #8/#9 from this pdf).  As lame as it sounds, I believe its critical to maintain a positive culture that more bugs found in code review generally means that a developer’s task was more complex.
    • Suggested Tool:  Review Board
  • ui / backend manual testing:
    • Answers:  Does the feature work?
    • Here you run through  use cases by hand on a clone of the production system (aka a staging stack):  using the UI and making requests.  Do you get the correct results?
    • These are essentially a repeat of integration tests.
    • UI testing automation:  If you are confident that parts of your UI layout won’t be changing soon, go ahead and build automated tests for them with selenium.
  • system testing and load testing
    • Answers:  Does the new feature work after a long period of load?  Do we have a memory leak?
    • Here you set up a staging stack and try to simulate the real world as much as possible using the feature.  The goal is to leave the feature running smoothly for a few hours.
    • Replay production traffic or a set of traffic designed to test specific features.
    • I recommend having a big script that populates your database with an object to be affected by each core feature.  For example, at Invite, our test script would generate campaigns that should only serve to US traffic and campaigns that had specific dollar budgets.  After you run traffic, run through a saved check list of what the system should look like if old and new features are working properly.   At Invite, this meant that campaigns didn’t go over budget and served to the correct geographies.
    • Watch out for rapid rates of memory growth (aka memory leaks)
    • Don’t forget to review error logs post testing.
    • Suggested Tools:  apache benchhttperf, something custom to log and replay traffic (twisted makes building this easy)
  • fuzz tests (aka negative tests):
    • Answers: Do lots of malformed requests bring down the system? (Hint: they shouldn’t)
    • Verify that malformed requests are logged gracefully and don’t bring down the system.  For low latency systems, you likely don’t want every error to be written to disk as this causes lots of lost cpu time to IO operations.
  • performance test:
    • Answers:  Has performance degraded too much in this release? (Will serving costs get too high?)
    • Replay a repeatable set of traffic.   It can be custom or a replay of production traffic.  The important part is that its the same set of requests used in each release for comparison purposes.
    • Suggested Tools:  apache bench, httperf

Continuous Integration (CI):

  • The sooner you catch bugs in the development process, the easier they are to debug and fix.  e.g. you know immediately which commit is the culprit of a breakage.
  • Suggested Tool: Hudson

Who should be testing what?

Unit tests through code review (listed above) should be done by the development team.

“ui / backend manual testing” and down can be done by a dedicated QA team.  Product managers should be involved in the UI / backend manual testing step.

What do you automate? Unit, integration, and some UI tests should go into your CI system.  System, fuzz, and performance should leverage scripts but are tough to automate and run with CI.

Written by scottb

September 17, 2010 at 10:45 pm

Posted in Development

Startup Tip #1: Measure, Monitor, Alert

with 17 comments

I can’t stress enough the importance of good monitoring.  At Invite Media, monitoring and alerting were indispensable.  We monitored everything we could: server, business, and application metrics.  We knew about problems long before they became serious.

Some sample questions you should be able to answer about your app:

  • How many requests are we serving?
  • How many failed requests are happening?
  • How many exceptions, warnings, and errors occur per minute?
  • What is our resource utilization per customer?
  • What is our response time?

Here is sample graphite graph from our war console showing test campaigns from a few years back.

I recommend making a webpage that lists all your important graphs.  Bonus points for having it constantly refresh on a big monitor for everyone to see (i.e. a “war console”).

Recommended Tools:

  • Zenoss
    • Standard monitoring and alerting software.  Great for server level monitoring: disk, memory, server up/down.  Its a bit tricky to learn.
  • Pagerduty
    • A useful tool in combination with a tool like zenoss to make sure that critical alerts get resolved.   It enables “on-duty” rotations as well as escalation rules if alerts are not resolved in a fixed time period.
  • Pingdom / Webmetrics / Gomez
    • Easy way to monitor url up/down and response time.  Pingdom is cheaper and best for up/down status of simple requests.  Gomez & Webmetrics are good for response time monitoring of full pages.  Besides just monitoring your own site, I recommend also including partner url’s in pingdom.  This will help in debugging the root cause of latency on pages.  Gomez is overpriced but, if your partners are using it, you have to use it.  Otherwise, you won’t be able to isolate if a gomez datacenter is to blame instead of your service [yes, partners will blame Gomez issues on you unless you can prove otherwise].

You can’t keep an eye on everything all the time of course; Setting up lots of thresholds and alerts is also important.  Zenoss allows for setting these up.  A cron script will do the job on top of graphite.

I recommend breaking alerts into critical (must be fixed immediately) and non-critical (can wait until morning) subgroups.   You can then set pagerduty to make sms & phone calls for the critical group and only send emails for the non-critical group.

Written by scottb

August 31, 2010 at 11:42 am

%d bloggers like this: