Ever wonder how giant web systems work? Wonder no more! (see links below)
- YouTube Architecture
- Plenty Of Fish Architecture
- Google Architecture
- Scaling Twitter: Making Twitter 10000 Percent Faster
- Flickr Architecture
- Amazon Architecture
- How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale
- Stack Overflow Architecture
- An Unorthodox Approach to Database Design : The Coming of the Shard
- Building Super Scalable Systems: Blade Runner Meets Autonomic Computing in the Ambient Cloud
- Are Cloud Based Memory Architectures the Next Big Thing?
- Latency is Everywhere and it Costs You Sales – How to Crush it
- Useful Scalability Blogs
- The Canonical Cloud Architecture
- Justin.Tv’s Live Video Broadcasting Architecture
Here is a talk I gave to college students in Mansfield, PA. It discusses my startup history and a framework for founders interested in B2B to figure out what product/service to build.
ugghh! Testing… so lame. Do we really have to do this?!
But you know what sucks more than testing:
- Losing customers & partners after a massive outage
- Having to figure out what brought your system down at peak while clients and partners are yelling at you (aka debugging).
- Finding out too late that your last release has a huge memory leak and spending days constantly restarting downed servers.
I have learned the hard way how important testing is. Lets just say… that I didn’t test a production push in my early days and… brought down a whole site for a few million people for an hour. Oops, there goes that partner. 😦
If customers can be affected by the code, it needs to be tested thoroughly.
How to Test
Here are all the tests that are worth your time in the long run (in chronological order within a dev cycle).
- unit tests:
- integration tests:
- Answers: Does the business logic of this feature work?
- Here use cases from product specs are converted into automated tests. These tests should exercise code from as many components as would be realistic on a production system (e.g. UI logic, backend logic, and database logic).
- Sample bug that would be caught: “The UI team called a field ‘foo’ but the backend assumed it would be ‘Foo'”
- Integration tests should be clear enough for the product manager to read through.
- To run quickly and with minimal setup, integration tests should be ran in a single process. e.g. use sql lite through mocking instead of connecting over tcp to a real database. (see mock objects and dependency injection)
- smoke tests:
- Answers: Does the server boot? Does a basic request go through without throwing an error?
- This test (often manual) is a quick sanity check before handing a feature off to a dedicated QA team. Trust me, you will piss off your QA team if you say you are done and the server won’t even start.
- code reviews:
- Answers: Was proper testing done? Were coding standards followed? Where should the QA team be most concerned, especially around performance?
- Code review can become a heated topic (see points #8/#9 from this pdf). As lame as it sounds, I believe its critical to maintain a positive culture that more bugs found in code review generally means that a developer’s task was more complex.
- Suggested Tool: Review Board
- ui / backend manual testing:
- Answers: Does the feature work?
- Here you run through use cases by hand on a clone of the production system (aka a staging stack): using the UI and making requests. Do you get the correct results?
- These are essentially a repeat of integration tests.
- UI testing automation: If you are confident that parts of your UI layout won’t be changing soon, go ahead and build automated tests for them with selenium.
- system testing and load testing
- Answers: Does the new feature work after a long period of load? Do we have a memory leak?
- Here you set up a staging stack and try to simulate the real world as much as possible using the feature. The goal is to leave the feature running smoothly for a few hours.
- Replay production traffic or a set of traffic designed to test specific features.
- I recommend having a big script that populates your database with an object to be affected by each core feature. For example, at Invite, our test script would generate campaigns that should only serve to US traffic and campaigns that had specific dollar budgets. After you run traffic, run through a saved check list of what the system should look like if old and new features are working properly. At Invite, this meant that campaigns didn’t go over budget and served to the correct geographies.
- Watch out for rapid rates of memory growth (aka memory leaks)
- Don’t forget to review error logs post testing.
- Suggested Tools: apache bench, httperf, something custom to log and replay traffic (twisted makes building this easy)
- fuzz tests (aka negative tests):
- Answers: Do lots of malformed requests bring down the system? (Hint: they shouldn’t)
- Verify that malformed requests are logged gracefully and don’t bring down the system. For low latency systems, you likely don’t want every error to be written to disk as this causes lots of lost cpu time to IO operations.
- performance test:
- Answers: Has performance degraded too much in this release? (Will serving costs get too high?)
- Replay a repeatable set of traffic. It can be custom or a replay of production traffic. The important part is that its the same set of requests used in each release for comparison purposes.
- Suggested Tools: apache bench, httperf
Continuous Integration (CI):
- The sooner you catch bugs in the development process, the easier they are to debug and fix. e.g. you know immediately which commit is the culprit of a breakage.
- Suggested Tool: Hudson
Who should be testing what?
Unit tests through code review (listed above) should be done by the development team.
“ui / backend manual testing” and down can be done by a dedicated QA team. Product managers should be involved in the UI / backend manual testing step.
What do you automate? Unit, integration, and some UI tests should go into your CI system. System, fuzz, and performance should leverage scripts but are tough to automate and run with CI.
While, and more importantly before, you “release early, release often“, talk to your customers, advisors, and partners early and often. This is a great way to prevent yourself from building the wrong thing. Coding is hard, talking with people is easy. [Though, many engineers might argue the opposite. 🙂 ]
Customers: Pitch customers your ideas and get their feedback.
- Will they pay for the new product?
- Which features could they live without?
- Try to have a basic demo/walk-through so that everyone can see what you are thinking. White boarding works too.
- Be sure to talk to a decent number of potential clients; Early on, you don’t want to waste time building features that only one client will use (of course, eventually, you may want to do so for your biggest clients).
Advisors: You need a solid set of experts to bounce ideas off of. People who have been where you want to go. You would be surprised just how friendly and helpful people can be, especially when you give them a piece of equity. [I’d like to take a moment to thank David Brussin, Brian O’kelley, and Mike Nolet; These guys are fantastic advisors on the product / management side].
Strategic Partners: Be sure to get your partners on board with what you are doing. You don’t want to develop something only to find that you can’t get access to a critical relationship.
Vendors: If you can pay someone (a reasonable price) to supply a component of your system, don’t build it! Oh boy are you going to save time.
- Try to lease/pay monthly and avoid long term contracts off the bat; You may want to drop the vendor in a few months.
- Be very cautious though if you plan to rely on an early stage startup for something… they could shift their product focus and leave you stranded.
At Invite Media, we lost a year of work because we didn’t talk enough. We iterated in a bubble and essentially developed a product without talking to a significant sample of potential customers and critical partners. We finally did figure it out though and shifted the product to something customers wanted. Josh Kopelman compared our journey to a heat seeking missile. I must admin that initially, we even made the mistake of developing with a single client in mind. We lost time working on features that only a single client ever used.
I can’t stress enough the importance of good monitoring. At Invite Media, monitoring and alerting were indispensable. We monitored everything we could: server, business, and application metrics. We knew about problems long before they became serious.
Some sample questions you should be able to answer about your app:
- How many requests are we serving?
- How many failed requests are happening?
- How many exceptions, warnings, and errors occur per minute?
- What is our resource utilization per customer?
- What is our response time?
Here is sample graphite graph from our war console showing test campaigns from a few years back.
I recommend making a webpage that lists all your important graphs. Bonus points for having it constantly refresh on a big monitor for everyone to see (i.e. a “war console”).
- Standard monitoring and alerting software. Great for server level monitoring: disk, memory, server up/down. Its a bit tricky to learn.
- A useful tool in combination with a tool like zenoss to make sure that critical alerts get resolved. It enables “on-duty” rotations as well as escalation rules if alerts are not resolved in a fixed time period.
- Amazon CloudWatch (new) or Graphite
- Allows for easy, up to the minute, application and business level monitoring.
- Pingdom / Webmetrics / Gomez
- Easy way to monitor url up/down and response time. Pingdom is cheaper and best for up/down status of simple requests. Gomez & Webmetrics are good for response time monitoring of full pages. Besides just monitoring your own site, I recommend also including partner url’s in pingdom. This will help in debugging the root cause of latency on pages. Gomez is overpriced but, if your partners are using it, you have to use it. Otherwise, you won’t be able to isolate if a gomez datacenter is to blame instead of your service [yes, partners will blame Gomez issues on you unless you can prove otherwise].
You can’t keep an eye on everything all the time of course; Setting up lots of thresholds and alerts is also important. Zenoss allows for setting these up. A cron script will do the job on top of graphite.
I recommend breaking alerts into critical (must be fixed immediately) and non-critical (can wait until morning) subgroups. You can then set pagerduty to make sms & phone calls for the critical group and only send emails for the non-critical group.