The 5 Hardest Parts of Programming

Optimization
Networking
Security
Reliability
Scalability

Introduction

Many things in programming become easier as your level of experience increases. However, some things remain stubbornly difficult.

This document tries to characterize five of these difficult things; optimization, networking, security, reliability and scalability, what makes them hard and what you can do about it.

As intractable as they are common and necessary, the five things above share a few important traits:

It's easy to visualize what it would mean for your product to have each one.
They each mean something different every time.
The problems they encompass are difficult to prepare for.
They are dimensional in nature.
Being satisfied with your implementations almost never happens.
Addressing them effectively is part of the art of programming.

⇪ Table of Contents Networking ⇨ Chapter 1 Optimization

What is it?

Doing more things with less stuff.

Sometimes this is expressed experientially in that user feedback isn't coming back quick enough. Other times it's a real problem, e.g. you are getting input at some rate X, but you can only process it at some rate ¾X, and thus you are in trouble. Oftentimes the "stuff" is runtime related, but other limiting factors may include things like bitrate or power consumption.

Why does it sound easy?

Every programmer with a Bachelors degree should have taken an algorithms course. They should know about Big-O notation. They probably also have in their bag of tricks some rules of thumbs, things like "Floating point arithmetic is slow" or "String comparison is slow" or other mantras. When looking at an optimization problem, it's easy to think "Do static analysis, create some unit tests, I got this".

Why is it hard?

The problem is that these methods are either ineffective or irrelevant. Sure, the O(2^N) algorithm is slower than the O(N) one. If that's your only problem you're in the clear.

However, assume nobody was that careless and optimization is still a goal, the best you can do, measurement driven optimization, is only about as assured as card-counting at a Black Jack table.

Measurement Driven Optimization

Before going further, let's talk about the four types of programmers:

The hardware tinkerers who write assembly and know the very soul of the processor.
The compiled-language developer, writing platforms and frameworks these days.
The application-level script-writing programmer, commonly writing web apps.
The business-person first, programmer-second SEO or WordPress programmer.

The first group, you people are fine. You can probably skip this. But the other 99% of you are sitting on a complex world. If you think these rules of thumb on how to make your program faster are true and you don't need to do extensive careful measurements, you are probably mistaken.

What runs fast on your computer may not run fast on the clients, unless you control the whole stack. For instance, there's vast performance differences in Javascript runtime. Here Microsoft claims that IE9 is an order of magnitude faster than IE8. In what? Traversing the DOM? Event propagation? Parsing JSON? Only the first one if the other two aren't happening at the same time?

Optimization needs a plan, just like anything else. The following questions will help you make this plan:

What am I doing?
What are concrete, real-world goals that your software is addressing?
How am I doing it?
How have you modeled these real-world goals?
Where am I doing it?
Is the system DRY and cohesive?
When am I doing it?
How coupled is your presentation and computation?

Answering these questions may guide you in deriving your performance numbers. But watch out, measurements can be deceiving!

Here's an example in Ruby:

I have a string, "hello". This will test whether the first character is an "h". Here's three methods:

"hello"[0] == ?h Take the first index of the string (outputs as the numeric value of its ASCII character), convert "h" to its numeric value, then compares two integers.
/^h/ =~ "hello" Construct a regular expression and then use a Perlish match against the string.
"hello"[0].chr == 'h' Take the first index, convert it to a char, compare it to an 'h'.

NoteThis is called micro-optimization and is a useless form of optimization. Stick with me for now though.

1000000.times do 
  # "hello"[0] == ?h
  # /^h/ =~ "hello"
  # "hello"[0].chr == 'h'
end

Before looking at the numbers below, rank the methods described above from fastest to slowest. See how your guesses match what is measured below.

An Ubuntu 12.04 machine using the /usr/bin/time command yielded the following numbers averaged over five trials:

/usr/bin/time performance numbers
Method	Time: user/real	Rank
"hello"[0] == ?h	0.622/0.520	1st
/^h/ =~ "hello"	0.727/0.580	2nd
"hello"[0].chr == 'h'	1.026/0.980	3rd

The first and second methods are about the same, with the first method being slightly faster. Remember the rankings.

Do you know what you are measuring?

Now here is the same test as before, but using the ruby-prof Gem. That is to say that the same tests are profiled with a different methodology.

The hypothesis is that we should see the same pattern of results, the fastest should stay the fastest, the slowest should stay the slowest.

require 'rubygems'
require 'ruby-prof'

RubyProf.start

1000000.times do 
  # "hello"[0].chr == 'h'
  # "hello"[0] == ?h
  # /^h/ =~ "hello"
end

result = RubyProf.stop
printer = RubyProf::GraphPrinter.new(result)
printer.print(STDOUT, {})

ruby-perf performance numbers
Method	Total Time	Rank
"hello"[0] == ?h	2.845332828	2nd
/^h/ =~ "hello"	1.511132323	1st
"hello"[0].chr == 'h'	5.099165590	3rd

The differences between the two test's results don't match up even in relative terms.

When using the time command, the ?h comparison edged slightly ahead of the regex. However, with the ruby-prof profiler the regex method ran in half the time.

As it turns out, profilers don't always agree because measuring has overhead, methodology, and implementation details.

Knowing how to "fix" numbers generated by your tools requires an understanding of how those numbers are generated. Otherwise, what are you responding to?

Other Bounds

You can also be disk bound. To illustrate the problem, get on a Linux or Mac OS X machine. (or Fabrice Bellards' Linux in the browser.)

You are going to create a 32 megabyte file of zeros, all in one go. Open up the shell and type the following in:

dd if=/dev/zero of=swiftly bs=33554432 count=1

That was fast. Great. Now we will copy this file, but only one byte at a time:

dd if=swiftly of=patience bs=1 count=33554432

This will take much longer. Probably over a minute (took 62 seconds for me).

This is the problem of I/O (currently, in July 2012). When you switch between reading and writing, there's a huge overhead cost. Additionally, when your throughput in that transit is low, you have wasted a lot. Minimizing this is a major focus of database design.

So yes, CPU bound, Memory bound, IO bound, GPU, lots of stuff to slow you down. Worst of all, acting today may be a huge waste of time tomorrow.

Future-Poof!

The worst thing is that you at best only know what to look for on your machine, right now, doing certain things.

Historically, optimizing this way hasn't always panned out as expected.

Take something called Duff's Device. Don't worry if you don't understand what's going on, the mechanics of it are irrelevant for this point.

All you need to know is that in the 1980s, using the Duff's Device programming pattern supposedly led to a performance boost. Fast forward to 2000 and this email:

Jim Gettys has a wonderful explanation of this effect in the X server. It turns out that with branch predictions and the relative speed of CPUs. memory changing over the past decade, loop unrolling is pretty much pointless. In fact, by eliminating all instances of Duff's Device from the XFree86 4.0 server, the server shrunk in size by half a megabyte (!!!), and was faster to boot, because the elimination of all that excess code meant that the X server wasn't thrashing the cache lines as much.

The bottom line is that our intuitive assumptions of what's fast and what isn't can often be wrong, especially given how much CPU's have changed over the past couple of years.

So even if you achieve the momentous task of being right about optimization today, there is no guarantee that your measures will help you be right, or even beneficial tomorrow.

Everything is moving at once and it's your job to manage it.

Don't Micro-Optimize

Up until this point we've been looking at what is termed micro-optimization. This has been done to make the topic easier to talk about.

Now given that, micro-optimizations are worthless. Do not do them.

Here's a few reasons:

Your fancy refactoring tools may get things wrong and birth new and exciting bugs.
They at best offer localized, temporary, almost unmeasurable boosts in performance.
They can often lead to issues amongst coders.

The last one is important if you value your time.

Arguing about optimization is wasteful. It should be easily demonstrable (to the end-user) or it's nonsense.

And even when it's demonstrable, doing it may still be a waste of your time.

Here's a great example, goto. There's a super-famous paper that many quote but few have read titled, Go to statement considered harmful. What it argues for, in 1968 (the year that the frame-buffer was developed, and a computer with 32KB of memory cost $50,000) is that programmers should prefer sub-functions over direct jumps.

By doing this, programs will have call-stacks enabling programmers to unroll what path the code took and thus be more easily able to determine what made things go wrong. Principally, gotos (or jumps) should more or less be abstracted away from the programmer via higher-order abstractions.

Given this, goto can be a micro-optimization. It says, "I don't care about scope, where I came from, or anything else. Go to this part of the code right now!". You'll often find gotos peppered by respectable programmers.

But now that you know the proper history and intent, you probably shouldn't go off and use goto. Many programmers you will work with will challenge you and scoff at you for using it because these thoughts on goto constitute part of the programming cultures' group cohesion. Thus you'll have to explain this to them. It's not worth it. It just isn't.

Micro-optimizations will at best give you a temporary performance gain, and at worst, squander your time with arguments during code reviews. And furthermore, since you aren't going to be the perpetual maintainer of your code, it will probably just be undone in the future.

And maybe such back-tracking will lead to a real improvement when it's done five years from now. Woops, dig yourself out of that one.

Turning your human text into machine code is the job of the compiler or interpreter you are using. When you do things conventionally, you are doing them in ways that will be the focus of future optimizations. It's like future eyeballs making your code run faster for you. When you try to outsmart the compiler however, all bets are off.

Cache and Woes

Eventually, almost all programmers have the following epiphany:

I'm asked the same question over and over again. Hah! I'll just stash the result over here, then when the question comes again, I can cheat a bit and have the answer ready.

This brilliant idea is called a cache. Sorry, you aren't first to come up with it. Furthermore, cache invalidation is hard. Really hard.

Almost everywhere in the stack, from the CPU on up there's cache. Hard drives have cache, RAM is kinda a form of cache, people are eying SSDs as some magical cache silver bullet.

Well here's just a few baddies about caches:

The thing you are optimizing out may have important side effects you forgot about.
The cache may grow unwieldy and then magically become your bottleneck.
The responses from the cache may not be fresh unless rather sophisticated.
The performance and perhaps very nature of your application is now less than consistent.

That is what happens when you do cache, wrongly. When you do it rightly, everything is awesome. But doing it right is a black art in and of itself. Different things matter every time. There's no deterministic way to go about it.

Macro-Machines

A more fruitful way to think about optimization is at the macro-level.

You are dealing with a large program with many moving parts. Your key to making things faster is to do less:

Most performance comes not from the language, but from application design.from Making twitter 10,000% faster

Avoid running unnecessary code.
Avoid creating unnecessary objects.
Avoid unnecessary abstractions.
Take the shortest path possible.

Here's a good synopsis of what GitHub did.

Also, enjoy a fantastic re-interpretation of this Richard Dawkins' video on YouTube.

What can you do to maintain sanity?

In summary, here's the problems:

Your static analysis intuitions do not appreciate the complexity of the stack under your feet.
Measuring tools will disagree even with simple things.
You don't know the expectations of performance from your userbase.
You are unlikely to get any meaningful gains without trashing large chunks of code, resetting the QA clock.

So here's how to do it. Optimization needs a working idea, if not a relativistic definition.

You should be able to complete the sentence, "It's slow because" with a problem-centric finish. For instance, "It's slow because our convergence numbers are going down." or "It's slow because it can't process input fast enough".

When you phrase the problem in this way, then you have your goal. Instead of the amorphous "optimize" you have to get those convergence number up.

You may do so by making stuff faster, you may do so another way. The point is that you dissected what optimization means for your application to see whether it matters at all.

Fetishize problems, not solutions.

⇦ Optimization ⇪ Table of Contents Security ⇨ Chapter 2 Networking

What is it?

People doing stuff together.

The user experience of your product, but interaction between users having higher fidelity and more fluidity.

Why does it sound easy?

Oftentimes it's stated as asynchronous operations becoming more synchronous. For instance, someone making a comment and then everyone else being notified of it (Facebook's red alert notifications is a good example). Other times it's something like "We are processing a bunch of stuff at various warehouses and our system is really inefficient because blah blah blah."

Why is it hard?

All networking problems are basically a generalized form of the following:

Doing an Action A on one Device X causes an Action B to happen on another Device Y.

This sounds easy. It's really easy for managers to gloss over the very valid picky nerd objections here:

What if Y isn't ready to do what you want it to?
What if X can't contact Y?
How quickly must Y be done as a result of X?
What if someone on Y is doing A at the same time?
What if you hear back from some other Device Z, then Y?
What if X is an older version of your software than Y?

The most important things to understand is that networking is extraordinarily similar to parallel programming, only an order of magnitude more difficult.

You are still dealing with multiple parties vying for an acquisition of a scarce resource and perhaps not having a consistent view of its state, but now you have the problem of those parties just disappearing and reappearing randomly and the consumer having the expectation that none of this matters and everything should just pick up where it left off.

The Great Computing Flash Mob

Here's an example of where things can get hard. Back off algorithms. The idea is that you try to maintain contact with a node and if it goes down you then "back-off" in the frequency upon which you try by some algorithm, since if you continue to incessantly try at a high rate you can bog down the resources of the machine that is trying, the network it is trying on, and things like that.

Well, let's assume that it's not X talking to Y but it's [A ... X] nodes talking to a single node Y. A network partition hits and they all, relatively simultaneously detect it and then back off, simultaneously.

So what happens is that at the computed back-off interval you see a packet from every single node hitting the network at once. And then, when Y comes back, they all, in unison, do some kind of fantastic DDoS attack because all of their back-off algorithms were relatively in sync.

Oops. Ok, not doing your homework, but taking a good guess you do something like throw some Poisson random variable in there. So now you have some nodes knowing before others that Y is back online. What happens? Do they submit this information in some kind of gossipy protocol amongst themselves? How big must such a random interval be to guarantee that you aren't causing a problem? This stuff is hard.

And that's a basic disconnect. Even without a disconnect and even with TCP, you still have to accommodate A coming in before B and B coming in before A, you know, this thing. Good luck with that.

This is because, realistically if you are trying to do Action A on Device X yielding Action B on Device Y what you have is State. That's what state is. You can say that Action A₁ doesn't depend on Action A, which is great, if you can do that. But X(A) -> Y(B) is a state-ful transform. And state-fullness is the cause of lots of programming errors that are really hard to recover from.

What can you do to maintain sanity?

Avoid state as much as possible. This is as opinionated as it sounds. Take HTTP for instance. It's claim to fame is how state-less it is. In fact, things like cookies are needed to give it a semblance of state. And HTTP works terribly well. This is where you need to go in your applications.

Additionally, you need to understand and if possible define (definitions can change) the following:

Acceptable latency in generality and for some things specific in your application, if applicable.
Expected availability of your nodes and when you expect data to be consistent.
Who is responsible at what time for what state and piece of data.

These should be defined and then treated as dictatorial laws. If someone asks for anything slightly beyond it, such as "The system is down, I'm getting 92% uptime" and you defined it as 90%, it is in your right to say it's out of scope and therefore, you don't care. We'll cover more about when it's ok to be a jerk in the section on reliability.

The important thing here is that if you refuse to put your foot down, you'll create a culture of accommodation and paralyze forward-moving development. Such limits need not always be defined, but in certain areas, there's no other reasonable way to deal with it. Five hours of your time is five hours, regardless of whether it was helping someone or progressing the product.

Make it dumb before making it smart.

⇦ Networking ⇪ Table of Contents Reliability ⇨ Chapter 3 Security

What is it?

Preventing people from doing stuff you don't want them to.

Generally, you, or at least your manager, probably want your product to be used only in ways it was intended.

Whether this desire is manifested by "a malicious user not gaining access" or "something having complex permission systems", it's all the same general request.

Why does it sound easy?

The idea is that you built something to do X. Of course it will do only X. Why would it do anything else?

Maybe you discover a few things, Y, and Z, that are also possible. Then you make sure they aren't. Total cakewalk!

Why is it hard?

You are wrong. You are so dead wrong. You didn't see operations A - W you fool, and now that crafty spy is downloading all your datas. The core of the problem has been a popular observation lately (July 2012): You have written software on a general purpose reprogrammable computer. One that does only what it's told to, totally agnostic to the social norms of the world around it.

I once came in contact with a security system at a research lab. All of the doors had a 10-digit access pad that would unlock the door and let people in. Therefore, ostensibly, there are 2 ways of getting in:

You know the code
You have a key

The problem is that since we have a general-purpose door-opening computer just remove the faceplate and place a staple on two specific contact points on the PCB.

The unlocking computer doesn't care how it was instructed it to unlock the door. The necessary wire got the necessary charge and the door opened.

This is the general synopsis of security. It's a race of creativity. And even if you think you have found all the creative ways that a person can use your general purpose device, you are probably incorrect. Remember, your authorization mechanism wants to authorize people.

Hacking The Sneaker Net

The other problem is that people may have just voluntarily opened the door for me.

Your vulnerability count is at least equal to your access count.

What more, mechanized onerous security measures are more susceptible to social manipulation. Only humans will ever care who is gaining access, and they are subject to empathy and other forms of manipulation: "Let me help the new guy, woops."

In this interview, Kevin Mitnick talks about how he utilized social engineering:

MITNICK: ... we installed a new upgrade to the password changing program and I need you to test it. And then I'd have the target change her password to something I knew. Like, oh, yeah. Just for the test, let's change your password to test 1, 2, 3, 4.

And by the way, do not give me your old password. And I'd lecture the person on security guidelines that you should never, ever give out your password. I'd have them change it to something that I suggested, and then I'd have them test their applications. And when they were testing everything, I was already into that person's account.

What can you do to maintain sanity?

There's something called a threat model. Depending on the academic literature de jour, this means something different. But basically the idea is to understand what the implications are if someone manages to do something unintended with your product or service.

Threat models (sometimes "risk models") are incomplete, but better-than-not-doing-it profiles of your technology. Ask yourself the following questions:

Is the stuff I work with (or the stuff that the user stores on the stuff I work with) of any real value outside the world of my product?
Even within the world of my product and brand, how important is it? How important is one part versus another?
Am I willing to make any guarantees on what someone cannot do with the product? Under what conditions am I willing to make this guarantee?
What type of dedication, sophistication, resources, and intent am I prepared for?

For instance, I recently worked on a product where a users content wasn't as private as their authentication token. So we made the decision that the access token has much higher security then the content.

Windows 95 and 98 were like this. The login prompt that you could enable didn't lock down the system at all.

The FAT32 file system didn't have ACLs, processes didn't really care about owners, etc. Windows 95/98/ME were not Network Operating Systems. It was the traditional microcomputer OS pretending to be a NOS (this all changed with Windows XP).

A proper security stack isn't free. It's hard.

Do What's Needed, Then Stop.

At a certain bank you need to swipe your card to open up a door and access the ATMs. The card reader however, is hooked up to nothing but the unlocking mechanism of the door. You can scan any card, such as a public transit card you bought with cash that has no identifiable information, and the door opens.

This is all that is needed.

The customers feel secure, other bank card holders can still access the ATMs, and muggers would be caught on security camera anyway. Besides, the existence of the mechanism implicates what it does. Whether it actually achieves it or not is often immaterial to the affected parties.

Also, now the real infrastructure is in place to restrict access and track people. It's just that threat hasn't escalated enough to justify maturing the technology to its full potential.

They did what was needed, and then stopped.

Despite the effectiveness ... it's probably not worth the costfrom why bank robbery is a bad idea

You do this all the time. It's called actuarial analysis. Would you put an expensive theft tracker in a $600 car? What about a safe in the trunk to store your loose change for the parking meter?

No Narratives

Narrative-based security is when people respond to a previous threat. Someone had explosives in their shoes so now everyone takes off their shoes. So next time, when someone has explosives in their toupee, the ATF will be combing through everyone's hair.

The truth is whether it's someone sneaking in the food service door to the exclusive event or bringing the entire nation to a stand still with a pair of box cutters, the attack vector of basic ingenuity is just so vast that it is almost impossible to address it without affecting the very soul of your product and experience.

Narrative-driven security can satisfy the requirements only if the requirements are to manage the fears of your customers (see the bank example). Otherwise, it's just ceremonies done in the name of security, as if a magical ritual will appease the Goddess Securitia to breathe safety into your product.

Given the difference between perceptive security, real security, the challenges of general purpose computing, the difficulties of encryption, physical security, all of that, the core problem is still that Security is a race of creativity.

Understand what's possible and embrace it.

⇦ Security ⇪ Table of Contents Scalability ⇨ Chapter 4 Reliability

What is it?

Stuff not breaking.

No really, that's it. Having stuff that works well.

Why does it sound easy?

There's finite code, finite state, so finite bugs. So long as you don't really change stuff you'll eventually find them, address them, and then have a stable, mature platform.

And besides, how hard could it be? You've come prepared for everything, right?

Why is it hard?

See "Why does it sound easy?". The asymptotic ride to stability can take months or years. Oftentimes far exceeding the patience and motivation of the creator. You're ready to move on, but it keeps coming back, like some bad horror movie that just ends up being irritating.

There's really a few kinds of reliability:

Everything works under your envisioned use cases
Everything works in pure spite of your limited vision of use cases
Everything works when everything else around it is engulfed in flames and combusting loudly

Most programmers only have to ever satisfy the first requirement.

You don't know how lucky you are. Really. This discussion will mostly focus on you, the third group includes things like the black box on an airplane and other marvellous things that are supposed to survive spontaneous explosions and your product slamming into mountain sides. If you are in that industry, you are probably quite capable of skipping this.

So why is the first one hard?

Let's characterize every program you wrote in school, ever:

Your program must complete task X. It will be running for at most a few minutes, and will at most only need to do a task a few times

Now here is every program you will write professionally, ever:

Your program must complete task X, every day, for as long as we stay in business. If it fails at task X, it will cost us substantial amounts of reputation and money. Please, for the love of God, don't ever fail at doing X.

When people talk about the great developer being worth 20 times more than the good one, the very soul that defines the valley between good and great are those who will do the second and those who can only do the first.

It's the difference between the person you can trust to get the job done and the person you can trust to work under the person who you are trusting to get the job done.

What does Reliability mean?

Your program will not crash, ever.
Your program will be just as responsive after a year as it is after a day.
Your program will respond dynamically, almost sentiently to the changing world around it.
Your users will forget that your program exists.

The last one takes some explaining. When you were a kid and your parents had a land-line (this should cover everyone), you picked up a phone and got a dial-tone. Did you ever pick it up and be like "Hrmm, AT&T must have crashed, I can't make a phone call". Thought not. If you had a phone that didn't require power you may have been lucky enough to discover how the AT&T phone network easily beat the power grid in stability during storms.

Here's another one: radio stations. After natural disasters, radios are often one of the first things that people turn to. Well, how do the radio station survive when everything else does not?

And that's our segue...

What can you do to maintain sanity?

There's a big fat important thing every engineer should know about reliability: It's counter-intuitive and flies in the face of all Proper Engineering.

There's lots of manifestos out there on Software Engineering but one of the few things they agree on is that simple, elegant, robust solutions are awesome. So awesome that they should be your top priority.

Working towards reliability however, is different in that you:

Write more code that does less.
Diversify and minimize your points of failure.
Decouple and focus on cohesion.
Make sure that all modes of failure are either obnoxiously loud or handled silently.
Re-implement a solution multiple times and have those implementations check each other.

The reason that AT&T and the radio station could stay impeccably solid is because they didn't do much, relied on as few externalities as possible, knew when they could expect something to fail, and focused on making sure that the one thing that they did, they could continue to do.

In the UNIX philosophy and In the Valley, there is a popular saying: "Do one thing and do it well". Same here. The goal is to make sure that a fault in your application only affects the thing that faulted it.

When you scope and set the parameters of your application, it's much easier to understand what you intend to, versus do not intend to, support. And when you are forced to actually state what your program does, you will more then likely back-off from what you want it to do and only state what you actually think it can do.

And that's a step in the right direction.

Re-implementing your solution isn't necessarily important, unless your data is. You should never trust just one way of solving any problem. There's various names for this in other industries, for instance Dual-sourcing and Second source.

Other examples include the Boeing 787, which has *swappable* engines, NASA's stuff, and Tandem's Non-Stop architecture, whose machines famously stayed on during the 1989 SF earthquake that toppled the upper deck of the Bay Bridge.

The biggest news is that you to can enjoy the benefits of such forms of engineering, for the most part, when you have redundant programs that verify each other; however much work that sounds.

Don't trust. Always verify.

⇦ Reliability ⇪ Table of Contents Conclusion ⇨ Chapter 5 Scalability

What is it?

Doing lots of very similar stuff.

Often times this will be accompanied with conditionals such as "over many computers" or "over diverse localities". The core component, however, is that your system, which is designed to handle X of something, should be able to handle 10 * X of something.

Why does it sound easy?

Non-programmers see this as a resource problem. Managers often think, "well, can't we just buy more computers and then run this on all of them?"

It's a very sensible conclusion. If you are moving things in boxes, you can move things faster if you have more people moving more boxes.

Why is it hard?

Of course the world isn't that simple. Even in the "move more boxes" analogy, you've got a problem; the workers probably can't synergistically, in isolation, decide who moves what where without any communication or contact.

And that's the problem ... well, one of them.

You see, scalability doesn't really introduce new problems we haven't seen up to this point as much as it's a conspiracy of all of them.

Remember the section in optimization about cache control? Same thing here, but instead of all of those things being a result of you trying to think you can cheat out performance gains with a cache, it's a direct consequence of running multiple things on different machines.

Scaling = replacing all components of a car while driving it at 100mph. From Scaling instagram

Remember that section in networking about how things can be out of order, inconsistent, unreliable, and just plain break? Yes, you'll have to deal with that now too, as an inherent property of the system.

Remember that section in security on assessing risk? Well once you scale from data center to data center or off to third party services, you've opened up your application to the wild untamed Internet. How Fun.

Remember reliability? Now your application will have 10 or 20 times more points of failure. And if you don't watch it, it means you will fail 10 or 20 times more often.

What can you do to maintain sanity?

15 years ago many programmers would have never have to consider the problems of scalability. Now it is not only becoming increasingly relevant, but increasingly assumed with the rise of The Cloud.

Luckily computers are fast. Moore's Law is a wonderful thing and reports of its death have been greatly exaggerrated (It may be coming though).

NoSQL

Just kidding. This isn't that kind of discussion.

But really, disks will be the first issue if you have any. They store absolutely astounding amounts of data these days and are still horrible when it comes to trying to service a bunch of things at once.

Actually your architecture of creating lots of small files and doing things inefficiently is the first problem. More computers won't help you as much as more competence. Don't have time for competence? Disks it is then!

Have a nice rack of them, or better, use some third party provider. But watch out! Done poorly, CDNs can work against you. Don't do managed hosting if you like having money. CDNs and things like AWS are nice because:

Replacing disks and doing back-up sucks (time that is).
They aren't pricey.

Now that's not hard is it?

As far as distributed systems go, there's been a lot of movement in the past few years. This book tries to cover it. Here's a decent talk by a memcached author - go to about 9min 50s in for a special treat.

There's lots written about scalability these days. Here's a good blog dedicated to it.

This is one where homework is really what you need to do. It's a shopping trip; break down your product into things that you do and then find the tools, and perhaps, a one-stop-shop that does it well.

But don't do this before your product is actually you know, done. It's always easier to shop when you actually know what you need.

Get ten thousand customers before worrying about ten million.

⇦ Scalability ⇪ Table of Contents Chapter 6 Conclusion

Hopefully this adventure has given a little more insight or at least confirmation to some of the art of engineering a product that you can actually put out in the world without people totally hating it.

An important thing to note is that when it comes to any of the things above, depending on your application you may be able to get it for free or avoid it entirely.

If you have this opportunity, and you are developing software commercially, and it's not aligned with your goal, then for the love of God, don't venture into the Lions' Den, unless you want total abject masochism.

We can hijack the famous Donald Knuth quote and repurpose it as "Never do or claim X prematurely" with X being one of the things above. In fact, let's come up with a new one:

Don't optimize, network, secure, scale, or stabilize. Offload it, outsource it, avoid it at all costs. If you have to do it, anticipate it from Day One, but don't ever take action on it until everything else is figured out and you actually need it.

Each one of these are reactive patterns to consumer driven habits. You never know how much of any one you'll need until late in the game.

It's also worth noting that there is a significant problem not covered here: Usability. Since it didn't fit the formulaic model of the rest of the issues, it had to be omitted.

However, effectively communicating your product to your users is the hardest of hard and deserving of its own article of this size.

Where to go from here

Certain things are a perpetual foot-dragging problem, with the computer dragging the feet. The hardest thing is to minimize the amount of time in your day you are doing this.

Have a beautiful day!

About

The author is Chris McKenzie; a programmer dedicated to truth, no matter how crazy it gets. Go to the front page.

Thanks to Alan Chen, Allen Gittelson, Chris Takemura, Dima Kogan, fmota, Jared Updike, John Underkoffler, Ken Waters, Mark Eschbach, Michael Schuresko, and ninereeds314 for help in editing and putting this article together. Also, thanks to Ram Ramanath who was the colleague that I did the Ruby performance exploration with.

Thanks to Nick Treadway, Keith Moss, Joseph Carney, G. Gabriel Ruales, Luca Molteni, Jorge Medina, Burak Arikan, Jake A. Smith, Terence Tuhinanshu, Kristoffer Nordstrom, Ilya Teterin, Mario Fusco, Levi Gross, Muhammad Aboulmagd, Mickey Kawick, Wojciech Pietrzak, Sylvain Soliman, Clinton Paquin, Torsten Schroder, Janos Bodn, Arash Bannazadeh-Mahani, Francois Buhagiar, Amira Sya'diah, Jerome Faria, Stephen Foster, Jackielene Camomot, Chris Eargle, dzone, dotnetpro, progclub, and Code Project for the publicity.

Comments

I welcome comments and critiques of this article on Hacker News and Reddit.

See also The Acid3 of JS has a few surprises.
In 1968, David Evans and Ivan Sutherland, both professors of computer science, founded a company to develop a special graphics computer known as a frame buffer.
There's other uses for this "jump without context" idea, such as in handling signals or exception handling in C.
See Things You Should Never Do, Part I for a good synopsis of what a QA clock reset does.
"The most resource-intensive operation performed in a chat system is not sending messages. It is rather keeping each online user aware of the online-idle-offline states of their friends, so that conversations can begin." From Facebook Chat.
See The Coming War on General Purpose Computation and the buzz it created.
OWASP, Microsoft, SANS, STRIDE, empirical, Boehm.
Disk is the new tape. Also, here.
"Managed hosting can't scale with you. You can't control hardware or make favorable networking agreements." - Scaling Youtube
"Firewalls" will cost you $100 a month. "Backup" can be another $100. This site for instance, has a princing plan that would buy an equivalent server in half a single billing cycle. Here's a bigger company with the same pattern.
More disks isn't a bad long-term policy. See the optimization section for more.
There's Drizzle, Dynamo, Redis, Riak, Volt and Voldemort. Cassandra, Couch, Memcache, Membase, Mnesia, Mongo, Monet, Maria, MySQL, Hadoop and Hypertable, PostgreSQL, Carrot2, Lucene and Solr. And I guess also Oracle. But all you may really need is something like Luster or Gluster.
Shameless plug: A good friend of mine runs a VPS company.