Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

High Scalability - Building bigger, faster, more reliable websites
Syndicate content
Updated: 13 hours 38 min ago

Sponsored Post: Airseed, Uber, ScaleOut Software, Couchbase, Tokutek, Logentries, Booking, Apple, MongoDB, BlueStripe, AiScaler, Aerospike, LogicMonitor, AppDynamics, ManageEngine, Site24x7

Tue, 03/18/2014 - 16:56

 

Who's Hiring?
  • Apple is hiring for multiple positions. Imagine what you could do here. At Apple, great ideas have a way of becoming great products, services, and customer experiences very quickly.  
    • Senior Engineer. We are looking for a team player with focus on designing and developing WWDR’s web-based applications. The successful candidate must have the ability to take minimal business requirements and work pro-actively with cross functional teams to obtain clear objectives that drive projects forward to completion. Please apply here.
    • Software Engineer. We are looking for a team player with focus on designing and developing WWDR’s web-based applications. The successful candidate must have the ability to take minimal business requirements and work pro-actively with cross functional teams to obtain clear objectives that drive projects forward to completion. Please apply here.
    • Quality Assurance Engineer. The iOS Systems team is looking for a Quality Assurance engineer. In this role you will be expected to work hand-in-hand with the software engineering team to find and diagnose software defects. Please apply here.

  • Airseed -- a Google Ventures backed, developer platform that powers single sign-on authentication, rich consumer data, and analytics -- is hiring lead backend and fullstack engineers (employees #4, 5, 6). More info here: https://www.airseed.com/jobs

  • Join the team that scales Uber supply globally! Our supply engineering team is responsible for prototyping, building, and maintaining the partner-facing platform. We're looking for experienced back-end developers who care about developing highly scalable services. Apply at https://www.uber.com/jobs/4810.

  • We need awesome people @ Booking.com - We want YOU! Come design next generation interfaces, solve critical scalability problems, and hack on one of the largest Perl codebases. Apply: http://www.booking.com/jobs.en-us.html

  • UI EngineerAppDynamics, founded in 2008 and lead by proven innovators, is looking for a passionate UI Engineer to design, architect, and develop our their user interface using the latest web and mobile technologies. Make the impossible possible and the hard easy. Apply here.

  • Software Engineer - Infrastructure & Big DataAppDynamics, leader in next generation solutions for managing modern, distributed, and extremely complex applications residing in both the cloud and the data center, is looking for a Software Engineers (All-Levels) to design and develop scalable software written in Java and MySQL for backend component of software that manages application architectures. Apply here.
Fun and Informative Events
  • The Biggest MongoDB Event Ever Is On. Will You Be There? Join us in New York City June 23-25 for MongoDB World! The conference lineup includes Amazon CTO Werner Vogels and Cloudera Co-Founder Mike Olson for keynote addresses.  You’ll walk away with everything you need to know to build and manage modern applications. Register before April 4 to take advantage of super early bird pricing.

  • How to Scale MySQL for Big Data Applications. A Guide to Evaluating TokuDB on March 20th at 1pm ET. You can do more than you think with the MySQL you already have. Learn how to use MySQL or MariaDB in Big Data applications by simply upgrading the storage engine with TokuDB. Register now.

  • April 3 Webinar: The BlueKai Playbook for Scaling to 10 Trillion Transactions a Month. As the industry’s largest online data exchange, BlueKai knows a thing or two about pushing the limits of scale. Find out how they are processing up to 10 trillion transactions per month from Vice President of Data Delivery, Ted Wallace. Register today.
Cool Products and Services
  • As one of the fastest growing VoIP services in the world Viber has replaced MongoDB with Couchbase Server, supporting 100,000+ operations per second in the short term and 1,000,000+ operations per second in the long term for their third generation architecture.  See the full story on the Viber switch.

  • Do Continuous MapReduce on Live Data? ScaleOut Software's hServer was built to let you hold your daily business data in-memory, update it as it changes, and concurrently run continuous MapReduce tasks on it to analyze it in real-time. We call this "stateful" analysis. To learn more check out hServer.

  • Log management made easy with Logentries Billions of log events analyzed every day to unlock insights from the log data the matters to you. Simply powerful search, tagging, alerts, live tail and more for all of your log data. Automated AWS log collection and analytics, including CloudWatch events. 

  • LogicMonitor is the cloud-based IT performance monitoring solution that enables companies to easily and cost-effectively monitor their entire IT infrastructure stack – storage, servers, networks, applications, virtualization, and websites – from the cloud. No firewall changes needed - start monitoring in only 15 minutes utilizing customized dashboards, trending graphs & alerting.

  • BlueStripe FactFinder Express is the ultimate tool for server monitoring and solving performance problems. Monitor URL response times and see if the problem is the application, a back-end call, a disk, or OS resources.

  • aiScaler, aiProtect, aiMobile Application Delivery Controller with integrated Dynamic Site Acceleration, Denial of Service Protection and Mobile Content Management. Cloud deployable. Free instant trial, no sign-up required.  http://aiscaler.com/

  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.

  • www.site24x7.com : Monitor End User Experience from a global monitoring network.

If any of these items interest you there's a full description of each sponsor below. Please click to read more...

Categories: Architecture

Intuitively Showing How To Scale a Web Application Using a Coffee Shop as an Example

Mon, 03/17/2014 - 17:13

This is a guest repost by Sriram Devadas, Engineer at Vistaprint, Web platform group. A fun and well written analogy of how to scale web applications using a familiar coffee shop as an example. No coffee was harmed during the making of this post.

I own a small coffee shop.

My expense is proportional to resources
100 square feet of built up area with utilities, 1 barista, 1 cup coffee maker.

My shop's capacity
Serves 1 customer at a time, takes 3 minutes to brew a cup of coffee, a total of 5 minutes to serve a customer.

Since my barista works without breaks and the German made coffee maker never breaks down,
my shop's maximum throughput = 12 customers per hour.

 

Web server

Customers walk away during peak hours. We only serve one customer at a time. There is no space to wait.

I upgrade shop. My new setup is better!

Expenses
Same area and utilities, 3 baristas, 2 cup coffee maker, 2 chairs

Capacity
3 minutes to brew 2 cups of coffee, ~7 minutes to serve 3 concurrent customers, 2 additional customers can wait in queue on chairs.

Concurrent customers = 3, Customer capacity = 5

 

Scaling vertically

Business is booming. Time to upgrade again. Bigger is better!...

Categories: Architecture

Stuff The Internet Says On Scalability For March 14th, 2014

Fri, 03/14/2014 - 16:56

Hey, it's HighScalability time:


LifeExplorer Cells in 3D
  • Quotable Quotes:
    • The Master Switch: History shows a typical progression of information technologies: from somebody’s hobby to somebody’s industry; from jury-rigged contraption to slick production marvel; from a freely accessible channel to one strictly controlled by a single corporation or cartel—from open to closed system.
    • @adrianco: #qconlondon @russmiles on PaaS "As old as I am, a leaky abstraction would be awful..."
    • @Obdurodon: "Scaling is hard.  Let's make excuses."
    • @TomRoyce: @jeffjarvis the rot is deep... The New Jersey pols just used Tesla to shake down the car dealers.
    • @CompSciFact: "The cheapest, fastest and most reliable components of a computer system are those that aren't there." -- Gordon Bell
    • @glyph: “Eventually consistent” is just another way to say “not consistent right now”.
    • @nutshell: LinkedIn is shutting down access to their APIs for CRMs (unless you’re Salesforce or Microsoft). Support open APIs!
    • Tim Berners-Lee: I never expected all these cats.
    • @muratdemirbas: "Simple clear purpose&principles give rise to complex&intelligent behavior. Complex rules&regulations give rise to simple&stupid behavior."
    • @BonzoESC: “Duplication is far cheaper than the wrong abstraction.” @sandimetz @rbonales 
    • @BenedictEvans: Umeng: there are 700m active smartphones and tablets in China.

  • Scale matters object lesson number infinity: HBO Go Crashes During True Detective Finale. Perhaps make HBO Go available without a cable package and maybe you'll have money to scale the service? Think peak. But wait, Dan Rayburn says bandwidth was not the problem, it's other parts of the system, which is why Internet TV will never be as reliable as broadcast TV. Still, I'd like to cut the cord.

  • Turns out ecommerce over messaging works well...quite well. Retailers Are Striking Gold with Instagram: Fox and Fawn, items often sell out within minutes of the picture being posted on Instagram.

  • Even Facebook's infrastructure struggles when a new feature becomes an unexpected hit. That's the situation described in an engaging story: Looking back on “Look Back” videos. Look Back's are one minute videos generated from a user's pics and posts. For the release they planned on 187 Gbps more bandwidth and 25 petabytes of disk. To get the rendering done they highly parallelized the pipeline. CDNs were alerted. Internal tests on employees found a few bugs. Less storage was actually needed because the video could be regenerated so a high replication factor wasn't needed. Go time! The videos were an unexpected hit with a 40% reshare instead of the projected 10% reshare. It seems people like themselves...a lot. Overnight 30 teams cooperated to move tens of thousands of machines over to rendering. Good story. Though I'm disappointed it didn't have its own Look Back video. Stories are people too.

  • Why Google's services aren't really free. We all help train the beast. A Glimpse of Google, NASA & Peter Norvig + The Restaurant at the End of the Universe: Algorithms behave differently as they churn thru more data. For example in the figure, the Blue algorithm was better with a million training dataset. If one had stopped at that scale, one would think of optimizing that algorithm for better performance. But as the scale increased the purple algorithm started showing promise – in fact the blue one starts deteriorating at larger scale; In general, Google prefers algorithms that get better with data. Not all algorithms are like that, but Google likes to after the ones with this type of performance characteristic. 

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Categories: Architecture

Paper: Scalable Eventually Consistent Counters over Unreliable Networks

Wed, 03/12/2014 - 16:56

Counting at scale in a distributed environment is surprisingly hard. And it's a subject we've covered before in various ways: Big Data Counting: How to count a billion distinct objects using only 1.5KB of Memory, How to update video views count effectively?, Numbers Everyone Should Know (sharded counters).

Kellabyte (which is an excellent blog) in Scalable Eventually Consistent Counters talks about how the Cassandra counter implementation scores well on the scalability and high availability front, but in so doing has "over and under counting problem in partitioned environments."

Which is often fine. But if you want more accuracy there's a PN-counter, which is a CRDT (convergent replicated data type) where "all the changes made to a counter on each node rather than storing and modifying a single value so that you can merge all the values into the proper final value. Of course the trade-off here is additional storage and processing but there are ways to optimize this."

And there's a paper you can count on that goes into more details: Scalable Eventually Consistent Counters over Unreliable Networks:

Categories: Architecture

Douglas Adams - 3 Rules that Describe Our Reactions to Technologies

Tue, 03/11/2014 - 19:25

Chris Dixon unearthed a great quote from Douglas Adams on the nature of technological adoption that unsurprisingly hits the mark in our ever changing and evolving world:

  1. Anything that is in the world when you’re born is normal and ordinary and is just a natural part of the way the world works.
  2. Anything that’s invented between when you’re fifteen and thirty-five is new and exciting and revolutionary and you can probably get a career in it.
  3. Anything invented after you’re thirty-five is against the natural order of things

Some that come to mind: horse to car, index card to online search, PC to mobile, web to app, portal to messaging, Newton to Einstein, oil to electric, rock to rap, Aquinas to Bacon, buying to renting, files to streaming, network TV to cordkilling, broadcast to social, programming CPUs to programming biology, server to cloud, vm to container, wired to wireless, long read to TLDR, privacy to public to ephemeral, paper based news aggregation to digital aggregation, checks to online banking, gold to fiat to bitcoin, linear to exponential growth, large to small teams, to a world that ignores you to a world the responds to you, nation states to who knows what, a military of people to a Military of Things, and weekly versus binge watching. Any others?

Categories: Architecture

Building a Social Music Service Using AWS, Scala, Akka, Play, MongoDB, and Elasticsearch

Tue, 03/11/2014 - 16:56

This is a guest repost by Rotem Hermon, former Chief Architect for serendip.me, on the architecture and scaling considerations behind making a startup music service.

serendip.me is a social music service that helps people discover great music shared by their friends, and also introduces them to their “music soulmates” - people outside their immediate social circle that shares a similar taste in music.

Serendip is running on AWS and is built on the following stack: scala (and some Java), akka (for handling concurrency), Play framework (for the web and API front-ends), MongoDB and Elasticsearch.

Choosing the stack

One of the challenges of building serendip was the need to handle a large amount of data from day one, since a main feature of serendip is that it collects every piece of music being shared on Twitter from public music services. So when we approached the question of choosing the language and technologies to use, an important consideration was the ability to scale.

The JVM seemed the right basis for our system as for its proven performance and tooling. It's also the language of choice for a lot of open source system (like Elasticsearch) which enables using their native clients - a big plus.

When we looked at the JVM ecosystem, scala stood out as an interesting language option that allowed a modern approach to writing code, while keeping full interoperability with Java. Another argument in favour of scala was the akka actor framework which seemed to be a good fit for a stream processing infrastructure (and indeed it was!). The Play web framework was just starting to get some adoption and looked promising. Back when we started, at the very beginning of 2011, these were still kind of bleeding edge technologies. So of course we were very pleased that by the end of 2011 scala and akka consolidated to become Typesafe, with Play joining in shortly after.

MongoDB was chosen for its combination of developer friendliness, ease of use, feature set and possible scalability (using auto-sharding). We learned very soon that the way we wanted to use and query our data will require creating a lot of big indexes on MongoDB, which will cause us to be hitting performance and memory issues pretty fast. So we kept using MongoDB mainly as a key-value document store, also relying on its atomic increments for several features that required counters.
With this type of usage MongoDB turned out to be pretty solid. It is also rather easy to operate, but mainly because we managed to avoid using sharding and went with a single replica-set (the sharding architecture of MongoDB is pretty complex).

For querying our data we needed a system with full blown search capabilities. Out of the possible open source search solutions, Elasticsearch came as the most scalable and cloud oriented system. Its dynamic indexing schema and the many search and faceting possibilities it provides allowed us to build many features on top of it, making it a central component in our architecture.

We chose to manage both MongoDB and Elasticsearch ourselves and not use a hosted solution for two main reasons. First, we wanted full control over both systems. We did not want to depend on another element for software upgrades/downgrades. And second, the amount of data we process meant that a hosted solution was more expensive than managing it directly on EC2 ourselves.

Some numbers
Categories: Architecture

Let's Play a Game of Take It or Leave It - Game 1

Mon, 03/10/2014 - 16:56

The way this game is played is you read a few statements on some hot topics below. If you agree with a statement then you “take it”; if you disagree then you “leave it.” And if you are so moved please write a convincing comment as to why. Got it?

  1. Snowden vs. the State. Snowden represents true the spirit of freedom and is not a threat to all we hold dear.

  2. Walled Garden vs. Federated Freedom. The Walled Garden has won the last decade. The cycle of life will return the balance and federated services will once again win the day.

  3. Mobile + messaging vs. Le Web. Mobile + messaging is eating search and the web, changing the way things are found, discovered, and bought.

  4. Fiat vs. Cryptocurrency. BitCoin has had its 400 million dollars of fame, it’s on the way out, a tulip gone out of bloom.

  5. True Detective vs. The Field. True Detective is the best show on TV, ever. Wired and Breaking Bad need not apply.

Categories: Architecture

Stuff The Internet Says On Scalability For March 7th, 2014

Fri, 03/07/2014 - 17:58

Hey, it's HighScalability time:


Twitter valiantly survived an Oscar DDoS attack by non-state actors.
  • Several Billion: Apple iMessages per Day along with 40 billion notifications and 15 to 20 million FaceTime calls. Take that WhatsApp. Their architecture? Hey, this is Apple, only the Shadow knows.
  • 200 bit quantum computer: more states than atoms in the universe; 10 million matches: Tinder's per day catch; $1 billion: Kickstarter's long tail pledge funding achievement
  • Quotable Quotes:
    • @cstross: Let me repeat that: 100,000 ARM processors will cost you a total of $75,000 and probably fit in your jacket pocket.
    • @openflow: "You can no longer separate compute, storage, and networking." -- @vkhosla #ONS2014
    • @HackerNewsOnion: New node.js co-working space has 1 table and everyone takes turns
    • @chrismunns: we're reaching the point where ease and low cost of doing DDOS attacks means you shouldn't serve anything directly out of your origin
    • @rilt: Mysql dead, Cassandra now in production using @DataStax python driver.
    • @CompSciFact: "No engineered structure is designed to be built and then neglected or ignored." -- Henry Petroski
    • Arundhati Roy: Revolutions can, and often have, begun with reading.
    • Brett Slatkin: 3D printing is to design what continuous deployment is to code.
  • Well Facebook got on that right quick: Facebook wants to use drones to blanket remote regions with Internet. We talked about a drone driven Internet back in January. This is good news IMHO. Facebook will have the resources to make this really happen. Hopefully. Maybe. Cross your fingers.

  • A vast hidden surveillance network runs across America, powered by the repo industry. This intelligence database was powered by individuals driving around and taking pictures of licence plates to track cars. Imagine how Google Glass will enable the tracking of people, without any three letter government agencies in the loop. Crowdsourcing is fun!

  • Francis Bacon way back in the 1700s was all over BigData with his ant, spider, and honey bee analogy:  Good scientists are not like ants (mindlessly gathering data) or spiders (spinning empty theories).  Instead, they are like bees, transforming nature into a nourishing product. This essay examines Bacon's "middle way" by elucidating the means he proposes to turn experience and insight into understanding.  The human intellect relies on "machines" to extend perceptual limits, check impulsive imaginations, and reveal nature's latent causal structure, or “forms.”

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Categories: Architecture

10 Things You Should Know About Running MongoDB at Scale

Wed, 03/05/2014 - 17:56

Guest post by Asya Kamsky, Principal Solutions Architect at MongoDB.

This post outlines ten things you need to know for operating MongoDB at scale based on my experience working with MongoDB customers and open source users:

  1. MongoDB requires DevOps, too. MongoDB is a database. Like any other data store, it requires capacity planning, tuning, monitoring, and maintenance. Just because it's easy to install and get started and it fits the developer paradigm more naturally than a relational database, don't assume that MongoDB doesn't need proper care and feeding. And just because it performs super-fast on a small sample dataset in development doesn't mean you can get away without having a good schema and indexing strategy, as well as the right hardware resources in production! But if you prepare well and understand the best practices, operating large MongoDB clusters can be boring instead of nerve-wracking.
  2. Successful MongoDB users monitor everything and prepare for growth. Tracking current capacity and capacity planning are essential practices in any database system, and MongoDB is no different. You need to know how much work your cluster is currently capable of sustaining and what demands will be placed on it during times of highest use. If you don't notice growing load on your servers you'll eventually get caught without enough capacity. To monitor your MongoDB deployment, you can use MongoDB Management Service (MMS) to visualize your operations by viewing the opscounters (operation counters) chart:
  3. The obstacles to scaling performance as your usage grows may not be what you'd expect. Having seen hundreds of users' deployments, the performance bottlenecks usually are (in this order):
Categories: Architecture

Sponsored Post: Uber, ScaleOut Software, Couchbase, Tokutek, Logentries, Booking, Apple, MongoDB, BlueStripe, AiScaler, Aerospike, LogicMonitor, AppDynamics, ManageEngine, Site24x7

Tue, 03/04/2014 - 17:56

Who's Hiring?
  • Apple is hiring for multiple positions. Imagine what you could do here. At Apple, great ideas have a way of becoming great products, services, and customer experiences very quickly.
    • C++ Senior Developer and Architect- Maps. The Maps Team is looking for a senior developer and architect to support and grow some of the core backend services that support Apple Map's Front End Services. Please apply here.  
    • Senior Engineer. We are looking for a team player with focus on designing and developing WWDR’s web-based applications. The successful candidate must have the ability to take minimal business requirements and work pro-actively with cross functional teams to obtain clear objectives that drive projects forward to completion. Please apply here.
    • Software Engineer. We are looking for a team player with focus on designing and developing WWDR’s web-based applications. The successful candidate must have the ability to take minimal business requirements and work pro-actively with cross functional teams to obtain clear objectives that drive projects forward to completion. Please apply here.
    • Quality Assurance Engineer. The iOS Systems team is looking for a Quality Assurance engineer. In this role you will be expected to work hand-in-hand with the software engineering team to find and diagnose software defects. Please apply here.

  • Join the team that scales Uber supply globally! Our supply engineering team is responsible for prototyping, building, and maintaining the partner-facing platform. We're looking for experienced back-end developers who care about developing highly scalable services. Apply at https://www.uber.com/jobs/4810.

  • We need awesome people @ Booking.com - We want YOU! Come design next generation interfaces, solve critical scalability problems, and hack on one of the largest Perl codebases. Apply: http://www.booking.com/jobs.en-us.html

  • UI EngineerAppDynamics, founded in 2008 and lead by proven innovators, is looking for a passionate UI Engineer to design, architect, and develop our their user interface using the latest web and mobile technologies. Make the impossible possible and the hard easy. Apply here.

  • Software Engineer - Infrastructure & Big DataAppDynamics, leader in next generation solutions for managing modern, distributed, and extremely complex applications residing in both the cloud and the data center, is looking for a Software Engineers (All-Levels) to design and develop scalable software written in Java and MySQL for backend component of software that manages application architectures. Apply here.
Fun and Informative Events
  • How to Scale MySQL for Big Data Applications. A Guide to Evaluating TokuDB on March 20th at 1pm ET. You can do more than you think with the MySQL you already have. Learn how to use MySQL or MariaDB in Big Data applications by simply upgrading the storage engine with TokuDB. Register now.

  • Snapdeal Selects Aerospike over MongoDB, Couchbase and Redis to Improve Shopper Satisfaction. After experiencing 500% growth in 2013, Snapdeal, India’s largest online marketplace, switched from 10 MongoDB servers to just two Linux servers on Amazon EC2 with Aerospike, and reduced response times to less than a millisecond. Read the case study.
Cool Products and Services
  • Do Continuous MapReduce on Live Data? ScaleOut Software's hServer was built to let you hold your daily business data in-memory, update it as it changes, and concurrently run continuous MapReduce tasks on it to analyze it in real-time. We call this "stateful" analysis. To learn more check out hServer.

  • As one of the fastest growing VoIP services in the world Viber has replaced MongoDB with Couchbase Server, supporting 100,000+ operations per second in the short term and 1,000,000+ operations per second in the long term for their third generation architecture.  See the full story on the Viber switch.

  • Log management made easy with Logentries Billions of log events analyzed every day to unlock insights from the log data the matters to you. Simply powerful search, tagging, alerts, live tail and more for all of your log data. Automated AWS log collection and analytics, including CloudWatch events. 

  • LogicMonitor is the cloud-based IT performance monitoring solution that enables companies to easily and cost-effectively monitor their entire IT infrastructure stack – storage, servers, networks, applications, virtualization, and websites – from the cloud. No firewall changes needed - start monitoring in only 15 minutes utilizing customized dashboards, trending graphs & alerting.

  • MongoDB Backup Free Usage Tier Announced. We're pleased to introduce the free usage tier to MongoDB Management Service (MMS). MMS Backup provides point-in-time recovery for replica sets and consistent snapshots for sharded systems with minimal performance impact. Start backing up today at mms.mongodb.com.

  • BlueStripe FactFinder Express is the ultimate tool for server monitoring and solving performance problems. Monitor URL response times and see if the problem is the application, a back-end call, a disk, or OS resources.

  • aiScaler, aiProtect, aiMobile Application Delivery Controller with integrated Dynamic Site Acceleration, Denial of Service Protection and Mobile Content Management. Cloud deployable. Free instant trial, no sign-up required.  http://aiscaler.com/

  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.

  • www.site24x7.com : Monitor End User Experience from a global monitoring network.

If any of these items interest you there's a full description of each sponsor below. Please click to read more...

Categories: Architecture

The “Four Hamiltons” Framework for Mitigating Faults in the Cloud: Avoid it, Mask it, Bound it, Fix it Fast

Mon, 03/03/2014 - 18:08

This is a guest post by Patrick Eaton, Software Engineer and Distributed Systems Architect at Stackdriver.

Stackdriver provides intelligent monitoring-as-a-service for cloud hosted applications.  Behind this easy-to-use service is a large distributed system for collecting and storing metrics and events, monitoring and alerting on them, analyzing them, and serving up all the results in a web UI.  Because we ourselves run in the cloud (mostly on AWS), we spend a lot of time thinking about how to deal with faults in the cloud.  We have developed a framework for thinking about fault mitigation for large, cloud-hosted systems.  We endearingly call this framework the “Four Hamiltons” because it is inspired by an article from James Hamilton, the Vice President and Distinguished Engineer at Amazon Web Services.

The article that led to this framework is called “The Power Failure Seen Around the World”.  Hamilton analyzes the causes of the power outage that affected Super Bowl XLVII in early 2013.  In the article, Hamilton writes:

As when looking at any system faults, the tools we have to mitigate the impact are: 1) avoid the fault entirely, 2) protect against the fault with redundancy, 3) minimize the impact of the fault through small fault zones, and 4) minimize the impact through fast recovery.

The mitigation options are roughly ordered by increasing impact to the customer.  In this article, we will refer to these strategies, in order, as “Avoid it”, “Mask it”, “Bound it”, and “Fix it fast”...

Categories: Architecture

Stuff The Internet Says On Scalability For February 28th, 2014

Fri, 02/28/2014 - 17:56

Hey, it's HighScalability time:


Plus ça change, plus c'est la même chose (full)
  • Quotable Quotes:
    • @ML_Hipster: A machine learning researcher, a crypto-currency expert, and an Erlang programmer walk into a bar. Facebook buys the bar for $27 billion.
    • OH: Network effects don't happen on toll roads.
    • Benedict Evans: Google is a vast machine learning engine... and it spent 10-15 years building that learning engine and feeding it data.
  • Mining Experiment: Running 600 Servers for a Year Yields 0.4 Bitcoin. Yes, this is a far superior way of doing things. Chew up the commons for marginal gain. It's like old times.
  • Game designers, forget the sardines and go hunt some whale. Swrve found: half of free-to-play games’ in-app purchases came from 0.15 percent of players. Only 1.5 percent of players of games in the Swrve network spent any money at all.
  • Google has a beta version of their cloud pricing calculator. The interface is a little funky with separate "Add to Estimate" sections, but the prices look good. 5 servers, with 2 cores, 7.5GB RAM, 24x7, 3TB storage, 100 million IOPS, 1TB snapshot storage, 1TB light Cloud SQL operations, 4TB cloud storage, all for $1,559.24 a month.
  • So scalability doesn't matter? After the WhatsApp acquisition here's a tweet from Telegram Messenger: 4 million users joined Telegram within the last 18 hours. We're doing our best, but the service is getting unstable due to high load..it'll take some time to transport and install the new equipment.
  • Maybe content can make money rather than being cheap commodity chum for aggregators. Financial Times’ CTO John O’Donovan: We make more money from our content than from advertising which is a really interesting shift – we are pushing boundaries in terms of how we are getting our content into these different services and platforms.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Categories: Architecture

The WhatsApp Architecture Facebook Bought For $19 Billion

Wed, 02/26/2014 - 17:56

Rick Reed in an upcoming talk in March titled That's 'Billion' with a 'B': Scaling to the next level at WhatsApp reveals some eye popping WhatsApp stats:

What has hundreds of nodes, thousands of cores, hundreds of terabytes of RAM, and hopes to serve the billions of smartphones that will soon be a reality around the globe? The Erlang/FreeBSD-based server infrastructure at WhatsApp. We've faced many challenges in meeting the ever-growing demand for our messaging services, but as we continue to push the envelope on size (>8000 cores) and speed (>70M Erlang messages per second) of our serving system.

But since we don’t have that talk yet, let’s take a look at a talk Rick Reed gave two years ago on WhatsApp: Scaling to Millions of Simultaneous Connections.

Having built a high performance messaging bus in C++ while at Yahoo, Rick Reed is not new to the world of high scalability architectures. The founders are also ex-Yahoo guys with not a little experience scaling systems. So WhatsApp comes by their scaling prowess honestly. And since they have a Big Hairy Audacious of Goal of being on every smartphone in the world, which could be as many as 5 billion phones in a few years, they’ll need to make the most of that experience.

Before we get to the facts, let’s digress for a moment on this absolutely fascinating conundrum: How can WhatsApp possibly be worth $19 billion to Facebook?

As a programmer if you ask me if WhatsApp is worth that much I’ll answer expletive no! It’s just sending stuff over a network. Get real. But I’m also the guy that thought we don’t need blogging platforms because how hard is it to remote login to your own server, edit the index.html file with vi, then write your post in HTML? It has taken quite a while for me to realize it’s not the code stupid, it’s getting all those users to love and use your product that is the hard part. You can’t buy love

What is it that makes WhatsApp so valuable? The technology? Ignore all those people who say they could write WhatsApp in a week with PHP. That’s simply not true. It is as we’ll see pretty cool technology. But certainly Facebook has sufficient chops to build WhatsApp if they wished.

Let’s look at features. We know WhatsApp is a no gimmicks (no ads, no gimmicks, no games) product with loyal users from across the world. It offers free texting in a cruel world where SMS charges can be abusive. As a sheltered American it has surprised me the most to see how many real people use WhatsApp to really stay in touch with family and friends. So when you get on WhatsApp it’s likely people you know are already on it, since everyone has a phone, which mitigates the empty social network problem. It is aggressively cross platform so everyone you know can use it and it will just work. It “just works” is a phrase often used. It is full featured (shared locations, video, audio, pictures, push-to-talk, voice-messages and photos, read receipt, group-chats, send messages via WiFi, and all can be done regardless of whether the recipient is online or not). It handles the display of native languages well. And using your cell number as identity and your contacts list as a social graph is diabolically simple. There’s no email verification, username and password, and no credit card number required. So it just works.

All impressive, but that can’t be worth $19 billion. Other products can compete on features.

Google wanted it is a possible reason. It’s a threat. It’s for the .99 cents a user. Facebook is just desperate. It’s for your phone book. It’s for the meta-data (even though WhatsApp keeps none).

It’s for the 450 million active users, with a user based growing at one million users a day, with a potential for a billion users. Facebook needs WhatApp for its next billion users. Certainly that must be part if it. And a cost of about $40 a user doesn’t seem unreasonable, especially with the bulk paid out in stock.  Facebook acquired Instagram for about $30 per user. A Twitter user is worth $110.

Benedict Evans makes a great case that Mobile is a 1+ trillion dollar business, WhatsApp is disrupting the SMS part of this industry, which globally has over $100 billion in revenue, by sending 18 billion SMS messages a day when the global SMS system only sends 20 billion SMS messages a day.  With a fundamental change in the transition from PCs to nearly universal smartphone adoption, the size of the opportunity is a much larger addressable market than where Facebook normally plays.

But Facebook has promised no ads and no interference, so where’s the win?

There’s the interesting development of business use over mobile. WhatsApp is used to create group conversations for project teams and venture capitalists carry out deal flow conversations over WhatsApp.

Instagram is used in Kuwait to sell sheep.

WeChat, a WhatsApp competitor, launched a taxi-cab hailing service in January. In the first month 21 million cabs were hailed.

With the future of e-commerce looking like it will be funneled through mobile messaging apps, it must be an e-commerce play?

It’s not just businesses using WhatsApp for applications that were once on the desktop or on the web. Police officers in Spain use WhatsApp to catch criminals. People in Italy use it to organize basketball games.

Commerce and other applications are jumping on to mobile for obvious reasons. Everyone has mobile and these messaging applications are powerful, free, and cheap to use. No longer do you need a desktop or a web application to get things done. A lot of functionality can be overlayed on a messaging app.

So messaging is a threat to Google and Facebook. The desktop is dead. The web is dying. Messaging + mobile is an entire ecosystem that sidesteps their channel.

Facebook needs to get into this market or become irrelevant?

With the move to mobile we are seeing deportalization of Facebook. The desktop web interface for Facebook is a portal style interface providing access to all the features made available by the backend. It’s big, complicated, and creaky. Who really loves the Facebook UI?

When Facebook moved to mobile they tried the portal approach and it didn’t work. So they are going with a strategy of smaller, more focussed, purpose built apps. Mobile first! There’s only so much you can do on a small screen. On mobile it’s easier to go find a special app than it is to find a menu buried deep within a complicated portal style application.

But Facebook is going one step further. They are not only creating purpose built apps, they are providing multiple competing apps that provide similar functionality and these apps may not even share a backend infrastructure. We see this with Messenger and WhatsApp, Instagram and Facebook’s photo app. Paper is an alternate interface to Facebook that provides very limited functionality, but it does what it does very well.

Conway's law may be operating here. The idea that “organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.” With a monolithic backend infrastructure we get a Borg-like portal design. The move to mobile frees the organization from this way of thinking. If apps can be built that provide a view of just a slice of the Facebook infrastructure then apps can be built that don’t use Facebook’s infrastructure at all. And if they don't need Facebook's infrastructure then they are free not to be built by Facebook at all. So exactly what is Facebook then?

Facebook CEO Mark Zuckerberg has his own take, saying in a keynote presentation at the Mobile World Congress that Facebook's acquisition of WhatsApp was closely related to the Internet.org vision:

The idea is to develop a group of basic internet services that would be free of charge to use — “a 911 for the internet." These could be a social networking service like Facebook, a messaging service, maybe search and other things like weather. Providing a bundle of these free of charge to users will work like a gateway drug of sorts — users who may be able to afford data services and phones these days just don’t see the point of why they would pay for those data services. This would give them some context for why they are important, and that will lead them to paying for more services like this — or so the hope goes.

This is the long play, which is a game that having a huge reservoir of valuable stock allows you to play. 

Have we reached a conclusion? I don’t think so. It’s such a stunning dollar amount with such tenuous apparent immediate rewards, that the long term play explanation actually does make some sense. We are still in the very early days of mobile. Nobody knows what the future will look like, so it pays not try to force the future to look like your past. Facebook seems to be doing just that.

But enough of this. How do you support 450 million active users with only 32 engineers? Let’s find out...

Categories: Architecture

Peter Norvig's 9 Master Steps to Improving a Program

Tue, 02/25/2014 - 17:56

 

Inspired by a xkcd comic, Peter Norvig, Director of Research at Google and all around interesting and nice guy, has created an above par code kata involving a regex program that demonstrates the core inner loop of many successful systems profiled on HighScalability.

The original code is at xkcd 1313: Regex Golf, which comes up with an algorithm to find a short regex that matches the winners and not the losers from two arbitrary lists. The Python code is readable, the process is TDDish, and the problem, which sounds simple, but soon explodes into regex weirdness, as does most regex code. If you find regular expressions confusing you'll definitely benefit from Peter's deliberate strategy for finding a regex.

The post demonstrating the iterated improvement of the program is at xkcd 1313: Regex Golf (Part 2: Infinite Problems). As with most first solutions it wasn't optimal. To improve the program Peter recommends the following steps:

Categories: Architecture

Stuff The Internet Says On Scalability For February 21st, 2014

Fri, 02/21/2014 - 17:56

Hey, it's HighScalability time (a particularly bountiful week):


The Telephone Wires of Manhattan in 1887 (full)

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Categories: Architecture

Planetary-Scale Computing Architectures for Electronic Trading and How Algorithms Shape Our World

Wed, 02/19/2014 - 19:09

Algorithms are moving out of the Platonic realm and are becoming dynamic first class players in real life. We've seen corporations become people. Algorithms will likely also follow that path to agency.

Kevin Slavin in his intriguing TED talk: How Algorithms Shape Our World, gives many and varied examples of how algorithms have penetrated RL. 

One of his most interesting examples is from a highly technical paper on Relativistic statistical arbitrage, which says to make money on markets you have to be where the people are, the red dots (on the diagram below), which means you have to put servers where the blue dots are, many of which are in the ocean. Here's the diagram from the paper:

Mr. Slavin neatly sums this up by saying:

And it's not the money that's so interesting actually. It's what the money motivates, that we're actually terraforming the Earth itself with this kind of algorithmic efficiency. And in that light, you go back and you look at Michael Najjar's photographs, and you realize that they're not metaphor, they're prophecy. They're prophecy for the kind of seismic, terrestrial effects of the math that we're making. And the landscape was always made by this sort of weird, uneasy collaboration between nature and man. But now there's this third co-evolutionary force: algorithms -- the Boston Shuffler, the Carnival. And we will have to understand those as nature, and in a way, they are.

The introduction to the paper spells out why this is so:

Categories: Architecture

Sponsored Post: Couchbase, Tokutek, Logentries, Booking, Apple, MongoDB, BlueStripe, AiScaler, Aerospike, LogicMonitor, AppDynamics, ManageEngine, Site24x7

Tue, 02/18/2014 - 19:00

Who's Hiring?
  • Apple is hiring for multiple positions. Imagine what you could do here. At Apple, great ideas have a way of becoming great products, services, and customer experiences very quickly.
    • Sr Software Engineer. The Emerging Technology team is looking for a highly motivated, detail-oriented, energetic individual with experience in a variety of big data technologies. You will be part of a fast growing, cohesive team with many exciting responsibilities related to Big Data. Please apply here.
    • C++ Senior Developer and Architect- Maps. The Maps Team is looking for a senior developer and architect to support and grow some of the core backend services that support Apple Map's Front End Services. Please apply here.  
    • Senior Engineer. We are looking for a team player with focus on designing and developing WWDR’s web-based applications. The successful candidate must have the ability to take minimal business requirements and work pro-actively with cross functional teams to obtain clear objectives that drive projects forward to completion. Please apply here.
    • Software Engineer. We are looking for a team player with focus on designing and developing WWDR’s web-based applications. The successful candidate must have the ability to take minimal business requirements and work pro-actively with cross functional teams to obtain clear objectives that drive projects forward to completion. Please apply here.
    • Quality Assurance Engineer. The iOS Systems team is looking for a Quality Assurance engineer. In this role you will be expected to work hand-in-hand with the software engineering team to find and diagnose software defects. Please apply here.

  • We need awesome people @ Booking.com - We want YOU! Come design next generation interfaces, solve critical scalability problems, and hack on one of the largest Perl codebases. Apply: http://www.booking.com/jobs.en-us.html

  • UI EngineerAppDynamics, founded in 2008 and lead by proven innovators, is looking for a passionate UI Engineer to design, architect, and develop our their user interface using the latest web and mobile technologies. Make the impossible possible and the hard easy. Apply here.

  • Software Engineer - Infrastructure & Big DataAppDynamics, leader in next generation solutions for managing modern, distributed, and extremely complex applications residing in both the cloud and the data center, is looking for a Software Engineers (All-Levels) to design and develop scalable software written in Java and MySQL for backend component of software that manages application architectures. Apply here.
Fun and Informative Events
  • Which MongoDB Distribution Should You Use? AOL Benchmark Results - TokuMX vs. MongoDB. March 5th at 1pm ET. It may be easy to choose a NoSQL database, but do you know which distribution is best for you? Which will perform better? Which will scale further? Look before you leap.  Register now.

  • Aerospike Webinar: “Getting the Most out of Your Flash/SSDs”. Tune in to Aerospike's latest webinar, “Getting the Most Out of your Flash/SSDs” at 10am PST Tuesday, Feb. 18 to learn how to select, test and prepare your drives for maximum database performance. Register now. 
Cool Products and Services
  • As one of the fastest growing VoIP services in the world Viber has replaced MongoDB with Couchbase Server, supporting 100,000+ operations per second in the short term and 1,000,000+ operations per second in the long term for their third generation architecture.  See the full story on the Viber switch.

  • Log management made easy with Logentries Billions of log events analyzed every day to unlock insights from the log data the matters to you. Simply powerful search, tagging, alerts, live tail and more for all of your log data. Automated AWS log collection and analytics, including CloudWatch events. 

  • LogicMonitor is the cloud-based IT performance monitoring solution that enables companies to easily and cost-effectively monitor their entire IT infrastructure stack – storage, servers, networks, applications, virtualization, and websites – from the cloud. No firewall changes needed - start monitoring in only 15 minutes utilizing customized dashboards, trending graphs & alerting.

  • MongoDB Backup Free Usage Tier Announced. We're pleased to introduce the free usage tier to MongoDB Management Service (MMS). MMS Backup provides point-in-time recovery for replica sets and consistent snapshots for sharded systems with minimal performance impact. Start backing up today at mms.mongodb.com.

  • BlueStripe FactFinder Express is the ultimate tool for server monitoring and solving performance problems. Monitor URL response times and see if the problem is the application, a back-end call, a disk, or OS resources.

  • aiScaler, aiProtect, aiMobile Application Delivery Controller with integrated Dynamic Site Acceleration, Denial of Service Protection and Mobile Content Management. Cloud deployable. Free instant trial, no sign-up required.  http://aiscaler.com/

  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.

  • www.site24x7.com : Monitor End User Experience from a global monitoring network.

If any of these items interest you there's a full description of each sponsor below. Please click to read more...

Categories: Architecture

How the AOL.com Architecture Evolved to 99.999% Availability, 8 Million Visitors Per Day, and 200,000 Requests Per Second

Mon, 02/17/2014 - 17:56

This is a guest post by Dave Hagler Systems Architect at AOL.

The AOL homepages receive more than 8 million visitors per day.  That’s more daily viewers than Good Morning America or the Today Show on television.  Over a billion page views are served each month.  AOL.com has been a major internet destination since 1996, and still has a strong following of loyal users.

The architecture for AOL.com is in it’s 5th generation.  It has essentially been rebuilt from scratch 5 times over two decades.  The current architecture was designed 6 years ago.  Pieces have been upgraded and new components have been added along the way, but the overall design remains largely intact.  The code, tools, development and deployment processes are highly tuned over 6 years of continual improvement, making the AOL.com architecture battle tested and very stable.

The engineering team is made up of developers, testers, and operations and totals around 25 people.  The majority are in Dulles, Virginia with a smaller team in Dublin, Ireland.

In general the technology in use are Java, JavaServer Pages, Tomca, Apache, CentOS 5, Git, Jenkins, Selenium, and jQuery.  There are some other technologies which are used outside that stack, but these are the main components.

Design Principles
Categories: Architecture

Stuff The Internet Says On Scalability For February 14th, 2014

Fri, 02/14/2014 - 18:27

Hey, it's HighScalability time:


Climbing the World's Second Tallest Building
  • 5 billion: Number of phone records NSA collects per day; Facebook: 1.23 billion users, 201.6 billion friend connections, 400 billion shared photos, and 7.8 trillion messages sent since the start of 2012.
  • Quotable Quotes:
    • @ShrikanthSS: people repeatedly underestimate the cost of busy waits
    • @mcclure111: Learning today java․net․URL․equals is a blocking operation that hits the network shook me badly. I don't know if I can trust the world now.
    • @hui_kenneth: @randybias: “3 ways 2 be market leader - be 1st, be best, or be cheapest. #AWS was all 3. Now #googlecloud may be best & is the cheapest.”
    • @thijs: The nice thing about Paper is that we can point out to clients that it took 18 experienced designers and developers two years to build.
    • @neil_conway: My guess is that the split between Spanner and F1 is a great example of Conway's Law.
  • How Facebook built the real-time posts search feature of Graph search. It's a big problem: one billion new posts added every day, the posts index contains more than one trillion total posts, comprising hundreds of terabytes of data. 

  • Chartbeat Engineering shares some of their experiences in two excellent articles: Part 1,  Part 2. Lessons: DNS is not a great means of load balancing traffic; Modifying sysctl values from their defaults can be important to ensure reliability; Graphing metrics is your friend;  Through TCP tuning and utilizing AWS Elastic Load Balancer we were able to decrease our response time by 98.5%, decrease our server footprint by 20% on our front end servers;  Enabling cross-zone load balancing got our request count distribution extremely well balanced;  planning to move from the m1.large instance type to the c3.large.  The c3.large is almost 50% cheaper and gives us more compute units which in turn yields slightly better response times.

  • Creating a resilient organization is a little like getting an allergy shot, you have to ingest a little of what ails you to boost your immune system. That's the idea behind DiRT, Disaster Recovery Testing event. In Weathering the Unexpected is the story of how far Google goes to improve their corporate immune system with disaster scenarios. Disasters can range from a walk-through of a backup restore to a company wide zombie attack simulation. More here and here.

  • 37signals' shows the power of focus by shedding all their products except Basecamp and even renaming themselves to be just Basecamp. A company can can grow wild unless pruned and shaped to let in the maximum amount of sunlight, growing the most and ripest fruit. While a hard prune is common in the orchard, it's not so common in an organization. A very brave move.

  • When I suggested this I was laughed at. So there! Patch Panels in the Sky:A Case for Free-Space Optics in Data Centers: We explore the vision of an all-wireless inter-rack datacenter fabric. 

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge...

Categories: Architecture

Snabb Switch - Skip the OS and Get 40 million Requests Per Second in Lua

Thu, 02/13/2014 - 17:56

Snabb Switch - a toolkit for solving novel problems in networking. If you are building a new packet-processing network appliance then you can use Snabb Switch to get the job done more quickly.

Here's a great impassioned overview from erichocean:

Or, you could just avoid the OS altogether: https://github.com/SnabbCo/snabbswitch

Our current engineering target is 1 million writes/sec and > 10 million reads/sec on top of an architecture similar to that, on a single box, to our fully transactional, MVCC database (write do not block reads, and vice versa) that runs in the same process (a la SQLite), which we've also merged with our application code and our caching tier, so we're down to—literally—a single process for what would have been at least three separate tiers in a traditional setup.

The result is that we had to move to measuring request latency in microseconds exclusively. The architecture (without additional application-specific processing) supports a wire-to-wire messaging speed of 26 nanoseconds, or approx. 40 million requests per second. And that's written in Lua!

To put that in perspective, that kind of performance is about 1/3 of what you'd need to be able to do to handle Facebook's messaging load (on average, obviously, Facebook bursts higher than the average at times...).

Point being, the OS is just plain out-of-date for how to solve heavy data plane problems efficiently. The disparity between what the OS can do and what the hardware is capable of delivering is off by a few orders of magnitude right now. It's downright ridiculous how much performance we're giving up for supposed "convenience" today.

Categories: Architecture