Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

High Scalability - Building bigger, faster, more reliable websites
Syndicate content
Updated: 7 hours 43 min ago

Marco Arment Uses Go Instead of PHP and Saves Money by Cutting the Number of Servers in Half

Mon, 02/02/2015 - 17:56

On the excellent Accidental Tech Podcast there's a running conversation about Marco Arment's (Tumblr, Instapaper) switch to Go, from a much loved PHP, to implement feed crawling for Overcast, his popular podcasting app for the iPhone.

In Episode 101 (at about 1:10) Marco said he halved the number of servers used for crawling feeds by switching to Go. The total savings was a few hundred dollars a month in server costs.

Why? Feed crawling requires lots of parallel networking requests and PHP is bad at that sort of thing, while Go is good at it. 

Amazingly, Marco wrote an article on how much Overcast earned in 2014. It earned $164,000 after Apple's 30%, but before other expenses. At this revenue level the savings, while not huge in absolute terms given the traffic of some other products Marco has worked on, was a good return on programming effort. 

How much effort? It took about two months to rewrite and debug the feed crawlers. In addition, lots of supporting infrastructure that tied into the crawling system had to be created, like the logging infrastructure, the infrastructure that says when a feed was last crawled, monitoring delays, knowing if there's queue congestion, and forcing a feed to be crawled immediately.

So while the development costs were high up front, as Overcast grows the savings will also grow over time as efficient code on fast servers can absorb more load without spinning up more servers.

Lots of good lessons here, especially for the lone developer:

Categories: Architecture

Stuff The Internet Says On Scalability For January 30th, 2015

Fri, 01/30/2015 - 17:56

Hey, it's HighScalability time:


It's a strange world...exotic, gigantic molecules Fit Inside Each Other like Russian nesting dolls
  • 1.39 billion: Facebook Monthly Active Users; $18 billion profit: Apple in 3 months; 200 million: Kik users; 11.2 billion: age of the oldest known solar system; 3 billion: videos viewed per day on Facebook
  • Quotable Quotes:
    • @kevinroose: This dude wins SF bingo. RT @caro: An Uber driver is Airbnb'ing the trunk of his Tesla for $85/night.
    • @BenedictEvans: Only 16% of Facebook DAUs aren't using it on mobile
    • @rezendi: Yo's Law: "in the 21st century tech industry, satire and reality are not merely indistinguishable but actually interchangeable."
    • Brent Ozar: I recommend that people back up data, not servers.
    • @AnnaPawlicka: "Shared State is the Root of All Evil"
    • Peter Lawrey: micro-day - about 1/12 of a second. micro-century - 51.3 minutes. femto-parsec - about 30 metres.
    • TapirLiu: OH: docker is like a condom to protect your computer from Node.
    • @DigitCurator: "The Next Decade In Storage": Resistive RAM promises better scaling, efficiency, and 1000x endurance of flash memory 
    • @BenedictEvans: At the end of 2014 Apple had ~650-675m live iOS devices. With zero unit sales growth, 700-720m by end 2015. Consumer PCs in use - 7-800m
    • @MailChimp: We sent 14.1 billion emails in December, including 741 million on Cyber Monday.
    • @mjpt777:  That's in the past. We can now do 20 million per second :-) per stream.
    • @bradwilson: Conclusions: 1. Ethernet over power does not perform as well as WiFi (??) 2. Ethernet over power hates being shared among multiple PCs
    • @mjpt777: Specialized Evolution of the General-Purpose CPU  - note that performance per watt is approx doubling per generation. 
    • @nighitingale: "The Earth is 4.6 billion years old. Scaling to 46 years, humans have been here 4 hours, the industrial..."
    • Joseph Campbell: The hero’s journey always begins with the call. One way or another, a guide must come to say, “Look, you’re in Sleepy Land. Wake. Come on a trip."
    • Frank Herbert: the most persistent principles of the universe were accident and error.

  • Will Facebook ever figure out this mobile thing? Not long ago that was the big question. We have an answer. In the fourth quarter, the percentage of its advertising revenue from mobile devices increased to 69%, up from 66% in the third quarter and 53% a year earlier. Mobile daily active users were 745 million on average for December 2014, an increase of 34 percent year-over-year.

  • The power of smart: Facebook’s Powerful Ad Tools Grew Its Revenue 25X Faster Than User Count. Facebook might be running out of people, but they aren't running out of ways of monetizing those people. Math grows faster than users.

  • The Cathedral of Computation by Ian Bogost. Agree in part. There does seem to be an uncritical acceptance of algorithms, as if because they enliven machines they are some how pure and objective, when the opposite is the case. Algorithms are made for human purposes by teams of humans and show the biases and hubris of their makers. And like all creatures, algorithms should be subject to skepticism, law, and review.

  • We have many long running debates in tech. Server side vs client side rendering is just one of them. A thoughtful analysis: Tradeoffs in server side and client side rendering by Malte Ubl.  Bret Slatkin boldly claims: Experimentally verified: "Why client-side templating is wrong". He concludes: I hope never to render anything server-side ever again. I feel more comfortable in making that choice than ever thanks to all this data. I see rare occasions when server-side rendering could make sense for performance, but I don't expect to encounter many of those situations in the future.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so please keep on reading)...

Categories: Architecture

Instagram Strategy to Radically Reduce Traffic: Kill all the spambots!

Wed, 01/28/2015 - 18:50

RIP to my fallen robot followers on Instagram, if there's a heaven for robot instagram users, you guys are in there

— alldaychubbyboy (@Allday)

How do you scale to handle increased user traffic? Have less traffic. No, this is not a koan. The best way to deal with traffic is not to have it. 

In a two day span Instagram disappeared 18.9 million users or more than 29 percent of their "followers." Justin Bieber lost 3.5 million followers (15 percent), Kim Kardashian lost 1.3 million followers (5.5 percent), Rihanna lost 1.2 million followers.

Instagram explains this dramatic reckoning was achieved by "removing deactivated spam accounts and accounts that violated its community guidelines." 

In an age when high user counts and tantalizing engagement metrics are more valuable than bitcoins, this can't have been an easy decision, but it was made after being bought by Facebook.

Why? Gabe Madway, an Instagram spokesman, tells us why: We totally get that it’s uncomfortable for people. The overall goal is we want it to be perceived that the people following you are real.

Uncomfortable is an understatement. A BuzzFeed article nicely captured some of the anger, here's just one example (could be NSFW):

Categories: Architecture

Paper: Immutability Changes Everything by Pat Helland

Mon, 01/26/2015 - 18:07

I was excited to see that Pat Helland has published another thought provoking paper: Immutability Changes Everything. If video is more your style, Pat gave a wonderful talk on the same subject at RICON2012 (videoslides).

It's fun to see how Pat's thinking is evolving over time as he's worked at Tandem Computers (TransactionMonitoring Facility), Amazon, Microsoft (Microsoft Transaction Server and SQL Service Broker), and now Salesforce.

You might have enjoyed some of Pat's other visionary papers: Life beyond Distributed Transactions: an Apostate’s OpinionThe end of an architectural era: (it's time for a complete rewrite), and Idempotence Is Not a Medical Condition.

This new paper is a high level overview of why immutability, the idea that destructive updates are not allowed, is a huge architectural win and because of cheaper disk, RAM, and compute, it's now financially feasible to keep all the things. The key insight is that without data updates, coordination in a distributed system becomes a much simpler problem to solve.

Immutability is an architectural concept that's been gaining steam on several fronts. Facebook is using a declarative immutable programming model in both the model and the view. We are seeing the idea of immutable infrastructure rise in DevOps. Aeron is a new messaging system that uses a persistent log to good advantage. The Lambda Architecture makes use of immutability. Datomic is a database data that treats data as a time-ordered series of immutable objects.

If that's of interest, then you'll like the paper.

Overview:

Categories: Architecture

Stuff The Internet Says On Scalability For January 23rd, 2015

Fri, 01/23/2015 - 17:56

Hey, it's HighScalability time:


Elon Musk: The universe is really, really big  [Gigapixels of Andromeda [4K]]
  • 90: is the new 50 for woman designer; $656.8 million: 3 months of Uber payouts; $10 billion: all it takes to build the Internet in space; 1 billion: registered WeChat users
  • Quotable Quotes:
    • @antirez: Tech stacks, more replaceable than ever: hardware is better, startups get $$ (few nodes + or - who cares), alternatives countless.
    • Olivio Sarikas: If every Star in this Image was a 2 millimeter Sandcorn you would end up with 1110 kg of Sand!!!!!!!!!
    • Chad Cipoletti: In even simpler terms, we see brands as people.
    • @timoreilly: Love it: “We need a stack, not a pile” says @michalmigurski.
    • @neha: I would be very happy to never again see a distributed systems paper eval on a workload that would fit on one machine.
    • @etherealmind: OH: "oh yeah, the extra 4 PB of storage is being installed today. Its about 4 racks of gear".
    • @lintool: Andrew Moore: Google's ecommerce platform ingests 100K-200K events per second continuously. 

  • Programming as myth building. Myths to Live By: The true symbol does not merely point to something else. It contains in itself a structure which awakens our consciousness to a new awareness of the inner meaning of life and of reality itself. A true symbol takes us to the center of the circle, not to another point on the circumference.

  • Not shocking at all: "We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code...A majority (77%) of the failures require more than one input event to manifest, but most of the failures(90%) require no more than 3." Really, who has the time? More on human nature in Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems.

  • Let simplicity fail before climbing the complexity ladder. Scalability! But at what COST?: "Big data systems may scale well, but this can often be just because they introduce a lot of overhead. Rather than making your computation go faster, the systems introduce substantial overheads which can require large compute clusters just to bring under control. In many cases, you’d be better off running the same computation on your laptop." But notice the kicker: "it took some work for parallel union-find." Replacing smart work with brute force is often the greater win. What are a few machine cycles between friends?

  • Programming is the ultimate team sport, so Why are Some Teams Smarter Than Others? The smartest teams were distinguished by three characteristics. First, their members contributed more equally to the team’s discussions. Second, their members can better read complex emotional states. Third, teams with more women outperformed teams with more men.

  • WhatsApp doesn't understand the web. Interesting design and discussions. Using proprietary Chrome APIs is a tough call, but this is more perplexing: "Your phone needs to stay connected to the internet for our web client to work." Is this for consistency reasons? To make sure the phone and the web stay in sync? Is it for monetization reasons? It does create a closed proxy that effectively prevents monetization leaks. It's tough to judge a solution without understanding the requirements, but there must be something compelling to impose so many limitations.

  • Roman Leventov analysis of Redis data structures. In which Salvatore 'antirez' Sanfilippo addresses point by point criticisms of Redis' implementation. People love Redis, part of that love has to come from what a good guy antirez is. Here he doesn't go all black diamond alpha nerd in the face of a challenge. He admits where things can be improved. He explains design decisions in detail. He advances the discussion with grace, humility, and smarts. A worthy model to emulate.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so please keep on reading)...

Categories: Architecture

As a DBA Expert, which database would you choose?

Thu, 01/22/2015 - 17:29

This is a guest post by Jenny Richards, a professional database administrator who is currently employed at Remote DBA.

In the world of databases, there is no single silver bullet fitting for every gun. How you select the database to use is very dependent on every other factor of your work: 

  • Who are you and what do you do? 
  • What is your end goal – what are you working to achieve?
  • How much data do you intend to store?
  • On what language and OS platforms do your applications run?
  • What is your budget?
  • Will you also require data warehousing, decision support systems and/or BI?
Background information
Categories: Architecture

Learn from my pain - 5 Lessons from Ello's Adventures in Rapid Scaling

Wed, 01/21/2015 - 17:56

Within one week Ello went from thousands of sessions a day to a few million sessions a day. Mike Pack wrote a great article sharing what they’ve learned: 5 Early Lessons from Rapid, High Availability Scaling with Rails.

Some of their scaling challenges: quantity of data, team size, DNS, bot prevention, responding to users, inappropriate content, and other forms of caching. What did they learn?

  1. Move the graph. User relationships were implemented on a standard Rails stack using Heroku and Postgres. The relationships table became the bottleneck. Solution: denormalize the social graph and move hot data into Redis. Redis is used for speed and Postgres is used for durability. Lesson: know the core pillar that supports your core offering and make it work.

  2. Create indexes early, or you're screwed. There's a camp that says only create indexes when they are needed. They are wrong. The lack of btree indexes kills query performance. Forget a unique index and your data becomes corrupted. Once the damage is done it's hard to add unique indexes later. The data has to be cleaned up and indexes take a long time to build when there's a lot of data.

  3. Sharding is cool, but not that cool. Shard all the things only after you've tried vertically scaling as much as possible. Sharding caused a lot of pain. Creating a covering index from the start and adding more RAM so data could be served from memory, not from disk, would have saved a lot of time and stress as the system scaled.

  4. Don't create bottlenecks, or do. Every new user automatically followed a system user that was used for announcements, etc. Scaling problems that would have been months down the road hit quickly as any write to the system user caused a write amplification of millions of records. The lesson here is not what you may think. While scaling to meet the challenge of the system user was a pain, it made them stay ahead of the scaling challenge. Lesson: self-inflict problems early and often.

  5. It always takes 10 times longer. All the solutions mentioned take much longer to implement than you might think. Early estimates of a couple days soon give way to the reality of much longer time hits. Simply moving large amounts of data can take days. Adding indexes to large amounts of data takes time. And with large amounts of data problems tend to happen as you get to the larger data sizes which means you need to apply a fix and start over. 

This full article is excellent and is filled with much more detail that makes it well worth reading.

Categories: Architecture

Sponsored Post: Couchbase, VividCortex, Internap, SocialRadar, Campanja, Transversal, MemSQL, Hypertable, Scalyr, FoundationDB, AiScaler, Aerospike, AppDynamics, ManageEngine, Site24x7

Tue, 01/20/2015 - 17:55

Who's Hiring?
  • Senior DevOps EngineerSocialRadar. We are a VC funded startup based in Washington, D.C. operated like our West Coast brethren. We specialize in location-based technology. Since we are rapidly consuming large amounts of location data and monitoring all social networks for location events, we have systems that consume vast amounts of data that need to scale. As our Senior DevOps Engineer you’ll take ownership over that infrastructure and, with your expertise, help us grow and scale both our systems and our team as our adoption continues its rapid growth. Full description and application here.

  • Linux Web Server Systems EngineerTransversal. We are seeking an experienced and motivated Linux System Engineer to join our Engineering team. This new role is to design, test, install, and provide ongoing daily support of our information technology systems infrastructure. As an experienced Engineer you will have comprehensive capabilities for understanding hardware/software configurations that comprise system, security, and library management, backup/recovery, operating computer systems in different operating environments, sizing, performance tuning, hardware/software troubleshooting and resource allocation. Apply here.

  • Campanja is an Internet advertising optimization company born in the cloud and today we are one of the nordics bigger AWS consumers, the time has come for us to the embrace the next generation of cloud infrastructure. We believe in immutable infrastructure, container technology and micro services, we hope to use PaaS when we can get away with it but consume at the IaaS layer when we have to. Please apply here.

  • UI EngineerAppDynamics, founded in 2008 and lead by proven innovators, is looking for a passionate UI Engineer to design, architect, and develop our their user interface using the latest web and mobile technologies. Make the impossible possible and the hard easy. Apply here.

  • Software Engineer - Infrastructure & Big DataAppDynamics, leader in next generation solutions for managing modern, distributed, and extremely complex applications residing in both the cloud and the data center, is looking for a Software Engineers (All-Levels) to design and develop scalable software written in Java and MySQL for backend component of software that manages application architectures. Apply here.
Fun and Informative Events
  • Sign Up for New Aerospike Training Courses.  Aerospike now offers two certified training courses; Aerospike for Developers and Aerospike for Administrators & Operators, to help you get the most out of your deployment.  Find a training course near you. http://www.aerospike.com/aerospike-training/
Cool Products and Services
  • See How PayPal Manages 1B Documents & 10TB Data with Couchbase. This presentation showcases PayPal's usage of Couchbase within its architecture, highlighting Linear scalability, Availability, Flexibility & Extensibility. See How PayPal Manages 1B Documents & 10TB Data with Couchbase.

  • VividCortex is a hosted (SaaS) database performance management platform that provides unparalleled insight and query-level analysis for both MySQL and PostgreSQL servers at micro-second detail. It's not just another tool to draw time-series charts from status counters. It's deep analysis of every metric, every process, and every query on your systems, stitched together with statistics and data visualization. Start a free trial today with our famous 15-second installation.

  • SQL for Big Data: Price-performance Advantages of Bare Metal. When building your big data infrastructure, price-performance is a critical factor to evaluate. Data-intensive workloads with the capacity to rapidly scale to hundreds of servers can escalate costs beyond your expectations. The inevitable growth of the Internet of Things (IoT) and fast big data will only lead to larger datasets, and a high-performance infrastructure and database platform will be essential to extracting business value while keeping costs under control. Read more.

  • MemSQL provides a distributed in-memory database for high value data. It's designed to handle extreme data ingest and store the data for real-time, streaming and historical analysis using SQL. MemSQL also cost effectively supports both application and ad-hoc queries concurrently across all data. Start a free 30 day trial here: http://www.memsql.com/

  • Aerospike demonstrates RAM-like performance with Google Compute Engine Local SSDs. After scaling to 1 M Writes/Second with 6x fewer servers than Cassandra on Google Compute Engine, we certified Google’s new Local SSDs using the Aerospike Certification Tool for SSDs (ACT) and found RAM-like performance and 15x storage cost savings. Read more.

  • FoundationDB 3.0. 3.0 makes the power of a multi-model, ACID transactional database available to a set of new connected device apps that are generating data at previously unheard of speed. It is the fastest, most scalable, transactional database in the cloud - A 32 machine cluster running on Amazon EC2 sustained more than 14M random operations per second.

  • Diagnose server issues from a single tab. The Scalyr log management tool replaces all your monitoring and analysis services with one, so you can pinpoint and resolve issues without juggling multiple tools and tabs. It's a universal tool for visibility into your production systems. Log aggregation, server metrics, monitoring, alerting, dashboards, and more. Not just “hosted grep” or “hosted graphs,” but enterprise-grade functionality with sane pricing and insane performance. Trusted by in-the-know companies like Codecademy – try it free! (See how Scalyr is different if you're looking for a Splunk alternative.)

  • aiScaler, aiProtect, aiMobile Application Delivery Controller with integrated Dynamic Site Acceleration, Denial of Service Protection and Mobile Content Management. Cloud deployable. Free instant trial, no sign-up required.  http://aiscaler.com/

  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.

  • www.site24x7.com : Monitor End User Experience from a global monitoring network.

If any of these items interest you there's a full description of each sponsor below. Please click to read more...

Categories: Architecture

Stuff The Internet Says On Scalability For January 16th, 2015

Fri, 01/16/2015 - 17:56

Hey, it's HighScalability time:


First people to free-climb the Dawn Wall of El Capitan using nothing but stone knives and bearskins (pics). 
  • $3.3 trillion: mobile revenue in 2014; ~10%: the difference between a good SpaceX landing and a crash; 6: hours for which quantum memory was held stable 
  • Quotable Quotes:
    • @stevesi: "'If you had bought the computing power found inside an iPhone 5S in 1991, it would have cost you $3.56 million.'"
    • @imgurAPI: Where do you buy shares in data structures? The Stack Exchange
    • @postwait: @xaprb agreed. @circonus does per-second monitoring, but *retain* one minute for 7 years; that plus histograms provides magical insight.
    • @iamaaronheld: A single @awscloud datacenter consumes enough electricity to send 24 DeLoreans back in time
    • @rstraub46: "We are becoming aware that the major questions regarding technology are not technical but human questions" - Peter Drucker, 1967
    • @Noahpinion: Behavioral economics IS the economics of information. via @CFCamerer 
    • @sheeshee: "decentralize all the things" (guess what everybody did in the early 90ies & why we happily flocked to "services". ;)
    • New Clues: The Internet is no-thing at all. At its base the Internet is a set of agreements, which the geeky among us (long may their names be hallowed) call "protocols," but which we might, in the temper of the day, call "commandments."

  • Can't agree with this. We Suck at HTTP. HTTP is just a transport. It should only deliver transport related error codes. Application errors belong in application messages, not spread all over the stack. 

  • Apple has lost the functional high ground. It's funny how microservices are hot and one of its wins is the independent evolution of services. Apple's software releases now make everything tied together. It's a strategy tax. The watch just extends the rigidity of the structure. But this is a huge upgrade. Apple is moving to a cloud multi-device sync model, which is a complete revolution. It will take a while for all this to shake out. 

  • This is so cool, I've never heard of Cornelis Drebbel (1620s) before or about his amazing accomplishments. The Vulgar Mechanic and His Magical Oven: His oven is one of the earliest devices that gave human control away to a machine and thus can be seen as a forerunner of the smart machine, the self-deciding automaton, the thinking robot.

  • Do you think there's a DevOps identity crisis, as Baron Schwartz suggests? Does DevOps have a messaging and positioning problem? Is DevOps just old wine in a new skin? Is DevOps made up of echo chambers? I don't know, but an interesting analysis by Baron.

  • How does Hyper-threading double your CPU throughput?: So if you are optimizing for higher throughput – that may be fine. But if you are optimizing for response time, then you may consider running with HT turned off.

  • Underdog.io share's what's Inside Datadog’s Tech Stack: python, javascript and go; the front-end happen in D3 and React; databases are Kafka, redis, Cassandra, S3, ElasticSearch, PostgreSQL; DevOps is Chef, Capistrano, Jenkins, Hubot, and others.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so please keep on reading)...

Categories: Architecture

StackExchange's Performance Dashboard

Wed, 01/14/2015 - 17:56

StackExchange created a very cool performance dashboard that looks to be updated from real system metrics. Wouldn't it be fascinating if every site had a similar dashboard?

The dashboard contains information like there are 560 million page views per month, 260,000 sustained connections,  34 TB data transferred per month, 9 web servers with 48GB of RAM handling 185 req/s at 15% CPU usage. There are 4 SQL servers, 2 redis servers, 3 tag engine servers, 3 elasticsearch servers, and 2 HAProxy servers, along with stats on each.

There's also an excellent discussion thread on reddit that goes into more interesting details, with questions being answered by folks from StackExchange. 

StackExchange is still doing innovative work and is very much an example worth learning from. They've always danced to their own tune and it's a catchy tune at that. More at StackOverflow Update: 560M Pageviews A Month, 25 Servers, And It's All About Performance.

Categories: Architecture

The Stunning Scale of AWS and What it Means for the Future of the Cloud

Mon, 01/12/2015 - 18:05

James Hamilton, VP and Distinguished Engineer at Amazon, and long time blogger of interesting stuff, gave an enthusiastic talk at AWS re:Invent 2014 on AWS Innovation at Scale. He’s clearly proud of the work they are doing and it shows.

James shared a few eye popping stats about AWS:

  • 1 million active customers
  • All 14 other cloud providers combined have 1/5th the aggregate capacity of AWS (estimate by Gartner in 2013)
  • 449 new services and major features released in 2014
  • Every day, AWS adds enough new server capacity to support all of Amazon’s global infrastructure when it was a $7B annual revenue enterprise (in 2004).
  • S3 has 132% year-over-year growth in data transfer
  • 102Tbps network capacity into a datacenter.

The major theme of the talk is the cloud is a different world. It’s a special environment that allows AWS to do great things at scale, things you can’t do, which is why the transition from on premise x86 servers to the public cloud is happening at a blistering pace. With so many scale driven benefits to the public cloud, it's a transition that can't be stopped. The cloud will keep getting more reliable, more functional, and cheaper at a rate that you can't begin to match with your limited resources, generalist gear, bloated software stacks, slow supply chains, and outdated innovation paradigms.

That's the PR message at least. But one thing you can say about Amazon is they are living it. They are making it real. So a healthy doubt is healthy, but extrapolating out the lines of fate would also be wise.

One of the fickle finger of fate advantages AWS has is resources. At one million customers they have the scale to keep the engine of expansion and improvement going. Profits aren't being taken out, money is being reinvested. This is perhaps the most important advantage of scale.

But money without smarts is simply waste. Amazon wants you to know they have the smarts. We've heard how Google and Facebook build their own gear, Amazon does too. They build their own networking gear, networking software, racks, and they work with Intel to get faster processor versions of processors than are available on the market. The key is they know everything and control everything about their environment, so they can build simpler gear that does exactly what they want, which turns out to be cheaper and more reliable in the end.

Complete control allows quality metrics to be built into everything. Metrics drive a constant quality increase in all parts of the system, which is why against all odds AWS is getting more reliable as the pace of innovation quickens. Great pools of actionable data turned into knowledge is another huge advantage of scale.

Another thing AWS can do that you can't is the Availability Zone architecture itself. Each AZ is its own datacenter and AZs within a region are located very close together. This reduces messaging latencies, which means state can be synchronously replicated between AZs, which greatly improves availability compared to the typical approach where redundant datacenters are very far apart. 

It's a talk rich with information and...well, spunk. The real meta-theme of the talk is how Amazon consciously uses scale to their competitive advantage. For Amazon scale isn't just an expense to be dealt with, scale is a resource to exploit, if you know how.

Here's my gloss of James Hamilton's incredible talk...

Everything in the Talk has a Foundation in Scale
Categories: Architecture

Stuff The Internet Says On Scalability For January 9th, 2015

Fri, 01/09/2015 - 17:56

Hey, it's HighScalability time:


UFOs or Floating Solar Balloon power stations? You decide.

 

  • 700 Million: WhatsApp active monthly users; 17 million: comments on Stack Exchange in 2014
  • Quotable Quotes
    • John von Neumann: It is easier to write a new code than to understand an old one.
    • @BenedictEvans: Gross revenue on Apple & Google's app stores was a little over $20bn in 2014. Bigger than recorded music, FWIW.
    • Julian Bigelow: Absence of a signal should never be used as a signal. 
    • Bigelow ~ separate signal from noise at every stage of the process—in this case, at the transfer of every single bit—rather than allowing noise to accumulate along the way
    • cgb_: One of the things I've found interesting about rapidly popular opensource solutions in the last 1-2 years is how quickly venture cap funding comes in and drives the direction of future development.
    • @miostaffin: "If Amazon wants to test 5,000 users to use a feature, they just need to turn it on for 45 seconds." -@jmspool #uxdc
    • Roberta Ness: Amazing possibility on the one hand and frustrating inaction on the other—that is the yin and yang of modern science. Invention generates ever more gizmos and gadgets, but imagination is not providing clues to solving the scientific puzzles that threaten our very existence.

  • Can HTTPS really be faster than HTTP? Yes, it can. Take the test for yourself. The secret: SPDY. More at Why we don’t use a CDN: A story about SPDY and SSL

  • A fascinating and well told tale of the unexpected at Facebook. Solving the Mystery of Link Imbalance: A Metastable Failure State at Scale: The most literal conclusion to draw from this story is that MRU connection pools shouldn’t be used for connections that traverse aggregated links. At a meta-level, the next time you are debugging emergent behavior, you might try thinking of the components as agents colluding via covert channels. At an organizational level, this investigation is a great example of why we say that nothing at Facebook is somebody else’s problem.

  • Everything old is new again. Facebook on disaggregation vs. hyperconvergence: Just when everyone agreed that scale-out infrastructure with commodity nodes of tightly-coupled CPU, memory and storage is the way to go, Facebook’s Jeff Qin, a capacity management engineer – in a talk at Storage Visions 2015 – offers an opposing vision: disaggregated racks. One rack for computes, another for memory and a third – and fourth – for storage.

  • Why Instagram Worked. Instagram was the result of a pivot away from a not popular enough social networking site to a stripped down app that allowed people to document their world in pictures. Though the source article is short on the why, there's a good discussion on Hacker News. Some interesting reasons: Instagram worked because it algorithmically hides flaws in photographs so everyone's pictures look "good"; Snapping a photo is easy and revolves around a moment -- something easier to recognize when it's worthy of sharing; Startups need lucky breaks, but connections with the right people increase the odds considerably; Instagram worked because it was at the right place at the right time; It worked because it's a simple, quick, ultra-low friction way of sharing photos.

  • Atheists, it's not what you think. The God Login. The incomparable Jeff Atwood does a deep dive on the design of a common everyday object: the Login page. The title was inspired by one of Jeff's teacher's who asked what was the "God Algorithm" for a problem, that is, if God solved a problem what would the solution look like? While you may not agree with the proposed solution to the Login page problem, you may at least come away believing that one may or may not exist.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so please keep on reading)...

Categories: Architecture

The Ultimate Guide: 5 Methods for Debugging Production Servers at Scale

Wed, 01/07/2015 - 17:56

This a guest post by Alex Zhitnitsky, an engineer working at Takipi, who is on a mission to help Java and Scala developers solve bugs in production and rid the world of buggy software.

How to approach the production debugging conundrum?

All sorts of wild things happen when your code leaves the safe and warm development environment. Unlike the comfort of the debugger in your favorite IDE, when errors happen on a live server - you better come prepared. No more breakpoints, step over, or step into, and you can forget about adding that quick line of code to help you understand what just happened. In production, bad things happen first and then you have to figure out what exactly went wrong. To be able to debug in this kind of environment we first need to switch our debugging mindset to plan ahead. If you’re not prepared with good practices in advance, roaming around aimlessly through the logs wouldn’t be too effective.

And that’s not all. With high scalability architectures, enter high scalability errors. In many cases we find transactions that originate on one machine or microservice and break something on another. Together with Continuous Delivery practices and constant code changes, errors find their way to production with an increasing rate. The biggest problem we’re facing here is capturing the exact state which led to the error, what were the variable values, which thread are we in, and what was this piece of code even trying to do?

Let’s take a look at 5 methods that can help us answer just that. Distributed logging, advanced jstack techniques, BTrace and other custom JVM agents:

1. Distributed Logging
Categories: Architecture

Sponsored Post: Wikia, MemSQL, Campanja, Hypertable, Sprout Social, Scalyr, FoundationDB, AiScaler, Aerospike, AppDynamics, ManageEngine, Site24x7

Tue, 01/06/2015 - 17:56

Who's Hiring?
  • DevOps Engineer for Wikia. Wikia is the go-to place for fan content that is created entirely by fans! As a Quantcast Top 20 site with over 120 million monthly uniques we are tackling very interesting problems at a scale you won't find at many other places. We embrace a DevOps culture and are looking to expand our team with people that are excited about working with just about every piece of our stack. You'll also partner with our platform team as they break down the monolith and move towards service oriented architecture. Please apply here.

  • Engineer Manager - Platform. At Wikia we're tackling interesting problems at a scale you won't find at many other places. We're a Quantcast Top 20 site with over 120 million monthly uniques. 100% of the content on our 400,000+ communities is user generated. That combination of scale and UGC creates some pretty compelling challenges and on top of that we're working on moving away from a monolithic architecture and actively working on finding the best technologies to best suit each individual piece of our platform. We're currently in search of an experienced Engineer Manager to help drive this process. Please apply here.

  • Campanja is an Internet advertising optimization company born in the cloud and today we are one of the nordics bigger AWS consumers, the time has come for us to the embrace the next generation of cloud infrastructure. We believe in immutable infrastructure, container technology and micro services, we hope to use PaaS when we can get away with it but consume at the IaaS layer when we have to. Please apply here.

  • Performance and Scale EngineerSprout Social, will be like a physical trainer for the Sprout social media management platform: you will evaluate and make improvements to keep our large, diverse tech stack happy, healthy, and, most importantly, fast. You'll work up and down our back-end stack - from our RESTful API through to our myriad data systems and into the Java services and Hadoop clusters that feed them - searching for SPOFs, performance issues, and places where we can shore things up. Apply here.

  • UI EngineerAppDynamics, founded in 2008 and lead by proven innovators, is looking for a passionate UI Engineer to design, architect, and develop our their user interface using the latest web and mobile technologies. Make the impossible possible and the hard easy. Apply here.

  • Software Engineer - Infrastructure & Big DataAppDynamics, leader in next generation solutions for managing modern, distributed, and extremely complex applications residing in both the cloud and the data center, is looking for a Software Engineers (All-Levels) to design and develop scalable software written in Java and MySQL for backend component of software that manages application architectures. Apply here.
Fun and Informative Events
  • Sign Up for New Aerospike Training Courses.  Aerospike now offers two certified training courses; Aerospike for Developers and Aerospike for Administrators & Operators, to help you get the most out of your deployment.  Find a training course near you. http://www.aerospike.com/aerospike-training/
Cool Products and Services
  • MemSQL provides a distributed in-memory database for high value data. It's designed to handle extreme data ingest and store the data for real-time, streaming and historical analysis using SQL. MemSQL also cost effectively supports both application and ad-hoc queries concurrently across all data. Start a free 30 day trial here: http://www.memsql.com/

  • Aerospike Hits 1M writes per second with 6x Fewer Servers than Cassandra. A new Google Compute Engine benchmark demonstrates how the Aerospike database hit 1 million writes per second with just 50 nodes - compared to Cassandra's 300 nodes. Read the benchmark: http://www.aerospike.com/blog/1m-wps-6x-fewer-servers-than-cassandra/

  • Hypertable Inc. Announces New UpTime Support Subscription Packages. The developer of Hypertable, an open-source, high-performance, massively scalable database, announces three new UpTime support subscription packages – Premium 24/7, Enterprise 24/7 and Basic. 24/7/365 support packages start at just $1995 per month for a ten node cluster -- $49.95 per machine, per month thereafter. For more information visit us on the Web at http://www.hypertable.com/. Connect with Hypertable: @hypertable--Blog.

  • FoundationDB 3.0. 3.0 makes the power of a multi-model, ACID transactional database available to a set of new connected device apps that are generating data at previously unheard of speed. It is the fastest, most scalable, transactional database in the cloud - A 32 machine cluster running on Amazon EC2 sustained more than 14M random operations per second.

  • Diagnose server issues from a single tab. The Scalyr log management tool replaces all your monitoring and analysis services with one, so you can pinpoint and resolve issues without juggling multiple tools and tabs. It's a universal tool for visibility into your production systems. Log aggregation, server metrics, monitoring, alerting, dashboards, and more. Not just “hosted grep” or “hosted graphs,” but enterprise-grade functionality with sane pricing and insane performance. Trusted by in-the-know companies like Codecademy – try it free! (See how Scalyr is different if you're looking for a Splunk alternative.)

  • aiScaler, aiProtect, aiMobile Application Delivery Controller with integrated Dynamic Site Acceleration, Denial of Service Protection and Mobile Content Management. Cloud deployable. Free instant trial, no sign-up required.  http://aiscaler.com/

  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.

  • www.site24x7.com : Monitor End User Experience from a global monitoring network.

If any of these items interest you there's a full description of each sponsor below. Please click to read more...

Categories: Architecture

Von Neumann had one piece of advice for us: not to originate anything.

Mon, 01/05/2015 - 18:05

I don't know about you, but when I read about the exploits of people like John von Neumann, Alan Turing, J. Robert Oppenheimer, and Kurt Gödel in Turing's Cathedral: The Origins of the Digital Universe by George Dyson, I can't help but flash back to the Age of Heroes, where the names are different--Achilles, Odysseus, Agamemnon, and Ajax--but the larger than life story they lived is familiar. Dyson's book is the Iliad of our times, telling the story of great battles of the human mind: the atomic bomb, Turing machines, programmable computers, weather prediction, genetic-modeling, Monte Carlo simulation, and cellular automata.

Which brings up another question I can't help but ponder: is it the age that makes the person or is it the person that makes the age? Do we have these kind of people today? Or can they only be forged in war?

Anyway, I found this advice from John von Neumann, as told by Julian Bigelow, about how to go about building the MANIAC  computer. This advice still echoes down project management halls today:

“Von Neumann had one piece of advice for us: not to originate anything.” This helped put the IAS project in the lead. “One of the reasons our group was successful, and got a big jump on others, was that we set up certain limited objectives, namely that we would not produce any new elementary components,” adds Bigelow. “We would try and use the ones which were available for standard communications purposes. We chose vacuum tubes which were in mass production, and very common types, so that we could hope to get reliable components, and not have to go into component research.”

They did innovate on architecture by making it possible to store and run programs. Some interesting quotes from the book around that development:

Categories: Architecture