Warning: Table './devblogsdb/cache_page' is marked as crashed and last (automatic?) repair failed query: SELECT data, created, headers, expire, serialized FROM cache_page WHERE cid = 'http://www.softdevblogs.com/?q=aggregator/categories/7' in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc on line 135

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 729

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 730

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 731

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 732
Software Development Blogs: Programming, Software Testing, Agile, Project Management
Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/common.inc on line 153.

Stuff The Internet Says On Scalability For November 27th, 2015

Hey, it's HighScalability time:

The most detailed picture of the Internet ever as compiled by an illegal 420,000-node botnet.
  • $40 billion: P2P lending in China; 20%: amount of all US margin expansion accounted for by Apple since 2010; 11: years of Saturn photos; 117: number of different steering wheels offered for a VW Golf; 1Gbps: speed of a network using a lightbulb.

  • Quotable Quotes:
    • @jaksprats: If we could compile a subset of JavaScript to Lua, JS could run on Server(Node,js), Browser, Desktop, iOS, & Android.JS could run EVERYWHERE
    • @wilkieii: Tech: "Don't roll your own crypto if you aren't an expert" *replaces nutrition with Soylent, currency with bitcoin* *puts wifi in lightbulb*
    • @brianpeddle: The architecture of one human brain would require a zettabyte of capacity. Full simulation of a human brain by 2023.
    • MarshalBanana: That can still easily be the right choice. Complex algorithms trade asymptotic performance for setup cost and maintenance cost. Sometimes the tradeoff isn't worth it.
    • kevindeasi: There are so many things to know nowadays. Backend: Sql, NoSql, NewSql, etc. Middlware: Django, NodeJs, Spring, Groovy, RoR, Symfony, etc. Client: Angular, Ember, React, Jquery, etc. I haven't even mentioned hardware, security, servers/cloud, and api. Now you also need to know about theory, UI/UX, git, deploying servers, HTTP, scrum, software development process, testing.
    • Brian Chesky~ It was better to have 100 people who loved us vs. 1M people who liked us. All movements grow this way.
    • idlewords: All the advantages of a dedicated server without the hassle of saving tons of money.
    • jorangreef: Well, how would you handle massive traffic spikes? Through a combination of vertical and horizontal scaling? Through having excess capacity? Except that I would probably want to start with something fast and inexpensive to begin with.
    • @jaykreps: "The bigger the interface, the weaker the abstraction"--@rob_pike
    • Animats: That still irks me. The real problem is not tinygram prevention. It's ACK delays, and that stupid fixed timer. They both went into TCP around the same time, but independently. I did tinygram prevention (the Nagle algorithm) and Berkeley did delayed ACKs, both in the early 1980s. The combination of the two is awful.
    • @jaykreps: Distributed computing is the new normal: Mesos, K8s = dist'd processes; Cassandra, Kafka, etc = dist'd data; microservices = dist'd apps.
    • @bradfitz: OH: "Well you can add nodes to the cluster. They made that work well, but you can't remove them. It's the Hotel California of auto-scaling."

  • Creating Your Own EC2 Spot Market -- Part 2. Video encoding represents 70% of Netflix's computing needs. And Netflix has a daily peak of 12,000 unused instances. So they created their own spot market to improve encoding throughput by the equivalent of a 210% increase in encoding capacity. Using their update real-time approach they were able to perform an encoding job in 18 hours that they expected to take a few days. Great article with a lot of deep thinking on the topic.

  • Amen! We should come up with a catchy name for RAII so more languages support it because RAII is awesome and simplifies code!

  • Google as a cloud company instead of an ad company? It could happen: Google's Holzle Envisions Cloud Business Eclipsing Ads in 2020. Google announced Custom Machine Types  so you can configure the number of virtual CPUs and the amount RAM you want for you machine. I imagine this nifty feature is enabled by Google's advanced datacenter scheduling software, but it will take more than that to beat AWS and Azure. To take market share Google may need to instigate a price war. Though it looks like Google might make a lot of money charging back to Google.

  • Good explanation of what is servless computing by Leonardo Federico: the phrase “serverless” doesn’t mean servers are no longer involved. It simply means that developers no longer have to think "that much" about them. Computing resources get used as services without having to manage around physical capacities or limits. Let's take for example AWS Lambda. "Lambda allows you to NOT think about servers. Which means you no longer have to deal with over/under capacity, deployments, scaling and fault tolerance, OS or language updates, metrics, and logging."

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so please keep on reading)...

Categories: Architecture

10 Personal Productivity Tools from Agile Results

“Great acts are made up of small deeds.“ -- Lao Tzu

The best productivity tools are the ones you actually use and get results.

I'll share some quick personal productivity tools from Agile Results, introduced in the book, Getting Results the Agile Way.

Agile Results is a Personal Results System for work and life, and it's all about how to use your best energy for your best results.

With that in mind, here are some quick productivity tools you can use to think better, feel better, and do better, while getting results better, faster, and easier with more fun ...


Think in terms of Three Wins each day, each week, each month, each year.

You can apply the Rule of 3 to life. Rather than get overwhelmed by your tasks, choose three things you want to accomplish today. This puts you in control. If nothing else, it gives you a very simple way to focus for the day. This will help you get on track and practice the art of ruthless prioritization.

Consider the energy you have, what's the most important, what's the most valuable, and what would actually feel like a win for you and build momentum.

To get started, right here, right now, simply write down on paper the three things you want to achieve today.


The Monday Vision, Daily Outcomes, and Friday Reflection pattern is a simple habit for daily and weekly results.

Monday Vision - On Monday, identify Three Wins that you want for the week.  Imagine if it was Friday and you were looking back on your week, what are three results that you would be proud of?  This helps you have create a simple vision for your week.

Daily Wins - Get a Fresh Start each day.  Each day, identify Three Wins that you want for the day.  First thing in the morning, before you dive into the hustle and the bustle, step back.  Take the balcony view for your day and identify Three Wins that you want to accomplish.  This helps you create a simple vision for your day.  You can imagine three scenes from your day -- morning, noon and night -- or whatever works for you.

One way to stay balanced here is to ask yourself both, "What do I want to accomplish?", and "What are the key things that if I don't get done ... I'm screwed?"

Friday Reflection -- On each Friday, reflect on your week.  To do this, ask yourself two questions:

“What are 3 things going well?”

“What are 3 things to improve?”

You'll find that either you are either focusing on the wrong things, getting distracted, or biting off more than you can chew.  Use what you learn here as input into next week's Monday Vision, Daily Wins, Friday Reflection. 

The real power of Friday Reflection is that you acknowledge and appreciate your Personal Victories.  If you gave your all during your workout, hats off to you.  If you pushed a bit harder to really nail your presentation, great job.

It's also a simple way to "put a bow" on your results for the week.  Now, if your manager or somebody were to ask you what you accomplished for the week, you have a simple story of Three Wins.


Hot Spots are a simple metaphor for thinking about what’s important.

Think of your life like a heat map.

Start with a simple set of categories:

  1. Mind
  2. Body
  3. Emotions
  4. Career
  5. Finance
  6. Relationships
  7. Fun

Where do you need to spend more time or less time?

The Hot Spot categories support each other and they are connected, and in some cases overlapping.  But they give you a very quick way to explore an area of your life. 

It's hard to do well at work if you're having issues with relationships.  And the surprise for a lot of people is how if they take better care of their body, work gets a lot easier, and they improve their mind and emotions. 


The Growth Mindset is a learning mindset.

Instead of a static view of things, you approach things as experiments to learn and explore.  Failure isn't final.  Failure isn't fatal.  Instead, find the lesson and change your approach.

By adopting a Growth Mindset, you get better and better over time.  You don't say, "I'm no good at that."  You say, "I'm getting better at that." or "I'm learning."

With a Growth Mindset and a focus on continuous learning, you turn your days into learning opportunities.  This helps you keep your motivation going and your energy strong.

Life-long Learners last longer :)


Timeboxing is a way to set a time "budget."  This helps you avoid spending too much time on something, or over-investing when it's diminishing returns.

For a lot of people, they find they can focus in short-batches.  They can't focus indefinitely, but if they know they only have to work on something for say 20-minutes, it helps them fully focus on the task at hand.

If you've heard of the Pomodoro Technique, this is an example.  Set a time limit for a task, and work on the task until the buzzer goes off.

I use Timeboxing at multiple levels.  I might Timebox a mini-project to a week or a month, rather than let it go on forever "until it is done."  By using a Timebox, I create a sense of urgency and I give myself a finish line.  That's a real key to staying motivated and refueling your momentum.

Timeboxing can help you improve your productivity in a very simple way. For example, rather than try to figure out how long something might take, start by figuring out how much time you want to invest in it. Identify up front, at what point is it diminishing return. This will help you cut your losses and figure out how to optimize your time.


Each week spend more time in your strengths, and less time in your weaknesses.

Push activities that make you weak to the first part of your day. By doing your Worst Things First, you create a glide path for the rest of the day. This is like Brian Tracy's Eat that Frog.

Set limits.  Stuff the things that make you weak into a Timebox. For example, if the stuff that makes you weak is taking more than 20 percent of your day, then find a way to keep it within that 20 percent boundary. This might mean limiting the time or quantity.

Sometimes you just can't get rid of the things that make you weak; in that case, balance it with more things that energize you and make you strong.

Apply this to your week too. Push the toughest things that drain you to the start of the week to create a glide path. Do the same with people. Spend more time with people that make you strong and less time with people that make you weak. Be careful not to confuse the things that make you weak with challenges that will actually make you stronger. Grow yourself stronger over time.


Pick one thing to improve for the month.

Each month, pick something new; this gives you a chance to cycle through 12 things over the year. Or if necessary, you can always repeat a sprint.

The idea is that 30 days is enough time to experiment with your results throughout the month. Because you might not see progress in the first couple of weeks while you’re learning, a month is a good chunk of time to check your progress.

This is especially helpful if you find that you start a bunch of things but never finish.  Just focus this month on the one thing, and then next month, you can focus on the other thing, and so on.

Each month is a Fresh Start and you get to pick a theme for the month so that everything you do accrues to something bigger.


This is perhaps one of the most impactful ways to improve your productivity.

Pair with people that complement your strengths.

Pair up or team up with others that compliment your preferred patterns.  If you are a Starter, pair up with a Finisher.  If you are a Thinker, pair up with a Doer.  If you are a Maximizer, pair up with a Simplifier.

Anything, and I mean anything, that you want to do better or faster, there is somebody in the world that lives and breathes it.  And, in my experience, they are more than happy to teach you, if you just ask.

The best way to Pair Up is to find somebody where it's a two-way exchange of value and you both get something out of it.  To do this, it helps when you really know what you bring to the table, so it's clear why you are Pairing Up.

Ask yourself, who can you team up with to get better results?


Chances are you have certain hours in the day or night when you are able to accomplish more.

These are your personal Power Hours.

Guard your Power Hours so they are available to you and try to push the bulk of your productivity within these Timeboxes. This maximizes your results while optimizing your time.

You might find you only have a few great hours during the week where you feel you produce effective and efficient results. You may even feel “in the zone” or in your “flow” state. Gradually increase the number of Power Hours you have. You can build a powerful day, or powerful week, one power hour at a time. If you know you only have three Power Hours in a 40-hour week, see if you can set yourself up to have five Power Hours.


Your Creative Hours are those times during the week where you feel you are at your creative best.

This might be a Saturday morning or a Tuesday night, or maybe during weekday afternoons.

The key is to find those times where you have enough creative space, to do your creative work.

Just like adding power hours, you might benefit from adding more creative hours. Count how many creative hours you have during the week. If it’s not enough, schedule more and set yourself up so that they truly are creative hours. If you’re the creative type, this will be especially important. If you don’t think of yourself as very creative, then simply use your Creative Hours to explore any challenges in your life or to innovate.

There is so much more, but I find that if you play around with these Personal Productivity Tools, you can very quickly get better results in work and life.

If you don't know where to start, start simple:

Ask yourself what are the Three Wins you want to accomplish today, and write those done on a piece of paper.

That's it -- You're doing Agile Results.

Categories: Architecture, Programming

Scheduling containers and more with Nomad

Xebia Blog - Thu, 11/26/2015 - 11:18

Specifically for the Dutch Docker Day on the 20th of November, HashiCorp released version 0.2.0 of Nomad which has some awesome features such as service discovery by integrating with Consul, the system scheduler and restart policies.  HashiCorp worked hard to release version 0.2.0 on 18th of November and we pushed ourselves to release a self-paced, hands-on workshop. If you would like to explore and play with these latest features of Nomad, go check out the workshop over at http://workshops.nauts.io.

In this blog post (or as I experienced it: roller coaster ride), you will catch a glimpse of the work that went into creating the workshop.

Last friday, November the 20th, was the first edition of the Dutch Docker Day where I helped prepare a workshop about "scheduling containers and more with Nomad". It was a great experience where attendees got to play with the new features included in 0.2.0, which nearly didn't make it into the workshop.


When HashiCorp released Nomad during their HashiConf event at the end of September, I was really excited as they always produce high quality tools with great user experience. As soon as the binary was available I downloaded it and tried to set up a cluster to see how it compared to some of it's competitors. The first release already had a lot of potential but also a lot of problems. For instance: when a container failed, Nomad would report it dead, but take no action; restart policies were still but a dream.

There were a lot of awesome features in store for the future of Nomad: integration with Consul, system jobs, batch jobs, restart policies, etc. Imagine all possible integrations with other HashiCorp tools! I was sold. So when I was asked to prepare a workshop for the Dutch Docker Day I jumped at the opportunity to get better acquainted with Nomad. The idea was that the attendees of the workshop, since it was a pretty new product and had some quirks, would go on an explorative journey into the far reaches of the scheduler and together find it's treasures and dark secrets.

Time went by and the workshop was taking shape nicely. We have a nice setup with a cluster of machines that automatically bootstrap the Nomad cluster and set up it's basic configuration. We were told that there would be a new version released before the Dutch Docker Day but nothing appeared, until the day before the event. I was both excited and terrified! The HashiCorp team worked long hours to get the new release of Nomad done in time for the Dutch Docker Day so Armon Dadgar, the CTO of HashiCorp and original creator of Nomad, could present the new features during his talk. This of course is a great thing, except for the fact that the workshop was entirely aimed at 0.1.2 and we had none of these new features incorporated into our Vagrant box. Were we going to throw all our work overboard and just start over, the night before the event?

“Immediately following the initial release of Nomad 0.1, we knew we wanted to get Nomad 0.2 and all its enhancements into the hands of our users by Dutch Docker Day. The team put in a huge effort over the course of a month and a half to get all the new features done and to a polish people expect of HashiCorp products. Leading up to the release we ran into one crazy bug after another (surprisingly all related to serialization). After intense debugging we got it to the fit and polish we wanted the night before at 3 AM! Once the website docs were updated and the blog post written, we released Nomad 0.2. The experience was very exciting but also exhausting! We are very happy with how it turned out and hope the users are too!„

- Alex Dadgar, HashiCorp Engineer working on Nomad

It took until late in the evening to get an updated Vagrant box with a bootstrapped Consul cluster and the new Nomad version, in order to showcase the auto discovery feature and Consul integration that 0.2.0 added. However, the slides for the workshop were still referencing the problems we encountered when trying out the 0.1.0 and 0.1.2 release, so all the slides and statements we had made about things not working or being released in the future had to be aligned with the fixes and improvements that came with the new release. After some hours of hectic editing during the morning of the event, the slides were finally updated and showcased all the glorious new features!

Nomad simplifies operations by supporting blue/green deployments, automatically handling machine failures, and providing a single workflow to deploy applications.

The new features they added in this release and the amount of fixes and improvements is staggering. In order to discover services there is no longer a need for extra tools such as Registrator, and your services are now automatically registered and deregistered as soon as they are started and stopped (which I first thought was a bug, cause I wasn't used to Nomad actually restarting my dead containers). The system scheduler is another feature I've been missing in other schedulers for a while, as it makes it possible to easily schedule services (such as Consul or Sysdig) on all of the eligible nodes in the cluster.

Feature 0.1.2 0.2.0 Service scheduler Schedule a long lived job. Y Y Batch scheduler Schedule batch workloads. Y Y System scheduler Schedule a long lived job on every eligible node. N Y Service discovery Discover launched services in Consul. N Y Restart policies If and how to restart a service when it fails. N Y Distinct host constraint Ensure that Task Groups are running on distinct clients. N Y Raw exec driver Run exec jobs without jailing them. N Y Rkt driver A driver for running containers with Rkt. N Y External artifacts Download external artifacts to execute for Exec and Raw exec drivers. N Y   And numerous fixes/improvements were added to 0.2.0

If you would like to follow the self-paced workshop by yourself, you can find the slides, machines and scripts for the workshop at http://workshops.nauts.io together with the other workshops of the event. Please let me know your experiences, so the workshop can be improved over time!

I would like to thank the HashiCorp team for their amazing work on the 0.2.0 release, the speed at which they have added so many great new features and improved the stability is incredible.

It was a lot of fun preparing the workshop for the Dutch Docker Day. Working with bleeding edge technologies is always a great way to really get to know it's inner workings and quirks, and I would recommend it to anyone, just be prepared to do some last-minute work ; )

Software architecture diagrams should be maps of your source code

Coding the Architecture - Simon Brown - Wed, 11/25/2015 - 09:23

If you've ever worked on a codebase that's more than just a sample application, you'll know that understanding and navigating the code can be tricky, certainly until you familiarise yourself with the key structures within it. Once you have a shared vocabulary that you can use to describe those key structures, creating some diagrams to describe them is easy. And if those structures are hierarchical, your diagrams become maps that you can use to navigate the codebase.

Software architecture diagrams are maps of your code

If you open up something like Google Maps on your smartphone and do a search for Jersey, it will zoom into Jersey. This is great if you want to know what's inside Jersey and what the various place names are, but if you've never heard of Jersey it's completely useless. What you then need to do is pinch-to-zoom-out to get back to the map of Europe, which puts Jersey in context. Diagrams of our software should be the same. Sometimes, as developers, we want the zoomed-in view of the code and at other times, depending on who we are talking to for example, we need a zoomed-out view.

Software architecture diagrams are maps of your code

A feature that has been built into Structurizr is that you can link components on a component diagram to code-level elements, which provides that final level of navigation from diagrams to code. You can try this yourself on the software architecture diagrams for the Spring PetClinic application.

Whatever tooling you use to create software architecture diagrams though, make sure that your diagrams reflect real structures in the code and that the mapping between diagrams and code is simple. My FREE The Art of Visualising Software Architecture ebook has more information on this topic.

Categories: Architecture

Sponsored Post: StatusPage.io, iStreamPlanet, Redis Labs, Jut.io, SignalFx, InMemory.Net, VividCortex, MemSQL, Scalyr, AiScaler, AppDynamics, ManageEngine, Site24x7

Who's Hiring?
  • Senior Devops Engineer - StatusPage.io is looking for a senior devops engineer to help us in making the internet more transparent around downtime. Your mission: help us create a fast, scalable infrastructure that can be deployed to quickly and reliably.

  • As a Networking & Systems Software Engineer at iStreamPlanet you’ll be driving the design and implementation of a high-throughput video distribution system. Our cloud-based approach to video streaming requires terabytes of high-definition video routed throughout the world. You will work in a highly-collaborative, agile environment that thrives on success and eats big challenges for lunch. Please apply here.

  • As a Scalable Storage Software Engineer at iStreamPlanet you’ll be driving the design and implementation of numerous storage systems including software services, analytics and video archival. Our cloud-based approach to world-wide video streaming requires performant, scalable, and reliable storage and processing of data. You will work on small, collaborative teams to solve big problems, where you can see the impact of your work on the business. Please apply here.

  • At Scalyr, we're analyzing multi-gigabyte server logs in a fraction of a second. That requires serious innovation in every part of the technology stack, from frontend to backend. Help us push the envelope on low-latency browser applications, high-speed data processing, and reliable distributed systems. Help extract meaningful data from live servers and present it to users in meaningful ways. At Scalyr, you’ll learn new things, and invent a few of your own. Learn more and apply.

  • UI EngineerAppDynamics, founded in 2008 and lead by proven innovators, is looking for a passionate UI Engineer to design, architect, and develop our their user interface using the latest web and mobile technologies. Make the impossible possible and the hard easy. Apply here.

  • Software Engineer - Infrastructure & Big DataAppDynamics, leader in next generation solutions for managing modern, distributed, and extremely complex applications residing in both the cloud and the data center, is looking for a Software Engineers (All-Levels) to design and develop scalable software written in Java and MySQL for backend component of software that manages application architectures. Apply here.
Fun and Informative Events
  • Your event could be here. How cool is that?
Cool Products and Services
  • Real-time correlation across your logs, metrics and events.  Jut.io just released its operations data hub into beta and we are already streaming in billions of log, metric and event data points each day. Using our streaming analytics platform, you can get real-time monitoring of your application performance, deep troubleshooting, and even product analytics. We allow you to easily aggregate logs and metrics by micro-service, calculate percentiles and moving window averages, forecast anomalies, and create interactive views for your whole organization. Try it for free, at any scale.

  • Turn chaotic logs and metrics into actionable data. Scalyr replaces all your tools for monitoring and analyzing logs and system metrics. Imagine being able to pinpoint and resolve operations issues without juggling multiple tools and tabs. Get visibility into your production systems: log aggregation, server metrics, monitoring, intelligent alerting, dashboards, and more. Trusted by companies like Codecademy and InsideSales. Learn more and get started with an easy 2-minute setup. Or see how Scalyr is different if you're looking for a Splunk alternative or Sumo Logic alternative.

  • SignalFx: just launched an advanced monitoring platform for modern applications that's already processing 10s of billions of data points per day. SignalFx lets you create custom analytics pipelines on metrics data collected from thousands or more sources to create meaningful aggregations--such as percentiles, moving averages and growth rates--within seconds of receiving data. Start a free 30-day trial!

  • InMemory.Net provides a Dot Net native in memory database for analysing large amounts of data. It runs natively on .Net, and provides a native .Net, COM & ODBC apis for integration. It also has an easy to use language for importing data, and supports standard SQL for querying data. http://InMemory.Net

  • VividCortex goes beyond monitoring and measures the system's work on your servers, providing unparalleled insight and query-level analysis. This unique approach ultimately enables your team to work more effectively, ship more often, and delight more customers.

  • MemSQL provides a distributed in-memory database for high value data. It's designed to handle extreme data ingest and store the data for real-time, streaming and historical analysis using SQL. MemSQL also cost effectively supports both application and ad-hoc queries concurrently across all data. Start a free 30 day trial here: http://www.memsql.com/

  • aiScaler, aiProtect, aiMobile Application Delivery Controller with integrated Dynamic Site Acceleration, Denial of Service Protection and Mobile Content Management. Also available on Amazon Web Services. Free instant trial, 2 hours of FREE deployment support, no sign-up required. http://aiscaler.com

  • ManageEngine Applications Manager : Monitor physical, virtual and Cloud Applications.

  • www.site24x7.com : Monitor End User Experience from a global monitoring network.

If any of these items interest you there's a full description of each sponsor below. Please click to read more...

Categories: Architecture

Example Mapping - Steering the conversation

Xebia Blog - Mon, 11/23/2015 - 19:14

People who are familiar with BDD and ATDD already know how useful the three amigos (product owner, tester and developer) session is for talking about what the system under development is supposed to do. But somehow these refinement sessions seem to drain the group's energy. One of the problems I see is not having a clear structure for conversations.

Example Mapping is a simple technique that can steer the conversation into breaking down any product backlog items within 30 minutes.

The Three Amigos

Example Mapping is best used in so called Three Amigo Sessions. The purpose of this session is to create a common understanding of the requirements and a shared vocabulary across product owner and the rest of the team. During this session the product owner shares every user story by explaining the need for change in a product. It is essential that the conversation has multiple points of view. Testers and developers identify missing requirements or edge cases and are addressed by describing accepted behaviour, before a feature is considered ready for development.

In order to help you steer the conversations, here is a list of guidelines for Three Amigos Sessions:

  • Empathy: Make sure the team has the capability to help each other understand the requirements. Without empathy and the room for it, you are lost.
  • Common understanding of the domain: Make sure that the team uses the same vocabulary (digital of physical) and speaks the same domain language.
  • Think big, but act small: Make sure all user stories are small and ready enough to make impact
  • Rules and examples: Make sure every user story explains the specification with rules and scenarios / examples.
Example mapping

Basic ingredients for Example Mapping are curiosity and a pack of post-it notes containing the following colours:

  • Yellow for defining user story
  • Red for defining questions
  • Blue for defining rules
  • Green for defining examples

Using the following steps can help you steer the conversations towards accepted behaviour of the system under development:

  1. Understanding the problem
    Let the product owner start by writing down the user story on a yellow post-it note and have him explain the need for change in the product. The product owner should help the team understand the problem.
  2. Challenge the problem by asking questions
    Once the team has understood the problem, the team challenges the problem by asking questions. Collect all the questions by writing them down starting with "What if ... " on red post-it notes. Place them on the right side of the user story (yellow) post-it note. We will treat this post-it note as a ticket for a specific and focussed discussion.
  3. Identifying rules
    The key here is to identify rules for every given answer (steered from the red question post-it notes). Extract rules from the answers and write them down on a blue post-it note. Place them below the user story (yellow) post-it note. This basically describes the acceptance criteria of a user story. Make sure that every rule can be discussed separately. The single responsibility principle and separation of concerns should be applied.
  4. Describing situations with examples
    Once you have collected all the important rules of the user story, you collect all interesting situations / examples by writing them down on a green post-it note. Place them below the rule (blue) post-it note. Make sure that the team talks about examples focussed on one single rule. Steer the discussion by asking questions like: Have we reached the boundaries of the rule? What happens when the rule fails?
An example


Here in the given example above, the product owners requires a free shipping process. She wrote it down on a yellow post-it note. After collecting and answering questions, two rules were discussed and refined on blue post-it notes; shopping cart limit and the appearance of the free shipping banner on product pages. All further discussions were steered towards the appropriate rule. Two examples in the shopping cart limit were defined and one example for the free shipping banner on a green post-it notes. Besides steering the team in rule based discussions, the team also gets a clear overview of the scope for the first iteration of the requirement.

Getting everyone on the same page is the key to success here. Try it a couple of times and let me know how it went.


Your New Technical Skills

One of the struggles a developer faces when moving up the ladder is how to keep their technical skills.

If they are used to being a high-performing, individual contributor, and a technical go-to resource, this is especially challenging.


Because the job is different, now.

It’s no longer about how awesome your developer skills are.  Now it’s about bringing out the best from the people you manage, and hopefully *lead.*  Your job is now about creating a high-performing team.   It’s about growing more leaders.  It’s about being the oil and the glue.  The oil so that the team can work effectively, as friction-free as possible, and the glue, so that all the work connects together.

There’s a good book called What Got You Here, Won’t Get You There, by Marshall Goldsmith.  The book title sort of says it all, but the big idea is that if you take on a new management role, but continue to perform like an individual contributor, or at a lower level, don’t expect to be successful.

The irony is that most people will quickly default to doing what they do best, which is what got them to where they are.   But now the rules have changed, and they don’t adapt.  And as the saying goes, adapt or die.  It’s how a lot of careers end.

But not you.

While you will want to keep up your skills that got you to where you are, the real challenge is about adding new ones.   And, at first blush, they might just seem like “soft skills”, while you are used to learning “technical skills.”   Well, treat these at your new technical skills to learn.

Your new technical skills are:

  1. Building EQ (Emotional Intelligence) in teams
  2. Building High-Performance Teams
  3. Putting vision/mission/values in place
  4. Putting the execution model in place
  5. Directing and inspiring as appropriate – situational leadership – per employee
  6. Creating and leverage leadership opportunities and teachable moments
  7. Creating the right decision frameworks and flows and empowerment models
  8. Building a better business
  9. And doing thought-work in the space for the industry

I’ll leave this list at 9, so that it doesn’t become a 10 Top Skills to Learn to Advance Your Career post.

Emotional Intelligence as a Technical Skill

If you wonder how Emotional Intelligence can be a technical skill, I wish I could show you all the Mind Maps, the taxonomies, the techniques, the hard-core debates over the underlying principles, patterns, and practices, that I have seen many developers dive into over the years.

The good news is that Emotional Intelligence is a skill you can build.  I’ve seen many developers become first time managers and then work on their Emotional Intelligence skills and everything changes.  They become a better manager.  They become more influential.  They read a room better and know how to adapt themselves more effectively in any situation.  They know how to manage their emotions.  And they know how to inspire and delight others, instead of tick them off.

Along the lines of Emotional Intelligence, I should add Financial Intelligence to the mix.  So many developers and technologists would be more effective in the business arena, if they mastered the basics of Financial Intelligence.  There is actually a book called Financial Intelligence for IT Professionals.   It breaks down the basics of how to think in financial terms.   Innovation doesn’t fund itself.  Cool projects don’t fund themselves.  Technology is all fun and games until the money runs out.  But if you can show how technology helps the business, all of a sudden instead of being a cost or overhead, you are now part of the value chain, or at least the business can appreciate what you bring to the table.

Building High-Performance Teams as a Technical Skill

Building High-Performance Teams takes a lot of know-how.  It helps if you are already well grounded in how to ship stuff.  It really helps if you have some basic project management skills and you know how to see how the parts of the project come together as a whole.  It especially helps if you have a strong background in Agile methodologies like Kanban, Scrum, XP, etc.  While you don’t need to create Kanbans, its certainly helps if you get the idea of visualizing the workflow and reducing open work.  And, while you may not need to do Scrum per se, it helps if you get the idea behind a Product Backlog, a Sprint Backlog, and Sprints.  And while you may not need to do XP, it helps if you get the idea of sustainable pace, test-driven development, pairing, collective ownership, and an on-site customer. 

But the real key to building high-performance teams is actually about trust. 

Not trust as in “I trust that you’ll do that.”  

No.  It’s vulnerability-based trust, as in “I’ve got your back.”   This is what enables individuals on a team to go out on a limb, to try more, to do more, to become more.

Otherwise, they everybody has to watch out for their own backs, and they spend their days making sure they don’t get pushed off the boat or hanging from a limb, while somebody saws it off.   (See 10 Things Great Managers Do.)

And nothing beats a self-organizing team, where people sign-up for work (vs. get assigned work), where people play their position well, and help others play theirs.

Vision, Mission, Values as a Technical Skill

Vision, mission, and values are actually some of the greatest technical skills you can master, for yourself and for any people or teams you might lead, now or in the future.   So many people mix up vision and mission.

Here’s the deal:

Mission is the job.

Vision is where you want to go, now that you know what the job is.

And Values are what you express in actions in terms of what you reward.  Notice how I said actions, not words.  Too many people and teams say they value one thing, but their actions value another.

It’s one thing to go off and craft a vision, mission, and values that you want everybody to adhere to.  It’s another thing to co-create the future with a team, and create your vision, mission, and values, with everybody’s fingerprints on it.  But that’s how you get buy-in.   And getting buy-in, usually involves dealing with conflict (which is a whole other set of technical skills you can master.)  

When a leader can express a vision, mission, and values with clarity, they can inspire the people around them, bring out the best in people, create a high-performance culture, and accelerate results.

Execution as a Technical Skill

This is where the rubber meets the road.  There are so many great books on how to execute with skill.  One of my favorites is Flawless Execution.  And of the most insightful books on creating an effective execution model is Managing the Design Factory.

The main thing to master here is to be able to easily create a release schedule that optimizes resources and people, while flowing value to customers and stakeholders.

I know that’s boiling a lot down, but that’s the point.  To master execution, you need to be able to easily think about the challenges you are up against:  not enough time, not enough resources, not enough budget, not enough clarity, not enough customers, etc.

It’s a powerful thing when you can turn chaos into clarity and get the train leaving the station in a reliable way.

It’s hard to beat smart people shipping on a cadence, if they are always learning and always improving.

Situational Leadership as a Technical Skill

Sadly, this is one of the most common mistakes of new managers.  Seasoned ones, too.  They treat everybody on the team the same.  And they usually default to whatever they learned.   They either focus on motivating or they focus on directing.  And directing to the extreme, very quickly becomes micro-managing.

The big idea of Situational Leadership is to consider whether each person needs direction or motivation, or both.  

If you try to motivate somebody who is really looking for direction, you will both be frustrated.  Similarly, if you try to direct somebody who really is looking for motivation, it’s a quick spiral down.

There are many very good books on Situational Leadership and how to apply it in the real world.

Decision Making as a Technical Skill

This is where a lot of blood-shed happens.   This is where conflict thrives or dies.   Decision making is the bread-and-butter of today’s knowledge worker.  That’s what makes insight so valuable in a Digital Economy.  After all, what do you use the insight for?  To make better decisions.

It’s one thing for you to just make decisions.

But the real key here is how to create simple ways to deal with conflict and how to make better decisions as a group.   This includes how to avoid the pitfalls of groupthink.  It includes the ability to leverage the wisdom of the crowds.  It also includes the ability to influence and persuade with skill.  It includes the ability to balance connection with conviction.  It includes the ability to balance your Conflict Management Style with the Conflict Management Style of others.

Business as a Technical Skill

Business can be hard-core.   This isn’t so obvious if you deal with mediocre business people.  But when you interact with serious business leaders, you quickly understand how complicated, and technical, running a business and changing a business really is.

At the most fundamental level, the purpose of a business is to create a customer.

But even who you choose to serve as your “customer” is a strategic choice.

You can learn a lot about business by studying some of the great business challenges in the book, Case Interview Secrets, which is written by a former McKinsey consultant.

You can also learn a lot about business by studying which KPIs and business outcomes matter, in each industry, and by each business function.

It also helps to be able to quickly know how to whiteboard a value chain and be able to use some simple tools like SWOT analysis.  If you can really internalize Michael Porter’s mental models and toolset, then you will be ahead of many people in the business world.

Thoughtwork as a Technical Skill

There are many books and guides on how to be a leader in your field.   One of my favorites is, Lead the Field, by Earl Nightingale.  It’s an oldie, but goodie.

The real key is to be able to master ideation.  You need to be able to come up with ideas.   Probably the best technique I learned was long ago.   I simply set an idea quota.   In the book, ThinkerToys, by Michael Michalko, I learned that Thomas Edison set a quote to think up new ideas.  Success really is a numbers game.   Anyway, I started by writing one idea per note in my little yellow sticky pad.  The first week, I had a handful of ideas.   But once my mind was cleared by writing my ideas down, I was soon filling up multiple yellow sticky pads per week.

I very quickly went from having an innovation challenge to having an execution challenge.

So then I went back to the drawing board and focused on mastering execution as a technical skill Winking smile

Hopefully, if you are worried about how to keep growing your skills as you climb your corporate ladder, this will give you some food for thought.

Categories: Architecture, Programming

How Wistia Handles Millions of Requests Per Hour and Processes Rich Video Analytics

This is a guest repost from Christophe Limpalair of his interview with Max Schnur, Web Developer at  Wistia.

Wistia is video hosting for business. They offer video analytics like heatmaps, and they give you the ability to add calls to action, for example. I was really interested in learning how all the different components work and how they’re able to stream so much video content, so that’s what this episode focuses on.

What does Wistia’s stack look like?

As you will see, Wistia is made up of different parts. Here are some of the technologies powering these different parts:

What scale are you running at?
Categories: Architecture

Add ifPresent to Swift Optionals

Xebia Blog - Fri, 11/20/2015 - 22:43

In my previous post I wrote a lot about how you can use the map and flatMap functions of Swift Optionals. In this one, I'll add a custom function to Optionals through an extension, the ifPresent function.

extension Optional {

    public func ifPresent(@noescape f: (Wrapped) throws -> Void) rethrows {
        switch self {
        case .Some(let value): try f(value)
        case .None: ()

What this does is simply execute the closure f if the Optional is not nil. A small example:

var str: String? = "Hello, playground"
str.ifPresent { print($0) }

This works pretty much the same as the ifPresent method of Java optionals.

Why do we need it?

Well, we don't really need it. Swift has the built-in language feature of if let and guard that deals pretty well with these kind of situations (which Java cannot do).

We could simply write the above example as follows:

var str: String? = "Hello, playground"
if let str = str {

For this example it doesn't matter much wether you would use ifPresent or if let. And because everyone is familiar with if let you'd probably want to stick with that.

When to use it

Sometimes, when you want to call a function that has exactly one parameter with the same type as your Optional, then you might benefit a bit more from this syntax. Let's have a loot at that:

var someOptionalView: UIView? = ...
var parentView: UIView = ...


Since addSubview has one parameter of type UIView, we can immediately pass in that function reference to the ifPresent function.

Otherwise we would have to write the following code instead:

var someOptionalView: UIView? = ...
var parentView: UIView = ...

if let someOptionalView = someOptionalView {
When you can't use it

Unfortunately, it's not always possible to use this in the way we'd like to. If we look back at the very first example with the print function we would ideally write it without closure:

var str: String? = "Hello, playground"

Even though print can be called with just a String, it's function signature takes variable arguments and default parameters. Whenever that's the case it's not possible to use it as a function reference. This becomes increasingly frustrating when you add default parameters to existing methods after which your code that refers to it doesn't compile anymore.

It would also be useful if it was possible to set class variables through a method reference. Instead of:

if let value = someOptionalValue {
  self.value = value

We would write something like this:


But that doesn't compile, so we need to write it like this:

someOptionalValue.ifPresent { self.value = $0 }

Which isn't really much better that the if let variant.

(I had a look at a post about If-Let Assignment Operator but unfortunately that crashed my Swift compiler while building, which is probably due to a compiler bug)


Is the ifPresent function a big game changer for Swift? Definitely not.
Is it necessary? No.
Can it be useful? Yes.

Initial experiences with the Prometheus monitoring system

Agile Testing - Grig Gheorghiu - Fri, 11/20/2015 - 22:23
I've been looking for a while for a monitoring system written in Go, self-contained and easy to deploy. I think I finally found what I was looking for in Prometheus, a monitoring system open-sourced by SoundCloud and started there by ex-Googlers who took their inspiration from Google's Borgmon system.

Prometheus is a pull system, where the monitoring server pulls data from its clients by hitting a special HTTP handler exposed by each client ("/metrics" by default) and retrieving a list of metrics from that handler. The output of /metrics is plain text, which makes it fairly easily parseable by humans as well, and also helps in troubleshooting.

Here's a subset of the OS-level metrics that are exposed by a client running the node_exporter Prometheus binary (and available when you hit http://client_ip_or_name:9100/metrics):

# HELP node_cpu Seconds the cpus spent in each mode.
# TYPE node_cpu counter
node_cpu{cpu="cpu0",mode="guest"} 0
node_cpu{cpu="cpu0",mode="idle"} 2803.93
node_cpu{cpu="cpu0",mode="iowait"} 31.38
node_cpu{cpu="cpu0",mode="irq"} 0
node_cpu{cpu="cpu0",mode="nice"} 2.26
node_cpu{cpu="cpu0",mode="softirq"} 0.23
node_cpu{cpu="cpu0",mode="steal"} 21.16
node_cpu{cpu="cpu0",mode="system"} 25.84
node_cpu{cpu="cpu0",mode="user"} 79.94
# HELP node_disk_io_now The number of I/Os currently in progress.
# TYPE node_disk_io_now gauge
node_disk_io_now{device="xvda"} 0
# HELP node_disk_io_time_ms Milliseconds spent doing I/Os.
# TYPE node_disk_io_time_ms counter
node_disk_io_time_ms{device="xvda"} 44608
# HELP node_disk_io_time_weighted The weighted # of milliseconds spent doing I/Os. See https://www.kernel.org/doc/Documentation/iostats.txt.
# TYPE node_disk_io_time_weighted counter
node_disk_io_time_weighted{device="xvda"} 959264

There are many such "exporters" available for Prometheus, exposing metrics in the format expected by the Prometheus server from systems such as Apache, MySQL, PostgreSQL, HAProxy and many others (see a list here).

What drew me to Prometheus though was the fact that it allows for easy instrumentation of code by providing client libraries for many languages: Go, Java/Scala, Python, Ruby and others. 
One of the main advantages of Prometheus over alternative systems such as Graphite is the rich query language that it provides. You can associate labels (which are arbitrary key/value pairs) with any metrics, and you are then able to query the system by label. I'll show examples in this post. Here's a more in-depth comparison between Prometheus and Graphite.
Installation (on Ubuntu 14.04)
I put together an ansible role that is loosely based on Brian Brazil's demo_prometheus_ansible repo.
Check out my ansible-prometheus repo for this ansible role, which installs Prometheus, node_exporter and PromDash (a ruby-based dashboard builder). For people not familiar with ansible, most of the installation commands are in the install.yml task file. Here is the sequence of installation actions, in broad strokes.
For the Prometheus server:
  • download prometheus-0.16.1.linux-amd64.tar.gz from https://github.com/prometheus/prometheus/releases/download
  • extract tar.gz into /opt/prometheus/dist and link /opt/prometheus/prometheus-server to /opt/prometheus/dist/prometheus-0.16.1.linux-amd64
  • create Prometheus configuration file from ansible template and drop it in /etc/prometheus/prometheus.yml (more on the config file later)
  • create Prometheus default command-line options file from ansible template and drop it in /etc/default/prometheus
  • create Upstart script for Prometheus in /etc/init/prometheus.conf:
# Run prometheus

start on startup

chdir /opt/prometheus/prometheus-server

./prometheus -config.file /etc/prometheus/prometheus.yml
end script
For node_exporter:
  • download node_exporter-0.12.0rc1.linux-amd64.tar.gz from https://github.com/prometheus/node_exporter/releases/download
  • extract tar.gz into /opt/prometheus/dist and move node_exporter binary to /opt/prometheus/bin/node_exporter
  • create Upstart script for Prometheus in /etc/init/prometheus_node_exporter.conf:
# Run prometheus node_exporter
start on startup
script   /opt/prometheus/bin/node_exporterend script
For PromDash:
  • git clone from https://github.com/prometheus/promdash
  • follow instructions in the Prometheus tutorial from Digital Ocean (can't stop myself from repeating that D.O. publishes the best technical tutorials out there!)
Here is a minimal Prometheus configuration file (/etc/prometheus/prometheus.yml):
global:  scrape_interval: 30s  evaluation_interval: 5s
scrape_configs:  - job_name: 'prometheus'    target_groups:      - targets:        - prometheus.example.com:9090  - job_name: 'node'    target_groups:      - targets:        - prometheus.example.com:9100        - api01.example.com:9100        - api02.example.com:9100        - test-api01.example.com:9100        - test-api02.example.com:9100
The configuration file format for Prometheus is well documented in the official docs. My example shows that the Prometheus server itself is monitored (or "scraped" in Prometheus parlance) on port 9090, and that OS metrics are also scraped from 5 clients which are running the node_exporter binary on port 9100, including the Prometheus server.
At this point, you can start Prometheus and node_exporter on your Prometheus server via Upstart:
# start prometheus# start prometheus_node_exporter
Then you should be able to hit http://prometheus.example.com:9100 to see the metrics exposed by node_exporter, and more importantly http://prometheus.example.com:9090 to see the default Web console included in the Prometheus server. A demo page available from Robust Perception can be examined here.
Note that Prometheus also provides default Web consoles for node_exporter OS-level metrics. They are available at http://prometheus.example.com:9090/consoles/node.html (the ansible-prometheus role installs nginx and redirects http://prometheus.example.com:80 to the previous URL). The node consoles show CPU, Disk I/O and Memory graphs and also network traffic metrics for each client running node_exporter. 

Working with the MySQL exporter
I installed the mysqld_exporter binary on my Prometheus server box.
# cd /opt/prometheus/dist# git clone https://github.com/prometheus/mysqld_exporter.git# cd mysqld_exporter# make
Then I created a wrapper script I called run_mysqld_exporter.sh:
# cat run_mysqld_exporter.sh#!/bin/bash
export DATA_SOURCE_NAME=“dbuser:dbpassword@tcp(dbserver:3306)/dbname”; ./mysqld_exporter
Two important notes here:
1) Note the somewhat awkward format for the DATA_SOURCE_NAME environment variable. I tried many other formats but only this one worked for me. The wrapper's script main purpose is to define this variable properly. With some of my other tries, I got this error message:
INFO[0089] Error scraping global state: Default addr for network 'dbserver:3306' unknown  file=mysqld_exporter.go line=697
You could also define this variable in ~/.bashrc but in that case it may clash with other  Prometheus exporters (the one for PostgreSQL for example) which also need to define this variable.
2) Note that the dbuser specified in the DATA_SOURCE_NAME variable needs to have either SUPER or REPLICATION CLIENT permissions to the MySQL server you need to monitor. I ran a SQL statement of this form:

I created an Upstart init script I called /etc/init/prometheus_mysqld_exporter.conf:
# cat /etc/init/prometheus_mysqld_exporter.conf# Run prometheus mysqld exporter
start on startup
chdir /opt/prometheus/dist/mysqld_exporter
script   ./run_mysqld_exporter.shend script
I modified the Prometheus server configuration file (/etc/prometheus/prometheus.yml) and added a scrape job for the MySQL metrics:

  - job_name: 'mysql'
    honor_labels: true
      - targets:
        - prometheus.example.com:9104

I restarted the Prometheus server:

# stop prometheus
# start prometheus

Then I started up mysqld_exporter via Upstart:
# start prometheus_mysqld_exporter
If everything goes well, the metrics scraped from MySQL will be available at http://prometheus.example.com:9104/metrics
Here are some of the available metrics:
# HELP mysql_global_status_innodb_data_reads Generic metric from SHOW GLOBAL STATUS.
# TYPE mysql_global_status_innodb_data_reads untyped
mysql_global_status_innodb_data_reads 12660
# HELP mysql_global_status_innodb_data_writes Generic metric from SHOW GLOBAL STATUS.
# TYPE mysql_global_status_innodb_data_writes untyped
mysql_global_status_innodb_data_writes 528790
# HELP mysql_global_status_innodb_data_written Generic metric from SHOW GLOBAL STATUS.
# TYPE mysql_global_status_innodb_data_written untyped
mysql_global_status_innodb_data_written 9.879318016e+09
# HELP mysql_global_status_innodb_dblwr_pages_written Generic metric from SHOW GLOBAL STATUS.
# TYPE mysql_global_status_innodb_dblwr_pages_written untyped
mysql_global_status_innodb_dblwr_pages_written 285184
# HELP mysql_global_status_innodb_row_ops_total Total number of MySQL InnoDB row operations.
# TYPE mysql_global_status_innodb_row_ops_total counter
mysql_global_status_innodb_row_ops_total{operation="deleted"} 14580
mysql_global_status_innodb_row_ops_total{operation="inserted"} 847656
mysql_global_status_innodb_row_ops_total{operation="read"} 8.1021419e+07
mysql_global_status_innodb_row_ops_total{operation="updated"} 35305

Most of the metrics exposed by mysqld_exporter are of type Counter, which means they always increase. A meaningful number to graph then is not their absolute value, but their rate of change. For example, for the mysql_global_status_innodb_row_ops_total metric, the rate of change of reads for the last 5 minutes (reads/sec) can be expressed as:
This is also an example of a Prometheus query which filters by a specific label (in this case {operation="read"})
A good way to get a feel for the metrics available to the Prometheus server is to go to the Web console and graphing tool available at http://prometheus.example.com:9090/graph. You can copy and paste the ine above in the Expression edit box and click execute. You should see something like this graph in the Graph tab:

It's important to familiarize yourself with the 4 types of metrics handled by Prometheus: Counter, Gauge, Histogram and Summary. 
Working with the Postgres exporter
Although not an official Prometheus package, the Postgres exporter has worked just fine for me. 
I installed the postgres_exporter binary on my Prometheus server box.
# cd /opt/prometheus/dist# git clone https://github.com/wrouesnel/postgres_exporter.git# cd postgres_exporter# make
Then I created a wrapper script I called run_postgres_exporter.sh:

# cat run_postgres_exporter.sh

export DATA_SOURCE_NAME="postgres://dbuser:dbpassword@dbserver/dbname"; ./postgres_exporter
Note that the format for DATA_SOURCE_NAME is a bit different from the MySQL format.
I created an Upstart init script I called /etc/init/prometheus_postgres_exporter.conf:
# cat /etc/init/prometheus_postgres_exporter.conf# Run prometheus postgres exporter
start on startup
chdir /opt/prometheus/dist/postgres_exporter
script   ./run_postgres_exporter.shend script
I modified the Prometheus server configuration file (/etc/prometheus/prometheus.yml) and added a scrape job for the Postgres metrics:

  - job_name: 'postgres'
    honor_labels: true
      - targets:
        - prometheus.example.com:9113

I restarted the Prometheus server:

# stop prometheus
# start prometheus
Then I started up postgres_exporter via Upstart:
# start prometheus_postgres_exporter
If everything goes well, the metrics scraped from Postgres will be available at http://prometheus.example.com:9113/metrics
Here are some of the available metrics:
# HELP pg_stat_database_tup_fetched Number of rows fetched by queries in this database
# TYPE pg_stat_database_tup_fetched counter
pg_stat_database_tup_fetched{datid="1",datname="template1"} 7.730469e+06
pg_stat_database_tup_fetched{datid="12998",datname="template0"} 0
pg_stat_database_tup_fetched{datid="13003",datname="postgres"} 7.74208e+06
pg_stat_database_tup_fetched{datid="16740",datname="mydb"} 2.18194538e+08
# HELP pg_stat_database_tup_inserted Number of rows inserted by queries in this database
# TYPE pg_stat_database_tup_inserted counter
pg_stat_database_tup_inserted{datid="1",datname="template1"} 0
pg_stat_database_tup_inserted{datid="12998",datname="template0"} 0
pg_stat_database_tup_inserted{datid="13003",datname="postgres"} 0
pg_stat_database_tup_inserted{datid="16740",datname="mydb"} 3.5467483e+07
# HELP pg_stat_database_tup_returned Number of rows returned by queries in this database
# TYPE pg_stat_database_tup_returned counter
pg_stat_database_tup_returned{datid="1",datname="template1"} 6.41976558e+08
pg_stat_database_tup_returned{datid="12998",datname="template0"} 0
pg_stat_database_tup_returned{datid="13003",datname="postgres"} 6.42022129e+08
pg_stat_database_tup_returned{datid="16740",datname="mydb"} 7.114057378094e+12
# HELP pg_stat_database_tup_updated Number of rows updated by queries in this database
# TYPE pg_stat_database_tup_updated counter
pg_stat_database_tup_updated{datid="1",datname="template1"} 1
pg_stat_database_tup_updated{datid="12998",datname="template0"} 0
pg_stat_database_tup_updated{datid="13003",datname="postgres"} 1
pg_stat_database_tup_updated{datid="16740",datname="mydb"} 4351

These metrics are also of type Counter, so to generate meaningful graphs for them, you need to plot their rates. For example, to see the rate of rows returned per second from the database called mydb, you would plot this expression:
The Prometheus expression evaluator available at http://prometheus.example.com:9090/graph is again your friend. BTW, if you start typing pg_ in the expression field, you'll see a drop-down filled automatically with all the available metrics starting with pg_. Handy!
Working with the AWS CloudWatch exporterThis is one of the officially supported Prometheus exporters, used for graphing and alerting on AWS CloudWatch metrics. I installed it on the Prometheus server box. It's a java app, so it needs a JDK installed, and also maven for building the app.
# cd /opt/prometheus/dist# git clone https://github.com/prometheus/cloudwatch_exporter.git# apt-get install maven2 openjdk-7-jdk# cd cloudwatch_exporter# mvn package
The cloudwatch_exporter app needs AWS credentials in order to connect to CloudWatch and read the metrics. Here's what I did:
  1. created an AWS IAM user called cloudwatch_ro and downloaded its access key and secret key
  2. created an AWS IAM custom policy called CloudWatchReadOnlyAccess-201511181031, which includes the default CloudWatchReadOnlyAccess policy (the custom policy is not stricly necessary, and you can use the default one, but I preferred to use a custom one because I may need to further edits to the policy file)
  3. attached the CloudWatchReadOnlyAccess-201511181031 policy to the cloudwatch_ro user
  4. created a file called ~/.aws/credentials with the contents:
The cloudwatch_exporter app also needs a json file containing the CloudWatch metrics we want it to retrieve from AWS. Here is an example of ELB-related metrics I specified in a file called cloudwatch.json:
  "region": "us-west-2",
  "metrics": [
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "RequestCount",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "BackendConnectionErrors",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_Backend_2XX",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_Backend_4XX",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_Backend_5XX",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_ELB_4XX",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_ELB_5XX",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "SurgeQueueLength",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Maximum", "Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "SpilloverCount",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "Latency",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Average"]},
Note that you need to look up the exact syntax for each metric name, dimensions and preferred statistics in the AWS CloudWatch documentation. For ELB metrics, the documentation is here. The CloudWatch name corresponds to the cloudwatch_exporter JSON parameter aws_metric_name, dimensions corresponds to aws_dimensions, and preferred statistics corresponds to aws_statistics.
I modified the Prometheus server configuration file (/etc/prometheus/prometheus.yml) and added a scrape job for the CloudWatch metrics:

  - job_name: 'cloudwatch'
    honor_labels: true
      - targets:
        - prometheus.example.com:9106

I restarted the Prometheus server:

# stop prometheus
# start prometheus

I created an Upstart init script I called /etc/init/prometheus_cloudwatch_exporter.conf:
# cat /etc/init/prometheus_cloudwatch_exporter.conf# Run prometheus cloudwatch exporter
start on startup
chdir /opt/prometheus/dist/cloudwatch_exporter
script   /usr/bin/java -jar target/cloudwatch_exporter-0.2-SNAPSHOT-jar-with-dependencies.jar 9106 cloudwatch.jsonend script
Then I started up cloudwatch_exporter via Upstart:
# start prometheus_cloudwatch_exporter
If everything goes well, the metrics scraped from CloudWatch will be available at http://prometheus.example.com:9106/metrics
Here are some of the available metrics:
# HELP aws_elb_request_count_sum CloudWatch metric AWS/ELB RequestCount Dimensions: [AvailabilityZone, LoadBalancerName] Statistic: Sum Unit: Count
# TYPE aws_elb_request_count_sum gauge
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2a",} 1.0
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2c",} 1.0
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2c",} 2.0
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2a",} 12.0
# HELP aws_elb_httpcode_backend_2_xx_sum CloudWatch metric AWS/ELB HTTPCode_Backend_2XX Dimensions: [AvailabilityZone, LoadBalancerName] Statistic: Sum Unit: Count
# TYPE aws_elb_httpcode_backend_2_xx_sum gauge
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2a",} 1.0
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2c",} 1.0
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2c",} 2.0
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2a",} 12.0
# HELP aws_elb_latency_average CloudWatch metric AWS/ELB Latency Dimensions: [AvailabilityZone, LoadBalancerName] Statistic: Average Unit: Seconds
# TYPE aws_elb_latency_average gauge
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2a",} 0.5571935176849365
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2c",} 0.5089397430419922
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2c",} 0.035556912422180176
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2a",} 0.0031794110933939614

Note that there are 3 labels available to query the metrics above: job, load_balancer_name and availability_zone. 
If we specify something like aws_elb_request_count_sum{job="aws_elb"} in the expression evaluator at http://prometheus.example.com:9090/graph, we'll see 4 graphs, one for each load_balancer_name/availability_zone combination. 
To see only graphs related to a specific load balancer, say LB1, we can specify an expression of the form:aws_elb_request_count_sum{job="aws_elb",load_balancer_name="LB1"}In this case, we'll see 2 graphs for LB1, one for each availability zone.
In order to see the request count across all availability zones for a specific load balancer, we need to apply the sum function: sum(aws_elb_request_count_sum{job="aws_elb",load_balancer_name="LB1"}) by (load_balancer_name) In this case, we'll see one graph with the request count across the 2 availability zones pertaining to LB1.
If we want to graph all load balancers but only show one graph per balancer, summing all availability zones for each balancer, we would use an expression like this: sum(aws_elb_request_count_sum{job="aws_elb"}) by (load_balancer_name)So in this case we'll see 2 graphs, one for LB1 and one for LB2, with each graph summing the request count across the availability zones for LB1 and LB2 respectively.
Note that in all the expressions above, since the job label has the value "aws_elb" common to all metrics, it can be dropped from the queries because it doesn't produce any useful filtering.
For other AWS CloudWatch metrics, consult the Amazon CloudWatch Namespaces, Dimensions and Metrics Reference.

Instrumenting Go code with Prometheus
For me, the most interesting feature of Prometheus is that allows for easy instrumentation of the code. Instead of pushing metrics a la statsd and Graphite, a web app needs to implement a /metrics handler and use the Prometheus client library code to publish app-level metrics to that handler. The Prometheus server will then hit /metrics on the client and pull/scrape the metrics.

More specifics for Go code instrumentation

1) Declare and register Prometheus metrics in your code

I have the following 2 variables defined in an init.go file in a common package that gets imported in all of the webapp code:

var PrometheusHTTPRequestCount = prometheus.NewCounterVec(
        Namespace: "myapp",
        Name:      "http_request_count",
        Help:      "The number of HTTP requests.",
    []string{"method", "type", "endpoint"},

var PrometheusHTTPRequestLatency = prometheus.NewSummaryVec(
        Namespace: "myapp",
        Name:      "http_request_latency",
        Help:      "The latency of HTTP requests.",
    []string{"method", "type", "endpoint"},

Note that the first metric is a CounterVec, which in the Prometheus client_golang library specifies a Counter metric that can also get labels associated with it. The labels in my case are "method", "type" and "endpoint". The purpose of this metric is to measure the HTTP request count. Since it's a Counter, it will increase monotonically, so for graphing purposes we'll need to plot its rate and not its absolute value.

The second metric is a SummaryVec, which in the client_golang library specifies a Summary metric with labels. I have the same labels are for the CounterVec metric. The purpose of this metric is to measure the HTTP request latency. Because it's a Summary, it will provide the absolute measurement, the count, as well as quantiles for the measurements.

These 2 variables then get registered in the init function:

func init() {
    // Register Prometheus metric trackers

2) Let Prometheus handle the /metrics endpoint

The GitHub README for client_golang shows the simplest way of doing this:

http.Handle("/metrics", prometheus.Handler())
http.ListenAndServe(":8080", nil)

However, most of the Go webapp code will rely on some sort of web framework, so YMMV. In our case, I had to insert the prometheus.Handler function as a variable pretty deep in our framework code in order to associate it with the /metrics endpoint.

3) Modify Prometheus metrics in your code

The final step in getting Prometheus to instrument your code is to modify the Prometheus metrics you registered by incrementing Counter variables and taking measurements for Summary variables in the appropriate places in your app. In my case, I increment PrometheusHTTPRequestCount in every HTTP handler in my webapp by calling its Inc() method. I also measure the HTTP latency, i.e. the time it took for the handler code to execute, and call the Observe() method on the PrometheusHTTPRequestLatency variable.

The values I associate with the "method", "type" and "endpoint" labels come from the endpoint URL associated with each instrumented handler. As an example, for an HTTP GET request to a URL such as http://api.example.com/customers/find, "method" is the HTTP method used in the request ("GET"), "type" is "customers", and "endpoint" is "/customers/find".

Here is the code I use for modifying the Prometheus metrics (R is an object/struct which represents the HTTP request):

    // Modify Prometheus metrics
    pkg, endpoint := common.SplitUrlForMonitoring(R.URL.Path)
    method := R.Method
    PrometheusHTTPRequestCount.WithLabelValues(method, pkg, endpoint).Inc()
    PrometheusHTTPRequestLatency.WithLabelValues(method, pkg, endpoint).Observe(float64(elapsed) / float64(time.Millisecond))

4) Retrieving your metrics

Assuming your web app runs on port 8080, you'll need to modify the Prometheus server configuration file and add a scrape job for app-level metrics. I have something similar to this in /etc/prometheus/prometheus.xml:

- job_name: 'myapp-api'
      - targets:
        - api01.example.com:8080
        - api02.example.com:8080
          group: 'production'
      - targets:
        - test-api01.example.com:8080
        - test-api02.example.com:8080
          group: 'test'

Note an extra label called "group" defined in the configuration file. It has the values "production" and "test" respectively, and allows for the filtering of Prometheus measurements by the environment of the monitored nodes.

Whenever the Prometheus configuration file gets modified, you need to restart the Prometheus server:

# stop prometheus
# start prometheus

At this point, the metrics scraped from the webapp servers will be available at http://api01.example.com:8080/metrics.

Here are some of the available metrics:
# HELP myapp_http_request_count The number of HTTP requests.
# TYPE myapp_http_request_count counter
myapp_http_request_count{endpoint="/merchant/register",method="GET",type="admin"} 2928
# HELP myapp_http_request_latency The latency of HTTP requests.
# TYPE myapp_http_request_latency summary
myapp_http_request_latency{endpoint="/merchant/register",method="GET",type="admin",quantile="0.5"} 31.284808
myapp_http_request_latency{endpoint="/merchant/register",method="GET",type="admin",quantile="0.9"} 33.353354
myapp_http_request_latency{endpoint="/merchant/register",method="GET",type="admin",quantile="0.99"} 33.353354
myapp_http_request_latency_sum{endpoint="/merchant/register",method="GET",type="admin"} 93606.57930099976

myapp_http_request_latency_count{endpoint="/merchant/register",method="GET",type="admin"} 2928

Note that myapp_http_request_count and myapp_http_request_latency_count show the same value for the method/type/endpoint combination in this example. You could argue that myapp_http_request_count is redundant in this case. There could be instances where you want to increment a counter without taking a measurement for the summary, so it's still useful to have both. 
Also note that myapp_http_request_latency, being a summary, computes 3 different quantiles: 0.5, 0.9 and 0.99 (so 50%, 90% and 99% of the measurements respectively fall under the given numbers for the latencies).

5) Graphing your metrics with PromDash
The PromDash tool provides an easy way to create dashboards with a look and feel similar to Graphite. PromDash is available at http://prometheus.example.com:3000. 
First you need to define a server by clicking on the Servers link up top, then entering a name ("prometheus") and the URL of the Prometheus server ("http://prometheus.example.com:9090/").
Then click on Dashboards up top, and create a new directory, which offers a way to group dashboards. You can call it something like "myapp". Now you can create a dashboard (you also need to select the directory it belongs to). Once you are in the Dashboard create/edit screen, you'll see one empty graph with the default title "Title". 
When you hover over the header of the graph, you'll see other buttons available. You want to click on the 2nd button from the left, called Datasources, then click Add Expression. Note that the server field is already pre-filled. If you start typing myapp in the expression field, you should see the metrics exposed by your application (for example myapp_http_request_count and myapp_http_request_latency).
To properly graph a Counter-type metric, you need to plot its rate. Use this expression to show the HTTP request/second rate measured in the last minute for all the production endpoints in my webapp:
(the job and group values correspond to what we specified in /etc/prometheus/prometheus.xml)
If you want to show the HTTP request/second rate for test endpoints of "admin" type, use this expression:
If you want to show the HTTP request/second rate for a specific production endpoint, use an expression similar to this:
Once you enter the expression you want, close the Datasources form (it will save everything). Also change the title by clicking on the button called "Graph and Axis Settings". In that form, you can also specify that you want the plot lines stacked as opposed to regular lines.
 For latency metrics, you don't need to look at the rate. Instead, you can look at a specific quantile. Let's say you want to plot the 99% quantile for latencies observed in all production endpoint, for write operations (corresponding to HTTP methods which are not GET). Then you would use an expression like this:
As for the HTTP request/second graphs, you can refine the latency queries by specifying a type, an endpoint or both:
I hope you have enough information at this point to go wild with dashboards! Remember, who has the most dashboards wins!
Wrapping up
I wanted to write this blog post so I don't forget all the stuff that was involved in setting up and using Prometheus. It's a lot, but it's also not that bad once you get a hang for it. In particular, the Prometheus server itself is remarkably easy to set up and maintain, a refreshing change from other monitoring systems I've used before.
One thing I haven't touched on is the alerting mechanism used in Prometheus. I haven't looked at that yet, since I'm still using a combination of Pingdom, monit and Jenkins for my alerting. I'll tackle Prometheus alerting in another blog post.
I really like Prometheus so far and I hope you'll give it a try!

Stuff The Internet Says On Scalability For November 20th, 2015

Hey, it's HighScalability time:

100 years ago people saw this as our future. We will be so laughably wrong about the future.
  • $24 billion: amount telcos make selling data about you; $500,000: cost of iOS zero day exploit; 50%: a year's growth of internet users in India; 72: number of cores in Intel's new chip; 30,000: Docker containers started on 1,000 nodes; 1962: when the first Cathode Ray Tube entered interplanetary space; 2x: cognitive improvement with better indoor air quality; 1 million: Kubernetes request per second; 

  • Quotable Quotes:
    • Zuckerberg: One of our goals for the next five to 10 years is to basically get better than human level at all of the primary human senses: vision, hearing, language, general cognition. 
    • Sawyer Hollenshead: I decided to do what any sane programmer would do: Devise an overly complex solution on AWS for a seemingly simple problem.
    • Marvin Minsky: Big companies and bad ideas don't mix very well.
    • @mathiasverraes: Events != hooks. Hooks allow you to reach into a procedure, change its state. Events communicate state change. Hooks couple, events decouple
    • @neil_conway: Lamport, trolling distributed systems engineers since 1998. 
    • @timoreilly: “Silicon Valley is the QA department for the rest of the world. It’s where you test out new business models.” @jamescham #NextEconomy
    • Henry Miller: It is my belief that the immature artist seldom thrives in idyllic surroundings. What he seems to need, though I am the last to advocate it, is more first-hand experience of life—more bitter experience, in other words. In short, more struggle, more privation, more anguish, more disillusionment.
    • @mollysf: "We save north of 30% when we move apps to cloud. Not in infrastructure; in operating model." @cdrum #structureconf
    • Alex Rampell: This is the flaw with looking at Square and Stripe and calling them commodity players. They have the distribution. They have the engineering talent. They can build their own TiVo. It doesn’t mean they will, but their success hinges on their own product and engineering prowess, not on an improbable deal with an oligopoly or utility.
    • @csoghoian: The Michigan Supreme Court, 1922: Cars are tools for robbery, rape, murder, enabling silent approach + swift escape.
    • @tomk_: Developers are kingmakers, driving technology adoption. They choose MongoDB for cost, agility, dev productivity. @dittycheria #structureconf
    • Andrea “Andy” Cunningham: You have to always foster an environment where people can stand up against the orthodoxy, otherwise you will never create anything new.
    • @joeweinman: Jay Parikh at #structureconf on moving Instagram to Facebook: only needed 1 FB server for every 3 AWS servers
    • amirmc: The other unikernel projects (i.e. MirageOS and HaLVM), take a clean-slate approach which means application code also has to be in the same language (OCaml and Haskell, respectively). However, there's also ongoing work to make pieces of the different implementations play nicely together too (but it's early days).

  • After a tragedy you can always expect the immediate fear inspired reframing of agendas. Snowden responsible for Paris...really?

  • High finance in low places. The Hidden Wealth of Nations: In 2003, less than a year before its initial public offering in August 2004, Google US transferred its search and advertisement technologies to “Google Holdings,” a subsidiary incorporated in Ireland, but which for Irish tax purposes is a resident of Bermuda.

  • The entertaining True Tales of Engineering Scaling. Started with Rails and Postgres. Traffic jumped. High memory workers on Heroku broke the bank. Can't afford the time to move to AWS. Lots of connection issues. More traffic. More problems. More solutions. An interesting story with many twists. The lesson: Building and, more importantly, shipping software is about the constant trade off of forward movement and present stability.

  • 5 Tips to Increase Node.js Application Performance: Implement a Reverse Proxy Server; Cache Static Files; Implement a Node.js Load Balancer; Proxy WebSocket Connections; Implement SSL/TLS and HTTP/2.

  • Docker adoption is not that easy, Uber took months to get up and running with Docker. How Docker Turbocharged Uber’s Deployments: Everything just changes a bit, we need to think about stuff differently...You really need to rethink all of the parts of your infrastructure...Uber recognizes that Docker removed team dependencies, offering more freedom because members were no longer tied to specific frameworks or specific versions. Framework and service pawners are now able to experiment with new technologies and to manage their own environments.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so please keep on reading)...

Categories: Architecture

The Sunk Cost Fallacy Fallacy

Xebia Blog - Thu, 11/19/2015 - 15:41

Imagine two football fans planning to attend a match 60 miles away. One of them paid for a ticket in advance; the other was just about to buy a ticket when he got one from a friend for free. The night of the game, a blizzard hits. Which fan do you think is more likely to drive through a blizzard to see the game?

You probably (correctly) guessed that the fan who paid for his ticket is more likely to drive through the blizzard. What you may not have realized, though, is that this is an irrational decision, at least economically speaking.

The football fan story is a classic example of the Sunk Cost Fallacy, adapted from Richard Thaler's "Towards a Positive Theory of Consumer" (1980) in Daniel Kahneman's excellent book, "Thinking, Fast and Slow" (2011).  Many thanks to my colleagues Joshua Appelman, Viktor Clerc and Bulat Yaminov for the recommendations.

The Sunk Cost Fallacy

The Sunk Cost Fallacy is a faulty pattern of behavior in which past investments cloud our judgment on how to move forward. When past investments are irrecoverable (we call them 'sunk' costs, and they should have no effect on our choices for the future. In practice, however, we find it difficult to cut our losses — even when it's the rational thing to do.

We see the Sunk Cost Fallacy effect in action every day when evaluating technical and business decisions. For instance, you may recognize a tendency to become attached to an "elegant" abstraction or invariant, even when evidence is mounting that it does the overall complexity more harm than good. Perhaps you've seen a Product Owner who remains too attached to a particular feature, even after its proven failure to achieve the desired effect. Or the team that sticks to an in-house graphing library even after better ones become available for free, because they are too emotional about throwing out their own code.

This is the Sunk Cost Fallacy in action. It's healthy to take a step back and see if it's time to cut your losses.

Abuse of the Sunk Cost Fallacy

However, the Sunk Cost Fallacy can be abused when it's used as an excuse to freely backtrack on choices with little regard for past costs. I call this the Sunk Cost Fallacy Fallacy.

Should you move from framework A to framework B? If B will help you be more effective in the future, even when you've invested in A, the Sunk Cost Fallacy says you should move to B. However, don't forget to factor in the 'cost of switching': the past investments in framework A may be sunk costs, but switching could introduce a technical debt of code that needs to now be ported. Make sure to compare the expected gain against this cost, and make a rational decision.

You might feel bad about having picked framework A in the first place. The Sunk Cost Fallacy teaches you not to let this emotion cloud your judgment while evaluating framework B. However, it is still a useful emotion that can trigger valuable questions: Could you have seen this coming? Is there something you could have done in the past to make it cheaper to move from framework A to framework B now? Can you learn from this experience and make a better initial choice next time?


An awareness of the Sunk Cost Fallacy can help you make better decisions: cut your losses when it is the rational thing to do. Be careful not to use the Sunk Cost Fallacy as an excuse, and take into account the cost of switching. Most importantly, look for opportunities to learn from your mistakes.

Free Book: Practical Scalablility Analysis with the Universal Scalability Law

If you are very comfortable with math and modeling Dr. Neil Gunther'Universal Scalability Law is a powerful way of predicting system performance and whittling down those bottlenecks. If not, the USL can be hard to wrap your head around.

There's a free eBook for that. Performance and scalability expert Baron Schwartz, founder of VividCortex, has written a wonderful exploration of scalability truths using the USL as a lens: Practical Scalablility Analysis with the Universal Scalability Law

As a sample of what you'll learn, here are some of the key takeaways from the book:

  • Scalability is a formal concept that is best defined as a mathematical function.
  • Linear scalability means equal return on investment. Double down on workers and you’ll get twice as much work done; add twice as many nodes and you’ll increase the maximum capacity twofold. Linear scalability is oft claimed but seldom delivered.
  • Systems scale sublinearly because of contention, which adds queueing delay, and crosstalk, which inflates service times. The penalty for contention grows linearly and the crosstalk penalty grows quadratically. (An alternative to the crosstalk theory is that longer queues are more costly to manage.)
  • Contention causes throughput to asymptotically approach the reciprocal of the serialized fraction of the workload. If your workload is 5% serialized you’ll never grow the effective speedup by more than 20-fold
  • Crosstalk causes the system to regress. The harder you try to push systems with crosstalk, the more time they spend fighting amongst themselves.
  • To build scalable systems, avoid contention (serialization) and crosstalk (synchronization). The contention and crosstalk penalties degrade system scalability and performance much faster than you’d think. Even tiny amounts of serialization or pairwise data synchronization cause big losses in efficiency.
  • If you can’t avoid crosstalk, partition (shard) into smaller systems that will lose less efficiency by avoiding the explosion of service times at larger sizes.
  • To model systems with the USL, obtain measurements of throughput at various levels of load or size, and use regression to estimate the parameters to Equation 3.
  • To forecast scalability beyond what’s observable, be pessimistic and treat the USL as a best-case scenario that won’t really happen. Use Equation 4 to forecast the maximum possible throughput, but don’t forecast too far out. Use Equation 6 to forecast response time.
  • Use your judgment to predict limitations that USL can’t see, such as saturation of network bandwidth or changes in the system’s model when all of the CPUs become busy
  • Use the USL to explain why systems aren’t scaling well. Too much queueing? Too much crosstalk? Treat the USL as a pessimistic model and demand that your systems scale at least as well as it does.
  • If you see superlinear scaling, check your measurements and how you’ve set up the system under test. In most cases σ should be positive, not negative. Make sure you’re not varying the system’s dimensions relative to each other and creating apparent superlinear efficiencies that don’t really exist.
  • It’s fun to fantasize about models that might match observed system behavior more closely than the USL, but the USL arises analytically from how we know queueing systems work. Invented models might not have any basis in reality. Besides, the USL usually models systems extremely well up to the point of inflection, and modeling what happens beyond that isn’t as interesting as knowing why it happens.
  • Never trust a scatterplot with an arbitrary curve fit through it unless you know why that’s the right curve. Don’t confuse the USL, hockey stick charts from queueing theory, or other charts that just happen to have similar shapes. Know what shape various plots should exhibit, and suspect bad measurements or other mistakes if you don’t see them.

Note, the link to the eBook requires entering some data, but it's free, well written, and useful, so it's probably worth it.

Related Articles
Categories: Architecture

Teach the World Your Skills in a Mobile-First, Cloud-First World

“Be steady and well-ordered in your life so that you can be fierce and original in your work.”  -- Gustave Flaubert

An important aspect of personal effectiveness and career development is learning business skills for a technology-centric world.

I know a lot of developers figuring out how to share their expertise in a mobile-first, cloud-first world.  Some are creating software services, some are selling online courses, some are selling books, and some are building digital products.    It’s how they are sharing and scaling their expertise with the world, while doing what they love. 

In each case, the underlying pattern is the same:

"Write once, share many." 

It’s how you scale.  It’s how you amplify your impact.  It’s a simple way to combine passion + purpose + profit.

With our mobile-first, cloud-first world, and so much technology at your fingertips to help with automation, it’s time to learn better business skills and how to stay relevant in in an ever-changing market.   

But the challenge is, how do you actually start?

On the consumer side ...
In a mobile-first, cloud-first world, users want the ability to consume information anywhere, anytime, from any device.

On the produce side ...
Producers want the ability to easily create digital products that they can share with the world -- and automate the process as much as possible. 

I've researched and tested a lot of ways to share your experience in a way that works in a mobile-first, cloud-first world.  I’ve went through a lot of people, programs, processes, and tools.  Ultimately, the proven practice for building high-end digital products is building courses.  And teaching courses is the easiest way to get started.  And Dr. Cha~zay is one of the best in the world at teaching people how to teach the world what they love.

I have a brilliant and deep guest post by Dr. Cha~zay on how to teach courses in a mobile-first, cloud-first world:

Teach the World What You Love

You could very much change your future, or your kid’s future, or your friend’s future, or whoever you know that needs to figure out new ways to teach in a mobile first, cloud-first world.

The sooner you start doing, testing, and experimenting, the sooner you start figuring out what works in a Digital Economy could mean to you, your family, your friends, in a mobile-first, cloud-first world.

The world changes. 

Do you?

Categories: Architecture, Programming

Docker to the on-premise rescue

Xebia Blog - Wed, 11/18/2015 - 10:18

During the second day at Dockercon EU 2015 in Barcelona, Docker introduced the missing glue which they call "Containers as a Service Platform". With both focus on public cloud and on-premise, this is a great addition to the eco system. For this blogpost I would like to focus on the Run part of the "Build-Ship-Run" thought of Docker, and with the focus on on-premise. To realize this, Docker launched the Docker Universal Control Plane which was the project formerly known as Orca.

caas-private I got to play with version 0.4.0 of the software during a hands-on lab and I will try to summarize what I've learned.

Easy installation

Of course the installation is done by launching Docker containers on one or more hosts, so you will need to provision your hosts with the Docker Engine. After that you can launch a `orca-bootstrap` container to install, uninstall, or add an Orca controller. The orca-bootstrap script will generate a Swarm Root CA, Orca Root CA, deploy the necessary Orca containers (I will talk more about this in the next section), after which you can login into the Docker Universal Control Plane. Adding a second Orca controller is as simple as running orca-bootstrap with a join parameter and specifying the existing Orca controller.


Let's talk a bit about the technical parts and keep in mind that I'm not the creator of this product. There are 7 containers running after you have succesfully run the orca-bootstrap installer. You have the Orca controller itself, listening on port 443, which is your main entry point to Docker UCP. There are 2 cfssl containers, one for Orca CA and one for Swarm CA. Then you have the Swarm containers (Manager and Agent) and the key-value store, for which Docker chose etcd. Finally, there is an orca-proxy container, whose port 12376 redirects to the Swarm Manager.  I'm not sure why this is yet, maybe we will find out in the beta.

From the frontend (which we will discuss next) you can download a 'bundle', which is a zip file containing the TLS parts and a  sourceable environment file containing:

export DOCKER_CERT_PATH=$(pwd)
export DOCKER_HOST=tcp://orca_controller_ip:443
# Run this command from within this directory to configure your shell:
# eval $(env.sh)
# This admin cert will also work directly against Swarm and the individual
# engine proxies for troubleshooting.  After sourcing this env file, use
# "docker info" to discover the location of Swarm managers and engines.
# and use the --host option to override $DOCKER_HOST

As you can see, it also works directly against Swarm manager and Engine to troubleshoot. Running `docker version` with this environment returns:

Version:      1.9.0
API version:  1.21
Go version:   go1.4.2
Git commit:   76d6bc9
Built:        Tue Nov  3 17:43:42 UTC 2015
OS/Arch:      linux/amd64
Version:      orca/0.4.0
API version:  1.21
Go version:   go1.5
Git commit:   56afff6
OS/Arch:      linux/amd64


Okay, so when I opened up the frontend it looked pretty familiar and I was trying to remember where I've seen this before. After a look at the source, I found an ng-app parameter in the html tag named shipyard. The GUI is based on the Shipyard project, which is cool because this was an already well functioning management tool built upon Docker Swarm and the Docker API, so people familiar with shipyard already know the functionality, so let me quickly sum up what it can do and wthat it looks like in Docker UCP.

ducp-dashboardDashboard overview

ducp-applications2Application expanded, quickly start/stop/restart/destroy/inspect running container

ducp-applications-applicationApplication overview, graphs of resource usage and container IDs can be included or excluded from the graph.

ducp-containersContainers overview, multi select containers and execute actions

ducp-containers-container-logsAbility to quickly inspect logs

ducp-contaienrs-container-consoleAbility to exec into the container to debug/troubleshoot etc.

Secrets Management & Authentication/Authorization

So, in this hands-on lab there were a a few things that were not ready yet. Eventually it will be possible to hook up Docker UCP to an existing LDAP directory but I was not able to test this yet. Once fully implemented you can hook it up to your existing RBAC system and give teams the authorization they need.

There was also a demo showing off a secret management tool, which also was not yet available. I guess this is what the key-value store is used for as well. Basically you can store a secret at a path such a secret/prod/redis and then access it by running a container with a label like:

docker run -ti --rm -label com.docker.secret.scope=secret/prod

Now you can access the secret within the container in the file /secret/prod/redis.

Now what?

A lot of the new things are being added to the ecosystem, which is certainly going to help the adoption of Docker for some customers and bringing it into production. I like that Docker thought of the on-premise customers and deliver them an equally as the cloud users. As this is an early version they need feedback from users, so if you are able to test it, please do so in order to make it a better product. They said they are already working on multi-tenancy for instance, but no timelines were given.

If you would like to sign up for the beta of Docker Universal Control Plane, you can sign up at this page: https://www.docker.com/try-ducp



Why I like golang: a programming autobiography

Agile Testing - Grig Gheorghiu - Mon, 11/16/2015 - 19:07
Tried my hand at writing a story on Medium.

9ish Low Latency Strategies for SaaS Companies

Achieving very low latencies takes special engineering, but if you are a SaaS company latencies of a few hundred milliseconds are possible for complex business logic using standard technologies like load balancers, queues, JVMs, and rest APIs.

Itai Frenkel, a software engineer at Forter, which provides a Fraud Prevention Decision as a Service, shows how in an excellent article: 9.5 Low Latency Decision as a Service Design Patterns.

While any article on latency will have some familiar suggestions, Itai goes into some new territory you can really learn from. The full article is rich with detail, so you'll want to read it, but here's a short gloss:

Categories: Architecture

Video: Software architecture as code

Coding the Architecture - Simon Brown - Mon, 11/16/2015 - 12:52

I presented a new version of my "Software architecture as code" talk at the Devoxx Belgium 2015 conference last week, and the video is already online. If you're interested in how to communicate software architecture without using tools like Microsoft Visio, you might find it interesting.

The slides are also available to view online and download. Enjoy!

Categories: Architecture

How Facebook's Safety Check Works

I noticed on Facebook during this horrible tragedy in Paris that there was some worry because not everyone had checked in using Safety Check (video). So I thought people might want to know a little more about how Safety Check works.

If a friend or family member hasn't checked-in yet it doesn't mean anything bad has happened to them. Please keep that in mind. Safety Check is a good system, but not a perfect system, so keep your hopes up.

This is a really short version, there's a longer article if you are interested.

When is Safety Check Triggered?
  • Before the Paris attack Safety Check was only activated for natural disasters. Paris was the first time it was activated for human disasters and they will be doing it more in the future. As a sign of this policy change, Safety Check has been activated for the recent bombing in Nigeria.

How Does Safety Check Work?
  • If you are in an area impacted by a disaster Facebook will send you a push notification asking if you are OK. 

  • Tapping the “I’m Safe” button marks that your are safe.

  • All your friends are notified that you are safe.

  • Friends can also see a list of all the people impacted by the disaster and how they are doing.

How is the impacted area selected?
  • Since Facebook only has city-level location for most users, declaring the area isn't as hard as drawing on a map. Facebook usually selects a number of cities, regions, states, or countries that are affected by the crisis.

  • Facebook always allows people to declare themselves into the crisis (or out) in case the geolocation prediction is inaccurate. This means Facebook can be a bit more selective with the geographic area, since they want a pretty high signal with the notifications. Notification click-through and conversion rates are used as downstream signals on how well a launch went.

  • For something like Paris, Facebook selected the whole city and launched. Especially with the media reporting "Paris terror attacks," this seemed like a good fit.

How do you build the pool of people impacted by a disaster in a certain area?
  • Building a geoindex is the obvious solution, but it has weaknesses.

  • People are constantly moving so the index will be stale.

  • A geoindex of 1.5 billion people is huge and would take a lot of resources they didn’t have. Remember, this is a small team without a lot of resources trying to implement a solution.

  • Instead of keeping a data pipeline that’s rarely used active all of the time, the solution should work only when there is an incident. This requires being able to make a query that is dynamic and instant.

  • Facebook does not have GPS-level location information for the majority of its user base (only those that turn on the nearby friends feature), so they use the same IP2Geo prediction algorithms that Google and other web companies use -- essentially determining city level location based on IP address.

The solution leveraged the shape of the social graph and its properties:
  • When there’s a disaster, say an earthquake in Nepal, a hook for Safety Check is turned on in every single news feed load.

  • When people check their news feed the hook executes. If the person checking their news feed is not in Nepal then nothing happens.

  • When someone in Nepal checks their news feed is when the magic happens.

  • Safety Check fans out to all their friends on their social graph. If a friend is in the same area then a push notification is sent asking if they are OK.

  • The process keeps repeating recursively. For every friend found in the disaster area a job is spawned to check their friends. Notifications are sent as needed.

In Practice this Solution Was Very Effective
  • At the end of the day it's really just DFS (Depth First Search) with seen state and selective exploration.

  • The product experience feels live and instant because the algorithm is so fast at finding people. Everyone in the same room, for example, will appear to get their notifications at the same time. Why?

  • Using the news feed gives a random sampling of users that is biased towards the most active users with the most friends. And it filters out inactive users, which is billions of rows of computation which need not be performed.

  • The graph is dense and interconnectedSix Degrees of Kevin Bacon is wrong, at least on Facebook. The average distance between any two of Facebook’s 1.5 billion users is 4.74 edges. Sorry Kevin. With 1.5 billion users the whole graph can be explored within 5 hops. Most people can be efficiently reached by following the social graph.

  • There’s a lot of parallelism for free using a social graph approach. Friends can be assigned to different machines and processed in parallel. As can their friends, and so on.

  • Isn't it possible to use something like Hadoop/Hive/Presto to simply get a list of all users in Paris on demand? Hive and Hadoop are offline. It can take ~45 minutes to execute a query on Facebook's entire user table (even longer if it involves joins) and certain times of the day its slower (during work hours usually). Not only that, but once the query executes some engineer has to go copy and paste into a script that would likely run on one machine. Doing this in a distributed async job fashion allowed for a lot more flexibility. Even better, it's possible to change the geographic area as the algorithm runs and those changes are reflected immediately. 

  • The cost of searching for the users in the area directly correlates with the size of the crisis (geographically). A smaller crises ends up being fairly cheap, whereas larger crises end up checking on a larger and larger portion of the userbase until 100% of the user base is reached. For Nepal, a big disaster, ~1B profiles were checked. For some smaller launches only ~100k profiles were checked. Had an index been used, or an offline job that did joins and filters, the cost would be constant, no matter how small the crisis.

On HackerNews

Categories: Architecture

Stuff The Internet Says On Scalability For November 13th, 2015

Hey, it's HighScalability time:

Gorgeous picture of where microbes live in species. Humans have the most. (M. WARDEH ET AL)

  • 14.3 billion: Alibaba single day sales; 1.55 billion: Facebook monthly active users; 6 billion: Snapchat video views per day; unlimited: now defined as 300 GB by Comcast; 80km: circumference of China's proposed supercolider; 500: alien worlds visualized; 50: future sensors per acre on farms; 1 million: Instagram requests per second.

  • Quotable Quotes:
    • Adam Savage~ Lesson learned: do not test fire rockets indoors.
    • dave_sullivan: I'm going to say something unpopular, but horizontally-scaled deep learning is overkill for most applications. Can anyone here present a use case where they have personally needed horizontal scaling because a Titan X couldn't fit what they were trying to do? 
    • @bcantrill: Question I've been posing at #KubeCon: are we near Peak Confusion in the container space? Consensus: no -- confusion still accelerating!
    • @PeterGleick: When I was born, CO2 levels were  ~300 ppm. This week may be the last time anyone alive will see less than 400 ppm. 
    • @patio11: "So I'm clear on this: our business is to employ people who can't actually do worthwhile work, train them up, then hand to competition?"
    • Settlement-Size: This finding reveals that incipient forms of hierarchical settlement structure may have preceded socioeconomic complexity in human societies
    • wingolog: for a project to be technically cohesive, it needs to be socially cohesive as well; anything else is magical thinking.
    • @mjpt777: Damn! @toddlmontgomery has got Aeron C++ IPC to go at over 30m msg/sec. Java is struggling to keep up.
    • Tim O'Reilly: While technological unemployment is a real phenomenon, I think it's far more important to look at the financial incentives we've put in place for companies to cut workers and the cost of labor. If you're a public company whose management compensation is tied your stock price, it's easy to make short term decisions that are good for your pocketbook but bad long term for both the company and for society as a whole.
    • @RichardDawkins: Evolution is "Descent with modification". Languages, computers and fashions evolve. Solar systems, mountains and embryos don't. They develop
    • @Grady_Booch: Dispatches from a programmer in the year 2065: "How do you expect me to fit 'Hello, World' into only a terabyte of memory?" via Joe Marasco
    • @huntchr: I find #Zookeeper to be the Achilles Heal of a few otherwise interesting projects e.g. #kafka, #mesos.
    • Robert Scoble~ Facebook Live was bringing 10x more viewers than Twitter/Periscope
    • cryptoz: I've always wondered about this. Presumably the people leading big oil companies are not dumb idiots; so why wouldn't they take this knowledge and prepare in advance?

  • Waze is using data from sources you may not expect. Robert Scoble: How about Waze? I witnessed an accident one day on the highway near my house. Two lane road. The map turned red within 30 seconds of the accident. How did that happen? Well, it turns out cell phone companies (Verizon, in particular, in the United States) gather real time data from cell phones. Your phone knows how fast it’s going. In fact, today, Waze shows you that it knows. Verizon sells that data (anonymized) to Google, which then uses that data to put the red line on your map.

  • If email would have been done really right in the early days then we wouldn't need half the social networks or messaging apps we have today. Almost everything we see is a reimplementation of email. Gmail, We Need To Talk.

  • Don Norman and Bruce Tognazzini, prophets from Apple's time in the wilderness, don't much like the new religion. They stand before the temple shaking fists at blasphemy. How Apple Is Giving Design A Bad Name: Apple is destroying design. Worse, it is revitalizing the old belief that design is only about making things look pretty. No, not so! Design is a way of thinking, of determining people’s true, underlying needs, and then delivering products and services that help them. Design combines an understanding of people, technology, society, and business. 

  • There's a new vision of the Internet out there and it's built around the idea of Named Data Networking (NDN). It's an evolution from today’s host-centric network architecture IP to a data-centric network architecture. Luminaries like Van Jacobson like the idea. Packet Pushers with good coverage in Show 262 – Future of Networking – Dave Ward. Dave Ward is the CTO of Engineering and Chief Architect at Cisco. For me, make the pipes dumb, fast, and secure. Everything else is emergent.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so please keep on reading)...

Categories: Architecture