Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Feed aggregator

Python: Creating a skewed random discrete distribution

Mark Needham - Mon, 03/30/2015 - 23:28

I’m planning to write a variant of the TF/IDF algorithm over the HIMYM corpus which weights in favour of term that appear in a medium number of documents and as a prerequisite needed a function that when given a number of documents would return a weighting.

It should return a higher value when a term appears in a medium number of documents i.e. if I pass in 10 I should get back a higher value than 200 as a term that appears in 10 episodes is likely to be more interesting than one which appears in almost every episode.

I went through a few different scipy distributions but none of them did exactly what I want so I ended up writing a sampling function which chooses random numbers in a range with different probabilities:

import math
import numpy as np
values = range(1, 209)
probs = [1.0 / 208] * 208
for idx, prob in enumerate(probs):
    if idx > 3 and idx < 20:
        probs[idx] = probs[idx] * (1 + math.log(idx + 1))
    if idx > 20 and idx < 40:
        probs[idx] = probs[idx] * (1 + math.log((40 - idx) + 1))
probs = [p / sum(probs) for p in probs]
sample =  np.random.choice(values, 1000, p=probs)
>>> print sample[:10]
[ 33   9  22 126  54   4  20  17  45  56]

Now let’s visualise the distribution of this sample by plotting a histogram:

import matplotlib
import matplotlib.pyplot as plt
binwidth = 2
plt.hist(sample, bins=np.arange(min(sample), max(sample) + binwidth, binwidth))
plt.xlim([0, max(sample)])

2015 03 30 23 25 05

It’s a bit hacky but it does seem to work in terms of weighting the values correctly. It would be nice if it I could smooth it off a bit better but I’m not sure how at the moment.

One final thing we can do is get the count of any one of the values by using the bincount function:

>>> print np.bincount(sample)[1]
>>> print np.bincount(sample)[10]
>>> print np.bincount(sample)[206]

Now I need to plug this into the rest of my code and see if it actually works!

Categories: Programming

How to Be More Productive: The Chunking Technique

NOOP.NL - Jurgen Appelo - Mon, 03/30/2015 - 21:42

“You’re so productive!”
“You’re so disciplined!”
“You’re so handsome!”

OK, I may have misremembered that last one. But I did receive the first two compliments more than once. I usually accept this positive feedback reluctantly, given my ongoing frustrations about “never getting anything done around here!”

The post How to Be More Productive: The Chunking Technique appeared first on NOOP.NL.

Categories: Project Management

Open Google Maps from your iOS app

Google Code Blog - Mon, 03/30/2015 - 20:00
Originally posted on the Google Geo Developers Blog
Posted by Todd Kerpelman, Developer Advocate

If you're an iOS developer, you're probably aware that you have the ability to open some apps directly by taking advantage of their custom URL schemes. (And if you're not aware of that fact, I have an excellent set of videos to recommend to you!)

Of course, we wouldn't be telling you all of this on the Google Geo Developers blog if it weren't for the fact that you can also use the comgooglemaps:// custom URL scheme to open up a map, Street View, or direction request directly in Google Maps on iOS.
Constructing these URLs, however, isn't always easy -- I don't know about you, but I don't spend a lot of my time memorizing key/value pairs for URL arguments. And adding x-callback-url support, while super useful for redirecting users back to your app, means adding even more URL arguments and escaping. And because not everybody has Google Maps installed on their iOS device, you may also want to build URLs to open up Apple Maps, which have their own similar-but-slightly different set of URL arguments.

It was one of those situations that made me say, "Hey, somebody should write a utility to make this easier." And that's how, a few months later, we ended up publishing the OpenInGoogleMapsController for iOS.

OpenInGoogleMapsController is a class that makes it easy to build links to open a map (or display Street View or directions) directly in Google Maps for iOS. Rather than creating URLs by hand, you can create map requests using Objective-C classes and types, so you can take advantage of all the type-checking and code hinting you've come to expect from Xcode.

For instance, if you needed biking directions from Sherlock Holmes' apartment on Baker Street to Scotland Yard, your request might look something like this:

GoogleDirectionsDefinition *defn = [[GoogleDirectionsDefinition alloc] init];
defn.startingPoint =
[GoogleDirectionsWaypoint waypointWithQuery:@"221B Baker Street, London"];
defn.destinationPoint = [GoogleDirectionsWaypoint
waypointWithLocation:CLLocationCoordinate2DMake(51.498511, -0.133091)];
defn.travelMode = kGoogleMapsTravelModeBiking;
[[OpenInGoogleMapsController sharedInstance] openDirections:defn];

My favorite feature about this utility is that it supports a number of fallback strategies. If, for instance, you want to open up your map request in Google Maps, but then fallback to Apple Maps if the user doesn't have Google Maps installed, our library can do that for you. On the other hand, if it's important that your map location uses Google's data set, you can open up the map request in Google Maps in Safari or Chrome as a fallback strategy. And, of course, it fully supports the x-callback-url standard, so you can make sure Google Maps (or Google Chrome) has a button that points back to your app.

Sound interesting? Give it a try. Just add a couple of files to your Xcode project, and you're ready to go. Feel free to add issues or enhancements requests you might encounter in the GitHub repository, and let us know if you use it in your app. We'd be excited to check it out.
Categories: Programming

How We Scale VividCortex's Backend Systems

This is guest post by Baron Schwartz, Founder & CEO of VividCortex, the first unified suite of performance management tools specifically designed for today's large-scale, polyglot persistence tier.

VividCortex is a cloud-hosted SaaS platform for database performance management. Our customers install agents that measure the work their servers perform (queries, processes, etc) and generate metrics and events from that at high frequency. The agents send the resulting data to our APIs, where we host our analysis backend. The backend system is a collection of databases, internal services (quasi-microservices), and web-facing APIs. These APIs also power our AngularJS frontend application.

We deal with a lot of data. We ingest metrics and events at high speed. We also perform analytics that touch large amounts of data interactively. We are not unique and I don't want to imply we are somehow impressive in the scheme of things. We don't yet operate at "web scale." Nevertheless, our workload has some relatively unusual characteristics, and we've been able to scale as far as we have, while remaining pretty efficient in terms of cost and infrastructure. And my career in consulting has taught me that building systems like this is usually a challenge for a company (as it has been for us). Our story might be useful to others. For that reason I will go into unnecessary detail on specific parts of our workload and the challenges it brings.

What We Do
Categories: Architecture

Why You Need to Speak at Your Next Code Camp

Making the Complex Simple - John Sonmez - Mon, 03/30/2015 - 16:00

I just came back from Orlando Code camp, so I thought I’d do a post talking about why speaking at a code camp is a great opportunity and why‚ÄĒeven if you think you have nothing to talk about or teach‚ÄĒyou should be speaking at a code camp near you. Now, just in case you aren’t […]

The post Why You Need to Speak at Your Next Code Camp appeared first on Simple Programmer.

Categories: Programming

Most Prolific Bloggers on .Net

Phil Trelford's Array - Mon, 03/30/2015 - 08:15

On Saturday I headed down to Thoughtworks in Soho, London for an F# Open Data Hackathon organized by Thoughtworker Sean Newham. We started up with questions we‚Äôd like to answer using open data. I was interested in finding the most prolific bloggers in .Net, and formed a team with Adam KosiŇĄski, Emmet Cassidy and my son Sean. We had from around 11am to 3pm to answer the question and present the results.

Dew Drop

We used data mined from Alvin Ashcraft’s Morning Dew site, which provides a labelled list of top links almost every week day since 2008.

Here’s Alvin’s activity since 2008:

Dew Drop Calendar 2008-2015

Alvin’s links covers many topics including .Net, Web, Mobile and XAML:

Dew Drop Tags 2008-2015

Top 100 .Net Bloggers

For this analysis we‚Äôre looking only at links labelled as ‚ÄúTop Links‚ÄĚ or ‚Äú.NET‚ÄĚ. Between 2008 and 2015 there were over 20,000 links from over 3000 unique author names.

Interestingly the top 100 bloggers account for roughly half of all posts, and here’s the table of top 100 .Net bloggers based on data extracted from the Morning Dew:

Rank Name 2008 2009 2010 2011 2012 2013 2014 2015 Total 1 Greg Duncan 2 16 74 142 116 86 74 14 524 2 Oren Eini 31 71 76 42 94 49 32 12 407 3 Sean Sexton 0 0 0 0 0 193 195 0 388 4 Zain Naboulsi 0 0 236 56 17 37 3 0 349 5 Richard Carr 6 13 78 77 90 58 12 0 334 6 Eric Lippert 7 48 47 46 37 68 44 0 297 7 Raymond Chen 0 0 0 20 75 92 86 17 290 8 Scott Hanselman 28 16 38 39 44 37 50 7 259 9 MS Downloads 27 35 42 48 46 23 13 0 234 10 CodePlex 36 12 34 70 49 13 10 0 224 11 Sasha Goldshtein 4 16 30 31 22 25 19 7 154 12 Brian Harry 0 0 0 0 35 58 46 8 147 13 Julie Lerman 19 45 15 16 19 16 7 2 139 14 Scott Guthrie 3 12 47 18 19 24 12 2 137 15 Martin Hinshelwood 3 12 12 20 27 32 25 5 136 16 Mike Hadlow 6 16 30 21 25 22 4 0 124 17 Dhananjay Kumar 0 0 0 60 26 7 25 1 119 18 Derik Whittaker 24 21 21 21 18 8 6 0 119 19 Gunnar Peipman 1 40 36 9 9 15 4 1 115 20 Abhijit Jana 0 0 18 87 0 3 2 3 113 21 Ricardo Peres 0 0 0 6 15 31 38 13 103 22 Jimmy Bogard 19 18 14 12 7 5 18 6 99 23 Dennis Delimarsky 0 0 50 28 16 5 0 0 99 24 Peter Vogel 0 0 0 0 13 27 44 12 96 25 Jonathan Allen 0 0 11 19 25 15 17 9 96 26 Matthew Podwysocki 27 45 21 1 1 0 0 0 95 27 Rockford Lhotka 8 19 17 20 20 5 3 0 92 28 K. Scott Allen 11 8 14 15 19 11 8 3 89 29 Shai Raiten 0 0 18 40 22 9 0 0 89 30 Charles Sterling 3 8 4 6 25 24 15 1 86 31 Deborah Kurata 0 24 36 2 9 3 8 1 83 32 Phil Haack 3 5 10 31 13 11 6 0 79 33 James Michael Hare 0 0 8 38 24 3 0 3 76 34 Peter Kellner 2 18 10 8 22 9 3 3 75 35 Sacha Barber 7 4 5 9 4 8 31 7 75 36 Kunal Chowdhury 0 0 14 13 20 20 6 2 75 37 Pete Brown 3 6 28 17 16 3 2 0 75 38 Mike Taulty 7 9 9 10 13 4 21 1 74 39 Glenn Block 12 9 18 11 9 5 6 2 72 40 Miguel de Icaza 0 0 25 19 8 5 12 3 72 41 Davy Brion 7 30 26 9 0 0 0 0 72 42 Mary Jo Foley 0 3 1 20 17 16 12 2 71 43 Rick Strahl 9 13 1 11 15 8 7 7 71 44 Alex Skorkin 0 0 11 51 9 0 0 0 71 45 Jesse Liberty 3 0 10 13 19 4 15 1 65 46 Tatworth 0 0 0 14 30 7 14 0 65 47 Daniel Moth 17 23 0 12 6 3 4 0 65 48 Abhishek Sur 0 1 20 35 2 3 1 2 64 49 Clemens Reijnen 11 12 19 5 8 3 5 1 64 50 Justin Etheredge 25 20 16 3 0 0 0 0 64 51 Jeremy Likness 0 0 18 12 21 3 5 3 62 52 Iris Classon 0 0 0 0 37 16 6 3 62 53 Bill Wagner 0 7 2 21 11 7 11 3 62 54 Mark Needham 0 31 31 0 0 0 0 0 62 55 Harry Pierson 17 37 0 2 3 0 1 0 60 56 Rob Eisenberg 4 1 14 5 3 12 13 7 59 57 Steven Sinofsky 0 0 1 22 31 3 1 0 58 58 John Papa 8 3 7 3 13 15 7 0 56 59 Patrick Smacchia 9 8 11 12 10 4 2 0 56 60 Jon Skeet 6 1 0 15 9 5 16 3 55 61 Steve Smith 0 1 3 22 8 6 15 0 55 62 Bnaya Eshet 0 0 0 12 18 9 5 9 53 63 Carl Franklin & Richard Campbell 0 0 3 0 7 17 16 10 53 64 Shawn Wildermuth 7 13 8 6 10 1 6 2 53 65 Rory Primrose 0 0 19 18 7 7 2 0 53 66 Willy-P. Schaub 0 0 0 3 17 9 19 4 52 67 Kirill Osenkov 1 18 12 5 9 4 3 0 52 68 S.Somasegar 0 0 0 0 15 16 17 3 51 69 Bart de Smet 21 24 5 1 0 0 0 0 51 70 Laurent Bugnion 0 0 7 17 13 4 6 3 50 71 Eric Battalio 0 0 0 0 6 15 27 2 50 72 Maarten Balliauw 2 0 6 16 7 15 3 1 50 73 Don Syme 1 2 4 15 13 14 1 0 50 74 Gil Fink 1 4 24 20 1 0 0 0 50 75 Cameron Skinner 11 11 21 6 1 0 0 0 50 76 Dave M. Bush 7 25 5 0 0 3 7 2 49 77 Stephen Forte 4 21 21 3 0 0 0 0 49 78 Michael Crump 0 0 2 11 13 4 13 5 48 79 Kim Spilker 0 0 2 16 7 9 14 0 48 80 Jura Gorohovsky 0 0 5 12 15 12 4 0 48 81 Wally McClure 0 4 11 17 15 0 1 0 48 82 Jason Zander 0 10 13 5 18 0 1 0 47 83 Chris Sells 2 22 10 4 9 0 0 0 47 84 Marcelo Lopez Ruiz 1 2 37 5 1 0 0 0 46 85 Nicholas Blumhardt 0 4 6 6 1 7 18 3 45 86 Rob Reynolds 0 12 11 9 6 3 4 0 45 87 Dmitri Nesteruk 0 1 1 1 14 20 2 5 44 88 Rory Becker 0 0 13 11 1 1 18 0 44 89 Filip Ekberg 0 0 0 0 11 21 11 0 43 90 G. Andrew Duthie 2 12 4 4 20 1 0 0 43 91 Grigori Melnik 0 0 8 13 8 8 4 1 42 92 Jonathan Wood 0 0 6 21 4 0 8 3 42 93 Peter Ritchie 3 1 14 6 13 2 3 0 42 94 Rob Conery 10 2 4 15 5 1 5 0 42 95 Gian Maria Ricci 0 0 1 37 4 0 0 0 42 96 Anoop Madhusudanan 0 0 14 8 11 8 0 0 41 97 Jeff Blankenburg 0 1 15 6 19 0 0 0 41 98 Rowan Miller 0 0 5 4 4 10 15 2 40 99 Hadi Hariri 0 1 4 21 11 2 0 0 39 100 Yochay Kiriaty 3 15 15 6 0 0 0 0 39


You can download the data from

I’d be interested in hearing about what you find :)

Categories: Programming

SPaMCAST 335 ‚Äď Critical Agile Definitions, Communication Content, Microservices and Granularity

Listen Now

Subscribe on iTunes

In this episode of the Software Process and Measurement Cast we feature three columns!  The first is our essay on the definitions of four critical words.  What do the words effectiveness, efficiency, frameworks and methodologies really mean?  These words get used ALL the time, however they really do have fairly specific meanings.  Meanings that, once understood and used to guide how we work, can help everyone to deliver more value and make our customers more satisfied!  The second column is from Jo Ann Sweeney with another of her stellar, Explaining Change columns.  In this segment, Jo Ann talks about content and a framework to guide the development of content.  Anchoring the Cast this week is Gene Hughson with another of his Forms Follows Function columns.  Gene extends his mini-series on microservices with a discussion of whether granularity is irrelevant.  Lots of content in this installment of the Software Process and Measurement Cast!

Call to action!

Reviews of the Podcast help to attract new listeners.  Can you write a review of the Software Process and Measurement Cast and post it on the podcatcher of your choice?  Whether you listen on ITunes or any other podcatcher, a review will help to grow the podcast!  Thank you in advance!

Re-Read Saturday News

The Re-Read Saturday focus on Eliyahu M. Goldratt and Jeff Cox’s The Goal: A Process of Ongoing Improvement began on February 21nd. The Goal has been hugely influential because it introduced the Theory of Constraints, which is central to lean thinking. The book is written as a business novel. Visit the Software Process and Measurement Blog and catch up on the re-read.

Note: If you don’t have a copy of the book, buy one.  If you use the link below it will support the Software Process and Measurement blog and podcast.

Dead Tree Version or Kindle Version 

I am beginning to think of which book will be next. Do you have any ideas?

Upcoming Events

Categories: Process Management

SPaMCAST 335 ‚Äď Critical Agile Definitions, Communication Content, Microservices and Granularity

Software Process and Measurement Cast - Sun, 03/29/2015 - 22:00

In this episode of the Software Process and Measurement Cast we feature three columns!  The first is our essay on the definitions of four critical words.  What do the words effectiveness, efficiency, frameworks and methodologies really mean?  These words get used ALL the time, however they really do have fairly specific meanings.  Meanings that, once understood and used to guide how we work, can help everyone to deliver more value and make our customers more satisfied!  The second column is from Jo Ann Sweeney with another of her stellar, Explaining Change columns.  In this segment, Jo Ann talks about content and a framework to guide the development of content.  Anchoring the Cast this week is Gene Hughson with another of his Forms Follows Function columns.  Gene extends his mini-series on microservices with a discussion of whether granularity is irrelevant.  Lots of content in this installment of the Software Process and Measurement Cast!

Call to action!

Reviews of the Podcast help to attract new listeners.  Can you write a review of the Software Process and Measurement Cast and post it on the podcatcher of your choice?  Whether you listen on ITunes or any other podcatcher, a review will help to grow the podcast!  Thank you in advance!

Re-Read Saturday News

The Re-Read Saturday focus on Eliyahu M. Goldratt and Jeff Cox’s The Goal: A Process of Ongoing Improvement began on February 21nd. The Goal has been hugely influential because it introduced the Theory of Constraints, which is central to lean thinking. The book is written as a business novel. Visit the Software Process and Measurement Blog and catch up on the re-read.

Note: If you don’t have a copy of the book, buy one.  If you use the link below it will support the Software Process and Measurement blog and podcast.

Dead Tree Version or Kindle Version 

I am beginning to think of which book will be next. Do you have any ideas?

Upcoming Events

Categories: Process Management

InetAddressImpl#lookupAllHostAddr slow/hangs

Mark Needham - Sun, 03/29/2015 - 01:31

Since I upgraded to Yosemite I’ve noticed that attempts to resolve localhost on my home network have been taking ages (sometimes over a minute) so I thought I’d try and work out why.

This is what my initial /etc/hosts file looked like based on the assumption that my machine’s hostname was teetotal:

$ cat /etc/hosts
# Host Database
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##	localhost	broadcasthost
::1             localhost
#fe80::1%lo0	localhost	wuqour.local       teetotal

I setup a little test which replicated the problem:

public class LocalhostResolution
    public static void main( String[] args ) throws UnknownHostException
        long start = System.currentTimeMillis();
        InetAddress localHost = InetAddress.getLocalHost();
        System.out.println(System.currentTimeMillis() - start);

which has the following output:

Exception in thread "main" teetotal-2: teetotal-2: nodename nor servname provided, or not known
	at LocalhostResolution.main(
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(
	at java.lang.reflect.Method.invoke(
	at com.intellij.rt.execution.application.AppMain.main(
Caused by: teetotal-2: nodename nor servname provided, or not known
	at Method)
	... 6 more

Somehow my hostname has changed to teetotal-2 so I added the following entry to /etc/hosts:	teetotal-2

Now if we run the program we see this output instead:


It’s still taking 5 seconds to resolve which is much longer than I’d expect. After break pointing through the code it seems like it’s trying to do an IPv6 resolution rather than IPv4 so I added an /etc/hosts entry for that too:

::1             teetotal-2

Now resolution is much quicker:


Happy days!

Categories: Programming

Re-Read Saturday: The Goal: A Process of Ongoing Improvement. Part 6


The Goal: A Process of Ongoing Improvement, published in 1984, is a business novel. The Goal uses the story of Alex Rogo, plant manager, to illustrate the theory of constraints and how the wrong measurement focus can harm an organization. The focus of the re-read is less on the story, but rather on the ideas the book explores that have shaped lean thinking. Earlier entries in this re-read are:

Part 1                         Part 2                         Part 3                         Part 4                      Part 5

This week I attended the CMMI EMEA conference.  Kirk Botula, CEO of the CMMI institute delivered a great speech about developing and managing capability.  Capability is ability to do something.  Without capabilities, individuals, teams and organizations are powerless. Tools like the CMMI, Agile and lean provide frameworks to decide which capabilities are important and then to measure the level and impact of current capabilities. During Kirk’s talk he referenced two books.  The first was Out of the Crisis by Edwards Deming and The Goal: A Process of Ongoing Improvement by Jeff Cox and Eliyahu Goldratt.  Deming’s work defined the continuous process improvement movement, while Goldratt framed the lean movement.  Later during the week I asked a group of C-level executives if they were familiar with The Goal and the Theory of Constraints.  Unfortunately the answer was not a resounding yes.  The limited awareness is problematic because that leads to a focus on cost containment and efficiency as the most important goals over effectiveness and throughput. Over focusing on cost and efficiency leads sub-optimal performance and in the long run market failure. In this installment of Re-read Saturday we begin to encounter the nuts and bolts that make the Theory of Constraints an effective tool for structured process improvement.

Chapter 17

After dealing with the complications of a household in turmoil (Alex’s household is a reflection of a series of dependent events and statistical variability), Alex arrives at the plant to find an order for an important client is late. Alex explains how dependent events and statistical variability will lead to the sub-optimal performance of a system. Despite Alex’s revelations from observing the scout troop, his team¬†believes they can deliver the parts in time to make the last truck at 5 PM. Alex places a¬†ten dollar bet that the parts will not be ready ship that day due to variations in the production process. ¬†The production of the parts requires two steps. The first step in the production process averages 25 parts per hour. In this case the production started slowly, however by the end of four hours 100 parts had been completed. Therefore the process had averaged 25 parts per hour. The production ranged from 19 parts to 32 parts. While the second step in the process had the same 25 parts per hour capacity, because of the variability of the first processes the capacity of the second process was exceeded on two occasions, therefore all 100 parts were not completed.

Chapter 18

Alex and his now attentive staff (the performance on late order has gotten their attention) attempt to determine how to control the flow of work thought the steps in the process (dependent events) as they are subject to statistical variability.They consider longer lead-time as a method to smooth the flow, because longer lead times would generate higher inventories to reduce efficiency. As a group they elect to call Johan, who introduces another new concept: bottleneck and non-bottleneck resources. A bottleneck resource is any resource that has a capacity that is equal to or less than the demand placed upon it. Any variability in the flow of work that is beyond the capacity of the bottlenecked resources will generate a backlog. While The Goal was written from the point of view of a manufacturing plant, bottlenecks can occur in any type of project. Any resource that is planned to 100% capacity is a bottlenecked resource. Through discussion with Johan the team discovers that like the scout troop the capacity of the plant is governed by the step with the least capacity. In the plant, the team needs to discover the bottlenecks that are blocking their ability to deliver effectively.

Bottlenecks are neither good or bad in their own right. A process with infinite capacity at all steps will not be efficient because resources would be under utilized. Bottlenecks can be used to generate a predictable process. This a reflection of how Alex used the slow scout in Chapter 15. Also understanding the capacity and variability of the bottleneck step will help make the process predictable.

When the staff discovers the bottleneck, it generates a process of discovery in which the team discovers that they really don’t know the process well and that the time accounting data is less than perfect. The big revelation is that looking for steps in the process with build-ups of work-in-process waiting to be addressed are an indication of a bottleneck. In this case the team finds two culprits. The first is the industrial robot that is darling of senior management and the second is the heat treatment step. The bottlenecks cannot be moved to the head of the column like Alex did to solve the problem with the boy scouts, nor can extra capacity be generated by increasing staff or buying machines. The workday ends with Alex searching for a solution.

The affect of the combination dependent events and statistical variability on the plant floor is obvious. In a manufacturing plant where processes are heavily automated variability might be low, however it still exists. In software or product development same interaction of process steps and variability exits although because the work tends to be more discovery-oriented, the variability is probably higher making the impact more substantial and less predictable. The variability in flow through the process exposes bottlenecks that limit our ability to catch up, making projects and products late or worse generating technical debt when corners are cut in order to make the date or budget.

Summary of The Goal so far:

Chapters 1 through 3 actively present the reader with a burning platform. The plant and division are failing. Alex Rogo has actively pursued increased efficiency and automation to generate cost reductions, however performance is falling even further behind and fear has become central feature in the corporate culture.

Chapters 4¬†through¬†6¬†shift the focus from steps in the process to the process as a whole. Chapters 4 ‚Äď 6 move us down the path of identifying the ultimate goal of the organization (in this book). The goal is making money and embracing the big picture of systems thinking. In this section, the authors point out that we are often caught up with pursuing interim goals, such as quality, efficiency or even employment, to the exclusion of the of the ultimate goal. We are reminded by the burning platform identified in the first few pages of the book, the impending closure of the plant and perhaps the division, which in the long run an organization must make progress towards their ultimate goal, or they won‚Äôt exist.

Chapters 7 through 9 show Alex‚Äôs commitment to change, seeks more precise advice from Johan, brings his closest reports into the discussion and begins a dialog with his wife (remember this is a novel). In this section of the book the concept ‚Äúthat you get what you measure‚ÄĚ is addressed. In this section of the book, we see measures of efficiency being used at the level of part production, but not at the level of whole orders or even sales. We discover the corollary to the adage ‚Äėyou get what you measure‚Äô is that if you measure the wrong thing ‚Ķyou get the wrong thing. We begin to see Alex‚Äôs urgency and commitment to make a change.

Chapters 10 through 12 mark a turning point in the book. Alex has embraced a more systems view of the plant and that the measures that have been used to date are more focused on optimizing parts of the process to the detriment to overall goal of the plant.  What has not fallen into place is how to take that new knowledge and change how the plant works. The introduction of the concepts of dependent events and statistical variation begin the shift the conceptual understanding of what measure towards how the management team can actually use that information.

Chapters 13 through 15 drive home the point that dependent events and statistical variation impact the performance of the overall system. In order for the overall process to be more effective you have to understand the capability and capacity of each step and then take a systems view. These chapters establish the concepts of bottlenecks and constraints without directly naming them and that focusing on local optimums causes more trouble than benefit.

Note: If you don’t have a copy of the book, buy one.  If you use the link below it will support the Software Process and Measurement blog and podcast. Dead Tree Version or Kindle Version

Categories: Process Management

How to Avoid the "Yesterday's Weather" Estimating Problem

Herding Cats - Glen Alleman - Sat, 03/28/2015 - 23:39

One suggestion from the #NoEstimates community is the use of empirical data of past performance. This is many time called yesterdays weather. First let's make sure we're not using just the averages from yesterdays weather. And even adding the variance to that small sample of past performance can lead to very naive outcomes. 

We need to do some actual statistics on that time series. A simple R set of commands will produce the chart below from the time series of past performance data.


 But that doesn't really help without some more work.

  • Is the future Really like the past - are the work products and the actual work performed in the past replicated ¬†in the future? If so, this sound like a simple project, just turn out features that all look alike.
  • Is there any interdependencies that grow ¬†in complexity as the project move forward? This is the integration and test problem. Then the system of systems integration and test problem. Again simple project don't usually have this problem. More complex projects do.
  • What about those pesky¬†emerging¬†requirements. This is a favorite idea of agile (and correctly so), but simple past performance is not going to forecast the needed performance in the presence of emerging requirements
  • Then all the externalities of all project work, where are those captured in the sample of past performance?
  • All big projects have little projects inside them is a common phrase. Except that collection of little projects needs to be integrated, tuned, tested, verified and validated that all the parts when assembled actually do what the customer wants.

Getting Out of the Yesterday's Weather Dilemma

Let's use the chart below to speak about so sources of estimating NOT based on simple small samples of yesterdays weather. This is a Master Plan for a non-trivial project to integrate half dozen or so legacy enterprise systems with a new health insurance ERO system for an integrated payer/provider solution: 

Capabilities Flow

  • Reference Class Forecasting for each class of work product.
    • As the project moves left to right in time the¬†classes of product and the related work likely change.¬†
    • Reference classes for each of this movements through increasing maturity, and increasing complexity from integration interactions needs to be used to estimate not only the current work but the next round of work
    • In the chart above work on the left is planned with some level of confiidence, because it's work in hand. Work in the right is in the future, so an courser estimate is all that is needed for the moment.
    • This is a¬†planning package¬†notion used in space and defense. Only plan in detail what yuo understand in detail.
  • Interdependencies Modeling in MCS
    • On any non-trivial project there are interdependencies
    • The notion of INVEST¬†needs to be tested¬†
      • Independent - not usually the case on enterprise projects
      • Negotiable - usually not, since he ERP system provides the core capability to do business. Would be illogical to have half the procurement system.
      • We can issue purchase orders and receive goods. But we cant pay for then until we get the Accounts Payable system. We need both at the same time
      • Valuable - Yep, why we doing this if it's not valuable to the business. This is a strawman used by low business maturity projects.
      • Estimate - to a good approximation is what the advice tells us. The term¬†good needs a unit of measure
      • Small - is a domain dependent measure. Small to an enterprise IT projects may be huge to a sole contributor game developer.
      • Testable - Yep, and verifiable, and validatable, and secure, and robust, and fault tolerant, and meets all performance requirements.
  • Margin - protects dates, cost, and technical performance from irreducible uncertainty. By irreductible it means nothing can be done about the uncertainties. It's not the lack of knowledged that is found in reducible uncertainty. Epistemic uncertainty. Irreducible uncertainty is Aleatory. It's the natural randomness in the underlying processes that creates the uncertainty. When we are estimating in the presence of aleatory uncertainty, we must account for this aleatory uncertainty. This is why using the¬†average of a time series for making a decision about possible future outcomes will always lead to disappointment.¬†
    • First we should always use the Most Likely value of the time series, not Average of the time series.
    • The Most Likely - the Mode - is that number that occurs most often of all the possible values that have occurred in the past. This should make complete sense when we consider¬†what value will appear next? Why the value that has appeared Most¬†Often¬†in the past.
    • The Average of two numbers 1 and 99 is 50. The average of two numbers 49 and 51 is 50. Be careful with averages in the absence of knowing the variance.
  • Risk retirement - Epistemic uncertainty creates risks that can be retired. This means spending money and time. So when we're looking at past performance in an attempt to estimate future performance (Yesterdays Weather), we must determine what kind of uncertainties there are in the future and what kind of uncertainties we encountered in the past.
    • Were the and are they reducible or irreducible?
    • Did the performance in the past contain irreducible uncertainties, baked into the numbers that we did not recognize?¬†

This bring up a critical issue with all estimates. Did the numbers produced from the past performance meet the expected values or were they just the numbers we observed? This notion of taking the observed numvers and using them for forecasting the future is an Open Loop control system. What SHOULD the numbers have been to meet our goals? What SHOULD the goal have been? Did know that, then there is no baseline to compare the past performance against to see if it will be able to meet the future goal. 

I'll say this again - THIS IS OPEN LOOP control, NOT CLOSED LOOP. No about of dancing around will get over this, it's a simple control systems principle found here. Open and Close Loop Project Controls

  • Measures of physical percent complete to forecast future performance with cost, schedule, and technical performance measures - once we have the notion of Closed Loop Control and have constructed a¬†steering target, can capture actual against plan, we need to define measures that are meaningful to the decisions makers. Agile does a good jib of forcing¬†working product to appear often. The assessment of Physical Percent Complete though needs to define what that¬†working software¬†is supposed to do in support of the business plan.
  • Measures of Effectiveness - one very good measure is of¬†Effectiveness. Does the software provide and effective solution to the problem. This begs the question or questions. What is the problem and what does an effective solution looks like were it to show up.¬†
    • MOE's are operational measures of success that are closely related to the achievements of the mission or operational objectives evaluated in the operational environment, under a specific set of conditions.
  • Key performance parameters - the companion of Measures of Effectiveness are Measures of Performance.
    • MOP's characterize physical or functional attributes relating to the system operation, measured or estimated under specific conditions.
  • Along with these two measures are Technical Performance Measures
    • TPM's are attributes that determine how well a system or system element is satisfying or expected to satisfy a technical requirement or goal.
  • And finally there are Key Performance Parameters
    • KPPs represent the capabilities and characteristics so significant¬† that failure to meet them can be cause for reevaluation, reassessing, or termination of the program.

The connections between these measures are shown below.

Screen Shot 2015-03-28 at 4.37.51 PM

With these measures, tools for making estimates of the future - forecasts - using statistical tools, we can use yesterdays weather, tomorrow models and related reference classes, desired MOE's, MOP's, KPP's, and TPM's and construct a credible estimate of what needs to happen and then measure what is happening and close the loop with an error signal and take corrective action to stay on track toward our goal.

This all sounds simple in principle, but in practice of course it's not. It's hard work, but when you assess the value at risk to be outside the tolerance range where thj customer is unwilling to risk their investment, we need tools and processes wot actually control the project.

Categories: Project Management

Some More Background on Probability, Needed for Estimating

Herding Cats - Glen Alleman - Sat, 03/28/2015 - 15:05

Statistics and Probability copyThe continued lack of understanding of the underlying probability and statistics of making decisions in the presence of uncertainty  to plague the discussion of estimating software.

All elements of all projects are statistical in nature. This statistical behaviour - reducible or irreducible stochastic processes - creates uncertainty.

Event based uncertainties have a probability of occurrence and a probability of the impact once that uncertainty becomes a relaity. These are Epistemic uncertainties - epistemology is the studdy of knowledge. Espitemtic means knowing or not knowing in this case/ We can buy knowledge. This is the core concept of agile paradigm/ We are buying down risk by building software to test the uncertainties of the project deliverables. This is the basis of saying agile is about risk management. But I suspect those saying that without being able to do the math as we say in our domain, don't realize what they are actually saying.

The natural occurring variances are aleatory. They are always there, they are irreducible. That is they can't be fixed. Work effort and duration is aleatory. The ONLY fix for aleatory uncertainty and the resulting risk is margin. Cost margin, schedule margin, technical performance margin. You can't buy the fix to aleatory uncertainty. 

Found a book today Discover Probability: How to Use It, How to Avoid Misusing It, and How It Affects Every Aspect of Your Life: How to Use It, How to Avoid Misusing It, and How It Affects Every Aspect of Your Life, Arieh Ben-Naim. World Scientific. 

This is one of those must read book for anyone working in a domain where probability and statistics dominates the decision making process. Unlike other books How To Measure Anything, The Flaw of Averages, How Not to Be Wrong: The Power of Mathematical Thinking, which is very good books - but populist in that they contain little in terms of actual mathematics. this books is in between. Lots of narrative, but math as well. Not like Probability Methods of Cost  Uncertainty Analysis: A Systems Engineering Perspective but in the middle.

In The End

you can't make decisions in the presence of uncertainty without estimating the outcome of your decision in the future. Using empirical data is preferred. But that empirical data MUST be adjusted for future uncertainty., past variances, sampling errors, poor representations of the actual process and the plethora of other drivers of uncertainty. Having small, simple samples without variances, and most of all confirming the past actually does represent the future - and doing that mathematically not just announcing it - is needed for any estimates of the future to have any credibility. Otherwise it's just an uninformed bad guess.

Categories: Project Management

Internet of things

Xebia Blog - Sat, 03/28/2015 - 11:26

Do you remember the time that you were not connected to the internet? When you had to call your friend instead of sending a message via Whatsapp to make an appointment. The times you had to apply for a job by writing a letter with  a pen instead of sending your Linkedin page? We all have gone through a major revolution in which we've all got a digital identity. The same revolution has started for things, right now!

Internet of Things


Things we use all day, like your watch or thermostat, get suddenly connected to the internet. These things are equipped with electronics, sensors and connectivity. Whether you like it or not, you will not escape this movement. According to Gartner there will be nearly 26 billion devices on the Internet of Things by 2020.
The potential of the Internet of Things isn’t just linking millions of devices. It is more like the potential of Google or Amazon had when building massive IT infrastructure. The IoT is about transforming business models and enabling companies to sell products in entirely new and better ways.


The definition of IoT is fairly simple. The internet of things involves `every thing that is connected to the internet`. You should exclude mobiles and computers, because they've already become mainstream.

In fact, the Internet of things is a name for a set of trends. First of all we have the trend that objects, we use all day,  are suddenly connected to the internet. For example bathroom scales that compares your weight with others.
Another trend is that devices that were already shipped with CPUs have become much more powerful when they’re connected to a network. Functionality of the hardware depends on local computing as well as powerful cloud computing. Fitness tracker, can carry out some functionality, like basic data cleaning and tabulation, on local hardware, but it depends on sophisticated machine-learning algorithms on an ever improving cloud service to offer prescription and insight.
Besides that we have the makers movement. Creating hardware isn’t for the big companies like Sony or Samsung anymore, just like creating software in the 80’s and 90’s was a privilege for IBM and Microsoft.

What’s new?

Devices that are connected to the internet aren’t new. For many years, we've connected devices to the internet, although not at that size. I think we have to compare it concepts like 'the cloud'. Before I heard the name 'the cloud' I was already using services that run on the internet, like Hotmail. But one swallow doesn't make a summer. When suddenly lots of services are running on the web, we speak of a trend and the market will give it a name to it. The same is happening with the internet of things. Suddenly all kinds of 'normal' devices are connected to the internet and we have a movement towards that direction that is unstoppable. And the possibilities it will bring are endless.

Splitting up in pieces

We can divide the internet of things into five smart categories. First of all we have the smart wearable. It is designed for a variety of purposes as well as for wear on a variety of parts of the body. Think of a bicycle helmet that detects a crash. Next to it we have the smart home. It’s goal is to make the experience of living at home more convenient and pleasant. A thermostat that learns what temperatures users like and builds a context-aware personalized schedule. A third and fourth category are the smart city and smart environment. It focuses on finding sustainable solutions to the growing problems. The last category is very interesting. It is the smart enterprise. One can think of real-time shipment tracking or smart metering solution that manages energy consumption at the individual appliance and machine level.

Changes of business models

There are plenty of new business models due to the internet of things. An example is a system that provides a sensor-embedded trash, so it’s capable of real-time context analysis and alerting the authorities when it is full and needs to be emptied. Another example is the insurance world, which will adopt the internet of things. There's already behaviour-based car insurance pricing available. Customers who agree to install a monitoring device on their car's diagnostic port, get a break on their insurance rates and pricing that varies according to usage and driving habits.

Big data

Other disciplines in software development also have interest in the new movement, like big data. Internet of things will not generate big data, but obesitas data. Did you know that within every flight, there are  3 terabytes of data generated with all the sensors in an airplane? Think about how much data will be generated, that has to be analyzed, when all buildings in a town are equipped with sensors to measure the quality of the air?


The intersection between software and the physical world becomes smaller every day and that is the result of the unstoppable movement of internet of things. It is still at the beginning. Business models are changing and there are lots of opportunities. You just have to find them...

Toward a Better Markdown Tutorial

Coding Horror - Jeff Atwood - Sat, 03/28/2015 - 01:19

It's always surprised me when people, especially technical people, say they don't know Markdown. Do you not use GitHub? Stack Overflow? Reddit?

I get that an average person may not understand how Markdown is based on simple old-school plaintext ASCII typing conventions. Like when you're *really* excited about something, you naturally put asterisks around it, and Markdown makes that automagically italic.

But how can we expect them to know that, if they grew up with wizzy-wig editors where the only way to make italic is to click a toolbar button, like an animal?

I am not advocating for WYSIWYG here. While there's certainly more than one way to make italic, I personally don't like invisible formatting tags and I find that WYSIWYG is more like WYCSYCG in practice. It's dangerous to be dependent on these invisible formatting codes you can't control. And they're especially bad if you ever plan to care about differences, revisions, and edit history. That's why I like to teach people simple, visible formatting codes.

We can certainly debate which markup language is superior, but in Discourse we tried to build a rainbow tool that satisifies everyone. We support:

  • HTML (safe subset)
  • BBCode (basic subset)
  • Markdown (full)

This makes coding our editor kind of hellishly complex, but it means that for you, the user, whatever markup language you're used to will probably "just work" on any Discourse site you happen to encounter in the future. But BBCode and HTML are supported mostly as bridges. What we view as our primary markup format, and what we want people to learn to use, is Markdown.

However, one thing I have really struggled with is that there isn't any single great place to refer people to with a simple walkthrough and explanation of Markdown.

When we built Stack Overflow circa 2008-2009, I put together my best effort at the time which became the "editing help" page:

It's just OK. And GitHub has their Markdown Basics, and GitHub Flavored Markdown help pages. They're OK.

The Ghost editor I am typing this in has an OK Markdown help page too.

But none of these are great.

What we really need is a great Markdown tutorial and reference page, one that we can refer anyone to, anywhere in the world, from someone who barely touches computers to the hardest of hard-core coders. I don't want to build another one for these kinds of help pages for Discourse, I want to build one for everyone. Since it is for everyone, I want to involve everyone. And by everyone, I mean you.

After writing about Our Programs Are Fun To Use – which I just updated with a bunch of great examples contributed in the comments, so go check that out even if you read it already – I am inspired by the idea that we can make a fun, interactive Markdown tutorial together.

So here's what I propose: a small contest to build an interactive Markdown tutorial and reference, which we will eventually host at the home page of, and can be freely mirrored anywhere in the world.

Some ground rules:

  • It should be primarily in JavaScript and HTML. Ideally entirely so. If you need to use a server-side scripting language, that's fine, but try to keep it simple, and make sure it's something that is reasonable to deploy on a generic Linux server anywhere.

  • You can pick any approach you want, but it should be highly interactive, and I suggest that you at minimum provide two tracks:

    • A gentle, interactive tutorial for absolute beginners who are asking "what the heck does Markdown even mean?"

    • A dynamic, interactive reference for intermediates and experts who are asking more advanced usage questions, like "how do I make code inside a list, or a list inside a list?"

  • There's a lot of variance in Markdown implementations, so teach the most common parts of Markdown, and cover the optional / less common variations either in the advanced reference areas or in extra bonus sections. People do love their tables and footnotes! We recommend using a CommonMark compatible implementation, but it is not a requirement.

  • Your code must be MIT licensed.

  • Judging will be completely at the whim of myself and John MacFarlane. Our decisions will be capricious, arbitrary, probably nonsensical, and above all, final.

  • We'll run this contest for a period of one month, from today until April 28th, 2015.

  • If I have hastily left out any clarifying rules I should have had, they will go here.

Of course, the real reward for building is the admiration of your peers, and the knowledge that an entire generation of people will grow up learning basic Markdown skills through your contribution to a global open source project.

But on top of that, I am offering … fabulous prizes!

  1. Let's start with my Recommended Reading List. I count sixteen books on it. As long as you live in a place Amazon can ship to, I'll send you all the books on that list. (Or the equivalent value in an Amazon gift certificate, if you happen to have a lot of these books already, or prefer that.)

  2. Second prize is a CODE Keyboard. This can be shipped worldwide.

  3. Third prize is you're fired. Just kidding. Third prize is your choice of any three books on my reading list. (Same caveats around Amazon apply.)

Looking for a place to get started? Check out:

If you want privacy, you can mail your entries to me directly (see the about page here for my email address), or if you are comfortable with posting your contest entry in public, I'll create a topic on talk.commonmark for you to post links and gather feedback. Leaving your entry in the comments on this article is also OK.

We desperately need a great place that we can send everyone to learn Markdown, and we need your help to build it. Let's give this a shot. Surprise and amaze us!

[advertisement] Stack Overflow Careers matches the best developers (you!) with the best employers. You can search our job listings or create a profile and even let employers find you.
Categories: Programming

Critical Success Factors of IT Forecasting

Herding Cats - Glen Alleman - Fri, 03/27/2015 - 19:11

The Estimation problem in enterprise IT and Software Intensive System has been with us for decades if not from the beginning of time in the software business. While not software, but a hardware implementation of an algorithm, Alan Turing's problem was to tell his boss when he'll be able to crack for Enigma Code.

We still struggle with software estimates. Tom DeMarco's quote is  an important starting point, which I'll repeat here from the paper below

The estimator's charter is not to state what developers should od, but rather to provide a reasonable projection of what they will do.

Screen Shot 2015-03-27 at 11.48.41 AM
For those interested further understanding the estimate problem, this is a very good starting point. 

A key point from a reference "The Inaccurate Conception," CAM, 51(3):13-16, 2008, says

"When a weather forecast indicates a 40% chance of rain, and it rains on you, was the forecast accurate? If it doesn't rain on you, was the forecast inaccurate? Thought of in these terms, the concept of accuracy takes on a different meaning. It seems that the fact that it does or does not rain on you is not a particularly good measure of the accuracy of the rain estimate."

"The accuracy of a weather forecast is not whether it rains or not but whether it rains at the likelihood it was forecast to rain. Similarly, the accuracy of an estimate on a project is not whether the project achieves its goals but whether it correctly forecasts the probability of achieving its goals."

For some more background from the ACM Library, referenced here are some further readings

  • What is ¬†Good estimate? Whether Forecasting is Valuable, Phillip G. Armour, CACM, Vol. 56 No. 6. June 2013
  • Estimation is Not Evil, Phillip G. Armour, CACM, Vol. 57 No. 1, March 2014
Related articles Empirical Data Used to Estimate Future Performance Five Estimating Pathologies and Their Corrective Actions Calculating Value from Software Projects - Estimating is a Risk Reduction Process
Categories: Project Management

7 Habits of Highly Motivated People

Motivation is a powerful skill.

It can lift you up from the worst of places, and inspire you to new heights.

After all, nothing is worse than slogging your way through your days, or working your way through a bunch of mundane tasks.

But, like I said, motivation is a skill.

You need to learn it.   For many people it does not come naturally.   And chances are, many of us have had bad models, bad advice, and worse, bad habits for a lifetime.

One of the most important insights I found was said by Stephen Covey long ago ‚Äď satisfied needs don‚Äôt motivate.

It’s why we need to stay hungry.

Here’s how you stay hungry -- find a problem you hate, and focus on creating a solution you love.

But how you light your fire from the inside out in a sustainable way?

That’s where the 7 habits of highly motivated people comes in.

I wanted to put together a very simple set of habits and practices that actually work for building your motivational muscle and finding your inner mojo.

Here are the 7 habits of highly motivated people at a glance:

  1. Find Your WHY
  2. Change Your Beliefs About What’s Possible
  3. Change Your Beliefs That Limit You
  4. Spend More Time In Your Values
  5. Surround Yourself With Catalysts
  6. Build Better Feedback Loops
  7. ‚ÄúPull‚ÄĚ Yourself with Compelling Goals

There is a lot of science behind the habits.  If you‚Äôre that motivated, you can research it through bunches of books, bunches of sites, and brilliant TED talks.

But, I’d much rather that you spent the time simply adopting and applying the habits, so you can set your motivation on fire.

It’s time to do more of what you were born to do.

It’s time to live and breathe the things that you want to live and breathe.

It’s time to rise again from whatever ashes might have burned you down, and let your phoenix fly.

If you aren’t sure where to get started, first read 7 habits of highly motivated people, and then adopt habit #1:

Find Your WHY.

You’ll be glad you did.

I can see your pilot light is on already.

Categories: Architecture, Programming

Stuff The Internet Says On Scalability For March 27th, 2015

Hey, it's HighScalability time:

@scienceporn: That Hubble Telescope picture explained in depth. I have never had anything blow my mind so hard.

  • hundreds of billions: files in Dropbox; $2 billion: amount Facebook saved building their own servers, racks, cooling, storage, flat fabric, etc.
  • Quotable Quotes:
    • Buckminster Fuller: I was born in the era of the specialist. I set about to be purposely comprehensive. I made up my mind that you don't find out something just to entertain yourself. You find out things in order to be able to turn everything not just into a philosophical statement, but actual tools to reorganize the environment of man by which greater numbers of men can prosper. That's been my main undertaking.
    • @mjpt777: PCI-e based SSDs are getting so fast. Writing at 1GB/s for well less than $1000 is so cool.
    • @DrQz: All meaning has a pattern, but not all patterns have a meaning.
    • Stu: “Exactly once” has *always* meant “at least once but dupe-detected”. Mainly because we couldn’t convince customers to send idempotent and communitative state changes.
    • @solarce: When some companies have trouble scaling their database they use Support or Consultants. Apple buys a database vendor. 
    • @nehanarkhede: Looks like Netflix will soon surpass LinkedIn's Kafka deployment of 800B events a day. Impressive.
    • @ESPNFantasy: More than 11.57 million brackets entered. Just 14 got the entire Sweet 16 correct.
    • @BenedictEvans: A cool new messaging app getting to 1m users is the new normal. Keeping them, and getting to 100m, is the question.
    • @jbogard: tough building software systems these days when your only two choices are big monoliths and microservices
    • @nvidia: "It isn't about one GPU anymore, it's about 32 GPUs" Andrew Ng quotes Jen-Hsun Huang. GPU scaling is important #GTC15

  • FoundationDB, a High Scalability advertiser and article contributer, has been acquired. Apple scooped them up. Though saving between 5% to 10% less hardware than Cassandra seems unlikely. And immediately taking their software off GitHub is a concerning trend. It adds uncertainty to the entire product selection dance. Something to think about.

  • In the future when an AI tries to recreate a virtual you from your vast data footprint, the loss of FriendFeed will create a big hole in your virtual personality. I think FF catches a side of people that isn't made manifest in other mediums. Perhaps 50 years from now people will look back on our poor data hygiene with horror and disbelief. How barbaric they were in the past, people will say. 

  • When the nanobots turn the world to goo this 3D printer can recreate it again. New 3-D printer that grows objects from goo. Instead of a world marked by an endless battle of good vs evil we'll have a ceaseless cycle of destruction and rebirth through goo. That's unexpected. A modern mythology in the making.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so please keep on reading)...

Categories: Architecture

Quote of the Day

Herding Cats - Glen Alleman - Fri, 03/27/2015 - 15:38

I have a card on my desk with a cute quote

May the Sun Always Shine Down on You a constant reminder that there is a giant nuclear-powered fireball in the sky just barely holding it together.

Simply said It's Not About You

Categories: Project Management

Neo4j: Generating real time recommendations with Cypher

Mark Needham - Fri, 03/27/2015 - 07:59

One of the most common uses of Neo4j is for building real time recommendation engines and a common theme is that they make use of lots of different bits of data to come up with an interesting recommendation.

For example in this video Amanda shows how dating websites build real time recommendation engines by starting with social connections and then introducing passions, location and a few other things.

Graph Aware have a neat framework that helps you to build your own recommendation engine using Java and I was curious what a Cypher version would look like.

This is the sample graph:

    (m:Person:Male {name:'Michal', age:30}),
    (d:Person:Female {name:'Daniela', age:20}),
    (v:Person:Male {name:'Vince', age:40}),
    (a:Person:Male {name:'Adam', age:30}),
    (l:Person:Female {name:'Luanne', age:25}),
    (c:Person:Male {name:'Christophe', age:60}),
    (lon:City {name:'London'}),
    (mum:City {name:'Mumbai'}),

We want to recommend some potential friends to ‘Adam’ so the first layer of our query is to find his friends of friends as there are bound to be some potential friends amongst them:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
RETURN me, potentialFriend, COUNT(*) AS friendsInCommon
==> +--------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | friendsInCommon |
==> +--------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 1               |
==> +--------------------------------------------------------------------------------------+
==> 3 rows

This query gives us back a list of potential friends and how many friends we have in common.

Now that we’ve got some potential friends let’s start building a ranking for each of them. One indicator which could weigh in favour of a potential friend is if they live in the same location as us so let’s add that to our query:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
        SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation
==> +-----------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation |
==> +-----------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            |
==> +-----------------------------------------------------------------------------------+
==> 3 rows

Next we’ll check whether Adams’ potential friends have the same gender as him by comparing the labels each node has. We’ve got ‘Male’ and ‘Female’ labels which indicate gender.

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
        SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
        LABELS(me) = LABELS(potentialFriend) AS gender
==> +--------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | gender |
==> +--------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | true   |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | false  |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | false  |
==> +--------------------------------------------------------------------------------------------+
==> 3 rows

Next up let’s calculate the age different between Adam and his potential friends:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
       SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
       abs( me.age - potentialFriend.age) AS ageDifference,
       LABELS(me) = LABELS(potentialFriend) AS gender,
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | ageDifference | gender | friendsInCommon |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | 10.0          | true   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | 10.0          | false  | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | 5.0           | false  | 1               |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> 3 rows

Now let’s do some filtering to get rid of people that Adam is already friends with – there wouldn’t be much point in recommending those people!

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
WITH me,
     SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
     abs( me.age - potentialFriend.age) AS ageDifference,
     LABELS(me) = LABELS(potentialFriend) AS gender,
WHERE NOT (me)-[:FRIEND_OF]-(potentialFriend)
       SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
       abs( me.age - potentialFriend.age) AS ageDifference,
       LABELS(me) = LABELS(potentialFriend) AS gender,
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | ageDifference | gender | friendsInCommon |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | 10.0          | true   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | 10.0          | false  | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | 5.0           | false  | 1               |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> 3 rows

In this case we haven’t actually filtered anyone out but for some of the other people we would see a reduction in the number of potential friends.

Our final step is to come up with a score for each of the features that we’ve identified as being important for making a friend suggestion.

We’ll assign a score of 10 if the people live in the same location or have the same gender as Adam and 0 if not. For the ageDifference and friendsInCommon we’ll apply apply a log curve so that those values don’t have a disproportional effect on our final score. We’ll use the formula defined in the ParetoScoreTransfomer to do this:

    public <OUT> float transform(OUT item, float score) {
        if (score < minimumThreshold) {
            return 0;
        double alpha = Math.log((double) 5) / eightyPercentLevel;
        double exp = Math.exp(-alpha * score);
        return new Double(maxScore * (1 - exp)).floatValue();

And now for our completed recommendation query:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
WITH me,
     SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
     abs( me.age - potentialFriend.age) AS ageDifference,
     LABELS(me) = LABELS(potentialFriend) AS gender,
WHERE NOT (me)-[:FRIEND_OF]-(potentialFriend)
WITH potentialFriend,
       // 100 -> maxScore, 10 -> eightyPercentLevel, friendsInCommon -> score (from the formula above)
       100 * (1 - exp((-1.0 * (log(5.0) / 10)) * friendsInCommon)) AS friendsInCommon,
       sameLocation * 10 AS sameLocation,
       -1 * (10 * (1 - exp((-1.0 * (log(5.0) / 20)) * ageDifference))) AS ageDifference,
       CASE WHEN gender THEN 10 ELSE 0 END as sameGender
RETURN potentialFriend,
      {friendsInCommon: friendsInCommon,
       sameLocation: sameLocation,
       sameGender: sameGender} AS parts,
     friendsInCommon + sameLocation + ageDifference + sameGender AS score
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | potentialFriend                   | parts                                                                                                           | score             |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1006]{name:"Vince",age:40}   | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -5.52786404500042, sameGender -> 10} | 19.33814370292112 |
==> | Node[1008]{name:"Luanne",age:25}  | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -3.312596950235779, sameGender -> 0} | 11.55341079768576 |
==> | Node[1005]{name:"Daniela",age:20} | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -5.52786404500042, sameGender -> 0}  | 9.33814370292112  |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The final query isn’t too bad – the only really complex bit is the log curve calculation. This is where user defined functions will come into their own in the future.

The nice thing about this approach is that we don’t have to step outside of cypher so if you’re not comfortable with Java you can still do real time recommendations! On the other hand, the different parts of the recommendation engine all get mixed up so it’s not as easy to see the whole pipeline as if you use the graph aware framework.

The next step is to apply this to the Twitter graph and come up with follower recommendations on there.

Categories: Programming

Personas: Nuts and Bolts Without Navel Gazing

Two completely different personas using the same system simultaneously.

Two completely different personas using the same system simultaneously.

Personas are often used in software development to generate and group requirements. For example, one of the classic constructions of user stories is: <Persona> <Goal> <Benefit>. The use of personas has become nearly ubiquitous. A persona represents one of the archetypical users that interacts with the system or product. Every system or product will have at least one archetypical user that can be identified. The concept of personas was originally developed by Alan Cooper in his book The Inmates Are Running the Asylum. Personas represent different behaviors based on research into the jobs, lifestyles, spare time activities, attitudes, motivations, behaviors and social roles. I am often asked where a team should get a set of personas given the level of usage (as if they were a set of playing cards), and once a team gathers the data, how that data should be captured.

In most internal development enhancement and maintenance projects, personas are developed using some form of brainstorming process. Brainstorming with affinity diagraming is a great tool to develop personas. The Software Process and Measurement Blog has described the process of Affinity diagraming. The people involved in generating personas are critical. The team you use to generate personas needs to have a broad experiential base. It needs to be cross-functional and diverse. The team needs to be able to take systems views of who uses the systems or product, otherwise the list of personas and their definitions will be incomplete. Second, when generating ideas in the brainstorming session someone needs to act as a facilitator and provide seed questions to the team. The seed questions will help the team to uncover groups of users and potential users, their needs, behaviors, attitudes and both work and non-work activities. In the grouping portion of the affinity diagraming process, the team should first think about personas when grouping and then when naming the groups ask the team to give the groups persona names. A further twist on the tried and true affinity session mechanism that I have recently found to be effective is to have the team break up in to smaller sub-teams to review what was done and re-brainstorm entries that could fill the gaps after the grouping and naming exercise. I generally time box the second step to fifteen minutes, although time-boxing may constrain innovation. When the sub-teams return integrate the new entries into the affinity diagram. The named and grouped entries from the affinity diagram will form the basis for the documentation needed to capture the personas.

Documenting personas might be the most contentious part of the process of establishing and using personas. Documentation smacks of overhead to many in the development community. I have found that if you don’t take the time to document a persona, it is far less useful. Documenting personas does a number of things. First the act of documentation allows the team to flesh out their ideas and drive out gaps and inconsistencies. Second, documentation helps the team to establish collective memory. Third, fleshing out the nuances fleshed out and establishing institutional memory makes it more likely that the team will refer to the persona after development because the investment in time and effort generates a perception of value. When documenting personas, ensure that you keep the documentation to the absolute minimum. I use a single sheet of flip chart paper per persona; not only does the single sheet feel lean, it also allows teams to post the personas around the room. Templates are good to for maintaining repeatability.

In Agile Story Maps: Personas we discussed and referenced a template defined by Roman Pitchler. There are a wide variety of templates for capturing information about a persona. In general they all include the following information:

  1. Persona Name
  2. Picture: a picture gives the persona visual substance and personality.
  3. Job: What does a person in this archetypical group do for a living (as it relates to the product or system).
  4. Goal: A definition of why the persona would want to use the product or system. Support of the goal can be found in the more personality and lifestyle (often written as backstory) of the persona.
  5. Personality: How does the group that the persona represents think, act, or react. This attribute includes motivation, attitudes, behaviors and social roles.
  6. Lifestyle: How does the group that the persona represents actually act? What does the persona do both at work and in their spare time?

Every template that I have seen (or created) is nuanced based on the needs of the team and the project. For example, the personas generated to guide developing requirements for an enhancement to a data entry application would be different from the personas needed to plan a major Agile conference. In the latter case more emphasis might be needed around motivation and lifestyle which would influence the speakers that might be accepted, social events planned and when the speakers and events might be slotted in the schedule. Teams use personas to help them think conceptually about the problem they are trying to solve without getting bogged down by having to think about and address each individual that uses the system or product as a special, individual case.

Personas are a tool help teams to understand a class of users by creating an archetypical user. Personas need to be identifiable and have back-story that includes needs and goals. The use of fictional people allows the team to dissociate themselves from the large number of individuals in the user base so they don’t get trapped in paralyzing detail. Personas need to be written down or you will not remember the nuances of personalities and lifestyles that make any two personas different.

Categories: Process Management