Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Feed aggregator

Componentize the web, with Polycasts!

Google Code Blog - Fri, 10/03/2014 - 18:58
Today at Google, we’re excited to announce the launch of Polycasts, a new video series to get developers up and running with Polymer and Web Components.

Web Components usher in a new era of web development, allowing developers to create encapsulated, interoperable elements that extend HTML itself. Built atop these new standards, Polymer makes it easier and faster to build Web Components, while also adding polyfill support so they work across all modern browsers.

Because Polymer and Web Components are such big changes for the platform, there’s a lot to learn, and it can be easy to get lost in the complexity. For that reason, we created Polycasts.

Polycasts are designed to be bite sized, and to teach one concept at a time. Along the way we plan to highlight best practices for not only working with Polymer, but also using the DevTools to make sure your code is performant.

We’ll be releasing new videos often over the coming weeks, initially focusing on core elements and layout. These episodes will also be embedded throughout the Polymer site, helping to augment the existing documentation. Because there’s so much to cover in the Polymer universe, we want to hear from you! What would you like to see? Feel free to shoot a tweet to @rob_dodson, if you have an idea for a show, and be sure to subscribe to our YouTube channel so you’re notified when new episodes are released.

Posted by Rob Dodson, Developer Advocate
Categories: Programming

Integrating Geb with FitNesse using the Groovy ConfigSlurper

Xebia Blog - Fri, 10/03/2014 - 18:01

We've been playing around with Geb for a while now and writing tests using WebDriver and Groovy has been a delight! Geb integrates well with JUnit, TestNG, Spock, and Cucumber. All there is left to do is integrating it with FitNesse ... or not :-).

Setup Gradle and Dependencies

First we start with grabbing the gradle fitnesse classpath builder from Arjan Molenaar.
Add the following dependencies to the gradle build file:

compile 'org.codehaus.groovy:groovy-all:2.3.7'
compile 'org.gebish:geb-core:0.9.3'
compile 'org.seleniumhq.selenium:selenium-java:2.43.1
Configure different drivers with the ConfigSlurper

Geb provides a configuration mechanism using the Groovy ConfigSlurper. It's perfect for environment sensitive configuration. Geb uses the geb.env system property to determine the environment to use. So we use the ConfigSlurper to configure different drivers.

import org.openqa.selenium.firefox.FirefoxDriver

driver = { new FirefoxDriver() }

environments {
  chrome {
    driver = { new ChromeDriver() }
FitNesse using the ConfigSlurper

We need to tweak the gradle build script to let FitNesse play nice with the ConfigSlurper. So we pass the geb.env system property as a JVM argument. Look for the gradle task "wiki" in the gradle build script and add the following lines.

def gebEnv = (System.getProperty("geb.env")) ? (System.getProperty("geb.env")) : "firefox"
jvmArgs "-Dgeb.env=${gebEnv}"

Since FitNesse spins up a separate 'service' process when you execute a test, we need to pass the geb.env system property into the COMMAND_PATTERN of FitNesse. That service needs the geb.env system property to let Geb know which environment to use. Put the following lines in the FitNesse page.

!define COMMAND_PATTERN {java -Dgeb.env=${geb.env} -cp %p %m}

Now you can control the Geb environment by specifying it on the following command line.

gradle wiki -Dgeb.env=chrome

The gradle build script will pass the geb.env system property as JVM argument when FitNesse starts up. And the COMMAND_PATTERN will pass it to the test runner service.

Want to see it in action? Sources can be found here.

Preview: DataGrid for Xamarin.Forms

Eric.Weblog() - Eric Sink - Fri, 10/03/2014 - 17:00

What is it?

It's a Xamarin.Forms grid control for displaying data in rows and columns.

Where's the code?

Is this open source?

Yes. Apache License v2.

Why are you writing a grid?

Because I see an unmet need. Xamarin.Forms needs a good way to display row/column data. And it needs to be capable of handling lots (millions) of cells. And it needs to be really, really fast.

I'm one of the founders of Zumero. We're all about mobile apps for businesses dealing with data. Many of our customers are using Xamarin, and we want to be able to recommend Xamarin.Forms to them. A DataGrid is one of the pieces we need.

What are the features?
  • Scrolling, both vertical and horizontal
  • Either scroll range can be fixed, infinite, or wraparound.
  • Optional headers, top, left, right, bottom.
  • Ample flexibility in connecting to different data sources
  • Column widths can be fixed width or variable. Same for row heights.
Is this ready for use?

No, this code is still pretty rough.

Is this ready to play with and complain about?

Yes, hence its presence on GitHub. :-)

Open dg.sln in Xamarin Studio and it should build. There's a demo app (Android and iOS) you can use to try it out. The WP8 code isn't there yet, but it'll be moving in soon.

Is there a NuGet package?

Not yet.

Is the API frozen yet?

No. In fact, I'm still considering some API changes that could be described as major.

What platforms does it support?

Android and iOS are currently in decent shape. Windows Phone is in progress. (The header row was bashful and refused to cooperate for the WP8 screenshot.)

What will the API be like?

I don't know yet. In fact, I'm tempted to quibble when you say "the API", because you're speaking of it in the singular, and I think I will end up with more than one. :-)

Earlier, I described this thing as "a grid control", but it would be more accurate right now to describe it as a framework for building grid controls.

I have implemented some sample grids, mostly just to demonstrate the framework's capabilities and to experiment with what kinds of user-facing APIs would be most friendly. Examples include:

  • A grid that gets its data from an IList, where the properties of objects of class T become columns.
  • A data connector that gets its data from ReactiveList (uses ReactiveUI).
  • A grid that gets its data from a sqlite3_stmt handle (uses my SQLitePCL.raw package).
  • A grid that just draws shapes.
  • A grid that draws nothing but a cell border, but the farther you scroll, the bigger the cells get.
  • A 2x2 grid that scrolls forever and just repeats its four cells over and over.
How is this different from the layouts built into Xamarin.Forms?

This control is not a "Layout", in the Xamarin.Forms sense. It is not a subclass of Xamarin.Forms.Layout. You can't add child views to it.

If you need something to help arrange the visual elements of your UI on the screen, DataGrid is not what you want. Just use one of the Layouts. That's what they're for.

But maybe you need to display a lot of data. Maybe you have 200,000 rows. Maybe you don't know how many rows you have and you won't know until you read the last one. Maybe you have lots of columns too, so you need the ability to scroll in both directions. Maybe you need one or more header rows at the top which sync-scroll horizontally but are frozen vertically. And so on.

Those kind of use cases are what DataGrid is aimed for.

What drawing API are you using?

Mostly I'm working with a hacked-up copy of Frank Krueger's CrossGraphics library, modified in a variety of ways.

The biggest piece of the code (in DataGrid.Core) actually doesn't care about the graphics API. That assembly contains generic classes which accept <TGraphics> as a type parameter. (As a proof of concept demo, I've got an iOS implementation built on CGContext which doesn't actually depend on Xamarin.Forms at all.)

So I can't add ANY child views to your DataGrid control?

Currently, no. I would like to add this capability in the future.

(At the moment, I'm pretty sure it is impossible to build a layout control for Xamarin.Forms unless you're a Xamarin employee. There seem to be a few important things that are not public.)

How fast is it?

Very. On my Nexus 7, a debug build of DataGrid can easily scroll a full screen of text cells at 60 frames/second. Performance on iOS is similar.

How much code is cross-platform?

Not counting the demo app or my copy of CrossGraphics, the following table shows lines of code in each combination of dependencies:

 PortableiOS-specificAndroid-specific <TGraphics>2,741141174 Xamarin.Forms6339281

Xamarin.Forms is [going to be] a wonderful foundation for cross-platform mobile apps.

Can I use this from Objective-C or Java?

No. It's all C#.

Why are you naming_things_with_underscores?

Sorry about that. It's a habit from my Unix days that I keep slipping back into. I'll clean up my mess soon.

What's up with IChanged? Why not IObservable<T>?

Er, yeah, remember above when I said I'm still considering some major changes? That's one them.

Does this in any way depend on your Zumero for SQL Server product?

No, DataGrid is a standalone open source library.

But it's rather likely that our commercial products will in the future depend on DataGrid.


Stuff The Internet Says On Scalability For October 3rd, 2014

Hey, it's HighScalability time:

Is the database landscape evolving or devolving?


  • 76 million: once more through the data breach; 2016: when a Zettabyte is transfered over the Internet in one year
  • Quotable Quotes:
    • @wattersjames: Words missing from the Oracle PaaS keynote: agile, continuous delivery, microservices, scalability, polyglot, open source, community #oow14
    • @samcharrington: At last count, there were over 1,000,000 million containers running in the wild.  @jejb_ #ccevent
    • @mappingbabel: Oracle's cloud has 30,000 computers. Google has about two million computers. Amazon over a million. Rackspace over 100,000.
    • Andrew Auernheimer: The world should have given the GNU project some money to hire developers and security auditors. Hell, it should have given Stallman a place to sleep that isn't a couch at a university. There is no f*cking justice in this world.
    • John Nagle: The right answer is to track wins and losses on delayed and non-delayed ACKs. Don't turn on ACK delay unless you're sending a lot of non-delayed ACKs closely followed by packets on which the ACK could have been piggybacked. Turn it off when a delayed ACK has to be sent. I should have pushed for this in the 1980s.
    • @neil_conway: The number of < 15 node Hadoop clusters is >> the number of > 15 node Hadoop clusters. Unfortunately not reflected in SW architecture.

  • In the meat world Google wants devices to talk to you. The Physical Web. This will be better than Apple's beacons because Apple is severely limiting the functionality of beacons by requiring IDs be baked into applications. It's a very static and controlled world. In other words, it's very Apple. By using URLs Google is supporting both the web and apps; and adding flexibility because a single app can dynamically and generically handle the interaction from any kind of device. In other words, it's very Google. Apple has the numbers though, with hundreds of millions of beacon enabled phones in customer hands. Since it's just another protocol over BLE it should work on Apple devices as well.

  • Did Netflix survive the great AWS rebootathon? The Chaos Monkey says yes, yes they did: Out of our 2700+ production Cassandra nodes, 218 were rebooted. 22 Cassandra nodes were on hardware that did not reboot successfully. This led to those Cassandra nodes not coming back online. Our automation detected the failed nodes and replaced them all, with minimal human intervention. Netflix experienced 0 downtime that weekend. 

  • Google Compute Engine is following Moore's Law by announcing a 10% discount. Bandwidth is still expensive because networks don't care about silly laws. And margin has to come from somewhere.

  • Software is eating...well you've heard it before. Mesosphere cofounder envisions future data center as ‘one big computer’: The data center of the future will be fully virtualized, with everything from power supplies to storage devices consolidated into a single pool and managed by software, according to an executive whose company intends to lead the way.

  • Companies, startups, hacker spaces, teams are all intentional communities. People choose to work together towards some end. A consistent group killer is that people can be really sh*tty to each other. There's a lot of work that has been done around how to make intentional communities work. Holacracy is just one option. Here's a really interesting interview with Diana Leafe Christian on what makes communities work. It requires creating Community Glue, Good Process and Communication Skill, Effective Project Management, and good Governance and Decision making. Which is why most communities fail. Did you know there's even something called Non-Defensive Communication? If followed the Internet would collapse.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so please keep on reading)...

Categories: Architecture

Distributed Agile: Demonstrations



Demonstrations are an important tool for teams to gather feedback to shape the value they deliver.  Demonstrations provide a platform for the team to show the stories that have been completed so the stakeholders can interact with the solution.  The feedback a team receives not only ensures that the solution delivered meets the needs but also generates new insights and lets the team know they are on track.  Demonstrations should provide value to everyone involved. Given the breadth of participation in a demo the chance of a distributed meeting is even more likely.  Techniques that support distributed demonstrations include:

  1. More written documentation: Teams, especially long-established teams, often develop shorthand expressions that convey meaning fall short before a broader audience. Written communication can be more effective at conveying meaning where body language can’t be read and eye contact can’t be made. Publish an agenda to guide the meeting; this will help everyone stay on track or get back on track when the phone line drops. Capture comments and ideas on paper where everyone can see them.  If using flip charts, use webcams to share the written notes.  Some collaboration tools provide a notepad feature that stays resident on the screen that can be used to capture notes that can be referenced by all sites.
  2. Prepare and practice the demo. The risk that something will go wrong with the logistics of the meeting increase exponentially with the number of sites involved.  Have a plan for the demo and then practice the plan to reduce the risk that you have not forgotten something.  Practice will not eliminate all risk of an unforeseen problem, but it will help.
  3. Replicate the demo in multiple locations. In scenarios with multiple locations with large or important stakeholder populations, consider running separate demonstrations.  Separate demonstrations will lose some of the interaction between sites and add some overhead but will reduce the logistical complications.
  4. Record the demo. Some sites may not be able to participate in the demo live due to their time zones or other limitations. Recording the demo lets stakeholders that could not participate in the live demo hear and see what happened and provide feedback, albeit asynchronously.  Recording the demo will also give the team the ability to use the recording as documentation and reference material, which I strongly recommend.
  5. Check the network(s)! Bandwidth is generally not your friend. Make sure the network at each location can support the tools you are going to use (video, audio or other collaboration tools) and then have a fallback plan. Fallback plans should be as low tech as practical.  One team I observed actually had to fall back to scribes in two locations who kept notes on flip charts by mirroring each-other (cell phones, bluetooth headphones and whispering were employed) when the audio service they were using went down.

Demonstrations typically involve stakeholders, management and others.  The team needs feedback, but also needs to ensure a successful demo to maintain credibility within the organization.  In order to get the most effective feedback in a demo everyone needs to be able to hear, see and get involved.  Distributed demos need to focus on facilitating interaction more than in-person demos. Otherwise, distributed demos risk not being effective.

Categories: Process Management

Critical Thinking - The Missing Element in Project Management Processes

Herding Cats - Glen Alleman - Thu, 10/02/2014 - 21:56

The_Thinker_Musee_RodinComplex and unstable environments encountered in project work - especially software development project work - calls for critical thinking by all participants. Complexity comes in part from technical uncertainties, starting with requirements for software capabilities. If there is uncertainty in what capabilities are needed, the project is starting off on the path to failure on day one. Certainly the functional and operational requirements have emergent properties that create uncertainty. As do staffing, productivity and risks created by reducible and irreducible uncertainty.

To deal with these complexities, critical thinking is needed on behalf of the project leaders and project participants alike.

The first responsibility of the project staff as well as management is to think. To think what is being asked of them. To consider that they are being paid to produce value by someone other than themselves. To think about why they are there, what they are being asked to do, and how they can go about being stewards of the money provided by those paying for their work. To be true professionals applying their education, training, and experience through analysis and creative, informed thought to make daily decisions.

Failure to apply critical thinking creates a  disconnect between those providing the value and those paying for the value. This is best illustrated in the notion that business decisions can be made in the absence of knowing the cost and outcomes of those decisions. The first gap in critical thinking occurs when the decsion making process ignores the principles of MicroEconomics. This gap is future reinforced when the probabilistic nature of all project work is also ignored. 

The three core elements of all projects are the delivered capabilities, the cost of producing those capabilities, and the time frame over which those capabilities are delivered for the cost. Each of these acts as a random variable, interacting with the other two in statistically complex ways.

To manage in the presence of these random variables, means making decisions in the presence of uncertainties - and resulting risks - created by the randomness of the variables. These uncertainties are reducible - we can pay more to find out more information. Or they are irreducible, we can only manage in their presence with margin for cost, schedule, and technical performance.

To make decisions in the presence of this paradigm we need to Estimate. Making decisions in the absence of estimating first violates the principles of microeconomics, and secondly ignores the underlying statistical and probabilistic nature of all project work.

When we hear that decisions be made in the absence of estimates, ask how Microeconomics and statistical uncertainty are handled. 


Categories: Project Management

Promises in the Google APIs JavaScript Client Library

Google Code Blog - Thu, 10/02/2014 - 21:28
The JavaScript Client Library for Google APIs is now Promises/A+-conformant. Requests made using gapi.client.request, gapi.client.newBatch, and from generated API methods like are also promises. You can pass in response and error handlers through their then methods.

Requests can be made using the then syntax provided by Promises:
gapi.client.load(‘plus’, ‘v1’).then(function () {{query: ‘John’}).then(function(res) {
}, function(err) {
All fulfilled responses and rejected application errors passed to the handlers will have these fields:
result: *, // JSON-parsed body or boolean false if not JSON-parseable
body: string,
headers: (Object.),
status: (?number),
statusText: (?string)
The promises can also be chained, making your code more readable:{
playlistId: 'PLOU2XLYxmsIIwGK7v7jg3gQvIAWJzdat_',
part: 'snippet'
}).then(function(res) {
return {
return item.snippet.resourceId.videoId;
}).then(function(videoIds) {
id: videoIds.join(','),
part: 'snippet,contentDetails'
}).then(function(res) {
res.result.items.forEach(function(item) {
}, function(err) {
Using promises makes it easy to handle results of API requests and offer elegant error propagation.

To learn more about promises in the library and about converting from callbacks to promises, visit Using Promises and check out our latest API reference.

Posted by Jane Park, Software Engineer
Categories: Programming

A quick note on haproxy acl rules

Agile Testing - Grig Gheorghiu - Thu, 10/02/2014 - 20:29
I blogged in the past about haproxy acl rules we used for geolocation detection purposes. In that post, I referenced acl conditions that were met when traffic was coming from a non-US IP address. In that case, we were using a different haproxy backend. We had an issue recently when trying to introduce yet another backend for a given country. We added these acl conditions:

       acl acl_geoloc_akamai_true_client_ip_some_country req.hdr(X-Country-Akamai) -m str -i SOME_COUNTRY_CODE
       acl acl_geoloc_src_some_country req.hdr(X-Country-Src) -m str -i SOME_COUNTRY_CODE

We also added this use_backend rule:

      use_backend www_some_country-backend if acl_akamai_true_client_ip_header_exists acl_geoloc_akamai_true_client_ip_some_country or acl_geoloc_src_some_country

However, the backend www_some_country-backend was never chosen by haproxy, even though we could see traffic coming from IP address from SOME_COUNTRY_CODE.

The cause of this issue was that another use_backend rule (for non-US traffic) was firing before the new rule we added. I believe this is because this rule is more generic:

       use_backend www_row-backend if acl_akamai_true_client_ip_header_exists !acl_geoloc_akamai_true_client_ip_us or !acl_geoloc_src_us
The solution was to modify the use_backend rule for non-US traffic to fire only when the SOME_COUNTRY acl condition isn't met:
       use_backend www_row-backend if acl_akamai_true_client_ip_header_exists !acl_geoloc_akamai_true_client_ip_us !acl_geoloc_akamai_true_client_ip_some_country or !acl_geoloc_src_us !acl_geoloc_src_some_country
Maybe another solution would be to change the order of acls and use_backend rules. I couldn't find any good documentation on how this order affects what gets triggered when.

Validation and Verification- An Engineer’s Perspective

Software Requirements Blog - - Thu, 10/02/2014 - 17:00
I recently began teaching our training courses here at Seilevel and one of the topics we cover is validation and verification. In the training, we ask the students to brainstorm what validation and verification are and how they apply to software requirements. Surprisingly, to me at least, there are many people who think these are […]
Categories: Requirements

Quote of the Day

Herding Cats - Glen Alleman - Thu, 10/02/2014 - 15:33

We shall not cease from exploration We shall not cease from exploration and the end of all our exploring will be to arrive where we started and be to arrive where we started and know the place for the first time. -T.S. Eliot

Quoting 29 year old reports or referencing 30 year old books it not likely the best way to reveal the problems of the day. These problems have existed of 30 year, but new more effective, and much more transparent solutions are available today.

Categories: Project Management

Management Feedback: Are You Abrasive or Assertive?

Let me guess. If you are a successful woman, in the past, you’ve been told you’re too abrasive, too direct, maybe even too assertive. Too much. See The One Word Men Never See In Their Performance Reviews.

Here’s the problem. You might be.

I was.

But never in the “examples” my bosses provided. The “examples” they provided were the ones when I advocated for my staff. The ones where I made my managers uncomfortable. The examples, where, if I had different anatomy, they would have relaxed afterwards, and we’d gone out for a beer.

But we didn’t.

Because my bosses could never get over the fact that I was a woman, and “women didn’t act this way.” Now, this was more than 20 years ago. (I’ve been a consultant for 20 years.) But, based on the Fast Company article, it doesn’t seem like enough culture has changed.

Middle and senior managers, here’s the deal: At work, you want your managers to advocate for their people. You want this. This is a form of problem-solving. Your first-line and middle managers see a problem. If they don’t have the entire context, explain the context to them. Now, does that change anything?

If it does, you, senior or middle manager, have been derelict in your management responsibility. Your first-line manager might have been able to solve the problem with his/her staff without being abrasive if you had explained the context earlier. Maybe you need to have more one-on-ones. Maybe all your first-line managers could have solved this problem in your staff meeting, as a cross-functional team. Are you canceling one-on-ones or canceling problem-solving meetings? Don’t do that.

Do you have a first-line manager who doesn’t want to be a manager? Maybe you fell prey to the myth of promoting the best technical person into a management position. You are not alone. Find someone who wants to work with people, and ask that person to try  management.

We all need feedback. Managers need feedback, too. Because managers leverage the work of others, they need feedback even more than technical people.

If you think a manager on your management team is “too” abrasive or assertive,” ask yourself, is this person female? Then ask yourself, “Would I say the same thing if this person looked as if she could be a large sports figure, male attributes and all?”

You see, the fact that I have the physical attributes of a short, kind-of cute woman has not bothered me one bit. I feel seven feet tall. I often act like it. I am not afraid to take chances or calculated risks. I am not afraid to talk to anyone in the organization about anything. How else would I accomplish the work that needs to be done? (You may have noticed that I write tall, too.)

Abrasive and assertive are code words for fearless problem solvers. Don’t penalize the women—or the men—in your organization who are fearless problem solvers.

Categories: Project Management

C# 6 Cuts

Phil Trelford's Array - Thu, 10/02/2014 - 08:21

In a recent thread on CodePlex Mads Torgeson, C# Language PM at Microsoft, announced 2 of the key features planned for C# 6 release have now been cut:

  • Primary constructors
  • Declaration expressions

According to Mads:

They are both characterized by having large amounts of downstream work still remaining.

primary constructors could grow up to become a full-blown record feature

Reading between the lines Mads seems to be saying the features weren’t finished and even if they were they seemed to conflict with a potential record feature currently being prototyped.

The full thread is here: Changes to the language feature set

Language Design

I think there’s two distinct options when adding new features to an existing language with a large user base:

  • upfront design
  • implement incrementally

Upfront design should mean that all cases are met but comes at a time-to-market cost, where as an incremental implementation means quick releases with the potential risk of either sub-optimal syntax or backward compatibility issues when applying more features.

It appeared at the high level that the C# team’s had initially opted for the incremental option. The feature cuts however suggest to me that there may have been a change in direction towards more upfront design.

Primary Constructors

The primary constructors feature was intended to reduce the verbosity of C#’s class declaration syntax. The new feature appeared to be inspired by F# ’s class syntax.

If you like the idea of a lighter syntax for class declarations then you may just want to try F# which already has a well thought out mature implementation, i.e.

type Person(name:string, age:int) =
    member this.Name = name
    member this.Age = age

Or for simple types use the even simpler record type:

type Person = { Name:string, Age:int }

Note: on top of lighter class syntax F# also packs a whole raft of cool features not available in C#, including powerful pattern matching and data access via Type Providers.

Declaration Expressions

Declaration expressions was again designed to reduce verbosity in C# providing a lighter syntax for handling out parameters. Out parameters are used in C# to allow a method to return multiple values:

int result;
bool success = Int32.TryParse("123", out result);

Again handling multiple return values is handled elegantly in F# which employs first-class tuples, i.e.

let success, value = Int32.TryParse("123")

As shown above, C# out parameters can be simply captured in F# as if the method were returning multiple values as a tuple.


The first time I saw Mads publicly announce the now cut primary constructor syntax and declaration expressions was nearly a year ago at NDC London. At the time the features were announced with a number of disclaimers that they may not actually ship. I think in future it may be better for everyone to take those disclaimers with more than just a pinch of salt.

Categories: Programming

Distributed Agile: Backlog Grooming Meetings

Understanding how a story or a group of stories fits into the big picture is sometime like reading a single line of Shakespeare and trying to develop the plot for the entire play.

Understanding how a story or a group of stories fits into the big picture is sometime like reading a single line of Shakespeare and trying to develop the plot for the entire play.

There are two reasons to hold backlog grooming meetings. The first is to make sprint planning more efficient and effective. The second reason is make sure you understand your backlog. When teams don’t spend the time needed to groom the backlog, planning meetings can be very tense and extend for hours . . . and hours. Backlog grooming sessions can be whole team activities (rare) or sub-team activities (more common). The most common technique used to generate a sub-team for grooming is the Three Amigos (or some variant). The tallest hurdle all distributed teams face is ensuring effective communications, followed quickly by staying focused on the task at hand. Many of the same techniques we discussed for sprint planning in distributed teams will be effective, however backlog grooming has a few unique twists.

  1. Everybody needs to see the story at all times. Everyone involved must be able to see the story being groomed, preferably as it is being edited. Reading a story to someone at the other end of a phone and then amending the reading as you wordsmith the statement is difficult for many people to conceptualize. Most webinar tools now have whiteboard options. Cut and paste the story and acceptance criteria into the whiteboard feature so that everyone can see the words. One team I recently worked with used messaging software to approximate the process (it worked fairly well). Tools like webcams and telepresence can be used, however make sure the story and the acceptance criteria are easily readable by all parties. When a team member can’t hear or see well enough to stay involved, they will lose focus and probably start doing email.
  2. The right people and locations need to be involved. There are many shades of distributed teams, ranging from two locations to completely dispersed (everyone in different locations). The goal of grooming is to make sure the backlog items that may be used in the next sprint are understood, well-formed and have acceptance criteria. Typically, grooming is most effective when the three major team constituencies are to be involved: the business, the developers and the testers. When a team is distributed, locations can become constituencies that need to be involved to ensure that the grooming session attains the goal of making sure the stories are understood. This is an argument for whole team grooming sessions so that no location feels left out.
  3. Use story maps to link stories the big picture. Understanding how a story or a group of stories fits into the big picture is sometime like reading a single line of Shakespeare and trying to develop the plot for the entire play. When a team is distributed, it becomes more difficult for members to have a side conversation to get things back on track or to develop ways to stay aligned to project’s big picture without a more formal reference. A story map provides a frame of reference so that the team members involved in the grooming session can see how the stories fit into overall project. The use of a story map in the grooming process makes it easier to identify or develop a theme for the next sprint (a theme provides focus and direction to the team).

Backlog grooming is a process to make sure the stories that might be used in the next sprint are understood, well-formed and have acceptance criteria. When backlogs are not well groomed teams tend to spend a lot time planning and re-planning rather than delivering value. This is true whether a team is distributed or not. The problem is that when a team is distributed any hiccup takes more effort to fix, making grooming even more important.

Categories: Process Management

Modularity and testability

Coding the Architecture - Simon Brown - Wed, 10/01/2014 - 22:21

I've been writing blog posts covering a number of topics over the past few months; from the conflict between software architecture and code and architecturally-evident coding styles through to representing a software architecture model as code and how microservice architectures can easily turn into distributed big balls of mud. The common theme running throughout all of them is structure, and this in turn has a relationship with testability.

The TL;DR version of this post is: think about modularity, think about how you structure your code, think about the options you have for testing your code and stop making everything public.

1. The conflict between software architecture and code

I've recently been talking a lot about the disconnect between software architecture and code. George Fairbanks calls this the "model-code gap". It basically says that the abstractions we consider at the architecture level (components, services, modules, layers, etc) are often not explicitly reflected in the code. A major cause is that we don't have those concepts in OO programming languages such as Java, C#, etc. You can't do public component X in Java, for example.

2. The "unit testing is wasteful" thing

Hopefully, we've all see the "unit testing is wasteful" thing, and all of the follow-up discussion. The unfortunate thing about much of the discussion is that "unit testing" has been used interchangeably with "TDD". In my mind, the debate is about unit testing rather than TDD as a practice. I'm not a TDDer, but I do write automated tests. I mostly write tests afterwards. But sometimes I write them beforehand, particularly if I want to test-drive my implementation of something before integrating it. If TDD works for you, that's great. If not, don't worry about it. Just make sure that you *do* write some tests. :-)

There are, of course, a number of sides to the debate, but in TDD is dead. Long live testing. (ignore the title), DHH makes some good points about the numbers and types of tests that a system should have. To quote (strikethrough mine):

I think that's the direction we're heading. Less emphasis on unit tests, because we're no longer doing test-first as a design practice, and more emphasis on, yes, slow, system tests. (Which btw do not need to be so slow any more, thanks to advances in parallelization and cloud runner infrastructure).

The type of software system you're building will also have an impact on the number and types of tests. I once worked on a system where we had a huge number of integration tests, but very few unit tests, primarily because the system actually did very little aside from get data from a Microsoft Dynamics CRM system (via web services) and display it on some web pages. I've also worked on systems that were completely the opposite, with lots of complex business logic.

There's another implicit assumption in all of this ... what's the "unit" in "unit testing"? For many it's an isolated class, but for others the word "unit" can be used to represent anything from a single class through to an entire sub-system.

3. The microservices hype

Microservices is the new, shiny kid in town. There *are* many genuine benefits from adopting this style of architecture, but I do worry that we're simply going to end up building the next wave of distributed big balls of mud if we're not careful. Technologies like Spring Boot make creating and deploying microservices relatively straightforward, but the design thinking behind partitioning a software system into services is still as hard as it's ever been. This is why I've been using this slide in my recent talks.

If you can't build a structured monolith, what makes you think microservices is the answer!?


Uncle Bob Martin posted Microservices and Jars last month, which touches upon the topic of building monolithic applications that do have a clean internal structure, by using the concept of separately deployable units (e.g. JARs, DLLs, etc). Although he doesn't talk about the mechanisms needed to make this happen (e.g. plugin architectures, Java classloaders, etc), it's all achievable. I rarely see teams doing this though.

Structuring our code for modularity at the macro level, even in monolithic systems, provides a number of benefits, but it's a simple way to reduce the model-code gap. In other words, we structure our code to reflect the structural building blocks (e.g. components, services, modules) that we define at the architecture level. If there are "components" on the architecture diagrams, I want to see "components" in the code. This alignment of architecture and code has positive implications for explaining, understanding, maintaining, adapting and working with the system.

It's also about avoiding big balls of mud, and I want to do this by enforcing some useful boundaries in order to slice up my thousands of lines of code/classes into manageable chunks. Uncle Bob suggests that you can use JARs to do this. There are other modularity mechanisms available in Java too; including SPI, CDI and OSGi. But you don't even need a plugin architecture to build a structured monolith. Simply using the scoping modifiers built in to Java is sufficient to represent the concept of a lightweight in-process component/module.

Stop making everything public

We need to resist the temptation to make everything public though, because this is often why codebases turn into a sprawling mass of interconnected objects. I do wonder whether the keystrokes used to write public class are ingrained into our muscle memory as developers. As I said during my closing session at DevDay in Krakow last week, we should make a donation to charity every time we type public class without thinking about whether that class really needs to be public.

Donate to charity every time you type public class without thinking

A simple way to create a lightweight component/module in Java is to create a public interface and keep all of the implementation (one or more classes) package protected, ensuring there is only one "component" per package. Here's an example of a such a component, which also happens to be a Spring Bean. This isn't a silver bullet and there are trade-offs that I have consciously made (e.g. shared domain classes and utility code), but it does at least illustrate that all code doesn't need to be public. Proponents of DDD and ports & adapters may disagree with the naming I've used but, that aside, I do like the stronger sense of modularity that such an approach provides.


And now you have some options for writing automated tests. In this particular example, I've chosen to write automated tests that treat the component as a single thing; going through the component API to the database and back again. You can still do class-level testing too (inside the package), but only if it makes sense and provides value. You can also do TDD; both at the component API and the component implementation level. Treating your components/modules as black boxes results in a slightly different testing pyramid, in that it changes the balance of class and component tests.

Rethinking the testing pyramid?

A microservice architecture will likely push you down this route too, with a balanced mix of low-level class and higher-level service tests. Of course there is no "typical" shape for the testing pyramid; the type of system you're building will determine what it looks like. There are many options for building testable software, but neither unit testing or TDD are dead.

In summary, I'm looking for ways in which it we can structure our code for modularity at the macro-level, to avoid the big ball of mud and to shrink the model-code gap. I also want to be able to automatically draw some useful architecture diagrams based upon the code. We shouldn't blindly be making everything public and writing automated tests at the class level. After all, there are a number of different approaches that we can take for all of this, and the modularity you choose has an implication on the number and types of tests that you write. As I said at the start; think about modularity, think about how you structure your code, think about the options you have for testing your code and stop making everything public. Designing software requires conscious effort. Let's not stop thinking.

Categories: Architecture

No Estimates Needs to Come In Contact With Those Providing the Money

Herding Cats - Glen Alleman - Wed, 10/01/2014 - 17:36

For all the words written and posted around estimating or not estimating - and I've contributed my share - the basis of estimates has yet to be addressed outside of a few people. @PeterKretzman @aritanninen @kalapaistos@fscavo come to mind.

The gap here is simple. No one seems to ask - or even want to ask - Who are the estimates for? They are not likely for developers, who rightly, so in some cases see estimating as taking away from their valuable development duties.

Who Are Estimates For?

Estimates are for business managers providing the money that appears in the developers paycheck. Estimates are for those same business managers accountable for the Profit & Loss statement of the firm employing the developers writing the code. Those estimates forecast confidence intervals of profit or loss on a project or service before that profit or loss arrives and is irrevocable. 

Estimates are for the business marketing staff in a product firm, who are forecasting the "break even" plan for the sunk cost of developing  software that will be sold in the market. Whose revenue will pay back the short term loan (line of credit) used to pay the salaries of the developers. Without this forecast, decisions about spending or further spending have to be made in the dark.

Estimates are for the business development staff in a professional services and development firm to forecast the confidence in the assure that the contractual obligations to provide working software will not cost more - including management reserve and contingency - than they quoted the customer during the early phases of the project. Since all forecasting are probabilistic, this confidence is - or should be - discussed as the probability of cost at of below or completing on or before. The dysfunction of using estimates as commitments, is recognized as just that - dysfuntion. But as a dysfunction, it's classified as Bad Management. Don't Do Stop Things on Purpose is good advice for any business.

Estimates are for the internal business finance staff accountable for managing and forecasting costs for internal software development or procurement used to run the business - and likely used to generate revenue - and assure the senior finance people that the "value" produced by this software measured in monetized units of "money" will exceed the cost to achieve that value when the project completes. And some sense of when the date will be, so those monetized benefits can start to appear on the balance sheet using FASB 86 accounting rules.

The estimates are not for the developers

Those talking about #NoEstimates from the developers point of view are talking to the wrong people. They appeat to be talking to their own self-selected group and not the group that provides the money for their work. As my former NASA Cost Director colleague reminds me "follow the money." So follow the money. Unless the developers are providing the money themselves, the question of estimating or not estimating is a self-referencing conversation in the absence of these people. Because of that, those best to say if estimates are of value or not are not in the conversation. 

So Back To The Original Question

Ignoring for the moment the observed or perceived dysfunctions found in low maturity software development organizations of the misuse of estimates. Ignore for the moment the preception that making estimates of the future cost, duration, and probabilistic outcomes of development work is part of normal engineering processes. Ignore the emotional rhetoric of the Dilbert approach to management. 

The core principle of Microeconomics of software development requires we  have some approximation of the future to make decisions about alternatives. The opportunity cost, the trade-space of decision making, requires we approximate the cost and outcomes of our decisions. 

Now add the core business process of managing expenditures against a planned and targeted Return on Investment, which has both Value and Cost in it's equation. 

Then ask those conjecturing there are:

  • Decision making frameworks for project that do not require estimates
  • Investment models for software projects that do not require estimates
  • Project management approaches of dealing with risk, scope management, progress reporting that do not require estimates

To connect the dots to those conjectures with Microeconomics of software development and ROI assessments of standard business processes.


Related articles How NOT to Estimate Anything How To Fix Martin Fowler's Estimating Problem in 3 Easy Steps More #NoEstimates All Project Numbers are Random Numbers - Act Accordingly How To Estimate, If You Really Want To Resources for Moving Beyond the "Estimating Fallacy" Back To The Future How to "Lie" with Statistics


Categories: Project Management

Announcing the GTAC 2014 Agenda

Google Testing Blog - Wed, 10/01/2014 - 01:37
by Anthony Vallone on behalf of the GTAC Committee

We have completed selection and confirmation of all speakers and attendees for GTAC 2014. You can find the detailed agenda at:

Thank you to all who submitted proposals! It was very hard to make selections from so many fantastic submissions.

There was a tremendous amount of interest in GTAC this year with over 1,500 applicants (up from 533 last year) and 194 of those for speaking (up from 88 last year). Unfortunately, our venue only seats 250. However, don’t despair if you did not receive an invitation. Just like last year, anyone can join us via YouTube live streaming. We’ll also be setting up Google Moderator, so remote attendees can get involved in Q&A after each talk. Information about live streaming, Moderator, and other details will be posted on the GTAC site soon and announced here.

Categories: Testing & QA

Distributed Agile: Daily Stand-up Meetings

Stand-ups are best on your feet!

Stand-ups are best on your feet!

Distributed Agile teams require a different level of care and feeding than a co-located team in order to ensure that they are as effective as possible. This is even truer for a team that is working through their forming-storming-norming process. Core to making Agile-as-framework work effectively is the concepts of team and communication. Daily stand-up meetings are one the most important communication tools used in Scrum or other Agile/Lean frameworks. Techniques that are effective making daily stand-ups work for distributed teams include:

  1. Deal with the time zone issue. There are two primary options to deal with time zones. The first is to keep the team members within three or four time zones of each other. Given typical sourcing options, this tends to be difficult. A second option is to rotate the time for the stand-up meeting from sprint to sprint so that everyone loses a similar amount of sleep (share the pain option). One usable solution that can be tried when distributed teams can’t overlap is to have one team member (rotate) staying late or coming in early to overlap work times.
  2. Identify and attack blockers between stand-ups. Typically, on distributed teams, all parties will not work at the same time. Team members should be counseled to communicate blockers to the team as soon as they are discovered so that something discovered late in the day in one time zone does not affect the team in a different time zone that might just be starting to work. One group I worked with had stand-ups twice each day (at the beginning of the day and at the end of the day) to ensure continuous communication.
  3. Push status outside the stand-up. A solution suggested by Matt Hauser is to have the team answer the classic three questions (What did you do yesterday? What will you do today? Is there anything blocking your progress?) on a WIKI for everyone on the team to read before the stand-up meeting. This helps focus the meeting on planning or dealing with issues.
  4. Vary the question set being asked. The process of varying the question set keeps the team focused on communication rather than giving a memorized speech. For example ask:
    1. Is anyone stuck?
    2. Does anyone need help?
    3. What did not get competed yesterday?
    4. Is there anything everyone should know?

This technique can be used for non-distributed teams, as well as distributed teams.

  1. Ensure that everyone is standing. This is code for making sure that everyone is paying attention and staying focused. Standing is just one technique for helping team members stay focused. Others include banning cell phones and side conversations.
  2. Make sure the meeting stays “crisp.” Stand-up meetings by definition are short and to the point. The team needs to ensure that the meeting stays as disciplined as possible. All team members should show up on time and be prepared to discuss their role in the project. Discussion includes the willingness to ask for help and to provide help to team members.
  3. Use a physical status wall. While the term “distributed” screams tool usage, using a physical wall helps to focus the team. The simplicity of a physical wall takes the complexity of tool usage off the table so the focus can be on communication. Use of a physical wall in a distributed environment will mean using video to moving tasks on the wall (after the fact a picture can be provided to the team). If video is not available, use a tool that EVERYONE has access to. Keep tools as simple as possible.
  4. Don’t stop doing stand-ups. Stand-up meetings are a critical communication and planning event, not doing stand-ups for a distributed team is an indicator that the organization should go back to project manager/plan-based methods.

Like any other distributed team meeting, having good telecommunication/video tools is not only important, it is a prerequisite. If team members can’t hear each other, they CAN’T communicate.

Stand-ups are nearly ubiquitous in Agile. I would do stand-ups even if I were not doing Agile. However despite their simplicity, the added complexity of distributed teams can cause problems. The whole team is responsible for making the stand-up meetings work. While the Scrum master may take the lead in insuring the logistics are right or to facilitate the session when needed, everyone needs to play a role.

Categories: Process Management

R: A first attempt at linear regression

Mark Needham - Tue, 09/30/2014 - 23:20

I’ve been working through the videos that accompany the Introduction to Statistical Learning with Applications in R book and thought it’d be interesting to try out the linear regression algorithm against my meetup data set.

I wanted to see how well a linear regression algorithm could predict how many people were likely to RSVP to a particular event. I started with the following code to build a data frame containing some potential predictors:

officeEventsQuery = "MATCH (g:Group {name: \"Neo4j - London User Group\"})-[:HOSTED_EVENT]->(event)<-[:TO]-({response: 'yes'})<-[:RSVPD]-(),
                     WHERE (event.time + event.utc_offset) < timestamp() AND IN [\"Neo Technology\", \"OpenCredo\"]
                     RETURN event.time + event.utc_offset AS eventTime,event.announced_at AS announcedAt,, COUNT(*) AS rsvps"
events = subset(cypher(graph, officeEventsQuery), !
events$eventTime <- timestampToDate(events$eventTime)
events$day <- format(events$eventTime, "%A")
events$monthYear <- format(events$eventTime, "%m-%Y")
events$month <- format(events$eventTime, "%m")
events$year <- format(events$eventTime, "%Y")
events$announcedAt<- timestampToDate(events$announcedAt)
events$timeDiff = as.numeric(events$eventTime - events$announcedAt, units = "days")

If we preview ‘events’ it contains the following columns:

> head(events)
            eventTime         announcedAt                               rsvps       day monthYear month year  timeDiff
1 2013-01-29 18:00:00 2012-11-30 11:30:57                                   Intro to Graphs    24   Tuesday   01-2013    01 2013 60.270174
2 2014-06-24 18:30:00 2014-06-18 19:11:19                                   Intro to Graphs    43   Tuesday   06-2014    06 2014  5.971308
3 2014-06-18 18:30:00 2014-06-08 07:03:13                         Neo4j World Cup Hackathon    24 Wednesday   06-2014    06 2014 10.476933
4 2014-05-20 18:30:00 2014-05-14 18:56:06                                   Intro to Graphs    53   Tuesday   05-2014    05 2014  5.981875
5 2014-02-11 18:00:00 2014-02-05 19:11:03                                   Intro to Graphs    35   Tuesday   02-2014    02 2014  5.950660
6 2014-09-04 18:30:00 2014-08-26 06:34:01 Hands On Intro to Cypher - Neo4j's Query Language    20  Thursday   09-2014    09 2014  9.497211

We want to predict ‘rsvps’ from the other columns so I started off by creating a linear model which took all the other columns into account:

> summary(lm(rsvps ~., data = events))
lm(formula = rsvps ~ ., data = events)
    Min      1Q  Median      3Q     Max 
-8.2582 -1.1538  0.0000  0.4158 10.5803 
Coefficients: (14 not defined because of singularities)
                                                                    Estimate Std. Error t value Pr(>|t|)   
(Intercept)                                                       -9.365e+03  3.009e+03  -3.113  0.00897 **
eventTime                                                          3.609e-06  2.951e-06   1.223  0.24479   
announcedAt                                                        3.278e-06  2.553e-06   1.284  0.22339   
event.nameGraph Modelling - Do's and Don'ts                        4.884e+01  1.140e+01   4.286  0.00106 **
event.nameHands on build your first Neo4j app for Java developers  3.735e+01  1.048e+01   3.562  0.00391 **
event.nameHands On Intro to Cypher - Neo4j's Query Language        2.560e+01  9.713e+00   2.635  0.02177 * 
event.nameIntro to Graphs                                          2.238e+01  8.726e+00   2.564  0.02480 * 
event.nameIntroduction to Graph Database Modeling                 -1.304e+02  4.835e+01  -2.696  0.01946 * 
event.nameLunch with Neo4j's CEO, Emil Eifrem                      3.920e+01  1.113e+01   3.523  0.00420 **
event.nameNeo4j Clojure Hackathon                                 -3.063e+00  1.195e+01  -0.256  0.80203   
event.nameNeo4j Python Hackathon with py2neo's Nigel Small         2.128e+01  1.070e+01   1.989  0.06998 . 
event.nameNeo4j World Cup Hackathon                                5.004e+00  9.622e+00   0.520  0.61251   
dayTuesday                                                         2.068e+01  5.626e+00   3.676  0.00317 **
dayWednesday                                                       2.300e+01  5.522e+00   4.165  0.00131 **
monthYear01-2014                                                  -2.350e+02  7.377e+01  -3.185  0.00784 **
monthYear02-2013                                                  -2.526e+01  1.376e+01  -1.836  0.09130 . 
monthYear02-2014                                                  -2.325e+02  7.763e+01  -2.995  0.01118 * 
monthYear03-2013                                                  -4.605e+01  1.683e+01  -2.736  0.01805 * 
monthYear03-2014                                                  -2.371e+02  8.324e+01  -2.848  0.01468 * 
monthYear04-2013                                                  -6.570e+01  2.309e+01  -2.845  0.01477 * 
monthYear04-2014                                                  -2.535e+02  8.746e+01  -2.899  0.01336 * 
monthYear05-2013                                                  -8.672e+01  2.845e+01  -3.049  0.01011 * 
monthYear05-2014                                                  -2.802e+02  9.420e+01  -2.975  0.01160 * 
monthYear06-2013                                                  -1.022e+02  3.283e+01  -3.113  0.00897 **
monthYear06-2014                                                  -2.996e+02  1.003e+02  -2.988  0.01132 * 
monthYear07-2014                                                  -3.123e+02  1.054e+02  -2.965  0.01182 * 
monthYear08-2013                                                  -1.326e+02  4.323e+01  -3.067  0.00976 **
monthYear08-2014                                                  -3.060e+02  1.107e+02  -2.763  0.01718 * 
monthYear09-2013                                                          NA         NA      NA       NA   
monthYear09-2014                                                  -3.465e+02  1.164e+02  -2.976  0.01158 * 
monthYear10-2012                                                   2.602e+01  1.959e+01   1.328  0.20886   
monthYear10-2013                                                  -1.728e+02  5.678e+01  -3.044  0.01020 * 
monthYear11-2012                                                   2.717e+01  1.509e+01   1.800  0.09704 . 
month02                                                                   NA         NA      NA       NA   
month03                                                                   NA         NA      NA       NA   
month04                                                                   NA         NA      NA       NA   
month05                                                                   NA         NA      NA       NA   
month06                                                                   NA         NA      NA       NA   
month07                                                                   NA         NA      NA       NA   
month08                                                                   NA         NA      NA       NA   
month09                                                                   NA         NA      NA       NA   
month10                                                                   NA         NA      NA       NA   
month11                                                                   NA         NA      NA       NA   
year2013                                                                  NA         NA      NA       NA   
year2014                                                                  NA         NA      NA       NA   
timeDiff                                                                  NA         NA      NA       NA   
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.287 on 12 degrees of freedom
Multiple R-squared:  0.9585,	Adjusted R-squared:  0.8512 
F-statistic: 8.934 on 31 and 12 DF,  p-value: 0.0001399

As I understand it we can look at the R-squared value to understand how much of the variance in the data has been explained by the model – in this case it’s 85%.

A lot of the coefficients seem to be based around specific event names which seems a bit too specific to me so I wanted to see what would happen if I derived a feature which indicated whether a session was practical:

events$practical = grepl("Hackathon|Hands on|Hands On", events$

We can now run the model again with the new column having excluded ‘’ field:

> summary(lm(rsvps ~., data = subset(events, select = -c(
lm(formula = rsvps ~ ., data = subset(events, select = -c(
    Min      1Q  Median      3Q     Max 
-18.647  -2.311   0.000   2.908  23.218 
Coefficients: (13 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)  
(Intercept)      -3.980e+03  4.752e+03  -0.838   0.4127  
eventTime         2.907e-06  3.873e-06   0.751   0.4621  
announcedAt       3.336e-08  3.559e-06   0.009   0.9926  
dayTuesday        7.547e+00  6.080e+00   1.241   0.2296  
dayWednesday      2.442e+00  7.046e+00   0.347   0.7327  
monthYear01-2014 -9.562e+01  1.187e+02  -0.806   0.4303  
monthYear02-2013 -4.230e+00  2.289e+01  -0.185   0.8553  
monthYear02-2014 -9.156e+01  1.254e+02  -0.730   0.4742  
monthYear03-2013 -1.633e+01  2.808e+01  -0.582   0.5676  
monthYear03-2014 -8.094e+01  1.329e+02  -0.609   0.5496  
monthYear04-2013 -2.249e+01  3.785e+01  -0.594   0.5595  
monthYear04-2014 -9.230e+01  1.401e+02  -0.659   0.5180  
monthYear05-2013 -3.237e+01  4.654e+01  -0.696   0.4952  
monthYear05-2014 -1.015e+02  1.509e+02  -0.673   0.5092  
monthYear06-2013 -3.947e+01  5.355e+01  -0.737   0.4701  
monthYear06-2014 -1.081e+02  1.604e+02  -0.674   0.5084  
monthYear07-2014 -1.110e+02  1.678e+02  -0.661   0.5163  
monthYear08-2013 -5.144e+01  6.988e+01  -0.736   0.4706  
monthYear08-2014 -1.023e+02  1.784e+02  -0.573   0.5731  
monthYear09-2013 -6.057e+01  7.893e+01  -0.767   0.4523  
monthYear09-2014 -1.260e+02  1.874e+02  -0.672   0.5094  
monthYear10-2012  9.557e+00  2.873e+01   0.333   0.7430  
monthYear10-2013 -6.450e+01  9.169e+01  -0.703   0.4903  
monthYear11-2012  1.689e+01  2.316e+01   0.729   0.4748  
month02                  NA         NA      NA       NA  
month03                  NA         NA      NA       NA  
month04                  NA         NA      NA       NA  
month05                  NA         NA      NA       NA  
month06                  NA         NA      NA       NA  
month07                  NA         NA      NA       NA  
month08                  NA         NA      NA       NA  
month09                  NA         NA      NA       NA  
month10                  NA         NA      NA       NA  
month11                  NA         NA      NA       NA  
year2013                 NA         NA      NA       NA  
year2014                 NA         NA      NA       NA  
timeDiff                 NA         NA      NA       NA  
practicalTRUE    -9.388e+00  5.289e+00  -1.775   0.0919 .
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.21 on 19 degrees of freedom
Multiple R-squared:  0.7546,	Adjusted R-squared:  0.4446 
F-statistic: 2.434 on 24 and 19 DF,  p-value: 0.02592

Now we’re only accounting for 44% of the variance and none of our coefficients are significant so this wasn’t such a good change.

I also noticed that we’ve got a bit of overlap in the date related features – we’ve got one column for monthYear and then separate ones for month and year. Let’s strip out the combined one:

> summary(lm(rsvps ~., data = subset(events, select = -c(, monthYear))))
lm(formula = rsvps ~ ., data = subset(events, select = -c(, 
     Min       1Q   Median       3Q      Max 
-16.5745  -4.0507  -0.1042   3.6586  24.4715 
Coefficients: (1 not defined because of singularities)
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)   -1.573e+03  4.315e+03  -0.364   0.7185  
eventTime      3.320e-06  3.434e-06   0.967   0.3425  
announcedAt   -2.149e-06  2.201e-06  -0.976   0.3379  
dayTuesday     4.713e+00  5.871e+00   0.803   0.4294  
dayWednesday  -2.253e-01  6.685e+00  -0.034   0.9734  
month02        3.164e+00  1.285e+01   0.246   0.8075  
month03        1.127e+01  1.858e+01   0.607   0.5494  
month04        4.148e+00  2.581e+01   0.161   0.8736  
month05        1.979e+00  3.425e+01   0.058   0.9544  
month06       -1.220e-01  4.271e+01  -0.003   0.9977  
month07        1.671e+00  4.955e+01   0.034   0.9734  
month08        8.849e+00  5.940e+01   0.149   0.8827  
month09       -5.496e+00  6.782e+01  -0.081   0.9360  
month10       -5.066e+00  7.893e+01  -0.064   0.9493  
month11        4.255e+00  8.697e+01   0.049   0.9614  
year2013      -1.799e+01  1.032e+02  -0.174   0.8629  
year2014      -3.281e+01  2.045e+02  -0.160   0.8738  
timeDiff              NA         NA      NA       NA  
practicalTRUE -9.816e+00  5.084e+00  -1.931   0.0645 .
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.19 on 26 degrees of freedom
Multiple R-squared:  0.666,	Adjusted R-squared:  0.4476 
F-statistic: 3.049 on 17 and 26 DF,  p-value: 0.005187

Again none of the coefficients are statistically significant which is disappointing. I think the main problem may be that I have very few data points (only 42) making it difficult to come up with a general model.

I think my next step is to look for some other features that could impact the number of RSVPs e.g. other events on that day, the weather.

I’m a novice at this but trying to learn more so if you have any ideas of what I should do next please let me know.

Categories: Programming

Neo4j: Generic/Vague relationship names

Mark Needham - Tue, 09/30/2014 - 17:47

An approach to modelling that I often see while working with Neo4j users is creating very generic relationships (e.g. HAS, CONTAINS, IS) and filtering on a relationship property or on a property/label at the end node.

Intuitively this doesn’t seem to make best use of the graph model as it means that you have to evaluate many relationships and nodes that you’re not interested in.

However, I’ve never actually tested the performance differences between the approaches so I thought I’d try it out.

I created 4 different databases which had one node with 60,000 outgoing relationships – 10,000 which we wanted to retrieve and 50,000 that were irrelevant.

I modelled the ‘relationship’ in 4 different ways…

  • Using a specific relationship type
  • Using a generic relationship type and then filtering by end node label
  • Using a generic relationship type and then filtering by relationship property
    (node)-[:HAS {type: "address"}]->(address)
  • Using a generic relationship type and then filtering by end node property
    (node)-[:HAS]->(address {type: “address”})

…and then measured how long it took to retrieve the ‘has address’ relationships.

The code is on github if you want to take a look.

Although it’s obviously not as precise as a JMH micro benchmark I think it’s good enough to get a feel for the difference between the approaches.

I ran a query against each database 100 times and then took the 50th, 75th and 99th percentiles (times are in ms):

Using a generic relationship type and then filtering by end node label
50%ile: 6.0    75%ile: 6.0    99%ile: 402.60999999999825
Using a generic relationship type and then filtering by relationship property
50%ile: 21.0   75%ile: 22.0   99%ile: 504.85999999999785
Using a generic relationship type and then filtering by end node label
50%ile: 4.0    75%ile: 4.0    99%ile: 145.65999999999931
Using a specific relationship type
50%ile: 0.0    75%ile: 1.0    99%ile: 25.749999999999872

We can drill further into why there’s a difference in the times for each of the approaches by profiling the equivalent cypher query. We’ll start with the one which uses a specific relationship name

Using a specific relationship type

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS_ADDRESS]->() return count(n);
| count(n) |
| 10000    |
1 row
|             Operator |  Rows | DbHits |                 Identifiers |                 Other |
|         ColumnFilter |     1 |      0 |                             | keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                       |
| SimplePatternMatcher | 10000 |  10000 | n,   UNNAMED53,   UNNAMED35 |                       |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |          {  AUTOINT0} |
Total database accesses: 10001

Here we can see that there were 10,002 database accesses in order to get a count of our 10,000 HAS_ADDRESS relationships. We get a database access each time we load a node, relationship or property.

By contrast the other approaches have to load in a lot more data only to then filter it out:

Using a generic relationship type and then filtering by end node label

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS]->(:Address) return count(n);
| count(n) |
| 10000    |
1 row
|             Operator |  Rows | DbHits |                 Identifiers |                            Other |
|         ColumnFilter |     1 |      0 |                             |            keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                                  |
|               Filter | 10000 |  10000 |                             | hasLabel(  UNNAMED45:Address(0)) |
| SimplePatternMatcher | 10000 |  60000 | n,   UNNAMED45,   UNNAMED35 |                                  |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |                     {  AUTOINT0} |
Total database accesses: 70001

Using a generic relationship type and then filtering by relationship property

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS {type: "address"}]->() return count(n);
| count(n) |
| 10000    |
1 row
|             Operator |  Rows | DbHits |                 Identifiers |                                            Other |
|         ColumnFilter |     1 |      0 |                             |                            keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                                                  |
|               Filter | 10000 |  20000 |                             | Property(  UNNAMED35,type(0)) == {  AUTOSTRING1} |
| SimplePatternMatcher | 10000 | 120000 | n,   UNNAMED63,   UNNAMED35 |                                                  |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |                                     {  AUTOINT0} |
Total database accesses: 140001

Using a generic relationship type and then filtering by end node property

neo4j-sh (?)$ profile match (n) where id(n) = 0 match (n)-[:HAS]->({type: "address"}) return count(n);
| count(n) |
| 10000    |
1 row
|             Operator |  Rows | DbHits |                 Identifiers |                                            Other |
|         ColumnFilter |     1 |      0 |                             |                            keep columns count(n) |
|     EagerAggregation |     1 |      0 |                             |                                                  |
|               Filter | 10000 |  20000 |                             | Property(  UNNAMED45,type(0)) == {  AUTOSTRING1} |
| SimplePatternMatcher | 10000 | 120000 | n,   UNNAMED45,   UNNAMED35 |                                                  |
|      NodeByIdOrEmpty |     1 |      1 |                        n, n |                                     {  AUTOINT0} |
Total database accesses: 140001

So in summary…specific relationships #ftw!

Categories: Programming

Sudoku, Linear Optimization, and the Ten Cent Diet

Google Code Blog - Tue, 09/30/2014 - 17:12
Originally posted on the Google Research blog. Cross posted on the Google Apps Developers blog

In 1945, future Nobel laureate George Stigler wrote an essay in the Journal of Farm Economics titled The Cost of Subsistence about a seemingly simple problem: how could a soldier be fed for as little money as possible?

The “Stigler Diet” became a classic problem in the then-new field of linear optimization, which is used today in many areas of science and engineering. Any time you have a set of linear constraints such as “at least 50 square meters of solar panels” or “the amount of paint should equal the amount of primer” along with a linear goal (e.g., “minimize cost” or “maximize customers served”), that’s a linear optimization problem.

At Google, our engineers work on plenty of optimization problems. One example is our YouTube video stabilization system, which uses linear optimization to eliminate the shakiness of handheld cameras. A more lighthearted example is in the Google Docs Sudoku add-on, which instantaneously generates and solves Sudoku puzzles inside a Google Sheet, using the SCIP mixed integer programming solver to compute the solution.
Today we’re proud to announce two new ways for everyone to solve linear optimization problems. First, you can now solve linear optimization problems in Google Sheets with the Linear Optimization add-on written by Google Software Engineer Mihai Amarandei-Stavila. The add-on uses Google Apps Script to send optimization problems to Google servers. The solutions are displayed inside the spreadsheet. For developers who want to create their own applications on top of Google Apps, we also provide an API to let you call our linear solver directly.
Second, we’re open-sourcing the linear solver underlying the add-on: Glop (the Google Linear Optimization Package), created by Bruno de Backer with other members of the Google Optimization team. It’s available as part of the or-tools suite and we provide a few examples to get you started. On that page, you’ll find the Glop solution to the Stigler diet problem. (A Google Sheets file that uses Glop and the Linear Optimization add-on to solve the Stigler diet problem is available here. You’ll need to install the add-on first.)

Stigler posed his problem as follows: given nine nutrients (calories, protein, Vitamin C, and so on) and 77 candidate foods, find the foods that could sustain soldiers at minimum cost.

The Simplex algorithm for linear optimization was two years away from being invented, so Stigler had to do his best, arriving at a diet that cost $39.93 per year (in 1939 dollars), or just over ten cents per day. Even that wasn’t the cheapest diet. In 1947, Jack Laderman used Simplex, nine calculator-wielding clerks, and 120 person-days to arrive at the optimal solution.

Glop’s Simplex implementation solves the problem in 300 milliseconds. Unfortunately, Stigler didn’t include taste as a constraint, and so the poor hypothetical soldiers will eat nothing but the following, ever:

  • Enriched wheat flour
  • Liver
  • Cabbage
  • Spinach
  • Navy beans

Is it possible to create an appealing dish out of these five ingredients? Google Chef Anthony Marco took it as a challenge, and we’re calling the result Foie Linéaire à la Stigler:
This optimal meal consists of seared calf liver dredged in flour, atop a navy bean purée with marinated cabbage and a spinach pesto.

Chef Marco reported that the most difficult constraint was making the dish tasty without butter or cream. That said, I had the opportunity to taste our linear optimization solution, and it was delicious.

Posted by Jon Orwant, Engineering Manager
Categories: Programming