Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!


Spark: pyspark/Hadoop – py4j.protocol.Py4JJavaError: An error occurred while calling o23.load.: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4

Mark Needham - Tue, 08/04/2015 - 07:35

I’ve been playing around with pyspark – Spark’s Python library – and I wanted to execute the following job which takes a file from my local HDFS and then counts how many times each FBI code appears using Spark SQL:

from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext("local", "Simple App")
sqlContext = SQLContext(sc)
file = "hdfs://localhost:9000/user/markneedham/Crimes_-_2001_to_present.csv"
sqlContext.load(source="com.databricks.spark.csv", header="true", path = file).registerTempTable("crimes")
rows = sqlContext.sql("select `FBI Code` AS fbiCode, COUNT(*) AS times FROM crimes GROUP BY `FBI Code` ORDER BY times DESC").collect()
for row in rows:
    print("{0} -> {1}".format(row.fbiCode, row.times))

I submitted the job and waited:

$ ./spark-1.3.0-bin-hadoop1/bin/spark-submit --driver-memory 5g --packages com.databricks:spark-csv_2.10:1.1.0
Traceback (most recent call last):
  File "/Users/markneedham/projects/neo4j-spark-chicago/", line 11, in <module>
    sqlContext.load(source="com.databricks.spark.csv", header="true", path = file).registerTempTable("crimes")
  File "/Users/markneedham/projects/neo4j-spark-chicago/spark-1.3.0-bin-hadoop1/python/pyspark/sql/", line 482, in load
    df = self._ssql_ctx.load(source, joptions)
  File "/Users/markneedham/projects/neo4j-spark-chicago/spark-1.3.0-bin-hadoop1/python/lib/", line 538, in __call__
  File "/Users/markneedham/projects/neo4j-spark-chicago/spark-1.3.0-bin-hadoop1/python/lib/", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o23.load.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(
	at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
	at org.apache.hadoop.ipc.RPC.getProxy(
	at org.apache.hadoop.ipc.RPC.getProxy(
	at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(
	at org.apache.hadoop.hdfs.DFSClient.<init>(
	at org.apache.hadoop.hdfs.DFSClient.<init>(
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(
	at org.apache.hadoop.fs.FileSystem.createFileSystem(
	at org.apache.hadoop.fs.FileSystem.access$200(
	at org.apache.hadoop.fs.FileSystem$Cache.get(
	at org.apache.hadoop.fs.FileSystem.get(
	at org.apache.hadoop.fs.Path.getFileSystem(
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1156)
	at org.apache.spark.rdd.RDD.first(RDD.scala:1189)
	at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:129)
	at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:127)
	at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:109)
	at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:62)
	at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:115)
	at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:40)
	at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:28)
	at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:290)
	at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
	at org.apache.spark.sql.SQLContext.load(SQLContext.scala:667)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(
	at java.lang.reflect.Method.invoke(
	at py4j.reflection.MethodInvoker.invoke(
	at py4j.reflection.ReflectionEngine.invoke(
	at py4j.Gateway.invoke(
	at py4j.commands.AbstractCommand.invokeMethod(
	at py4j.commands.CallCommand.execute(

It looks like my Hadoop Client and Server are using different versions which in fact they are! We can see from the name of the spark folder that I’m using Hadoop 1.x there and if we check the local Hadoop version we’ll notice it’s using the 2.x seris:

$ hadoop version
Hadoop 2.6.0

In this case the easiest fix is use a version of Spark that’s compiled against Hadoop 2.6 which as of now means Spark 1.4.1.

Let’s try and run our job again:

$ ./spark-1.4.1-bin-hadoop2.6/bin/spark-submit --driver-memory 5g --packages com.databricks:spark-csv_2.10:1.1.0
06 -> 859197
08B -> 653575
14 -> 488212
18 -> 457782
26 -> 431316
05 -> 257310
07 -> 197404
08A -> 188964
03 -> 157706
11 -> 112675
04B -> 103961
04A -> 60344
16 -> 47279
15 -> 40361
24 -> 31809
10 -> 22467
17 -> 17555
02 -> 17008
20 -> 15190
19 -> 10878
22 -> 8847
09 -> 6358
01A -> 4830
13 -> 1561
12 -> 835
01B -> 16

And it’s working!

Categories: Programming

7 Things Your Boss Doesn’t Understand About Software Development

Making the Complex Simple - John Sonmez - Mon, 08/03/2015 - 16:00

Your boss may be awesome. I’ve certainly had a few awesome bosses in my programming career, but even the most awesome bosses don’t always seem to “get it.” In fact, I’d say that most software development managers are a little short-sighted when it comes to more than a few elements of programming. So, I’ve compiled […]

The post 7 Things Your Boss Doesn’t Understand About Software Development appeared first on Simple Programmer.

Categories: Programming

Building IntelliJ plugins from the command line

Xebia Blog - Mon, 08/03/2015 - 13:16

For a few years already, IntelliJ IDEA has been my IDE of choice. Recently I dove into the world of plugin development for IntelliJ IDEA and was unhappily surprised. Plugin development all relies on IDE features. It looked hard to create a build script to do the actual plugin compilation and packaging from a build script. The JetBrains folks simply have not catered for that. Unless you're using TeamCity as your CI tool, you're out of luck.

For me it makes no sense writing code if:

  1. it can not be compiled and packaged from the command line
  2. the code can not be compiled and tested on a CI environment
  3. IDE configurations can not be generated from the build script

Google did not help out a lot. Tomasz Dziurko put me in the right direction.

In order to build and test a plugin, the following needs to be in place:

  1. First of all you'll need IntelliJ IDEA. This is quite obvious. The Plugin DevKit plugins need to be installed. If you want to create a language plugin you might want to install Grammar-Kit too.
  2. An IDEA SDK needs to be registered. The SDK can point to your IntelliJ installation.

The plugin module files are only slightly different from your average project.

Compiling and testing the plugin

Now for the build script. My build tool of choice is Gradle. My plugin code adheres to the default Gradle project structure.

First thing to do is to get a hold of the IntelliJ IDEA libraries in an automated way. Since the IDEA libraries are not available via Maven repos, an IntelliJ IDEA Community Edition download is probably the best option to get a hold of the libraries.

The plan is as follows: download the Linux version of IntelliJ IDEA, and extract it in a predefined location. From there, we can point to the libraries and subsequently compile and test the plugin. The libraries are Java, and as such platform independent. I picked the Linux version since it has a nice, simple file structure.

The following code snippet caters for this:

apply plugin: 'java'

// Pick the Linux version, as it is a tar.gz we can simply extract
def IDEA_SDK_URL = ''
def IDEA_SDK_NAME = 'IntelliJ IDEA Community Edition IC-139.1603.1'

configurations {
    bundle // dependencies bundled with the plugin

dependencies {
    ideaSdk fileTree(dir: 'lib/sdk/', include: ['*/lib/*.jar'])

    compile configurations.ideaSdk
    compile configurations.bundle
    testCompile 'junit:junit:4.12'
    testCompile 'org.mockito:mockito-core:1.10.19'

// IntelliJ IDEA can still run on a Java 6 JRE, so we need to take that into account.
sourceCompatibility = 1.6
targetCompatibility = 1.6

task downloadIdeaSdk(type: Download) {
    sourceUrl = IDEA_SDK_URL
    target = file('lib/idea-sdk.tar.gz')

task extractIdeaSdk(type: Copy, dependsOn: [downloadIdeaSdk]) {
    def zipFile = file('lib/idea-sdk.tar.gz')
    def outputDir = file("lib/sdk")

    from tarTree(resources.gzip(zipFile))
    into outputDir

compileJava.dependsOn extractIdeaSdk

class Download extends DefaultTask {
    String sourceUrl

    File target

    void download() {
       if (!target.parentFile.exists()) {
       ant.get(src: sourceUrl, dest: target, skipexisting: 'true')

If parallel test execution does not work for your plugin, you'd better turn it off as follows:

test {
    // Avoid parallel execution, since the IntelliJ boilerplate is not up to that
    maxParallelForks = 1
The plugin deliverable

Obviously, the whole build process should be automated. That includes the packaging of the plugin. A plugin is simply a zip file with all libraries together in a lib folder.

task dist(type: Zip, dependsOn: [jar, test]) {
    from configurations.bundle
    from jar.archivePath
    rename { f -> "lib/${f}" }

build.dependsOn dist
Handling IntelliJ project files

We also need to generate IntelliJ IDEA project and module files so the plugin can live within the IDE. Telling the IDE it's dealing with a plugin opens some nice features, mainly the ability to run the plugin from within the IDE. Anton Arhipov's blog post put me on the right track.

The Gradle idea plugin helps out in creating those files. This works out of the box for your average project, but for plugins IntelliJ expects some things differently. The project files should mention that we're dealing with a plugin project and the module file should point to the plugin.xml file required for each plugin. Also, the SDK libraries are not to be included in the module file; so, I excluded those from the configuration.

The following code snippet caters for this:

apply plugin: 'idea'

idea {
    project {
        languageLevel = '1.6'
        jdkName = IDEA_SDK_NAME

        ipr {
            withXml {
                it.node.find { node ->
                    node.@name == 'ProjectRootManager'
                }.'@project-jdk-type' = 'IDEA JDK'

                logger.warn "=" * 71
                logger.warn " Configured IDEA JDK '${jdkName}'."
                logger.warn " Make sure you have it configured IntelliJ before opening the project!"
                logger.warn "=" * 71

    module {
        scopes.COMPILE.minus = [ configurations.ideaSdk ]

        iml {
            beforeMerged { module ->
            withXml {
                it.node.@type = 'PLUGIN_MODULE'
                //  &lt;component name="DevKit.ModuleBuildProperties" url="file://$MODULE_DIR$/src/main/resources/META-INF/plugin.xml" />
                def cmp = it.node.appendNode('component')
                cmp.@name = 'DevKit.ModuleBuildProperties'
                cmp.@url = 'file://$MODULE_DIR$/src/main/resources/META-INF/plugin.xml'
Put it to work!

Combining the aforementioned code snippets will result in a build script that can be run on any environment. Have a look at my idea-clock plugin for a working example.

Spark: Processing CSV files using Databricks Spark CSV Library

Mark Needham - Sun, 08/02/2015 - 19:08

Last year I wrote about exploring the Chicago crime data set using Spark and the OpenCSV parser and while this worked well, a few months ago I noticed that there’s now a spark-csv library which I should probably use instead.

I thought it’d be a fun exercise to translate my code to use it.

So to recap our goal: we want to count how many times each type of crime has been committed. I have a more up to date version of the crimes file now so the numbers won’t be exactly the same.

First let’s launch the spark-shell and register our CSV file as a temporary table so we can query it as if it was a SQL table:

$ ./spark-1.3.0-bin-hadoop1/bin/spark-shell
scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
scala> val crimeFile = "/Users/markneedham/Downloads/Crimes_-_2001_to_present.csv"
crimeFile: String = /Users/markneedham/Downloads/Crimes_-_2001_to_present.csv
scala> val sqlContext = new SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@9746157
scala> sqlContext.load("com.databricks.spark.csv", Map("path" -> crimeFile, "header" -> "true")).registerTempTable("crimes")
java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
	at scala.sys.package$.error(package.scala:27)
	at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:268)
	at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:279)
	at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
        at java.lang.reflect.Method.invoke(
	at org.apache.spark.repl.SparkIMain$
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
	at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
	at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
	at org.apache.spark.repl.Main$.main(Main.scala:31)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(
	at java.lang.reflect.Method.invoke(
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I’ve actually forgotten to tell spark-shell about the CSV package so let’s restart the shell and pass it as an argument:

$ ./spark-1.3.0-bin-hadoop1/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.1.0
scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
scala> val crimeFile = "/Users/markneedham/Downloads/Crimes_-_2001_to_present.csv"
crimeFile: String = /Users/markneedham/Downloads/Crimes_-_2001_to_present.csv
scala> val sqlContext = new SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@44587c44
scala> sqlContext.load("com.databricks.spark.csv", Map("path" -> crimeFile, "header" -> "true")).registerTempTable("crimes")
15/08/02 18:57:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/08/02 18:57:46 INFO DAGScheduler: Stage 0 (first at CsvRelation.scala:129) finished in 0.207 s
15/08/02 18:57:46 INFO DAGScheduler: Job 0 finished: first at CsvRelation.scala:129, took 0.267327 s

Now we can write a simple SQL query on our ‘crimes’ table to find the most popular crime types:

scala>  sqlContext.sql(
        select `Primary Type` as primaryType, COUNT(*) AS times
        from crimes
        group by `Primary Type`
        order by times DESC
        """).save("/tmp/agg.csv", "com.databricks.spark.csv")

That spits out a load of CSV ‘part files’ into /tmp/agg.csv so let’s bring in the merge function that we’ve used previously to combine these into one CSV file:

scala> import org.apache.hadoop.conf.Configuration
scala> import org.apache.hadoop.fs._
scala> def merge(srcPath: String, dstPath: String): Unit =  {
         val hadoopConfig = new Configuration()
         val hdfs = FileSystem.get(hadoopConfig)
         FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
scala> merge("/tmp/agg.csv", "agg.csv")

And finally let’s browse the contents of our new CSV file:

$ cat agg.csv

Great! We’ve got the same output with much less code which is always a #win.

Categories: Programming

17 Theses on Software Estimation

10x Software Development - Steve McConnell - Sun, 08/02/2015 - 17:20

(with apologies to Martin Luther for the title)

Arriving late to the #NoEstimates discussion, I’m amazed at some of the assumptions that have gone unchallenged, and I’m also amazed at the absence of some fundamental points that no one seems to have made so far. The point of this article is to state unambiguously what I see as the arguments in favor of estimation in software and put #NoEstimates in context.  

1. Estimation is often done badly and ineffectively and in an overly time-consuming way. 

My company and I have taught upwards of 10,000 software professionals better estimation practices, and believe me, we have seen every imaginable horror story of estimation done poorly. There is no question that “estimation is often done badly” is a true observation of the state of the practice. 

2. The root cause of poor estimation is usually lack of estimation skills. 

Estimation done poorly is most often due to lack of estimation skills. Smart people using common sense is not sufficient to estimate software projects. Reading two page blog articles on the internet is not going to teach anyone how to estimate very well. Good estimation is not that hard, once you’ve developed the skill, but it isn’t intuitive or obvious, and it requires focused self-education or training. 

3. Many comments in support of #NoEstimates demonstrate a lack of basic software estimation knowledge. 

I don’t expect most #NoEstimates advocates to agree with this thesis, but as someone who does know a lot about estimation I think it’s clear on its face. Here are some examples

(a) Are estimation and forecasting the same thing? As far as software estimation is concerned, yes they are. (Just do a Google or Bing search of “definition of forecast”.) Estimation, forecasting, prediction--it's all the same basic activity, as far as software estimation is concerned. 

(b) Is showing someone several pictures of kitchen remodels that have been completed for $30,000 and implying that the next kitchen remodel can be completed for $30,000 estimation? Yes, it is. That’s an implementation of a technique called Reference Class Forecasting. 

(c) Is doing a few iterations, calculating team velocity, and then using that empirical velocity data to project a completion date count as estimation? Yes it does. Not only is it estimation, it is a really effective form of estimation. I’ve heard people argue that because velocity is empirically based, it isn’t estimation. That argument is incorrect and shows a lack of basic understanding of the nature of estimation. 

(d) Is estimation time consuming and a waste of time? One of the most common symptoms of lack of estimation skill is spending too much time on the wrong activities. This work is often well-intentioned, but it’s common to see well-intentioned people doing more work than they need to get worse answers than they could be getting.  

4. Being able to estimate effectively is a skill that any true software professional needs to develop, even if they don’t need it on every project. 

“Estimating is problematic, therefore software professionals should not develop estimation skill” – this is a common line of reasoning in #NoEstimates. Unless a person wants to argue that the need for estimation is rare, this argument is not supported by the rest of #NoEstimate’s premises. 

If I agreed, for sake of argument, that 50% of the projects don’t need to be estimated, the other 50% of the projects would still benefit from the estimators having good estimation skills. If you’re a true software professional, you should develop estimation skill so that you can estimate competently on the 50% of projects that do require estimation. 

In practice, I think the number of projects that need estimates is much higher than 50%. 

5. Estimates serve numerous legitimate, important business purposes.

Estimates are used by businesses in numerous ways, including: 

  • Allocating budgets to projects (i.e., estimating the effort and budget of each project)
  • Making cost/benefit decisions at the project/product level, which is based on cost (software estimate) and benefit (defined feature set)
  • Deciding which projects get funded and which do not, which is often based on cost/benefit
  • Deciding which projects get funded this year vs. next year, which is often based on estimates of which projects will finish this year
  • Deciding which projects will be funded from CapEx budget and which will be funded from OpEx budget, which is based on estimates of total project effort, i.e., budget
  • Allocating staff to specific projects, i.e., estimates of how many total staff will be needed on each project
  • Allocating staff within a project to different component teams or feature teams, which is based on estimates of scope of each component or feature area
  • Allocating staff to non-project work streams (e.g., budget for a product support group, which is based on estimates for the amount of support work needed)
  • Making commitments to internal business partners (based on projects’ estimated availability dates)
  • Making commitments to the marketplace (based on estimated release dates)
  • Forecasting financials (based on when software capabilities will be completed and revenue or savings can be booked against them)
  • Tracking project progress (comparing actual progress to planned (estimated) progress)
  • Planning when staff will be available to start the next project (by estimating when staff will finish working on the current project)
  • Prioritizing specific features on a cost/benefit basis (where cost is an estimate of development effort)

These are just a subset of the many legitimate reasons that businesses request estimates from their software teams. I would be very interested to hear how #NoEstimates advocates suggest that a business would operate if you remove the ability to use estimates for each of these purposes.

The #NoEstimates response to these business needs is typically of the form, “Estimates are inaccurate and therefore not useful for these purposes” rather than, “The business doesn’t need estimates for these purposes.” 

That argument really just says that businesses are currently operating on much worse quality information than they should be, and probably making poorer decisions as a result, because the software staff are not providing very good estimates. If software staff provided more accurate estimates, the business would make better decisions in each of these areas, which would make the business stronger. 

This all supports my point that improved estimation skill should be part of the definition of being a true software professional. 

6. Part of being an effective estimator is understanding that different estimation techniques should be used for different kinds of estimates. 

One thread that runs throughout the #NoEstimates discussions is lack of clarity about whether we’re estimating before the project starts, very early in the project, or after the project is underway. The conversation is also unclear about whether the estimates are project-level estimates, task-level estimates, sprint-level estimates, or some combination. Some of the comments imply ineffective attempts to combine kinds of estimates—the most common confusion I’ve read is trying to use task-level estimates to estimate a whole project, which is another example of lack of software estimation skill. 

Effective estimation requires that the right kind of technique be applied to each different kind of estimate. Learning when to use each technique, as well as learning each technique, requires some professional skills development. 

7. Estimation and planning are not the same thing, and you can estimate things that you can’t plan. 

Many of the examples given in support of #NoEstimates are actually indictments of overly detailed waterfall planning, not estimation. The simple way to understand the distinction is to remember that planning is about “how” and estimation is about “how much.” 

Can I “estimate” a chess game, if by “estimate” I mean how each piece will move throughout the game? No, because that isn’t estimation; it’s planning; it’s “how.”

Can I estimate a chess game in the sense of “how much”? Sure. I can collect historical data on the length of chess games and know both the average length and the variation around that average and predict the length of a game. 

More to the point, estimating software projects is not analogous to estimating one chess game. It’s analogous to estimating a series of chess games. People who are not skilled in estimation often assume it’s more difficult to estimate a series of games than to estimate an individual game, but estimating the series is actually easier. Indeed, the more chess games in the set, the more accurately we can estimate the set, once you understand the math involved. 

8. You can estimate what you don’t know, up to a point. 

In addition to estimating “how much,” you can also estimate “how uncertain.” In the #NoEstimates discussions, people throw out lots of examples along the lines of, “My project was doing unprecedented work in Area X, and therefore it was impossible to estimate the whole project.” That isn’t really true. What you would end up with in cases like that is high variability in your estimate for Area X, and a common estimation mistake would be letting X’s uncertainty apply to the whole project rather than constraining it’s uncertainty just to Area X. 

Most projects contain a mix of precedented and unprecedented work, or certain and uncertain work. Decomposing the work, estimating uncertainty in different areas, and building up an overall estimate from that is one way of dealing with uncertainty in estimates. 

9. Both estimation and control are needed to achieve predictability. 

Much of the writing on Agile development emphasizes project control over project estimation. I actually agree that project control is more powerful than project estimation, however, effective estimation usually plays an essential role in achieving effective control. 

To put this in Agile Manifesto-like terms:

We have come to value project control over project estimation, 
as a means of achieving predictability

As in the Agile Manifesto, we value both terms, which means we still value the term on the right. 

#NoEstimates seems to pay lip service to both terms, but the emphasis from the hashtag onward is really about discarding the term on the right. This is a case where I believe the right answer is both/and, not either/or

10. People use the word "estimate" sloppily. 

No doubt. Lack of understanding of estimation is not limited to people tweeting about #NoEstimates. Business partners often use the word “estimate” to refer to what would more properly be called a “planning target” or “commitment.” Further, one common mistake software professionals make is trying to create estimates when the business is really asking for a commitment, or asking for a plan to meet a target, but using the word “estimate” to ask for that. 

We have worked with many companies to achieve organizational clarity about estimates, targets, and commitments. Clarifying these terms makes a huge difference in the dynamics around creating, presenting, and using software estimates effectively. 

11. Good project-level estimation depends on good requirements, and average requirements skills are about as bad as average estimation skills. 

A common refrain in Agile development is “It’s impossible to get good requirements,” and that statement has has never been true. I agree that it’s impossible to get perfect requirements, but that isn’t the same thing as getting good requirements. I would agree that “It is impossible to get good requirements if you don’t have very good requirement skills,” and in my experience that is a common case.  I would also agree that “Projects usually don’t have very good requirements,” as an empirical observation—but not as a normative statement that we should accept as inevitable. 

Like estimation skill, requirements skill is something that any true software professional should develop, and the state of the art in requirements at this time is far too advanced for even really smart people to invent everything they need to know on their own. Like estimation skill, a person is not going to learn adequate requirements skills by reading blog entries or watching short YouTube videos. Acquiring skill in requirements requires focused, book-length self-study or explicit training or both. 

Why would we care about getting good requirements if we’re Agile? Isn’t trying to get good requirements just waterfall? The answer is both yes and no. You can’t achieve good predictability of the combination of cost, schedule, and functionality if you don’t have a good definition of functionality. If your business truly doesn’t care about predictability (and some truly don’t), then letting your requirements emerge over the course of the project can be a good fit for business needs. But if your business does care about predictability, you should develop the skill to get good requirements, and then you should actually do the work to get them. You can still do the rest of the project using by-the-book Scrum, and then you’ll get the benefits of both good requirements and Scrum. 

12. The typical estimation context involves moderate volatility and a moderate levels of unknowns

Ron Jeffries writes, “It is conventional to behave as if all decent projects have mostly known requirements, low volatility, understood technology, …, and are therefore capable of being more or less readily estimated by following your favorite book.” 

I don’t know who said that, but it wasn’t me, and I agree with Ron that that statement doesn’t describe most of the projects that I have seen. 

I think it would be more true to say, “The typical software project has requirements that are knowable in principle, but that are mostly unknown in practice due to insufficient requirements skills; low volatility in most areas with high volatility in selected areas; and technology that tends to be either mostly leading edge or mostly mature; …; and are therefore amenable to having both effective requirements work and effective estimation work performed on those projects, given sufficient training in both skill sets.”

In other words, software projects are challenging, and they’re even more challenging if you don’t have the skills needed to work on them. If you have developed the right skills, the projects will still be challenging, but you’ll be able to overcome most of the challenges or all of them. 

Of course there is a small percentage of projects that do have truly unknowable requirements and across-the-board volatility. I consider those to be corner cases. It’s good to explore corner cases, but also good not to lose sight of which cases are most common. 

13. Responding to change over following a plan does not imply not having a plan. 

It’s amazing that in 2015 we’re still debating this point. Many of the #NoEstimates comments literally emphasize not having a plan, i.e., treating 100% of the project as emergent. They advocate a process—typically Scrum—but no plan beyond instantiating Scrum. 

According to the Agile Manifesto, while agile is supposed to value responding to change, it also is supposed to value following a plan. Doing no planning at all is not only inconsistent with the Agile Manifesto, it also wastes some of Scrum's capabilities. One of the amazingly powerful aspects of Scrum is that it gives you the ability to respond to change; and that doesn’t imply that you need to avoid committing to plans in the first place. 

My company and I have seen Agile adoptions shut down in some companies because an Agile team is unwilling to commit to requirements up front or refuses to estimate up front. As a strategy, that’s just dumb. If you fight your business up front about providing estimates, even if you win the argument that day, you will still get knocked down a peg in the business’s eyes. 

Instead, use your velocity to estimate how much work you can do over the course of a project, and commit to a product backlog based on your demonstrated capacity for work. Your business will like that. Then, later, when your business changes its mind—which it probably will—you’ll be able to respond to change. Your business will like that even more. Wouldn’t you rather look good twice than look bad once? 

14. Scrum provides better support for estimation than waterfall ever did, and there does not have to be a trade off between agility and predictability. 

Some of the #NoEstimates discussion seems to interpret challenges to #NoEstimates as challenges to the entire ecosystem of Agile practices, especially Scrum. Many of the comments imply that predictability comes at the expense of agility. The examples cited to support that are mostly examples of unskilled misapplications of estimation practices, so I see them as additional examples of people not understanding estimation very well. 

The idea that we have to trade off agility to achieve predictability is a false trade off. In particular, if no one had ever uttered the word “agile,” I would still want to use Scrum because of its support for estimation and predictability. 

The combination of story pointing, product backlog, velocity calculation, short iterations, just-in-time sprint planning, and timely retrospectives after each sprint creates a nearly perfect context for effective estimation. Scrum provides better support for estimation than waterfall ever did. 

If a company truly is operating in a high uncertainty environment, Scrum can be an effective approach. In the more typical case in which a company is operating in a moderate uncertainty environment, Scrum is well-equipped to deal with the moderate level of uncertainty and provide high predictability (e.g., estimation) at the same time. 

15. There are contexts where estimates provide little value. 

I don’t estimate how long it will take me to eat dinner, because I know I’m going to eat dinner regardless of what the estimate says. If I have a defect that keeps taking down my production system, the business doesn’t need an estimate for that because the issue needs to get fixed whether it takes an hour, a day, or a week. 

The most common context I see where estimates are not done on an ongoing basis and truly provide little business value is online contexts, especially mobile, where the cycle times are measured in days or shorter, the business context is highly volatile, and the mission truly is, “Always do the next most useful thing with the resources available.” 

In both these examples, however, there is a point on the scale at which estimates become valuable. If the work on the production system stretches into weeks or months, the business is going to want and need an estimate. As the mobile app matures from one person working for a few days to a team of people working for a few weeks, with more customers depending on specific functionality, the business is going to want more estimates. Enjoy the #NoEstimates context while it lasts; don’t assume that it will last forever. 

16. This is not religion. We need to get more technical and economic about software discussions. 

I’ve seen #NoEstimates advocates treat these questions of requirements volatility, estimation effectiveness, and supposed tradeoffs between agility and predictability as value-laden moral discussions in which their experience with usually-bad requirements and usually-bad estimates calls for an iterative approach like pure Scrum, rather than a front-loaded approach like Scrum with a pre-populated product backlog. In these discussions, “Waterfall” is used as an invective, where the tone of the argument is often more moral than economic. That religion isn’t unique to Agile advocates, and I’ve seen just as much religion on the non-Agile sides of various discussions. I’ve appreciated my most recent discussion with Ron Jeffries because he hasn’t done that. It would be better for the industry at large if people could stay more technical and economic more often. 

For my part, software is not religion, and the ratio of work done up front on a software project is not a moral issue. If we assume professional-level skills in agile practices, requirements, and estimation, the decision about how much work to do up front should be an economic decision based on cost of change and value of predictability. If the environment is volatile enough, then it’s a bad economic decision to do lots of up front requirements work just to have a high percentage of requirements spoil before they can be implemented. If there’s little or no business value created by predictability, that also suggests that emphasizing up front estimation work would be a bad economic decision.

On the other hand, if the business does value predictability, then how we support that predictability should also be an economic decision. If we do a lot of the requirements work up front, and some requirements spoil, but most do not, and that supports improved predictability, and the business derives value from that, that would be a good economic choice. 

The economics of these decisions are affected by the skills of the people involved. If my team is great at Scrum but poor at estimation and requirements, the economics of up front vs. emergent will tilt one way. If my team is great at estimation and requirements but poor at Scrum, the economic might tilt the other way. 

Of course, skill sets are not divinely dictated or cast in stone; they can be improved through focused self-study and training. So we can treat the question of whether we should invest in developing additional skills as an economic issue too. 

What is the cost of training staff to reach proficiency in estimation and requirements? Does the cost of achieving proficiency exceed the likely benefits that would derive from proficiency? That goes back to the question of how much the business values predictability. If the business truly places no value on predictability, there’s won’t be any ROI from training staff in practices that support predictability. But I do not see that as the typical case. 

My company and I can train software professionals to become proficient in both requirements and estimation in about a week. In my experience most businesses place enough value on predictability that investing a week to make that option available provides a good ROI to the business. Note: this is about making the option available, not necessarily exercising the option on every project. 

My company and I can also train software professionals to become proficient in a full complement of Scrum and other Agile technical practices in about a week. That produces a good ROI too. In any given case, I would recommend both sets of training. If I had to recommend only one or the other, sometimes I would recommend starting with the Agile practices. But I wouldn’t recommend stopping with them. 

Skills development in practices that support predictability vs. practices that support agility is not an either/or decision. A truly agile business would be able to be flexible when needed, or predictable when needed. A true software professional will be most effective when skilled in both skill sets. 

17. Agility plus predictability is better than agility alone. 

If you think your business values agility only, ask your business what it values. Businesses vary, and you might work in a business that truly does value agility over predictability or that values agility exclusively. 

In some cases, businesses will value predictability over agility. Odds are that your business actually values both agility and predictability. The point is, ask the business, don’t just assume it’s one or the other. 

I think it’s self-evident that a business that has both agility and predictability will outperform a business that has agility only. We need to get past the either/or thinking that limits us to one set of skills or the other and embrace both/and thinking that leads us to develop the full set of skills needed to become true software professionals. 


#NoEstimates - Response to Ron Jeffries

10x Software Development - Steve McConnell - Fri, 07/31/2015 - 19:22

Ron Jeffries posted a thoughtful response to my #NoEstimates video. While I like some elements of his response, it still ultimately glosses over problems with #NoEstimates. 

I'll walk through Ron's critique and show where I think it makes good points vs. where it misses the point. 

Ron's First Remodel of my Kitchen Remodel Example

Ron describes a variation on my video's kitchen remodel story. (If Ron is modifying my original story, does that qualify as fan fic??? Cool!) He says the real #NoEstimates way to approach that kind of remodel would be for the contractor to say something like, "Let's divide your remodel up into areas, and we'll allocate $6,000 per area." The customer then says, "I need 15 linear feet of cabinets. What kind of cabinets can I get for $6,000?" Ron characterizes that as a "very answerable question." The contractor then goes through the rest of the project similarly. 

I like the idea of dividing the project into pieces, and I like the idea of budgeting each piece individually. But what makes it possible to break down the project into areas with budget amounts in each area? What makes it possible to know that we can deliver 15 linear feet of cabinets for $6,000? Ron says that question is "very answerable." What makes that question "very answerable?"  


Specifically, we can answer that question because we have lots of historical data about the cost of kitchen cabinets. As Ron says, "Here are pictures of $30,000 kitchens I've done in the past." That's historical data from completed past projects. As I discuss in my estimation book, historical data is the key to good estimation in general, whether kitchen cabinets or software. In software, Ron's example would called a "reference table of similar completed projects," which is a kind of "estimation by analogy." 

Far from supporting #NoEstimates, the example supports the value of collecting historical data so that you can use it in estimates. 

Ron's Second Remodel of My Kitchen Remodel Example

Ron presents a second modification of my scenario, this one based on the observation that kitchens involve physical material and software doesn't. "Kitchens are 'hard', Software is 'soft'."

(The whole hard vs. soft argument is a red herring. Yes, there are physical components in a kitchen remodel and there are few if any physical components in most software projects. So that's a difference. But even with the physical components in a kitchen remodel, the cost of labor is a major cost driver, just as it is in software, and more to the point, the labor cost is the primary source of uncertainty and risk in both cases. The presence of uncertainty and risk is the factor that makes estimation interesting in both cases. If there wasn't any uncertainty or risk, we could just look up the correct answer in a reference guide. Estimation would not present any challenges, and we would not need to write blog articles or create hashtags about it. So I think the contexts are more similar than different. Having said that, this issue really is beside the point.)

Ron goes on to say that, because software is soft, if the kitchen remodel was a software project we could just build it up $1,000 at a time, always doing the next most useful thing, and always leaving the kitchen in a state in which it can be used each day. The customer can inspect progress each day and give feedback on the direction. As we go along, if we see we don't really need $6,000 for a sink, we can redirect those funds to other areas, or just not spend them at all. If we get to the point where we've spent $20,000 and we're satisfied with where we are, we can can just stop, and we'll have saved $10,000. 

This sounds appealing and probably works in some cases, especially in cases where the people have done the same kind of work many, many times and have a well calibrated gut feel that they can do the whole project satisfactorily for $30,000. However, it also depends on available resources exceeding the resources needed to satisfy requirements. I would love to work in an environment that had excess resources, but my experience says that resources normally fall short of what is needed to satisfy requirements. 

When resources are not sufficient to satisfy the requirements, a less idealized version of Ron's scenario would go more like this: 

The contractor gets to work diligently working and spending $1,000 per day. The kitchen is indeed usable each day, and each day the customer agrees that the kitchen is incrementally better. After reaching the $15,000 mark, however, the customer says, "It doesn't look to me like we're halfway done with the work. I like each piece better than I did before, but we're no where near the end state I wanted." The contractor asks the customer for more detailed feedback and tries to adjust. The daily deliveries continue until the entire $30,000 is gone. 

The kitchen is better, and it is usable, but at the project retrospective the customer says, "None of the major parts are really what I wanted. If I'd known ahead of time that this approach would not get me what I wanted in any category, I would said, Do the appliances and the countertops, and that's all. That way I would at least have been satisfied with something. As it turned out, I'm not satisfied with anything." 

In this case, "collaboration" turned into "going down the drain together," which is not what anyone wanted. 

How do you avoid this outcome? You estimate the cost of each of the components. Or you give ranges of estimates and work with the customer to develop budgets for each area. Estimates and budgets help the customer prioritize, which is one of the more common reasons customers want estimates.  

Ron's Third Example 

Ron gives a third example in which he built a database product that no one had built before. There are ways to estimate that kind of work (more than you'd think, if you haven't received training in software estimation), but there is going to be more variability in those estimates, and if there are enough unknowns the variability might be high enough to make the estimates worthless. That's a better example of a case in which #NoEstimates might apply. But even then, I think #AskTheCustomer is a better position than #NoEstimates, or at least better than #AssumeNoEstimates, which is what #NoEstimates is often taken to imply. 


Ron's first example is based on expert estimation using historical data and directly supports #KnowWhenToEstimate. His example actually undermines #NoEstimates. 

Ron's second example assumes resources exceed what is needed to satisfy the requirements. When assumptions are adjusted to the more common condition of scarce resources, Ron's second example also supports the need for estimates. 

Ron closes with encouragement to get better at working within budgets (I agree!), collaborating with customers to identify budgets and similar constraints (I agree!). He also closes with the encouragement to get better at "giving an idea what we can do with that slice, for a slice of the budget"--I agree again, and we can only provide "giving an idea of what we can do with that slice" through estimation! 

None of this should be taken as a knock against decomposing into parts or building incrementally. Estimation by decomposition is a fundamental estimation approach. And I like the incremental emphasis in Ron's examples. It's just that, while building incrementally is good, building incrementally with predictability is even better. 

#NoHacked: How to avoid being the target of hackers

Google Code Blog - Fri, 07/31/2015 - 19:03

Originally posted by the Webmaster Central Blog.

If you publish anything online, one of your top priorities should be security. Getting hacked can negatively affect your online reputation and result in loss of critical and private data. Over the past year Google has noticed a 180% increase in the number of sites getting hacked. While we are working hard to combat this hacked trend, there are steps you can take to protect your content on the web.

This week, Google Webmasters has launched a second #NoHacked campaign. We’ll be focusing on how to protect your site from hacking and give you better insight into how some of these hacking campaigns work. You can follow along with #NoHacked on Twitter and Google+. We’ll also be wrapping up with a Google Hangout focused on security where you can ask our security experts questions.

We’re kicking off the campaign with some basic tips on how to keep your site safe on the web.

1. Strengthen your account security

Creating a password that’s difficult to guess or crack is essential to protecting your site. For example, your password might contain a mixture of letters, numbers, symbols, or be a passphrase. Password length is important. The longer your password, the harder it will be to guess. There are many resources on the web that can test how strong your password is. Testing a similar password to yours (never enter your actual password on other sites) can give you an idea of how strong your password is.

Also, it’s important to avoid reusing passwords across services. Attackers often try known username and password combinations obtained from leaked password lists or hacked services to compromise as many accounts as possible.

You should also turn on 2-Factor Authentication for accounts that offer this service. This can greatly increase your account’s security and protect you from a variety of account attacks. We’ll be talking more about the benefits of 2-Factor Authentication in two weeks.

2. Keep your site’s software updated

One of the most common ways for a hacker to compromise your site is through insecure software on your site. Be sure to periodically check your site for any outdated software, especially updates that patch security holes. If you use a web server like Apache, nginx or commercial web server software, make sure you keep your web server software patched. If you use a Content Management System (CMS) or any plug-ins or add-ons on your site, make sure to keep these tools updated with new releases. Also, sign up to the security announcement lists for your web server software and your CMS if you use one. Consider completely removing any add-ons or software that you don't need on your website -- aside from creating possible risks, they also might slow down the performance of your site.

3. Research how your hosting provider handles security issues

Your hosting provider’s policy for security and cleaning up hacked sites is in an important factor to consider when choosing a hosting provider. If you use a hosting provider, contact them to see if they offer on-demand support to clean up site-specific problems. You can also check online reviews to see if they have a track record of helping users with compromised sites clean up their hacked content.

If you control your own server or use Virtual Private Server (VPS) services, make sure that you’re prepared to handle any security issues that might arise. Server administration is very complex, and one of the core tasks of a server administrator is making sure your web server and content management software is patched and up to date. If you don't have a compelling reason to do your own server administration, you might find it well worth your while to see if your hosting provider offers a managed services option.

4. Use Google tools to stay informed of potential hacked content on your site

It’s important to have tools that can help you proactively monitor your site.The sooner you can find out about a compromise, the sooner you can work on fixing your site.

We recommend you sign up for Search Console if you haven’t already. Search Console is Google’s way of communicating with you about issues on your site including if we have detected hacked content. You can also set up Google Alerts on your site to notify you if there are any suspicious results for your site. For example, if you run a site selling pet accessories called, you can set up an alert for [ cheap software] to alert you if any hacked content about cheap software suddenly starts appearing on your site. You can set up multiple alerts for your site for different spammy terms. If you’re unsure what spammy terms to use, you can use Google to search for common spammy terms.

We hope these tips will keep your site safe on the web. Be sure to follow our social campaigns and share any tips or tricks you might have about staying safe on the web with the #NoHacked hashtag.

If you have any additional questions, you can post in the Webmaster Help Forums where a community of webmasters can help answer your questions. You can also join our Hangout on Air about Security on August 26th.

Posted by Eric Kuan, Webmaster Relations Specialist and Yuan Niu, Webspam Analyst

Categories: Programming

Get your hands on Android Studio 1.3

Android Developers Blog - Thu, 07/30/2015 - 23:55

Posted by Jamal Eason, Product Manager, Android

Previewed earlier this summer at Google I/O, Android Studio 1.3 is now available on the stable release channel. We appreciated the early feedback from those developers on our canary and beta channels to help ship a great product.

Android Studio 1.3 is our biggest feature release for the year so far, which includes a new memory profiler, improved testing support, and full editing and debugging support for C++. Let’s take a closer look.

New Features in Android Studio 1.3

Performance & Testing Tools

  • Android Memory (HPROF) Viewer

    Android Studio now allows you to capture and analyze memory snapshots in the native Android HPROF format.

  • Allocation Tracker

    In addition to displaying a table of memory allocations that your app uses, the updated allocation tracker now includes a visual way to view the your app allocations.

  • APK Tests in Modules

    For more flexibility in app testing, you now have the option to place your code tests in a separate module and use the new test plugin (‘’) instead of keeping your tests right next to your app code. This feature does require your app project to use the Gradle Plugin 1.3.

Code and SDK Management

  • App permission annotations

    Android Studio now has inline code annotation support to help you manage the new app permissions model in the M release of Android. Learn more about code annotations.

  • Data Binding Support

    New data brinding features allow you to create declarative layouts in order to minimize boilerplate code by binding your application logic into your layouts. Learn more about data binding.

  • SDK Auto Update & SDK Manager

    Managing Android SDK updates is now a part of the Android Studio. By default, Android Studio will now prompt you about new SDK & Tool updates. You can still adjust your preferences with the new & integrated Android SDK Manager.

  • C++ Support

    As a part of the Android 1.3 stable release, we included an Early Access Preview of the C++ editor & debugger support paired with an experimental build plugin. See the Android C++ Preview page for information on how to get started. Support for more complex projects and build configurations is in development, but let us know your feedback.

Time to Update

An important thing to remember is that an update to Android Studio does not require you to change your Android app projects. With updating, you get the latest features but still have control of which build tools and app dependency versions you want to use for your Android app.

For current developers on Android Studio, you can check for updates from the navigation menu. For new users, you can learn more about Android Studio on the product overview page or download the stable version from the Android Studio download site.

We are excited to launch this set of features in Android Studio and we are hard at work developing the next set of tools to make develop Android development easier on Android Studio. As always we welcome feedback on how we can help you. Connect with the Android developer tools team on Google+.

Join the discussion on

+Android Developers
Categories: Programming

Iterate faster on Google Play with improved beta testing

Android Developers Blog - Thu, 07/30/2015 - 23:53

Posted by Ellie Powers, Product Manager, Google Play

Today, Google Play is making it easier for you to manage beta tests and get your users to join them. Since we launched beta testing two years ago, developers have told us that it’s become a critical part of their workflow in testing ideas, gathering rapid feedback, and improving their apps. In fact, we’ve found that 80 percent of developers with popular apps routinely run beta tests as part of their workflow.

Improvements to managing a beta test in the Developer Console

Currently, the Google Play Developer Console lets developers release early versions of their app to selected users as an alpha or beta test before pushing updates to full production. The select user group downloads the app on Google Play as normal, but can’t review or rate it on the store. This gives you time to address bugs and other issues without negatively impacting your app listing.

Based on your feedback, we’re launching new features to more effectively manage your beta tests, and enable users to join with one click.

  • NEW! Open beta – Use an open beta when you want any user who has the link to be able to join your beta with just one click. One of the advantages of an open beta is that it allows you to scale to a large number of testers. However, you can also limit the maximum number of users who can join.
  • NEW! Closed beta using email addresses – If you want to restrict which users can access your beta, you have a new option: you can now set up a closed beta using lists of individual email addresses which you can add individually or upload as a .csv file. These users will be able to join your beta via a one-click opt-in link.
  • Closed beta with Google+ community or Google Group – This is the option that you’ve been using today, and you can continue to use betas with Google+ communities or Google Groups. You will also be able to move to an open beta while maintaining your existing testers.
How developers are finding success with beta testing

Beta testing is one of the fast iteration features of Google Play and Android that help drive success for developers like Wooga, the creators of hit games Diamond Dash, Jelly Splash, and Agent Alice. Find out more about how Wooga iterates on Android first from Sebastian Kriese, Head of Partnerships, and Pal Tamas Feher, Head of Engineering.

Kabam is a global leader in AAA quality mobile games developed in partnership with Hollywood studios for such franchises such as Fast & Furious, Marvel, Star Wars and The Hobbit. Beta testing helps Kabam engineers perfect the gameplay for Android devices before launch. “The ability to receive pointed feedback and rapidly reiterate via alpha/beta testing on Google Play has been extremely beneficial to our worldwide launches,” said Kabam VP Rob Oshima.

Matt Small, Co-Founder of Vector Unit recently told us how they’ve been using beta testing extensively to improve Beach Buggy Racing and uncover issues they may not have found otherwise. You can read Matt’s blog post about beta testing on Google Play on Gamasutra to hear about their experience. We’ve picked a few of Matt’s tips and shared them below:

  1. Limit more sensitive builds to a closed beta where you invite individual testers via email addresses. Once glaring problems are ironed out, publish your app to an open beta to gather feedback from a wider audience before going to production.
  2. Set expectations early. Let users know about the risks of beta testing (e.g. the software may not be stable) and tell them what you’re looking for in their feedback.
  3. Encourage critical feedback. Thank people when their criticisms are thoughtful and clearly explained and try to steer less-helpful feedback in a more productive direction.
  4. Respond quickly. The more people see actual responses from the game developer, the more encouraged they are to participate.
  5. Enable Google Play game services. To let testers access features like Achievements and Leaderboards before they are published, go into the Google Play game services testing panel and enable them.

We hope this update to beta testing makes it easier for you to test your app and gather valuable feedback and that these tips help you conduct successful tests. Visit the Developer Console Help Center to find out more about setting up beta testing for your app.

Join the discussion on

+Android Developers
Categories: Programming

Auto Backup for Apps made simple

Android Developers Blog - Thu, 07/30/2015 - 23:53

Posted by Wojtek Kaliciński, Developer Advocate, Android

Auto Backup for Apps makes seamless app data backup and restore possible with zero lines of application code. This feature will be available on Android devices running the upcoming M release. All you need to do to enable it for your app is update the targetSdkVersion to 23. You can test it now on the M Developer Preview, where we’ve enabled Auto Backup for all apps regardless of targetSdkVersion.

Auto Backup for Apps is provided by Google to both users and developers at no charge. Even better, the backup data stored in Google Drive does not count against the user's quota. Please note that data transferred may still incur charges from the user's cellular / internet provider.

What is Auto-Backup for Apps?

By default, for users that have opted in to backup, all of the data files of an app are automatically copied out to a user’s Drive. That includes databases, shared preferences and other content in the application’s private directory, up to a limit of 25 megabytes per app. Any data residing in the locations denoted by Context.getCacheDir(), Context.getCodeCacheDir() and Context.getNoBackupFilesDir() is excluded from backup. As for files on external storage, only those in Context.getExternalFilesDir() are backed up.

How to control what is backed up

You can customize what app data is available for backup by creating a backup configuration file in the res/xml folder and referencing it in your app’s manifest:


In the configuration file, specify <include/> or <exclude/> rules that you need to fine tune the behavior of the default backup agent. Please refer to a detailed explanation of the rules syntax available in the documentation.

What to exclude from backup

You may not want to have certain app data eligible for backup. For such data, please use one of the mechanisms above. For example:

  • You must exclude any device specific identifiers, either issued by a server or generated on the device. This includes the Google Cloud Messaging (GCM) registration token which, when restored to another device, can render your app on that device unable to receive GCM messages.
  • Consider excluding account credentials or other sensitive information information, e.g., by asking the user to reauthenticate the first time they launch a restored app rather than allowing for storage of such information in the backup.

With such a diverse landscape of apps, it’s important that developers consider how to maximise the benefits to the user of automatic backups. The goal is to reduce the friction of setting up a new device, which in most cases means transferring over user preferences and locally saved content.

For example, if you have the user’s account stored in shared preferences such that it can be restored on install, they won’t have to even think about which account they used to sign in with previously - they can submit their password and get going!

If you support a variety of log-ins (Google Sign-In and other providers, username/password), it’s simple to keep track of which log-in method was used previously so the user doesn’t have to.

Transitioning from key/value backups

If you have previously implemented the legacy, key/value backup by subclassing BackupAgent and setting it in your Manifest (android:backupAgent), you’re just one step away from transitioning to full-data backups. Simply add the android:fullBackupOnly="true" attribute on <application/>. This is ignored on pre-M versions of Android, meaning onBackup/onRestore will still be called, while on M+ devices it lets the system know you wish to use full-data backups while still providing your own BackupAgent.

You can use the same approach even if you’re not using key/value backups, but want to do any custom processing in onCreate(), onFullBackup() or be notified when a restore operation happens in onRestoreFinished(). Just remember to call super.onFullBackup() if you want to retain the system implementation of XML include/exclude rules handling.

What is the backup/restore lifecycle?

The data restore happens as part of the package installation, before the user has a chance to launch your app. Backup runs at most once a day, when your device is charging and connected to Wi-Fi. If your app exceeds the data limit (currently set at 25 MB), no more backups will take place and the last saved snapshot will be used for subsequent restores. Your app’s process is killed after a full backup happens and before a restore if you invoke it manually through the bmgr command (more about that below).

Test your apps now

Before you begin testing Auto Backup, make sure you have the latest M Developer Preview on your device or emulator. After you’ve installed your APK, use the adb shell command to access the bmgr tool.

Bmgr is a tool you can use to interact with the Backup Manager:

  • bmgr run schedules an immediate backup pass; you need to run this command once after installing your app on the device so that the Backup Manager has a chance to initialize properly
  • bmgr fullbackup <packagename> starts a full-data backup operation.
  • bmgr restore <packagename> restores previously backed up data

If you forget to invoke bmgr run, you might see errors in Logcat when trying the fullbackup and restore commands. If you are still having problems, make sure you have Backup enabled and a Google account set up in system Settings -> Backup & reset.

Learn more

You can find a sample application that shows how to use Auto Backup on our GitHub. The full documentation is available on

Join the Android M Developer Preview Community on Google+ for more information on Android M features and remember to report any bugs you find with Auto Backup in the bug tracker.

Join the discussion on

+Android Developers
Categories: Programming

[New eBook] Download The No-nonsense Guide to App Growth

Android Developers Blog - Thu, 07/30/2015 - 23:53

Originally posted on the AdMob Blog.

What’s the secret to rapid growth for your app?

Play Store or App Store optimization? A sophisticated paid advertising strategy? A viral social media campaign?

While all of these strategies could help you grow your user base, the foundation for rapid growth is much more basic and fundamental—you need an engaging app.

This handbook will walk you through practical ways to increase your app’s user engagement to help you eventually transition to growth. You’ll learn how to:

  • Pick the right metric to represent user engagement
  • Look at data to audit your app and find areas to fix
  • Promote your app after you’ve reached a healthy level of user engagement

Download a free copy here.

For more tips on app monetization, be sure to stay connected on all things AdMob by following our Twitter and Google+ pages.

Posted by Raj Ajrawat, Product Specialist, AdMob

Join the discussion on

+Android Developers
Categories: Programming

Lighting the way with BLE beacons

Android Developers Blog - Thu, 07/30/2015 - 23:52

Originally posted on the Google Developers blog.

Posted by Chandu Thota, Engineering Director and Matthew Kulick, Product Manager

Just like lighthouses have helped sailors navigate the world for thousands of years, electronic beacons can be used to provide precise location and contextual cues within apps to help you navigate the world. For instance, a beacon can label a bus stop so your phone knows to have your ticket ready, or a museum app can provide background on the exhibit you’re standing in front of. Today, we’re beginning to roll out a new set of features to help developers build apps using this technology. This includes a new open format for Bluetooth low energy (BLE) beacons to communicate with people’s devices, a way for you to add this meaningful data to your apps and to Google services, as well as a way to manage your fleet of beacons efficiently.

Eddystone: an open BLE beacon format

Working closely with partners in the BLE beacon industry, we’ve learned a lot about the needs and the limitations of existing beacon technology. So we set out to build a new class of beacons that addresses real-life use-cases, cross-platform support, and security.

At the core of what it means to be a BLE beacon is the frame format—i.e., a language—that a beacon sends out into the world. Today, we’re expanding the range of use cases for beacon technology by publishing a new and open format for BLE beacons that anyone can use: Eddystone. Eddystone is robust and extensible: It supports multiple frame types for different use cases, and it supports versioning to make introducing new functionality easier. It’s cross-platform, capable of supporting Android, iOS or any platform that supports BLE beacons. And it’s available on GitHub under the open-source Apache v2.0 license, for everyone to use and help improve.

By design, a beacon is meant to be discoverable by any nearby Bluetooth Smart device, via its identifier which is a public signal. At the same time, privacy and security are really important, so we built in a feature called Ephemeral Identifiers (EIDs) which change frequently, and allow only authorized clients to decode them. EIDs will enable you to securely do things like find your luggage once you get off the plane or find your lost keys. We’ll publish the technical specs of this design soon.

Eddystone for developers: Better context for your apps

Eddystone offers two key developer benefits: better semantic context and precise location. To support these, we’re launching two new APIs. The Nearby API for Android and iOS makes it easier for apps to find and communicate with nearby devices and beacons, such as a specific bus stop or a particular art exhibit in a museum, providing better context. And the Proximity Beacon API lets developers associate semantic location (i.e., a place associated with a lat/long) and related data with beacons, stored in the cloud. This API will also be used in existing location APIs, such as the next version of the Places API.

Eddystone for beacon manufacturers: Single hardware for multiple platforms

Eddystone’s extensible frame formats allow hardware manufacturers to support multiple mobile platforms and application scenarios with a single piece of hardware. An existing BLE beacon can be made Eddystone compliant with a simple firmware update. At the core, we built Eddystone as an open and extensible protocol that’s also interoperable, so we’ll also introduce an Eddystone certification process in the near future by closely working with hardware manufacturing partners. We already have a number of partners that have built Eddystone-compliant beacons.

Eddystone for businesses: Secure and manage your beacon fleet with ease

As businesses move from validating their beacon-assisted apps to deploying beacons at scale in places like stadiums and transit stations, hardware installation and maintenance can be challenging: which beacons are working, broken, missing or displaced? So starting today, beacons that implement Eddystone’s telemetry frame (Eddystone-TLM) in combination with the Proximity Beacon API’s diagnostic endpoint can help deployers monitor their beacons’ battery health and displacement—common logistical challenges with low-cost beacon hardware.

Eddystone for Google products: New, improved user experiences

We’re also starting to improve Google’s own products and services with beacons. Google Maps launched beacon-based transit notifications in Portland earlier this year, to help people get faster access to real-time transit schedules for specific stations. And soon, Google Now will also be able to use this contextual information to help prioritize the most relevant cards, like showing you menu items when you’re inside a restaurant.

We want to make beacons useful even when a mobile app is not available; to that end, the Physical Web project will be using Eddystone beacons that broadcast URLs to help people interact with their surroundings.

Beacons are an important way to deliver better experiences for users of your apps, whether you choose to use Eddystone with your own products and services or as part of a broader Google solution like the Places API or Nearby API. The ecosystem of app developers and beacon manufacturers is important in pushing these technologies forward and the best ideas won’t come from just one company, so we encourage you to get some Eddystone-supported beacons today from our partners and begin building!

Join the discussion on

+Android Developers
Categories: Programming

Connect With the World Around You Through Nearby APIs

Android Developers Blog - Thu, 07/30/2015 - 23:52

Originally posted on the Google Developers blog.

Posted by Akshay Kannan, Product Manager

Mobile phones have made it easy to communicate with anyone, whether they’re right next to you or on the other side of the world. The great irony, however, is that those interactions can often feel really awkward when you're sitting right next to someone.

Today, it takes several steps -- whether it’s exchanging contact information, scanning a QR code, or pairing via bluetooth -- to get a simple piece of information to someone right next to you. Ideally, you should be able to just turn to them and do so, the same way you do in the real world.

This is why we built Nearby. Nearby provides a proximity API, Nearby Messages, for iOS and Android devices to discover and communicate with each other, as well as with beacons.

Nearby uses a combination of Bluetooth, Wi-Fi, and inaudible sound (using the device’s speaker and microphone) to establish proximity. We’ve incorporated Nearby technology into several products, including Chromecast Guest Mode, Nearby Players in Google Play Games, and Google Tone.

With the latest release of Google Play services 7.8, the Nearby Messages API becomes available to all developers across iOS and Android devices (Gingerbread and higher). Nearby doesn’t use or require a Google Account. The first time an app calls Nearby, users get a permission dialog to grant that app access.

A few of our partners have built creative experiences to show what's possible with Nearby.

Edjing Pro uses Nearby to let DJs publish their tracklist to people around them. The audience can vote on tracks that they like, and their votes are updated in realtime.

Trello uses Nearby to simplify sharing. Share a Trello board to the people around you with a tap of a button.

Pocket Casts uses Nearby to let you find and compare podcasts with people around you. Open the Nearby tab in Pocket Casts to view a list of podcasts that people around you have, as well as podcasts that you have in common with others.

Trulia uses Nearby to simplify the house hunting process. Create a board and use Nearby to make it easy for the people around you to join it.

To learn more, visit

Join the discussion on

+Android Developers
Categories: Programming


10x Software Development - Steve McConnell - Thu, 07/30/2015 - 22:13

I've posted a YouTube video that gives my perspective on #NoEstimates. 

This is in the new Construx Brain Casts video series. 


How Do I Come Up With A Profitable Side Project Idea?

Making the Complex Simple - John Sonmez - Thu, 07/30/2015 - 15:00

In this episode, I’m finally going to answer a question about side projects. Full transcript: John:               Hey, John Sonmez from I got a question. I’ve been getting asked this a lot, so I’m finally going to answer this. I’ll probably do a blogpost at some point on this as well, but I always mention […]

The post How Do I Come Up With A Profitable Side Project Idea? appeared first on Simple Programmer.

Categories: Programming

Doing Terrible Things To Your Code

Coding Horror - Jeff Atwood - Thu, 07/30/2015 - 10:31

In 1992, I thought I was the best programmer in the world. In my defense, I had just graduated from college, this was pre-Internet, and I lived in Boulder, Colorado working in small business jobs where I was lucky to even hear about other programmers much less meet them.

I eventually fell in with a guy named Bill O'Neil, who hired me to do contract programming. He formed a company with the regrettably generic name of Computer Research & Technologies, and we proceeded to work on various gigs together, building line of business CRUD apps in Visual Basic or FoxPro running on Windows 3.1 (and sometimes DOS, though we had a sense by then that this new-fangled GUI thing was here to stay).

Bill was the first professional programmer I had ever worked with. Heck, for that matter, he was the first programmer I ever worked with. He'd spec out some work with me, I'd build it in Visual Basic, and then I'd hand it over to him for review. He'd then calmly proceed to utterly demolish my code:

  • Tab order? Wrong.
  • Entering a number instead of a string? Crash.
  • Entering a date in the past? Crash.
  • Entering too many characters? Crash.
  • UI element alignment? Off.
  • Does it work with unusual characters in names like, say, O'Neil? Nope.

One thing that surprised me was that the code itself was rarely the problem. He occasionally had some comments about the way I wrote or structured the code, but what I clearly had no idea about is testing my code.

I dreaded handing my work over to him for inspection. I slowly, painfully learned that the truly difficult part of coding is dealing with the thousands of ways things can go wrong with your application at any given time – most of them user related.

That was my first experience with the buddy system, and thanks to Bill, I came out of that relationship with a deep respect for software craftsmanship. I have no idea what Bill is up to these days, but I tip my hat to him, wherever he is. I didn't always enjoy it, but learning to develop discipline around testing (and breaking) my own stuff unquestionably made me a better programmer.

It's tempting to lay all this responsibility at the feet of the mythical QA engineer.

QA Engineer walks into a bar. Orders a beer. Orders 0 beers. Orders 999999999 beers. Orders a lizard. Orders -1 beers. Orders a sfdeljknesv.

— Bill Sempf (@sempf) September 23, 2014

If you are ever lucky enough to work with one, you should have a very, very healthy fear of professional testers. They are terrifying. Just scan this "Did I remember to test" list and you'll be having the worst kind of flashbacks in no time. Did I mention that's the abbreviated version of his list?

I believe a key turning point in every professional programmer's working life is when you realize you are your own worst enemy, and the only way to mitigate that threat is to embrace it. Act like your own worst enemy. Break your UI. Break your code. Do terrible things to your software.

This means programmers need a good working knowledge of at least the common mistakes, the frequent cases that average programmers tend to miss, to work against. You are tester zero. This is your responsibility.

Let's start with Patrick McKenzie's classic Falsehoods Programmers Believe about Names:

  1. People have exactly one canonical full name.
  2. People have exactly one full name which they go by.
  3. People have, at this point in time, exactly one canonical full name.
  4. People have, at this point in time, one full name which they go by.
  5. People have exactly N names, for any value of N.
  6. People’s names fit within a certain defined amount of space.
  7. People’s names do not change.
  8. People’s names change, but only at a certain enumerated set of events.
  9. People’s names are written in ASCII.
  10. People’s names are written in any single character set.

That's just the first 10. There are thirty more. Plus a lot in the comments if you're in the mood for extra credit. Or, how does Falsehoods Programmers Believe About Time grab you?

  1. There are always 24 hours in a day.
  2. Months have either 30 or 31 days.
  3. Years have 365 days.
  4. February is always 28 days long.
  5. Any 24-hour period will always begin and end in the same day (or week, or month).
  6. A week always begins and ends in the same month.
  7. A week (or a month) always begins and ends in the same year.
  8. The machine that a program runs on will always be in the GMT time zone.
  9. Ok, that’s not true. But at least the time zone in which a program has to run will never change.
  10. Well, surely there will never be a change to the time zone in which a program has to run in production.
  11. The system clock will always be set to the correct local time.
  12. The system clock will always be set to a time that is not wildly different from the correct local time.
  13. If the system clock is incorrect, it will at least always be off by a consistent number of seconds.
  14. The server clock and the client clock will always be set to the same time.
  15. The server clock and the client clock will always be set to around the same time.

Are there more? Of course there are! There's even a whole additional list of stuff he forgot when he put that giant list together.

Catastrophic Error - User attempted to use program in the manner program was meant to be used

I think you can see where this is going. This is programming. We do this stuff for fun, remember?

But in true made-for-TV fashion, wait, there's more! Seriously, guys, where are you going? Get back here. We have more awesome failure states to learn about:

At this point I wouldn't blame you if you decided to quit programming altogether. But I think it's better if we learn to do for each other what Bill did for me, twenty years ago — teach less experienced developers that a good programmer knows they have to do terrible things to their code. Do it because if you don't, I guarantee you other people will, and when they do, they will either walk away or create a support ticket. I'm not sure which is worse.

[advertisement] Find a better job the Stack Overflow way - what you need when you need it, no spam, and no scams.
Categories: Programming

Neo4j: Cypher – Removing consecutive duplicates

Mark Needham - Thu, 07/30/2015 - 07:23

When writing Cypher queries I sometimes find myself wanting to remove consecutive duplicates in collections that I’ve joined together.

e.g we might start with the following query where 1 and 7 appear consecutively:

RETURN [1,1,2,3,4,5,6,7,7,8] AS values
==> +-----------------------+
==> | values                |
==> +-----------------------+
==> | [1,1,2,3,4,5,6,7,7,8] |
==> +-----------------------+
==> 1 row

We want to end up with [1,2,3,4,5,6,7,8]. We can start by exploding our array and putting consecutive elements next to each other:

WITH [1,1,2,3,4,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
RETURN idx, idx+1, values[idx], values[idx+1]
==> +-------------------------------------------+
==> | idx | idx+1 | values[idx] | values[idx+1] |
==> +-------------------------------------------+
==> | 0   | 1     | 1           | 1             |
==> | 1   | 2     | 1           | 2             |
==> | 2   | 3     | 2           | 3             |
==> | 3   | 4     | 3           | 4             |
==> | 4   | 5     | 4           | 5             |
==> | 5   | 6     | 5           | 6             |
==> | 6   | 7     | 6           | 7             |
==> | 7   | 8     | 7           | 7             |
==> | 8   | 9     | 7           | 8             |
==> +-------------------------------------------+
==> 9 rows

Next we can filter out rows which have the same values since that means they have consecutive duplicates:

WITH [1,1,2,3,4,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
==> +-------+
==> | a | b |
==> +-------+
==> | 1 | 2 |
==> | 2 | 3 |
==> | 3 | 4 |
==> | 4 | 5 |
==> | 5 | 6 |
==> | 6 | 7 |
==> | 7 | 8 |
==> +-------+
==> 7 rows

Now we need to join the collection back together again. Most of the values we want are in field ‘b’ but we also need to grab the first value from field ‘a':

WITH [1,1,2,3,4,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN COLLECT(a)[0] + COLLECT(b) AS noDuplicates
==> +-------------------+
==> | noDuplicates      |
==> +-------------------+
==> | [1,2,3,4,5,6,7,8] |
==> +-------------------+
==> 1 row

What about if we have more than 2 duplicates in a row?

WITH [1,1,1,2,3,4,5,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN COLLECT(a)[0] + COLLECT(b) AS noDuplicates
==> +-------------------+
==> | noDuplicates      |
==> +-------------------+
==> | [1,2,3,4,5,6,7,8] |
==> +-------------------+
==> 1 row

Still happy, good times! Of course if we have a non consecutive duplicate that wouldn’t be removed:

WITH [1,1,1,2,3,4,5,5,6,7,7,8,1] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN COLLECT(a)[0] + COLLECT(b) AS noDuplicates
==> +---------------------+
==> | noDuplicates        |
==> +---------------------+
==> | [1,2,3,4,5,6,7,8,1] |
==> +---------------------+
==> 1 row
Categories: Programming

What is Insight?

"A moment's insight is sometimes worth a life's experience." -- Oliver Wendell Holmes, Sr.

Some say we’re in the Age of Insight.  Others say insight is the new currency in the Digital Economy.

And still others say that insight is the backbone of innovation.

Either way, we use “insight” an awful lot without talking about what insight actually is.

So, what is insight?

I thought it was time to finally do a deeper dive on what insight actually is.  Here is my elaboration of “insight” on Sources of Insight:


You can think of it as “insight explained.”

The simple way that I think of insight, or those “ah ha” moments, is by remembering a question Ward Cunningham uses a lot:

“What did you learn that you didn’t expect?” or “What surprised you?”

Ward uses these questions to reveal insights, rather than have somebody tell him a bunch of obvious or uneventful things he already knows.  For example, if you ask somebody what they learned at their presentation training, they’ll tell you that they learned how to present more effectively, speak more confidently, and communicate their ideas better.

No kidding.

But if you instead ask them, “What did you learn that you didn’t expect?” they might actually reveal some insight and say something more like this:

“Even though we say don’t shoot the messenger all the time, you ARE the message.”


“If you win the heart, the mind follows.”

It’s the non-obvious stuff, that surprises you (at least at first).  Or sometimes, insight strikes us as something that should have been obvious all along and becomes the new obvious, or the new normal.

Ward used this insights gathering technique to more effectively share software patterns.  He wanted stories and insights from people, rather than descriptions of the obvious.

I’ve used it myself over the years and it really helps get to deeper truths.  If you are a truth seeker or a lover of insights, you’ll enjoy how you can tease out more insights, just by changing your questions.   For example, if you have kids, don’t ask, “How was your day?”   Ask them, “What was the favorite part of your day?” or “What did you learn that surprised you?”

Wow, I now this is a short post, but I almost left without defining insight.

According to the dictionary, insight is “The capacity to gain an accurate and deep intuitive understanding of a person or thing.”   Or you may see insight explained as inner sight, mental vision, or wisdom.

I like Edward de Bono’s simple description of insight as “Eureka moments.”

Some people count steps in their day.  I count my “ah-ha” moments.  After all, the most important ingredient of effective ideation and innovation is …yep, you guessed it – insight!

For a deeper dive on the power of insight, read my page on Insight explained, on Sources Of

Categories: Architecture, Programming

SE-Radio Episode 233: Fangjin Yang on OLAP and the Druid Real-Time Analytical Data Store

Fangjin Yang, creator of the Druid real-time analytical database, talks with Robert Blumen. They discuss the OLAP (online analytical processing) domain, OLAP concepts (hypercube, dimension, metric, and pivot), types of OLAP queries (roll-up, drill-down, and slicing and dicing), use cases for OLAP by organizations, the OLAP store’s position in the enterprise workflow, what “real time” […]
Categories: Programming

Neo4j: MERGE’ing on super nodes

Mark Needham - Tue, 07/28/2015 - 22:04

In my continued playing with the Chicago crime data set I wanted to connect the crimes committed to their position in the FBI crime type hierarchy.

These are the sub graphs that I want to connect:

2015 07 26 22 19 04

We have a ‘fbiCode’ on each ‘Crime’ node which indicates which ‘Crime Sub Category’ the crime belongs to.

I started with the following query to connect the nodes together:

MATCH (crime:Crime)
WITH crime SKIP {skip} LIMIT 10000
MATCH (subCat:SubCategory {code: crime.fbiCode})
MERGE (crime)-[:CATEGORY]->(subCat)
RETURN COUNT(*) AS crimesProcessed

I had this running inside a Python script which incremented ‘skip’ by 10,000 on each iteration as long as ‘crimesProcessed’ came back with a value > 0.

To start with the ‘CATEGORY’ relationships were being created very quickly but it slowed down quite noticeably about 1 million nodes in.

I profiled the queries but the query plans didn’t show anything obviously wrong. My suspicion was that I had a super node problem where the cypher run time was iterating through all of the sub category’s relationships to check whether one of them pointed to the crime on the other side of the ‘MERGE’ statement.

I cancelled the import job and wrote a query to check how many relationships each sub category had. It varied from 1,000 to 93,000 somewhat confirming my suspicion.

Michael suggested tweaking the query to use the shortestpath function to check for the existence of the relationship and then use the ‘CREATE’ clause to create it if it didn’t exist.

The neat thing about the shortestpath function is that it will start from the side with the lowest cardinality and as soon as it finds a relationship it will stop searching. Let’s have a look at that version of the query:

MATCH (crime:Crime)
WITH crime SKIP {skip} LIMIT 10000
MATCH (subCat:SubCategory {code: crime.fbiCode})
WITH crime, subCat, shortestPath((crime)-[:CATEGORY]->(subCat)) AS path
  CREATE (crime)-[:CATEGORY]->(subCat))

This worked much better – 10,000 nodes processed in ~ 2.5 seconds – and the time remained constant as more relationships were added. This allowed me to create all the category nodes but we can actually do even better if we use CREATE UNIQUE instead of MERGE

MATCH (crime:Crime)
WITH crime SKIP {skip} LIMIT 10000
MATCH (subCat:SubCategory {code: crime.fbiCode})
CREATE UNIQUE (crime)-[:CATEGORY]->(subCat)
RETURN COUNT(*) AS crimesProcessed

Using this query 10,000 nodes took ~ 250ms -900ms second to process which means we can process all the nodes in 5-6 minutes – good times!

I’m not super familiar with the ‘CREATE UNIQUE’ code so I’m not sure that it’s always a good substitute for ‘MERGE’ but on this occasion it does the job.

The lesson for me here is that if a query is taking longer than you think it should try and use other constructs / a combination of other constructs and see whether things improve – they just might!

Categories: Programming