Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Programming

Fearless Speaking

“Do one thing every day that scares you.” ― Eleanor Roosevelt

I did a deep dive book review.

This time, I reviewed Fearless Speaking.

The book is more than meets the eye.

It’s actually a wealth of personal development skills at your fingertips and it’s a powerful way to grow your personal leadership skills.

In fact, there are almost fifty exercises throughout the book.

Here’s an example of one of the techniques …

Spotlight Technique #1

When you’re overly nervous and anxious as a public speaker, you place yourself in a ‘third degree’ spotlight.  That’s the name for the harsh bright light police detectives use in days gone by to ‘sweat’ a suspect and elicit a confession.  An interrogation room was always otherwise dimly lit, so the source of light trained on the person (who was usually forced to sit in a hard straight backed chair) was unrelenting.

This spotlight is always harsh, hot, and uncomfortable – and the truth is, you voluntarily train it on yourself by believing your audience is unforgiving.  The larger the audience, the more likely you believe that to be true.

So here’s a technique to get out from under this hot spotlight that you’re imagining so vividly turn it around! Visualize swiveling the spotlight so it’s aimed at your audience instead of you.  After all, aren’t you supposed to illuminate your listeners? You don’t want to leave them in the dark, do you?

There’s no doubt that it’s cooler and much more comfortable when you’re out under that harsh light.  The added benefit is that now the light is shining on your listeners – without question the most important people in the room or auditorium!

I like that there are so many exercises and techniques to choose from.   Many of them don’t fit my style, but there were several that exposed me to new ways of thinking and new ideas to try.

And what’s especially great is knowing that these exercise come from professional actors and speakers – it’s like an insider’s guide at your fingertips.

My book review on Fearless Speaking includes a list of all the exercises, the chapters at a glance, key features from the book, and a few of my favorite highlights from the book (sort of like a movie trailer for the book.)

You Might Also Like

7 Habits of Highly Effective People at a Glance

347 Personal Effectiveness Articles to Help You Change Your Game

Effectiveness Blog Post Roundup

Categories: Architecture, Programming

Transform the input before indexing in elasticsearch

Gridshore - Sat, 07/26/2014 - 07:51

Sometimes you are indexing data and want to have as little to do in the input, or maybe even no influence on the input. Still you need to make changes, you want other content, or other fields. Maybe even remove fields. In elasticsearch 1.3 a new feature is introduces called Transform. In this blogpost I am going to show some of the aspects of this new feature.

Insert the document with the problem

The input we get is coming from a system that puts the string null in a field if it is empty. We do not want null as a string in elasticsearch index. Therefore we want to remove this field completely when indexing a document like that. We start with the example and the proof that you can search on the field.

PUT /transform/simple/1
{
  "title":"This is a document with text",
  "description":"null"
}

Now search for the word null in the description.

For completeness I’ll show you the response as well.

Response:
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.30685282,
      "hits": [
         {
            "_index": "transform",
            "_type": "simple",
            "_id": "1",
            "_score": 0.30685282,
            "_source": {
               "title": "This is a document with text",
               "description": "null"
            }
         }
      ]
   }
}
Change mapping to contain transform

Next we are going to use the transform functionality to remove the field if it contains the string null. To do that we need to remove the index and create a mapping containing the transform functionality. We use the groovy language for the script. Beware that the script is only validated when the first document is inserted.

PUT /transform
{
  "mappings": {
    "simple": {
      "transform": {
        "lang":"groovy",
        "script":"if (ctx._source['description']?.equals('null')) ctx._source['description'] = null"
      },
      "properties": {
        "title": {
          "type": "string"
        },
        "description": {
          "type": "string"
        }
      }
    }
  }
}

When we insert the same document as before and execute the same query we do not get hits. The description field is no longer indexed. An important aspect is that the actual _source is not changed. When requesting the _source of the document you still get back the original document.

GET transform/simple/1/_source
Response:
{
   "title": "This is a document with text",
   "description": "null"
}
Add a field to the mapping

To add a bit more complexity, we add a field called nullField which will contain the name of the field that was null. Not very useful but it suits to show the possibilities.

PUT /transform
{
  "mappings": {
    "simple": {
      "transform": {
        "lang":"groovy",
        "script":"if (ctx._source['description']?.equals('null')) {ctx._source['description'] = null;ctx._source['nullField'] = 'description';}"
      },
      "properties": {
        "title": {
          "type": "string"
        },
        "description": {
          "type": "string"
        },
        "nullField": {
          "type": "string"
        }
      }
    }
  }
}

Notice that we script has changed, not only do we remove the description field, now we also add a new field called nullField. Check that the _source is still not changed. Now we do a search and only return the fields description and nullField. Before scrolling to the response think about the response that you would expect.

GET /transform/_search
{
  "query": {
    "match_all": {}
  },
  "fields": ["nullField","description"]
}

Did you really think about it? Try it out and notice that the nullField is not returned. That is because we did not store it in the index and it is not obtained from the source. So if we really need this value, we can store the nullField in the index and we are fine.

PUT /transform
{
  "mappings": {
    "simple": {
      "transform": {
        "lang":"groovy",
        "script":"if (ctx._source['description']?.equals('null')) {ctx._source['description'] = null;ctx._source['nullField'] = 'description';}"
      },
      "properties": {
        "title": {
          "type": "string"
        },
        "description": {
          "type": "string"
        },
        "nullField": {
          "type": "string",
          "store": "yes"
        }
      }
    }
  }
}

Than with the match all query for two fields we get the following response.

GET /transform/_search
{
  "query": {
    "match_all": {}
  },
  "fields": ["nullField","description"]
}
Response:
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "transform",
            "_type": "simple",
            "_id": "1",
            "_score": 1,
            "fields": {
               "description": [
                  "null"
               ],
               "nullField": [
                  "description"
               ]
            }
         }
      ]
   }
}

Yes, now we do have the new field. That is it, but wait there is more you need to know. There is a way to check what is actually passed to the index for a certain document.

GET transform/simple/1?pretty&_source_transform
Result:
{
   "_index": "transform",
   "_type": "simple",
   "_id": "1",
   "_version": 1,
   "found": true,
   "_source": {
      "description": null,
      "nullField": "description",
      "title": "This is a document with text"
   }
}

Notice the null description and the nullField in the _source.

Final remark

You cannot update the transform part, think about what would happen to your index when some documents did pass the transform version 1 and others version 2.

I would be gentile with this feature, try to solve it before sending it to elasticsearch, but maybe you just have the usecase for this feature, now you know it exists.

In my next blogpost I dive a little bit deeper into the scripting module.

The post Transform the input before indexing in elasticsearch appeared first on Gridshore.

Categories: Architecture, Programming

R: ggplot – Plotting back to back charts using facet_wrap

Mark Needham - Fri, 07/25/2014 - 22:57

Earlier in the week I showed a way to plot back to back charts using R’s ggplot library but looking back on the code it felt like it was a bit hacky to ‘glue’ two charts together using a grid.

I wanted to find a better way.

To recap, I came up with the following charts showing the RSVPs to Neo4j London meetup events using this code:

2014 07 20 17 42 40

The first thing we need to do to simplify chart generation is to return ‘yes’ and ‘no’ responses in the same cypher query, like so:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
query = "MATCH (e:Event)<-[:TO]-(response {response: 'yes'})
         WITH e, COLLECT(response) AS yeses
         MATCH (e)<-[:TO]-(response {response: 'no'})<-[:NEXT]-()
         WITH e, COLLECT(response) + yeses AS responses
         UNWIND responses AS response
         RETURN response.time AS time, e.time + e.utc_offset AS eventTime, response.response AS response"
allRSVPs = cypher(graph, query)
allRSVPs$time = timestampToDate(allRSVPs$time)
allRSVPs$eventTime = timestampToDate(allRSVPs$eventTime)
allRSVPs$difference = as.numeric(allRSVPs$eventTime - allRSVPs$time, units="days")

The query is a bit because we want to capture the ‘no’ responses when they initially said yes which is why we check for a ‘NEXT’ relationship when looking for the negative responses.

Let’s inspect allRSVPs:

> allRSVPs[1:10,]
                  time           eventTime response difference
1  2014-06-13 21:49:20 2014-07-22 18:30:00       no   38.86157
2  2014-07-02 22:24:06 2014-07-22 18:30:00      yes   19.83743
3  2014-05-23 23:46:02 2014-07-22 18:30:00      yes   59.78053
4  2014-06-23 21:07:11 2014-07-22 18:30:00      yes   28.89084
5  2014-06-06 15:09:29 2014-07-22 18:30:00      yes   46.13925
6  2014-05-31 13:03:09 2014-07-22 18:30:00      yes   52.22698
7  2014-05-23 23:46:02 2014-07-22 18:30:00      yes   59.78053
8  2014-07-02 12:28:22 2014-07-22 18:30:00      yes   20.25113
9  2014-06-30 23:44:39 2014-07-22 18:30:00      yes   21.78149
10 2014-06-06 15:35:53 2014-07-22 18:30:00      yes   46.12091

We’ve returned the actual response with each row so that we can distinguish between responses. It will also come in useful for pivoting our single chart later on.

The next step is to get ggplot to generate our side by side charts. I started off by plotting both types of response on the same chart:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  geom_bar(binwidth=1)

2014 07 25 22 14 28

This one stacks the ‘yes’ and ‘no’ responses on top of each other which isn’t what we want as it’s difficult to compare the two.

What we need is the facet_wrap function which allows us to generate multiple charts grouped by key. We’ll group by ‘response’:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  geom_bar(binwidth=1) + 
  facet_wrap(~ response, nrow=2, ncol=1)

2014 07 25 22 34 46

The only thing we’re missing now is the red and green colours which is where the scale_fill_manual function comes in handy:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  scale_fill_manual(values=c("#FF0000", "#00FF00")) + 
  geom_bar(binwidth=1) +
  facet_wrap(~ response, nrow=2, ncol=1)

2014 07 25 22 39 56

If we want to show the ‘yes’ chart on top we can pass in an extra parameter to facet_wrap to change where it places the highest value:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  scale_fill_manual(values=c("#FF0000", "#00FF00")) + 
  geom_bar(binwidth=1) +
  facet_wrap(~ response, nrow=2, ncol=1, as.table = FALSE)

2014 07 25 22 43 29

We could go one step further and group by response and day. First let’s add a ‘day’ column to our data frame:

allRSVPs$dayOfWeek = format(allRSVPs$eventTime, "%A")

And now let’s plot the charts using both columns:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  scale_fill_manual(values=c("#FF0000", "#00FF00")) + 
  geom_bar(binwidth=1) +
  facet_wrap(~ response + dayOfWeek, as.table = FALSE)

2014 07 25 22 49 57

The distribution of dropouts looks fairly similar for all the days – Thursday is just at an order of magnitude below the other days because we haven’t run many events on Thursdays so far.

At a glance it doesn’t appear that so many people sign up for Thursday events on the day or one day before.

One potential hypothesis is that people have things planned for Thursday whereas they decide more last minute what to do on the other days.

We’ll have to run some more events on Thursdays to see whether that trend holds out.

The code is on github if you want to play with it

Categories: Programming

Marketing scrum vs IT scrum - a report published and presented at agile 2014

Xebia Blog - Fri, 07/25/2014 - 17:49

As we know, Scrum is the perfect framework for IT / software development projects to learn, adapt to change and deliver great software of value, faster.

But is Scrum also usable outside of software development? Can we apply similar or maybe even the same principals in other departments in the enterprise?

Yes, we can! And yes there are differences but there are also a lot of similarities.

We (Remco en Me)  successfully implemented Scrum in the marketing departments of two large companies: The ANWB and ING Bank. Both companies are now using Scrum for the development of new campaigns, their full commercial expressions and even at the product development level. They wanted a faster time to market, more ownership, and greater innovation. How did we approach and realized a transition with those goals in the marketing environment? And what are the results?

So when we are not delivering software but other things, how does Scrum change? Well, a great deal actually. The people working in these other departments are, in general, quite different to those in Software Development (and yes more than you would expect). This means coaches or change agents need to take another approach.

Since the people are different, it is possible to go faster or ‘deeper’ in certain areas. Entrepreneurial skills or ambitions are more present in marketing. This gives a sense of ‘act first apologize later’, taking ownership, a higher drive to succeed, and upfront and willing behavior. Scrumming here means thinking more about business goals and KPIs (how to go from department to scrumteam goals for example). After that the fun begins…

I will be speaking about this topic at agile 2014. A great honor offcourse to be standing there. I will also attende the conference and therefor try to post some updates here.

To read more about this topic you can read my publication about marketing scrum. It has the extensive research paper I publisched about this story. Please feel free to give me comments and questions either about agile 2014 or the paper.

 

Enjoy reading the paper:

Marketing scrum vs IT scrum – two marketing case studies who now ‘act first and apologize later'

 

The Drag of Old Mental Models on Innovation and Change

“Don’t worry about people stealing your ideas. If your ideas are any good, you’ll have to ram them down people’s throats.” — Howard Aiken

It's not a lack of risk taking that holds innovation and change back. 

Even big companies take big risks all the time.

The real barrier to innovation and change is the drag of old mental models.

People end up emotionally invested in their ideas, or they are limited by their beliefs or their world views.  They can't see what's possible with the lens they look through, or fear and doubt hold them back.  In some cases, it's even learned helplessness.

In the book The Future of Management, Gary Hamel shares some great insight into what holds people and companies back from innovation and change.

Yesterday’s Heresies are Tomorrow’s Dogmas

Yesterday's ideas that were profoundly at odds with what is generally accepted, eventually become the norm, and then eventually become a belief system that is tough to change.

Via The Future of Management:

“Innovators are, by nature, contrarians.  Trouble is, yesterday's heresies often become tomorrow's dogmas, and when they do, innovation stalls and the growth curve flattens out.”

Deeply Held Beliefs are the Real Barrier to Strategic Innovation

Success turns beliefs into barriers by cementing ideas that become inflexible to change.

Via The Future of Management:

“... the real barrier to strategic innovation is more than denial -- it's a matrix of deeply held beliefs about the inherent superiority of a business model, beliefs that have been validated by millions of customers; beliefs that have been enshrined in physical infrastructure and operating handbooks; beliefs that have hardened into religious convictions; beliefs that are held so strongly, that nonconforming ideas seldom get considered, and when they do, rarely get more than grudging support.”

It's Not a Lack of Risk Taking that Holds Innovation Back

Big companies take big risks every day.  But the risks are scoped and constrained by old beliefs and the way things have always been done.

Via The Future of Management:

“Contrary to popular mythology, the thing that most impedes innovation in large companies is not a lack of risk taking.  Big companies take big, and often imprudent, risks every day.  The real brake on innovation is the drag of old mental models.  Long-serving executives often have a big chunk of their emotional capital invested in the existing strategy.  This is particularly true for company founders.  While many start out as contrarians, success often turns them into cardinals who feel compelled to defend the one true faith.  It's hard for founders to credit ideas that threaten the foundations of the business models they invented.  Understanding this, employees lower down self-edit their ideas, knowing that anything too far adrift from conventional thinking won't win support from the top.  As a result, the scope of innovation narrows, the risk of getting blindsided goes up, and the company's young contrarians start looking for opportunities elsewhere.”

Legacy Beliefs are a Much Bigger Liability When It Comes to Innovation

When you want to change the world, sometimes it takes a new view, and existing world views get in the way.

Via The Future of Management:

“When it comes to innovation, a company's legacy beliefs are a much bigger liability than its legacy costs.  Yet in my experience, few companies have a systematic process for challenging deeply held strategic assumptions.  Few have taken bold steps to open up their strategy process to contrarian points of view.  Few explicitly encourage disruptive innovation.  Worse, it's usually senior executives, with their doctrinaire views, who get to decide which ideas go forward and which get spiked.  This must change.”

What you see, or can’t see, changes everything.

You Might Also Like

The New Competitive Landscape

The New Realities that Call for New Organizational and Management Capabilities

Who’s Managing Your Company

Categories: Architecture, Programming

Playing with two most interesting new features of elasticsearch 1.3.0

Gridshore - Fri, 07/25/2014 - 11:47

Just a few days a go elasticsearch released version 1.3.0 of their flagship product. The first one is the most waited for feature called the Top hits aggregation. Basically this is what is called grouping. You want to group certain items based on one characteristic, but within this group you want to have the best matching result(s) based on score. Another very important feature is the new support for scripts. Better security options when using scripts using sandboxed script languages.

In this blogpost I am going to explain and show the top_hits feature as well as the new scripting support.


Top hits

I am going to show a very simple example of top hits using my music index. This index contains all the songs I have in my itunes library. First step is to find songs by genre, the following query gives (the default) 10 hits based on the match_all query and the terms aggregation as requested.

GET /mymusic/_search
{
  "aggs": {
    "byGenre": {
      "terms": {
        "field": "genre",
        "size": 10
      }
    }
  }
}

The response is of the format:

{
	"hits": {},
	"aggregations": {
		"byGenre": {
			"buckets": [
				{"key":"rock","doc_count":1910},
				...
			]
		}
    }
}

Now we add the query to the request, songs containing the word love in the title.

GET /mymusic/_search
{
  "query": {
    "match": {
      "name": "love"
    }
  }, 
  "aggs": {
    "byGenre": {
      "terms": {
        "field": "genre",
        "size": 10
      }
    }
  }
}

Now we have less hits, still a number of buckets and the amount of songs that match our query within that bucket. The biggest change is the score in the returned hits. In te previous query the score was always 1, now the score is different due to the query we execute. The highest score now is the song Love by The Mission. The genre for this song is Rock and the song is from the year 1990. Time to introduce the top hits aggregation. With this query we can return the top song containing the word love in the title per genre

GET /mymusic/_search
{
  "query": {
    "match": {
      "name": "love"
    }
  },
  "aggs": {
    "byGenre": {
      "terms": {
        "field": "genre",
        "size": 5
      },
      "aggs": {
        "topFoundHits": {
          "top_hits": {
            "size": 1
          }
        }
      }
    }
  }
}

Again we get hits, but they are not different from the query before. The interesting part is in the aggs part. Here we add a sub aggregation to the byGenre aggregation. This aggregation is called topFoundHits of type top_hits. We only return the best hit per genre. The next code block shows the part of the response with the top hits, I did remove the content of the _source field in the top_hits to keep the response shorter.

{
   "took": 4,
   "timed_out": false,
   "_shards": {
      "total": 3,
      "successful": 3,
      "failed": 0
   },
   "hits": {
      "total": 141,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "byGenre": {
         "buckets": [
            {
               "key": "rock",
               "doc_count": 52,
               "topFoundHits": {
                  "hits": {
                     "total": 52,
                     "max_score": 4.715253,
                     "hits": [
                        {
                           "_index": "mymusic",
                           "_type": "itunes",
                           "_id": "4147",
                           "_score": 4.715253,
                           "_source": {
                              "name": "Love",
                           }
                        }
                     ]
                  }
               }
            },
            {
               "key": "pop",
               "doc_count": 39,
               "topFoundHits": {
                  "hits": {
                     "total": 39,
                     "max_score": 3.3341873,
                     "hits": [
                        {
                           "_index": "mymusic",
                           "_type": "itunes",
                           "_id": "11381",
                           "_score": 3.3341873,
                           "_source": {
                              "name": "Love To Love You",
                           }
                        }
                     ]
                  }
               }
            },
            {
               "key": "alternative",
               "doc_count": 12,
               "topFoundHits": {
                  "hits": {
                     "total": 12,
                     "max_score": 4.1945505,
                     "hits": [
                        {
                           "_index": "mymusic",
                           "_type": "itunes",
                           "_id": "7889",
                           "_score": 4.1945505,
                           "_source": {
                              "name": "Love Love Love",
                           }
                        }
                     ]
                  }
               }
            },
            {
               "key": "b",
               "doc_count": 9,
               "topFoundHits": {
                  "hits": {
                     "total": 9,
                     "max_score": 3.0271564,
                     "hits": [
                        {
                           "_index": "mymusic",
                           "_type": "itunes",
                           "_id": "2549",
                           "_score": 3.0271564,
                           "_source": {
                              "name": "First Love",
                           }
                        }
                     ]
                  }
               }
            },
            {
               "key": "r",
               "doc_count": 7,
               "topFoundHits": {
                  "hits": {
                     "total": 7,
                     "max_score": 3.0271564,
                     "hits": [
                        {
                           "_index": "mymusic",
                           "_type": "itunes",
                           "_id": "2549",
                           "_score": 3.0271564,
                           "_source": {
                              "name": "First Love",
                           }
                        }
                     ]
                  }
               }
            }
         ]
      }
   }
}

Did you note a problem with my analyser for genre? Hint R&B!

More information on the top_hits aggregation can be found here:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html

Scripting

Elasticsearch has support for scripts for a long time. The default scripting language was and is mvel up to version 1.3. It will change to groovy in 1.4. Mvel is not a well known scripting language. The biggest advantage is that mvel is very powerful. The disadvantage is that it is to powerful. Mvel does not come with a sandbox principle. Therefore is is possible to write some very nasty scripts even when only doing a query. This was very well shown by a colleague of mine (Byron Voorbach) who created a query to read private keys on developer machines who did not safeguard their elasticsearch instance. Therefore dynamic scripting was switched off in version 1.2 by default.

This came with a very big disadvantage, now it was not possible anymore to use the function_score query without resorting to stored scripts on the server. In version 1.3 of elasticsearch a much better way is introduced. Now you use sandboxed scripting languages like groovy to keep using the flexible approach. Groovy can be configured to include object creation and method calls that are allowed. More information about this is provided in the elasticsearch documentation about scripting.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html

Next is an example query that is querying my music index. This index contains all the songs from my music library. It queries all he songs after the year 1999 and calculates the score based on the year. So the newest songs get the highest score. And yes I know a sort by year desc would have given the same result.

GET mymusic/_search
{
  "query": {
    "function_score": {
      "query": {
        "range": {
          "year": {
            "gte": 2000
          }
        }
      },
      "functions": [
        {
          "script_score": {
            "lang": "groovy", 
            "script": "_score * doc['year'].value"
          }
        }
      ]
    }
  }
}

The score now becomes high, since we do a range query we get back only scores of one. Using the function_score as the multiplication of the year with the score, the end score is the year. I added the year as the only field to return, some of the results than are:

{
   "took": 4,
   "timed_out": false,
   "_shards": {
      "total": 3,
      "successful": 3,
      "failed": 0
   },
   "hits": {
      "total": 2895,
      "max_score": 2014,
      "hits": [
         {
            "_index": "mymusic",
            "_type": "itunes",
            "_id": "12965",
            "_score": 2014,
            "fields": {
               "year": [
                  "2014"
               ]
            }
         },
         {
            "_index": "mymusic",
            "_type": "itunes",
            "_id": "12975",
            "_score": 2014,
            "fields": {
               "year": [
                  "2014"
               ]
            }
         }
      ]
   }
}

Next up is the last sample, a combination of top_hits and scripting.

Top hits with scripting

We start with the sample from top_hits using my music index. Now we want to sort the buckets on the score of the best matching document in the bucket. The default is the number of documents in the bucket. As mentioned in the documentation you need a trick to do this.

The top_hits aggregator isn’t a metric aggregator and therefor can’t be used in the order option of the terms aggregator.

GET /mymusic/_search?search_type=count
{
  "query": {
    "match": {
      "name": "love"
    }
  },
  "aggs": {
    "byGenre": {
      "terms": {
        "field": "genre",
        "size": 5,
        "order": {
          "best_hit":"desc"
        }
      },
      "aggs": {
        "topFoundHits": {
          "top_hits": {
            "size": 1
          }
        },
        "best_hit": {
          "max": {
            "lang": "groovy", 
            "script": "doc.score"
          }
        }
      }
    }
  }
}

The results of this query again with most of the _source taken out is following. Compare it to the query in the top_hits section. Notice the different genres that we get back now. Also check the scores.

{
   "took": 4,
   "timed_out": false,
   "_shards": {
      "total": 3,
      "successful": 3,
      "failed": 0
   },
   "hits": {
      "total": 141,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "byGenre": {
         "buckets": [
            {
               "key": "rock",
               "doc_count": 37,
               "topFoundHits": {
                  "hits": {
                     "total": 37,
                     "max_score": 4.715253,
                     "hits": [
                        {
                           "_index": "mymusic",
                           "_type": "itunes",
                           "_id": "4147",
                           "_score": 4.715253,
                           "_source": {
                              "name": "Love",
                           }
                        }
                     ]
                  }
               },
               "best_hit": {
                  "value": 4.715252876281738
               }
            },
            {
               "key": "alternative",
               "doc_count": 12,
               "topFoundHits": {
                  "hits": {
                     "total": 12,
                     "max_score": 4.1945505,
                     "hits": [
                        {
                           "_index": "mymusic",
                           "_type": "itunes",
                           "_id": "7889",
                           "_score": 4.1945505,
                           "_source": {
                              "name": "Love Love Love",
                           }
                        }
                     ]
                  }
               },
               "best_hit": {
                  "value": 4.194550514221191
               }
            },
            {
               "key": "punk",
               "doc_count": 3,
               "topFoundHits": {
                  "hits": {
                     "total": 3,
                     "max_score": 4.1945505,
                     "hits": [
                        {
                           "_index": "mymusic",
                           "_type": "itunes",
                           "_id": "7889",
                           "_score": 4.1945505,
                           "_source": {
                              "name": "Love Love Love",
                           }
                        }
                     ]
                  }
               },
               "best_hit": {
                  "value": 4.194550514221191
               }
            },
            {
               "key": "pop",
               "doc_count": 24,
               "topFoundHits": {
                  "hits": {
                     "total": 24,
                     "max_score": 3.3341873,
                     "hits": [
                        {
                           "_index": "mymusic",
                           "_type": "itunes",
                           "_id": "11381",
                           "_score": 3.3341873,
                           "_source": {
                              "name": "Love To Love You",
                           }
                        }
                     ]
                  }
               },
               "best_hit": {
                  "value": 3.3341872692108154
               }
            },
            {
               "key": "b",
               "doc_count": 7,
               "topFoundHits": {
                  "hits": {
                     "total": 7,
                     "max_score": 3.0271564,
                     "hits": [
                        {
                           "_index": "mymusic",
                           "_type": "itunes",
                           "_id": "2549",
                           "_score": 3.0271564,
                           "_source": {
                              "name": "First Love",
                           }
                        }
                     ]
                  }
               },
               "best_hit": {
                  "value": 3.027156352996826
               }
            }
         ]
      }
   }
}

This is just a first introduction into the top_hits and scripting. Stay tuned for more blogs around these topics.

The post Playing with two most interesting new features of elasticsearch 1.3.0 appeared first on Gridshore.

Categories: Architecture, Programming

Review The Twitter Story by Nick Bilton

Gridshore - Thu, 07/24/2014 - 23:35


A lot of people dream of creating the new Facebook, Twitter or Instagram. Nobody knows when they will create it, most of the ideas just came to be big. Personally I do not believe I will ever create such a product. To busy with to much different things, usually based on great ideas of others. One thing I do like is reading about the success stories of others. I read the book about starbucks, microsoft, apple and a few others.

Recently I started reading the Twitter Story. It reeds like an exciting story, nevertheless it is telling a story based on interviews and facts behind one of the most exciting companies on the internet of the last century.

I do not think it is a coincidence that a lot of what I read in the Lean Startup but also the starbucks story is coming back in the twitter story. One thing that struck me in this book is what business does to friendship. It is also showing that people with great ideas are usually not the people that make this ideas into a profitable company.

When starting a company based on your terrific idea, read this book and learn from it. It might make your life a lot better.

The post Review The Twitter Story by Nick Bilton appeared first on Gridshore.

Categories: Architecture, Programming

How Do I Make $2,000 A Month On Passive Income?

Making the Complex Simple - John Sonmez - Thu, 07/24/2014 - 15:00

In this video I answer a question about how to make passive income from a book and a blog.

The post How Do I Make $2,000 A Month On Passive Income? appeared first on Simple Programmer.

Categories: Programming

Java: Determining the status of data import using kill signals

Mark Needham - Wed, 07/23/2014 - 23:20

A few weeks ago I was working on the initial import of ~ 60 million bits of data into Neo4j and we kept running into a problem where the import process just seemed to freeze and nothing else was imported.

It was very difficult to tell what was happening inside the process – taking a thread dump merely informed us that it was attempting to process one line of a CSV line and was somehow unable to do so.

One way to help debug this would have been to print out every single line of the CSV as we processed it and then watch where it got stuck but this seemed a bit over kill. Ideally we wanted to only print out the line we were processing on demand.

As luck would have it we can do exactly this by sending a kill signal to our import process and have it print out where it had got up to. We had to make sure we picked a signal which wasn’t already being handled by the JVM and decided to go with ‘SIGTRAP’ i.e. kill -5 [pid]

We came across a neat blog post that explained how to wire everything up and then created our own version:

class Kill3Handler implements SignalHandler
{
    private AtomicInteger linesProcessed;
    private AtomicReference<Map<String, Object>> lastRowProcessed;
 
    public Kill3Handler( AtomicInteger linesProcessed, AtomicReference<Map<String, Object>> lastRowProcessed )
    {
        this.linesProcessed = linesProcessed;
        this.lastRowProcessed = lastRowProcessed;
    }
 
    @Override
    public void handle( Signal signal )
    {
        System.out.println("Last Line Processed: " + linesProcessed.get() + " " + lastRowProcessed.get());
    }
}

We then wired that up like so:

AtomicInteger linesProcessed = new AtomicInteger( 0 );
AtomicReference<Map<String, Object>> lastRowProcessed = new AtomicReference<>(  );
Kill3Handler kill3Handler = new Kill3Handler( linesProcessed, lastRowProcessed );
Signal.handle(new Signal("TRAP"), kill3Handler);
 
// as we iterate each line we update those variables
 
linesProcessed.incrementAndGet();
lastRowProcessed.getAndSet( properties ); // properties = a representation of the row we're processing

This worked really well for us and we were able to work out that we had a slight problem with some of the data in our CSV file which was causing it to be processed incorrectly.

We hadn’t been able to see this by visual inspection since the CSV files were a few GB in size. We’d therefore only skimmed a few lines as a sanity check.

I didn’t even know you could do this but it’s a neat trick to keep in mind – I’m sure it shall come in useful again.

Categories: Programming

The New Realities that Call for New Organizational and Management Capabilities

“The only people who can change the world are people who want to. And not everybody does.” -- Hugh MacLeod

Is it just me or is the world changing faster than ever?

I hear from everybody around me (inside and outside of Microsoft) how radically their worlds are changing under their feet, business models are flipped on their heads, and the game of generating new business value for customers is at an all-time competitive high.

Great.

Challenge is where growth and greatness come from.  It’s always a chance to test what we’re capable of and respond to whatever gets thrown our way.  But first, it helps to put a finger on what exactly these changes are that are disrupting our world, and what to focus on to survive and thrive.

In the book The Future of Management, Gary Hamel shares some great insight into the key challenges that companies are facing that create even more demand for management innovation.

The New Realities We’re Facing that Call for Management Innovation

I think Hamel describes our new world pretty well …

Via The Future of Management:

  • “As the pace of change accelerates more, more and more companies are finding themselves on the wrong side of the change curve.  Recent research by L.G. Thomas and Richard D'Aveni suggests that industry leadership is changing hands more frequently, and competitive advantage is eroding more rapidly, than ever before.  Today, it's not just the occasional company that gets caught out by the future, but entire industries -- be it traditional airlines, old-line department stores, network television broadcasters, the big drug companies, America's carmakers, or the newspaper and music industries.”
  • “Deregulation, along with the de-scaling effects of new technology, are dramatically reducing the barriers to entry across a wide range of industries, from publishing to telecommunications to banking to airlines.  As a result, long-standing oligopolies are fracturing and competitive 'anarchy' is on the rise.”
  • “Increasingly, companies are finding themselves enmeshed in 'value webs' and 'ecosystems' over which they have only partial control.  As a result, competitive outcomes are becoming less the product of market power, and more the product of artful negotiation.  De-verticalization, disintermediation, and outsource-industry consortia, are leaving firms with less and less control over their own destinies.”
  • “The digitization of anything not nailed down threatens companies that make their living out of creating and selling intellectual property.  Drug companies, film studios, publishers, and fashion designers are all struggling to adapt to a world where information and ideas 'want to be free.'”
  • The internet is rapidly shifting bargaining power from producers to consumers.  In the past, customer 'loyalty' was often an artifact of high search costs and limited information, and companies frequently profited from customer ignorance.  Today, customers are in control as never before -- and in a world of near-perfect information, there is less and less room for mediocre products and services.”
  • Strategy cycles are shrinking.  Thanks to plentiful capital, the power of outsourcing, and the global reach of the Web, it's possible to ramp up a new business faster than ever before.  But the more rapidly a business grows, the sooner it fulfills the promise of its original business model, peaks, and enters its dotage.  Today, the parabola of success is often a short, sharp spike.”
  • “Plummeting communication costs and globalization are opening up industries to a horde of new ultra-low-cost competitors.  These new entrants are eager to exploit the legacy costs of the old guard.  While some veterans will join the 'race to the bottom' and move their core activities to the world's lowest-cost locations, many others will find it difficult to reconfigure their global operations.  As Indian companies suck in service jobs and China steadily expands its share of global manufacturing, companies everywhere will struggle to maintain their margins.”
Strategically Adaptive and Operationally Efficient

So how do you respond to the challenges.   Hamel says it takes becoming strategically adaptable and operationally efficient.   What a powerful combo.

Via The Future of Management:

“These new realities call for new organizational and managerial capabilities.  To thrive in an increasingly disruptive world, companies must become as strategically adaptable as they are operationally efficient.  To safeguard their margins, they must become gushers of rule-breaking innovation.  And if they're going to out-invent and outthink a growing mob of upstarts, they must learn how to inspire their employees to give the very best of themselves every day.  These are the challenges that must be addressed by 21st-century management innovations.”

There are plenty of challenges.  It’s time to get your greatness on. 

If there ever was a chance to put to the test what you’re capable of, now is the time.

No matter what, as long as you live and learn, you’ll grow from the process.

You Might Also Like

Simplicity is the Ultimate Enabler

The New Competitive Landscape

Who’s Managing Your Company

Categories: Architecture, Programming

Google Play Services 5.0

Android Developers Blog - Tue, 07/22/2014 - 00:43
gps

Google Play services 5.0 is now rolled out to devices worldwide, and it includes a number of features you can use to improve your apps. This release introduces Android wearable services APIs, Dynamic Security Provider and App Indexing, whilst also including updates to the Google Play game services, Cast, Drive, Wallet, Analytics, and Mobile Ads.

Android wearable services

Google Play services 5.0 introduces a set of APIs that make it easier to communicate with your apps running on Android wearables. The APIs provide an automatically synchronized, persistent data store and a low-latency messaging interface that let you sync data, exchange control messages, and transfer assets.

Dynamic security provider

Provides an API that apps can use to easily install a dynamic security provider. The dynamic security provider includes a replacement for the platform's secure networking APIs, which can be updated frequently for rapid delivery of security patches. The current version includes fixes for recent issues identified in OpenSSL.

Google Play game services

Quests are a new set of APIs to run time-based goals for players, and reward them without needing to update the game. To do this, you can send game activity data to the Quests service whenever a player successfully wins a level, kills an alien, or saves a rare black sheep, for example. This tells Quests what’s going on in the game, and you can use that game activity to create new Quests. By running Quests on a regular basis, you can create an unlimited number of new player experiences to drive re-engagement and retention.

Saved games lets you store a player's game progress to the cloud for use across many screen, using a new saved game snapshot API. Along with game progress, you can store a cover image, description and time-played. Players never play level 1 again when they have their progress stored with Google, and they can see where they left off when you attach a cover image and description. Adding cover images and descriptions provides additional context on the player’s progress and helps drive re-engagement through the Play Games app.

App Indexing API

The App Indexing API provides a way for you to notify Google about deep links in your native mobile applications and drive additional user engagement. Integrating with the App Indexing API allows the Google Search app to serve up your app’s history to users as instant Search suggestions, providing fast and easy access to inner pages in your app. The deep links reported using the App Indexing API are also used by Google to index your app’s content and surface them as deep links to Google search result.

Google Cast

The Google Cast SDK now includes media tracks that introduce closed caption support for Chromecast.

Drive

The Google Drive API adds the ability to sort query results, create folders offline, and select any mime type in the file picker by default.

Wallet

Wallet objects from Google take physical objects (like loyalty cards, offers) from your wallet and store them in the cloud. In this release, Wallet adds "Save to Wallet" button support for offers. When a user clicks "Save to Wallet" the offer gets saved and shows up in the user's Google Wallet app. Geo-fenced in-store notifications prompt the user to show and scan digital cards at point-of-sale, driving higher redemption. This also frees the user from having to carry around offers and loyalty cards.

Users can also now use their Google Wallet Balance to pay for Instant Buy transactions by providing split tender support. With split tender, if your Wallet Balance is not sufficient, the payment is split between your Wallet Balance and a credit/debit card in your Google Wallet.

Analytics

Enhanced Ecommerce provides visibility into the full customer journey, adding the ability to measure product impressions, product clicks, viewing product details, adding a product to a shopping cart, initiating the checkout process, internal promotions, transactions, and refunds. Together they help users gain deeper insights into the performance of their business, including how far users progress through the shopping funnel and where they are abandoning in the purchase process. Enhanced Ecommerce also allows users to analyze the effectiveness of their marketing and merchandising efforts, including the impact of internal promotions, coupons, and affiliate marketing programs.

Mobile Ads

Google Mobile Ads are a great way to monetise your apps and you now have access to better in-app purchase ads. We've now added a default implementation for consumable purchases using the Google Play In-app Billing service.

And that’s another release of Google Play services. The updated Google Play services SDK is now available through the Android SDK manager. For details on the APIs, please see New Features in Google Play services 5.0.




Join the discussion on
+Android Developers


Categories: Programming

KNOX Contribution to Android: Accelerating Android in the Workplace

Android Developers Blog - Mon, 07/21/2014 - 17:21

Srikanth Rajagopalan, PM Director and Workplace aficionado

Recently at Google I/O, we announced a comprehensive set of new features that will allow IT organizations to easily deploy and manage Android devices in enterprise environments. These features will be built into the upcoming Android L release.

Samsung, with its KNOX technology, has been a thought leader in the enterprise mobility space. In order to accelerate Android adoption in the enterprise, we have partnered with Samsung to bring key KNOX functionality into Android, for the benefit of the entire Android ecosystem. We thank Samsung for their contributions. These new capabilities will make it easy for IT organizations to allow employees to bring their own Android devices to work (BYOD) and use them on the corporate network or to simply issue new Android devices to their employees. IT administrators will be able to manage a wide range of Android devices from many manufacturers, using third-party Enterprise Mobility Management (EMM) solutions that are built on top of the new enterprise APIs launching with Android L release.

Google and Samsung together designed the new enterprise APIs around three major concepts:

  • Device and data security
  • Support for IT policies and restrictions
  • Mobile application management
Device and data security

At the core of the expanded enterprise capabilities being introduced in Android ‘L’ lies a set of technologies that are designed to keep personal and corporate data both separate and safe. We achieve the data separation by building on the existing multi-user support in Android: personal and corporate applications will run as two separate Android users. Data is kept safe by using block-level disk encryption as well as verified boot technology. For those of you familiar with KNOX, this is analogous to KNOX Workspace. EMMs will be able to take advantage of new Android SDK APIs to enable the creation of a managed profile, which is where all corporate applications and data will reside.

Support for IT restrictions and policies

EMMs can use new Android SDK APIs , which have evolved from KNOX APIs, to allow IT admins to enforce a wide set of policies, ranging from system settings and certificate provisioning to application-specific (e.g. Chrome) configurations and restrictions.

Mobile application management

EMMs will be able to use new backend APIs, adapted from KNOX APIs and built around strong security principles for on-device app deployment, to allow IT admins to curate the corporate application catalog and to remotely deploy applications to the managed profile on the employees’ devices.

We encourage developers interested in the new Enterprise APIs to download and test the Android L Developer Preview. For developers who have already built applications using Samsung KNOX APIs, Samsung will be providing a KNOX Compatibility Library that will let such applications run on all Android L devices.

You can read more about this collaboration on the Samsung KNOX blog. Stay tuned for additional details.


Join the discussion on
+Android Developers


Categories: Programming

Who's Managing Your Company?

One of the best books I’m reading lately is The Future of Management, by Gary Hamel.

It’s all about how management innovation is the best competitive advantage, whether you look through the history of great businesses or the history of great militaries.  Hamel makes a great case that strategic innovation, product or service innovation, and operational innovation are fleeting advantages, but management innovation leads to competitive advantage for the long haul.

In The Future of Management, Hamel poses a powerful question …

“Who is managing your company?”

Via The Future of Management:

“Who's managing your company?  You might be tempted to answer, 'the CEO,' or 'the executive team,' or 'all of us in middle management.'  And you'd be right, but that wouldn't be the whole truth.  To a large extent, your company is being managed right now by a small coterie of long-departed theorists and practitioners who invented the rules and conventions of 'modern' management back in the early years of the 20th century.  They are the poltergeists who inhabit the musty machinery of management.  It is their edicts, echoing across the decades, that invisibly shape the way your company allocates resources, sets budgets, distributes power, rewards people, and makes decisions.”

That’s why it’s easy for CEOs to hop around companies …

Via The Future of Management:

“So pervasive is the influence of these patriarchs that the technology of management varies only slightly from firm to firm.  Most companies have a roughly similar management hierarchy (a cascade of EVPs, SVPs, and VPs).  They have analogous control systems, HR practices, and planning rituals, and rely on comparable reporting structures and review systems.  That's why it's so easy for a CEO to jump from one company to another -- the levers and dials of management are more or less the same in every corporate cockpit.”

What really struck me here is how much management approach has been handed down through the ages, and accepted as status quo.

It’s some great good for thought, especially given that management innovation is THE most powerful form of competitive advantage from an innovation standpoint (which Hamel really builds a strong case here throughout the entirety of the book.)

You Might Also Like

The New Competitive Landscape

Principles and Values Define a Culture

The Enterprise of the Future

Cognizant on The Next Generation Enterprise

Satya Nadella on the Future is Software

Categories: Architecture, Programming

Why I am AGAINST Net Neutrality

Making the Complex Simple - John Sonmez - Mon, 07/21/2014 - 15:30

No, I’m not saying this just to get a rise out of you, I am actually against net neutrality and here is why: Net neutrality is this principle that says internet service providers should treat all data on the internet equally. I actually like this idea; I agree with it; I hope internet providers choose […]

The post Why I am AGAINST Net Neutrality appeared first on Simple Programmer.

Categories: Programming

My First Xamarin Mobile App

Phil Trelford's Array - Mon, 07/21/2014 - 08:09

Following on from F# week, Xamarin have been running a contest to build your first F# mobile app which ends today. If you missed it you can still enter, run an F# app and get a free F# T-shirt.

Over the weekend I’ve put together a simple calculator app that includes units of measure support:

units calculator


Getting started

I installed the latest version of Xamarin on my Mac. The happy path seems to be the stable release channel these days, as F# is now baked into the IDE. Once it’s installed you can start straight away developing Android apps. For iOS you need to install Xcode from the Apple store too. The iOS emulator seemed to run the faster, taking just a few seconds to build and run, so I settled on that.

The Xamarin IDE includes plenty of project templates for building Android and iOS applications with F#. I used the universal single view application template for iOS as a start point, and the Creating iOS Applications in Code tutorial as a guide. I’d also recommend checking out Rachel Reese’s excellent Introduction to F# with Xamarin article.

Note: F# is a mature, open source, cross-platform language so you get the same compiler in Xamarin as you do in Visual Studio.


Units of measure

F# has built-in support for units of measure so you can write:

let speed = 100<m> / 10<s>

This unit information is used for compile time checking, it has no cost at runtime, and subsequently no meta data is available at runtime.

For the units calculator an expression parser and units implementation is required. This was a case of here’s one I wrote earlier, see:

The implementation uses a couple of F# discriminated unions for defining unit types and a simple recursive decent parser using F# active patterns for parsing expressions. No libraries were imported and the code is just over 200 lines.

You can also play with the units implementation in F# interactive, simply highlight the source in Units.fs and then right-click and execute selected in F# interactive.

Note: Google and Bing both provide unit calculators from their respective search boxes.


Source code

All the source code is available on BitBucket: https://bitbucket.org/ptrelford/units-calculator/src

Categories: Programming

100 Articles to Sharpen Your Mind

Actually, it's more than 100 articles for your mind.  I've tagged my articles with "mind" on Sources of Insight that focus on increasing your "intellectual horsepower":

Articles on Mind Power and the Power of Thoughts

Here are a few of the top mind articles that you can quickly get results with:

Note that if reading faster is important to you, then I recommend also reading How To Read 10,000 Words a Minute (it’s my ultimate edge) and The Truth About Speed Reading.

If there’s one little trick I use with reading (whether it’s a book, an email, or whatever), I ask myself “what’s the insight?” or “what’s the action?” or “how can I use this?"  You’d be surprised but just asking yourself those little focusing questions can help you parse down cluttered content fast and find the needles in the haystack.

Categories: Architecture, Programming

R: ggplot – Plotting back to back bar charts

Mark Needham - Sun, 07/20/2014 - 17:50

I’ve been playing around with R’s ggplot library to explore the Neo4j London meetup and the next thing I wanted to do was plot back to back bar charts showing ‘yes’ and ‘no’ RSVPs.

I’d already done the ‘yes’ bar chart using the following code:

query = "MATCH (e:Event)<-[:TO]-(response {response: 'yes'})
         RETURN response.time AS time, e.time + e.utc_offset AS eventTime"
allYesRSVPs = cypher(graph, query)
allYesRSVPs$time = timestampToDate(allYesRSVPs$time)
allYesRSVPs$eventTime = timestampToDate(allYesRSVPs$eventTime)
allYesRSVPs$difference = as.numeric(allYesRSVPs$eventTime - allYesRSVPs$time, units="days")
 
ggplot(allYesRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="green")
2014 07 20 01 15 39

The next step was to create a similar thing for people who’d RSVP’d ‘no’ having originally RSVP’d ‘yes’ i.e. people who dropped out:

query = "MATCH (e:Event)<-[:TO]-(response {response: 'no'})<-[:NEXT]-()
         RETURN response.time AS time, e.time + e.utc_offset AS eventTime"
allNoRSVPs = cypher(graph, query)
allNoRSVPs$time = timestampToDate(allNoRSVPs$time)
allNoRSVPs$eventTime = timestampToDate(allNoRSVPs$eventTime)
allNoRSVPs$difference = as.numeric(allNoRSVPs$eventTime - allNoRSVPs$time, units="days")
 
ggplot(allNoRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="red")
2014 07 20 17 25 03

As expected if people are going to drop out they do so a day or two before the event happens. By including the need for a ‘NEXT’ relationship we only capture the people who replied ‘yes’ and changed it to ‘no’. We don’t capture the people who said ‘no’ straight away.

I thought it’d be cool to be able to have the two charts back to back using the same scale so I could compare them against each other which led to my first attempt:

yes = ggplot(allYesRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="green")
no = ggplot(allNoRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="red") + scale_y_reverse()
library(gridExtra)
grid.arrange(yes,no,ncol=1,widths=c(1,1))

scale_y_reverse() flips the y axis so we’d see the ‘no’ chart upside down. The last line plots the two charts in a grid containing 1 column which forces them to go next to each other vertically.

2014 07 20 17 29 27

When we compare them next to each other we can see that the ‘yes’ replies are much more spread out whereas if people are going to drop out it nearly always happens a week or so before the event happens. This is what we thought was happening but it’s cool to have it confirmed by the data.

One annoying thing about that visualisation is that the two charts aren’t on the same scale. The ‘no’ chart only goes up to 100 days whereas the ‘yes’ one goes up to 120 days. In addition, the top end of the ‘yes’ chart is around 200 whereas the ‘no’ is around 400.

Luckily we can solve that problem by fixing the axes for both plots:

yes = ggplot(allYesRSVPs, aes(x=difference)) + 
  geom_histogram(binwidth=1, fill="green") +
  xlim(0,120) + 
  ylim(0, 400)
 
no = ggplot(allNoRSVPs, aes(x=difference)) +
  geom_histogram(binwidth=1, fill="red") +
  xlim(0,120) + 
  ylim(0, 400) +
  scale_y_reverse()

Now if we re-render it looks much better:

2014 07 20 17 42 40

From having comparable axes we can see that a lot more people drop out of an event (500) as it approaches than new people sign up (300). This is quite helpful for working out how many people are likely to show up.

We’ve found that the number of people RSVP’d ‘yes’ to an event will drop by 15-20% overall from 2 days before an event up until the evening of the event and the data seems to confirm this.

The only annoying thing about this approach is that the axes are repeated due to them being completely separate charts.

I expect it would look better if I can work out how to combine the two data frames together and then pull out back to back charts based on a variable in the combined data frame.

I’m still working on that so suggestions are most welcome. The code is on github if you want to play with it.

Categories: Programming

Neo4j 2.1.2: Finding where I am in a linked list

Mark Needham - Sun, 07/20/2014 - 16:13

I was recently asked how to calculate the position of a node in a linked list and realised that as the list increases in size this is one of the occasions when we should write an unmanaged extension rather than using cypher.

I wrote a quick bit of code to create a linked list with 10,000 elements in it:

public class Chains 
{
    public static void main(String[] args)
    {
        String simpleChains = "/tmp/longchains";
        populate( simpleChains, 10000 );
    }
 
    private static void populate( String path, int chainSize )
    {
        GraphDatabaseService db = new GraphDatabaseFactory().newEmbeddedDatabase( path );
        try(Transaction tx = db.beginTx()) {
            Node currentNode = null;
            for ( int i = 0; i < chainSize; i++ )
            {
                Node node = db.createNode();
 
                if(currentNode != null) {
                    currentNode.createRelationshipTo( node, NEXT );
                }
                currentNode = node;
            }
            tx.success();
        }
 
 
        db.shutdown();
    }
}

To find our distance from the end of the linked list we could write the following cypher query:

match n  where id(n) = {nodeId}  with n 
match path = (n)-[:NEXT*]->() 
RETURN id(n) AS nodeId, length(path) AS length 
ORDER BY length DESC 
LIMIT 1;

For simplicity we’re finding a node by it’s internal node id and then finding the ‘NEXT’ relationships going out from this node recursively. We then filter the results so that we only get the longest path back which will be our distance to the end of the list.

I noticed that this query would sometimes take 10s of seconds so I wrote a version using the Java Traversal API to see whether I could get it any quicker.

This is the Java version:

try(Transaction tx = db.beginTx()) {
    Node startNode = db.getNodeById( nodeId );
    TraversalDescription traversal = db.traversalDescription();
    Traverser traverse = traversal
            .depthFirst()
            .relationships( NEXT, Direction.OUTGOING )
            .sort( new Comparator<Path>()
            {
                @Override
                public int compare( Path o1, Path o2 )
                {
                    return Integer.valueOf( o2.length() ).compareTo( o1 .length() );
                }
            } )
            .traverse( startNode );
 
    Collection<Path> paths = IteratorUtil.asCollection( traverse );
 
    int maxLength = traverse.iterator().next().length();
    System.out.print( maxLength );
 
    tx.failure();
}

This is a bit more verbose than the cypher version but computes the same result. We’ve sorted the paths by length using a comparator to ensure we get the longest path back first.

I created a little program to warm up the caches and kick off a few iterations where I queried from different nodes and returned the length and time taken. These were the results:

--------
(Traversal API) Node:    1, Length: 9998, Time (ms):  15
       (Cypher) Node:    1, Length: 9998, Time (ms): 26225
(Traversal API) Node:  456, Length: 9543, Time (ms):  10
       (Cypher) Node:  456, Length: 9543, Time (ms): 24881
(Traversal API) Node:  761, Length: 9238, Time (ms):   9
       (Cypher) Node:  761, Length: 9238, Time (ms): 9941
--------
(Traversal API) Node:    1, Length: 9998, Time (ms):   9
       (Cypher) Node:    1, Length: 9998, Time (ms): 12537
(Traversal API) Node:  456, Length: 9543, Time (ms):   8
       (Cypher) Node:  456, Length: 9543, Time (ms): 15690
(Traversal API) Node:  761, Length: 9238, Time (ms):   7
       (Cypher) Node:  761, Length: 9238, Time (ms): 9202
--------
(Traversal API) Node:    1, Length: 9998, Time (ms):   8
       (Cypher) Node:    1, Length: 9998, Time (ms): 11905
(Traversal API) Node:  456, Length: 9543, Time (ms):   7
       (Cypher) Node:  456, Length: 9543, Time (ms): 22296
(Traversal API) Node:  761, Length: 9238, Time (ms):   8
       (Cypher) Node:  761, Length: 9238, Time (ms): 8739
--------

Interestingly when I reduced the size of the linked list to 1000 the difference wasn’t so pronounced:

--------
(Traversal API) Node:    1, Length: 998, Time (ms):   5
       (Cypher) Node:    1, Length: 998, Time (ms): 174
(Traversal API) Node:  456, Length: 543, Time (ms):   2
       (Cypher) Node:  456, Length: 543, Time (ms):  71
(Traversal API) Node:  761, Length: 238, Time (ms):   1
       (Cypher) Node:  761, Length: 238, Time (ms):  13
--------
(Traversal API) Node:    1, Length: 998, Time (ms):   2
       (Cypher) Node:    1, Length: 998, Time (ms): 111
(Traversal API) Node:  456, Length: 543, Time (ms):   1
       (Cypher) Node:  456, Length: 543, Time (ms):  40
(Traversal API) Node:  761, Length: 238, Time (ms):   1
       (Cypher) Node:  761, Length: 238, Time (ms):  12
--------
(Traversal API) Node:    1, Length: 998, Time (ms):   3
       (Cypher) Node:    1, Length: 998, Time (ms): 129
(Traversal API) Node:  456, Length: 543, Time (ms):   2
       (Cypher) Node:  456, Length: 543, Time (ms):  48
(Traversal API) Node:  761, Length: 238, Time (ms):   0
       (Cypher) Node:  761, Length: 238, Time (ms):  12
--------

which is good news as most linked lists that we’ll create will be in the 10s – 100s range rather than 10,000 which was what I was faced with.

I’m sure cypher will reach parity for this type of query in future which will be great as I like writing cypher much more than I do Java. For now though it’s good to know we have a backup option to call on when necessary.

The code is available as a gist if you want to play around with it further.

Categories: Programming

R: ggplot – Don’t know how to automatically pick scale for object of type difftime – Discrete value supplied to continuous scale

Mark Needham - Sun, 07/20/2014 - 01:21

While reading ‘Why The R Programming Language Is Good For Business‘ I came across Udacity’s ‘Data Analysis with R‘ courses – part of which focuses exploring data sets using visualisations, something I haven’t done much of yet.

I thought it’d be interesting to create some visualisations around the times that people RSVP ‘yes’ to the various Neo4j events that we run in London.

I started off with the following query which returns the date time that people replied ‘Yes’ to an event and the date time of the event:

library(Rneo4j)
query = "MATCH (e:Event)<-[:TO]-(response {response: 'yes'})
         RETURN response.time AS time, e.time + e.utc_offset AS eventTime"
allYesRSVPs = cypher(graph, query)
allYesRSVPs$time = timestampToDate(allYesRSVPs$time)
allYesRSVPs$eventTime = timestampToDate(allYesRSVPs$eventTime)
 
> allYesRSVPs[1:10,]
                  time           eventTime
1  2011-06-05 12:12:27 2011-06-29 18:30:00
2  2011-06-05 14:49:04 2011-06-29 18:30:00
3  2011-06-10 11:22:47 2011-06-29 18:30:00
4  2011-06-07 15:27:07 2011-06-29 18:30:00
5  2011-06-06 20:21:45 2011-06-29 18:30:00
6  2011-07-04 19:49:04 2011-07-27 19:00:00
7  2011-07-05 16:40:10 2011-07-27 19:00:00
8  2011-08-19 07:41:10 2011-08-31 18:30:00
9  2011-08-24 12:47:40 2011-08-31 18:30:00
10 2011-08-18 09:56:53 2011-08-31 18:30:00

I wanted to create a bar chart showing the amount of time in advance of a meetup that people RSVP’d ‘yes’ so I added the following column to my data frame:

allYesRSVPs$difference = allYesRSVPs$eventTime - allYesRSVPs$time
 
> allYesRSVPs[1:10,]
                  time           eventTime    difference
1  2011-06-05 12:12:27 2011-06-29 18:30:00 34937.55 mins
2  2011-06-05 14:49:04 2011-06-29 18:30:00 34780.93 mins
3  2011-06-10 11:22:47 2011-06-29 18:30:00 27787.22 mins
4  2011-06-07 15:27:07 2011-06-29 18:30:00 31862.88 mins
5  2011-06-06 20:21:45 2011-06-29 18:30:00 33008.25 mins
6  2011-07-04 19:49:04 2011-07-27 19:00:00 33070.93 mins
7  2011-07-05 16:40:10 2011-07-27 19:00:00 31819.83 mins
8  2011-08-19 07:41:10 2011-08-31 18:30:00 17928.83 mins
9  2011-08-24 12:47:40 2011-08-31 18:30:00 10422.33 mins
10 2011-08-18 09:56:53 2011-08-31 18:30:00 19233.12 mins

I then tried to use ggplot to create a bar chart of that data:

> ggplot(allYesRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="green")

Unfortunately that resulted in this error:

Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous
Error: Discrete value supplied to continuous scale

I couldn’t find anyone who had come across this problem before in my search but I did find the as.numeric function which seemed like it would put the difference into an appropriate format:

allYesRSVPs$difference = as.numeric(allYesRSVPs$eventTime - allYesRSVPs$time, units="days")
> ggplot(allYesRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="green")

that resulted in the following chart:

2014 07 20 01 15 39

We can see there is quite a heavy concentration of people RSVPing yes in the few days before the event and then the rest are scattered across the first 30 days.

We usually announce events 3/4 weeks in advance so I don’t know that it tells us anything interesting other than that it seems like people sign up for events when an email is sent out about them.

The date the meetup was announced (by email) isn’t currently exposed by the API but hopefully one day it will be.

The code is on github if you want to have a play – any suggestions welcome.

Categories: Programming

Episode 206: Ken Collier on Agile Analytics

Johannes Thönes talks to Dr. Ken Collier, Director of Agile Analytics at ThoughtWorks about Agile Analytics. The outline includes: descriptive analytics, predictive analytic and prescriptive analytics; artificial intelligence, machine learning, data mining and statistics; collaborative filtering; data science and data scientists; data warehousing and business intelligence; online analytical processing (OLAP), extract transform load (ETL), feature […]
Categories: Programming