Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 32 min 36 sec ago

4 types of user

Tue, 07/29/2014 - 20:07

I’ve been working with Neo4j full time for slightly more than a year now and from interacting with the community I’ve noticed that while using different features of the product people fall into 4 categories.

These are as follows:

4types

On one axis we have ‘loudness’ i.e. how vocal somebody is either on twitter, StackOverflow or by email and on the other we have ‘success’ which is how well a product feature is working for them.

The people in the top half of the diagram will get the most attention because they’re the most visible.

Of those people we’ll tend to spend more time on the people who are unhappy and vocal to try and help them solve the problems their having.

When working with the people in the top left it’s difficult to understand how representative they are for the whole user base.

It could be the case that they aren’t representative at all and actually there is a quiet majority who the product is working for and are just getting on with it with no fuss.

However, it could equally be the case that they are absolutely representative and there are a lot of users quietly suffering / giving up using the product.

I haven’t come up with a good way to come across the less vocal users but in my experience they’ll often be passive users of the user group or Stack Overflow i.e. they’ll read existing issues but not post anything themselves.

Given this uncertainty I think it makes sense to assume that the silent majority suffer the same problems as the more vocal minority.

Another interesting thing I’ve noticed about this quadrant is that the people in the top right are often the best people in the community to help those who are struggling.

It’d be interesting to know whether anyone has noticed a similar thing with the products they worked on, and if so what approach do you take to unveiling the silent majority?

Categories: Programming

R: ggplot – Plotting back to back charts using facet_wrap

Fri, 07/25/2014 - 22:57

Earlier in the week I showed a way to plot back to back charts using R’s ggplot library but looking back on the code it felt like it was a bit hacky to ‘glue’ two charts together using a grid.

I wanted to find a better way.

To recap, I came up with the following charts showing the RSVPs to Neo4j London meetup events using this code:

2014 07 20 17 42 40

The first thing we need to do to simplify chart generation is to return ‘yes’ and ‘no’ responses in the same cypher query, like so:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
query = "MATCH (e:Event)<-[:TO]-(response {response: 'yes'})
         WITH e, COLLECT(response) AS yeses
         MATCH (e)<-[:TO]-(response {response: 'no'})<-[:NEXT]-()
         WITH e, COLLECT(response) + yeses AS responses
         UNWIND responses AS response
         RETURN response.time AS time, e.time + e.utc_offset AS eventTime, response.response AS response"
allRSVPs = cypher(graph, query)
allRSVPs$time = timestampToDate(allRSVPs$time)
allRSVPs$eventTime = timestampToDate(allRSVPs$eventTime)
allRSVPs$difference = as.numeric(allRSVPs$eventTime - allRSVPs$time, units="days")

The query is a bit because we want to capture the ‘no’ responses when they initially said yes which is why we check for a ‘NEXT’ relationship when looking for the negative responses.

Let’s inspect allRSVPs:

> allRSVPs[1:10,]
                  time           eventTime response difference
1  2014-06-13 21:49:20 2014-07-22 18:30:00       no   38.86157
2  2014-07-02 22:24:06 2014-07-22 18:30:00      yes   19.83743
3  2014-05-23 23:46:02 2014-07-22 18:30:00      yes   59.78053
4  2014-06-23 21:07:11 2014-07-22 18:30:00      yes   28.89084
5  2014-06-06 15:09:29 2014-07-22 18:30:00      yes   46.13925
6  2014-05-31 13:03:09 2014-07-22 18:30:00      yes   52.22698
7  2014-05-23 23:46:02 2014-07-22 18:30:00      yes   59.78053
8  2014-07-02 12:28:22 2014-07-22 18:30:00      yes   20.25113
9  2014-06-30 23:44:39 2014-07-22 18:30:00      yes   21.78149
10 2014-06-06 15:35:53 2014-07-22 18:30:00      yes   46.12091

We’ve returned the actual response with each row so that we can distinguish between responses. It will also come in useful for pivoting our single chart later on.

The next step is to get ggplot to generate our side by side charts. I started off by plotting both types of response on the same chart:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  geom_bar(binwidth=1)

2014 07 25 22 14 28

This one stacks the ‘yes’ and ‘no’ responses on top of each other which isn’t what we want as it’s difficult to compare the two.

What we need is the facet_wrap function which allows us to generate multiple charts grouped by key. We’ll group by ‘response’:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  geom_bar(binwidth=1) + 
  facet_wrap(~ response, nrow=2, ncol=1)

2014 07 25 22 34 46

The only thing we’re missing now is the red and green colours which is where the scale_fill_manual function comes in handy:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  scale_fill_manual(values=c("#FF0000", "#00FF00")) + 
  geom_bar(binwidth=1) +
  facet_wrap(~ response, nrow=2, ncol=1)

2014 07 25 22 39 56

If we want to show the ‘yes’ chart on top we can pass in an extra parameter to facet_wrap to change where it places the highest value:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  scale_fill_manual(values=c("#FF0000", "#00FF00")) + 
  geom_bar(binwidth=1) +
  facet_wrap(~ response, nrow=2, ncol=1, as.table = FALSE)

2014 07 25 22 43 29

We could go one step further and group by response and day. First let’s add a ‘day’ column to our data frame:

allRSVPs$dayOfWeek = format(allRSVPs$eventTime, "%A")

And now let’s plot the charts using both columns:

ggplot(allRSVPs, aes(x = difference, fill=response)) + 
  scale_fill_manual(values=c("#FF0000", "#00FF00")) + 
  geom_bar(binwidth=1) +
  facet_wrap(~ response + dayOfWeek, as.table = FALSE)

2014 07 25 22 49 57

The distribution of dropouts looks fairly similar for all the days – Thursday is just at an order of magnitude below the other days because we haven’t run many events on Thursdays so far.

At a glance it doesn’t appear that so many people sign up for Thursday events on the day or one day before.

One potential hypothesis is that people have things planned for Thursday whereas they decide more last minute what to do on the other days.

We’ll have to run some more events on Thursdays to see whether that trend holds out.

The code is on github if you want to play with it

Categories: Programming

Java: Determining the status of data import using kill signals

Wed, 07/23/2014 - 23:20

A few weeks ago I was working on the initial import of ~ 60 million bits of data into Neo4j and we kept running into a problem where the import process just seemed to freeze and nothing else was imported.

It was very difficult to tell what was happening inside the process – taking a thread dump merely informed us that it was attempting to process one line of a CSV line and was somehow unable to do so.

One way to help debug this would have been to print out every single line of the CSV as we processed it and then watch where it got stuck but this seemed a bit over kill. Ideally we wanted to only print out the line we were processing on demand.

As luck would have it we can do exactly this by sending a kill signal to our import process and have it print out where it had got up to. We had to make sure we picked a signal which wasn’t already being handled by the JVM and decided to go with ‘SIGTRAP’ i.e. kill -5 [pid]

We came across a neat blog post that explained how to wire everything up and then created our own version:

class Kill3Handler implements SignalHandler
{
    private AtomicInteger linesProcessed;
    private AtomicReference<Map<String, Object>> lastRowProcessed;
 
    public Kill3Handler( AtomicInteger linesProcessed, AtomicReference<Map<String, Object>> lastRowProcessed )
    {
        this.linesProcessed = linesProcessed;
        this.lastRowProcessed = lastRowProcessed;
    }
 
    @Override
    public void handle( Signal signal )
    {
        System.out.println("Last Line Processed: " + linesProcessed.get() + " " + lastRowProcessed.get());
    }
}

We then wired that up like so:

AtomicInteger linesProcessed = new AtomicInteger( 0 );
AtomicReference<Map<String, Object>> lastRowProcessed = new AtomicReference<>(  );
Kill3Handler kill3Handler = new Kill3Handler( linesProcessed, lastRowProcessed );
Signal.handle(new Signal("TRAP"), kill3Handler);
 
// as we iterate each line we update those variables
 
linesProcessed.incrementAndGet();
lastRowProcessed.getAndSet( properties ); // properties = a representation of the row we're processing

This worked really well for us and we were able to work out that we had a slight problem with some of the data in our CSV file which was causing it to be processed incorrectly.

We hadn’t been able to see this by visual inspection since the CSV files were a few GB in size. We’d therefore only skimmed a few lines as a sanity check.

I didn’t even know you could do this but it’s a neat trick to keep in mind – I’m sure it shall come in useful again.

Categories: Programming

R: ggplot – Plotting back to back bar charts

Sun, 07/20/2014 - 17:50

I’ve been playing around with R’s ggplot library to explore the Neo4j London meetup and the next thing I wanted to do was plot back to back bar charts showing ‘yes’ and ‘no’ RSVPs.

I’d already done the ‘yes’ bar chart using the following code:

query = "MATCH (e:Event)<-[:TO]-(response {response: 'yes'})
         RETURN response.time AS time, e.time + e.utc_offset AS eventTime"
allYesRSVPs = cypher(graph, query)
allYesRSVPs$time = timestampToDate(allYesRSVPs$time)
allYesRSVPs$eventTime = timestampToDate(allYesRSVPs$eventTime)
allYesRSVPs$difference = as.numeric(allYesRSVPs$eventTime - allYesRSVPs$time, units="days")
 
ggplot(allYesRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="green")
2014 07 20 01 15 39

The next step was to create a similar thing for people who’d RSVP’d ‘no’ having originally RSVP’d ‘yes’ i.e. people who dropped out:

query = "MATCH (e:Event)<-[:TO]-(response {response: 'no'})<-[:NEXT]-()
         RETURN response.time AS time, e.time + e.utc_offset AS eventTime"
allNoRSVPs = cypher(graph, query)
allNoRSVPs$time = timestampToDate(allNoRSVPs$time)
allNoRSVPs$eventTime = timestampToDate(allNoRSVPs$eventTime)
allNoRSVPs$difference = as.numeric(allNoRSVPs$eventTime - allNoRSVPs$time, units="days")
 
ggplot(allNoRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="red")
2014 07 20 17 25 03

As expected if people are going to drop out they do so a day or two before the event happens. By including the need for a ‘NEXT’ relationship we only capture the people who replied ‘yes’ and changed it to ‘no’. We don’t capture the people who said ‘no’ straight away.

I thought it’d be cool to be able to have the two charts back to back using the same scale so I could compare them against each other which led to my first attempt:

yes = ggplot(allYesRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="green")
no = ggplot(allNoRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="red") + scale_y_reverse()
library(gridExtra)
grid.arrange(yes,no,ncol=1,widths=c(1,1))

scale_y_reverse() flips the y axis so we’d see the ‘no’ chart upside down. The last line plots the two charts in a grid containing 1 column which forces them to go next to each other vertically.

2014 07 20 17 29 27

When we compare them next to each other we can see that the ‘yes’ replies are much more spread out whereas if people are going to drop out it nearly always happens a week or so before the event happens. This is what we thought was happening but it’s cool to have it confirmed by the data.

One annoying thing about that visualisation is that the two charts aren’t on the same scale. The ‘no’ chart only goes up to 100 days whereas the ‘yes’ one goes up to 120 days. In addition, the top end of the ‘yes’ chart is around 200 whereas the ‘no’ is around 400.

Luckily we can solve that problem by fixing the axes for both plots:

yes = ggplot(allYesRSVPs, aes(x=difference)) + 
  geom_histogram(binwidth=1, fill="green") +
  xlim(0,120) + 
  ylim(0, 400)
 
no = ggplot(allNoRSVPs, aes(x=difference)) +
  geom_histogram(binwidth=1, fill="red") +
  xlim(0,120) + 
  ylim(0, 400) +
  scale_y_reverse()

Now if we re-render it looks much better:

2014 07 20 17 42 40

From having comparable axes we can see that a lot more people drop out of an event (500) as it approaches than new people sign up (300). This is quite helpful for working out how many people are likely to show up.

We’ve found that the number of people RSVP’d ‘yes’ to an event will drop by 15-20% overall from 2 days before an event up until the evening of the event and the data seems to confirm this.

The only annoying thing about this approach is that the axes are repeated due to them being completely separate charts.

I expect it would look better if I can work out how to combine the two data frames together and then pull out back to back charts based on a variable in the combined data frame.

I’m still working on that so suggestions are most welcome. The code is on github if you want to play with it.

Categories: Programming

Neo4j 2.1.2: Finding where I am in a linked list

Sun, 07/20/2014 - 16:13

I was recently asked how to calculate the position of a node in a linked list and realised that as the list increases in size this is one of the occasions when we should write an unmanaged extension rather than using cypher.

I wrote a quick bit of code to create a linked list with 10,000 elements in it:

public class Chains 
{
    public static void main(String[] args)
    {
        String simpleChains = "/tmp/longchains";
        populate( simpleChains, 10000 );
    }
 
    private static void populate( String path, int chainSize )
    {
        GraphDatabaseService db = new GraphDatabaseFactory().newEmbeddedDatabase( path );
        try(Transaction tx = db.beginTx()) {
            Node currentNode = null;
            for ( int i = 0; i < chainSize; i++ )
            {
                Node node = db.createNode();
 
                if(currentNode != null) {
                    currentNode.createRelationshipTo( node, NEXT );
                }
                currentNode = node;
            }
            tx.success();
        }
 
 
        db.shutdown();
    }
}

To find our distance from the end of the linked list we could write the following cypher query:

match n  where id(n) = {nodeId}  with n 
match path = (n)-[:NEXT*]->() 
RETURN id(n) AS nodeId, length(path) AS length 
ORDER BY length DESC 
LIMIT 1;

For simplicity we’re finding a node by it’s internal node id and then finding the ‘NEXT’ relationships going out from this node recursively. We then filter the results so that we only get the longest path back which will be our distance to the end of the list.

I noticed that this query would sometimes take 10s of seconds so I wrote a version using the Java Traversal API to see whether I could get it any quicker.

This is the Java version:

try(Transaction tx = db.beginTx()) {
    Node startNode = db.getNodeById( nodeId );
    TraversalDescription traversal = db.traversalDescription();
    Traverser traverse = traversal
            .depthFirst()
            .relationships( NEXT, Direction.OUTGOING )
            .sort( new Comparator<Path>()
            {
                @Override
                public int compare( Path o1, Path o2 )
                {
                    return Integer.valueOf( o2.length() ).compareTo( o1 .length() );
                }
            } )
            .traverse( startNode );
 
    Collection<Path> paths = IteratorUtil.asCollection( traverse );
 
    int maxLength = traverse.iterator().next().length();
    System.out.print( maxLength );
 
    tx.failure();
}

This is a bit more verbose than the cypher version but computes the same result. We’ve sorted the paths by length using a comparator to ensure we get the longest path back first.

I created a little program to warm up the caches and kick off a few iterations where I queried from different nodes and returned the length and time taken. These were the results:

--------
(Traversal API) Node:    1, Length: 9998, Time (ms):  15
       (Cypher) Node:    1, Length: 9998, Time (ms): 26225
(Traversal API) Node:  456, Length: 9543, Time (ms):  10
       (Cypher) Node:  456, Length: 9543, Time (ms): 24881
(Traversal API) Node:  761, Length: 9238, Time (ms):   9
       (Cypher) Node:  761, Length: 9238, Time (ms): 9941
--------
(Traversal API) Node:    1, Length: 9998, Time (ms):   9
       (Cypher) Node:    1, Length: 9998, Time (ms): 12537
(Traversal API) Node:  456, Length: 9543, Time (ms):   8
       (Cypher) Node:  456, Length: 9543, Time (ms): 15690
(Traversal API) Node:  761, Length: 9238, Time (ms):   7
       (Cypher) Node:  761, Length: 9238, Time (ms): 9202
--------
(Traversal API) Node:    1, Length: 9998, Time (ms):   8
       (Cypher) Node:    1, Length: 9998, Time (ms): 11905
(Traversal API) Node:  456, Length: 9543, Time (ms):   7
       (Cypher) Node:  456, Length: 9543, Time (ms): 22296
(Traversal API) Node:  761, Length: 9238, Time (ms):   8
       (Cypher) Node:  761, Length: 9238, Time (ms): 8739
--------

Interestingly when I reduced the size of the linked list to 1000 the difference wasn’t so pronounced:

--------
(Traversal API) Node:    1, Length: 998, Time (ms):   5
       (Cypher) Node:    1, Length: 998, Time (ms): 174
(Traversal API) Node:  456, Length: 543, Time (ms):   2
       (Cypher) Node:  456, Length: 543, Time (ms):  71
(Traversal API) Node:  761, Length: 238, Time (ms):   1
       (Cypher) Node:  761, Length: 238, Time (ms):  13
--------
(Traversal API) Node:    1, Length: 998, Time (ms):   2
       (Cypher) Node:    1, Length: 998, Time (ms): 111
(Traversal API) Node:  456, Length: 543, Time (ms):   1
       (Cypher) Node:  456, Length: 543, Time (ms):  40
(Traversal API) Node:  761, Length: 238, Time (ms):   1
       (Cypher) Node:  761, Length: 238, Time (ms):  12
--------
(Traversal API) Node:    1, Length: 998, Time (ms):   3
       (Cypher) Node:    1, Length: 998, Time (ms): 129
(Traversal API) Node:  456, Length: 543, Time (ms):   2
       (Cypher) Node:  456, Length: 543, Time (ms):  48
(Traversal API) Node:  761, Length: 238, Time (ms):   0
       (Cypher) Node:  761, Length: 238, Time (ms):  12
--------

which is good news as most linked lists that we’ll create will be in the 10s – 100s range rather than 10,000 which was what I was faced with.

I’m sure cypher will reach parity for this type of query in future which will be great as I like writing cypher much more than I do Java. For now though it’s good to know we have a backup option to call on when necessary.

The code is available as a gist if you want to play around with it further.

Categories: Programming

R: ggplot – Don’t know how to automatically pick scale for object of type difftime – Discrete value supplied to continuous scale

Sun, 07/20/2014 - 01:21

While reading ‘Why The R Programming Language Is Good For Business‘ I came across Udacity’s ‘Data Analysis with R‘ courses – part of which focuses exploring data sets using visualisations, something I haven’t done much of yet.

I thought it’d be interesting to create some visualisations around the times that people RSVP ‘yes’ to the various Neo4j events that we run in London.

I started off with the following query which returns the date time that people replied ‘Yes’ to an event and the date time of the event:

library(Rneo4j)
query = "MATCH (e:Event)<-[:TO]-(response {response: 'yes'})
         RETURN response.time AS time, e.time + e.utc_offset AS eventTime"
allYesRSVPs = cypher(graph, query)
allYesRSVPs$time = timestampToDate(allYesRSVPs$time)
allYesRSVPs$eventTime = timestampToDate(allYesRSVPs$eventTime)
 
> allYesRSVPs[1:10,]
                  time           eventTime
1  2011-06-05 12:12:27 2011-06-29 18:30:00
2  2011-06-05 14:49:04 2011-06-29 18:30:00
3  2011-06-10 11:22:47 2011-06-29 18:30:00
4  2011-06-07 15:27:07 2011-06-29 18:30:00
5  2011-06-06 20:21:45 2011-06-29 18:30:00
6  2011-07-04 19:49:04 2011-07-27 19:00:00
7  2011-07-05 16:40:10 2011-07-27 19:00:00
8  2011-08-19 07:41:10 2011-08-31 18:30:00
9  2011-08-24 12:47:40 2011-08-31 18:30:00
10 2011-08-18 09:56:53 2011-08-31 18:30:00

I wanted to create a bar chart showing the amount of time in advance of a meetup that people RSVP’d ‘yes’ so I added the following column to my data frame:

allYesRSVPs$difference = allYesRSVPs$eventTime - allYesRSVPs$time
 
> allYesRSVPs[1:10,]
                  time           eventTime    difference
1  2011-06-05 12:12:27 2011-06-29 18:30:00 34937.55 mins
2  2011-06-05 14:49:04 2011-06-29 18:30:00 34780.93 mins
3  2011-06-10 11:22:47 2011-06-29 18:30:00 27787.22 mins
4  2011-06-07 15:27:07 2011-06-29 18:30:00 31862.88 mins
5  2011-06-06 20:21:45 2011-06-29 18:30:00 33008.25 mins
6  2011-07-04 19:49:04 2011-07-27 19:00:00 33070.93 mins
7  2011-07-05 16:40:10 2011-07-27 19:00:00 31819.83 mins
8  2011-08-19 07:41:10 2011-08-31 18:30:00 17928.83 mins
9  2011-08-24 12:47:40 2011-08-31 18:30:00 10422.33 mins
10 2011-08-18 09:56:53 2011-08-31 18:30:00 19233.12 mins

I then tried to use ggplot to create a bar chart of that data:

> ggplot(allYesRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="green")

Unfortunately that resulted in this error:

Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous
Error: Discrete value supplied to continuous scale

I couldn’t find anyone who had come across this problem before in my search but I did find the as.numeric function which seemed like it would put the difference into an appropriate format:

allYesRSVPs$difference = as.numeric(allYesRSVPs$eventTime - allYesRSVPs$time, units="days")
> ggplot(allYesRSVPs, aes(x=difference)) + geom_histogram(binwidth=1, fill="green")

that resulted in the following chart:

2014 07 20 01 15 39

We can see there is quite a heavy concentration of people RSVPing yes in the few days before the event and then the rest are scattered across the first 30 days.

We usually announce events 3/4 weeks in advance so I don’t know that it tells us anything interesting other than that it seems like people sign up for events when an email is sent out about them.

The date the meetup was announced (by email) isn’t currently exposed by the API but hopefully one day it will be.

The code is on github if you want to have a play – any suggestions welcome.

Categories: Programming

R: Apply a custom function across multiple lists

Wed, 07/16/2014 - 06:04

In my continued playing around with R I wanted to map a custom function over two lists comparing each item with its corresponding items.

If we just want to use a built in function such as subtraction between two lists it’s quite easy to do:

> c(10,9,8,7,6,5,4,3,2,1) - c(5,4,3,4,3,2,2,1,2,1)
 [1] 5 5 5 3 3 3 2 2 0 0

I wanted to do a slight variation on that where instead of returning the difference I wanted to return a text value representing the difference e.g. ’5 or more’, ’3 to 5′ etc.

I spent a long time trying to figure out how to do that before finding an excellent blog post which describes all the different ‘apply’ functions available in R.

As far as I understand ‘apply’ is the equivalent of ‘map’ in Clojure or other functional languages.

In this case we want the mapply variant which we can use like so:

> mapply(function(x, y) { 
    if((x-y) >= 5) {
        "5 or more"
    } else if((x-y) >= 3) {
        "3 to 5"
    } else {
        "less than 5"
    }    
  }, c(10,9,8,7,6,5,4,3,2,1),c(5,4,3,4,3,2,2,1,2,1))
 [1] "5 or more"   "5 or more"   "5 or more"   "3 to 5"      "3 to 5"      "3 to 5"      "less than 5"
 [8] "less than 5" "less than 5" "less than 5"

We could then pull that out into a function if we wanted:

summarisedDifference <- function(one, two) {
  mapply(function(x, y) { 
    if((x-y) >= 5) {
      "5 or more"
    } else if((x-y) >= 3) {
      "3 to 5"
    } else {
      "less than 5"
    }    
  }, one, two)
}

which we could call like so:

> summarisedDifference(c(10,9,8,7,6,5,4,3,2,1),c(5,4,3,4,3,2,2,1,2,1))
 [1] "5 or more"   "5 or more"   "5 or more"   "3 to 5"      "3 to 5"      "3 to 5"      "less than 5"
 [8] "less than 5" "less than 5" "less than 5"

I also wanted to be able to compare a list of items to a single item which was much easier than I expected:

> summarisedDifference(c(10,9,8,7,6,5,4,3,2,1), 1)
 [1] "5 or more"   "5 or more"   "5 or more"   "5 or more"   "5 or more"   "3 to 5"      "3 to 5"     
 [8] "less than 5" "less than 5" "less than 5"

If we wanted to get a summary of the differences between the lists we could plug them into ddply like so:

> library(plyr)
> df = data.frame(x=c(10,9,8,7,6,5,4,3,2,1), y=c(5,4,3,4,3,2,2,1,2,1))
> ddply(df, .(difference=summarisedDifference(x,y)), summarise, count=length(x))
   difference count
1      3 to 5     3
2   5 or more     3
3 less than 5     4
Categories: Programming

Neo4j: LOAD CSV – Processing hidden arrays in your CSV documents

Thu, 07/10/2014 - 15:54

I was recently asked how to process an ‘array’ of values inside a column in a CSV file using Neo4j’s LOAD CSV tool and although I initially thought this wouldn’t be possible as every cell is treated as a String, Michael showed me a way of working around this which I thought was pretty neat.

Let’s say we have a CSV file representing people and their friends. It might look like this:

name,friends
"Mark","Michael,Peter"
"Michael","Peter,Kenny"
"Kenny","Anders,Michael"

And what we want is to have the following nodes:

  • Mark
  • Michael
  • Peter
  • Kenny
  • Anders

And the following friends relationships:

  • Mark -> Michael
  • Mark -> Peter
  • Michael -> Peter
  • Michael -> Kenny
  • Kenny -> Anders
  • Kenny -> Michael

We’ll start by loading the CSV file and returning each row:

$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row RETURN row;
+------------------------------------------------+
| row                                            |
+------------------------------------------------+
| {name -> "Mark", friends -> "Michael,Peter"}   |
| {name -> "Michael", friends -> "Peter,Kenny"}  |
| {name -> "Kenny", friends -> "Anders,Michael"} |
+------------------------------------------------+
3 rows

As expected the ‘friends’ column is being treated as a String which means we can use the split function to get an array of people that we want to be friends with:

$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row RETURN row, split(row.friends, ",") AS friends;
+-----------------------------------------------------------------------+
| row                                            | friends              |
+-----------------------------------------------------------------------+
| {name -> "Mark", friends -> "Michael,Peter"}   | ["Michael","Peter"]  |
| {name -> "Michael", friends -> "Peter,Kenny"}  | ["Peter","Kenny"]    |
| {name -> "Kenny", friends -> "Anders,Michael"} | ["Anders","Michael"] |
+-----------------------------------------------------------------------+
3 rows

Now that we’ve got them as an array we can use UNWIND to get pairs of friends that we want to create:

$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row 
  WITH row, split(row.friends, ",") AS friends 
  UNWIND friends AS friend 
  RETURN row.name, friend;
+-----------------------+
| row.name  | friend    |
+-----------------------+
| "Mark"    | "Michael" |
| "Mark"    | "Peter"   |
| "Michael" | "Peter"   |
| "Michael" | "Kenny"   |
| "Kenny"   | "Anders"  |
| "Kenny"   | "Michael" |
+-----------------------+
6 rows

And now we’ll introduce some MERGE statements to create the appropriate nodes and relationships:

$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row 
  WITH row, split(row.friends, ",") AS friends 
  UNWIND friends AS friend  
  MERGE (p1:Person {name: row.name}) 
  MERGE (p2:Person {name: friend}) 
  MERGE (p1)-[:FRIENDS_WITH]->(p2);
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 5
Relationships created: 6
Properties set: 5
Labels added: 5
373 ms

And now if we query the database to get back all the nodes + relationships…

$ match (p1:Person)-[r]->(p2) RETURN p1,r, p2;
+------------------------------------------------------------------------+
| p1                      | r                  | p2                      |
+------------------------------------------------------------------------+
| Node[0]{name:"Mark"}    | :FRIENDS_WITH[0]{} | Node[1]{name:"Michael"} |
| Node[0]{name:"Mark"}    | :FRIENDS_WITH[1]{} | Node[2]{name:"Peter"}   |
| Node[1]{name:"Michael"} | :FRIENDS_WITH[2]{} | Node[2]{name:"Peter"}   |
| Node[1]{name:"Michael"} | :FRIENDS_WITH[3]{} | Node[3]{name:"Kenny"}   |
| Node[3]{name:"Kenny"}   | :FRIENDS_WITH[4]{} | Node[4]{name:"Anders"}  |
| Node[3]{name:"Kenny"}   | :FRIENDS_WITH[5]{} | Node[1]{name:"Michael"} |
+------------------------------------------------------------------------+
6 rows

…you’ll see that we have everything.

If instead of a comma separated list of people we have a literal array in the cell…

name,friends
"Mark", "[Michael,Peter]"
"Michael", "[Peter,Kenny]"
"Kenny", "[Anders,Michael]"

…we’d need to tweak the part of the query which extracts our friends to strip off the first and last characters:

$ load csv with headers from "file:/Users/markneedham/Desktop/friendsa.csv" AS row 
  RETURN row, split(substring(row.friends, 1, length(row.friends) -2), ",") AS friends;
+-------------------------------------------------------------------------+
| row                                              | friends              |
+-------------------------------------------------------------------------+
| {name -> "Mark", friends -> "[Michael,Peter]"}   | ["Michael","Peter"]  |
| {name -> "Michael", friends -> "[Peter,Kenny]"}  | ["Peter","Kenny"]    |
| {name -> "Kenny", friends -> "[Anders,Michael]"} | ["Anders","Michael"] |
+-------------------------------------------------------------------------+
3 rows

And then if we put the whole query together we end up with this:

$ load csv with headers from "file:/Users/markneedham/Desktop/friendsa.csv" AS row 
  WITH row, split(substring(row.friends, 1, length(row.friends) -2), ",") AS friends 
  UNWIND friends AS friend  
  MERGE (p1:Person {name: row.name}) 
  MERGE (p2:Person {name: friend}) 
  MERGE (p1)-[:FRIENDS_WITH]->(p2);;
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 5
Relationships created: 6
Properties set: 5
Labels added: 5
Categories: Programming

R/plyr: ddply – Error in vector(type, length) : vector: cannot make a vector of mode ‘closure’.

Mon, 07/07/2014 - 07:07

In my continued playing around with plyr’s ddply function I was trying to group a data frame by one of its columns and return a count of the number of rows with specific values and ran into a strange (to me) error message.

I had a data frame:

n = c(2, 3, 5) 
s = c("aa", "bb", "cc") 
b = c(TRUE, FALSE, TRUE) 
df = data.frame(n, s, b)

And wanted to group and count on column ‘b’ so I’d get back a count of 2 for TRUE and 1 for FALSE. I wrote this code:

ddply(df, "b", function(x) { 
  countr <- length(x$n) 
  data.frame(count = count) 
})

which when evaluated gave the following error:

Error in vector(type, length) : 
  vector: cannot make a vector of mode 'closure'.

It took me quite a while to realise that I’d just made a typo in assigned the count to a variable called ‘countr’ instead of ‘count’.

As a result of that typo I think the R compiler was trying to find a variable called ‘count’ somwhere else in the lexical scope but was unable to. If I’d defined the variable ‘count’ outside the call to ddply function then my typo wouldn’t have resulted in an error but rather an unexpected resulte.g.

> count = 10
> ddply(df, "b", function(x) { 
+   countr <- length(x$n) 
+   data.frame(count = count) 
+ })
      b count
1 FALSE     4
2  TRUE     4

Once I spotted the typo and fixed it things worked as expected:

> ddply(df, "b", function(x) { 
+   count <- length(x$n) 
+   data.frame(count = count) 
+ })
      b count
1 FALSE     1
2  TRUE     2
Categories: Programming

R/plyr: ddply – Renaming the grouping/generated column when grouping by date

Wed, 07/02/2014 - 07:30

On Nicole’s recommendation I’ve been having a look at R’s plyr package to see if I could simplify my meetup analysis and I started by translating my code that grouped meetup join dates by day of the week.

To refresh, the code without plyr looked like this:

library(Rneo4j)
timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01")
 
query = "MATCH (:Person)-[:HAS_MEETUP_PROFILE]->()-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g:Group {name: \"Neo4j - London User Group\"})
         RETURN membership.joined AS joinDate"
meetupMembers = cypher(graph, query)
meetupMembers$joined <- timestampToDate(meetupMembers$joinDate)
 
dd = aggregate(meetupMembers$joined, by=list(format(meetupMembers$joined, "%A")), function(x) length(x))
colnames(dd) = c("dayOfWeek", "count")

which returns the following:

> dd
  dayOfWeek count
1    Friday   135
2    Monday   287
3  Saturday    80
4    Sunday   102
5  Thursday   187
6   Tuesday   286
7 Wednesday   211

We need to use plyr’s ddply function which takes a data frame and transforms it into another one.

To refresh, this is what the initial data frame looks like:

> meetupMembers[1:10,]
       joinDate              joined
1  1.376572e+12 2013-08-15 14:13:40
2  1.379491e+12 2013-09-18 08:55:11
3  1.349454e+12 2012-10-05 17:28:04
4  1.383127e+12 2013-10-30 09:59:03
5  1.372239e+12 2013-06-26 10:27:40
6  1.330295e+12 2012-02-26 22:27:00
7  1.379676e+12 2013-09-20 12:22:39
8  1.398462e+12 2014-04-25 22:41:19
9  1.331734e+12 2012-03-14 14:11:43
10 1.396874e+12 2014-04-07 13:32:26

Most of the examples of using ddply show how to group by a specific ‘column’ e.g. joined but I want to group by part of the value in that column and eventually came across an example which showed how to do it:

> ddply(meetupMembers, .(format(joined, "%A")), function(x) {
    count <- length(x$joined)
    data.frame(count = count)
  })
  format(joined, "%A") count
1               Friday   135
2               Monday   287
3             Saturday    80
4               Sunday   102
5             Thursday   187
6              Tuesday   286
7            Wednesday   211

Unfortunately the generated column heading for the group by key isn’t very readable and it took me way longer than it should have to work out how to name it as I wanted! This is how you do it:

> ddply(meetupMembers, .(dayOfWeek=format(joined, "%A")), function(x) {
    count <- length(x$joined)
    data.frame(count = count)
  })
  dayOfWeek count
1    Friday   135
2    Monday   287
3  Saturday    80
4    Sunday   102
5  Thursday   187
6   Tuesday   286
7 Wednesday   211

If we want to sort that in descending order by ‘count’ we can wrap that ddply in another one:

> ddply(ddply(meetupMembers, .(dayOfWeek=format(joined, "%A")), function(x) {
    count <- length(x$joined)
    data.frame(count = count)
  }), .(count = count* -1))
  dayOfWeek count
1    Monday   287
2   Tuesday   286
3 Wednesday   211
4  Thursday   187
5    Friday   135
6    Sunday   102
7  Saturday    80

From reading a bit about ddply I gather that its slower than using some other approaches e.g. data.table but I’m not dealing with much data so it’s not an issue yet.

Once I got the hang of how it worked ddply was quite nice to work with so I think I’ll have a go at translating some of my other code to use it now.

Categories: Programming

R: Aggregate by different functions and join results into one data frame

Mon, 06/30/2014 - 23:47

In continuing my analysis of the London Neo4j meetup group using R I wanted to see which days of the week we organise meetups and how many people RSVP affirmatively by the day.

I started out with this query which returns each event and the number of ‘yes’ RSVPS:

library(Rneo4j)
timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01")
 
query = "MATCH (g:Group {name: \"Neo4j - London User Group\"})-[:HOSTED_EVENT]->(event)<-[:TO]-({response: 'yes'})<-[:RSVPD]-()
         WHERE (event.time + event.utc_offset) < timestamp()
         RETURN event.time + event.utc_offset AS eventTime, COUNT(*) AS rsvps"
events = cypher(graph, query)
events$datetime <- timestampToDate(events$eventTime)
      eventTime rsvps            datetime
1  1.314815e+12     3 2011-08-31 19:30:00
2  1.337798e+12    13 2012-05-23 19:30:00
3  1.383070e+12    29 2013-10-29 18:00:00
4  1.362474e+12     5 2013-03-05 09:00:00
5  1.369852e+12    66 2013-05-29 19:30:00
6  1.385572e+12    67 2013-11-27 17:00:00
7  1.392142e+12    35 2014-02-11 18:00:00
8  1.364321e+12    23 2013-03-26 18:00:00
9  1.372183e+12    22 2013-06-25 19:00:00
10 1.401300e+12    60 2014-05-28 19:00:00

I wanted to get a data frame which had these columns:

Day of Week | RSVPs | Number of Events

Getting the number of events for a given day was quite easy as I could use the groupBy function I wrote last time:

groupBy = function(dates, format) {
  dd = aggregate(dates, by=list(format(dates, format)), function(x) length(x))
  colnames(dd) = c("key", "count")
  dd
}
 
> groupBy(events$datetime, "%A")
        key count
1  Thursday     9
2   Tuesday    24
3 Wednesday    35

The next step is to get the sum of RSVPs by the day which we can get with the following code:

dd = aggregate(events$rsvps, by=list(format(events$datetime, "%A")), FUN=sum)
colnames(dd) = c("key", "count")

The difference between this and our previous use of the aggregate function is that we’re passing in the number of RSVPs for each event and then grouping by the day and summing up the values for each day rather than counting how many occurrences there are.

If we evaluate ‘dd’ we get the following:

> dd
        key count
1  Thursday   194
2   Tuesday   740
3 Wednesday  1467

We now have two data tables with a very similar shape and it turns out there’s a function called merge which makes it very easy to convert these two data frames into a single one:

x = merge(groupBy(events$datetime, "%A"), dd, by = "key")
colnames(x) = c("day", "events", "rsvps")
> x
        day events rsvps
1  Thursday      9   194
2   Tuesday     24   740
3 Wednesday     35  1467

We could now choose to order our new data frame by number of events descending:

> x[order(-x$events),]
        day events rsvps
3 Wednesday     35  1467
2   Tuesday     24   740
1  Thursday      9   194

We might also add an extra column to calculate the average number of RSVPs per day:

> x$rsvpsPerEvent = x$rsvps / x$events
> x
        day events rsvps rsvpsPerEvent
1  Thursday      9   194      21.55556
2   Tuesday     24   740      30.83333
3 Wednesday     35  1467      41.91429

I’m still getting the hang of it but already it seems like the combination of R and Neo4j allows us to quickly get insights into our data and I’ve barely scratched the surface!

Categories: Programming

R: Order by data frame column and take top 10 rows

Mon, 06/30/2014 - 22:44

I’ve been doing some ad-hoc analysis of the Neo4j London meetup group using R and Neo4j and having worked out how to group by certain keys the next step was to order the rows of the data frame.

I wanted to drill into the days on which people join the group and see whether they join it at a specific time of day. My feeling was that most people would join on a Monday morning.

The first step was to run the query using RNeo4j and then group by day and hour:

library(Rneo4j)
 
query = "MATCH (:Person)-[:HAS_MEETUP_PROFILE]->()-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g:Group {name: \"Neo4j - London User Group\"})
         RETURN membership.joined AS joinDate"
 
timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01")
 
meetupMembers = cypher(graph, query)
meetupMembers$joined <- timestampToDate(meetupMembers$joinDate)
 
groupBy = function(dates, format) {
  dd = aggregate(dates, by= list(format(dates, format)), function(x) length(x))
  colnames(dd) = c("key", "count")
  dd
}
 
byDayTime = groupBy(meetupMembers$joined, "%A %H:00")

This returned quite a few rows so we’ll just display a subset of them:

> byDayTime[12:25,]
            key count
12 Friday 14:00    12
13 Friday 15:00     8
14 Friday 16:00    11
15 Friday 17:00    10
16 Friday 18:00     3
17 Friday 19:00     1
18 Friday 20:00     3
19 Friday 21:00     4
20 Friday 22:00     7
21 Friday 23:00     2
22 Monday 00:00     3
23 Monday 01:00     1
24 Monday 03:00     1
25 Monday 05:00     3

The next step was to order by the ‘count’ column which wasn’t too difficult:

> byDayTime[order(byDayTime$count),][1:10,]
              key count
2    Friday 03:00     1
3    Friday 04:00     1
4    Friday 05:00     1
5    Friday 07:00     1
17   Friday 19:00     1
23   Monday 01:00     1
24   Monday 03:00     1
46 Saturday 03:00     1
66   Sunday 06:00     1
67   Sunday 07:00     1

If we run the order function on its own we’ll see that it returns the order in which the current rows in the data frame should appear:

> order(byDayTime$count)
  [1]   2   3   4   5  17  23  24  46  66  67 109 128 129   1  21  44  47  48  81  86  87  88 108 130  16  18  22  25  45  53  64  71  75 107  19  26  49  51  55  56  58  59  61
 [44]  65  68  77  79  85 106 110 143  50  52  54  82  84 101 127 146  27  57  60  62  63  69  70  73  99 103 126 145   6  20  76  83  89 105 122 131 144   7  13  40  43  72  80
 [87] 102  39  78 100 132 147  15  94 121 123 142  14  42  74 104 137 140  12  38  92  93 111 124   8   9  11  90  96 125 139  10  32  34  36  95  97  98  28 135 136  33  35 112
[130] 113 116 134  91 141  41 115 120 133  37 119 138  31 117 118  30 114  29

The first 4 rows in our sorted data frame will be rows 2-5 from the initial data frame, which are:

           key count
2 Friday 03:00     1
3 Friday 04:00     1
4 Friday 05:00     1
5 Friday 07:00     1

So that makes sense! In our case we want to sort in descending order which we can do by prefixing the sorting variable with a minus sign:

> byDayTime[order(-byDayTime$count),][1:10,]
                key count
29     Monday 09:00    34
30     Monday 10:00    28
114   Tuesday 11:00    28
31     Monday 11:00    27
117   Tuesday 14:00    27
118   Tuesday 15:00    27
138 Wednesday 14:00    23
119   Tuesday 16:00    22
37     Monday 17:00    21
115   Tuesday 12:00    20

As expected Monday morning makes a strong showing although Tuesday afternoon is also popular which is unexpected. We’ll need to do some more investigation to figure out what’s going on there.

Categories: Programming

Neo4j/R: Grouping meetup members by join timestamp

Mon, 06/30/2014 - 01:06

I wanted to do some ad-hoc analysis on the join date of members of the Neo4j London meetup group and since cypher doesn’t yet have functions for dealings with dates I thought I’d give R a try.

I started off by executing a cypher query which returned the join timestamp of all the group members using Nicole White’s RNeo4j package:

> library(Rneo4j)
 
> query = "match (:Person)-[:HAS_MEETUP_PROFILE]->()-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g:Group {name: \"Neo4j - London User Group\"})
RETURN membership.joined AS joinDate"
 
> meetupMembers = cypher(graph, query)
 
> meetupMembers[1:5,]
[1] 1.389107e+12 1.376572e+12 1.379491e+12 1.349454e+12 1.383127e+12

I realised that if I was going to do any date manipulation I’d need to translate the timestamp into an R friendly format so I wrote the following function to help me do that:

> timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01")

I added another column to the data frame with this date representation:

> meetupMembers$joined <- timestampToDate(meetupMembers$joinDate)
 
> meetupMembers[1:5,]
      joinDate              joined
1 1.389107e+12 2014-01-07 15:08:40
2 1.376572e+12 2013-08-15 14:13:40
3 1.379491e+12 2013-09-18 08:55:11
4 1.349454e+12 2012-10-05 17:28:04
5 1.383127e+12 2013-10-30 09:59:03

Next I wanted to group those timestamps by the combination of month + year for which the aggregate and format functions came in handy:

> dd = aggregate(meetupMembers$joined, by=list(format(meetupMembers$joined, "%m-%Y")), function(x) length(x))
> colnames(dd) = c("month", "count")
> dd
     month count
1  01-2012     4
2  01-2013    52
3  01-2014    88
4  02-2012     7
5  02-2013    52
6  02-2014    91
7  03-2012    12
8  03-2013    23
9  03-2014    93
10 04-2012     3
11 04-2013    34
12 04-2014   119
13 05-2012     9
14 05-2013    69
15 05-2014   102
16 06-2011    14
17 06-2012     5
18 06-2013    39
19 06-2014   114
20 07-2011     4
21 07-2012    16
22 07-2013    20
23 08-2011     2
24 08-2012    34
25 08-2013    50
26 09-2012    14
27 09-2013    52
28 10-2011     2
29 10-2012    29
30 10-2013    42
31 11-2011     2
32 11-2012    31
33 11-2013    34
34 12-2012     7
35 12-2013    19

I wanted to be able to group by different date formats so I created the following function to make life easier:

groupBy = function(dates, format) {
  dd = aggregate(dates, by= list(format(dates, format)), function(x) length(x))
  colnames(dd) = c("key", "count")
  dd
}

Now we can find the join dates grouped by year:

> groupBy(meetupMembers$joined, "%Y")
   key count
1 2011    24
2 2012   171
3 2013   486
4 2014   607

or by day:

> groupBy(meetupMembers$joined, "%A")
        key count
1    Friday   135
2    Monday   287
3  Saturday    80
4    Sunday   102
5  Thursday   187
6   Tuesday   286
7 Wednesday   211

or by month:

> groupBy(meetupMembers$joined, "%m")
   key count
1   01   144
2   02   150
3   03   128
4   04   156
5   05   180
6   06   172
7   07    40
8   08    86
9   09    66
10  10    73
11  11    67
12  12    26

I found the ‘by day’ grouping interesting as I had the impression that the huge majority of people joined meetup groups on a Monday but the difference between Monday and Tuesday isn’t significant. 60% of the joins happen between Monday and Wednesday.

The ‘by month’ grouping is a bit skewed by the fact we’re only half way into 2014 and there have been a lot more people joining this year than in previous years.

If we exclude this year then the spread is more uniform with a slight dip in December:

> groupBy(meetupMembers$joined[format(meetupMembers$joined, "%Y") != 2014], "%m")
   key count
1   01    56
2   02    59
3   03    35
4   04    37
5   05    78
6   06    58
7   07    40
8   08    86
9   09    66
10  10    73
11  11    67
12  12    26

Next up I think I need to get some charts going on and perhaps compare the distributions of join dates of various London meetup groups against each other.

I’m an absolute R newbie so if anything I’ve done is stupid and can be done better please let me know.

Categories: Programming

Neo4j: Set Based Operations with the experimental Cypher optimiser

Sun, 06/29/2014 - 09:45

A few months ago I wrote about cypher queries which look for a missing relationship and showed how you could optimise them by re-working the query slightly.

To refresh, we wanted to find all the people in the London office that I hadn’t worked with given this model…

…and this initial query:

MATCH (p:Person {name: "me"})-[:MEMBER_OF]->(office {name: "London Office"})<-[:MEMBER_OF]-(colleague)
WHERE NOT (p-[:COLLEAGUES]->(colleague))
RETURN COUNT(colleague)

This took on average 7.46 seconds to execute using cypher-query-tuning so we came up with the following version which took 150 ms on average:

MATCH (p:Person {name: "me"})-[:COLLEAGUES]->(colleague)
WITH p, COLLECT(colleague) as marksColleagues
MATCH (colleague)-[:MEMBER_OF]->(office {name: "London Office"})<-[:MEMBER_OF]-(p)
WHERE NOT (colleague IN marksColleagues)
RETURN COUNT(colleague)

With the release of Neo4j 2.1 we can now make use of Ronja – the experimental Cypher optimiser – which performs much better for certain types of queries. I thought I’d give it a try against this one.

We can use the experimental optimiser by prefixing our query like so:

cypher 2.1.experimental MATCH (p:Person {name: "me"})-[:MEMBER_OF]->(office {name: "London Office"})<-[:MEMBER_OF]-(colleague)
WHERE NOT (p-[:COLLEAGUES]->(colleague))
RETURN COUNT(colleague)

If we run that through the query tuner we get the following results:

$ python set-based.py
 
cypher 2.1.experimental MATCH (p:Person {name: "me"})-[:MEMBER_OF]->(office {name: "London Office"})<-[:MEMBER_OF]-(colleague)
WHERE NOT (p-[:COLLEAGUES]->(colleague))
RETURN COUNT(colleague)
Min 0.719580888748 50% 0.723278999329 95% 0.741609430313 Max 0.743646144867
 
 
MATCH (p:Person {name: "me"})-[:COLLEAGUES]->(colleague)
WITH p, COLLECT(colleague) as marksColleagues
MATCH (colleague)-[:MEMBER_OF]->(office {name: "London Office"})<-[:MEMBER_OF]-(p)
WHERE NOT (colleague IN marksColleagues)
RETURN COUNT(colleague)
Min 0.706955909729 50% 0.715770959854 95% 0.731880950928 Max 0.733670949936

As you can see there’s not much in it – our original query now runs as quickly as the optimised one. Ronja #ftw!

Give it a try on your slow queries and see how it gets on. There’ll certainly be some cases where it’s slower but over time it should be faster for a reasonable chunk of queries.

Categories: Programming

Neo4j’s Cypher vs Clojure – Group by and Sorting

Sun, 06/29/2014 - 03:56

One of the points that I emphasised during my talk on building Neo4j backed applications using Clojure last week is understanding when to use Cypher to solve a problem and when to use the programming language.

A good example of this is in the meetup application I’ve been working on. I have a collection of events and want to display past events in descending order and future events in ascending order.

First let’s create some future and some past events based on the current timestamp of 1404006050535:

CREATE (event1:Event {name: "Future Event 1", timestamp: 1414002772427 })
CREATE (event2:Event {name: "Future Event 2", timestamp: 1424002772427 })
CREATE (event3:Event {name: "Future Event 3", timestamp: 1416002772427 })
 
CREATE (event4:Event {name: "Past Event 1", timestamp: 1403002772427 })
CREATE (event5:Event {name: "Past Event 2", timestamp: 1402002772427 })

If we return all the events we see the following:

$ MATCH (e:Event) RETURN e;
==> +------------------------------------------------------------+
==> | e                                                          |
==> +------------------------------------------------------------+
==> | Node[15414]{name:"Future Event 1",timestamp:1414002772427} |
==> | Node[15415]{name:"Future Event 2",timestamp:1424002772427} |
==> | Node[15416]{name:"Future Event 3",timestamp:1416002772427} |
==> | Node[15417]{name:"Past Event 1",timestamp:1403002772427}   |
==> | Node[15418]{name:"Past Event 2",timestamp:1402002772427}   |
==> +------------------------------------------------------------+
==> 5 rows
==> 13 ms

We can achieve the desired grouping and sorting with the following cypher query:

(def sorted-query "MATCH (e:Event)
WITH COLLECT(e) AS events
WITH [e IN events WHERE e.timestamp <= timestamp()] AS pastEvents,
     [e IN events WHERE e.timestamp > timestamp()] AS futureEvents
UNWIND pastEvents AS pastEvent
WITH pastEvent, futureEvents ORDER BY pastEvent.timestamp DESC
WITH COLLECT(pastEvent) as orderedPastEvents, futureEvents
UNWIND futureEvents AS futureEvent
WITH futureEvent, orderedPastEvents ORDER BY futureEvent.timestamp
RETURN COLLECT(futureEvent) AS orderedFutureEvents, orderedPastEvents")

We then use the following function to call through to the Neo4j server using the excellent neocons library:

(ns neo4j-meetup.db
  (:require [clojure.walk :as walk])
  (:require [clojurewerkz.neocons.rest.cypher :as cy])
  (:require [clojurewerkz.neocons.rest :as nr]))
 
(def NEO4J_HOST "http://localhost:7521/db/data/")
 
(defn cypher
  ([query] (cypher query {}))
  ([query params]
     (let [conn (nr/connect! NEO4J_HOST)]
       (->> (cy/tquery query params)
            walk/keywordize-keys))))

We call that function and grab the first row since we know there won’t be any other rows in the result:

(def query-result (->> ( db/cypher sorted-query) first))

Now we need to extract the past and future collections so that we can display them on the page which we can do like so:

> (map #(% :data) (query-result :orderedPastEvents))
({:timestamp 1403002772427, :name "Past Event 1"} {:timestamp 1402002772427, :name "Past Event 2"})
 
> (map #(% :data) (query-result :orderedFutureEvents))
({:timestamp 1414002772427, :name "Future Event 1"} {:timestamp 1416002772427, :name "Future Event 3"} {:timestamp 1424002772427, :name "Future Event 2"})

An alternative approach is to return the events from cypher and then handle the grouping and sorting in clojure. In that case our query is much simpler:

(def unsorted-query "MATCH (e:Event) RETURN e")

We’ll use the clj-time library to determine the current time:

(def now (clj-time.coerce/to-long (clj-time.core/now)))

First let’s split the events into past and future:

> (def grouped-by-events 
     (->> (db/cypher unsorted-query)
          (map #(->> % :e :data))
          (group-by #(> (->> % :timestamp) now))))
 
> grouped-by-events
{true [{:timestamp 1414002772427, :name "Future Event 1"} {:timestamp 1424002772427, :name "Future Event 2"} {:timestamp 1416002772427, :name "Future Event 3"}], 
 false [{:timestamp 1403002772427, :name "Past Event 1"} {:timestamp 1402002772427, :name "Past Event 2"}]}

And finally we sort appropriately using these functions:

(defn time-descending [row] (* -1 (->> row :timestamp)))
(defn time-ascending [row] (->> row :timestamp))
> (sort-by time-descending (get grouped-by-events false))
({:timestamp 1403002772427, :name "Past Event 1"} {:timestamp 1402002772427, :name "Past Event 2"})
 
> (sort-by time-ascending (get grouped-by-events true))
({:timestamp 1414002772427, :name "Future Event 1"} {:timestamp 1416002772427, :name "Future Event 3"} {:timestamp 1424002772427, :name "Future Event 2"})

I used Clojure to do the sorting and grouping in my project because the query to get the events was a bit more complicated and became very difficult to read with the sorting and grouping mixed in.

Unfortunately cypher doesn’t provide an easy way to sort within a collection so we need our sorting in the row context and then collect the elements back again afterwards.

Categories: Programming

Data Science: Mo’ Data Mo’ Problems

Sun, 06/29/2014 - 00:35

Over the last couple of years I’ve worked on several proof of concept style Neo4j projects and on a lot of them people have wanted to work with their entire data set which I don’t think makes sense so early on.

In the early parts of a project we’re trying to prove out our approach rather than prove we can handle big data – something that Ashok taught me a couple of years ago on a project we worked on together.

In a Neo4j project that means coming up with an effective way to model and query our data and if we lose track of this it’s very easy to get sucked into working on the big data problem.

This could mean optimising our import scripts to deal with huge amounts of data or working out how to handle different aspects of the data (e.g. variability in shape or encoding) that only seem to reveal themselves at scale.

These are certainly problems that we need to solve but in my experience they end up taking much more time than expected and therefore aren’t the best problem to tackle when time is limited. Early on we want to create some momentum and keep the feedback cycle fast.

We probably want to tackle the data size problem as part of the implementation/production stage of the project to use Michael Nygaard’s terminology.

At this stage we’ll have some confidence that our approach makes sense and then we can put aside the time to set things up properly.

I’m sure there are some types of projects where this approach doesn’t make sense so I’d love to hear about them in the comments so I can spot them in future.

Categories: Programming

Neo4j: Cypher – Finding movies by decade

Sat, 06/28/2014 - 12:12

I was recently asked how to find the number of movies produced per decade in the movie data set that comes with the Neo4j browser and can be imported with the following command:

:play movies

We want to get one row per decade and have a count alongside so the easiest way is to start with one decade and build from there.

MATCH (movie:Movie)
WHERE movie.released >= 1990 and movie.released <= 1999
RETURN 1990 + "-" + 1999 as years, count(movie) AS movies
ORDER BY years

Note that we’re doing a label scan of all nodes of type Movie as there are no indexes for range queries. In this case it’s fine as we have few movies but If we had 100s of thousands of movies then we’d want to optimise the WHERE clause to make use of an IN which would then use any indexes.

If we run the query we get the following result:

==> +----------------------+
==> | years       | movies |
==> +----------------------+
==> | "1990-1999" | 21     |
==> +----------------------+
==> 1 row

Let’s pull out the start and end years so they’re explicitly named:

WITH 1990 AS startDecade, 1999 AS endDecade
MATCH (movie:Movie)
WHERE movie.released >= startDecade and movie.released <= endDecade
RETURN startDecade + "-" + endDecade as years, count(movie)
ORDER BY years

Now we need to create a collection of start and end years so we can return more than one. We can use the UNWIND function to take a collection of decades and run them through the rest of the query:

UNWIND [{start: 1970, end: 1979}, {start: 1980, end: 1989}, {start: 1980, end: 1989}, {start: 1990, end: 1999}, {start: 2000, end: 2009}, {start: 2010, end: 2019}] AS row
WITH row.start AS startDecade, row.end AS endDecade
MATCH (movie:Movie)
WHERE movie.released >= startDecade and movie.released <= endDecade
RETURN startDecade + "-" + endDecade as years, count(movie)
ORDER BY years
==> +----------------------------+
==> | years       | count(movie) |
==> +----------------------------+
==> | "1970-1979" | 2            |
==> | "1980-1989" | 2            |
==> | "1990-1999" | 21           |
==> | "2000-2009" | 13           |
==> | "2010-2019" | 1            |
==> +----------------------------+
==> 5 rows

Alistair pointed out that we can simplify this even further by using the RANGE function:

UNWIND range(1970,2010,10) as startDecade
WITH startDecade, startDecade + 9 as endDecade
MATCH (movie:Movie)
WHERE movie.released >= startDecade and movie.released <= endDecade
RETURN startDecade + "-" + endDecade as years, count(movie)
ORDER BY years

And here’s a graph gist for you to play with.

Categories: Programming

Neo4j: Cypher – Separation of concerns

Fri, 06/27/2014 - 11:51

While preparing my talk on building Neo4j backed applications with Clojure I realised that some of the queries I’d written were incredibly complicated and went against anything I’d learnt about separating different concerns.

One example of this was the query I used to generate the data for the following page of the meetup application I’ve been working on:

2014 06 27 08 19 34 2014 06 27 08 31 13

Depending on the selected tab you can choose to see the people signed up for the meetup and the date that they signed up or the topics that those people are interested in.

For reference, this is an outline of the schema of the graph behind the application:

2014 06 27 11 51 00

This was my initial query to get the data:

MATCH (event:Event {id: {eventId}})-[:HELD_AT]->(venue)
OPTIONAL MATCH (event)<-[:TO]-(rsvp)<-[:RSVPD]-(person)
OPTIONAL MATCH (person)-[:INTERESTED_IN]->(topic) WHERE ()-[:HAS_TOPIC]->(topic)
WITH event, venue, rsvp, person, COLLECT(topic) as topics ORDER BY rsvp.time
OPTIONAL MATCH (rsvp)<-[:NEXT]-(initial)
WITH event, venue, COLLECT({rsvp: rsvp, initial: initial, person: person, topics: topics}) AS responses
WITH event, venue,
    [response in responses WHERE response.initial is null AND response.rsvp.response = "yes"] as attendees,
    [response in responses WHERE NOT response.initial is null] as dropouts, responses
UNWIND([response in attendees | response.topics]) AS topics
UNWIND(topics) AS topic
WITH event, venue, attendees, dropouts, {id: topic.id, name:topic.name, freq:COUNT(*)} AS t
RETURN event, venue, attendees, dropouts, COLLECT(t) AS topics

The first two lines of the query works out which people have RSVP’d to a particular event, the 3rd line captures the topics they’re interested in as long as the topic is linked to at least one of the NoSQL London groups.

We then optionally capture their initial RSVP in case they’ve changed it before doing a bit of data manipulation to group everything together.

If we run a slight variation of that which only shows a few of the topics, attendees and dropouts this is the type of result we get:

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| event.name                               | venue.name      | [a IN attendees[0..5] | a.person.name]                                 | [d in dropouts[0..5] | d.person.name]                              | topics[0..5]                                                                                                                                                                                                                                                    |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "Building Neo4j backed web applications" | "Skills Matter" | ["Mark Needham","Alistair Jones","Jim Webber","Axel Morgner","Ramesh"] | ["Frank Gibson","Keith Hinde","Richard Mason","Ollie Glass","Tom"] | [{id -> 10538, name -> "Business Intelligence", freq -> 3},{id -> 61680, name -> "HBase", freq -> 3},{id -> 61679, name -> "Hive", freq -> 2},{id -> 193021, name -> "Graph Databases", freq -> 12},{id -> 85951, name -> "JavaScript Frameworks", freq -> 10}] |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The problem is we’ve mixed together two different concerns – the attendees to a meetup and the topics they’re interested in – which made the query quite hard to understand when I came back to it a couple of months later.

Instead what we can do is split the query in two and make two different calls to the server. We then end up with the following:

// Get the event + attendees + dropouts
MATCH (event:Event {id: {eventId}})-[:HELD_AT]->(venue)
OPTIONAL MATCH (event)<-[:TO]-(rsvp)<-[:RSVPD]-(person)
WITH event, venue, rsvp, person ORDER BY rsvp.time
OPTIONAL MATCH (rsvp)<-[:NEXT]-(initial)
WITH event, venue, COLLECT({rsvp: rsvp, initial: initial, person: person}) AS responses
WITH event, venue,
    [response in responses WHERE response.initial is null 
                           AND response.rsvp.response = "yes"] as attendees,
    [response in responses WHERE NOT response.initial is null] as dropouts
RETURN event, venue, attendees, dropouts
// Get the topics the attendees are interested in
MATCH (event:Event {id: {eventId}})
MATCH (event)<-[:TO]-(rsvp {response: "yes"})<-[:RSVPD]-(person)-[:INTERESTED_IN]->(topic)
WHERE ()-[:HAS_TOPIC]->(topic)
RETURN topic.id AS id, topic.name AS name, COUNT(*) AS freq

The first query is still a bit complex but that’s because there’s a bit of tricky logic to distinguish people who signed up and dropped out. However, the second query is now quite easy to read and expresses it’s intent very clearly.

Categories: Programming

Neo4j: LOAD CSV – Handling conditionals

Wed, 06/18/2014 - 00:41

While building up the Neo4j World Cup Graph I’ve been making use of the LOAD CSV function and I frequently found myself needing to do different things depending on the value in one of the columns.

For example I have one CSV file which contains the different events that can happen in a football match:

match_id,player,player_id,time,type
"1012","Antonin Panenka","174835",21,"penalty"
"1012","Faisal Al Dakhil","2204",57,"goal"
"102","Roger Milla","79318",106,"goal"
"102","Roger Milla","79318",108,"goal"
"102","Bernardo Redin","44555",115,"goal"
"102","Andre Kana-biyik","174649",44,"yellow"

If the type is ‘penalty’, ‘owngoal’ or ‘goal’ then I want to create a SCORED_GOAL relationship whereas if it’s ‘yellow’, ‘yellowred’ or ‘red’ then I want to create a RECEIVED_CARD relationship instead.

I learnt – from reading a cypher script written by Chris Leishman – that we can make FOREACH mimic a conditional by creating a collection with one item in to represent ‘true’ and an empty collection to represent ‘false’.

In this case we’d end up with something like this to handle the case where a row represents a goal:

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/mneedham/neo4j-worldcup/master/data/import/events.csv" AS csvLine
 
// removed for conciseness
 
// goals
FOREACH(n IN (CASE WHEN csvLine.type IN ["penalty", "goal", "owngoal"] THEN [1] else [] END) |
  FOREACH(t IN CASE WHEN team = home THEN [home] ELSE [away] END |
    MERGE (stats)-[:SCORED_GOAL]->(penalty:Goal {time: csvLine.time, type: csvLine.type})
  )		
)

And equally when we want to process a row that represents a card we’d have this:

// cards
FOREACH(n IN (CASE WHEN csvLine.type IN ["yellow", "red", "yellowred"] THEN [1] else [] END) |
  FOREACH(t IN CASE WHEN team = home THEN [home] ELSE [away] END |
    MERGE (stats)-[:RECEIVED_CARD]->(card {time: csvLine.time, type: csvLine.type})
  )		
)

And if we put everything together we get this:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/mneedham/neo4j-worldcup/master/data/import/events.csv" AS csvLine
 
MATCH (home)<-[:HOME_TEAM]-(match:Match {id: csvLine.match_id})-[:AWAY_TEAM]->(away)
 
MATCH (player:Player {id: csvLine.player_id})-[:IN_SQUAD]->(squad)<-[:NAMED_SQUAD]-(team)
MATCH (player)-[:STARTED|:SUBSTITUTE]->(stats)-[:IN_MATCH]->(match)
 
// goals
FOREACH(n IN (CASE WHEN csvLine.type IN ["penalty", "goal", "owngoal"] THEN [1] else [] END) |
  FOREACH(t IN CASE WHEN team = home THEN [home] ELSE [away] END |
    MERGE (stats)-[:SCORED_GOAL]->(penalty:Goal {time: csvLine.time, type: csvLine.type})
  )		
)
 
// cards
FOREACH(n IN (CASE WHEN csvLine.type IN ["yellow", "red", "yellowred"] THEN [1] else [] END) |
  FOREACH(t IN CASE WHEN team = home THEN [home] ELSE [away] END |
    MERGE (stats)-[:RECEIVED_CARD]->(card {time: csvLine.time, type: csvLine.type})
  )		
)
;

You can have a look at the code on github or follow the instructions to get all the World Cup graph into your own local Neo4j.

Feedback welcome as always.

Categories: Programming

Ruby: Regex – Matching the Trademark ™ character

Sun, 06/08/2014 - 02:34

I’ve been playing around with some World Cup data and while cleaning up the data I wanted to strip out the year and host country for a world cup.

I started with a string like this which I was reading from a file:

1930 FIFA World Cup Uruguay ™

And I wanted to be able to extract just the ‘Uruguay’ bit without getting the trademark or the space preceding it. I initially tried the following to match all parts of the line and extract my bit:

p text.match(/\d{4} FIFA World Cup (.*?) ™/)[1]

Unfortunately that doesn’t actually compile:

tm.rb:4: syntax error, unexpected $end, expecting ')'
p text.match(/\d{4} FIFA World Cup (.*?) ™/)[1]
                                           ^

I was initially able to work around the problem by matching the unicode code point instead:

p text.match(/\d{4} FIFA World Cup (.*?) \u2122/)[1]

While working on this blog post I also remembered that you can specify the character set of your Ruby file and by default it’s ASCII which would explain why it doesn’t like the ™ character.

If we add the following line at the top of the file then we can happily use the ™ character in our regex:

# encoding: utf-8
# ...
p text.match(/\d{4} FIFA World Cup (.*?) ™/)[1]
# returns "Uruguay"

This post therefore ends up being more of a reminder for future Mark when he comes across this problem again having forgotten about Ruby character sets!

Categories: Programming