Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 4 hours 16 min ago

Neo4j: Summarising neo4j-shell output

Fri, 08/21/2015 - 21:59

I frequently find myself trying to optimise a set of cypher queries and I tend to group them together in a script that I fed to the Neo4j shell.

When tweaking the queries it’s easy to make a mistake and end up not creating the same data so I decided to write a script which will show me the aggregates of all the commands executed.

I want to see the number of constraints created, indexes added, nodes, relationships and properties created. The first 2 don’t need to match across the scripts but the latter 3 should be the same.

I put together the following script:

import re
import sys
from tabulate import tabulate
 
lines = sys.stdin.readlines()
 
def search(term, line):
    m =  re.match(term + ": (.*)", line)
    return (int(m.group(1)) if m else 0)
 
nodes_created, relationships_created, constraints_added, indexes_added, labels_added, properties_set = 0, 0, 0, 0, 0, 0
for line in lines:
    nodes_created = nodes_created + search("Nodes created", line)
    relationships_created = relationships_created + search("Relationships created", line)
    constraints_added = constraints_added + search("Constraints added", line)
    indexes_added = indexes_added + search("Indexes added", line)
    labels_added = labels_added + search("Labels added", line)
    properties_set = properties_set + search("Properties set", line)
 
    time_match = re.match("real.*([0-9]+m[0-9]+\.[0-9]+s)$", line)
 
    if time_match:
        time = time_match.group(1)
 
table = [
            ["Constraints added", constraints_added],
            ["Indexes added", indexes_added],
            ["Nodes created", nodes_created],
            ["Relationships created", relationships_created],
            ["Labels added", labels_added],
            ["Properties set", properties_set],
            ["Time", time]
         ]
print tabulate(table)

Its input is the piped output of the neo4j-shell command which will contain a description of all the queries it executed.

$ cat import.sh
#!/bin/sh
 
{ ./neo4j-community-2.2.3/bin/neo4j stop; } 2>&1
rm -rf neo4j-community-2.2.3/data/graph.db/
{ ./neo4j-community-2.2.3/bin/neo4j start; } 2>&1
{ time ./neo4j-community-2.2.3/bin/neo4j-shell --file $1; } 2>&1

We can use the script in two ways.

Either we can pipe the output of our shell straight into it and just get the summary e.g.

$ ./import.sh local.import.optimised.cql | python summarise.py
 
---------------------  ---------
Constraints added      5
Indexes added          1
Nodes created          13249
Relationships created  32227
Labels added           21715
Properties set         36480
Time                   0m17.595s
---------------------  ---------

…or we can make use of the ‘tee’ function in Unix and pipe the output into stdout and into the file and then either tail the file on another window or inspect it afterwards to see the detailed timings. e.g.

$ ./import.sh local.import.optimised.cql | tee /tmp/output.txt |  python summarise.py
 
---------------------  ---------
Constraints added      5
Indexes added          1
Nodes created          13249
Relationships created  32227
Labels added           21715
Properties set         36480
Time                   0m11.428s
---------------------  ---------
$ tail -f /tmp/output.txt
+-------------+
| appearances |
+-------------+
| 3771        |
+-------------+
1 row
Nodes created: 3439
Properties set: 3439
Labels added: 3439
289 ms
+------------------------------------+
| appearances -> player, match, team |
+------------------------------------+
| 3771                               |
+------------------------------------+
1 row
Relationships created: 10317
1006 ms
...

My only dependency is the tabulate package to get the pretty table:

$ cat requirements.txt
 
tabulate==0.7.5

The cypher script I’m running creates a BBC football graph which is available as a github project. Feel free to grab it and play around – any problems let me know!

Categories: Programming

Python: Extracting Excel spreadsheet into CSV files

Thu, 08/20/2015 - 00:27

I’ve been playing around with the Road Safety open data set and the download comes with several CSV files and an excel spreadsheet containing the legend.

There are 45 sheets in total and each of them looks like this:

2015 08 17 23 33 19

I wanted to create a CSV file for each sheet so that I can import the data set into Neo4j using the LOAD CSV command.

I came across the Python Excel website which pointed me at the xlrd library since I’m working with a pre 2010 Excel file.


The main documentation is very extensive but I found the github example much easier to follow.

I ended up with the following script which iterates through all but the first two sheets in the spreadsheet – the first two sheets contain instructions rather than data:

from xlrd import open_workbook
import csv
 
wb = open_workbook('Road-Accident-Safety-Data-Guide-1979-2004.xls')
 
for i in range(2, wb.nsheets):
    sheet = wb.sheet_by_index(i)
    print sheet.name
    with open("data/%s.csv" %(sheet.name.replace(" ","")), "w") as file:
        writer = csv.writer(file, delimiter = ",")
        print sheet, sheet.name, sheet.ncols, sheet.nrows
 
        header = [cell.value for cell in sheet.row(0)]
        writer.writerow(header)
 
        for row_idx in range(1, sheet.nrows):
            row = [int(cell.value) if isinstance(cell.value, float) else cell.value
                   for cell in sheet.row(row_idx)]
            writer.writerow(row)

I’ve replaced spaces in the sheet name so that the file name on a disk is a bit easier to work with. For some reason the numeric values were all floats whereas I wanted them as ints so I had to explicitly apply that transformation.

Here are a few examples of what the CSV files look like:

$ cat data/1stPointofImpact.csv
code,label
0,Did not impact
1,Front
2,Back
3,Offside
4,Nearside
-1,Data missing or out of range
 
$ cat data/RoadType.csv
code,label
1,Roundabout
2,One way street
3,Dual carriageway
6,Single carriageway
7,Slip road
9,Unknown
12,One way street/Slip road
-1,Data missing or out of range
 
$ cat data/Weather.csv
code,label
1,Fine no high winds
2,Raining no high winds
3,Snowing no high winds
4,Fine + high winds
5,Raining + high winds
6,Snowing + high winds
7,Fog or mist
8,Other
9,Unknown
-1,Data missing or out of range

And that’s it. Not too difficult!

Categories: Programming

Unix: Stripping first n bytes in a file / Byte Order Mark (BOM)

Thu, 08/20/2015 - 00:27

I’ve previously written a couple of blog posts showing how to strip out the byte order mark (BOM) from CSV files to make loading them into Neo4j easier and today I came across another way to clean up the file using tail.

The BOM is 3 bytes long at the beginning of the file so if we know that a file contains it then we can strip out those first 3 bytes tail like this:

$ time tail -c +4 Casualty7904.csv > Casualty7904_stripped.csv
 
real	0m31.945s
user	0m31.370s
sys	0m0.518s

The -c command is described thus;

-c number
             The location is number bytes.

So in this case we start reading at byte 4 (i.e. skipping the first 3 bytes) and then direct the output into a new file.

Although using tail is quite simple, it took 30 seconds to process a 300MB CSV file which might actually be slower than opening the file with a Hex editor and manually deleting the bytes!

Categories: Programming

Unix: Redirecting stderr to stdout

Sat, 08/15/2015 - 16:55

I’ve been trying to optimise some Neo4j import queries over the last couple of days and as part of the script I’ve been executed I wanted to redirect the output of a couple of commands into a file to parse afterwards.

I started with the following script which doesn’t do any explicit redirection of the output:

#!/bin/sh
 
./neo4j-community-2.2.3/bin/neo4j start

Now let’s run that script and redirect the output to a file:

$ ./foo.sh > /tmp/output.txt
Unable to find any JVMs matching version "1.7".
 
$ cat /tmp/output.txt
Starting Neo4j Server...WARNING: not changing user
process [48230]... waiting for server to be ready.... OK.
http://localhost:7474/ is ready.

So the line about not finding a matching JVM is being printed to stderr. That’s reasonably easy to fix:

#!/bin/sh
 
./neo4j-community-2.2.3/bin/neo4j start 2>&1

Let’s run the script again:

$ ./foo.sh > /tmp/output.txt
 
$ cat /tmp/output.txt
Unable to find any JVMs matching version "1.7".
Starting Neo4j Server...WARNING: not changing user
process [47989]... waiting for server to be ready.... OK.
http://localhost:7474/ is ready.

Great, that worked as expected. Next I extended the script to stop Neo4j, delete all it’s data, start it again and execute a cypher script:

#!/bin/sh
 
./neo4j-community-2.2.3/bin/neo4j start 2>&1
rm -rf neo4j-community-2.2.3/data/graph.db/
./neo4j-community-2.2.3/bin/neo4j start 2>&1
time ./neo4j-community-2.2.3/bin/neo4j-shell --file foo.cql 2>&1

Let’s run that script and redirect the output:

$ ./foo.sh > /tmp/output.txt
Unable to find any JVMs matching version "1.7".
 
real	0m0.604s
user	0m0.334s
sys	0m0.054s
 
$ cat /tmp/output.txt
Unable to find any JVMs matching version "1.7".
Another server-process is running with [50614], cannot start a new one. Exiting.
Unable to find any JVMs matching version "1.7".
Another server-process is running with [50614], cannot start a new one. Exiting.
+---------+
| "hello" |
+---------+
| "hello" |
+---------+
1 row
4 ms

It looks like our stderr -> stdout redirection on the last line didn’t work. My understanding is that the ‘time’ command swallows all the arguments that follow whereas we want the redirection to be run afterwards.

We can work our way around this problem by putting the actual command in a code block and redirected the output of that:

#!/bin/sh
 
./neo4j-community-2.2.3/bin/neo4j start 2>&1
rm -rf neo4j-community-2.2.3/data/graph.db/
./neo4j-community-2.2.3/bin/neo4j start 2>&1
{ time ./neo4j-community-2.2.3/bin/neo4j-shell --file foo.cql; } 2>&1
$ ./foo.sh  > /tmp/output.txt
 
$ cat /tmp/output.txt
Unable to find any JVMs matching version "1.7".
Another server-process is running with [50614], cannot start a new one. Exiting.
Unable to find any JVMs matching version "1.7".
Another server-process is running with [50614], cannot start a new one. Exiting.
Unable to find any JVMs matching version "1.7".
+---------+
| "hello" |
+---------+
| "hello" |
+---------+
1 row
4 ms
 
real	0m0.615s
user	0m0.316s
sys	0m0.050s

Much better!

Categories: Programming

Sed: Using environment variables

Thu, 08/13/2015 - 20:30

I’ve been playing around with the BBC football data set that I wrote about a couple of months ago and I wanted to write some code that would take the import script and replace all instances of remote URIs with a file system path.

For example the import file contains several lines similar to this:

LOAD CSV WITH HEADERS 
FROM "https://raw.githubusercontent.com/mneedham/neo4j-bbc/master/data/matches.csv" 
AS row

And I want that to read:

LOAD CSV WITH HEADERS 
FROM "file:///Users/markneedham/repos/neo4j-bbc/data/matches.csv" 
AS row

The start of that path also happens to be my working directory:

$ echo $PWD
/Users/markneedham/repos/neo4j-bbc

So I wanted to write a script that would look for occurrences of ‘https://raw.githubusercontent.com/mneedham/neo4j-bbc/master’ and replace it with $PWD. I’m a fan of Sed so I thought I’d try and use it to solve my problem.

The first thing we can do to make life easy is to change the default delimiter. Sed usually uses ‘/’ to separate parts of the command but since we’re using URIs that’s going to be horrible so we’ll use an underscore instead.

For a first cut I tried just removing that first part of the URI but not replacing it with anything in particular:

$ sed 's_https://raw.githubusercontent.com/mneedham/neo4j-bbc/master__' import.cql
 
$ sed 's_https://raw.githubusercontent.com/mneedham/neo4j-bbc/master__' import.cql | grep LOAD
LOAD CSV WITH HEADERS FROM "/data/matches.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/fouls.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "/data/subs.csv" AS row

Cool! That worked as expected. Now let’s try and replace it with $PWD:

$ sed 's_https://raw.githubusercontent.com/mneedham/neo4j-bbc/master_file://$PWD_' import.cql | grep LOAD
LOAD CSV WITH HEADERS FROM "file://$PWD/data/matches.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/fouls.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file://$PWD/data/subs.csv" AS row

Hmmm that didn’t work as expected. The $PWD is being treated as a literal instead of being evaluated like we want it to be.

It turns out this is a popular question on Stack Overflow and there are lots of suggestions – I tried a few of them and found that single quotes did the trick:

$ sed 's_https://raw.githubusercontent.com/mneedham/neo4j-bbc/master_file://'$PWD'_' import.cql | grep LOAD
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/matches.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/fouls.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/subs.csv" AS row

We could also use double quotes everywhere if we prefer:

$ sed "s_https://raw.githubusercontent.com/mneedham/neo4j-bbc/master_file://"$PWD"_" import.cql | grep LOAD
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/matches.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/players.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/fouls.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/attempts.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/corners.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/cards.csv" AS row
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/repos/neo4j-bbc/data/subs.csv" AS row
Categories: Programming

Java: Jersey – java.lang.NoSuchMethodError: com.sun.jersey.core.reflection.ReflectionHelper. getContextClassLoaderPA()Ljava/security/PrivilegedAction;

Tue, 08/11/2015 - 07:59

I’ve been trying to put some tests around an Neo4j unmanaged extension I’ve been working on and ran into the following stack trace when launching the server using the Neo4j test harness:

public class ExampleResourceTest {
    @Rule
    public Neo4jRule neo4j = new Neo4jRule()
            .withFixture("CREATE (:Person {name: 'Mark'})")
            .withFixture("CREATE (:Person {name: 'Nicole'})")
            .withExtension( "/unmanaged", ExampleResource.class );
 
    @Test
    public void shouldReturnAllTheNodes() {
        // Given
        URI serverURI = neo4j.httpURI();
        // When
        HTTP.Response response = HTTP.GET(serverURI.resolve("/unmanaged/example/people").toString());
 
        // Then
        assertEquals(200, response.status());
        List content = response.content();
 
        assertEquals(2, content.size());
    }
}
07:51:32.985 [main] WARN  o.e.j.u.component.AbstractLifeCycle - FAILED o.e.j.s.ServletContextHandler@29eda4f8{/unmanaged,null,STARTING}: java.lang.NoSuchMethodError: com.sun.jersey.core.reflection.ReflectionHelper.getContextClassLoaderPA()Ljava/security/PrivilegedAction;
java.lang.NoSuchMethodError: com.sun.jersey.core.reflection.ReflectionHelper.getContextClassLoaderPA()Ljava/security/PrivilegedAction;
	at com.sun.jersey.spi.scanning.AnnotationScannerListener.<init>(AnnotationScannerListener.java:94) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.spi.scanning.PathProviderScannerListener.<init>(PathProviderScannerListener.java:59) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.api.core.ScanningResourceConfig.init(ScanningResourceConfig.java:79) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.api.core.PackagesResourceConfig.init(PackagesResourceConfig.java:104) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.api.core.PackagesResourceConfig.<init>(PackagesResourceConfig.java:78) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.api.core.PackagesResourceConfig.<init>(PackagesResourceConfig.java:89) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:696) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:674) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at com.sun.jersey.spi.container.servlet.WebComponent.init(WebComponent.java:203) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:374) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:557) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at javax.servlet.GenericServlet.init(GenericServlet.java:244) ~[javax.servlet-api-3.1.0.jar:3.1.0]
	at org.eclipse.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:612) ~[jetty-servlet-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.servlet.ServletHolder.initialize(ServletHolder.java:395) ~[jetty-servlet-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:871) ~[jetty-servlet-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:298) ~[jetty-servlet-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:741) ~[jetty-server-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) [jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:132) [jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:114) [jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61) [jetty-server-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) [jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:132) [jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.server.Server.start(Server.java:387) [jetty-server-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:114) [jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61) [jetty-server-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.server.Server.doStart(Server.java:354) [jetty-server-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) [jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.neo4j.server.web.Jetty9WebServer.startJetty(Jetty9WebServer.java:381) [neo4j-server-2.2.3.jar:2.2.3]
	at org.neo4j.server.web.Jetty9WebServer.start(Jetty9WebServer.java:184) [neo4j-server-2.2.3.jar:2.2.3]
	at org.neo4j.server.AbstractNeoServer.startWebServer(AbstractNeoServer.java:474) [neo4j-server-2.2.3.jar:2.2.3]
	at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:230) [neo4j-server-2.2.3.jar:2.2.3]
	at org.neo4j.harness.internal.InProcessServerControls.start(InProcessServerControls.java:59) [neo4j-harness-2.2.3.jar:2.2.3]
	at org.neo4j.harness.internal.InProcessServerBuilder.newServer(InProcessServerBuilder.java:72) [neo4j-harness-2.2.3.jar:2.2.3]
	at org.neo4j.harness.junit.Neo4jRule$1.evaluate(Neo4jRule.java:64) [neo4j-harness-2.2.3.jar:2.2.3]
	at org.junit.rules.RunRules.evaluate(RunRules.java:20) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) [junit-4.11.jar:na]
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) [junit-4.11.jar:na]
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309) [junit-4.11.jar:na]
	at org.junit.runner.JUnitCore.run(JUnitCore.java:160) [junit-4.11.jar:na]
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:78) [junit-rt.jar:na]
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:212) [junit-rt.jar:na]
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:68) [junit-rt.jar:na]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_51]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_51]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_51]
	at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_51]
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) [idea_rt.jar:na]
07:51:32.991 [main] WARN  o.e.j.u.component.AbstractLifeCycle - FAILED org.eclipse.jetty.server.handler.HandlerList@2b22a1cc[o.e.j.s.h.MovedContextHandler@5e671e20{/,null,AVAILABLE}, o.e.j.s.ServletContextHandler@29eda4f8{/unmanaged,null,STARTING}, o.e.j.s.ServletContextHandler@62573c86{/db/manage,null,null}, o.e.j.s.ServletContextHandler@2418ba04{/db/data,null,null}, o.e.j.w.WebAppContext@14229fa7{/browser,jar:file:/Users/markneedham/.m2/repository/org/neo4j/app/neo4j-browser/2.2.3/neo4j-browser-2.2.3.jar!/browser,null}, o.e.j.s.ServletContextHandler@2ab0702e{/,null,null}]: java.lang.NoSuchMethodError: com.sun.jersey.core.reflection.ReflectionHelper.getContextClassLoaderPA()Ljava/security/PrivilegedAction;
java.lang.NoSuchMethodError: com.sun.jersey.core.reflection.ReflectionHelper.getContextClassLoaderPA()Ljava/security/PrivilegedAction;
	at com.sun.jersey.spi.scanning.AnnotationScannerListener.<init>(AnnotationScannerListener.java:94) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.spi.scanning.PathProviderScannerListener.<init>(PathProviderScannerListener.java:59) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.api.core.ScanningResourceConfig.init(ScanningResourceConfig.java:79) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.api.core.PackagesResourceConfig.init(PackagesResourceConfig.java:104) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.api.core.PackagesResourceConfig.<init>(PackagesResourceConfig.java:78) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.api.core.PackagesResourceConfig.<init>(PackagesResourceConfig.java:89) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:696) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:674) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at com.sun.jersey.spi.container.servlet.WebComponent.init(WebComponent.java:203) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:374) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:557) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at javax.servlet.GenericServlet.init(GenericServlet.java:244) ~[javax.servlet-api-3.1.0.jar:3.1.0]
	at org.eclipse.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:612) ~[jetty-servlet-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.servlet.ServletHolder.initialize(ServletHolder.java:395) ~[jetty-servlet-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:871) ~[jetty-servlet-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:298) ~[jetty-servlet-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:741) ~[jetty-server-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) [jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:132) [jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:114) [jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61) [jetty-server-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) [jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:132) [jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.server.Server.start(Server.java:387) [jetty-server-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:114) [jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61) [jetty-server-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.server.Server.doStart(Server.java:354) [jetty-server-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) [jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.neo4j.server.web.Jetty9WebServer.startJetty(Jetty9WebServer.java:381) [neo4j-server-2.2.3.jar:2.2.3]
	at org.neo4j.server.web.Jetty9WebServer.start(Jetty9WebServer.java:184) [neo4j-server-2.2.3.jar:2.2.3]
	at org.neo4j.server.AbstractNeoServer.startWebServer(AbstractNeoServer.java:474) [neo4j-server-2.2.3.jar:2.2.3]
	at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:230) [neo4j-server-2.2.3.jar:2.2.3]
	at org.neo4j.harness.internal.InProcessServerControls.start(InProcessServerControls.java:59) [neo4j-harness-2.2.3.jar:2.2.3]
	at org.neo4j.harness.internal.InProcessServerBuilder.newServer(InProcessServerBuilder.java:72) [neo4j-harness-2.2.3.jar:2.2.3]
	at org.neo4j.harness.junit.Neo4jRule$1.evaluate(Neo4jRule.java:64) [neo4j-harness-2.2.3.jar:2.2.3]
	at org.junit.rules.RunRules.evaluate(RunRules.java:20) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) [junit-4.11.jar:na]
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) [junit-4.11.jar:na]
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309) [junit-4.11.jar:na]
	at org.junit.runner.JUnitCore.run(JUnitCore.java:160) [junit-4.11.jar:na]
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:78) [junit-rt.jar:na]
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:212) [junit-rt.jar:na]
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:68) [junit-rt.jar:na]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_51]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_51]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_51]
	at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_51]
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) [idea_rt.jar:na]
07:51:33.013 [main] INFO  o.e.jetty.server.ServerConnector - Started ServerConnector@19962194{HTTP/1.1}{localhost:7475}
07:51:33.014 [main] WARN  o.e.j.u.component.AbstractLifeCycle - FAILED org.eclipse.jetty.server.Server@481e91b6: java.lang.NoSuchMethodError: com.sun.jersey.core.reflection.ReflectionHelper.getContextClassLoaderPA()Ljava/security/PrivilegedAction;
java.lang.NoSuchMethodError: com.sun.jersey.core.reflection.ReflectionHelper.getContextClassLoaderPA()Ljava/security/PrivilegedAction;
	at com.sun.jersey.spi.scanning.AnnotationScannerListener.<init>(AnnotationScannerListener.java:94) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.spi.scanning.PathProviderScannerListener.<init>(PathProviderScannerListener.java:59) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.api.core.ScanningResourceConfig.init(ScanningResourceConfig.java:79) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.api.core.PackagesResourceConfig.init(PackagesResourceConfig.java:104) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.api.core.PackagesResourceConfig.<init>(PackagesResourceConfig.java:78) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.api.core.PackagesResourceConfig.<init>(PackagesResourceConfig.java:89) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:696) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:674) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at com.sun.jersey.spi.container.servlet.WebComponent.init(WebComponent.java:203) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:374) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:557) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at javax.servlet.GenericServlet.init(GenericServlet.java:244) ~[javax.servlet-api-3.1.0.jar:3.1.0]
	at org.eclipse.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:612) ~[jetty-servlet-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.servlet.ServletHolder.initialize(ServletHolder.java:395) ~[jetty-servlet-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:871) ~[jetty-servlet-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:298) ~[jetty-servlet-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:741) ~[jetty-server-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) ~[jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:132) ~[jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:114) ~[jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61) ~[jetty-server-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) ~[jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:132) ~[jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.server.Server.start(Server.java:387) ~[jetty-server-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:114) ~[jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61) ~[jetty-server-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.server.Server.doStart(Server.java:354) ~[jetty-server-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) ~[jetty-util-9.2.4.v20141103.jar:9.2.4.v20141103]
	at org.neo4j.server.web.Jetty9WebServer.startJetty(Jetty9WebServer.java:381) [neo4j-server-2.2.3.jar:2.2.3]
	at org.neo4j.server.web.Jetty9WebServer.start(Jetty9WebServer.java:184) [neo4j-server-2.2.3.jar:2.2.3]
	at org.neo4j.server.AbstractNeoServer.startWebServer(AbstractNeoServer.java:474) [neo4j-server-2.2.3.jar:2.2.3]
	at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:230) [neo4j-server-2.2.3.jar:2.2.3]
	at org.neo4j.harness.internal.InProcessServerControls.start(InProcessServerControls.java:59) [neo4j-harness-2.2.3.jar:2.2.3]
	at org.neo4j.harness.internal.InProcessServerBuilder.newServer(InProcessServerBuilder.java:72) [neo4j-harness-2.2.3.jar:2.2.3]
	at org.neo4j.harness.junit.Neo4jRule$1.evaluate(Neo4jRule.java:64) [neo4j-harness-2.2.3.jar:2.2.3]
	at org.junit.rules.RunRules.evaluate(RunRules.java:20) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) [junit-4.11.jar:na]
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) [junit-4.11.jar:na]
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) [junit-4.11.jar:na]
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309) [junit-4.11.jar:na]
	at org.junit.runner.JUnitCore.run(JUnitCore.java:160) [junit-4.11.jar:na]
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:78) [junit-rt.jar:na]
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:212) [junit-rt.jar:na]
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:68) [junit-rt.jar:na]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_51]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_51]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_51]
	at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_51]
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) [idea_rt.jar:na]
 
org.neo4j.server.ServerStartupException: Starting Neo4j Server failed: com.sun.jersey.core.reflection.ReflectionHelper.getContextClassLoaderPA()Ljava/security/PrivilegedAction;
	at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:258)
	at org.neo4j.harness.internal.InProcessServerControls.start(InProcessServerControls.java:59)
	at org.neo4j.harness.internal.InProcessServerBuilder.newServer(InProcessServerBuilder.java:72)
	at org.neo4j.harness.junit.Neo4jRule$1.evaluate(Neo4jRule.java:64)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:78)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:212)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:68)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.lang.NoSuchMethodError: com.sun.jersey.core.reflection.ReflectionHelper.getContextClassLoaderPA()Ljava/security/PrivilegedAction;
	at com.sun.jersey.spi.scanning.AnnotationScannerListener.<init>(AnnotationScannerListener.java:94)
	at com.sun.jersey.spi.scanning.PathProviderScannerListener.<init>(PathProviderScannerListener.java:59)
	at com.sun.jersey.api.core.ScanningResourceConfig.init(ScanningResourceConfig.java:79)
	at com.sun.jersey.api.core.PackagesResourceConfig.init(PackagesResourceConfig.java:104)
	at com.sun.jersey.api.core.PackagesResourceConfig.<init>(PackagesResourceConfig.java:78)
	at com.sun.jersey.api.core.PackagesResourceConfig.<init>(PackagesResourceConfig.java:89)
	at com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:696)
	at com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:674)
	at com.sun.jersey.spi.container.servlet.WebComponent.init(WebComponent.java:203)
	at com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:374)
	at com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:557)
	at javax.servlet.GenericServlet.init(GenericServlet.java:244)
	at org.eclipse.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:612)
	at org.eclipse.jetty.servlet.ServletHolder.initialize(ServletHolder.java:395)
	at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:871)
	at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:298)
	at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:741)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
	at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:132)
	at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:114)
	at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
	at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:132)
	at org.eclipse.jetty.server.Server.start(Server.java:387)
	at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:114)
	at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61)
	at org.eclipse.jetty.server.Server.doStart(Server.java:354)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
	at org.neo4j.server.web.Jetty9WebServer.startJetty(Jetty9WebServer.java:381)
	at org.neo4j.server.web.Jetty9WebServer.start(Jetty9WebServer.java:184)
	at org.neo4j.server.AbstractNeoServer.startWebServer(AbstractNeoServer.java:474)
	at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:230)
	... 22 more

I was a bit baffled at first but if we look closely about 5 – 10 lines down the stack trace we can see the mistake I’ve made:

	at com.sun.jersey.api.core.PackagesResourceConfig.(PackagesResourceConfig.java:89) ~[jersey-server-1.19.jar:1.19]
	at com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:696) ~[jersey-servlet-1.17.1.jar:1.17.1]
	at com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:674) ~[jersey-servlet-1.17.1.jar:1.17.1]

We have different versions of the Jersey libraries. Since we’re writing the extension for Neo4j 2.2.3 we can quickly check which version it depends on:

$ ls -alh neo4j-community-2.2.3/system/lib/ | grep jersey
-rwxr-xr-x@  1 markneedham  staff   426K 22 Jun 04:57 jersey-core-1.19.jar
-rwxr-xr-x@  1 markneedham  staff    52K 22 Jun 05:02 jersey-multipart-1.19.jar
-rwxr-xr-x@  1 markneedham  staff   686K 22 Jun 05:02 jersey-server-1.19.jar
-rwxr-xr-x@  1 markneedham  staff   126K 22 Jun 05:02 jersey-servlet-1.19.jar

So we should’t have 1.17.1 in our project and it was easy enough to find my mistake by looking in the pom file:

<properties>
...
    <jersey.version>1.17.1</jersey.version>
...
</properties>
 
<dependencies>
...
 
    <dependency>
        <groupId>com.sun.jersey</groupId>
        <artifactId>jersey-servlet</artifactId>
        <version>${jersey.version}</version>
    </dependency>
 
...
</dependencies>

Also easy enough to fix!

<properties>
...
    <jersey.version>1.19</jersey.version>
...
</properties>

You can see an example of this working on github.

Categories: Programming

Neo4j 2.2.3: Unmanaged extensions – Creating gzipped streamed responses with Jetty

Tue, 08/11/2015 - 00:57

Back in 2013 I wrote a couple of blog posts showing examples of an unmanaged extension which had a streamed and gzipped response but two years on I realised they were a bit out of date and deserved a refresh.

When writing unmanaged extensions in Neo4j a good rule of thumb is to try and reduce the amount of objects you keep hanging around. In this context this means that we should stream our response to the client as quickly as possible rather than building it up in memory and sending it in one go.

The documentation has a good example showing how to stream a list of colleagues but in this blog post we’ll look at how to do something simpler – we’ll create a couple of nodes representing people and then write an unmanaged extension to return them.

We’ll first create an unmanaged extension which runs a cypher query, iterates through the rows returned and sends them to the client:

@Path("/example")
public class ExampleResource {
    private final GraphDatabaseService db;
    private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
 
    public ExampleResource(@Context GraphDatabaseService db) {
        this.db = db;
    }
 
    @GET
    @Produces(MediaType.APPLICATION_JSON)
    @Path("/people")
    public Response allNodes() throws IOException {
        StreamingOutput stream = streamQueryResponse("MATCH (n:Person) RETURN n.name AS name");
        return Response.ok().entity(stream).type(MediaType.APPLICATION_JSON).build();
    }
 
    private StreamingOutput streamQueryResponse(final String query) {
        return new StreamingOutput() {
                @Override
                public void write(OutputStream os) throws IOException, WebApplicationException {
                    JsonGenerator jg = OBJECT_MAPPER.getJsonFactory().createJsonGenerator(os, JsonEncoding.UTF8);
                    jg.writeStartArray();
 
                    writeQueryResultTo(query, jg);
 
                    jg.writeEndArray();
                    jg.flush();
                    jg.close();
                }
            };
    }
 
    private void writeQueryResultTo(String query, JsonGenerator jg) throws IOException {
        try (Result result = db.execute(query)) {
            while (result.hasNext()) {
                Map<String, Object> row = result.next();
 
                jg.writeStartObject();
                for (Map.Entry<String, Object> entry : row.entrySet()) {
                    jg.writeFieldName(entry.getKey());
                    jg.writeString(entry.getValue().toString());
                }
                jg.writeEndObject();
            }
        }
    }
}

There’s nothing too complicated going on here although notice that we make much more fine grained calls to the JSON Library rather than created a JSON object in memory and calling ObjectMapper#writeValueAsString on it.

To get this to work we’d build a JAR containing this class, put that into the plugins folder and then add the following property to conf/neo4j-server.properties (or the Neo4j desktop equivalent) before restarting the server:

org.neo4j.server.thirdparty_jaxrs_classes=org.neo4j.unmanaged=/unmanaged

We can then test it out like this:

$ curl http://localhost:7474/unmanaged/example/people
[{"name":"Mark"},{"name":"Nicole"}]

I’ve put in a couple of test people nodes – full instructions are available on the github README page.

Next we want to make it possible to send that response in the gzip format. To do that we need to add a GzipFilter to the Neo4j lifecycle. This class has moved to a different namespace in Jetty 9 which Neo4j 2.2.3 depends on, but the following class does the job:

import org.eclipse.jetty.servlets.GzipFilter;
 
public class GZipInitialiser implements SPIPluginLifecycle {
    private WebServer webServer;
 
    @Override
    public Collection<Injectable<?>> start(NeoServer neoServer) {
        webServer = getWebServer(neoServer);
        GzipFilter filter = new GzipFilter();
 
        webServer.addFilter(filter, "/*");
        return Collections.emptyList();
    }
 
    private WebServer getWebServer(final NeoServer neoServer) {
        if (neoServer instanceof AbstractNeoServer) {
            return ((AbstractNeoServer) neoServer).getWebServer();
        }
        throw new IllegalArgumentException("expected AbstractNeoServer");
    }
 
    @Override
    public Collection<Injectable<?>> start(GraphDatabaseService graphDatabaseService, Configuration configuration) {
        throw new IllegalAccessError();
    }
 
    @Override
    public void stop() {
 
    }
}

I needed to include the jersey-servlets JAR in my unmanaged extension JAR in order for this to work correctly. Once we redeploy the JAR and restart Neo4j we can try making the same request as above but with a gzip header:

$ curl -v -H "Accept-Encoding:gzip,deflate" http://localhost:7474/unmanaged/example/people
��V�K�MU�R�M,�V�Ձ��2��sR�jcf(�#

We can unpack that on the fly by piping it through gunzip to check we get a sensible result:

$ curl -v -H "Accept-Encoding:gzip,deflate" http://localhost:7474/unmanaged/example/people | gunzip
[{"name":"Mark"},{"name":"Nicole"}]

And there we have it – a gzipped streamed response. All the code is on github so give it a try and give me a shout if it doesn’t work. The fastest way to get me is probably on our new shiny neo4j-users Slack group.

Categories: Programming

Record Linkage: Playing around with Duke

Sat, 08/08/2015 - 23:50

I’ve become quite interesting in record linkage recently and came across the Duke project which provides some tools to help solve this problem. I thought I’d give it a try.

The typical problem when doing record linkage is that we have two records from different data sets which represent the same entity but don’t have a common key that we can use to merge them together. We therefore need to come up with a heuristic that will allow us to do so.

Duke has a few examples showing it in action and I decided to go with the linking countries one. Here we have countries from Dbpedia and the Mondial database and we want to link them together.

The first thing we need to do is build the project:

export JAVA_HOME=`/usr/libexec/java_home`
mvn clean package -DskipTests

At the time of writing this will put a zip fail containing everything we need at duke-dist/target/. Let’s unpack that:

unzip duke-dist/target/duke-dist-1.3-SNAPSHOT-bin.zip

Next we need to download the data files and Duke configuration file:

wget https://raw.githubusercontent.com/larsga/Duke/master/doc/example-data/countries-dbpedia.csv
wget https://raw.githubusercontent.com/larsga/Duke/master/doc/example-data/countries.xml
wget https://raw.githubusercontent.com/larsga/Duke/master/doc/example-data/countries-mondial.csv
wget https://raw.githubusercontent.com/larsga/Duke/master/doc/example-data/countries-test.txt

Now we’re ready to give it a go:

java -cp "duke-dist-1.3-SNAPSHOT/lib/*" no.priv.garshol.duke.Duke --testfile=countries-test.txt --testdebug --showmatches countries.xml
 
...
 
NO MATCH FOR:
ID: '7706', NAME: 'guatemala', AREA: '108890', CAPITAL: 'guatemala city',
 
MATCH 0.9825124555160142
ID: '10052', NAME: 'pitcairn islands', AREA: '47', CAPITAL: 'adamstown',
ID: 'http://dbpedia.org/resource/Pitcairn_Islands', NAME: 'pitcairn islands', AREA: '47', CAPITAL: 'adamstown',
 
Correct links found: 200 / 218 (91.7%)
Wrong links found: 0 / 24 (0.0%)
Unknown links found: 0
Percent of links correct 100.0%, wrong 0.0%, unknown 0.0%
Records with no link: 18
Precision 100.0%, recall 91.74311926605505%, f-number 0.9569377990430622

We can look in countries.xml to see how the similarity between records is being calculated:

  <schema>
    <threshold>0.7</threshold>
...
    <property>
      <name>NAME</name>
      <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
      <low>0.09</low>
      <high>0.93</high>
    </property>
    <property>
      <name>AREA</name>
      <comparator>no.priv.garshol.duke.comparators.NumericComparator</comparator>
      <low>0.04</low>
      <high>0.73</high>
    </property>
    <property>
      <name>CAPITAL</name>
      <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
      <low>0.12</low>
      <high>0.61</high>
    </property>
  </schema>

So we’re working out similarity of the capital city and country by calculating their Levenshtein distance i.e. the minimum number of single-character edits required to change one word into the other

This works very well if there is a typo or difference in spelling in one of the data sets. However, I was curious what would happen if the country had two completely different names e.g Cote d’Ivoire is sometimes know as Ivory Coast. Let’s try changing the country name in one of the files:

"19147","Cote dIvoire","Yamoussoukro","322460"
java -cp "duke-dist-1.3-SNAPSHOT/lib/*" no.priv.garshol.duke.Duke --testfile=countries-test.txt --testdebug --showmatches countries.xml
 
NO MATCH FOR:
ID: '19147', NAME: 'ivory coast', AREA: '322460', CAPITAL: 'yamoussoukro',

I also tried it out with the BBC and ESPN match reports of the Man Utd vs Tottenham match – the BBC references players by surname, while ESPN has their full names.

When I compared the full name against surname using the Levenshtein comparator there were no matches as you’d expect. I had to split the ESPN names up into first name and surname to get the linking to work.

Equally when I varied the team name’s to be ‘Man Utd’ rather than ‘Manchester United’ and ‘Tottenham’ rather than ‘Tottenham Hotspur’ that didn’t work either.

I think I probably need to write a domain specific comparator but I’m also curious whether I could come up with a bunch of training examples and then train a model to detect what makes two records similar. It’d be less deterministic but perhaps more robust.

Categories: Programming

Spark: Convert RDD to DataFrame

Thu, 08/06/2015 - 22:11

As I mentioned in a previous blog post I’ve been playing around with the Databricks Spark CSV library and wanted to take a CSV file, clean it up and then write out a new CSV file containing some of the columns.

I started by processing the CSV file and writing it into a temporary table:

import org.apache.spark.sql.{SQLContext, Row, DataFrame}
 
val sqlContext = new SQLContext(sc)
val crimeFile = "Crimes_-_2001_to_present.csv"
sqlContext.load("com.databricks.spark.csv", Map("path" -> crimeFile, "header" -> "true")).registerTempTable("crimes")

I wanted to get to the point where I could call the following function which writes a DataFrame to disk:

private def createFile(df: DataFrame, file: String, header: String): Unit = {
  FileUtil.fullyDelete(new File(file))
  val tmpFile = "tmp/" + System.currentTimeMillis() + "-" + file
  df.distinct.save(tmpFile, "com.databricks.spark.csv")
}

The first file only needs to contain the primary type of crime, which we can extract with the following query:

val rows = sqlContext.sql("select `Primary Type` as primaryType FROM crimes LIMIT 10")
 
rows.collect()
res4: Array[org.apache.spark.sql.Row] = Array([ASSAULT], [ROBBERY], [CRIMINAL DAMAGE], [THEFT], [THEFT], [BURGLARY], [THEFT], [BURGLARY], [THEFT], [CRIMINAL DAMAGE])

Some of the primary types have trailing spaces which I want to get rid of. As far as I can tell Spark’s variant of SQL doesn’t have the LTRIM or RTRIM functions but we can map over ‘rows’ and use the String ‘trim’ function instead:

rows.map { case Row(primaryType: String) => Row(primaryType.trim) }
res8: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[29] at map at DataFrame.scala:776

Now we’ve got an RDD of Rows which we need to convert back to a DataFrame again. ‘sqlContext’ has a function which we might be able to use:

sqlContext.createDataFrame(rows.map { case Row(primaryType: String) => Row(primaryType.trim) })
 
<console>:27: error: overloaded method value createDataFrame with alternatives:
  [A <: Product](data: Seq[A])(implicit evidence$4: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame <and>
  [A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence$3: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
 cannot be applied to (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row])
              sqlContext.createDataFrame(rows.map { case Row(primaryType: String) => Row(primaryType.trim) })
                         ^

These are the signatures we can choose from:

2015 08 06 21 58 12

If we want to pass in an RDD of type Row we’re going to have to define a StructType or we can convert each row into something more strongly typed:

case class CrimeType(primaryType: String)
 
sqlContext.createDataFrame(rows.map { case Row(primaryType: String) => CrimeType(primaryType.trim) })
res14: org.apache.spark.sql.DataFrame = [primaryType: string]

Great, we’ve got our DataFrame which we can now plug into the ‘createFile’ function like so:

createFile(
  sqlContext.createDataFrame(rows.map { case Row(primaryType: String) => CrimeType(primaryType.trim) }),
  "/tmp/crimeTypes.csv",
  "crimeType:ID(CrimeType)")

We can actually do better though!

Since we’ve got an RDD of a specific class we can make use of the ‘rddToDataFrameHolder’ implicit function and then the ‘toDF’ function on ‘DataFrameHolder’. This is what the code looks like:

import sqlContext.implicits._
createFile(
  rows.map { case Row(primaryType: String) => CrimeType(primaryType.trim) }.toDF(),
  "/tmp/crimeTypes.csv",
  "crimeType:ID(CrimeType)")

And we’re done!

Categories: Programming

Spark: pyspark/Hadoop – py4j.protocol.Py4JJavaError: An error occurred while calling o23.load.: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4

Tue, 08/04/2015 - 07:35

I’ve been playing around with pyspark – Spark’s Python library – and I wanted to execute the following job which takes a file from my local HDFS and then counts how many times each FBI code appears using Spark SQL:

from pyspark import SparkContext
from pyspark.sql import SQLContext
 
sc = SparkContext("local", "Simple App")
sqlContext = SQLContext(sc)
 
file = "hdfs://localhost:9000/user/markneedham/Crimes_-_2001_to_present.csv"
 
sqlContext.load(source="com.databricks.spark.csv", header="true", path = file).registerTempTable("crimes")
rows = sqlContext.sql("select `FBI Code` AS fbiCode, COUNT(*) AS times FROM crimes GROUP BY `FBI Code` ORDER BY times DESC").collect()
 
for row in rows:
    print("{0} -> {1}".format(row.fbiCode, row.times))

I submitted the job and waited:

$ ./spark-1.3.0-bin-hadoop1/bin/spark-submit --driver-memory 5g --packages com.databricks:spark-csv_2.10:1.1.0 fbi_spark.py
...
Traceback (most recent call last):
  File "/Users/markneedham/projects/neo4j-spark-chicago/fbi_spark.py", line 11, in <module>
    sqlContext.load(source="com.databricks.spark.csv", header="true", path = file).registerTempTable("crimes")
  File "/Users/markneedham/projects/neo4j-spark-chicago/spark-1.3.0-bin-hadoop1/python/pyspark/sql/context.py", line 482, in load
    df = self._ssql_ctx.load(source, joptions)
  File "/Users/markneedham/projects/neo4j-spark-chicago/spark-1.3.0-bin-hadoop1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/Users/markneedham/projects/neo4j-spark-chicago/spark-1.3.0-bin-hadoop1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o23.load.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
	at org.apache.hadoop.ipc.Client.call(Client.java:1070)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
	at com.sun.proxy.$Proxy7.getProtocolVersion(Unknown Source)
	at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
	at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
	at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1156)
	at org.apache.spark.rdd.RDD.first(RDD.scala:1189)
	at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:129)
	at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:127)
	at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:109)
	at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:62)
	at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:115)
	at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:40)
	at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:28)
	at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:290)
	at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
	at org.apache.spark.sql.SQLContext.load(SQLContext.scala:667)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
	at py4j.Gateway.invoke(Gateway.java:259)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:207)
	at java.lang.Thread.run(Thread.java:745)

It looks like my Hadoop Client and Server are using different versions which in fact they are! We can see from the name of the spark folder that I’m using Hadoop 1.x there and if we check the local Hadoop version we’ll notice it’s using the 2.x seris:

$ hadoop version
Hadoop 2.6.0

In this case the easiest fix is use a version of Spark that’s compiled against Hadoop 2.6 which as of now means Spark 1.4.1.

Let’s try and run our job again:

$ ./spark-1.4.1-bin-hadoop2.6/bin/spark-submit --driver-memory 5g --packages com.databricks:spark-csv_2.10:1.1.0 fbi_spark.py
 
06 -> 859197
08B -> 653575
14 -> 488212
18 -> 457782
26 -> 431316
05 -> 257310
07 -> 197404
08A -> 188964
03 -> 157706
11 -> 112675
04B -> 103961
04A -> 60344
16 -> 47279
15 -> 40361
24 -> 31809
10 -> 22467
17 -> 17555
02 -> 17008
20 -> 15190
19 -> 10878
22 -> 8847
09 -> 6358
01A -> 4830
13 -> 1561
12 -> 835
01B -> 16

And it’s working!

Categories: Programming

Spark: Processing CSV files using Databricks Spark CSV Library

Sun, 08/02/2015 - 19:08

Last year I wrote about exploring the Chicago crime data set using Spark and the OpenCSV parser and while this worked well, a few months ago I noticed that there’s now a spark-csv library which I should probably use instead.

I thought it’d be a fun exercise to translate my code to use it.

So to recap our goal: we want to count how many times each type of crime has been committed. I have a more up to date version of the crimes file now so the numbers won’t be exactly the same.

First let’s launch the spark-shell and register our CSV file as a temporary table so we can query it as if it was a SQL table:

$ ./spark-1.3.0-bin-hadoop1/bin/spark-shell
 
scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
 
scala> val crimeFile = "/Users/markneedham/Downloads/Crimes_-_2001_to_present.csv"
crimeFile: String = /Users/markneedham/Downloads/Crimes_-_2001_to_present.csv
 
scala> val sqlContext = new SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@9746157
 
scala> sqlContext.load("com.databricks.spark.csv", Map("path" -> crimeFile, "header" -> "true")).registerTempTable("crimes")
java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
	at scala.sys.package$.error(package.scala:27)
	at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:268)
	at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:279)
	at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
        at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
	at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
	at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
	at org.apache.spark.repl.Main$.main(Main.scala:31)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I’ve actually forgotten to tell spark-shell about the CSV package so let’s restart the shell and pass it as an argument:

$ ./spark-1.3.0-bin-hadoop1/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.1.0
 
scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
 
scala> val crimeFile = "/Users/markneedham/Downloads/Crimes_-_2001_to_present.csv"
crimeFile: String = /Users/markneedham/Downloads/Crimes_-_2001_to_present.csv
 
scala> val sqlContext = new SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@44587c44
 
scala> sqlContext.load("com.databricks.spark.csv", Map("path" -> crimeFile, "header" -> "true")).registerTempTable("crimes")
...
15/08/02 18:57:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/08/02 18:57:46 INFO DAGScheduler: Stage 0 (first at CsvRelation.scala:129) finished in 0.207 s
15/08/02 18:57:46 INFO DAGScheduler: Job 0 finished: first at CsvRelation.scala:129, took 0.267327 s

Now we can write a simple SQL query on our ‘crimes’ table to find the most popular crime types:

scala>  sqlContext.sql(
        """
        select `Primary Type` as primaryType, COUNT(*) AS times
        from crimes
        group by `Primary Type`
        order by times DESC
        """).save("/tmp/agg.csv", "com.databricks.spark.csv")

That spits out a load of CSV ‘part files’ into /tmp/agg.csv so let’s bring in the merge function that we’ve used previously to combine these into one CSV file:

scala> import org.apache.hadoop.conf.Configuration
scala> import org.apache.hadoop.fs._
 
scala> def merge(srcPath: String, dstPath: String): Unit =  {
         val hadoopConfig = new Configuration()
         val hdfs = FileSystem.get(hadoopConfig)
         FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
       }
 
scala> merge("/tmp/agg.csv", "agg.csv")

And finally let’s browse the contents of our new CSV file:

$ cat agg.csv
THEFT,1206745
BATTERY,1066110
CRIMINAL DAMAGE,672782
NARCOTICS,662257
OTHER OFFENSE,360824
ASSAULT,354583
BURGLARY,343443
MOTOR VEHICLE THEFT,278014
ROBBERY,218190
DECEPTIVE PRACTICE,197477
CRIMINAL TRESPASS,171363
PROSTITUTION,65660
WEAPONS VIOLATION,56218
PUBLIC PEACE VIOLATION,42446
OFFENSE INVOLVING CHILDREN,37010
CRIM SEXUAL ASSAULT,21346
SEX OFFENSE,21305
GAMBLING,13704
LIQUOR LAW VIOLATION,13264
INTERFERENCE WITH PUBLIC OFFICER,11366
ARSON,9642
HOMICIDE,7192
KIDNAPPING,6029
INTIMIDATION,3443
STALKING,2760
OBSCENITY,331
PUBLIC INDECENCY,123
OTHER NARCOTIC VIOLATION,106
CONCEALED CARRY LICENSE VIOLATION,34
NON-CRIMINAL,31
NON - CRIMINAL,25
RITUALISM,23
HUMAN TRAFFICKING,9
NON-CRIMINAL (SUBJECT SPECIFIED),3
DOMESTIC VIOLENCE,1

Great! We’ve got the same output with much less code which is always a #win.

Categories: Programming

Neo4j: Cypher – Removing consecutive duplicates

Thu, 07/30/2015 - 07:23

When writing Cypher queries I sometimes find myself wanting to remove consecutive duplicates in collections that I’ve joined together.

e.g we might start with the following query where 1 and 7 appear consecutively:

RETURN [1,1,2,3,4,5,6,7,7,8] AS values
 
==> +-----------------------+
==> | values                |
==> +-----------------------+
==> | [1,1,2,3,4,5,6,7,7,8] |
==> +-----------------------+
==> 1 row

We want to end up with [1,2,3,4,5,6,7,8]. We can start by exploding our array and putting consecutive elements next to each other:

WITH [1,1,2,3,4,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
RETURN idx, idx+1, values[idx], values[idx+1]
 
==> +-------------------------------------------+
==> | idx | idx+1 | values[idx] | values[idx+1] |
==> +-------------------------------------------+
==> | 0   | 1     | 1           | 1             |
==> | 1   | 2     | 1           | 2             |
==> | 2   | 3     | 2           | 3             |
==> | 3   | 4     | 3           | 4             |
==> | 4   | 5     | 4           | 5             |
==> | 5   | 6     | 5           | 6             |
==> | 6   | 7     | 6           | 7             |
==> | 7   | 8     | 7           | 7             |
==> | 8   | 9     | 7           | 8             |
==> +-------------------------------------------+
==> 9 rows

Next we can filter out rows which have the same values since that means they have consecutive duplicates:

WITH [1,1,2,3,4,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN a,b
 
==> +-------+
==> | a | b |
==> +-------+
==> | 1 | 2 |
==> | 2 | 3 |
==> | 3 | 4 |
==> | 4 | 5 |
==> | 5 | 6 |
==> | 6 | 7 |
==> | 7 | 8 |
==> +-------+
==> 7 rows

Now we need to join the collection back together again. Most of the values we want are in field ‘b’ but we also need to grab the first value from field ‘a':

WITH [1,1,2,3,4,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN COLLECT(a)[0] + COLLECT(b) AS noDuplicates
 
==> +-------------------+
==> | noDuplicates      |
==> +-------------------+
==> | [1,2,3,4,5,6,7,8] |
==> +-------------------+
==> 1 row

What about if we have more than 2 duplicates in a row?

WITH [1,1,1,2,3,4,5,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN COLLECT(a)[0] + COLLECT(b) AS noDuplicates
 
==> +-------------------+
==> | noDuplicates      |
==> +-------------------+
==> | [1,2,3,4,5,6,7,8] |
==> +-------------------+
==> 1 row

Still happy, good times! Of course if we have a non consecutive duplicate that wouldn’t be removed:

WITH [1,1,1,2,3,4,5,5,6,7,7,8,1] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN COLLECT(a)[0] + COLLECT(b) AS noDuplicates
 
==> +---------------------+
==> | noDuplicates        |
==> +---------------------+
==> | [1,2,3,4,5,6,7,8,1] |
==> +---------------------+
==> 1 row
Categories: Programming

Neo4j: MERGE’ing on super nodes

Tue, 07/28/2015 - 22:04

In my continued playing with the Chicago crime data set I wanted to connect the crimes committed to their position in the FBI crime type hierarchy.

These are the sub graphs that I want to connect:

2015 07 26 22 19 04

We have a ‘fbiCode’ on each ‘Crime’ node which indicates which ‘Crime Sub Category’ the crime belongs to.

I started with the following query to connect the nodes together:

MATCH (crime:Crime)
WITH crime SKIP {skip} LIMIT 10000
 
MATCH (subCat:SubCategory {code: crime.fbiCode})
MERGE (crime)-[:CATEGORY]->(subCat)
RETURN COUNT(*) AS crimesProcessed

I had this running inside a Python script which incremented ‘skip’ by 10,000 on each iteration as long as ‘crimesProcessed’ came back with a value > 0.

To start with the ‘CATEGORY’ relationships were being created very quickly but it slowed down quite noticeably about 1 million nodes in.

I profiled the queries but the query plans didn’t show anything obviously wrong. My suspicion was that I had a super node problem where the cypher run time was iterating through all of the sub category’s relationships to check whether one of them pointed to the crime on the other side of the ‘MERGE’ statement.

I cancelled the import job and wrote a query to check how many relationships each sub category had. It varied from 1,000 to 93,000 somewhat confirming my suspicion.

Michael suggested tweaking the query to use the shortestpath function to check for the existence of the relationship and then use the ‘CREATE’ clause to create it if it didn’t exist.

The neat thing about the shortestpath function is that it will start from the side with the lowest cardinality and as soon as it finds a relationship it will stop searching. Let’s have a look at that version of the query:

MATCH (crime:Crime)
WITH crime SKIP {skip} LIMIT 10000
MATCH (subCat:SubCategory {code: crime.fbiCode})
WITH crime, subCat, shortestPath((crime)-[:CATEGORY]->(subCat)) AS path
FOREACH(ignoreMe IN CASE WHEN path is NULL THEN [1] ELSE [] END |
  CREATE (crime)-[:CATEGORY]->(subCat))
RETURN COUNT(*)

This worked much better – 10,000 nodes processed in ~ 2.5 seconds – and the time remained constant as more relationships were added. This allowed me to create all the category nodes but we can actually do even better if we use CREATE UNIQUE instead of MERGE

MATCH (crime:Crime)
WITH crime SKIP {skip} LIMIT 10000
 
MATCH (subCat:SubCategory {code: crime.fbiCode})
CREATE UNIQUE (crime)-[:CATEGORY]->(subCat)
RETURN COUNT(*) AS crimesProcessed

Using this query 10,000 nodes took ~ 250ms -900ms second to process which means we can process all the nodes in 5-6 minutes – good times!

I’m not super familiar with the ‘CREATE UNIQUE’ code so I’m not sure that it’s always a good substitute for ‘MERGE’ but on this occasion it does the job.

The lesson for me here is that if a query is taking longer than you think it should try and use other constructs / a combination of other constructs and see whether things improve – they just might!

Categories: Programming

Python: Difference between two datetimes in milliseconds

Tue, 07/28/2015 - 21:05

I’ve been doing a bit of adhoc measurement of some cypher queries executed via py2neo and wanted to work out how many milliseconds each query was taking end to end.

I thought there’d be an obvious way of doing this but if there is it’s evaded me so far and I ended up calculating the different between two datetime objects which gave me the following timedelta object:

>>> import datetime
>>> start = datetime.datetime.now()
>>> end = datetime.datetime.now()
 
>>> end - start
datetime.timedelta(0, 3, 519319)

The 3 parts of this object are ‘days’, ‘seconds’ and ‘microseconds’ which I found quite strange!

These are the methods/attributes we have available to us:

>>> dir(end - start)
['__abs__', '__add__', '__class__', '__delattr__', '__div__', '__doc__', '__eq__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__pos__', '__radd__', '__rdiv__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmul__', '__rsub__', '__setattr__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', 'days', 'max', 'microseconds', 'min', 'resolution', 'seconds', 'total_seconds']

There’s no ‘milliseconds’ on there so we’ll have to calculate it from what we do have:

>>> diff = end - start
>>> elapsed_ms = (diff.days * 86400000) + (diff.seconds * 1000) + (diff.microseconds / 1000)
 
>>> elapsed_ms
3519

Or we could do the following slightly simpler calculation:

>>> diff.total_seconds() * 1000
3519.319

And now back to the query profiling!

Categories: Programming

Neo4j: From JSON to CSV to LOAD CSV via jq

Sun, 07/26/2015 - 00:05


In my last blog post I showed how to import a Chicago crime categories & sub categories JSON document using Neo4j’s cypher query language via the py2neo driver. While this is a good approach for people with a developer background, many of the users I encounter aren’t developers and favour using Cypher via the Neo4j browser.

If we’re going to do this we’ll need to transform our JSON document into a CSV file so that we can use the LOAD CSV command on it. Michael pointed me to the jq tool which comes in very handy.

To recap, this is a part of the JSON file:

{
    "categories": [
        {
            "name": "Index Crime",
            "sub_categories": [
                {
                    "code": "01A",
                    "description": "Homicide 1st & 2nd Degree"
                },
            ]
        },
        {
            "name": "Non-Index Crime",
            "sub_categories": [
                {
                    "code": "01B",
                    "description": "Involuntary Manslaughter"
                },
            ]
        },
        {
            "name": "Violent Crime",
            "sub_categories": [
                {
                    "code": "01A",
                    "description": "Homicide 1st & 2nd Degree"
                },
            ]
        }
    ]
}

We want to get one row for each sub category which contains three columns – category name, sub category code, sub category description.

First we need to pull out the categories:

$ jq ".categories[]" categories.json
 
{
  "name": "Index Crime",
  "sub_categories": [
    {
      "code": "01A",
      "description": "Homicide 1st & 2nd Degree"
    },
  ]
}
{
  "name": "Non-Index Crime",
  "sub_categories": [
    {
      "code": "01B",
      "description": "Involuntary Manslaughter"
    },
  ]
}
{
  "name": "Violent Crime",
  "sub_categories": [
    {
      "code": "01A",
      "description": "Homicide 1st & 2nd Degree"
    },
  ]
}

Next we want to create a row for each sub category with the category alongside it. We can use the pipe function to combine the two selectors:

$ jq ".categories[] | {name: .name, sub_category: .sub_categories[]}" categories.json
 
{
  "name": "Index Crime",
  "sub_category": {
    "code": "01A",
    "description": "Homicide 1st & 2nd Degree"
  }
}
...
{
  "name": "Non-Index Crime",
  "sub_category": {
    "code": "01B",
    "description": "Involuntary Manslaughter"
  }
}
...
{
  "name": "Violent Crime",
  "sub_category": {
    "code": "01A",
    "description": "Homicide 1st & 2nd Degree"
  }
}

Now we want to un-nest the sub category:

$ jq ".categories[] | {name: .name, sub_category: .sub_categories[]} | [.name, .sub_category.code, .sub_category.description]" categories.json
 
[
  "Index Crime",
  "01A",
  "Homicide 1st & 2nd Degree"
]
 
[
  "Non-Index Crime",
  "01B",
  "Involuntary Manslaughter"
]
 
[
  "Violent Crime",
  "01A",
  "Homicide 1st & 2nd Degree"
]

And finally let’s use the @csv filter to generate CSV lines:

$ jq ".categories[] | {name: .name, sub_category: .sub_categories[]} | [.name, .sub_category.code, .sub_category.description] | @csv" categories.json
"\"Index Crime\",\"01A\",\"Homicide 1st & 2nd Degree\""
"\"Index Crime\",\"02\",\"Criminal Sexual Assault\""
"\"Index Crime\",\"03\",\"Robbery\""
"\"Index Crime\",\"04A\",\"Aggravated Assault\""
"\"Index Crime\",\"04B\",\"Aggravated Battery\""
"\"Index Crime\",\"05\",\"Burglary\""
"\"Index Crime\",\"06\",\"Larceny\""
"\"Index Crime\",\"07\",\"Motor Vehicle Theft\""
"\"Index Crime\",\"09\",\"Arson\""
"\"Non-Index Crime\",\"01B\",\"Involuntary Manslaughter\""
"\"Non-Index Crime\",\"08A\",\"Simple Assault\""
"\"Non-Index Crime\",\"08B\",\"Simple Battery\""
"\"Non-Index Crime\",\"10\",\"Forgery & Counterfeiting\""
"\"Non-Index Crime\",\"11\",\"Fraud\""
"\"Non-Index Crime\",\"12\",\"Embezzlement\""
"\"Non-Index Crime\",\"13\",\"Stolen Property\""
"\"Non-Index Crime\",\"14\",\"Vandalism\""
"\"Non-Index Crime\",\"15\",\"Weapons Violation\""
"\"Non-Index Crime\",\"16\",\"Prostitution\""
"\"Non-Index Crime\",\"17\",\"Criminal Sexual Abuse\""
"\"Non-Index Crime\",\"18\",\"Drug Abuse\""
"\"Non-Index Crime\",\"19\",\"Gambling\""
"\"Non-Index Crime\",\"20\",\"Offenses Against Family\""
"\"Non-Index Crime\",\"22\",\"Liquor License\""
"\"Non-Index Crime\",\"24\",\"Disorderly Conduct\""
"\"Non-Index Crime\",\"26\",\"Misc Non-Index Offense\""
"\"Violent Crime\",\"01A\",\"Homicide 1st & 2nd Degree\""
"\"Violent Crime\",\"02\",\"Criminal Sexual Assault\""
"\"Violent Crime\",\"03\",\"Robbery\""
"\"Violent Crime\",\"04A\",\"Aggravated Assault\""
"\"Violent Crime\",\"04B\",\"Aggravated Battery\""

The only annoying thing about this output is that all the double quotes are escaped. We can sort that out by passing the ‘-r’ flag when we call jq:

$ jq -r ".categories[] | {name: .name, sub_category: .sub_categories[]} | [.name, .sub_category.code, .sub_category.description] | @csv" categories.json
"Index Crime","01A","Homicide 1st & 2nd Degree"
"Index Crime","02","Criminal Sexual Assault"
"Index Crime","03","Robbery"
"Index Crime","04A","Aggravated Assault"
"Index Crime","04B","Aggravated Battery"
"Index Crime","05","Burglary"
"Index Crime","06","Larceny"
"Index Crime","07","Motor Vehicle Theft"
"Index Crime","09","Arson"
"Non-Index Crime","01B","Involuntary Manslaughter"
"Non-Index Crime","08A","Simple Assault"
"Non-Index Crime","08B","Simple Battery"
"Non-Index Crime","10","Forgery & Counterfeiting"
"Non-Index Crime","11","Fraud"
"Non-Index Crime","12","Embezzlement"
"Non-Index Crime","13","Stolen Property"
"Non-Index Crime","14","Vandalism"
"Non-Index Crime","15","Weapons Violation"
"Non-Index Crime","16","Prostitution"
"Non-Index Crime","17","Criminal Sexual Abuse"
"Non-Index Crime","18","Drug Abuse"
"Non-Index Crime","19","Gambling"
"Non-Index Crime","20","Offenses Against Family"
"Non-Index Crime","22","Liquor License"
"Non-Index Crime","24","Disorderly Conduct"
"Non-Index Crime","26","Misc Non-Index Offense"
"Violent Crime","01A","Homicide 1st & 2nd Degree"
"Violent Crime","02","Criminal Sexual Assault"
"Violent Crime","03","Robbery"
"Violent Crime","04A","Aggravated Assault"
"Violent Crime","04B","Aggravated Battery"

Excellent. The only thing left is to write a header and then direct the output into a CSV file and get it into Neo4j:

$ echo "category,sub_category_code,sub_category_description" > categories.csv
$ jq -r ".categories[] |
         {name: .name, sub_category: .sub_categories[]} |
         [.name, .sub_category.code, .sub_category.description] |
         @csv " categories.json >> categories.csv
$ head -n10 categories.csv
category,sub_category_code,sub_category_description
"Index Crime","01A","Homicide 1st & 2nd Degree"
"Index Crime","02","Criminal Sexual Assault"
"Index Crime","03","Robbery"
"Index Crime","04A","Aggravated Assault"
"Index Crime","04B","Aggravated Battery"
"Index Crime","05","Burglary"
"Index Crime","06","Larceny"
"Index Crime","07","Motor Vehicle Theft"
"Index Crime","09","Arson"
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/categories.csv" AS row
MERGE (c:CrimeCategory {name: row.category})
MERGE (sc:SubCategory {code: row.sub_category_code})
ON CREATE SET sc.description = row.sub_category_description
MERGE (c)-[:CHILD]->(sc)

And that’s it!

Graph  25
Categories: Programming

Neo4j: Loading JSON documents with Cypher

Thu, 07/23/2015 - 07:15

One of the most commonly asked questions I get asked is how to load JSON documents into Neo4j and although Cypher doesn’t have a ‘LOAD JSON’ command we can still get JSON data into the graph.

Michael shows how to do this from various languages in this blog post and I recently wanted to load a JSON document that I generated from Chicago crime types.

This is a snippet of the JSON document:

{
    "categories": [
        {
            "name": "Index Crime", 
            "sub_categories": [
                {
                    "code": "01A", 
                    "description": "Homicide 1st & 2nd Degree"
                }
            ]
        }, 
        {
            "name": "Non-Index Crime", 
            "sub_categories": [
                {
                    "code": "01B", 
                    "description": "Involuntary Manslaughter"
                }
            ]
        }, 
        {
            "name": "Violent Crime", 
            "sub_categories": [
                {
                    "code": "01A", 
                    "description": "Homicide 1st & 2nd Degree"
                }
            ]
        }
    ]
}

We want to create the following graph structure from this document:

2015 07 23 06 46 50

We can then connect the crimes to the appropriate sub category and write aggregation queries that drill down from the category.

To do this we’re going to have to pass the JSON document to Neo4j via its HTTP API rather than through the browser. Luckily there are drivers available for {insert your favourite language here} so we should still be good.

Python is my current goto language so I’m going to use py2neo to load the data in.

Let’s start by writing a simple query which passes our JSON document in and gets it straight back. Note that I’ve updated my Neo4j password to be ‘foobar’ – replace that with your equivalent if you’re following along:

import json
from py2neo import Graph, authenticate
 
# replace 'foobar' with your password
authenticate("localhost:7474", "neo4j", "foobar")
graph = Graph()
 
with open('categories.json') as data_file:
    json = json.load(data_file)
 
query = """
RETURN {json}
"""
 
# Send Cypher query.
print graph.cypher.execute(query, json = json)
$ python import_categories.py
   | document
---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1 | {u'categories': [{u'name': u'Index Crime', u'sub_categories': [{u'code': u'01A', u'description': u'Homicide 1st & 2nd Degree'}, {u'code': u'02', u'description': u'Criminal Sexual Assault'}, {u'code': u'03', u'description': u'Robbery'}, {u'code': u'04A', u'description': u'Aggravated Assault'}, {u'code': u'04B', u'description': u'Aggravated Battery'}, {u'code': u'05', u'description': u'Burglary'}, {u'code': u'06', u'description': u'Larceny'}, {u'code': u'07', u'description': u'Motor Vehicle Theft'}, {u'code': u'09', u'description': u'Arson'}]}, {u'name': u'Non-Index Crime', u'sub_categories': [{u'code': u'01B', u'description': u'Involuntary Manslaughter'}, {u'code': u'08A', u'description': u'Simple Assault'}, {u'code': u'08B', u'description': u'Simple Battery'}, {u'code': u'10', u'description': u'Forgery & Counterfeiting'}, {u'code': u'11', u'description': u'Fraud'}, {u'code': u'12', u'description': u'Embezzlement'}, {u'code': u'13', u'description': u'Stolen Property'}, {u'code': u'14', u'description': u'Vandalism'}, {u'code': u'15', u'description': u'Weapons Violation'}, {u'code': u'16', u'description': u'Prostitution'}, {u'code': u'17', u'description': u'Criminal Sexual Abuse'}, {u'code': u'18', u'description': u'Drug Abuse'}, {u'code': u'19', u'description': u'Gambling'}, {u'code': u'20', u'description': u'Offenses Against Family'}, {u'code': u'22', u'description': u'Liquor License'}, {u'code': u'24', u'description': u'Disorderly Conduct'}, {u'code': u'26', u'description': u'Misc Non-Index Offense'}]}, {u'name': u'Violent Crime', u'sub_categories': [{u'code': u'01A', u'description': u'Homicide 1st & 2nd Degree'}, {u'code': u'02', u'description': u'Criminal Sexual Assault'}, {u'code': u'03', u'description': u'Robbery'}, {u'code': u'04A', u'description': u'Aggravated Assault'}, {u'code': u'04B', u'description': u'Aggravated Battery'}]}]}

It’s a bit ugly but we can see that everything’s there! Our next step is to extract each category into its own row. We can do this by accessing the ‘categories’ key in our JSON document and then calling the UNWIND function which allows us to expand a collection into a sequence of rows:

query = """
WITH {json} AS document
UNWIND document.categories AS category
RETURN category.name
"""
$ python import_categories.py
   | category.name
---+-----------------
 1 | Index Crime
 2 | Non-Index Crime
 3 | Violent Crime

Now we can create a node for each of those categories. We’ll use the MERGE command so that we can run this script multiple times without ending up with repeat categories:

query = """
WITH {json} AS document
UNWIND document.categories AS category
MERGE (:CrimeCategory {name: category.name}) 
"""

Let’s quickly check those categories were correctly imported:

match (category:CrimeCategory)
return category

Graph  23

Looking good so far – now for the sub categories. We’re going to use the UNWIND function to help us out here as well:

query = """
WITH {json} AS document
UNWIND document.categories AS category
UNWIND category.sub_categories AS subCategory
RETURN category.name, subCategory.code, subCategory.description
"""
$ python import_categories.py
    | category.name   | subCategory.code | subCategory.description
----+-----------------+------------------+---------------------------
  1 | Index Crime     | 01A              | Homicide 1st & 2nd Degree
  2 | Index Crime     | 02               | Criminal Sexual Assault
  3 | Index Crime     | 03               | Robbery
  4 | Index Crime     | 04A              | Aggravated Assault
  5 | Index Crime     | 04B              | Aggravated Battery
  6 | Index Crime     | 05               | Burglary
  7 | Index Crime     | 06               | Larceny
  8 | Index Crime     | 07               | Motor Vehicle Theft
  9 | Index Crime     | 09               | Arson
 10 | Non-Index Crime | 01B              | Involuntary Manslaughter
 11 | Non-Index Crime | 08A              | Simple Assault
 12 | Non-Index Crime | 08B              | Simple Battery
 13 | Non-Index Crime | 10               | Forgery & Counterfeiting
 14 | Non-Index Crime | 11               | Fraud
 15 | Non-Index Crime | 12               | Embezzlement
 16 | Non-Index Crime | 13               | Stolen Property
 17 | Non-Index Crime | 14               | Vandalism
 18 | Non-Index Crime | 15               | Weapons Violation
 19 | Non-Index Crime | 16               | Prostitution
 20 | Non-Index Crime | 17               | Criminal Sexual Abuse
 21 | Non-Index Crime | 18               | Drug Abuse
 22 | Non-Index Crime | 19               | Gambling
 23 | Non-Index Crime | 20               | Offenses Against Family
 24 | Non-Index Crime | 22               | Liquor License
 25 | Non-Index Crime | 24               | Disorderly Conduct
 26 | Non-Index Crime | 26               | Misc Non-Index Offense
 27 | Violent Crime   | 01A              | Homicide 1st & 2nd Degree
 28 | Violent Crime   | 02               | Criminal Sexual Assault
 29 | Violent Crime   | 03               | Robbery
 30 | Violent Crime   | 04A              | Aggravated Assault
 31 | Violent Crime   | 04B              | Aggravated Battery

Let’s give sub categories the MERGE treatment too:

query = """
WITH {json} AS document
UNWIND document.categories AS category
UNWIND category.sub_categories AS subCategory
MERGE (c:CrimeCategory {name: category.name})
MERGE (sc:SubCategory {code: subCategory.code})
ON CREATE SET sc.description = subCategory.description
MERGE (c)-[:CHILD]->(sc)
"""

And finally let’s write a query to check what we’ve imported:

match (category:CrimeCategory)-[:CHILD]->(subCategory)
return *
Graph  24

I hadn’t realised before running this query is that some sub categories sit under multiple categories so that’s quite an interesting insight. The final Python script is available on github – any questions let me know.

Categories: Programming

Neo4j 2.2.3: neo4j-import – Encoder StringEncoder[2] returned an illegal encoded value 0

Tue, 07/21/2015 - 07:11

I’ve been playing around with the Chicago crime data set again while preparing for a Neo4j webinar next week and while running the import tool ran into the following exception:

Importing the contents of these files into tmp/crimes.db:
Nodes:
  /Users/markneedham/projects/neo4j-spark-chicago/tmp/crimes.csv
 
  /Users/markneedham/projects/neo4j-spark-chicago/tmp/beats.csv
 
  /Users/markneedham/projects/neo4j-spark-chicago/tmp/primaryTypes.csv
 
  /Users/markneedham/projects/neo4j-spark-chicago/tmp/locations.csv
Relationships:
  /Users/markneedham/projects/neo4j-spark-chicago/tmp/crimesBeats.csv
 
  /Users/markneedham/projects/neo4j-spark-chicago/tmp/crimesPrimaryTypes.csv
 
  /Users/markneedham/projects/neo4j-spark-chicago/tmp/crimesLocationsCleaned.csv
 
Available memory:
  Free machine memory: 263.17 MB
  Max heap memory : 3.56 GB
 
Nodes
[*>:17.41 MB/s-------------------------|PROPERTIES(3)=|NODE:3|LABEL SCAN----|v:36.30 MB/s(2)===]  3MImport error: Panic called, so exiting
java.lang.RuntimeException: Panic called, so exiting
	at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.assertHealthy(AbstractStep.java:200)
	at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.await(AbstractStep.java:191)
	at org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep.receive(ProcessorStep.java:98)
	at org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep.sendDownstream(ProcessorStep.java:224)
	at org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep.access$400(ProcessorStep.java:42)
	at org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep$Sender.send(ProcessorStep.java:250)
	at org.neo4j.unsafe.impl.batchimport.LabelScanStorePopulationStep.process(LabelScanStorePopulationStep.java:60)
	at org.neo4j.unsafe.impl.batchimport.LabelScanStorePopulationStep.process(LabelScanStorePopulationStep.java:37)
	at org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep$4.run(ProcessorStep.java:120)
	at org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep$4.run(ProcessorStep.java:102)
	at org.neo4j.unsafe.impl.batchimport.executor.DynamicTaskExecutor$Processor.run(DynamicTaskExecutor.java:237)
Caused by: java.lang.IllegalStateException: Encoder StringEncoder[2] returned an illegal encoded value 0
	at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.encode(EncodingIdMapper.java:229)
	at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.put(EncodingIdMapper.java:208)
	at org.neo4j.unsafe.impl.batchimport.NodeEncoderStep.process(NodeEncoderStep.java:77)
	at org.neo4j.unsafe.impl.batchimport.NodeEncoderStep.process(NodeEncoderStep.java:43)
	... 3 more

I narrowed the problem down to a specific file and from tracing the code learned that this exception happens when we’ve ended up with a node that doesn’t have an id.

I guessed that this might be due to there being an empty column somewhere in my CSV file so I did a bit of grepping:

$ grep -rn  "\"\"" tmp/locations.csv
tmp/locations.csv:11:"",Location

We can now narrow down the import to just the one line and see if we still get the exception:

$ cat foo.csv
id:ID(Location),:LABEL
"",Location
 
$ ./neo4j-community-2.2.3/bin/neo4j-import --into tmp/foo --nodes foo.csv
Importing the contents of these files into tmp/foo:
Nodes:
  /Users/markneedham/projects/neo4j-spark-chicago/foo.csv
 
Available memory:
  Free machine memory: 2.22 GB
  Max heap memory : 3.56 GB
 
Nodes
Import error: Encoder StringEncoder[2] returned an illegal encoded value 0

Yep, same error. Now we can clean up our CSV file and try again:

$ grep -v  "\"\"" foo.csv > fooCleaned.csv
 
# I put in a few real records so we can see them import
$ cat fooCleaned.csv
id:ID(Location),:LABEL
"RAILROAD PROPERTY",Location
"NEWSSTAND",Location
"SCHOOL, PRIVATE, BUILDING",Location
 
$ ./neo4j-community-2.2.3/bin/neo4j-import --into tmp/foo --nodes fooCleaned.csv
Importing the contents of these files into tmp/foo:
Nodes:
  /Users/markneedham/projects/neo4j-spark-chicago/fooCleaned.csv
 
Available memory:
  Free machine memory: 1.23 GB
  Max heap memory : 3.56 GB
 
Nodes
[*>:??-------------------------------------------------|PROPE|NODE:7.63 MB-----------------|LA] 10k
Done in 110ms
Prepare node index
[*DETECT:7.63 MB-------------------------------------------------------------------------------]   0
Done in 60ms
Calculate dense nodes
[*>:??-----------------------------------------------------------------------------------------]   0
Done in 10ms
Relationships
[*>:??-----------------------------------------------------------------------------------------]   0
Done in 11ms
Node --> Relationship
[*v:??-----------------------------------------------------------------------------------------] 10k
Done in 1ms
Relationship --> Relationship
[*>:??-----------------------------------------------------------------------------------------]   0
Done in 11ms
Node counts
[>:|*COUNT:76.29 MB----------------------------------------------------------------------------] 10k
Done in 46ms
Relationship counts
[*>:??-----------------------------------------------------------------------------------------]   0
Done in 12ms
 
IMPORT DONE in 1s 576ms. Imported:
  3 nodes
  0 relationships
  3 properties

Sweet! We’re back in business.

Categories: Programming

R: Bootstrap confidence intervals

Sun, 07/19/2015 - 20:44

I recently came across an interesting post on Julia Evans’ blog showing how to generate a bigger set of data points by sampling the small set of data points that we actually have using bootstrapping. Julia’s examples are all in Python so I thought it’d be a fun exercise to translate them into R.

We’re doing the bootstrapping to simulate the number of no-shows for a flight so we can work out how many seats we can overbook the plane by.

We start out with a small sample of no-shows and work off the assumption that it’s ok to kick someone off a flight 5% of the time. Let’s work out how many people that’d be for our initial sample:

> data = c(0, 1, 3, 2, 8, 2, 3, 4)
> quantile(data, 0.05)
  5% 
0.35

0.35 people! That’s not a particularly useful result so we’re going to resample the initial data set 10,000 times, taking the 5%ile each time and see if we come up with something better:

We’re going to use the sample function with replacement to generate our resamples:

> sample(data, replace = TRUE)
[1] 0 3 2 8 8 0 8 0
> sample(data, replace = TRUE)
[1] 2 2 4 3 4 4 2 2

Now let’s write a function to do that multiple times:

library(ggplot)
 
bootstrap_5th_percentile = function(data, n_bootstraps) {
  return(sapply(1:n_bootstraps, 
                function(iteration) quantile(sample(data, replace = TRUE), 0.05)))
}
 
values = bootstrap_5th_percentile(data, 10000)
 
ggplot(aes(x = value), data = data.frame(value = values)) + geom_histogram(binwidth=0.25)

2015 07 19 18 05 48

So this visualisation is telling us that we can oversell by 0-2 people but we don’t know an exact number.

Let’s try the same exercise but with a bigger initial data set of 1,000 values rather than just 8. First we’ll generate a distribution (with a mean of 5 and standard deviation of 2) and visualise it:

library(dplyr)
 
df = data.frame(value = rnorm(1000,5, 2))
df = df %>% filter(value >= 0) %>% mutate(value = as.integer(round(value)))
ggplot(aes(x = value), data = df) + geom_histogram(binwidth=1)

2015 07 19 18 09 15

Our distribution seems to have a lot more values around 4 & 5 whereas the Python version has a flatter distribution – I’m not sure why that is so if you have any ideas let me know. In any case, let’s check the 5%ile for this data set:

> quantile(df$value, 0.05)
5% 
 2

Cool! Now at least we have an integer value rather than the 0.35 we got earlier. Finally let’s do some bootstrapping over our new distribution and see what 5%ile we come up with:

resampled = bootstrap_5th_percentile(df$value, 10000)
byValue = data.frame(value = resampled) %>% count(value)
 
> byValue
Source: local data frame [3 x 2]
 
  value    n
1   1.0    3
2   1.7    2
3   2.0 9995
 
ggplot(aes(x = value, y = n), data = byValue) + geom_bar(stat = "identity")

2015 07 19 18 23 29

‘2’ is by far the most popular 5%ile here although it seems weighted more towards that value than with Julia’s Python version, which I imagine is because we seem to have sampled from a slightly different distribution.

Categories: Programming

R: Blog post frequency anomaly detection

Sat, 07/18/2015 - 00:34

I came across Twitter’s anomaly detection library last year but haven’t yet had a reason to take it for a test run so having got my blog post frequency data into shape I thought it’d be fun to run it through the algorithm.

I wanted to see if it would detect any periods of time when the number of posts differed significantly – I don’t really have an action I’m going to take based on the results, it’s curiosity more than anything else!

First we need to get the library installed. It’s not on CRAN so we need to use devtools to install it from the github repository:

install.packages("devtools")
devtools::install_github("twitter/AnomalyDetection")
library(AnomalyDetection)

The expected data format is two columns – one containing a time stamp and the other a count. e.g. using the ‘raw_data’ data frame that is in scope when you add the library:

> library(dplyr)
> raw_data %>% head()
            timestamp   count
1 1980-09-25 14:01:00 182.478
2 1980-09-25 14:02:00 176.231
3 1980-09-25 14:03:00 183.917
4 1980-09-25 14:04:00 177.798
5 1980-09-25 14:05:00 165.469
6 1980-09-25 14:06:00 181.878

In our case the timestamps will be the start date of a week and the count the number of posts in that week. But first let’s get some practice calling the anomaly function using the canned data:

res = AnomalyDetectionTs(raw_data, max_anoms=0.02, direction='both', plot=TRUE)
res$plot

2015 07 18 00 09 22

From this visualisation we learn that we should expect both high and low outliers to be identified. Let’s give it a try with the blog post publication data.

We need to get the data into shape so we’ll start by getting a count of the number of blog posts by (week, year) pair:

> df %>% sample_n(5)
                                                           title                date
1425                            Coding: Copy/Paste then refactor 2009-10-31 07:54:31
783  Neo4j 2.0.0-M06 -> 2.0.0-RC1: Working with path expressions 2013-11-23 10:30:41
960                                        R: Removing for loops 2015-04-18 23:53:20
966   R: dplyr - Error in (list: invalid subscript type 'double' 2015-04-27 22:34:43
343                     Parsing XML from the unix terminal/shell 2011-09-03 23:42:11
 
> byWeek = df %>% 
    mutate(year = year(date), week = week(date)) %>% 
    group_by(week, year) %>% summarise(n = n()) %>% 
    ungroup() %>% arrange(desc(n))
 
> byWeek %>% sample_n(5)
Source: local data frame [5 x 3]
 
  week year n
1   44 2009 6
2   37 2011 4
3   39 2012 3
4    7 2013 4
5    6 2010 6

Great. The next step is to translate this data frame into one containing a date representing the start of that week and the number of posts:

> data = byWeek %>% 
    mutate(start_of_week = calculate_start_of_week(week, year)) %>%
    filter(start_of_week > ymd("2008-07-01")) %>%
    select(start_of_week, n)
 
> data %>% sample_n(5)
Source: local data frame [5 x 2]
 
  start_of_week n
1    2010-09-10 4
2    2013-04-09 4
3    2010-04-30 6
4    2012-03-11 3
5    2014-12-03 3

We’re now ready to plug it into the anomaly detection function:

res = AnomalyDetectionTs(data, 
                         max_anoms=0.02, 
                         direction='both', 
                         plot=TRUE)
res$plot

2015 07 18 00 24 20

Interestingly I don’t seem to have any low end anomalies – there were a couple of really high frequency weeks when I first started writing and I think one of the other weeks contains a New Year’s Eve when I was particularly bored!

If we group by month instead only the very first month stands out as an outlier:

data = byMonth %>% 
  mutate(start_of_month = ymd(paste(year, month, 1, sep="-"))) %>%
  filter(start_of_month > ymd("2008-07-01")) %>%
  select(start_of_month, n)
res = AnomalyDetectionTs(data, 
                         max_anoms=0.02, 
                         direction='both',       
                         #longterm = TRUE,
                         plot=TRUE)
res$plot

2015 07 18 00 34 02

I’m not sure what else to do as far as anomaly detection goes but if you have any ideas please let me know!

Categories: Programming

Neo4j: The football transfers graph

Thu, 07/16/2015 - 07:40

Given we’re still in pre season transfer madness as far as European football is concerned I thought it’d be interesting to put together a football transfers graph to see whether there are any interesting insights to be had.

It took me a while to find an appropriate source but I eventually came across transfermarkt.co.uk which contains transfers going back at least as far as the start of the Premier League in 1992.

I wrote a quick Python script to create a CSV file of all the transfers. This is what the file looks like:

$ head -n 10 data/transfers.csv
player,from_team,from_team_id,to_team,to_team_id,fee,season
Martin Keown,Everton,29,Arsenal FC,11,"2,10 Mill. £",1992-1993
John Jensen,Bröndby IF,206,Arsenal FC,11,"1,12 Mill. £",1992-1993
Alan Miller,Birmingham,337,Arsenal FC,11,,1992-1993
Jim Will,Sheffield Utd.,350,Arsenal FC,11,,1992-1993
David Rocastle,Arsenal FC,11,Leeds,399,"1,68 Mill. £",1992-1993
Perry Groves,Arsenal FC,11,Southampton FC,180,595 Th. £,1992-1993
Ty Gooden,Arsenal FC,11,Wycombe Wand.,2805,?,1992-1993
Geraint Williams,Derby,22,Ipswich Town,677,525 Th. £,1992-1993
Jason Winters,Chelsea U21,9250,Ipswich Town,677,?,1992-1993

I’m going to create the following graph and then we’ll write some queries which explore chains of transfers involving players and clubs.

2015 07 15 07 28 11

I wrote a few import scripts using Neo4j’s LOAD CSV command, having set up the appropriate indexes first:

create index on :Team(id);
create index on :Season(name);
create index on :Transfer(description);
create index on :Player(name);
// teams
load csv with headers from "file:///Users/markneedham/projects/football-transfers/data/teams.csv" as row
merge (team:Team {id: toint(row.team_id)})
on create set team.name = row.team;
 
// seasons
load csv with headers from "file:///Users/markneedham/projects/football-transfers/data/transfers.csv" as row
merge (season:Season {name: row.season})
ON CREATE SET season.starts =  toint(split(season.name, "-")[0]);
 
// players
load csv with headers from "file:///Users/markneedham/projects/football-transfers/data/transfers.csv" as row
merge (player:Player {name: row.player});
 
// transfers
load csv with headers from "file:///Users/markneedham/projects/football-transfers/data/transfers.csv" as row
match (from:Team {id: toint(row.from_team_id)})
match (to:Team {id: toint(row.to_team_id)})
match (season:Season {name: row.season})
match (player:Player {name: row.player})
 
merge (transfer:Transfer {description: row.player + " from " + from.name + " to " + to.name})
merge (transfer)-[:FROM_TEAM]->(from)
merge (transfer)-[:TO_TEAM]->(to)
merge (transfer)-[:IN_SEASON]->(season)
merge (transfer)-[:PLAYER]->(player);
 
// connect transfers
match (season)<-[:IN_SEASON]-(transfer:Transfer)-[:PLAYER]->(player)
WITH player, season, transfer
ORDER BY player.name, season.starts
WITH player, COLLECT({s: season, t: transfer}) AS transfers
UNWIND range(0, length(transfers)-2) AS idx
WITH player, transfers[idx] AS t1, transfers[idx +1] AS t2
WITH player, t1.t AS t1, t2.t AS t2
MERGE (t1)-[:NEXT]->(t2);

All the files and scripts are on this gist if you want to play around with the data. The only thing you’ll need to change is the file path on each of the ‘LOAD CSV’ lines.

The ‘connect transfers’ query is a bit more complicated than the others – in that one we’re first ordering the transfers in ascending order grouped by player and then creating a linked list of a player’s transfers.

Now that we’ve got the data loaded let’s find out which player was transferred the most:

match path = (:Transfer)-[:NEXT*0..]->(transfer:Transfer)
where NOT (transfer)-[:NEXT]->()
RETURN path 
ORDER BY LENGTH(path) DESC
LIMIT 1
Graph  22

Which other players have moved teams frequently?

match path = (first:Transfer)-[:NEXT*0..]->(transfer:Transfer),
             (player)<-[:PLAYER]-(transfer)
where NOT ((transfer)-[:NEXT]->()) AND NOT ((first)<-[:NEXT]-())
RETURN player.name, LENGTH(path) AS numberOfTransfers 
ORDER BY numberOfTransfers DESC
LIMIT 10
 
==> +--------------------------------------+
==> | player.name      | numberOfTransfers |
==> +--------------------------------------+
==> | "Craig Bellamy"  | 7                 |
==> | "David Unsworth" | 6                 |
==> | "Andrew Cole"    | 6                 |
==> | "Peter Crouch"   | 6                 |
==> | "Les Ferdinand"  | 5                 |
==> | "Kevin Phillips" | 5                 |
==> | "Mark Hughes"    | 5                 |
==> | "Tommy Wright"   | 4                 |
==> | "Carl Tiler"     | 4                 |
==> | "Don Hutchison"  | 4                 |
==> +--------------------------------------+
==> 10 rows

What are the most frequent combinations of clubs involved in transfers?

match (from)<-[:FROM_TEAM]-(t:Transfer)-[:TO_TEAM]->(to), (t)-[:PLAYER]->(p)
RETURN from.name, to.name, COUNT(*) AS times, COLLECT(p.name) AS players
ORDER BY times DESC
LIMIT 10
 
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | from.name           | to.name               | times | players                                                                                                                                                                                                    |
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "West Ham United"   | "Queens Park Rangers" | 13    | ["Keith Rowland","Iain Dowie","Tim Breacker","Ludek Miklosko","Bertie Brayley","Terrell Forbes","Steve Lomas","Hogan Ephraim","Nigel Quashie","Danny Gabbidon","Kieron Dyer","Robert Green","Gary O'Neil"] |
==> | "Tottenham Hotspur" | "Portsmouth FC"       | 12    | ["Paul Walsh","Andy Turner","Rory Allen","Justin Edinburgh","Tim Sherwood","Teddy Sheringham","Noé Pamarot","Pedro Mendes","Sean Davis","Jermain Defoe","Younès Kaboul","Kevin-Prince Boateng"]            |
==> | "Liverpool FC"      | "West Ham United"     | 12    | ["Julian Dicks","David Burrows","Mike Marsh","Don Hutchison","Neil Ruddock","Titi Camara","Rob Jones","Rigobert Song","Craig Bellamy","Joe Cole","Andy Carroll","Stewart Downing"]                         |
==> | "Manchester United" | "Everton FC"          | 9     | ["Andrey Kanchelskis","John O'Kane","Jesper Blomqvist","Phil Neville","Tim Howard","Louis Saha","Darron Gibson","Sam Byrne","Tom Cleverley"]                                                               |
==> | "Newcastle United"  | "West Ham United"     | 9     | ["Paul Kitson","Shaka Hislop","Stuart Pearce","Wayne Quinn","Lee Bowyer","Kieron Dyer","Scott Parker","Nolberto Solano","Kevin Nolan"]                                                                     |
==> | "Blackburn Rovers"  | "Leicester City"      | 9     | ["Steve Agnew","Tim Flowers","Callum Davidson","John Curtis","Keith Gillespie","Craig Hignett","Nils-Eric Johansson","Bruno Berner","Paul Gallagher"]                                                      |
==> | "Chelsea FC"        | "Southampton FC"      | 8     | ["Ken Monkou","Kerry Dixon","Neil Shipperley","Mark Hughes","Paul Hughes","Graeme Le Saux","Jack Cork","Ryan Bertrand"]                                                                                    |
==> | "Birmingham City"   | "Coventry City"       | 8     | ["David Rennie","John Gayle","Liam Daish","Gary Breen","Stern John","Julian Gray","Lee Carsley","Gary McSheffrey"]                                                                                         |
==> | "Southampton FC"    | "Fulham FC"           | 8     | ["Micky Adams","Kevin Moore","Terry Hurlock","Maik Taylor","Alan Neilson","Luís Boa Morte","Antti Niemi","Chris Baird"]                                                                                    |
==> | "Portsmouth FC"     | "Stoke City"          | 8     | ["Kevin Harper","Lewis Buxton","Anthony Pulis","Vincent Péricard","Asmir Begovic","Marc Wilson","Elliot Wheeler","Alex Grant"]                                                                             |
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> 10 rows

Are there ever situations where players get transferred in both directions?

match (from)<-[:FROM_TEAM]-(t:Transfer)-[:TO_TEAM]->(to), (t)-[:PLAYER]->(player)
where id(from) < id(to)
WITH from, to, COUNT(*) AS times, COLLECT(player.name) AS players
match (to)<-[:FROM_TEAM]-(t:Transfer)-[:TO_TEAM]->(from), (t)-[:PLAYER]->(player)
RETURN from.name, to.name, times, COUNT(*) as otherWayTimes, players, COLLECT(player.name) AS otherWayPlayers
ORDER BY times + otherWayTimes DESC
LIMIT 10
 
==> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | from.name           | to.name               | times | otherWayTimes | players                                                                                                                                                                                                    | otherWayPlayers                                                                                                                                                                    |
==> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "Tottenham Hotspur" | "Portsmouth FC"       | 12    | 5             | ["Paul Walsh","Andy Turner","Rory Allen","Justin Edinburgh","Tim Sherwood","Teddy Sheringham","Noé Pamarot","Pedro Mendes","Sean Davis","Jermain Defoe","Younès Kaboul","Kevin-Prince Boateng"]            | ["Jermain Defoe","Niko Kranjcar","Younès Kaboul","Peter Crouch","Darren Anderton"]                                                                                                 |
==> | "West Ham United"   | "Liverpool FC"        | 4     | 12            | ["Julian Dicks","Daniel Sjölund","Yossi Benayoun","Javier Mascherano"]                                                                                                                                     | ["Stewart Downing","Andy Carroll","Joe Cole","Craig Bellamy","Rigobert Song","Titi Camara","Rob Jones","Neil Ruddock","Don Hutchison","Julian Dicks","Mike Marsh","David Burrows"] |
==> | "West Ham United"   | "Queens Park Rangers" | 13    | 2             | ["Keith Rowland","Iain Dowie","Tim Breacker","Ludek Miklosko","Bertie Brayley","Terrell Forbes","Steve Lomas","Hogan Ephraim","Nigel Quashie","Danny Gabbidon","Kieron Dyer","Robert Green","Gary O'Neil"] | ["Andy Impey","Trevor Sinclair"]                                                                                                                                                   |
==> | "West Ham United"   | "Tottenham Hotspur"   | 5     | 8             | ["Jermain Defoe","Frédéric Kanouté","Michael Carrick","Jimmy Walker","Scott Parker"]                                                                                                                       | ["Sergiy Rebrov","Mauricio Taricco","Calum Davenport","Les Ferdinand","Matthew Etherington","Bobby Zamora","Ilie Dumitrescu","Mark Robson"]                                        |
==> | "West Ham United"   | "Portsmouth FC"       | 8     | 5             | ["Martin Allen","Adrian Whitbread","Marc Keller","Svetoslav Todorov","Hayden Foxe","Shaka Hislop","Sébastien Schemmel","Hayden Mullins"]                                                                   | ["Stephen Henderson","Teddy Sheringham","Shaka Hislop","Marc Keller","Lee Chapman"]                                                                                                |
==> | "Newcastle United"  | "West Ham United"     | 9     | 3             | ["Paul Kitson","Shaka Hislop","Stuart Pearce","Wayne Quinn","Lee Bowyer","Kieron Dyer","Scott Parker","Nolberto Solano","Kevin Nolan"]                                                                     | ["Demba Ba","Lee Bowyer","David Terrier"]                                                                                                                                          |
==> | "Birmingham City"   | "Coventry City"       | 8     | 4             | ["David Rennie","John Gayle","Liam Daish","Gary Breen","Stern John","Julian Gray","Lee Carsley","Gary McSheffrey"]                                                                                         | ["Scott Dann","David Burrows","Peter Ndlovu","David Smith"]                                                                                                                        |
==> | "Manchester City"   | "Portsmouth FC"       | 8     | 4             | ["Paul Walsh","Carl Griffiths","Fitzroy Simpson","Eyal Berkovic","David James","Andrew Cole","Sylvain Distin","Tal Ben Haim"]                                                                              | ["Benjani","Gerry Creaney","Kit Symons","Paul Walsh"]                                                                                                                              |
==> | "Blackburn Rovers"  | "Southampton FC"      | 5     | 6             | ["David Speedie","Stuart Ripley","James Beattie","Kevin Davies","Zak Jones"]                                                                                                                               | ["Zak Jones","Egil Östenstad","Kevin Davies","Alan Shearer","Jeff Kenna","Tim Flowers"]                                                                                            |
==> | "AFC Bournemouth"   | "West Ham United"     | 3     | 8             | ["Keith Rowland","Paul Mitchell","Scott Mean"]                                                                                                                                                             | ["Steve Jones","Matt Holland","Mohammed Berthé","Scott Mean","Paul Mitchell","Jamie Victory","Mark Watson","Stephen Purches"]                                                      |
==> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Any players who go back to the same club they were at previously?

match (player:Player)<-[:PLAYER]-(t1:Transfer)-[:FROM_TEAM]->(from)<-[:TO_TEAM]-(t2:Transfer)-[:FROM_TEAM]->(to),
      (t2)-[:PLAYER]->(player), (t1)-[:TO_TEAM]->(to)
WHERE ID(to) < ID(from)
WITH player, COLLECT([ from.name, " ⥄ ", to.name]) AS teams
RETURN player.name, 
       REDUCE(acc = [], item in teams | acc  + REDUCE(acc2 = "", i in item | acc2 + i)) AS thereAndBack
ORDER BY LENGTH(thereAndBack) DESC
LIMIT 10
 
==> +-------------------------------------------------------------------------------------+
==> | player.name       | thereAndBack                                                    |
==> +-------------------------------------------------------------------------------------+
==> | "Mark Stein"      | ["Stoke City ⥄ Chelsea FC","Ipswich Town ⥄ Chelsea FC"]         |
==> | "Peter Beagrie"   | ["Bradford City ⥄ Everton FC","Bradford City ⥄ Wigan Athletic"] |
==> | "Richard Dryden"  | ["Southampton FC ⥄ Stoke City","Southampton FC ⥄ Swindon Town"] |
==> | "Robbie Elliott"  | ["Bolton Wanderers ⥄ Newcastle United"]                         |
==> | "Elliot Grandin"  | ["Blackpool FC ⥄ Crystal Palace"]                               |
==> | "Robert Fleck"    | ["Chelsea FC ⥄ Norwich City"]                                   |
==> | "Paul Walsh"      | ["Portsmouth FC ⥄ Manchester City"]                             |
==> | "Rick Holden"     | ["Manchester City ⥄ Oldham Athletic"]                           |
==> | "Gary McAllister" | ["Liverpool FC ⥄ Coventry City"]                                |
==> | "Lee Bowyer"      | ["West Ham United ⥄ Newcastle United"]                          |
==> +-------------------------------------------------------------------------------------+

That’s all I’ve got for now – if you can think of any other interesting avenues to explore let me know and I’ll take a look.

Categories: Programming