Warning: Table './devblogsdb/cache_page' is marked as crashed and last (automatic?) repair failed query: SELECT data, created, headers, expire, serialized FROM cache_page WHERE cid = 'http://www.softdevblogs.com/?q=aggregator/sources/23' in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc on line 135

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 729

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 730

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 731

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 732
Software Development Blogs: Programming, Software Testing, Agile, Project Management
Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/common.inc on line 153.
Syndicate content
Thoughts on Software Development
Updated: 45 min 37 sec ago

Kubernetes: Spinning up a Neo4j 3.1 Causal Cluster

Fri, 11/25/2016 - 17:55

A couple of weeks ago I wrote a blog post explaining how I’d created a Neo4j causal cluster using docker containers directly and for my next pet project I wanted to use Kubernetes as an orchestration layer so that I could declaratively change the number of servers in my cluster.

I’d never used Kubernetes before but I saw a presentation showing how to use it to create an Elastic cluster at the GDG Cloud meetup a couple of months ago.

In that presentation I was introduced to the idea of a PetSet which is an abstraction exposed by Kubernetes which allows us to manage a set of pods (containers) which have a fixed identity. The documentation explains it better:

A PetSet ensures that a specified number of β€œpets” with unique identities are running at any given time. The identity of a Pet is comprised of:

  • a stable hostname, available in DNS
  • an ordinal index
  • stable storage: linked to the ordinal & hostname

In my case I need to have a stable hostname because each member of a Neo4j cluster is given a list of other cluster members with which it can create a new cluster or join an already existing one. This is the first use case described in the documentation:

PetSet also helps with the 2 most common problems encountered managing such clustered applications:

  • discovery of peers for quorum
  • startup/teardown ordering

So the first thing we need to do is create some stable storage for our pods to use.

We’ll create a cluster of 3 members so we need to create one PersistentVolume for each of them. The following script does the job:

volumes.sh

for i in $(seq 0 2); do
  cat <<EOF | kubectl create -f -
kind: PersistentVolume
apiVersion: v1
metadata:
  name: pv${i}
  labels:
    type: local
    app: neo4j
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/tmp/${i}"
EOF
 
  cat <<EOF | kubectl create -f -
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: datadir-neo4j-${i}
  labels:
    app: neo4j
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
EOF
done;

If we run this script it’ll create 3 volumes which we can see by running the following command:

$ kubectl get pv
NAME      CAPACITY   ACCESSMODES   STATUS    CLAIM                     REASON    AGE
pv0       1Gi        RWO           Bound     default/datadir-neo4j-0             7s
pv1       1Gi        RWO           Bound     default/datadir-neo4j-1             7s
pv2       1Gi        RWO           Bound     default/datadir-neo4j-2             7s
$ kubectl get pvc
NAME              STATUS    VOLUME    CAPACITY   ACCESSMODES   AGE
datadir-neo4j-0   Bound     pv0       1Gi        RWO           26s
datadir-neo4j-1   Bound     pv1       1Gi        RWO           26s
datadir-neo4j-2   Bound     pv2       1Gi        RWO           25s

Next we need to create a PetSet template. After a lot of iterations I ended up with the following:

# Headless service to provide DNS lookup
apiVersion: v1
kind: Service
metadata:
  labels:
    app: neo4j
  name: neo4j
spec:
  clusterIP: None
  ports:
    - port: 7474
  selector:
    app: neo4j
----
# new API name
apiVersion: "apps/v1alpha1"
kind: PetSet
metadata:
  name: neo4j
spec:
  serviceName: neo4j
  replicas: 3
  template:
    metadata:
      annotations:
        pod.alpha.kubernetes.io/initialized: "true"
        pod.beta.kubernetes.io/init-containers: '[
            {
                "name": "install",
                "image": "gcr.io/google_containers/busybox:1.24",
                "command": ["/bin/sh", "-c", "echo \"
                unsupported.dbms.edition=enterprise\n
                dbms.mode=CORE\n
                dbms.connectors.default_advertised_address=$HOSTNAME.neo4j.default.svc.cluster.local\n
                dbms.connectors.default_listen_address=0.0.0.0\n
                dbms.connector.bolt.type=BOLT\n
                dbms.connector.bolt.enabled=true\n
                dbms.connector.bolt.listen_address=0.0.0.0:7687\n
                dbms.connector.http.type=HTTP\n
                dbms.connector.http.enabled=true\n
                dbms.connector.http.listen_address=0.0.0.0:7474\n
                causal_clustering.raft_messages_log_enable=true\n
                causal_clustering.initial_discovery_members=neo4j-0.neo4j.default.svc.cluster.local:5000,neo4j-1.neo4j.default.svc.cluster.local:5000,neo4j-2.neo4j.default.svc.cluster.local:5000\n
                causal_clustering.leader_election_timeout=2s\n
                  \" > /work-dir/neo4j.conf" ],
                "volumeMounts": [
                    {
                        "name": "confdir",
                        "mountPath": "/work-dir"
                    }
                ]
            }
        ]'
      labels:
        app: neo4j
    spec:
      containers:
      - name: neo4j
        image: "neo4j/neo4j-experimental:3.1.0-M13-beta3-enterprise"
        imagePullPolicy: Always
        ports:
        - containerPort: 5000
          name: discovery
        - containerPort: 6000
          name: tx
        - containerPort: 7000
          name: raft
        - containerPort: 7474
          name: browser
        - containerPort: 7687
          name: bolt
        securityContext:
          privileged: true
        volumeMounts:
        - name: datadir
          mountPath: /data
        - name: confdir
          mountPath: /conf
      volumes:
      - name: confdir
  volumeClaimTemplates:
  - metadata:
      name: datadir
      annotations:
        volume.alpha.kubernetes.io/storage-class: anything
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

The main thing I had trouble with was getting the members of the cluster to talk to each other. The default docker config uses hostnames but I found that pods were unable to contact each other unless I specified the FQDN in the config file. We can run the following command to create the PetSet:

$ kubectl create -f neo4j.yaml 
service "neo4j" created
petset "neo4j" created

We can check if the pods are up and running by executing the following command:

$ kubectl get pods
NAME      READY     STATUS    RESTARTS   AGE
neo4j-0   1/1       Running   0          2m
neo4j-1   1/1       Running   0          14s
neo4j-2   1/1       Running   0          10s

And we can tail neo4j’s log files like this:

$ kubectl logs neo4j-0
Starting Neo4j.
2016-11-25 16:39:50.333+0000 INFO  Starting...
2016-11-25 16:39:51.723+0000 INFO  Bolt enabled on 0.0.0.0:7687.
2016-11-25 16:39:51.733+0000 INFO  Initiating metrics...
2016-11-25 16:39:51.911+0000 INFO  Waiting for other members to join cluster before continuing...
2016-11-25 16:40:12.074+0000 INFO  Started.
2016-11-25 16:40:12.428+0000 INFO  Mounted REST API at: /db/manage
2016-11-25 16:40:13.350+0000 INFO  Remote interface available at http://neo4j-0.neo4j.default.svc.cluster.local:7474/
$ kubectl logs neo4j-1
Starting Neo4j.
2016-11-25 16:39:53.846+0000 INFO  Starting...
2016-11-25 16:39:56.212+0000 INFO  Bolt enabled on 0.0.0.0:7687.
2016-11-25 16:39:56.225+0000 INFO  Initiating metrics...
2016-11-25 16:39:56.341+0000 INFO  Waiting for other members to join cluster before continuing...
2016-11-25 16:40:16.623+0000 INFO  Started.
2016-11-25 16:40:16.951+0000 INFO  Mounted REST API at: /db/manage
2016-11-25 16:40:17.607+0000 INFO  Remote interface available at http://neo4j-1.neo4j.default.svc.cluster.local:7474/
$ kubectl logs neo4j-2
Starting Neo4j.
2016-11-25 16:39:57.828+0000 INFO  Starting...
2016-11-25 16:39:59.166+0000 INFO  Bolt enabled on 0.0.0.0:7687.
2016-11-25 16:39:59.176+0000 INFO  Initiating metrics...
2016-11-25 16:39:59.329+0000 INFO  Waiting for other members to join cluster before continuing...
2016-11-25 16:40:19.216+0000 INFO  Started.
2016-11-25 16:40:19.675+0000 INFO  Mounted REST API at: /db/manage
2016-11-25 16:40:21.029+0000 INFO  Remote interface available at http://neo4j-2.neo4j.default.svc.cluster.local:7474/

I wanted to log into the servers from my host machine’s browser so I setup port forwarding for each of the servers:

$ kubectl port-forward neo4j-0 7474:7474 7687:7687

We can then get an overview of the cluster by running the following procedure:

CALL dbms.cluster.overview()
 
╒════════════════════════════════════╀═════════════════════════════════════════════════════╀════════╕
β”‚id                                  β”‚addresses                                            β”‚role    β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════════════════════════════════════════════β•ͺ════════║
β”‚81d8e5e2-02db-4414-85de-a7025b346e84β”‚[bolt://neo4j-0.neo4j.default.svc.cluster.local:7687,β”‚LEADER  β”‚
β”‚                                    β”‚ http://neo4j-0.neo4j.default.svc.cluster.local:7474]β”‚        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚347b7517-7ca0-4b92-b9f0-9249d46b2ad3β”‚[bolt://neo4j-1.neo4j.default.svc.cluster.local:7687,β”‚FOLLOWERβ”‚
β”‚                                    β”‚ http://neo4j-1.neo4j.default.svc.cluster.local:7474]β”‚        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚a5ec1335-91ce-4358-910b-8af9086c2969β”‚[bolt://neo4j-2.neo4j.default.svc.cluster.local:7687,β”‚FOLLOWERβ”‚
β”‚                                    β”‚ http://neo4j-2.neo4j.default.svc.cluster.local:7474]β”‚        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜

So far so good. What if we want to have 5 servers in the cluster instead of 3? We can run the following command to increase our replica size:

$ kubectl patch petset neo4j -p '{"spec":{"replicas":5}}'
"neo4j" patched

Let’s run that procedure again:

CALL dbms.cluster.overview()
 
╒════════════════════════════════════╀═════════════════════════════════════════════════════╀════════╕
β”‚id                                  β”‚addresses                                            β”‚role    β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════════════════════════════════════════════β•ͺ════════║
β”‚81d8e5e2-02db-4414-85de-a7025b346e84β”‚[bolt://neo4j-0.neo4j.default.svc.cluster.local:7687,β”‚LEADER  β”‚
β”‚                                    β”‚ http://neo4j-0.neo4j.default.svc.cluster.local:7474]β”‚        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚347b7517-7ca0-4b92-b9f0-9249d46b2ad3β”‚[bolt://neo4j-1.neo4j.default.svc.cluster.local:7687,β”‚FOLLOWERβ”‚
β”‚                                    β”‚ http://neo4j-1.neo4j.default.svc.cluster.local:7474]β”‚        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚a5ec1335-91ce-4358-910b-8af9086c2969β”‚[bolt://neo4j-2.neo4j.default.svc.cluster.local:7687,β”‚FOLLOWERβ”‚
β”‚                                    β”‚ http://neo4j-2.neo4j.default.svc.cluster.local:7474]β”‚        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚28613d06-d4c5-461c-b277-ddb3f05e5647β”‚[bolt://neo4j-3.neo4j.default.svc.cluster.local:7687,β”‚FOLLOWERβ”‚
β”‚                                    β”‚ http://neo4j-3.neo4j.default.svc.cluster.local:7474]β”‚        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚2eaa0058-a4f3-4f07-9f22-d310562ad1ecβ”‚[bolt://neo4j-4.neo4j.default.svc.cluster.local:7687,β”‚FOLLOWERβ”‚
β”‚                                    β”‚ http://neo4j-4.neo4j.default.svc.cluster.local:7474]β”‚        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Neat! And it’s as easy to go back down to 3 again:

$ kubectl patch petset neo4j -p '{"spec":{"replicas":3}}'
"neo4j" patched
CALL dbms.cluster.overview()
 
╒════════════════════════════════════╀══════════════════════════════════════════════════════╀════════╕
β”‚id                                  β”‚addresses                                             β”‚role    β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════════════════════════════════════════════β•ͺ════════║
β”‚81d8e5e2-02db-4414-85de-a7025b346e84β”‚[bolt://neo4j-0.neo4j.default.svc.cluster.local:7687, β”‚LEADER  β”‚
β”‚                                    β”‚http://neo4j-0.neo4j.default.svc.cluster.local:7474]  β”‚        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚347b7517-7ca0-4b92-b9f0-9249d46b2ad3β”‚[bolt://neo4j-1.neo4j.default.svc.cluster.local:7687, β”‚FOLLOWERβ”‚
β”‚                                    β”‚http://neo4j-1.neo4j.default.svc.cluster.local:7474]  β”‚        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚a5ec1335-91ce-4358-910b-8af9086c2969β”‚[bolt://neo4j-2.neo4j.default.svc.cluster.local:7687, β”‚FOLLOWERβ”‚
β”‚                                    β”‚http://neo4j-2.neo4j.default.svc.cluster.local:7474]  β”‚        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Next I need to look at how we can add read replicas into the cluster. These don’t take part in the membership/quorum algorithm so I think I’ll be able to use the more common ReplicationController/Pod architecture for those.

If you want to play around with this the code is available as a gist. I’m using the minikube library for all my experiments but I’ll hopefully get around to trying this on GCE or AWS soon.

Categories: Programming

Kubernetes: Writing hostname to a file

Tue, 11/22/2016 - 20:56

Over the weekend I spent a bit of time playing around with Kubernetes and to get the hang of the technology I set myself the task of writing the hostname of the machine to a file.

I’m using the excellent minikube tool to create a local Kubernetes cluster for my experiments so the first step is to spin that up:

$ minikube start
Starting local Kubernetes cluster...
Kubectl is now configured to use the cluster.

The first thing I needed to work out how to get the hostname. I figured there was probably an environment variable that I could access. We can call the env command to see a list of all the environment variables in a container so I created a pod template that would show me that information:

hostname_super_simple.yaml

apiVersion: v1
kind: Pod
metadata:
  name: mark-super-simple-test-pod
spec:
  containers:
    - name: test-container
      image: gcr.io/google_containers/busybox:1.24
      command: [ "/bin/sh", "-c", "env" ]      
  dnsPolicy: Default
  restartPolicy: Never

I then created a pod from that template and checked the logs of that pod:

$ kubectl create -f hostname_super_simple.yaml 
pod "mark-super-simple-test-pod" created
$ kubectl logs  mark-super-simple-test-pod
KUBERNETES_SERVICE_PORT=443
KUBERNETES_PORT=tcp://10.0.0.1:443
HOSTNAME=mark-super-simple-test-pod
SHLVL=1
HOME=/root
KUBERNETES_PORT_443_TCP_ADDR=10.0.0.1
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT_443_TCP=tcp://10.0.0.1:443
PWD=/
KUBERNETES_SERVICE_HOST=10.0.0.1

The information we need is in $HOSTNAME so the next thing we need to do is created a pod template which puts that into a file.

hostname_simple.yaml

apiVersion: v1
kind: Pod
metadata:
  name: mark-test-pod
spec:
  containers:
    - name: test-container
      image: gcr.io/google_containers/busybox:1.24
      command: [ "/bin/sh", "-c", "echo $HOSTNAME > /tmp/bar; cat /tmp/bar" ]
  dnsPolicy: Default
  restartPolicy: Never

We can create a pod using this template by running the following command:

$ kubectl create -f hostname_simple.yaml
pod "mark-test-pod" created

Now let’s check the logs of the instance to see whether our script worked:

$ kubectl logs mark-test-pod
mark-test-pod

Indeed it did, good times!

Categories: Programming

Neo4j 3.1 beta3 + docker: Creating a Causal Cluster

Sun, 11/13/2016 - 13:30

Over the weekend I’ve been playing around with docker and learning how to spin up a Neo4j Causal Cluster.

Causal Clustering is Neo4j’s new clustering architecture which makes use of Diego Ongaro’s Raft consensus algorithm to ensure writes are committed on a majority of servers. It’ll be available in the 3.1 series of Neo4j which is currently in beta. I’ll be using BETA3 in this post.

2016 11 13 09 14 41

I don’t know much about docker but luckily my colleague Kevin Van Gundy wrote a blog post a couple of weeks ago explaining how to spin up Neo4j inside a docker container which was very helpful for getting me started.

Kevin spins up a single Neo4j server using the latest released version, which at the time of writing is 3.0.7. Since we want to use a beta version we’ll need to use a docker image from the neo4j-experimental repository.

We’re going to create 3 docker instances, each running Neo4j, and have them form a cluster. We’ll name them instance0, instance1, and instance2. We’ll create config files for each instance on the host machine and refer to those from our docker instance. This is the config file for instance0:

/tmp/ce/instance0/conf/neo4j.conf

unsupported.dbms.edition=enterprise
dbms.mode=CORE
 
dbms.security.auth_enabled=false
dbms.memory.heap.initial_size=512m
dbms.memory.heap.max_size=512m
dbms.memory.pagecache.size=100M
dbms.tx_log.rotation.retention_policy=false
 
dbms.connector.bolt.type=BOLT
dbms.connector.bolt.enabled=true
dbms.connector.bolt.listen_address=0.0.0.0:7687
dbms.connector.http.type=HTTP
dbms.connector.http.enabled=true
dbms.connector.http.listen_address=0.0.0.0:7474
 
dbms.connectors.default_listen_address=0.0.0.0
dbms.connectors.default_advertised_address=instance0
 
causal_clustering.initial_discovery_members=instance0:5000,instance1:5000,instance2:5000
causal_clustering.leader_election_timeout=2s

The only config that changes between instances is dbms.connectors.default_advertised_address which would have a value of instance1 or instance2 for the other members of our cluster.

We can create a docker instance using this config:

docker run --name=instance0 --detach \
           --publish=7474:7474 \
           --publish=7687:7687 \
           --net=cluster \
           --hostname=instance0 \
           --volume /tmp/ce/instance0/conf:/conf \
           --volume /tmp/ce/instance0/data:/data \
           neo4j/neo4j-experimental:3.1.0-M13-beta3-enterprise

We create the network ‘cluster’ referenced on the 4th line like this:

docker network create --driver=bridge cluster

It’s a bit of a pain having to create these config files and calls to docker by hand but luckily Michael has scripted the whole thing for us.

docker.sh

function config {
mkdir -p /tmp/ce/$1/conf
cat > /tmp/ce/$1/conf/neo4j.conf << EOF
unsupported.dbms.edition=enterprise
dbms.mode=CORE
 
dbms.security.auth_enabled=false
dbms.memory.heap.initial_size=512m
dbms.memory.heap.max_size=512m
dbms.memory.pagecache.size=100M
dbms.tx_log.rotation.retention_policy=false
 
dbms.connector.bolt.type=BOLT
dbms.connector.bolt.enabled=true
dbms.connector.bolt.listen_address=0.0.0.0:7687
dbms.connector.http.type=HTTP
dbms.connector.http.enabled=true
dbms.connector.http.listen_address=0.0.0.0:7474
 
dbms.connectors.default_listen_address=0.0.0.0
dbms.connectors.default_advertised_address=${1}
 
causal_clustering.initial_discovery_members=instance0:5000,instance1:5000,instance2:5000
causal_clustering.leader_election_timeout=2s
EOF
}
 
function run {
HOST=$1
INSTANCE=instance$HOST
config $INSTANCE
docker run --name=$INSTANCE --detach \
           --publish=$[7474+$HOST]:7474 \
           --publish=$[7687+$HOST]:7687 \
           --net=cluster \
           --hostname=$INSTANCE \
           --volume /tmp/ce/$INSTANCE/conf:/conf \
           --volume /tmp/ce/$INSTANCE/data:/data \
           neo4j/neo4j-experimental:3.1.0-M13-beta3-enterprise
}
 
docker network create --driver=bridge cluster
 
run 0
run 1
run 2

Once we run the script we can run the following command to check that the cluster has come up:

$ docker logs instance0
Starting Neo4j.
2016-11-13 11:46:55.863+0000 INFO  Starting...
2016-11-13 11:46:57.241+0000 INFO  Bolt enabled on 0.0.0.0:7687.
2016-11-13 11:46:57.255+0000 INFO  Initiating metrics...
2016-11-13 11:46:57.439+0000 INFO  Waiting for other members to join cluster before continuing...
2016-11-13 11:47:17.816+0000 INFO  Started.
2016-11-13 11:47:18.054+0000 INFO  Mounted REST API at: /db/manage
2016-11-13 11:47:19.068+0000 INFO  Remote interface available at http://instance0:7474/

Each instance is available at port 7474 but we’ve mapped these to different ports on the host OS by using this line in the parameters we passed to docker run:

--publish=$[7474+$HOST]:7474

We can therefore access each of these Neo4j instances from the host OS at the following ports:

instance0 -> http://localhost:7474
instance1 -> http://localhost:7475
instance2 -> http://localhost:7476

If we open one of those we’ll be confronted with the following dialog:

2016 11 13 12 10 06

This is a bit strange as we explicitly disabled security in our config.

The actual problem is that the Neo4j browser is unable to communicate with the underlying database. There are two ways to work around this:

Connect using HTTP instead of BOLT

We can tell the browser to connect to the database using the HTTP protocol rather than BOLT by unticking the checkbox:

2016 11 13 12 12 24 Update the BOLT host

Or we can update the Bolt host value to refer to a host:port value that’s accessible from the host OS. Each server is accessible from port 7687 but we mapped those ports to different ports on the host OS with this flag that we passed to docker run:

--publish=$[7687+$HOST]:7687 \

We can access BOLT from the following ports:

instance0 -> localhost:7687
instance1 -> localhost:7688
instance2 -> localhost:7689

Let’s try changing it for instance2:

2016 11 13 12 20 29

You might have to refresh your web browser after you change value but it usually updates automatically. We can run the :sysinfo command in the browser to see the state of our cluster:

2016 11 13 12 22 55

And we’re good to go. The full script is available as a gist if you want to give it a try.

Let me know how you get on!

Categories: Programming

Neo4j: Find the intermediate point between two lat/longs

Tue, 11/01/2016 - 23:10

Yesterday I wrote a blog post showing how to find the midpoint between two lat/longs using Cypher which worked well as a first attempt at filling in missing locations, but I realised I could do better.

As I mentioned in the last post, when I find a stop that’s missing lat/long coordinates I can usually find two nearby stops that allow me to triangulate this stop’s location.

I also have train routes which indicate the number of seconds it takes to go from one stop to another, which allows me to indicate whether the location-less stop is closer to one stop than the other.

For example, consider stops a, b, and c where b doesn’t have a location. If we have these distances between the stops:

(a)-[:NEXT {time: 60}]->(b)-[:NEXT {time: 240}]->(c)

it tells us that point ‘b’ is actually 0.2 of the distance from ‘a’ to ‘c’ rather than being the midpoint.

There’s a formula we can use to work out that point:

a = sin((1βˆ’f)β‹…Ξ΄) / sin Ξ΄
b = sin(fβ‹…Ξ΄) / sin Ξ΄
x = a β‹… cos Ο†1 β‹… cos Ξ»1 + b β‹… cos Ο†2 β‹… cos Ξ»2
y = a β‹… cos Ο†1 β‹… sin Ξ»1 + b β‹… cos Ο†2 β‹… sin Ξ»2
z = a β‹… sin Ο†1 + b β‹… sin Ο†2
Ο†i = atan2(z, √xΒ² + yΒ²)
Ξ»i = atan2(y, x)
 
Ξ΄ is the angular distance d/R between the two points.
Ο† = latitude
Ξ» = longitude

Translated to Cypher (with mandatory Greek symbols) it reads like this to find the point 0.2 of the way from one point to another

with {latitude: 51.4931963543, longitude: -0.0475185810} AS p1, 
     {latitude: 51.47908, longitude: -0.05393950 } AS p2
 
WITH p1, p2, distance(point(p1), point(p2)) / 6371000 AS Ξ΄, 0.2 AS f
WITH p1, p2, Ξ΄, 
     sin((1-f) * Ξ΄) / sin(Ξ΄) AS a,
     sin(f * Ξ΄) / sin(Ξ΄) AS b
WITH radians(p1.latitude) AS Ο†1, radians(p1.longitude) AS Ξ»1,
     radians(p2.latitude) AS Ο†2, radians(p2.longitude) AS Ξ»2,
     a, b
WITH a * cos(Ο†1) * cos(Ξ»1) + b * cos(Ο†2) * cos(Ξ»2) AS x,
     a * cos(Ο†1) * sin(Ξ»1) + b * cos(Ο†2) * sin(Ξ»2) AS y,
     a * sin(Ο†1) + b * sin(Ο†2) AS z
RETURN degrees(atan2(z, sqrt(x^2 + y^2))) AS Ο†i,
       degrees(atan2(y,x)) AS Ξ»i
╒═════════════════╀════════════════════╕
β”‚Ο†i               β”‚Ξ»i                  β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ════════════════════║
β”‚51.49037311149128β”‚-0.04880308288561931β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

A quick sanity check plugging in 0.5 instead of 0.2 finds the midpoint which I was able to sanity check against yesterday’s post:

╒═════════════════╀═════════════════════╕
β”‚Ο†i               β”‚Ξ»i                   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════════════║
β”‚51.48613822097523β”‚-0.050729537454086385β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

That’s all for now!

Categories: Programming

Neo4j: Find the midpoint between two lat/longs

Mon, 10/31/2016 - 20:31

2016 10 31 06 06 00

Over the last couple of weekends I’ve been playing around with some transport data and I wanted to run the A* algorithm to find the quickest route between two stations.

The A* algorithm takes an estimateEvaluator as one of its parameters and the evaluator looks at lat/longs of nodes to work out whether a path is worth following or not. I therefore needed to add lat/longs for each station and I found it surprisingly hard to find this location date for all the points in the dataset.

Luckily I tend to have the lat/longs for two points either side of a station so I can work out the midpoint as an approximation for the missing one.

I found an article which defines a formula we can use to do this and there’s a StackOverflow post which has some Java code that implements the formula.

I wanted to find the midpoint between Surrey Quays station (51.4931963543,-0.0475185810) and a point further south on the train line (51.47908,-0.05393950). I wrote the following Cypher query to calculate this point:

WITH 51.4931963543 AS lat1, -0.0475185810 AS lon1, 
     51.47908 AS lat2 , -0.05393950 AS lon2
 
WITH radians(lat1) AS rlat1, radians(lon1) AS rlon1, 
     radians(lat2) AS rlat2, radians(lon2) AS rlon2, 
     radians(lon2 - lon1) AS dLon
 
WITH rlat1, rlon1, rlat2, rlon2, 
     cos(rlat2) * cos(dLon) AS Bx, 
     cos(rlat2) * sin(dLon) AS By
 
WITH atan2(sin(rlat1) + sin(rlat2), 
           sqrt( (cos(rlat1) + Bx) * (cos(rlat1) + Bx) + By * By )) AS lat3,
     rlon1 + atan2(By, cos(rlat1) + Bx) AS lon3
 
RETURN degrees(lat3) AS midLat, degrees(lon3) AS midLon
╒═════════════════╀═════════════════════╕
β”‚midLat           β”‚midLon               β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════════════║
β”‚51.48613822097523β”‚-0.050729537454086385β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The Google Maps screenshot on the right hand side shows the initial points at the top and bottom and the midpoint in between. It’s not perfect; ideally I’d like the midpoint to be on the track, but I think it’s good enough for the purposes of the algorithm.

Now I need to go and fill in the lat/longs for my location-less stations!

Categories: Programming

Neo4j: Create dynamic relationship type

Sun, 10/30/2016 - 23:12

One of the things I’ve often found frustrating when importing data using Cypher, Neo4j’s query language, is that it’s quite difficult to create dynamic relationship types.

Say we have a CSV file structured like this:

load csv with headers from "file:///people.csv" AS row
RETURN row
╒═══════════════════════════════════════════════════════╕
β”‚row                                                    β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚{node1: Mark, node2: Reshmee, relationship: MARRIED_TO}β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚{node1: Mark, node2: Alistair, relationship: FRIENDS}  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

We want to create nodes with the relationship type specified in the file. Unfortunately, in Cypher we can’t pass in relationship types so we have to resort to the FOREACH hack to create our relationships:

load csv with headers from "file:///people.csv" AS row
MERGE (p1:Person {name: row.node1})
MERGE (p2:Person {name: row.node2})
 
FOREACH(ignoreMe IN CASE WHEN row.relationship = "MARRIED_TO" THEN [1] ELSE [] END |
 MERGE (p1)-[:MARRIED_TO]->(p2))
 
FOREACH(ignoreMe IN CASE WHEN row.relationship = "FRIENDS" THEN [1] ELSE [] END |
 MERGE (p1)-[:FRIENDS]->(p2))

This works, but:

  1. Looks horrendous
  2. Doesn’t scale particularly well when we have multiple relationship types to deal with

As in my last post the APOC library comes to the rescue again, this time in the form of the apoc.create.relationship procedure.

This procedure allows us to change our initial query to read like this:

load csv with headers from "file:///people.csv" AS row
MERGE (p1:Person {name: row.node1})
MERGE (p2:Person {name: row.node2})
 
WITH p1, p2, row
CALL apoc.create.relationship(p1, row.relationship, {}, p2) YIELD rel
RETURN rel

Much better!

Categories: Programming

Neo4j: Dynamically add property/Set dynamic property

Thu, 10/27/2016 - 06:29

I’ve been playing around with a dataset which has the timetable for the national rail in the UK and they give you departure and arrival times of each train in a textual format.

For example, the node to represent a stop could be created like this:

CREATE (stop:Stop {arrival: "0802", departure: "0803H"})

That time format isn’t particular amenable to querying so I wanted to add another property which indicated the number of seconds since the start of the day.

So we want to add ‘arrivalSecondsSinceStartOfDay’ and ‘departureSecondsSinceStartOfDay’ properties to our node. I wrote the following query to calculate the values for those properties.

MATCH (stop:Stop)
UNWIND ["arrival", "departure"] AS key
 
WITH key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
 
RETURN key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
╒═══════════════════════════════╀══════════════════════╕
β”‚newKey                         β”‚secondsSinceStartOfDayβ”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════════════║
β”‚arrivalSecondsSinceStartOfDay  β”‚28920                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚departureSecondsSinceStartOfDayβ”‚29010                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Now we’re ready to set those properties on the ‘stop’ node.

MATCH (stop:Stop2)
UNWIND ["arrival", "departure"] AS key
 
WITH stop,
     key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH stop, key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
WITH stop, key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
SET stop[newKey] = secondsSinceStartOfDay
Invalid input '[': expected an identifier character, whitespace, '{', node labels, a property map, a relationship pattern, '.', '(', '=' or "+=" (line 12, column 9 (offset: 447))
"SET stop[newKey] = secondsSinceStartOfDay"
         ^

Hmmm that didn’t work as expected! It doesn’t look like we can set dynamic properties using Cypher just yet.

Luckily my colleague Michael Hunger and the Neo4j community have been curating the APOC procedures library and it has just the procedure to help us out.

You’ll need to download the jar for your version of Neo4j and then place it in the plugins directory. I’m using Neo4j 3.1 Beta1 so this is what it looks like for me:

$ tree neo4j-enterprise-3.1.0-BETA1/plugins/
 
neo4j-enterprise-3.1.0-BETA1/plugins/
└── apoc-3.1.0.1-all.jar
 
0 directories, 1 file

After you’ve done that you’ll need to restart Neo4j so that it can pick up the new procedures that we’ve added. Once you’ve done that execute the following query to check they’ve installed correctly:

call dbms.procedures()
YIELD name 
WITH name 
WHERE name STARTS WITH "apoc"
RETURN COUNT(*)
╒════════╕
β”‚COUNT(*)β”‚
β•žβ•β•β•β•β•β•β•β•β•‘
β”‚183     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜

We’re now ready to dynamically set properties in the graph. The procedure that we’ll use is apoc.create.setProperty and it’s easy to update our query to use it:

MATCH (stop:Stop)
UNWIND ["arrival", "departure"] AS key
 
WITH stop,
     key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH stop, key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
WITH stop, key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
CALL apoc.create.setProperty(stop, newKey, secondsSinceStartOfDay)
Query cannot conclude with CALL (must be RETURN or an update clause) (line 12, column 1 (offset: 439))
"CALL apoc.create.setProperty(stop, newKey, secondsSinceStartOfDay)"
 ^

Oops I spoke too soon! We need to yield the return column of the procedure and return it or just return a count to work around this:

MATCH (stop:Stop)
UNWIND ["arrival", "departure"] AS key
 
WITH stop,
     key,
     toInteger(substring(stop[key], 0, 2)) AS hours,          
     toInteger(substring(stop[key], 2, 2)) AS minutes,
     CASE WHEN substring(stop[key], 4,1) = "H" THEN 30 ELSE 0 END AS seconds
 
WITH stop, key, (hours * 60 * 60) + (minutes * 60) + seconds AS secondsSinceStartOfDay
WITH stop, key + "SecondsSinceStartOfDay" AS newKey, secondsSinceStartOfDay
CALL apoc.create.setProperty(stop, newKey, secondsSinceStartOfDay) 
YIELD node
RETURN COUNT(*)
╒════════╕
β”‚COUNT(*)β”‚
β•žβ•β•β•β•β•β•β•β•β•‘
β”‚2       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜

And that’s it, we can now dynamically set properties in our queries.

Categories: Programming

Neo4j: Detecting rogue spaces in CSV headers with LOAD CSV

Wed, 10/19/2016 - 06:16

Last week I was helping someone load the data from a CSV file into Neo4j and we were having trouble filtering out rows which contained a null value in one of the columns.

This is what the data looked like:

load csv with headers from "file:///foo.csv" as row
RETURN row
╒══════════════════════════════════╕
β”‚row                               β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚{key1: a,  key2: (null),  key3: c}β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚{key1: d,  key2: e,  key3: f}     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

We’d like to filter out any rows which have ‘key2’ as null, so let’s tweak our query to do that:

load csv with headers from "file:///foo.csv" as row
WITH row WHERE NOT row.key2 is null
RETURN row
(no rows)

Hmmm that’s odd, it’s got rid of both rows. We’d expect to see the 2nd row since that doesn’t have a null value.

At this point we might suspect that what we’re seeing on the screen isn’t actually what the data looks like. Let’s write the following query to check our header values:

load csv with headers from "file:///foo.csv" as row
WITH row LIMIT 1
UNWIND keys(row) AS key
RETURN key, SIZE(key)
╒═════╀═════════╕
β”‚key  β”‚SIZE(key)β”‚
β•žβ•β•β•β•β•β•ͺ═════════║
β”‚key1 β”‚4        β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ key2β”‚5        β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ key3β”‚5        β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The second column tells us that there are some extra characters in the columns for ‘key2’ and ‘key3’ or rather ‘ key2’ and ‘ key3’. In this case they are spaces, but it could easily be another character:

load csv with headers from "file:///foo.csv" as row
WITH row LIMIT 1
UNWIND keys(row) AS key
RETURN key, replace(key, " ", "_SPACE_") AS spaces
╒═════╀═══════════╕
β”‚key  β”‚spaces     β”‚
β•žβ•β•β•β•β•β•ͺ═══════════║
β”‚key1 β”‚key1       β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ key2β”‚_SPACE_key2β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ key3β”‚_SPACE_key3β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

If we clean up our CSV file and try again everything works as expected:

load csv with headers from "file:///foo.csv" as row
WITH row LIMIT 1
UNWIND keys(row) AS key
RETURN key, SIZE(key)
╒════╀═════════╕
β”‚key β”‚SIZE(key)β”‚
β•žβ•β•β•β•β•ͺ═════════║
β”‚key1β”‚4        β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚key2β”‚4        β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚key3β”‚4        β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
load csv with headers from "file:///foo.csv" as row
WITH row WHERE NOT row.key2 is null
RETURN row
╒═══════════════════════════╕
β”‚row                        β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚{key1: d, key2: e, key3: f}β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Categories: Programming

Neo4j: requirement failed

Tue, 10/04/2016 - 23:33

Last week during a hands on Cypher meetup, using Neo4j’s built in movie dataset, one of the attendees showed me the following query which wasn’t working as expected:

MATCH (p:Person)-[:ACTED_IN]->(movie)
RETURN p, COLLECT(movie.title) AS movies
ORDER BY COUNT(movies) DESC
LIMIT 10
requirement failed

We can get a full stack trace in logs/debug.log if we run the same query from the cypher-shell, which was introduced during one fo the Neo4j 3.1 milestone releases:

2016-10-03 23:25:07.529+0000 ERROR [o.n.b.v.r.ErrorReporter] Client triggered an unexpected error [UnknownError]: requirement failed, reference to. requirement failed
java.lang.IllegalArgumentException: requirement failed
        at scala.Predef$.require(Predef.scala:212)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.steps.sortSkipAndLimit$.apply(sortSkipAndLimit.scala:38)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanEventHorizon$.apply(PlanEventHorizon.scala:43)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanEventHorizon$.apply(PlanEventHorizon.scala:31)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanWithTail.apply(PlanWithTail.scala:46)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanWithTail.apply(PlanWithTail.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanSingleQuery.apply(PlanSingleQuery.scala:47)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.PlanSingleQuery.apply(PlanSingleQuery.scala:30)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner$$anonfun$2.apply(QueryPlanner.scala:51)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner$$anonfun$2.apply(QueryPlanner.scala:51)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner.planQueries(QueryPlanner.scala:51)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.logical.DefaultQueryPlanner.plan(QueryPlanner.scala:36)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder.produceLogicalPlan(CostBasedExecutablePlanBuilder.scala:95)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder$$anonfun$1.apply(CostBasedExecutablePlanBuilder.scala:71)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder$$anonfun$1.apply(CostBasedExecutablePlanBuilder.scala:71)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$$anonfun$closing$1.apply(package.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$$anonfun$closing$1.apply(package.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$.using(package.scala:37)
        at org.neo4j.cypher.internal.compiler.v3_1.helpers.package$.closing(package.scala:29)
        at org.neo4j.cypher.internal.compiler.v3_1.planner.CostBasedExecutablePlanBuilder.producePlan(CostBasedExecutablePlanBuilder.scala:70)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.procs.DelegatingProcedureExecutablePlanBuilder.producePlan(DelegatingProcedureExecutablePlanBuilder.scala:99)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.FallbackBuilder$class.producePlan(SilentFallbackPlanBuilder.scala:37)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.SilentFallbackPlanBuilder.producePlan(SilentFallbackPlanBuilder.scala:56)
        at org.neo4j.cypher.internal.compiler.v3_1.executionplan.ExecutionPlanBuilder.build(ExecutionPlanBuilder.scala:106)
        at org.neo4j.cypher.internal.compiler.v3_1.CypherCompiler$$anonfun$5.apply(CypherCompiler.scala:176)
        at org.neo4j.cypher.internal.compiler.v3_1.CypherCompiler$$anonfun$5.apply(CypherCompiler.scala:176)
        at org.neo4j.cypher.internal.compiler.v3_1.QueryCache$$anonfun$getOrElseUpdate$1$$anonfun$apply$1.apply(CacheAccessor.scala:36)
        at org.neo4j.cypher.internal.compiler.v3_1.MonitoringCacheAccessor$$anonfun$1.apply(CacheAccessor.scala:57)
        at org.neo4j.cypher.internal.compiler.v3_1.LFUCache$$anon$1.apply(LFUCache.scala:31)

The problem line seems to be this one:

require(sortItems.forall(_.expression.isInstanceOf[Variable]))

The solution is actually quite simple. We should be using the SIZE function to calculate the size of a list rather then the COUNT function:

MATCH (p:Person)-[:ACTED_IN]->(movie)
RETURN p, COLLECT(movie.title) AS movies
ORDER BY SIZE(movies) DESC
LIMIT 10
Categories: Programming

Neo4j: Procedure call inside a query does not support passing arguments implicitly (pass explicitly after procedure name instead)

Sun, 10/02/2016 - 11:13

A couple of days I was trying to write a Cypher query to filter the labels in my database.

I started with the following procedure call to get the list of all the labels:

CALL db.labels
╒══════════╕
β”‚label     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•‘
β”‚Airport   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚Flight    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚Airline   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚Movie     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚AirportDayβ”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚Person    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚Engineer  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

I was only interested in labels that contained the letter ‘a’ so I tweaked the query to filter the output of the procedure:

CALL db.labels
YIELD label 
WITH label WHERE tolower(label) contains "a"
RETURN label

Unfortunately that didn’t work as I expected:

Procedure call inside a query does not support passing arguments implicitly (pass explicitly after procedure name instead) (line 1, column 9 (offset: 8))
"CALL db.labels"
         ^

The mistake I made was calling the procedure implicitly without using parentheses. If you want to do any post processing on the output of a procedure you need to call it explicitly otherwise Cypher gets very confused.

If we add back the parentheses it’s much happier:

CALL db.labels()
YIELD label 
WITH label WHERE tolower(label) contains "a"
RETURN label
╒══════════╕
β”‚label     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•‘
β”‚Airport   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚Airline   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚AirportDayβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

It stumped me for a while until I figured out what the error message meant! I think I’ll just use explicit parentheses all the time from now on to save me running into this one again.

Categories: Programming

scikit-learn: First steps with log_loss

Wed, 09/14/2016 - 06:33

Over the last week I’ve spent a little bit of time playing around with the data in the Kaggle TalkingData Mobile User Demographics competition, and came across a notebook written by dune_dweller showing how to run a logistic regression algorithm on the dataset.

The metric used to evaluate the output in this competition is multi class logarithmic loss, which is implemented by the log_loss function in the scikit-learn library.

I’ve not used it before so I created a small example to get to grips with it.

Let’s say we have 3 rows to predict and we happen to know that they should be labelled ‘bam’, ‘spam’, and ‘ham’ respectively:

>>> actual_labels = ["bam", "ham", "spam"]


To work out the log loss score we need to make a prediction for what we think each label actually is. We do this by passing an array containing a probability between 0-1 for each label

e.g. if we think the first label is definitely ‘bam’ then we’d pass [1, 0, 0], whereas if we thought it had a 50-50 chance of being ‘bam’ or ‘spam’ then we might pass [0.5, 0, 0.5]. As far as I can tell the values get sorted into (alphabetical) order so we need to provide our predictions in the same order.

Let’s give it a try. First we’ll import the function:

>>> from sklearn.metrics import log_loss

Now let’s see what score we get if we make a perfect prediction:

>>> log_loss(actual_labels,  [[1, 0, 0], [0, 1, 0], [0, 0, 1]])
2.1094237467877998e-15

What about if we make a completely wrong prediction?

>>> log_loss(actual_labels,  [[0, 0, 1], [1, 0, 0], [0, 1, 0]])
34.538776394910684

We can reverse engineer this score to work out the probability that we’ve predicted the correct class.

If we look at the case where the average log loss exceeds 1, it is when log(pij) < -1 when i is the true class. This means that the predicted probability for that given class would be less than exp(-1) or around 0.368. So, seeing a log loss greater than one can be expected in the cass that that your model only gives less than a 36% probability estimate for the correct class.

This is the formula of logloss:

NEmt7

In which yij is 1 for the correct class and 0 for other classes and pij is the probability assigned for that class.

The interesting thing about this formula is that we only care about the correct class. The yij value of 0 cancels out the wrong classes.

In our two examples so far we actually already know the probability estimate for the correct class – 100% in the first case and 0% in the second case, but we can plug in the numbers to check we end up with the same result.

First we need to work out what value would have been passed to the log function which is easy in this case. The value of yij is

# every prediction exactly right
>>> math.log(1)
0.0
 
>>> math.exp(0)
1.0
# every prediction completely wrong
>>> math.log(0.000000001)
-20.72326583694641
 
>>> math.exp(-20.72326583694641)
1.0000000000000007e-09

I used a really small value instead of 0 in the second example because math.log(0) trends towards negative infinity.

Let’s try another example where we have less certainty:

>>> print log_loss(actual_labels, [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7]])
0.363548039673

We’ll have to do a bit more work to figure out what value was being passed to the log function this time, but not too much. This is roughly the calculation being performed:

# 0.363548039673 = -1/3 * (log(0.8) + log(0.6) + log(0.7)
 
>>> print log_loss(actual_labels,  [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7]])
0.363548039673

In this case, on average our probability estimate would be:

# we put in the negative value since we multiplied by -1/N
>>> math.exp(-0.363548039673)
0.6952053289772744

We had 60%, 70%, and 80% accuracy for our 3 labels so an overall probability of 69.5% seems about right.

One more example. This time we’ll make one more very certain (90%) prediction for ‘spam’:

>>> print log_loss(["bam", "ham", "spam", "spam"], [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7], [0.05, 0.05, 0.9]])
0.299001158669
 
>>> math.exp(-0.299001158669)
0.741558550213609

74% accuracy overall, sounds about right!

Categories: Programming

scikit-learn: Clustering and the curse of dimensionality

Sat, 08/27/2016 - 21:32

In my last post I attempted to cluster Game of Thrones episodes based on character appearances without much success. After I wrote that post I was flicking through the scikit-learn clustering documentation and noticed the following section which describes some of the weaknesses of the K-means clustering algorithm:

Inertia is not a normalized metric: we just know that lower values are better and zero is optimal.

But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called β€œcurse of dimensionality”).

Running a dimensionality reduction algorithm such as PCA prior to k-means clustering can alleviate this problem and speed up the computations.

Each episode has 638 dimensions so this is probably the problem we’re seeing. I actually thought the ‘curse of dimensionality’ referred to the greater than linear increase in computation time; I hadn’t realised it could also impact the clustering itself.

As the documentation notes, the K-Means algorithm calculates euclidean distances to work out which cluster episodes should go in. Episodes in the same cluster should have a small euclidean distance and items in different clusters should have larger ones.

I created a little script to help me understand the curse of dimensionality. I’ve got 4 pairs of vectors, of size 4, 6, 100, and 600. Half of the items in the vector match and the other half differ. I calculate the cosine similarity and euclidean distance for each pair of vectors:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
 
def distances(a, b):
    return np.linalg.norm(a-b), cosine_similarity([a, b])[0][1]
 
def mixed(n_zeros, n_ones):
    return np.concatenate((np.repeat([1], n_ones), np.repeat([0], n_zeros)), axis=0)
 
def ones(n_ones):
    return np.repeat([1], n_ones)
 
print distances(mixed(2, 2), ones(4))
print distances(mixed(3, 3), ones(6))
print distances(mixed(50, 50), ones(100))
print distances(mixed(300, 300), ones(600))
 
(1.4142135623730951, 0.70710678118654746)
(1.7320508075688772, 0.70710678118654768)
(7.0710678118654755, 0.70710678118654757)
(17.320508075688775, 0.70710678118654746)

The euclidean distance for the 600 item vector is 17x larger than for the one containing 4 items despite having the same similarity score.

Having convinced myself that reducing the dimensionality of the vectors could make a difference I reduced the size of the episodes vectors using the the Truncated SVD algorithm before trying K-means clustering again.

First we reduce the dimensionality of the episodes vectors:

from sklearn.decomposition import TruncatedSVD
 
n_components = 2
reducer = TruncatedSVD(n_components=n_components)
reducer.fit(all)
new_all = reducer.transform(all)
print("%d: Percentage explained: %s\n" % (n_components, reducer.explained_variance_ratio_.sum()))
 
2: Percentage explained: 0.124579183633

I’m not sure how much I should be reducing the number of dimensions so I thought 2 would an interesting place to start. I’m not sure exactly what the output of the reducer.explained_variance_ratio_ function means so I need to do some more reading to figure out whether it makes sense to carry on with a dimension of 2.

For now though let’s try out the clustering algorithm again and see how it gets on:

from sklearn.cluster import KMeans
 
for n_clusters in range(2, 10):
    km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1)
    cluster_labels = km.fit_predict(new_all)
    silhouette_avg = metrics.silhouette_score(new_all, cluster_labels, sample_size=1000)
 
    print n_clusters, silhouette_avg
 
2 0.559681096025
3 0.498456585461
4 0.524704352941
5 0.441580592398
6 0.44703058946
7 0.447895331824
8 0.433698007009
9 0.459874485986

This time out silhouette scores are much better. I came across a tutorial from the Guide to Advanced Data Analysis which includes a table explaining how to interpret this score:

2016 08 27 21 18 14

We have a couple of cluster sizes which fit in the ‘reasonable structure’ and a few just on the edge of fitting in that category.

I tried varying the number of dimensions and found that 3 worked reasonably well, but after that the silhouette score dropped rapidly. Once we reach 30 dimensions the silhouette score is almost the same as if we hadn’t reduced dimensionality at all.

I haven’t figured out a good way of visualising the results of my experiments where I vary the dimensions and number of clusters so that’s something to work on next. I find it quite difficult to see what’s going on by just staring at the raw numbers.

I also need to read up on the SVD algorithm to understand when it is/isn’t acceptable to reduce dimensions and how much I should be reducing them by.

Any questions/thoughts/advice do let me know in the comments.

Categories: Programming

scikit-learn: Trying to find clusters of Game of Thrones episodes

Thu, 08/25/2016 - 23:07

In my last post I showed how to find similar Game of Thrones episodes based on the characters that appear in different episodes. This allowed us to find similar episodes on an episode by episode basis, but I was curious whether there were groups of similar episodes that we could identify.

scikit-learn provides several clustering algorithms that can run over our episode vectors and hopefully find clusters of similar episodes. A clustering algorithm groups similar documents together, where similarity is based on calculating a ‘distance’ between documents. Documents separated by a small distance would be in the same cluster, whereas if there’s a large distance between episodes then they’d probably be in different clusters.

The simplest variant is K-means clustering:

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified.

The output from the algorithm is a list of labels which correspond to the cluster assigned to each episode.

Let’s give it a try on the Game of Thrones episodes. We’ll start from the 2 dimensional array of episodes/character appearances that we created in the previous post.

>>> all.shape
(60, 638)
 
>>> all
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

We have a 60 (episodes) x 638 (characters) array which we can now plug into the K-means clustering algorithm:

>>> from sklearn.cluster import KMeans
 
>>> n_clusters = 3
>>> km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1)
>>> cluster_labels = km.fit_predict(all)
 
>>> cluster_labels
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)

cluster_labels is an array containing a label for each episode in the all array. The spread of these labels is as follows:

>>> import numpy as np
>>> np.bincount(cluster_labels)
array([19, 12, 29])

i.e. 19 episodes in cluster 0, 12 in cluster 1, and 29 in cluster 2.

How do we know if the clustering is any good?

Ideally we’d have some labelled training data which we could compare our labels against, but since we don’t we can measure the effectiveness of our clustering by calculating inter-centroidal separation and intra-cluster variance.

i.e. how close are the episodes to other episodes in the same cluster vs how close are they to episodes in the closest different cluster.

scikit-learn gives us a function that we can use to calculate this score – the silhouette coefficient.

The output of this function is a score between -1 and 1.

  • A score of 1 means that our clustering has worked well and a document is far away from the boundary of another cluster.
  • A score of -1 means that our document should have been placed in another cluster.
  • A score of 0 means that the document is very close to the decision boundary between two clusters.

I tried calculating this coefficient for some different values of K. This is what I found:

from sklearn import metrics
 
for n_clusters in range(2, 10):
    km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1)
    cluster_labels = km.fit_predict(all)
 
    silhouette_avg = metrics.silhouette_score(all, cluster_labels, sample_size=1000)
    sample_silhouette_values = metrics.silhouette_samples(all, cluster_labels)
 
    print n_clusters, silhouette_avg
 
2 0.0798610142955
3 0.0648416081725
4 0.0390877994786
5 0.020165277756
6 0.030557856406
7 0.0389677156458
8 0.0590721834989
9 0.0466170527996

The best score we manage here is 0.07 when we set the number of clusters to 2. Even our highest score is much lower than the lowest score on the documentation page!

I tried it out with some higher values of K but only saw a score over 0.5 once I put the number of clusters to 40 which would mean 1 or 2 episodes per cluster at most.

At the moment our episode arrays contain 638 elements so they’re too long to visualise on a 2D silhouette plot. We’d need to apply a dimensionality reduction algorithm before doing that.

In summary it looks like character co-occurrence isn’t a good way to cluster episodes. I’m curious what would happen if we flip the array on its head and try and cluster the characters instead, but that’s for another day.

If anyone spots anything that I’ve missed when reading the output of the algorithm let me know in the comments. I’m just learning by experimentation at the moment.

Categories: Programming

Neo4j/scikit-learn: Calculating the cosine similarity of Game of Thrones episodes

Mon, 08/22/2016 - 22:12

A couple of months ago Praveena and I created a Game of Thrones dataset to use in a workshop and I thought it’d be fun to run it through some machine learning algorithms and hopefully find some interesting insights.

The dataset is available as CSV files but for this analysis I’m assuming that it’s already been imported into neo4j. If you want to import the data you can run the tutorial by typing the following into the query bar of the neo4j browser:

:play http://guides.neo4j.com/got

Since we don’t have any training data we’ll be using unsupervised learning methods, and we’ll start simple by calculating the similarity of episodes based character appearances. We’ll be using scitkit-learn‘s cosine similarity function to determine episode similarity.

Christian Perone has an excellent blog post explaining how to use cosine similarity on text documents which is well worth a read. We’ll be using a similar approach here, but instead of building a TF/IDF vector for each document we’re going to create a vector indicating whether a character appeared in an episode or not.

e.g. imagine that we have 3 characters – A, B, and C – and 2 episodes. A and B appear in the first episode and B and C appear in the second episode. We would represent that with the following vectors:

Episode 1 = [1, 1, 0]
Episode 2 = [0, 1, 1]

We could then calculate the cosine similarity between these two episodes like this:

>>> from sklearn.metrics.pairwise import cosine_similarity
>>> one = [1,1,0]
>>> two = [0,1,1]
 
>>> cosine_similarity([one, two])
array([[ 1. ,  0.5],
       [ 0.5,  1. ]])

So this is telling us that Episode 1 is 100% similar to Episode 1, Episode 2 is 100% similar to itself as well, and Episodes 1 and 2 are 50% similar to each other based on the fact that they both have an appearance of Character B.

Note that the character names aren’t even mentioned at all, they are implicitly a position in the array. This means that when we use our real dataset we need to ensure that the characters are in the same order for each episode, otherwise the calculation will be meaningless!

In neo4j land we have an APPEARED_IN relationship between a character and each episode that they appeared in. We can therefore write the following code using the Python driver to get all pairs of episodes and characters:

from neo4j.v1 import GraphDatabase, basic_auth
driver = GraphDatabase.driver("bolt://localhost", auth=basic_auth("neo4j", "neo"))
session = driver.session()
 
rows = session.run("""
    MATCH (c:Character), (e:Episode)
    OPTIONAL MATCH (c)-[appearance:APPEARED_IN]->(e)
    RETURN e, c, appearance
    ORDER BY e.id, c.id""")

We can iterate through the rows to see what the output looks like:

>>> for row in rows:
        print row
 
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5415 labels=set([u'Character']) properties={u'name': u'Addam Marbrand', u'id': u'/wiki/Addam_Marbrand'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5882 labels=set([u'Character']) properties={u'name': u'Adrack Humble', u'id': u'/wiki/Adrack_Humble'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6747 labels=set([u'Character']) properties={u'name': u'Aegon V Targaryen', u'id': u'/wiki/Aegon_V_Targaryen'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5750 labels=set([u'Character']) properties={u'name': u'Aemon', u'id': u'/wiki/Aemon'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5928 labels=set([u'Character']) properties={u'name': u'Aeron Greyjoy', u'id': u'/wiki/Aeron_Greyjoy'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5503 labels=set([u'Character']) properties={u'name': u'Aerys II Targaryen', u'id': u'/wiki/Aerys_II_Targaryen'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6753 labels=set([u'Character']) properties={u'name': u'Alannys Greyjoy', u'id': u'/wiki/Alannys_Greyjoy'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=6750 labels=set([u'Character']) properties={u'name': u'Alerie Tyrell', u'id': u'/wiki/Alerie_Tyrell'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5753 labels=set([u'Character']) properties={u'name': u'Alliser Thorne', u'id': u'/wiki/Alliser_Thorne'}> appearance=None>
<Record e=<Node id=6780 labels=set([u'Episode']) properties={u'season': 1, u'number': 1, u'id': 1, u'title': u'Winter Is Coming'}> c=<Node id=5858 labels=set([u'Character']) properties={u'name': u'Alton Lannister', u'id': u'/wiki/Alton_Lannister'}> appearance=None>

Next we’ll build a ‘matrix’ of episodes/characters. If a character appears in an episode then we’ll put a ‘1’ in the matrix, if not we’ll put a ‘0’:

episodes = {}
for row in rows:
    if episodes.get(row["e"]["id"]) is None:
        if row["appearance"] is None:
            episodes[row["e"]["id"]] = [0]
        else:
            episodes[row["e"]["id"]] = [1]
    else:
        if row["appearance"] is None:
            episodes[row["e"]["id"]].append(0)
        else:
            episodes[row["e"]["id"]].append(1)

Here’s an example of one entry in the matrix:

>>> len(episodes)
60
 
>>> len(episodes[1])
638
 
>>> episodes[1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

From this output we learn that there are 60 episodes and 638 characters in Game of Thrones so far. We can also see which characters appeared in the first episode, although it’s a bit tricky to work out which index in the array corresponds to each character.

The next thing we’re going to do is calculate the cosine similarity between episodes. Let’s start by seeing how similar the first episode is to all the others:

>>> all = episodes.values()
 
>>> cosine_similarity(all[0:1], all)[0]
array([ 1.        ,  0.69637306,  0.48196269,  0.54671752,  0.48196269,
        0.44733753,  0.31707317,  0.42340087,  0.34989921,  0.43314808,
        0.36597766,  0.18421252,  0.30961158,  0.2328101 ,  0.30616181,
        0.41905818,  0.36842504,  0.35338088,  0.18376917,  0.3569686 ,
        0.2328101 ,  0.34539847,  0.25043516,  0.31707317,  0.25329221,
        0.33342786,  0.34921515,  0.2174909 ,  0.2533473 ,  0.28429311,
        0.23026565,  0.22310537,  0.22365301,  0.23816275,  0.28242289,
        0.16070148,  0.24847093,  0.21434648,  0.03582872,  0.21189672,
        0.15460414,  0.17161693,  0.15460414,  0.17494961,  0.1234662 ,
        0.21426863,  0.21434648,  0.18748505,  0.15308091,  0.20161946,
        0.19877675,  0.30920827,  0.21058466,  0.19127301,  0.24607943,
        0.18033393,  0.17734311,  0.16296707,  0.18740851,  0.23995201])

The first entry in the array indicates that episode 1 is 100% similar to episode 1 which is a good start. It’s 69% similar to episode 2 and 48% similar to episode 3. We can sort that array to work out which episodes it’s most similar to:

>>> for idx, score in sorted(enumerate(cosine_similarity(all[0:1], all)[0]), key = lambda x: x[1], reverse = True)[:5]:
        print idx, score
 
0 1.0
1 0.696373059207
3 0.546717521051
2 0.481962692712
4 0.481962692712

Or we can see how similar the last episode of season 6 is compared to the others:

>>> for idx, score in sorted(enumerate(cosine_similarity(all[59:60], all)[0]), key = lambda x: x[1], reverse = True)[:5]:
        print idx, score
 
59 1.0
52 0.500670191678
46 0.449085146211
43 0.448218732478
49 0.446296233312

I found it a bit painful exploring similarities like this so I decided to write them into neo4j instead and then write a query to find the most similar episodes. The following query creates a SIMILAR_TO relationship between episodes and sets a score property on that relationship:

>>> episode_mapping = {}
>>> for idx, episode_id in enumerate(episodes):
        episode_mapping[idx] = episode_id
 
>>> for idx, episode_id in enumerate(episodes):
        similarity_matrix = cosine_similarity(all[idx:idx+1], all)[0]
        for other_idx, similarity_score in enumerate(similarity_matrix):
            other_episode_id = episode_mapping[other_idx]
            print episode_id, other_episode_id, similarity_score
            if episode_id != other_episode_id:
                session.run("""
                    MATCH (episode1:Episode {id: {episode1}}), (episode2:Episode {id: {episode2}})
                    MERGE (episode1)-[similarity:SIMILAR_TO]-(episode2)
                    ON CREATE SET similarity.score = {similarityScore}
                    """, {'episode1': episode_id, 'episode2': other_episode_id, 'similarityScore': similarity_score})
 
    session.close()

The episode_mapping dictionary is needed to map from episode ids to indices e.g. episode 1 is at index 0.

If we want to find the most similar pair of episodes in Game of Thrones we can execute the following query:

MATCH (episode1:Episode)-[similarity:SIMILAR_TO]-(episode2:Episode)
WHERE ID(episode1) > ID(episode2)
RETURN "S" + episode1.season + "E" + episode1.number AS ep1, 
       "S" + episode2.season + "E" + episode2.number AS ep2, 
       similarity.score AS score
ORDER BY similarity.score DESC
LIMIT 10
 
╒═════╀════╀══════════════════╕
β”‚ep1  β”‚ep2 β”‚score             β”‚
β•žβ•β•β•β•β•β•ͺ════β•ͺ══════════════════║
β”‚S1E2 β”‚S1E1β”‚0.6963730592072543β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S1E4 β”‚S1E3β”‚0.6914173051223086β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S1E9 β”‚S1E8β”‚0.6869464497590777β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S2E10β”‚S2E8β”‚0.6869037302955034β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S3E7 β”‚S3E6β”‚0.6819943394704735β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S2E7 β”‚S2E6β”‚0.6813598225089799β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S1E10β”‚S1E9β”‚0.6796436827080401β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S1E5 β”‚S1E4β”‚0.6698105143372364β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S1E10β”‚S1E8β”‚0.6624062584864754β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S4E5 β”‚S4E4β”‚0.6518358737330705β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

And the least popular?

MATCH (episode1:Episode)-[similarity:SIMILAR_TO]-(episode2:Episode)
WHERE ID(episode1) > ID(episode2)
RETURN "S" + episode1.season + "E" + episode1.number AS ep1, 
       "S" + episode2.season + "E" + episode2.number AS ep2, 
       similarity.score AS score
ORDER BY similarity.score
LIMIT 10
 
╒════╀════╀═══════════════════╕
β”‚ep1 β”‚ep2 β”‚score              β”‚
β•žβ•β•β•β•β•ͺ════β•ͺ═══════════════════║
β”‚S4E9β”‚S1E5β”‚0                  β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S4E9β”‚S1E6β”‚0                  β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S4E9β”‚S4E2β”‚0                  β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S4E9β”‚S2E9β”‚0                  β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S4E9β”‚S2E4β”‚0                  β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S5E6β”‚S4E9β”‚0                  β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S6E8β”‚S4E9β”‚0                  β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S4E9β”‚S4E6β”‚0                  β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S3E9β”‚S2E9β”‚0.03181423814878889β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚S4E9β”‚S1E1β”‚0.03582871819500093β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The output of this query suggests that there are no common characters between 8 pairs of episodes which at first glance sounds surprising. Let’s write a query to check that finding:

MATCH (episode1:Episode)<-[:APPEARED_IN]-(character)-[:APPEARED_IN]->(episode2:Episode)
WHERE episode1.season = 4 AND episode1.number = 9 AND episode2.season = 1 AND episode2.number = 5
return episode1, episode2
 
(no changes, no rows)

It’s possible I made a mistake with the scraping of the data but from a quick look over the Wiki page I don’t think I have. I found it interesting that Season 4 Episode 9 shows up on 9 of the top 10 least similar pairs of episodes.

Next I’m going to cluster the episodes based on character appearances, but this post is long enough already so that’ll have to wait for another post another day.

Categories: Programming

Python: matplotlib/seaborn/virtualenv – Python is not installed as a framework

Sun, 08/14/2016 - 19:56

Over the weekend I was following The Marketing Technologist’s content based recommender tutorial but ran into the following exception when trying to import the seaborn library:

$ python 5_content_based_recommender/run.py 
Traceback (most recent call last):
  File "5_content_based_recommender/run.py", line 14, in <module>
    import seaborn as sns
  File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/seaborn/__init__.py", line 6, in <module>
    from .rcmod import *
  File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/seaborn/rcmod.py", line 8, in <module>
    from . import palettes, _orig_rc_params
  File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/seaborn/palettes.py", line 12, in <module>
    from .utils import desaturate, set_hls_values, get_color_cycle
  File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/seaborn/utils.py", line 12, in <module>
    import matplotlib.pyplot as plt
  File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/matplotlib/pyplot.py", line 114, in <module>
    _backend_mod, new_figure_manager, draw_if_interactive, _show = pylab_setup()
  File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/matplotlib/backends/__init__.py", line 32, in pylab_setup
    globals(),locals(),[backend_name],0)
  File "/Users/markneedham/projects/themarketingtechnologist/tmt/lib/python2.7/site-packages/matplotlib/backends/backend_macosx.py", line 24, in <module>
    from matplotlib.backends import _macosx
RuntimeError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are Working with Matplotlib in a virtual enviroment see 'Working with Matplotlib in Virtual environments' in the Matplotlib FAQ

We can see from the stacktrace that seaborn calls matplotlib so that’s where the problem lies. There’s even a page on the matplotlib website suggesting some workarounds.

I’ve come across this error before and been unable to get any of the suggestions to work, but this time I was successful. I needed to create the following function in my bash profile file:


~/.bash_profile

function frameworkpython {
    if [[ ! -z "$VIRTUAL_ENV" ]]; then
        PYTHONHOME=$VIRTUAL_ENV /usr/bin/python "$@"
    else
        /usr/bin/python "$@"
    fi
}

And call that function instead of my virtualenv’s python:

$ frameworkpython 5_content_based_recommender/run.py

This time the matplotlib visualisation works:

2016 08 14 16 16 08

#win

Categories: Programming