Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 17 hours 25 min ago

R: Removing for loops

Sun, 04/19/2015 - 00:53

In my last blog post I showed the translation of a likelihood function from Think Bayes into R and in my first attempt at this function I used a couple of nested for loops.

likelihoods = function(names, mixes, observations) {
  scores = rep(1, length(names))
  names(scores) = names
 
  for(name in names) {
      for(observation in observations) {
        scores[name] = scores[name] *  mixes[[name]][observation]      
      }
    }  
  return(scores)
}
Names = c("Bowl 1", "Bowl 2")
 
bowl1Mix = c(0.75, 0.25)
names(bowl1Mix) = c("vanilla", "chocolate")
bowl2Mix = c(0.5, 0.5)
names(bowl2Mix) = c("vanilla", "chocolate")
Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix)
Mixes
 
Observations = c("vanilla", "vanilla", "vanilla", "chocolate")
l = likelihoods(Names, Mixes, Observations)
 
> l / sum(l)
  Bowl 1   Bowl 2 
0.627907 0.372093

We pass in a vector of bowls, a nested dictionary describing the mixes of cookies in each bowl and the observations that we’ve made. The function tells us that there’s an almost 2/3 probability of the cookies coming from Bowl 1 and just over 1/3 of being Bowl 2.

In this case there probably won’t be much of a performance improvement by getting rid of the loops but we should be able to write something that’s more concise and hopefully idiomatic.

Let’s start by getting rid of the inner for loop. That can be replace by a call to the Reduce function like so:

likelihoods2 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  for(name in names) {
    scores[name] = Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1)
  }  
  return(scores)
}
l2 = likelihoods2(Names, Mixes, Observations)
 
> l2 / sum(l2)
  Bowl 1   Bowl 2 
0.627907 0.372093

So that’s good, we’ve still got the same probabilities as before. Now to get rid of the outer for loop. The Map function helps us out here:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = Map(function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1), 
    names)
 
  return(scores)
}
 
l3 = likelihoods3(Names, Mixes, Observations)
> l3
$`Bowl 1`
  vanilla 
0.1054688 
 
$`Bowl 2`
vanilla 
 0.0625

We end up with a list instead of a vector which we need to fix by using the unlist function:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = Map(function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1), 
    names)
 
  return(unlist(scores))
}
 
l3 = likelihoods3(Names, Mixes, Observations)
 
> l3 / sum(l3)
Bowl 1.vanilla Bowl 2.vanilla 
      0.627907       0.372093

Now we just have this annoying ‘vanilla’ in the name. That’s fixed easily enough:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = Map(function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1), 
    names)
 
  result = unlist(scores)
  names(result) = names
 
  return(result)
}
 
l3 = likelihoods3(Names, Mixes, Observations)
 
> l3 / sum(l3)
  Bowl 1   Bowl 2 
0.627907 0.372093

A slightly cleaner alternative makes use of the sapply function:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = sapply(names, function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1))
  names(scores) = names
 
  return(scores)
}
 
l3 = likelihoods3(Names, Mixes, Observations)
 
> l3 / sum(l3)
  Bowl 1   Bowl 2 
0.627907 0.372093

That’s the best I’ve got for now but I wonder if we could write a version of this using matrix operations some how – but that’s for next time!

Categories: Programming

R: Think Bayes – More posterior probability calculations

Thu, 04/16/2015 - 21:57

As I mentioned in a post last week I’ve been reading through Think Bayes and translating some of the examples form Python to R.

After my first post Antonios suggested a more idiomatic way of writing the function in R so I thought I’d give it a try to calculate the probability that combinations of cookies had come from each bowl.

In the simplest case we have this function which takes in the names of the bowls and the likelihood scores:

f = function(names,likelihoods) {
  # Assume each option has an equal prior
  priors = rep(1, length(names)) / length(names)
 
  # create a data frame with all info you have
  dt = data.frame(names,priors,likelihoods)
 
  # calculate posterior probabilities
  dt$post = dt$priors*dt$likelihoods / sum(dt$priors*dt$likelihoods)
 
  # specify what you want the function to return
  list(names=dt$names, priors=dt$priors, likelihoods=dt$likelihoods, posteriors=dt$post)  
}

We assume a prior probability of 0.5 for each bowl.

Given the following probabilities of of different cookies being in each bowl…

mixes = {
  'Bowl 1':dict(vanilla=0.75, chocolate=0.25),
  'Bowl 2':dict(vanilla=0.5, chocolate=0.5),
}

…we can simulate taking one vanilla cookie with the following parameters:

Likelihoods = c(0.75,0.5)
Names = c("Bowl 1", "Bowl 2")
res=f(Names,Likelihoods)
 
> res$posteriors[res$name == "Bowl 1"]
[1] 0.6
> res$posteriors[res$name == "Bowl 2"]
[1] 0.4

If we want to simulate taking 3 vanilla cookies and 1 chocolate one we’d have the following:

Likelihoods = c((0.75 ** 3) * (0.25 ** 1), (0.5 ** 3) * (0.5 ** 1))
Names = c("Bowl 1", "Bowl 2")
res=f(Names,Likelihoods)
 
> res$posteriors[res$name == "Bowl 1"]
[1] 0.627907
> res$posteriors[res$name == "Bowl 2"]
[1] 0.372093

That’s a bit clunky and the intent of ‘3 vanilla cookies and 1 chocolate’ has been lost. I decided to refactor the code to take in a vector of cookies and calculate the likelihoods internally.

First we need to create a data structure to store the mixes of cookies in each bowl that we defined above. It turns out we can do this using a nested list:

bowl1Mix = c(0.75, 0.25)
names(bowl1Mix) = c("vanilla", "chocolate")
bowl2Mix = c(0.5, 0.5)
names(bowl2Mix) = c("vanilla", "chocolate")
Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix)
 
> Mixes
$`Bowl 1`
  vanilla chocolate 
     0.75      0.25 
 
$`Bowl 2`
  vanilla chocolate 
      0.5       0.5

Now let’s tweak our function to take in observations rather than likelihoods and then calculate those likelihoods internally:

likelihoods = function(names, observations) {
  scores = c(1,1)
  names(scores) = names
 
  for(name in names) {
      for(observation in observations) {
        scores[name] = scores[name] *  mixes[[name]][observation]      
      }
    }  
  return(scores)
}
 
f = function(names,mixes,observations) {
  # Assume each option has an equal prior
  priors = rep(1, length(names)) / length(names)
 
  # create a data frame with all info you have
  dt = data.frame(names,priors)
 
  dt$likelihoods = likelihoods(Names, Observations)
 
  # calculate posterior probabilities
  dt$post = dt$priors*dt$likelihoods / sum(dt$priors*dt$likelihoods)
 
  # specify what you want the function to return
  list(names=dt$names, priors=dt$priors, likelihoods=dt$likelihoods, posteriors=dt$post)  
}

And if we call that function:

Names = c("Bowl 1", "Bowl 2")
 
bowl1Mix = c(0.75, 0.25)
names(bowl1Mix) = c("vanilla", "chocolate")
bowl2Mix = c(0.5, 0.5)
names(bowl2Mix) = c("vanilla", "chocolate")
Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix)
Mixes
 
Observations = c("vanilla", "vanilla", "vanilla", "chocolate")
 
res=f(Names,Mixes,Observations)
 
> res$posteriors[res$names == "Bowl 1"]
[1] 0.627907
 
> res$posteriors[res$names == "Bowl 2"]
[1] 0.372093

Exactly the same result as before! #win

Categories: Programming

Spark: Generating CSV files to import into Neo4j

Tue, 04/14/2015 - 23:56

About a year ago Ian pointed me at a Chicago Crime data set which seemed like a good fit for Neo4j and after much procrastination I’ve finally got around to importing it.

The data set covers crimes committed from 2001 until now. It contains around 4 million crimes and meta data around those crimes such as the location, type of crime and year to name a few.

The contents of the file follow this structure:

$ head -n 10 ~/Downloads/Crimes_-_2001_to_present.csv
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
9464711,HX114160,01/14/2014 05:00:00 AM,028XX E 80TH ST,0560,ASSAULT,SIMPLE,APARTMENT,false,true,0422,004,7,46,08A,1196652,1852516,2014,01/20/2014 12:40:05 AM,41.75017626412204,-87.55494559131228,"(41.75017626412204, -87.55494559131228)"
9460704,HX113741,01/14/2014 04:55:00 AM,091XX S JEFFERY AVE,031A,ROBBERY,ARMED: HANDGUN,SIDEWALK,false,false,0413,004,8,48,03,1191060,1844959,2014,01/18/2014 12:39:56 AM,41.729576153145636,-87.57568059471686,"(41.729576153145636, -87.57568059471686)"
9460339,HX113740,01/14/2014 04:44:00 AM,040XX W MAYPOLE AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,false,true,1114,011,28,26,14,1149075,1901099,2014,01/16/2014 12:40:00 AM,41.884543798701515,-87.72803579358926,"(41.884543798701515, -87.72803579358926)"
9461467,HX114463,01/14/2014 04:43:00 AM,059XX S CICERO AVE,0820,THEFT,$500 AND UNDER,PARKING LOT/GARAGE(NON.RESID.),false,false,0813,008,13,64,06,1145661,1865031,2014,01/16/2014 12:40:00 AM,41.785633535413176,-87.74148516669783,"(41.785633535413176, -87.74148516669783)"
9460355,HX113738,01/14/2014 04:21:00 AM,070XX S PEORIA ST,0820,THEFT,$500 AND UNDER,STREET,true,false,0733,007,17,68,06,1171480,1858195,2014,01/16/2014 12:40:00 AM,41.766348042591375,-87.64702037047671,"(41.766348042591375, -87.64702037047671)"
9461140,HX113909,01/14/2014 03:17:00 AM,016XX W HUBBARD ST,0610,BURGLARY,FORCIBLE ENTRY,COMMERCIAL / BUSINESS OFFICE,false,false,1215,012,27,24,05,1165029,1903111,2014,01/16/2014 12:40:00 AM,41.889741146006095,-87.66939334853973,"(41.889741146006095, -87.66939334853973)"
9460361,HX113731,01/14/2014 03:12:00 AM,022XX S WENTWORTH AVE,0820,THEFT,$500 AND UNDER,CTA TRAIN,false,false,0914,009,25,34,06,1175363,1889525,2014,01/20/2014 12:40:05 AM,41.85223460427207,-87.63185047834335,"(41.85223460427207, -87.63185047834335)"
9461691,HX114506,01/14/2014 03:00:00 AM,087XX S COLFAX AVE,0650,BURGLARY,HOME INVASION,RESIDENCE,false,false,0423,004,7,46,05,1195052,1847362,2014,01/17/2014 12:40:17 AM,41.73607283858007,-87.56097809501115,"(41.73607283858007, -87.56097809501115)"
9461792,HX114824,01/14/2014 03:00:00 AM,012XX S CALIFORNIA BLVD,0810,THEFT,OVER $500,STREET,false,false,1023,010,28,29,06,1157929,1894034,2014,01/17/2014 12:40:17 AM,41.86498077118534,-87.69571529596696,"(41.86498077118534, -87.69571529596696)"

Since I wanted to import this into Neo4j I needed to do some massaging of the data since the neo4j-import tool expects to receive CSV files containing the nodes and relationships we want to create.

Spark logo 192x100px

I’d been looking at Spark towards the end of last year and the pre-processing of the big initial file into smaller CSV files containing nodes and relationships seemed like a good fit.

I therefore needed to create a Spark job to do this. We’ll then pass this job to a Spark executor running locally and it will spit out CSV files.

2015 04 15 00 51 42

We start by creating a Scala object with a main method that will contain our processing code. Inside that main method we’ll instantiate a Spark context:

import org.apache.spark.{SparkConf, SparkContext}
 
object GenerateCSVFiles {  
    def main(args: Array[String]) {    
        val conf = new SparkConf().setAppName("Chicago Crime Dataset")    
        val sc = new SparkContext(conf)  
    }
}

Easy enough. Next we’ll read in the CSV file. I found the easiest way to reference this was with an environment variable but perhaps there’s a more idiomatic way:

import java.io.File
import org.apache.spark.{SparkConf, SparkContext}
 
object GenerateCSVFiles {
  def main(args: Array[String]) {
    var crimeFile = System.getenv("CSV_FILE")
 
    if(crimeFile == null || !new File(crimeFile).exists()) {
      throw new RuntimeException("Cannot find CSV file [" + crimeFile + "]")
    }
 
    println("Using %s".format(crimeFile))
 
    val conf = new SparkConf().setAppName("Chicago Crime Dataset")
 
    val sc = new SparkContext(conf)
    val crimeData = sc.textFile(crimeFile).cache()
}

The type of crimeData is RDD[String] – Spark’s way of representing the (lazily evaluated) lines of the CSV file. This also includes the header of the file so let’s write a function to get rid of that since we’ll be generating our own headers for the different files:

import org.apache.spark.rdd.RDD
 
// http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAEYYnxYuEaie518ODdn-fR7VvD39d71=CgB_Dxw_4COVXgmYYQ@mail.gmail.com%3E
def dropHeader(data: RDD[String]): RDD[String] = {
  data.mapPartitionsWithIndex((idx, lines) => {
    if (idx == 0) {
      lines.drop(1)
    }
    lines
  })
}

Now we’re ready to start generating our new CSV files so we’ll write a function which parses each line and extracts the appropriate columns. I’m using Open CSV for this:

import au.com.bytecode.opencsv.CSVParser
 
def generateFile(file: String, withoutHeader: RDD[String], fn: Array[String] => Array[String], header: String , distinct:Boolean = true, separator: String = ",") = {
  FileUtil.fullyDelete(new File(file))
 
  val tmpFile = "/tmp/" + System.currentTimeMillis() + "-" + file
  val rows: RDD[String] = withoutHeader.mapPartitions(lines => {
    val parser = new CSVParser(',')
    lines.map(line => {
      val columns = parser.parseLine(line)
      fn(columns).mkString(separator)
    })
  })
 
  if (distinct) rows.distinct() saveAsTextFile tmpFile else rows.saveAsTextFile(tmpFile)
}

We then call this function like this:

generateFile("/tmp/crimes.csv", withoutHeader, columns => Array(columns(0),"Crime", columns(2), columns(6)), "id:ID(Crime),:LABEL,date,description", false)

The output into ‘tmpFile’ is actually 32 ‘part files’ but I wanted to be able to merge those together into individual CSV files that were easier to work with.

I won’t paste the the full job here but if you want to take a look it’s on github.

Now we need to submit the job to Spark. I’ve wrapped this in a script if you want to follow along but these are the contents:

./spark-1.1.0-bin-hadoop1/bin/spark-submit \
--driver-memory 5g \
--class GenerateCSVFiles \
--master local[8] \ 
target/scala-2.10/playground_2.10-1.0.jar \
$@

If we execute that we’ll see the following output…”

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Crimes_-_2001_to_present.csv
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/04/15 00:31:44 INFO SparkContext: Running Spark version 1.3.0
...
15/04/15 00:47:26 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool
15/04/15 00:47:26 INFO DAGScheduler: Stage 8 (saveAsTextFile at GenerateCSVFiles.scala:51) finished in 2.702 s
15/04/15 00:47:26 INFO DAGScheduler: Job 4 finished: saveAsTextFile at GenerateCSVFiles.scala:51, took 8.715588 s
 
real	0m44.935s
user	4m2.259s
sys	0m14.159s

and these CSV files will be generated:

$ ls -alh /tmp/*.csv
-rwxrwxrwx  1 markneedham  wheel   3.0K 14 Apr 07:37 /tmp/beats.csv
-rwxrwxrwx  1 markneedham  wheel   217M 14 Apr 07:37 /tmp/crimes.csv
-rwxrwxrwx  1 markneedham  wheel    84M 14 Apr 07:37 /tmp/crimesBeats.csv
-rwxrwxrwx  1 markneedham  wheel   120M 14 Apr 07:37 /tmp/crimesPrimaryTypes.csv
-rwxrwxrwx  1 markneedham  wheel   912B 14 Apr 07:37 /tmp/primaryTypes.csv

Let’s have a quick check what they contain:

$ head -n 10 /tmp/beats.csv
id:ID(Beat),:LABEL
1135,Beat
1421,Beat
2312,Beat
1113,Beat
1014,Beat
2411,Beat
1333,Beat
2521,Beat
1652,Beat
$ head -n 10 /tmp/crimes.csv
id:ID(Crime),:LABEL,date,description
9464711,Crime,01/14/2014 05:00:00 AM,SIMPLE
9460704,Crime,01/14/2014 04:55:00 AM,ARMED: HANDGUN
9460339,Crime,01/14/2014 04:44:00 AM,TO PROPERTY
9461467,Crime,01/14/2014 04:43:00 AM,$500 AND UNDER
9460355,Crime,01/14/2014 04:21:00 AM,$500 AND UNDER
9461140,Crime,01/14/2014 03:17:00 AM,FORCIBLE ENTRY
9460361,Crime,01/14/2014 03:12:00 AM,$500 AND UNDER
9461691,Crime,01/14/2014 03:00:00 AM,HOME INVASION
9461792,Crime,01/14/2014 03:00:00 AM,OVER $500
$ head -n 10 /tmp/crimesBeats.csv
:START_ID(Crime),:END_ID(Beat),:TYPE
5896915,0733,ON_BEAT
9208776,2232,ON_BEAT
8237555,0111,ON_BEAT
6464775,0322,ON_BEAT
6468868,0411,ON_BEAT
4189649,0524,ON_BEAT
7620897,0421,ON_BEAT
7720402,0321,ON_BEAT
5053025,1115,ON_BEAT

Looking good. Let’s get them imported into Neo4j:

$ ./neo4j-community-2.2.0/bin/neo4j-import --into /tmp/my-neo --nodes /tmp/crimes.csv --nodes /tmp/beats.csv --nodes /tmp/primaryTypes.csv --relationships /tmp/crimesBeats.csv --relationships /tmp/crimesPrimaryTypes.csv
Nodes
[*>:45.76 MB/s----------------------------------|PROPERTIES(2)=============|NODE:3|v:118.05 MB/]  4M
Done in 5s 605ms
Prepare node index
[*RESOLVE:64.85 MB-----------------------------------------------------------------------------]  4M
Done in 4s 930ms
Calculate dense nodes
[>:42.33 MB/s-------------------|*PREPARE(7)===================================|CALCULATOR-----]  8M
Done in 5s 417ms
Relationships
[>:42.33 MB/s-------------|*PREPARE(7)==========================|RELATIONSHIP------------|v:44.]  8M
Done in 6s 62ms
Node --> Relationship
[*>:??-----------------------------------------------------------------------------------------]  4M
Done in 324ms
Relationship --> Relationship
[*LINK-----------------------------------------------------------------------------------------]  8M
Done in 1s 984ms
Node counts
[*>:??-----------------------------------------------------------------------------------------]  4M
Done in 360ms
Relationship counts
[*>:??-----------------------------------------------------------------------------------------]  8M
Done in 653ms
 
IMPORT DONE in 26s 517ms

Next I updated conf/neo4j-server.properties to point to my new database:

#***************************************************************
# Server configuration
#***************************************************************
 
# location of the database directory
#org.neo4j.server.database.location=data/graph.db
org.neo4j.server.database.location=/tmp/my-neo

Now I can start up Neo and start exploring the data:

$ ./neo4j-community-2.2.0/bin/neo4j start
MATCH (:Crime)-[r:CRIME_TYPE]->() 
RETURN r 
LIMIT 10
Graph  15

There’s lots more relationships and entities that we could pull out of this data set – what I’ve done is just a start. So if you’re up for some more Chicago crime exploration the code and instructions explaining how to run it are on github.

Categories: Programming

R: Creating an object with functions to calculate conditional probability

Sun, 04/12/2015 - 08:55

I’ve been working through Alan Downey’s Thinking Bayes and I thought it’d be an interesting exercise to translate some of the code from Python to R.

The first example is a simple one about conditional probablity and the author creates a class ‘PMF’ (Probability Mass Function) to solve the following problem:

Suppose there are two bowls of cookies. Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of each.

Now suppose you choose one of the bowls at random and, without looking, select a cookie at random. The cookie is vanilla.

What is the probability that it came from Bowl 1?

In Python the code looks like this:

pmf = Pmf()
pmf.Set('Bowl 1', 0.5)
pmf.Set('Bowl 2', 0.5)
 
pmf.Mult('Bowl 1', 0.75)
pmf.Mult('Bowl 2', 0.5)
 
pmf.Normalize()
 
print pmf.Prob('Bowl 1')

The ‘PMF’ class is defined here.

  • ‘Set’ defines the prior probability of picking a cookie from either bowl i.e. in our case it’s random.
  • ‘Mult’ defines the likelihood of picking a vanilla biscuit from either bowl
  • ‘Normalize’ applies a normalisation so that our posterior probabilities add up to 1.

We want to create something similar in R and the actual calculation is stragiht forward:

pBowl1 = 0.5
pBowl2 = 0.5
 
pVanillaGivenBowl1 = 0.75
pVanillaGivenBowl2 = 0.5
 
> (pBowl1 * pVanillaGivenBowl1) / ((pBowl1 * pVanillaGivenBowl1) + (PBowl2 * pVanillaGivenBowl2))
0.6
 
> (pBowl2 * pVanillaGivenBowl2) / ((pBowl1 * pVanillaGivenBowl1) + (pBowl2 * pVanillaGivenBowl2))
0.4

The problem is we have quite a bit of duplication and it doesn’t read as cleanly as the Python version.

I’m not sure of the idiomatic way of handling this type of problem in R with mutable state in R but it seems like we can achieve this using functions.

I ended up writing the following function which returns a list of other functions to call.

create.pmf = function() {
  priors <<- c()
  likelihoods <<- c()
  list(
    prior = function(option, probability) {
      l = c(probability)  
      names(l) = c(option)
      priors <<- c(priors, l)
    },
    likelihood = function(option, probability) {
      l = c(probability)  
      names(l) = c(option)
      likelihoods <<- c(likelihoods, l)
    },
    posterior = function(option) {
      names = names(priors)
      normalised = 0.0
      for(name in names) {
        normalised = normalised + (priors[name] * likelihoods[name])
      }
 
      (priors[option] * likelihoods[option]) / normalised
    }    
  )
}

I couldn’t work out how to get ‘priors’ and ‘likelihoods’ to be lexically scoped so I’ve currently got those defined as global variables. I’m using a list as a kind of dictionary following a suggestion on Stack Overflow.

The code doesn’t handle the unhappy path very well but it seems to work for the example from the book:

pmf = create.pmf()
 
pmf$prior("Bowl 1", 0.5)
pmf$prior("Bowl 2", 0.5)
 
pmf$likelihood("Bowl 1", 0.75)
pmf$likelihood("Bowl 2", 0.5)
 
> pmf$posterior("Bowl 1")
Bowl 1 
   0.6 
> pmf$posterior("Bowl 2")
Bowl 2 
   0.4

How would you solve this type of problem? Is there a cleaner/better way?

Categories: Programming

R: Snakes and ladders markov chain

Thu, 04/09/2015 - 23:02

A few days ago I read a really cool blog post explaining how Markov chains can be used to model the possible state transitions in a game of snakes and ladders, a use of Markov chains I hadn’t even thought of!

While the example is very helpful for understanding the concept, my understanding of the code is that it works off the assumption that any roll of the dice that puts you on a score > 100 is a winning roll.

In the version of the game that I know you have to land exactly on 100 to win. e.g if you’re on square 98 and roll a 6 you would go forward 2 spaces to 100 and then bounce back 4 spaces to 96.

I thought it’d be a good exercise to tweak the code to cater for this:

n=100
 
# We have 6 extra columns because we want to represent throwing of the dice which results in a final square > 100
M=matrix(0,n+1,n+1+6)
rownames(M)=0:n
colnames(M)=0:(n+6)
 
# set probabilities of landing on each square assuming that there aren't any snakes or ladders
for(i in 1:6){
  diag(M[,(i+1):(i+1+n)])=1/6
}
 
# account for 'bounce back' if a dice roll leads to a final score > 100
for(i in 96:100) {
  for(c in 102:107) {
    idx = 101 - (c - 101)  
    M[i, idx] = M[i, idx] + M[i, c]
  }  
}

We can inspect the last few rows to check that if the transition matrix is accurate:

> M[95:100,95:101]
 
   94        95        96        97        98        99       100
94  0 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
95  0 0.0000000 0.1666667 0.1666667 0.1666667 0.3333333 0.1666667
96  0 0.0000000 0.0000000 0.1666667 0.3333333 0.3333333 0.1666667
97  0 0.0000000 0.0000000 0.1666667 0.3333333 0.3333333 0.1666667
98  0 0.0000000 0.1666667 0.1666667 0.1666667 0.3333333 0.1666667
99  0 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667

If we’re on the 99th square (the last row) we could roll a 1 and end up on 100, a 2 and end up on 99 (1 forward, 1 back), a 3 and end up on 98 (1 forward, 2 back), a 4 and end up on 97 (1 forward, 3 back), a 5 and end up on 96 (1 forward, 4 back) or a 6 and end up on 95 (1 forward, 5 back). i.e. we can land on 95, 96, 97, 98, 99 or 100 with 1/6 probability.

If we’re on the 96th square (the 3rd row) we could roll a 1 and end up on 97, a 2 and end up on 98, a 3 and end up on 99, a 4 and end up on 100, a 5 and end up on 99 (4 forward, 1 back) or a 6 and end up on 98 (4 forward, 2 back). i.e. we can land on 97 with 1/6 probability, 98 with 2/6 probability, 99 with 2/6 probability or 100 with 1/6 probability.

We could do a similar analysis for the other squares but it seems like the probabilities are being calculated correctly.

Next we can update the matrix with the snakes and ladders. That code stays the same:

# get rid of the extra columns, we don't need them anymore
M=M[,1:(n+1)]
 
# add in the snakes and ladders
starting = c(4,9,17,20,28,40,51,54,62,64,63,71,93,95,92)
ending   = c(14,31,7,38,84,59,67,34,19,60,81,91,73,75,78)
 
for(i in 1:length(starting)) {
  # Retrieve current probabilities of landing on the starting square
  v=M[,starting[i]+1]  
  ind=which(v>0)
 
  # Set no probability of falling on the starting squares
  M[ind,starting[i]+1]=0
 
  # Move all existing probabilities to the ending squares
  M[ind,ending[i]+1]=M[ind,ending[i]+1]+v[ind]
}

We can also simplify the powermat function which is used to simulate what the board would look like after a certain number of dice rolls:

# original
powermat=function(P,h){
  Ph=P
  if(h>1) {
    for(k in 2:h) {
      Ph=Ph%*%P
    }
  }
  return(Ph)
}
 
#new 
library(expm)
powermat = function(P,h) {
  return (P %^% h)
}

initial=c(1,rep(0,n))
h = 1
> (initial%*%powermat(M,h))[1:15]
     0         1         2         3 4         5         6 7 8 9 10 11 12 13        14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
[1,] 0 0.1666667 0.1666667 0.1666667 0 0.1666667 0.1666667 0 0 0  0  0  0  0 0.1666667  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
     46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
[1,]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   0

One interesting thing I noticed is that it now seems to take way more turns on average to finish the game than when you didn’t need to score exactly 100 to win:

> sum(1 - game)
[1] 999
distrib=initial%*%M
game=rep(NA,1000)
for(h in 1:length(game)){
game[h]=distrib[n+1]
distrib=distrib%*%M}
plot(1-game[1:200],type="l",lwd=2,col="red",
ylab="Probability to be still playing")
2015 04 09 22 48 24

I expected it to take longer to finish the game but not this long! I think I’ve probably made a mistake but I’m not sure where…

Categories: Programming

R: Snakes and ladders markov chain

Thu, 04/09/2015 - 23:02

A few days ago I read a really cool blog post explaining how Markov chains can be used to model the possible state transitions in a game of snakes and ladders, a use of Markov chains I hadn’t even thought of!

While the example is very helpful for understanding the concept, my understanding of the code is that it works off the assumption that any roll of the dice that puts you on a score > 100 is a winning roll.

In the version of the game that I know you have to land exactly on 100 to win. e.g if you’re on square 98 and roll a 6 you would go forward 2 spaces to 100 and then bounce back 4 spaces to 96.

I thought it’d be a good exercise to tweak the code to cater for this:

n=100
 
# We have 6 extra columns because we want to represent throwing of the dice which results in a final square > 100
M=matrix(0,n+1,n+1+6)
rownames(M)=0:n
colnames(M)=0:(n+6)
 
# set probabilities of landing on each square assuming that there aren't any snakes or ladders
for(i in 1:6){
  diag(M[,(i+1):(i+1+n)])=1/6
}
 
# account for 'bounce back' if a dice roll leads to a final score > 100
for(i in 96:100) {
  for(c in 102:107) {
    idx = 101 - (c - 101)  
    M[i, idx] = M[i, idx] + M[i, c]
  }  
}

We can inspect the last few rows to check that if the transition matrix is accurate:

> M[95:100,95:101]
 
   94        95        96        97        98        99       100
94  0 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
95  0 0.0000000 0.1666667 0.1666667 0.1666667 0.3333333 0.1666667
96  0 0.0000000 0.0000000 0.1666667 0.3333333 0.3333333 0.1666667
97  0 0.0000000 0.0000000 0.1666667 0.3333333 0.3333333 0.1666667
98  0 0.0000000 0.1666667 0.1666667 0.1666667 0.3333333 0.1666667
99  0 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667

If we’re on the 99th square (the last row) we could roll a 1 and end up on 100, a 2 and end up on 99 (1 forward, 1 back), a 3 and end up on 98 (1 forward, 2 back), a 4 and end up on 97 (1 forward, 3 back), a 5 and end up on 96 (1 forward, 4 back) or a 6 and end up on 95 (1 forward, 5 back). i.e. we can land on 95, 96, 97, 98, 99 or 100 with 1/6 probability.

If we’re on the 96th square (the 3rd row) we could roll a 1 and end up on 97, a 2 and end up on 98, a 3 and end up on 99, a 4 and end up on 100, a 5 and end up on 99 (4 forward, 1 back) or a 6 and end up on 98 (4 forward, 2 back). i.e. we can land on 97 with 1/6 probability, 98 with 2/6 probability, 99 with 2/6 probability or 100 with 1/6 probability.

We could do a similar analysis for the other squares but it seems like the probabilities are being calculated correctly.

Next we can update the matrix with the snakes and ladders. That code stays the same:

# get rid of the extra columns, we don't need them anymore
M=M[,1:(n+1)]
 
# add in the snakes and ladders
starting = c(4,9,17,20,28,40,51,54,62,64,63,71,93,95,92)
ending   = c(14,31,7,38,84,59,67,34,19,60,81,91,73,75,78)
 
for(i in 1:length(starting)) {
  # Retrieve current probabilities of landing on the starting square
  v=M[,starting[i]+1]  
  ind=which(v>0)
 
  # Set no probability of falling on the starting squares
  M[ind,starting[i]+1]=0
 
  # Move all existing probabilities to the ending squares
  M[ind,ending[i]+1]=M[ind,ending[i]+1]+v[ind]
}

We can also simplify the powermat function which is used to simulate what the board would look like after a certain number of dice rolls:

# original
powermat=function(P,h){
  Ph=P
  if(h>1) {
    for(k in 2:h) {
      Ph=Ph%*%P
    }
  }
  return(Ph)
}
 
#new 
library(expm)
powermat = function(P,h) {
  return (P %^% h)
}

initial=c(1,rep(0,n))
h = 1
> (initial%*%powermat(M,h))[1:15]
     0         1         2         3 4         5         6 7 8 9 10 11 12 13        14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
[1,] 0 0.1666667 0.1666667 0.1666667 0 0.1666667 0.1666667 0 0 0  0  0  0  0 0.1666667  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
     46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
[1,]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   0

One interesting thing I noticed is that it now seems to take way more turns on average to finish the game than when you didn’t need to score exactly 100 to win:

> sum(1 - game)
[1] 999
distrib=initial%*%M
game=rep(NA,1000)
for(h in 1:length(game)){
game[h]=distrib[n+1]
distrib=distrib%*%M}
plot(1-game[1:200],type="l",lwd=2,col="red",
ylab="Probability to be still playing")
2015 04 09 22 48 24

I expected it to take longer to finish the game but not this long! I think I’ve probably made a mistake but I’m not sure where…

Update

Antonios found the mistake I’d made – when on the 100th square we should have a 1 as the probability of getting to the 100th square. i.e. we need to update M like so:

M[101,101] = 1

Now if we visualise he probability that we’re still playing we get a more accurate curve:

distrib=initial%*%M
game=rep(NA,1000)
for(h in 1:length(game)){
game[h]=distrib[n+1]
distrib=distrib%*%M}
plot(1-game[1:200],type="l",lwd=2,col="red",
ylab="Probability to be still playing")
2015 04 10 23 49 21

Categories: Programming

Neo4j: The learning to cycle dependency graph

Tue, 04/07/2015 - 21:59

Over the past couple of weeks I’ve been reading about skill building and the break down of skills into more manageable chunks, and recently had a chance to break down the skills required to learn to cycle.

I initially sketched out the skill progression but quickly realised I had drawn a dependency graph and thought that putting it into Neo4j would simplify things.

I started out with the overall goal for cycling which was to ‘Be able to cycle through a public park':

MERGE (:Goal:Task {name: "Be able to cycle through a public park"})

This goal is easy for someone who’s already learnt to cycle but if we’re starting from scratch it’s a bit daunting so we need to break it down into a simpler skill that we can practice.

The mini algorithm that we’re going to employ for task breakdown is this:

  1. Can we do the given task now?
  2. Break the task down into something simpler and return to 1.

One of the things to keep in mind is that we won’t get the break down perfect the first time so we may need to change it. For a diagram drawn on a piece of paper this would be annoying but in Neo4j it’s just a simpler refactoring.

Going back to cycling. Since the goal isn’t yet achievable we need to break that down into something a bit easier. Let’s start with something really simple:

MERGE (task:Task {name: "Take a few steps forward while standing over the bike"})
WITH task
MATCH (goal:Goal:Task {name: "Be able to cycle through a public park"})
MERGE (goal)-[:DEPENDS_ON]->(task)

In the first line we create our new task and then we connect it to our goal which we created earlier.

Graph  9

After we’ve got the hang of walking with the bike we want to get comfortable with cycling forward a few rotations while sitting on the bike but to do that we need to be able to get the bike moving from a standing start. We might also have another step where we cycle forward while standing on the bike as that might be slightly easier.

Let’s update our graph:

// First let's get rid of the relationship between our initial task and the goal
MATCH (initialTask:Task {name: "Take a few steps forward while standing over the bike"})
MATCH (goal:Goal {name: "Be able to cycle through a public park"})
MATCH (goal)-[rel:DEPENDS_ON]->(initialTask)
DELETE rel
 
WITH initialTask, goal, ["Get bike moving from standing start", "Cycle forward while standing", "Cycle forward while sitting"] AS newTasks
 
// Create some nodes for our new tasks
UNWIND newTasks AS newTask
MERGE (t:Task {name: newTask})
WITH initialTask, goal, COLLECT(t) AS newTasks
WITH initialTask, goal, newTasks, newTasks[0] AS firstTask, newTasks[-1] AS lastTask
 
// Connect the last task to the goal
MERGE (goal)-[:DEPENDS_ON]->(lastTask)
 
// And the first task to our initial task
MERGE (firstTask)-[:DEPENDS_ON]->(initialTask)
 
// And all the tasks to each other
FOREACH(i in RANGE(0, length(newTasks) - 2) |
  FOREACH(t1 in [newTasks[i]] | FOREACH(t2 in [newTasks[i+1]] |
    MERGE (t2)-[:DEPENDS_ON]->(t1) 
)))
Graph  10

We don’t strictly need to learn how to cycle while standing up – we could just go straight from getting the bike moving to cycling forward while sitting. Let’s update the graph to reflect that:

MATCH (sitting:Task {name: "Cycle forward while sitting"})
MATCH (moving:Task {name: "Get bike moving from standing start"})
MERGE (sitting)-[:DEPENDS_ON]->(moving)

Graph  11

Once we’ve got the hang of those tasks let’s add in a few more to get us closer to our goal:

WITH [
  {skill: "Controlled stop using brakes/feet", dependsOn: "Cycle forward while sitting"},
  {skill: "Steer around stationary objects", dependsOn: "Controlled stop using brakes/feet"},
  {skill: "Steer around people", dependsOn: "Steer around stationary objects"},
  {skill: "Navigate a small circular circuit", dependsOn: "Steer around stationary objects"},
  {skill: "Navigate a loop of a section of the park", dependsOn: "Navigate a small circular circuit"},
  {skill: "Navigate a loop of a section of the park", dependsOn: "Steer around people"},
  {skill: "Be able to cycle through a public park", dependsOn: "Navigate a loop of a section of the park"}
 
] AS newTasks
 
FOREACH(newTask in newTasks |
  MERGE (t1:Task {name: newTask.skill})   
  MERGE (t2:Task {name: newTask.dependsOn})
  MERGE (t1)-[:DEPENDS_ON]->(t2)
)

Finally let’s get rid of the relationship from our goal to ‘Cycle forward while sitting’ since we’ve replaced that with some intermediate steps:

MATCH (task:Task {name: "Cycle forward while sitting"})
WITH task
MATCH (goal:Goal:Task {name: "Be able to cycle through a public park"})
MERGE (goal)-[rel:DEPENDS_ON]->(task)
DELETE rel

And here’s what the final dependency graph looks like:

Graph  13

Although I put this into Neo4j in order to visualise the dependencies we can now query the data as well. For example, let’s say I know how to cycle forward while sitting on the bike. What steps are there between me and being able to cycle around a park?

MATCH (t:Task {name: "Cycle forward while sitting"}),
      (g:Goal {name: "Be able to cycle through a public park"}),
      path = shortestpath((g)-[:DEPENDS_ON*]->(t))
RETURN path

Graph  14

Or if we want a list of the tasks we need to do next we could restructure the query slightly:

MATCH (t:Task {name: "Cycle forward while sitting"}),
      (g:Goal {name: "Be able to cycle through a public park"}),
      path = shortestpath((t)<-[:DEPENDS_ON*]->(g))
WITH [n in nodes(path) | n.name] AS tasks
UNWIND tasks AS task
RETURN task
 
==> +--------------------------------------------+
==> | task                                       |
==> +--------------------------------------------+
==> | "Cycle forward while sitting"              |
==> | "Controlled stop using brakes/feet"        |
==> | "Steer around stationary objects"          |
==> | "Steer around people"                      |
==> | "Navigate a loop of a section of the park" |
==> | "Be able to cycle through a public park"   |
==> +--------------------------------------------+
==> 6 rows

That’s all for now but I think this is an interesting way of tracking how you’d learn a skill. I’m trying a similar approach for some statistics topics I’m learning about but I’ve found the order of tasks isn’t so linear there – interestingly much more a graph than a tree.

Categories: Programming

R: Markov Chain Wikipedia Example

Sun, 04/05/2015 - 11:07

Over the weekend I’ve been reading about Markov Chains and I thought it’d be an interesting exercise for me to translate Wikipedia’s example into R code.

But first a definition:

A Markov chain is a random process that undergoes transitions from one state to another on a state space.

It is required to possess a property that is usually characterized as “memoryless”: the probability distribution of the next state depends only on the current state and not on the sequence of events that preceded it.

that ‘random process’ could be moves in a Monopoly game, the next word in a sentence or, as in Wikipedia’s example, the next state of the Stock Market.

The diagram below shows the probabilities of transitioning between the various states:

800px Finance Markov chain example state space svg

e.g. if we’re in a Bull Market the probability of the state of the market next week being a Bull Market is 0.9, a Bear Market is 0.075 and a Stagnant Market is 0.025.

We can model the various transition probabilities as a matrix:

M = matrix(c(0.9, 0.075, 0.025, 0.15, 0.8, 0.05, 0.25, 0.25, 0.5),
          nrow = 3,
          ncol = 3,
          byrow = TRUE)
 
> M
     [,1]  [,2]  [,3]
[1,] 0.90 0.075 0.025
[2,] 0.15 0.800 0.050
[3,] 0.25 0.250 0.500

Rows/Cols 1-3 are Bull, Bear, Stagnant respectively.

Now let’s say we start with a Bear market and want to find the probability of each state in 3 weeks time.

We can do this is by multiplying our probability/transition matrix by itself 3 times and then multiplying the result by a vector representing the initial Bear market state.

threeIterations = (M %*% M %*% M)
 
> threeIterations
> threeIterations
       [,1]    [,2]    [,3]
[1,] 0.7745 0.17875 0.04675
[2,] 0.3575 0.56825 0.07425
[3,] 0.4675 0.37125 0.16125
 
> c(0,1,0) %*% threeIterations
       [,1]    [,2]    [,3]
[1,] 0.3575 0.56825 0.07425

So we have a 56.825% chance of still being in a Bear Market, 35.75% chance that we’re now in a Bull Market and only a 7.425% chance of being in a stagnant market.

I found it a bit annoying having to type ‘%*% M’ multiple times but luckily the expm library allows us to apply a Matrix power operation:

install.packages("expm")
library(expm)
 
> M %^% 3
       [,1]    [,2]    [,3]
[1,] 0.7745 0.17875 0.04675
[2,] 0.3575 0.56825 0.07425
[3,] 0.4675 0.37125 0.16125

The nice thing about this function is that we can now easily see where the stock market will trend towards over a large number of weeks:

> M %^% 100
      [,1]   [,2]   [,3]
[1,] 0.625 0.3125 0.0625
[2,] 0.625 0.3125 0.0625
[3,] 0.625 0.3125 0.0625

i.e. 62.5% of weeks we will be in a bull market, 31.25% of weeks will be in a bear market and 6.25% of weeks will be stagnant,

Categories: Programming

How I met your mother: Story arcs

Sat, 04/04/2015 - 00:31

After weeks of playing around with various algorithms to extract story arcs in How I met your mother I’ve come to the conclusion that I don’t yet have the skills to completely automate this process so I’m going to change my approach.

The new plan is to treat the outputs of the algorithms as suggestions for possible themes but then have a manual step where I extract what I think are interesting themes in the series.

A theme can consist of a single word or a phrase and the idea is that once a story arc is identified we’ll search over the corpus and find the episodes where that phrase occurs.

We can then generate a CSV file of (story arc) -> (episodeId), store that into our HIMYM graph and use the story arc as another factor for episode similarity.

I ended up with the following script to work out which episodes contained a story arc:

#!/bin/bash
 
find_term() {
  arc=${1}
  searchTerm=${2}
  episodes=$(grep --color -iE "${searchTerm}" data/import/sentences.csv | awk -F"," '{ print $2  }' | sort | uniq)
  for episode in ${episodes}; do
    echo ${arc},${episode}
  done
}
 
find_term "Bro Code" "bro code"
find_term "Legendary" "legen(.*)ary"
find_term "Slutty Pumpkin" "slutty pumpkin"
find_term "Magician's Code" "magician's code"
find_term "Thanksgiving" "thanksgiving"
find_term "The Playbook" "playbook"
find_term "Slap Bet" "slap bet"
find_term "Wrestlers and Robots" "wrestlers"
find_term "Robin Sparkles" "sparkles"
find_term "Blue French Horn" "blue french horn"
find_term "Olive Theory" "olive"
find_term "Thank You Linus" "thank you, linus"
find_term "Have you met...?" "have you met"
find_term "Laser Tag" "laser tag"
find_term "Goliath National Bank" "goliath national bank"
find_term "Challenge Accepted" "challenge accepted"
find_term "Best Man" "best man"

If we run this script we’ll see something like the following:

$ ./scripts/arcs.sh
Bro Code,14
Bro Code,155
Bro Code,170
Bro Code,188
Bro Code,201
Bro Code,61
Bro Code,64
Legendary,117
Legendary,120
Legendary,122
Legendary,136
Legendary,137
Legendary,15
Legendary,152
Legendary,157
Legendary,162
Legendary,171
...
Best Man,208
Best Man,30
Best Man,32
Best Man,41
Best Man,42

I pulled out these themes by eyeballing the output of the following scripts:

  • TF/IDF – calculates TF/IDF scores for ngrams. This helps find important themes in the context of a single episode. I then did some manual searching to see how many of those themes existed in other episodes
  • Weighted Term Frequency – this returns a weighted term frequency for ngrams of different lengths. The weights are determined by the skewed random discrete distribution I wrote about earlier in the week. I ran it with different skews and ngram lengths.
  • Named entity extraction – this pulls out any phrases that are named entities. It mostly pulled out names of people (which I used as a stop word list in some other algorithms) but also revealed a couple of themes.
  • Topic modelling – I used mallet to extract topics across the corpus. Most of them didn’t make much sense to me but there were a few which identified themes that I recognised.

I can’t remember off the top of my head if any obvious themes have been missed so if you know HIMYM better than me let me know and I’ll try and work out why those didn’t surface.

Next I want to see how these scripts fare against some other TV shows and see how quickly I can extract themes for those. It’d also be cool if I can make the whole process a bit more automated.

Categories: Programming

Neo4j: Cypher – Building the query for a movie’s profile page

Wed, 04/01/2015 - 12:54
2015 04 01 12 11 51

Yesterday I spent the day in Berlin delivering a workshop as part of the Data Science Retreat and one of the exercises we did was write a query that would pull back all the information you’d need to create the IMDB page for a movie.

Scanning the page we can see that need to get some basic meta data including the title. Next we’ll need to pull in the actors, directors, producers and finally a recommendation for some other movies the viewer might like to see.

I’d struggle to be able to write this all in one go – it’s non trivial. However, if we break it down there are actually 5 simpler queries that we probably can write. Our final step is then to work out how to glue them all together.

Let’s get started.

If you want to follow along open up your Neo4j browser and type :play movies and import the built in data set.

We’re going to create the query for The Matrix home page so the first step is to find the node representing that movie in the database:

match (movie:Movie {title: "The Matrix"})
return movie.title
 
==> +--------------+
==> | movie.title  |
==> +--------------+
==> | "The Matrix" |
==> +--------------+
==> 1 row

Easy enough. Now let’s get back the producers:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
RETURN movie.title, COLLECT(producer.name) AS producers
 
==> +--------------------------------+
==> | movie.title  | producers       |
==> +--------------------------------+
==> | "The Matrix" | ["Joel Silver"] |
==> +--------------------------------+
==> 1 row

We’ve introduced the COLLECT function here as we want to ensure that our final result only has one row regardless of how many producers there are. COLLECT applies an implicit group by ‘movie.title’ and collects the producers for each movie (in this case just The Matrix) into an array.

We’ve used OPTIONAL MATCH *LINK* because we still want to return a row for the query even if it has no producers. In the case that there are no producers we’d hope to see an empty array.

Now let’s write the same query to pull back the directors of the movie:

match (movie:Movie {title: "The Matrix"})
optional match (director)-[:DIRECTED]->(movie)
 
RETURN movie.title, COLLECT(director.name) AS directors
 
==> +----------------------------------------------------+
==> | movie.title  | directors                           |
==> +----------------------------------------------------+
==> | "The Matrix" | ["Lana Wachowski","Andy Wachowski"] |
==> +----------------------------------------------------+
==> 1 row

We really want to do both of these in one query so we get back a single result with 3 columns. To do that we’re going to introduce the WITH clause which allows us combine the results of traversals together.

In this case we’ll first do a traversal to get the producers, collect those into an array and then traverse out again to get the directors and collect those. This is what the query looks like:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
with movie, COLLECT(producer.name) AS producers
optional match (director)-[:DIRECTED]->(movie)
 
RETURN movie.title, producers, COLLECT(director.name) AS directors
 
==> +----------------------------------------------------------------------+
==> | movie.title  | producers       | directors                           |
==> +----------------------------------------------------------------------+
==> | "The Matrix" | ["Joel Silver"] | ["Lana Wachowski","Andy Wachowski"] |
==> +----------------------------------------------------------------------+
==> 1 row

We can follow the same pattern to return the actors:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
with movie, COLLECT(producer.name) AS producers
optional match (director)-[:DIRECTED]->(movie)
 
with movie, producers, COLLECT(director.name) AS directors
optional match (actor)-[:ACTED_IN]->(movie)
 
RETURN movie.title, COLLECT(actor.name) AS actors, producers, directors
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | movie.title  | actors                                                                                | producers       | directors                           |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "The Matrix" | ["Hugo Weaving","Laurence Fishburne","Carrie-Anne Moss","Keanu Reeves","Emil Eifrem"] | ["Joel Silver"] | ["Lana Wachowski","Andy Wachowski"] |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> 1 row

So far, so good. We’ve got everything except the other movies recommendation which is a bit trickier so we’ll write it on its own first:

match (movie:Movie {title: "The Matrix"})<-[:ACTED_IN]-(actor)-[:ACTED_IN]->(otherMovie)
RETURN otherMovie, COUNT(*) AS score
ORDER BY score DESC
 
==> +---------------------------------------------------------------------------------------------------------------------------+
==> | otherMovie                                                                                                        | score |
==> +---------------------------------------------------------------------------------------------------------------------------+
==> | Node[348]{title:"The Matrix Revolutions",released:2003,tagline:"Everything that has a beginning has an end"}      | 4     |
==> | Node[347]{title:"The Matrix Reloaded",released:2003,tagline:"Free your mind"}                                     | 4     |
==> | Node[490]{title:"Something's Gotta Give",released:2003}                                                           | 1     |
==> | Node[349]{title:"The Devil's Advocate",released:1997,tagline:"Evil has its winning ways"}                         | 1     |
==> | Node[438]{title:"Johnny Mnemonic",released:1995,tagline:"The hottest data on earth. In the coolest head in town"} | 1     |
==> | Node[443]{title:"Cloud Atlas",released:2012,tagline:"Everything is connected"}                                    | 1     |
==> | Node[452]{title:"V for Vendetta",released:2006,tagline:"Freedom! Forever!"}                                       | 1     |
==> | Node[425]{title:"The Replacements",released:2000,tagline:"Pain heals, Chicks dig scars... Glory lasts forever"}   | 1     |
==> +---------------------------------------------------------------------------------------------------------------------------+
==> 8 rows

Our recommendation query finds all the actors in The Matrix and then traverses out to find other movies they’ve acted in and orders those movies based on how many of our actors appeared in them. Not surprisingly the other Matrix movies come out top.

In order to plug this into the rest of the query we need a single row to be returned i.e. our other movie suggestions need to be returned as an array rather than individual rows. Let’s do that:

match (movie:Movie {title: "The Matrix"})<-[:ACTED_IN]-(actor)-[:ACTED_IN]->(otherMovie)
 
WITH otherMovie, COUNT(*) AS score
ORDER BY score DESC
 
RETURN COLLECT({movie: otherMovie.title, score: score}) AS otherMovies
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | recommended                                                                                                                                                                                                                                                                                                                                                  |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | [{movie -> "The Matrix Revolutions", score -> 4},{movie -> "The Matrix Reloaded", score -> 4},{movie -> "Something's Gotta Give", score -> 1},{movie -> "The Devil's Advocate", score -> 1},{movie -> "Johnny Mnemonic", score -> 1},{movie -> "Cloud Atlas", score -> 1},{movie -> "V for Vendetta", score -> 1},{movie -> "The Replacements", score -> 1}] |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

We’ve introduced a WITH clause for two reasons:

  1. To ensure the order of the movies based on highest score
  2. Because we can’t do an aggregation within an aggregation i.e. COLLECT(COUNT(…)) would be an illegal operation in Cypher.

Now we’re ready to plug this recommendation query into our main one:

match (movie:Movie {title: "The Matrix"})
optional match (producer)-[:PRODUCED]->(movie)
 
with movie, COLLECT(producer.name) AS producers
optional match (director)-[:DIRECTED]->(movie)
 
with movie, producers, COLLECT(director.name) AS directors
optional match (actor)-[:ACTED_IN]->(movie)
 
WITH  movie, COLLECT(actor.name) AS actors, producers, directors
optional match (movie)<-[:ACTED_IN]-(actor)-[:ACTED_IN]->(otherMovie)
 
WITH movie, actors, producers, directors, otherMovie, COUNT(*) AS score
ORDER BY score DESC
 
RETURN movie, actors, producers, directors,  
       COLLECT({movie: otherMovie.title, score: score}) AS recommended
 
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | movie                                                                           | actors                                                                                | producers       | directors                           | recommended                                                                                                                                                                                                                                                                                                                                                  |
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | Node[338]{title:"The Matrix",released:1999,tagline:"Welcome to the Real World"} | ["Hugo Weaving","Laurence Fishburne","Carrie-Anne Moss","Keanu Reeves","Emil Eifrem"] | ["Joel Silver"] | ["Lana Wachowski","Andy Wachowski"] | [{movie -> "The Matrix Revolutions", score -> 4},{movie -> "The Matrix Reloaded", score -> 4},{movie -> "Johnny Mnemonic", score -> 1},{movie -> "The Replacements", score -> 1},{movie -> "Cloud Atlas", score -> 1},{movie -> "V for Vendetta", score -> 1},{movie -> "Something's Gotta Give", score -> 1},{movie -> "The Devil's Advocate", score -> 1}] |
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> 1 row

Voila! 4 different types of data gathered and just one query to do it all.

For the eagle eyed cypher specialists (Hi Michael!), you’ll have noticed a bit of duplication in how we traverse out to the actors twice, once to retrieve them and once to make the movie recommendation.

We could optimise this by collecting the actors once and then using the UNWIND clause but that’s an optimisation which I think slightly obscures the intent of the query so I’ve left it like this for now.

Categories: Programming

Python: Creating a skewed random discrete distribution

Mon, 03/30/2015 - 23:28

I’m planning to write a variant of the TF/IDF algorithm over the HIMYM corpus which weights in favour of term that appear in a medium number of documents and as a prerequisite needed a function that when given a number of documents would return a weighting.

It should return a higher value when a term appears in a medium number of documents i.e. if I pass in 10 I should get back a higher value than 200 as a term that appears in 10 episodes is likely to be more interesting than one which appears in almost every episode.

I went through a few different scipy distributions but none of them did exactly what I want so I ended up writing a sampling function which chooses random numbers in a range with different probabilities:

import math
import numpy as np
 
values = range(1, 209)
probs = [1.0 / 208] * 208
 
for idx, prob in enumerate(probs):
    if idx > 3 and idx < 20:
        probs[idx] = probs[idx] * (1 + math.log(idx + 1))
    if idx > 20 and idx < 40:
        probs[idx] = probs[idx] * (1 + math.log((40 - idx) + 1))
 
probs = [p / sum(probs) for p in probs]
sample =  np.random.choice(values, 1000, p=probs)
 
>>> print sample[:10]
[ 33   9  22 126  54   4  20  17  45  56]

Now let’s visualise the distribution of this sample by plotting a histogram:

import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
 
binwidth = 2
plt.hist(sample, bins=np.arange(min(sample), max(sample) + binwidth, binwidth))
plt.xlim([0, max(sample)])
plt.show()

2015 03 30 23 25 05

It’s a bit hacky but it does seem to work in terms of weighting the values correctly. It would be nice if it I could smooth it off a bit better but I’m not sure how at the moment.

One final thing we can do is get the count of any one of the values by using the bincount function:

>>> print np.bincount(sample)[1]
4
 
>>> print np.bincount(sample)[10]
16
 
>>> print np.bincount(sample)[206]
3

Now I need to plug this into the rest of my code and see if it actually works!

Categories: Programming

InetAddressImpl#lookupAllHostAddr slow/hangs

Sun, 03/29/2015 - 01:31

Since I upgraded to Yosemite I’ve noticed that attempts to resolve localhost on my home network have been taking ages (sometimes over a minute) so I thought I’d try and work out why.

This is what my initial /etc/hosts file looked like based on the assumption that my machine’s hostname was teetotal:

$ cat /etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1	localhost
255.255.255.255	broadcasthost
::1             localhost
#fe80::1%lo0	localhost
127.0.0.1	wuqour.local
127.0.0.1       teetotal

I setup a little test which replicated the problem:

import java.net.InetAddress;
import java.net.UnknownHostException;
 
public class LocalhostResolution
{
    public static void main( String[] args ) throws UnknownHostException
    {
        long start = System.currentTimeMillis();
        InetAddress localHost = InetAddress.getLocalHost();
        System.out.println(localHost);
        System.out.println(System.currentTimeMillis() - start);
    }
}

which has the following output:

Exception in thread "main" java.net.UnknownHostException: teetotal-2: teetotal-2: nodename nor servname provided, or not known
	at java.net.InetAddress.getLocalHost(InetAddress.java:1473)
	at LocalhostResolution.main(LocalhostResolution.java:9)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: java.net.UnknownHostException: teetotal-2: nodename nor servname provided, or not known
	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
	at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
	at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
	... 6 more

Somehow my hostname has changed to teetotal-2 so I added the following entry to /etc/hosts:

127.0.0.1	teetotal-2

Now if we run the program we see this output instead:

teetotal-2/127.0.0.1
5011

It’s still taking 5 seconds to resolve which is much longer than I’d expect. After break pointing through the code it seems like it’s trying to do an IPv6 resolution rather than IPv4 so I added an /etc/hosts entry for that too:

::1             teetotal-2

Now resolution is much quicker:

teetotal-2/127.0.0.1
6

Happy days!

Categories: Programming

Neo4j: Generating real time recommendations with Cypher

Fri, 03/27/2015 - 07:59

One of the most common uses of Neo4j is for building real time recommendation engines and a common theme is that they make use of lots of different bits of data to come up with an interesting recommendation.

For example in this video Amanda shows how dating websites build real time recommendation engines by starting with social connections and then introducing passions, location and a few other things.

Graph Aware have a neat framework that helps you to build your own recommendation engine using Java and I was curious what a Cypher version would look like.

This is the sample graph:

CREATE
    (m:Person:Male {name:'Michal', age:30}),
    (d:Person:Female {name:'Daniela', age:20}),
    (v:Person:Male {name:'Vince', age:40}),
    (a:Person:Male {name:'Adam', age:30}),
    (l:Person:Female {name:'Luanne', age:25}),
    (c:Person:Male {name:'Christophe', age:60}),
 
    (lon:City {name:'London'}),
    (mum:City {name:'Mumbai'}),
 
    (m)-[:FRIEND_OF]->(d),
    (m)-[:FRIEND_OF]->(l),
    (m)-[:FRIEND_OF]->(a),
    (m)-[:FRIEND_OF]->(v),
    (d)-[:FRIEND_OF]->(v),
    (c)-[:FRIEND_OF]->(v),
    (d)-[:LIVES_IN]->(lon),
    (v)-[:LIVES_IN]->(lon),
    (m)-[:LIVES_IN]->(lon),
    (l)-[:LIVES_IN]->(mum);

We want to recommend some potential friends to ‘Adam’ so the first layer of our query is to find his friends of friends as there are bound to be some potential friends amongst them:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
RETURN me, potentialFriend, COUNT(*) AS friendsInCommon
 
==> +--------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | friendsInCommon |
==> +--------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 1               |
==> +--------------------------------------------------------------------------------------+
==> 3 rows

This query gives us back a list of potential friends and how many friends we have in common.

Now that we’ve got some potential friends let’s start building a ranking for each of them. One indicator which could weigh in favour of a potential friend is if they live in the same location as us so let’s add that to our query:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
RETURN  me,
        potentialFriend,
        SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation
 
==> +-----------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation |
==> +-----------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            |
==> +-----------------------------------------------------------------------------------+
==> 3 rows

Next we’ll check whether Adams’ potential friends have the same gender as him by comparing the labels each node has. We’ve got ‘Male’ and ‘Female’ labels which indicate gender.

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
RETURN  me,
        potentialFriend,
        SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
        LABELS(me) = LABELS(potentialFriend) AS gender
 
==> +--------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | gender |
==> +--------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | true   |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | false  |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | false  |
==> +--------------------------------------------------------------------------------------------+
==> 3 rows

Next up let’s calculate the age different between Adam and his potential friends:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
RETURN me,
       potentialFriend,
       SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
       abs( me.age - potentialFriend.age) AS ageDifference,
       LABELS(me) = LABELS(potentialFriend) AS gender,
       friendsInCommon
 
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | ageDifference | gender | friendsInCommon |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | 10.0          | true   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | 10.0          | false  | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | 5.0           | false  | 1               |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> 3 rows

Now let’s do some filtering to get rid of people that Adam is already friends with – there wouldn’t be much point in recommending those people!

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
WITH me,
     potentialFriend,
     SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
     abs( me.age - potentialFriend.age) AS ageDifference,
     LABELS(me) = LABELS(potentialFriend) AS gender,
     friendsInCommon
 
WHERE NOT (me)-[:FRIEND_OF]-(potentialFriend)
 
RETURN me,
       potentialFriend,
       SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
       abs( me.age - potentialFriend.age) AS ageDifference,
       LABELS(me) = LABELS(potentialFriend) AS gender,
       friendsInCommon
 
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | me                             | potentialFriend                   | sameLocation | ageDifference | gender | friendsInCommon |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1007]{name:"Adam",age:30} | Node[1006]{name:"Vince",age:40}   | 0            | 10.0          | true   | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1005]{name:"Daniela",age:20} | 0            | 10.0          | false  | 1               |
==> | Node[1007]{name:"Adam",age:30} | Node[1008]{name:"Luanne",age:25}  | 0            | 5.0           | false  | 1               |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> 3 rows

In this case we haven’t actually filtered anyone out but for some of the other people we would see a reduction in the number of potential friends.

Our final step is to come up with a score for each of the features that we’ve identified as being important for making a friend suggestion.

We’ll assign a score of 10 if the people live in the same location or have the same gender as Adam and 0 if not. For the ageDifference and friendsInCommon we’ll apply apply a log curve so that those values don’t have a disproportional effect on our final score. We’ll use the formula defined in the ParetoScoreTransfomer to do this:

    public <OUT> float transform(OUT item, float score) {
        if (score < minimumThreshold) {
            return 0;
        }
 
        double alpha = Math.log((double) 5) / eightyPercentLevel;
        double exp = Math.exp(-alpha * score);
        return new Double(maxScore * (1 - exp)).floatValue();
    }

And now for our completed recommendation query:

MATCH (me:Person {name: "Adam"})
MATCH (me)-[:FRIEND_OF]-()-[:FRIEND_OF]-(potentialFriend)
 
WITH me, potentialFriend, COUNT(*) AS friendsInCommon
 
WITH me,
     potentialFriend,
     SIZE((potentialFriend)-[:LIVES_IN]->()<-[:LIVES_IN]-(me)) AS sameLocation,
     abs( me.age - potentialFriend.age) AS ageDifference,
     LABELS(me) = LABELS(potentialFriend) AS gender,
     friendsInCommon
 
WHERE NOT (me)-[:FRIEND_OF]-(potentialFriend)
 
WITH potentialFriend,
       // 100 -> maxScore, 10 -> eightyPercentLevel, friendsInCommon -> score (from the formula above)
       100 * (1 - exp((-1.0 * (log(5.0) / 10)) * friendsInCommon)) AS friendsInCommon,
       sameLocation * 10 AS sameLocation,
       -1 * (10 * (1 - exp((-1.0 * (log(5.0) / 20)) * ageDifference))) AS ageDifference,
       CASE WHEN gender THEN 10 ELSE 0 END as sameGender
 
RETURN potentialFriend,
      {friendsInCommon: friendsInCommon,
       sameLocation: sameLocation,
       ageDifference:ageDifference,
       sameGender: sameGender} AS parts,
     friendsInCommon + sameLocation + ageDifference + sameGender AS score
ORDER BY score DESC
 
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | potentialFriend                   | parts                                                                                                           | score             |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1006]{name:"Vince",age:40}   | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -5.52786404500042, sameGender -> 10} | 19.33814370292112 |
==> | Node[1008]{name:"Luanne",age:25}  | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -3.312596950235779, sameGender -> 0} | 11.55341079768576 |
==> | Node[1005]{name:"Daniela",age:20} | {friendsInCommon -> 14.86600774792154, sameLocation -> 0, ageDifference -> -5.52786404500042, sameGender -> 0}  | 9.33814370292112  |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The final query isn’t too bad – the only really complex bit is the log curve calculation. This is where user defined functions will come into their own in the future.

The nice thing about this approach is that we don’t have to step outside of cypher so if you’re not comfortable with Java you can still do real time recommendations! On the other hand, the different parts of the recommendation engine all get mixed up so it’s not as easy to see the whole pipeline as if you use the graph aware framework.

The next step is to apply this to the Twitter graph and come up with follower recommendations on there.

Categories: Programming

Python: matplotlib hangs and shows nothing (Mac OS X)

Thu, 03/26/2015 - 01:02

I’ve been playing around with some of the matplotlib demos recently and discovered that simply copying one of the examples didn’t actually work for me.

I was following the bar chart example and had the following code:

import numpy as np
import matplotlib.pyplot as plt
 
N = 5
ind = np.arange(N)
fig, ax = plt.subplots()
menMeans = (20, 35, 30, 35, 27)
menStd =   (2, 3, 4, 1, 2)
width = 0.35       # the width of the bars
rects1 = ax.bar(ind, menMeans, width, color='r', yerr=menStd)
 
plt.show()

When I execute this script from the command line it just hangs and I don’t see anything at all.

Via a combination of different blog posts (which all suggested different things!) I ended up with the following variation of imports which seems to do the job:

import numpy as np
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
 
N = 5
ind = np.arange(N)
fig, ax = plt.subplots()
menMeans = (20, 35, 30, 35, 27)
menStd =   (2, 3, 4, 1, 2)
width = 0.35       # the width of the bars
rects1 = ax.bar(ind, menMeans, width, color='r', yerr=menStd)
 
plt.show()

If I run this script a Python window pops up and contains the following image which is what I expected to happen in the first place!

2015 03 25 23 56 08

The thing to notice is that we’ve had to change the backend in order to use matplotlib from the shell:

With the TkAgg backend, which uses the Tkinter user interface toolkit, you can use matplotlib from an arbitrary non-gui python shell.

Current state: Wishing for ggplot!

Categories: Programming

Topic Modelling: Working out the optimal number of topics

Tue, 03/24/2015 - 23:33

In my continued exploration of topic modelling I came across The Programming Historian blog and a post showing how to derive topics from a corpus using the Java library mallet.

The instructions on the blog make it very easy to get up and running but as with other libraries I’ve used, you have to specify how many topics the corpus consists of. I’m never sure what value to select but the authors make the following suggestion:

How do you know the number of topics to search for? Is there a natural number of topics? What we have found is that one has to run the train-topics with varying numbers of topics to see how the composition file breaks down. If we end up with the majority of our original texts all in a very limited number of topics, then we take that as a signal that we need to increase the number of topics; the settings were too coarse.

There are computational ways of searching for this, including using MALLETs hlda command, but for the reader of this tutorial, it is probably just quicker to cycle through a number of iterations (but for more see Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Science, 101, 5228-5235).

Since I haven’t yet had the time to dive into the paper or explore how to use the appropriate option in mallet I thought I’d do some variations on the stop words and number of topics and see how that panned out.

As I understand it, the idea is to try and get a uniform spread of topics -> documents i.e. we don’t want all the documents to have the same topic otherwise any topic similarity calculations we run won’t be that interesting.

I tried running mallet with 10,15,20 and 30 topics and also varied the stop words used. I had one version which just stripped out the main characters and the word ‘narrator’ & another where I stripped out the top 20% of words by occurrence and any words that appeared less than 10 times.

The reason for doing this was that it should identify interesting phrases across episodes better than TF/IDF can while not just selecting the most popular words across the whole corpus.

I used mallet from the command line and ran it in two parts.

  1. Generate the model
  2. Work out the allocation of topics and documents based on hyper parameters

I wrote a script to help me out:

#!/bin/sh
 
train_model() {
  ./mallet-2.0.7/bin/mallet import-dir \
    --input mallet-2.0.7/sample-data/himym \
    --output ${2} \
    --keep-sequence \
    --remove-stopwords \
    --extra-stopwords ${1}
}
 
extract_topics() {
  ./mallet-2.0.7/bin/mallet train-topics \
    --input ${2} --num-topics ${1} \
    --optimize-interval 20 \
    --output-state himym-topic-state.gz \
    --output-topic-keys output/himym_${1}_${3}_keys.txt \
    --output-doc-topics output/himym_${1}_${3}_composition.txt
}
 
train_model "stop_words.txt" "output/himym.mallet"
train_model "main-words-stop.txt" "output/himym.main.words.stop.mallet"
 
extract_topics 10 "output/himym.mallet" "all.stop.words"
extract_topics 15 "output/himym.mallet" "all.stop.words"
extract_topics 20 "output/himym.mallet" "all.stop.words"
extract_topics 30 "output/himym.mallet" "all.stop.words"
 
extract_topics 10 "output/himym.main.words.stop.mallet" "main.stop.words"
extract_topics 15 "output/himym.main.words.stop.mallet" "main.stop.words"
extract_topics 20 "output/himym.main.words.stop.mallet" "main.stop.words"
extract_topics 30 "output/himym.main.words.stop.mallet" "main.stop.words"

As you can see, this script first generates a bunch of models from text files in ‘mallet-2.0.7/sample-data/himym’ – there is one file per episode of HIMYM. We then use that model to generate differently sized topic models.

The output is two files; one containing a list of topics and another describing what percentage of the words in each document come from each topic.

$ cat output/himym_10_all.stop.words_keys.txt
 
0	0.08929	back brad natalie loretta monkey show call classroom mitch put brunch betty give shelly tyler interview cigarette mc laren
1	0.05256	zoey jerry arthur back randy arcadian gael simon blauman blitz call boats becky appartment amy gary made steve boat
2	0.06338	back claudia trudy doug int abby call carl stuart voix rachel stacy jenkins cindy vo katie waitress holly front
3	0.06792	tony wendy royce back jersey jed waitress bluntly lucy made subtitle film curt mosley put laura baggage officer bell
4	0.21609	back give patrice put find show made bilson nick call sam shannon appartment fire robots top basketball wrestlers jinx
5	0.07385	blah bob back thanksgiving ericksen maggie judy pj valentine amanda made call mickey marcus give put dishes juice int
6	0.04638	druthers karen back jen punchy jeanette lewis show jim give pr dah made cougar call jessica sparkles find glitter
7	0.05751	nora mike pete scooter back magazine tiffany cootes garrison kevin halloween henrietta pumpkin slutty made call bottles gruber give
8	0.07321	ranjit back sandy mary burger call find mall moby heather give goat truck made put duck found stangel penelope
9	0.31692	back give call made find put move found quinn part ten original side ellen chicago italy locket mine show
$ head -n 10 output/himym_10_all.stop.words_composition.txt
#doc name topic proportion ...
0	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/1.txt	0	0.70961794636687	9	0.1294699168584466	8	0.07950442338871108	2	0.07192178481473664	4	0.008360809510263838	5	2.7862560133367015E-4	3	2.562409242784946E-4	7	2.1697378721335337E-4	1	1.982849604752168E-4	6	1.749937876710496E-4
1	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/10.txt	2	0.9811551470820473	9	0.016716882136209997	4	6.794128563082893E-4	0	2.807350575301132E-4	5	2.3219634098530471E-4	8	2.3018997315244256E-4	3	2.1354177341696056E-4	7	1.8081798384467614E-4	1	1.6524340216541808E-4	6	1.4583339433951297E-4
2	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/100.txt	2	0.724061485807234	4	0.13624729774423758	0	0.13546964196228636	9	0.0019436342339785994	5	4.5291919356563914E-4	8	4.490055982996677E-4	3	4.1653183421485213E-4	7	3.5270123154213927E-4	1	3.2232165301666123E-4	6	2.8446074162457316E-4
3	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/101.txt	2	0.7815231689893246	0	0.14798271520316794	9	0.023582384458063092	8	0.022251052243582908	1	0.022138209217973336	4	0.0011804626661380394	5	4.0343527385745457E-4	3	3.7102343418895774E-4	7	3.1416667687862693E-4	6	2.533818368250992E-
4	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/102.txt	6	0.6448245189567259	4	0.18612146979166502	3	0.16624873439661025	9	0.0012233726722317548	0	3.4467218590717303E-4	5	2.850788252495599E-4	8	2.8261550915084904E-4	2	2.446611421432842E-4	7	2.2199909869250053E-4	1	2.028774216237081E-
5	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/103.txt	8	0.7531586740033047	5	0.17839539108961253	0	0.06512376460651902	9	0.001282794040111701	4	8.746645156304241E-4	3	2.749100345664577E-4	2	2.5654476523149865E-4	7	2.327819863700214E-4	1	2.1273153572848481E-4	6	1.8774342292520802E-4
6	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/104.txt	7	0.9489502365148181	8	0.030091466847852504	4	0.017936457663121977	9	0.0013482824985091328	0	3.7986419553884905E-4	5	3.141861834124008E-4	3	2.889445824352445E-4	2	2.6964174000656E-4	1	2.2359178288566958E-4	6	1.9732799141958482E-4
7	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/105.txt	8	0.7339694064061175	7	0.1237041841318045	9	0.11889696041555338	0	0.02005288536233353	4	0.0014026751618923005	5	4.793786828705149E-4	3	4.408655780020889E-4	2	4.1141370625324785E-4	1	3.411516484151411E-4	6	3.0107890675777946E-4
8	file:/Users/markneedham/projects/mallet/mallet-2.0.7/sample-data/himym/106.txt	5	0.37064909999661005	9	0.3613559917055785	0	0.14857567731040344	6	0.09545466082502917	4	0.022300625744661403	8	3.8725629469313333E-4	3	3.592484711785775E-4	2	3.3524900189121E-4	7	3.041961449432886E-4	1	2.779945050112539E-4

The output is a bit tricky to understand on its own so I did a bit of post processing using pandas and then ran the results of that through matplotlib to see the distribution of documents for different topics sizes with different stop words. You can see the script here.

I ended up with the following chart:

2015 03 24 22 08 48

On the left hand side we’re using more stop words and on the right just the main ones. For most of the variations there are one or two topics which most documents belong to but interestingly the most uniform distribution seems to be when we have few topics.

These are the main words for the most popular topics on the left hand side:

15 topics

8       0.50732 back give call made put find found part move show side ten mine top abby front fire full fianc

20 topics

12      0.61545 back give call made put find show found part move side mine top front ten full cry fire fianc

30 topics

22      0.713   back call give made put find show part found side move front ten full top mine fire cry bottom

All contain more or less the same words which at first glance seem like quite generic words so I’m surprised they weren’t excluded.

On the right hand side we haven’t removed many words so we’d expect common words in the English language to dominate. Let’s see if they do:

10 topics

1       3.79451 don yeah ll hey ve back time guys good gonna love god night wait uh thing guy great make

15 topics

5       2.81543 good time love ll great man guy ve night make girl day back wait god life yeah years thing
 
10      1.52295 don yeah hey gonna uh guys didn back ve ll um kids give wow doesn thing totally god fine

20 topics

1       3.06732 good time love wait great man make day back ve god life years thought big give apartment people work
 
13      1.68795 don yeah hey gonna ll uh guys night didn back ve girl um kids wow guy kind thing baby

30 topics

14      1.42509 don yeah hey gonna uh guys didn back ve um thing ll kids wow time doesn totally kind wasn
 
24      2.19053 guy love man girl wait god ll back great yeah day call night people guys years home room phone
 
29      1.84685 good make ve ll stop time made nice put feel love friends big long talk baby thought things happy

Again we have similar words across each run and as expected they are all quite generic words.

My take away from this exploration is that I should vary the stop word percentages as well and see if that leads to an improved distribution.

Taking out very common words like we do with the left hand side charts seems to make sense although I need to work out why there’s a single outlier in each group.

The authors suggest that having the majority of our texts in a small number of topics means we need to create more of them so I will investigate that too.

The code is all on github along with the transcripts so give it a try and let me know what you think.

Categories: Programming

Python: Equivalent to flatMap for flattening an array of arrays

Mon, 03/23/2015 - 01:45

I found myself wanting to flatten an array of arrays while writing some Python code earlier this afternoon and being lazy my first attempt involved building the flattened array manually:

episodes = [
    {"id": 1, "topics": [1,2,3]},
    {"id": 2, "topics": [4,5,6]}
]
 
flattened_episodes = []
for episode in episodes:
    for topic in episode["topics"]:
        flattened_episodes.append({"id": episode["id"], "topic": topic})
 
for episode in flattened_episodes:
    print episode

If we run that we’ll see this output:

$ python flatten.py
 
{'topic': 1, 'id': 1}
{'topic': 2, 'id': 1}
{'topic': 3, 'id': 1}
{'topic': 4, 'id': 2}
{'topic': 5, 'id': 2}
{'topic': 6, 'id': 2}

What I was really looking for was the Python equivalent to the flatmap function which I learnt can be achieved in Python with a list comprehension like so:

flattened_episodes = [{"id": episode["id"], "topic": topic}
                      for episode in episodes
                      for topic in episode["topics"]]
 
for episode in flattened_episodes:
    print episode

We could also choose to use itertools in which case we’d have the following code:

from itertools import chain, imap
flattened_episodes = chain.from_iterable(
                        imap(lambda episode: [{"id": episode["id"], "topic": topic}
                                             for topic in episode["topics"]],
                             episodes))
for episode in flattened_episodes:
    print episode

We can then simplify this approach a little by wrapping it up in a ‘flatmap’ function:

def flatmap(f, items):
        return chain.from_iterable(imap(f, items))
 
flattened_episodes = flatmap(
    lambda episode: [{"id": episode["id"], "topic": topic} for topic in episode["topics"]], episodes)
 
for episode in flattened_episodes:
    print episode

I think the list comprehensions approach still works but I need to look into itertools more – it looks like it could work well for other list operations.

Categories: Programming

Python: Simplifying the creation of a stop word list with defaultdict

Sun, 03/22/2015 - 02:51

I’ve been playing around with topics models again and recently read a paper by David Mimno which suggested the following heuristic for working out which words should go onto the stop list:

A good heuristic for identifying such words is to remove those that occur in more than 5-10% of documents (most common) and those that occur fewer than 5-10 times in the entire corpus (least common).

I decided to try this out on the HIMYM dataset that I’ve been working on over the last couple of months.

I started out with the following code to build a dictionary of words, their total occurrences and the episodes they’d been used in:

import csv
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict
 
episodes = defaultdict(str)
with open("sentences.csv", "r") as file:
    reader = csv.reader(file, delimiter = ",")
    reader.next()
    for row in reader:
        episodes[row[1]] += row[4]
 
vectorizer = CountVectorizer(analyzer='word', min_df = 0, stop_words = 'english')
matrix = vectorizer.fit_transform(episodes.values())
features = vectorizer.get_feature_names()
 
words = {}
for doc_id, doc in enumerate(matrix.todense()):
    for word_id, score in enumerate(doc.tolist()[0]):
        word = features[word_id]
        if not words.get(word):
            words[word] = {}
 
        if not words[word].get("score"):
            words[word]["score"] = 0
        words[word]["score"] += score
 
        if not words[word].get("episodes"):
            words[word]["episodes"] = set()
 
        if score > 0:
            words[word]["episodes"].add(doc_id)

This works fine but the code inside the last for block is ugly and most of it is handling the case when parts of a dictionary aren’t yet initialised which is defaultdict territory. You’ll notice I am using defaultdict in the first part of the code but not yet the second as I’d struggled to get it working.

This was my first attempt to make the ‘words’ variable based on it:

>>> words = defaultdict({})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: first argument must be callable

We can see why this doesn’t work if we try to evaluate ‘{}’ as a function which is what defaultdict does internally:

>>> {}()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'dict' object is not callable

Instead what we need is to pass in ‘dict':

>>> dict()
{}
 
>>> words = defaultdict(dict)
 
>>> words
defaultdict(<type 'dict'>, {})

That simplifies the first bit of the loop:

words = defaultdict(dict)
for doc_id, doc in enumerate(matrix.todense()):
    for word_id, score in enumerate(doc.tolist()[0]):
        word = features[word_id]
        if not words[word].get("score"):
            words[word]["score"] = 0
        words[word]["score"] += score
 
        if not words[word].get("episodes"):
            words[word]["episodes"] = set()
 
        if score > 0:
            words[word]["episodes"].add(doc_id)

We’ve still got a couple of other places to simplify though which we can do by defining a custom function and passing that into defaultdict:

def default_dict_function():
   return {"score": 0, "episodes": set()}
 
>>> words = defaultdict(default_dict_function)
 
>>> words
defaultdict(<function default_dict_function at 0x10963fcf8>, {})

And here’s the final product:

def default_dict_function():
   return {"score": 0, "episodes": set()}
words = defaultdict(default_dict_function)
 
for doc_id, doc in enumerate(matrix.todense()):
    for word_id, score in enumerate(doc.tolist()[0]):
        word = features[word_id]
        words[word]["score"] += score
        if score > 0:
            words[word]["episodes"].add(doc_id)

After this we can write out the words to our stop list:

with open("stop_words.txt", "w") as file:
    writer = csv.writer(file, delimiter = ",")
    for word, value in words.iteritems():
        # appears in > 10% of episodes
        if len(value["episodes"]) > int(len(episodes) / 10):
            writer.writerow([word.encode('utf-8')])
 
        # less than 10 occurences
        if value["score"] < 10:
            writer.writerow([word.encode('utf-8')])
Categories: Programming

Python: Forgetting to use enumerate

Sun, 03/22/2015 - 02:28

Earlier this evening I found myself writing the equivalent of the following Python code while building a stop list for a topic model…

words = ["mark", "neo4j", "michael"]
word_position = 0
for word in words:
   print word_position, word
   word_position +=1

…which is very foolish given that there’s already a function that makes it really easy to grab the position of an item in a list:

for word_position, word in enumerate(words):
   print word_position, word

Python does make things extremely easy at times – you’re welcome future Mark!

Categories: Programming

Badass: Making users awesome – Kathy Sierra: Book Review

Fri, 03/20/2015 - 08:30

I started reading Kathy Sierra’s new book ‘Badass: Making users awesome‘ a couple of weeks ago and with the gift of flights to/from Stockholm this week I’ve got through the rest of it.

I really enjoyed the book and have found myself returning to it almost every day to check up exactly what was said on a particular topic.

There were a few things that I’ve taken away and have been going on about to anyone who will listen.

2015 03 20 06 52 51

Paraphrasing, ‘help users acquire skills, don’t throw knowledge at them.’ I found this advice helpful both in my own learning of new things as well as for thinking how to help users of Neo4j get up and running faster.

Whether we’re doing a talk, workshop or online training, the goal isn’t to teach the user a bunch of information/facts but rather to help them learn skills which they can use to achieve their ‘compelling context‘.

Having said that, it’s very easy to fall into the information/facts trap as that type of information is much easier to prepare and present. You don’t have to spend much time thinking about how the user is going to use, rather you hope that if you throw enough information at them some of it will stick.

A user’s compelling context the problem they’re trying to solve regardless of the tools they use to solve it. The repeated example of this is a camera – we don’t buy a camera because we want to buy a camera, we buy it because we want to take great photographs.

2015 03 17 23 49 25

There’s a really interesting section in the middle of the book which talks about expert performance and skill acquisition and how we can achieve this through deliberate practice.

My main take away here is that we have only mastered a skill if we can achieve 95% reliability in repeating the task within 1-3 45-90 minute sessions.

If we can’t achieve this then the typical reaction is to either give up or keep trying to achieve the goal for many more hours. Neither of these is considered a useful approach.

Instead we should realise that if we can’t do the skill it’s probably because there’s a small sub skill that we need to master first. So our next step is to break this skill down into its components, master those and then try the original skill again.

Amy Hoy’s ‘doing it backwards‘ guide is very helpful for doing the skill breakdown as it makes you ask the question ‘can I do it tomorrow?‘ or is there something else that I need to do (learn) first.

I’ve been trying to apply this approach to my machine learning adventures which most recently has involved various topic modelling attempts on a How I met your mother data set.

I’d heard good things about the MALLET open source library but having never used it before sketched out the goals/skills I wanted to achieve:

Extract topics for HIMYM corpus ->
Train a topic model with mallet ->
Tweak an existing topic model that uses mallet ->
Run an existing topic model that uses mallet -> 
Install mallet
2015 03 20 00 11 48

The idea is that you then start from the last action and work your way back up the chain – it should also act as a nice deterrent for yak shaving.

While learning about mallet I came across several more articles that I should read about topic modelling and while these don’t directly contribute to learning a skill I think they will give me good background to help pick up some of the intuition behind topic modelling.

My take away about gaining knowledge on a skill is that when we’re getting started we should spend more time gaining practical knowledge rather than only reading but once we get more into it we’ll naturally become more curious and do the background reading. I often find myself just reading non stop about things but never completely understanding them because I don’t go hands on so this was a good reminder.

One of the next things I’m working on is a similar skill break down for people learning Neo4j and then we’ll look to apply this to make our meetup sessions more effective – should be fun!

The other awesome thing about this book is that I’ve come away with a bunch of other books to read as well:

In summary, if learning is your thing get yourself a copy of the book and read it over a few times – so many great tips, I’ve only covered a few.

Categories: Programming

Neo4j: Detecting potential typos using EXPLAIN

Tue, 03/17/2015 - 23:46

I’ve been running a few intro to Neo4j training sessions recently using Neo4j 2.2.0 RC1 and at some stage in every session somebody will make a typo when writing out of the example queries.

For example one of the queries that we do about half way finds the actors and directors who have worked together and aggregates the movies they were in.

This is the correct query:

MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5

which should yield the following results:

==> +-----------------------------------------------------------------------------------------------------------------------+
==> | actor.name           | director.name    | movies                                                                      |
==> +-----------------------------------------------------------------------------------------------------------------------+
==> | "Hugo Weaving"       | "Andy Wachowski" | ["Cloud Atlas","The Matrix Revolutions","The Matrix Reloaded","The Matrix"] |
==> | "Hugo Weaving"       | "Lana Wachowski" | ["Cloud Atlas","The Matrix Revolutions","The Matrix Reloaded","The Matrix"] |
==> | "Laurence Fishburne" | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> | "Keanu Reeves"       | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> | "Carrie-Anne Moss"   | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> +-----------------------------------------------------------------------------------------------------------------------+

However, a common typo is to write ‘DIRECTED_IN’ instead of ‘DIRECTED’ in which case we’ll see no results:

MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED_IN]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5
 
==> +-------------------------------------+
==> | actor.name | director.name | movies |
==> +-------------------------------------+
==> +-------------------------------------+
==> 0 row

It’s not immediately obvious why we aren’t seeing any results which can be quite frustrating.

However, in Neo4j 2.2 the ‘EXPLAIN’ keyword has been introduced and we can use this to see what the query planner thinks of the query we want to execute without actually executing it.

Instead the planner makes use of knowledge that it has about our schema to come up with a plan that it would run and how much of the graph it thinks that plan would touch:

EXPLAIN MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED_IN]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5

2015 03 17 23 39 55

The first row of the query plan describes an all nodes scan which tells us that the query will start from the ‘director’ but it’s the second row that’s interesting.

The estimated rows when expanding the ‘DIRECTED_IN’ relationship is 0 when we’d expect it to at least be a positive value if there were some instances of that relationship in the database.

If we compare this to the plan generated when using the proper ‘DIRECTED’ relationship we can see the difference:

2015 03 17 23 43 11

Here we see an estimated 44 rows from expanding the ‘DIRECTED’ relationship so we know there are at least some nodes connected by that relationship type.

In summary if you find your query not returning anything when you expect it to, prefix an ‘EXPLAIN’ and make sure you’re not seeing the dreaded ‘0 expected rows’.

Categories: Programming