Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 5 hours 2 min ago

Python: Joining multiple generators/iterators

Mon, 05/25/2015 - 00:51

In my previous blog post I described how I’d refactored some scraping code I’ve been working on to use iterators and ended up with a function which returned a generator containing all the events for one BBC live text match:

match_id = "32683310"
events = extract_events("data/raw/%s" % (match_id))
 
>>> print type(events)
<type 'generator'>

The next thing I wanted to do is get the events for multiple matches which meant I needed to glue together multiple generators into one big generator.

itertools’ chain function does exactly what we want:

itertools.chain(*iterables)

Make an iterator that returns elements from the first iterable until it is exhausted, then proceeds to the next iterable, until all of the iterables are exhausted. Used for treating consecutive sequences as a single sequence.

First let’s try it out on a collection of range generators:

import itertools
gens = [(n*2 for n in range(0, 3)), (n*2 for n in range(4,7))]
>>> gens
[<generator object <genexpr> at 0x10ff3b140>, <generator object <genexpr> at 0x10ff7d870>]
 
output = itertools.chain()
for gen in gens:
  output = itertools.chain(output, gen)

Now if we iterate through ‘output’ we’d expect to see the multiples of 2 up to and including 12:

>>> for item in output:
...   print item
...
0
2
4
8
10
12

Exactly as we expected! Our scraping code looks like this once we plug the chaining in:

matches = ["32683310", "32683303", "32384894", "31816155"]
 
raw_events = itertools.chain()
for match_id in matches:
    raw_events = itertools.chain(raw_events, extract_events("data/raw/%s" % (match_id)))

‘raw_events’ now contains a single generator that we can iterate through and process the events for all matches.

Categories: Programming

Python: Refactoring to iterator

Sat, 05/23/2015 - 11:14

Over the last week I’ve been building a set of scripts to scrape the events from the Bayern Munich/Barcelona game and I’ve ended up with a few hundred lines of nested for statements, if statements and mutated lists. I thought it was about time I did a bit of refactoring.

The following is a function which takes in a match file and spits out a collection of maps containing times & events.

import bs4
import re
from bs4 import BeautifulSoup
from soupselect import select
 
def extract_events(file):
    match = open(file, 'r')
    soup = BeautifulSoup(match.read())
 
    all_events = []
    for event in select(soup, 'div#live-text-commentary-wrapper div#live-text'):
        for child in event.children:
            if type(child) is bs4.element.Tag:
                all_events.append(child.getText().strip())
 
    for event in select(soup, 'div#live-text-commentary-wrapper div#more-live-text'):
        for child in event.children:
            if type(child) is bs4.element.Tag:
                all_events.append(child.getText().strip())
 
    timed_events = []
    for i in range(0, len(all_events)):
        event = all_events[i]
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            timed_events.append({'time': formatted_time, 'event': all_events[i+1]})
    return timed_events

We call it like this:

match_id = "32683310"
for event in extract_events("data/%s" % (match_id))[:10]:
    print event

The file we’re loading is the Bayern Munich vs Barcelona match HTML file which I have saved locally. After we’ve got that read into beautiful soup we locate the two divs on the page which contain the match events.

We then iterate over that list and create a new list containing (time, event) pairs which we return.

I think we should be able to get to our resulting collection without persisting an intermediate list, but first things first – let’s remove the duplicated for loops:

def extract_events(file):
    match = open(file, 'r')
    soup = BeautifulSoup(match.read())
 
    all_events = []
    events = select(soup, 'div#live-text-commentary-wrapper div#live-text')
    more_events = select(soup, 'div#live-text-commentary-wrapper div#more-live-text')
 
    for event in events + more_events:
        for child in event.children:
            if type(child) is bs4.element.Tag:
                all_events.append(child.getText().strip())
 
    timed_events = []
    for i in range(0, len(all_events)):
        event = all_events[i]
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            timed_events.append({'time': formatted_time, 'event': all_events[i+1]})
    return timed_events

The next step is to refactor towards using an iterator. After a bit of reading I realised a generator would make life even easier.

I created a function which returned an iterator of the raw events and plugged that into the original function:

def raw_events(file):
    match = open(file, 'r')
    soup = BeautifulSoup(match.read())
    events = select(soup, 'div#live-text-commentary-wrapper div#live-text')
    more_events = select(soup, 'div#live-text-commentary-wrapper div#more-live-text')
    for event in events + more_events:
        for child in event.children:
            if type(child) is bs4.element.Tag:
                yield child.getText().strip()
 
def extract_events(file):
    all_events = list(raw_events(file))
 
    timed_events = []
    for i in range(0, len(all_events)):
        event = all_events[i]
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            timed_events.append({'time': formatted_time, 'event': all_events[i+1]})
    return timed_events

If we run that function we still get the same output as before which is good. Now we need to work out how to clean up the second bit of the code which groups the appropriate rows together.

The goal is that ‘extract_events’ returns an iterator rather than a list – we need to figure out how to iterate over the output of ‘raw_events’ in such a way that when we find a ‘time row’ we can yield that and the row immediately after.

Luckily I found a Stack Overflow post explaining that you can use the ‘next’ function inside an iterator to achieve this:

def extract_events(file):
    events = raw_events(file)
    for event in events:
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            yield {'time': formatted_time, 'event': next(events)}

It’s not that much less code than the original function but I think it’s an improvement. Any thoughts/tips to simplify it further are always welcome.

Categories: Programming

Python: UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xfc’ in position 11: ordinal not in range(128)

Thu, 05/21/2015 - 07:14

I’ve been trying to write some Python code to extract the players and the team they represented in the Bayern Munich/Barcelona match into a CSV file and had much more difficulty than I expected.

I have some scraping code (which is beyond the scope of this article) which gives me a list of (player, team) pairs that I want to write to disk. The contents of the list is as follows:

$ python extract_players.py
(u'Sergio Busquets', u'Barcelona')
(u'Javier Mascherano', u'Barcelona')
(u'Jordi Alba', u'Barcelona')
(u'Bastian Schweinsteiger', u'FC Bayern M\xfcnchen')
(u'Dani Alves', u'Barcelona')

I started with the following script:

with open("data/players.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])
 
    for player, team in players:
        print player, team, type(player), type(team)
        writer.writerow([player, team])

And if I run that I’ll see this error:

$ python extract_players.py
...
Bastian Schweinsteiger FC Bayern München <type 'unicode'> <type 'unicode'>
Traceback (most recent call last):
  File "extract_players.py", line 67, in <module>
    writer.writerow([player, team])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)

So it looks like the ‘ü’ in ‘FC Bayern München’ is causing us issues. Let’s try and encode the teams to avoid this:

with open("data/players.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])
 
    for player, team in players:
        print player, team, type(player), type(team)
        writer.writerow([player, team.encode("utf-8")])
$ python extract_players.py
...
Thomas Müller FC Bayern München <type 'unicode'> <type 'unicode'>
Traceback (most recent call last):
  File "extract_players.py", line 70, in <module>
    writer.writerow([player, team.encode("utf-8")])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 8: ordinal not in range(128)

Now we’ve got the same issue with the ‘ü’ in Müller so let’s encode the players too:

with open("data/players.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])
 
    for player, team in players:
        print player, team, type(player), type(team)
        writer.writerow([player.encode("utf-8"), team.encode("utf-8")])
$ python extract_players.py
...
Gerard Piqué Barcelona <type 'str'> <type 'unicode'>
Traceback (most recent call last):
  File "extract_players.py", line 70, in <module>
    writer.writerow([player.encode("utf-8"), team.encode("utf-8")])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)

Now we’ve got a problem with Gerard Piqué because that value has type string rather than unicode. Let’s fix that:

with open("data/players.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])
 
    for player, team in players:
        if isinstance(player, str):
            player = unicode(player, "utf-8")
        print player, team, type(player), type(team)
        writer.writerow([player.encode("utf-8"), team.encode("utf-8")])

Et voila! All the players are now successfully written to the file.

An alternative approach is to change the default encoding of the whole script to be ‘UTF-8′, like so:

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
 
with open("data/players.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])
 
    for player, team in players:
        print player, team, type(player), type(team)
        writer.writerow([player, team])

It took me a while to figure it out but finally the players are ready to go!

Categories: Programming

Neo4j: Finding all shortest paths

Tue, 05/19/2015 - 23:45

One of the Cypher language features we show in Neo4j training courses is the shortest path function which allows you to find the shortest path in terms of number of relationships between two nodes.

Using the movie graph, which you can import via the ‘:play movies’ command in the browser, we’ll first create a ‘KNOWS’ relationship between any people that have appeared in the same movie:

MATCH (p1:Person)-[:ACTED_IN]->()<-[:ACTED_IN]-(p2:Person)
MERGE (p1)-[:KNOWS]-(p2)

Now that we’ve got that relationship we can easily find the shortest path between two people, say Tom Cruise and Tom Hanks:

MATCH (p1:Person {name: "Tom Hanks"}), (p2:Person {name: "Tom Cruise"}),
      path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
Graph  18

That works pretty well but what if we want to find the longest shortest path between any two people in the graph? We can calculate it like this:

MATCH (p1:Person), (p2:Person),
      path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
ORDER BY LENGTH(path) DESC
LIMIT 1
Graph  19

So that’s 6 hops which is actually the Bacon number – I expect we’d probably see a smaller maximum value if we imported all the movies.

And to round off the post what if we want to find the longest shortest path between the 10 people who acted in the most movies? We might start out with the following query which seems like it should do the job:

MATCH (p1:Person)-[:ACTED_IN]->()
 
WITH p1, COUNT(*) AS appearances
ORDER BY appearances DESC
LIMIT 10
 
WITH p1 AS p1, p1 AS p2
MATCH path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
ORDER BY LENGTH(path) DESC
LIMIT 1

Unfortunately if we run that query we get no rows returned because ‘p1′ and ‘p2′ always refer to the same node.

Instead we can calculate the shortest path between our hardest working people by creating a cross product using COLLECT and UNWIND:

MATCH (p1:Person)-[:ACTED_IN]->()
 
WITH p1, COUNT(*) AS appearances
ORDER BY appearances DESC
LIMIT 10
 
WITH COLLECT(p1) AS ps
UNWIND ps AS p1 UNWIND ps AS p2
MATCH path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
ORDER BY LENGTH(path) DESC
LIMIT 1

Graph  20

That’s all for now!

Categories: Programming

Neo4j: Refactoring the BBC football live text fouls graph

Sun, 05/17/2015 - 12:04

Yesterday I wrote about a Neo4j graph I’ve started building which contains all the fouls committed in the Champions League game between Barcelona & Bayern Munich and surrounding meta data.

While adding other events into the graph I realised that I’d added some duplication in the model and the model could do with some refactoring to make it easier to use.

To recap, this is the model that we designed in the previous blog post:

The duplication is on the left hand side of the model – we model a foul as being committed by one player against another and then hook the foul back into the match. By doing that we’re not using the ‘appearance’ concept which links a player and a match together.

We can make the ‘COMMITTED_IN_MATCH’ relationship redundant by connecting the foul to appearance rather than to player. The match the foul was committed in can then be found by navigating through the appearance node.

This is what we want the graph to look like:

2015 05 17 10 40 44

We’ll move towards this new model in 3 steps:

  • Introduce the new structure alongside the existing one
  • Rewrite our queries to use the new structure
  • Remove the old structure
Introducing the new structure

First up let’s write a query to introduce the new structure.

match (foul:Foul)-[:COMMITTED_AGAINST]->(fouledPlayer),
      (foul)<-[:COMMITTED_FOUL]-(foulingPlayer),
      (foul)-[:COMMITTED_IN_MATCH]->(match:Match {id: "32683310"}),
      (foulingPlayer)-[:MADE_APPEARANCE]-(foulingPlayerApp)-[:IN_MATCH]->(match),
      (fouledPlayer)-[:MADE_APPEARANCE]-(fouledPlayerApp)-[:IN_MATCH]->(match)
MERGE (foul)<-[:COMMITTED_FOUL]-(foulingPlayerApp)
MERGE (foul)-[:COMMITTED_AGAINST]->(fouledPlayerApp)

Remember we’re not going to delete the old structure yet so that’s why there aren’t any delete statements in here.

Rewriting our queries

Now we need to update our queries to work against the new graph structure:

Where do the fouls happen?
match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-()<-[]-(foul:Foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC
Who fouls the most?
match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul:Foul)<-[:COMMITTED_FOUL]-(fouler:Player)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(appearance)-[:COMMITTED_FOUL]->(foul:Foul),
      (appearance)<-[:MADE_APPEARANCE]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10
Who was fouled the most?
match (match:Match {id: "32683310"})<-[:IN_MATCH]-(appearance)-[r:COMMITTED_FOUL]->(foul:Foul),
      (appearance)<-[:MADE_APPEARANCE]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(appearance)<-[:COMMITTED_AGAINST]->(foul:Foul),
      (appearance)<-[:MADE_APPEARANCE]-(fouled)
RETURN fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10
Who fouled who the most?
match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul:Foul)-[:COMMITTED_AGAINST]->(fouled:Player),
      (foul)<-[:COMMITTED_FOUL]-(fouler:Player)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

becomes

match (match:Match {id: "32683310"}),
      (match)<-[:IN_MATCH]-(fouledApp)<-[:COMMITTED_AGAINST]->(foul:Foul)<-[:COMMITTED_FOUL]-(foulerApp)-[:IN_MATCH]->(match),
      (fouledApp)<-[:MADE_APPEARANCE]-(fouled),
      (foulerApp)<-[:MADE_APPEARANCE]-(fouler)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
Which team fouled most?
match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-()<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(app:Appearance)-[:COMMITTED_FOUL]->(),
      (app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC
Worst fouler for each team
match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(app:Appearance)-[:COMMITTED_FOUL]->(),
      (app)-[:FOR_TEAM]->(team),
      (fouler)-[:MADE_APPEARANCE]->(app)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;
Most fouled against for each team
match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(app:Appearance)<-[:COMMITTED_AGAINST]->(),
      (app)-[:FOR_TEAM]->(team),
      (fouled)-[:MADE_APPEARANCE]->(app)
WITH team, fouled, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouled:fouled, fouls:fouls})[0] AS topFouled
RETURN team.name, topFouled.fouled.name, topFouled.fouls

The early queries are made more complicated by the refactoring but the latter ones are slightly simpler. I think we need to hook some more events onto the appearance node to see whether this refactoring is worthwhile or not.

Removing the old structure

Holding judgement for now, let’s look at how we’d remove the old structure – the final step in this refactoring:

match (match:Match {id: "32683310"})<-[oldRel:COMMITTED_IN_MATCH]-(foul:Foul)
DELETE oldRel
match (player:Player)<-[oldRel:COMMITTED_AGAINST]-(foul:Foul)
DELETE oldRel
match (player:Player)-[oldRel:COMMITTED_FOUL]->(foul:Foul)
DELETE oldRel

Hopefully you can see how you’d go about refactoring your own graph if you realise the model isn’t quite what you want.

Any questions/thoughts/suggestions let me know!

Categories: Programming

Neo4j: BBC football live text fouls graph

Sat, 05/16/2015 - 22:13

I recently came across the Partially Derivative podcast and in episode 17 they describe how Kirk Goldsberry scraped a bunch of data about shots in basketball matches then ran some analysis on that data.

It got me thinking that we might be able to do something similar for football matches and although event based data for football matches only comes from Opta, the BBC does expose some of them in live text feeds.

We’ll start with the Champions League match between Barcelona and Bayern Munich from last Tuesday.

2015 05 16 23 10 43

Our first task is to extract the events that happened in the match along with the players involved. After we’ve got that we’ll generate a Neo4j graph and see if we can find some interesting insights.

I find the feedback cycle with this type of work is dramatically improved if we have the source data available locally so the first step was to get the BBC web page downloaded:

$ wget http://www.bbc.co.uk/sport/0/football/32683310

Next we need to write a scraper which will extract all the events. We want to get an array containing one entry for each event, where the following is an example of an event:

2015 05 16 22 19 00

HTML-wise it looks like this:

2015 05 16 22 20 28

4709393221 bddd85c64e z

Image courtesy of William Brawley

I do most of my scraping work in Python so I used the Beautiful Soup library with the soupselect wrapper to get the data into CSV format ready to import into Neo4j.

It was mostly a straight forward job of finding the appropriate CSS tag and pulling out the values although the way fouls are described in the page is a bit strange – sometimes the person fouled comes first row and the fouler comes on the next line and sometimes vice versa.

Luckily the two parts of the foul can be joined together by matching the time which made life easier.

The full code for the scrapper is on github if you want to play with it.

This is what the resulting CSV file looks like:

$ head -n 10 data/events.csv 
matchId,foulId,freeKickId,time,foulLocation,fouledPlayer,fouledPlayerTeam,foulingPlayer,foulingPlayerTeam
32683310,3,2,90:00 +0:40,in the defensive half.,Xabi Alonso,FC Bayern München,Pedro,Barcelona
32683310,9,8,84:38,on the right wing.,Rafinha,FC Bayern München,Pedro,Barcelona
32683310,12,13,83:17,in the attacking half.,Lionel Messi,Barcelona,Sebastian Rode,FC Bayern München
32683310,15,14,82:43,in the defensive half.,Sebastian Rode,FC Bayern München,Neymar,Barcelona
32683310,17,18,80:41,in the attacking half.,Pedro,Barcelona,Xabi Alonso,FC Bayern München
32683310,22,23,76:31,in the defensive half.,Neymar,Barcelona,Rafinha,FC Bayern München
32683310,25,26,75:03,in the attacking half.,Lionel Messi,Barcelona,Xabi Alonso,FC Bayern München
32683310,31,30,69:37,in the attacking half.,Bastian Schweinsteiger,FC Bayern München,Dani Alves,Barcelona
32683310,36,35,63:27,in the attacking half.,Robert Lewandowski,FC Bayern München,Ivan Rakitic,Barcelona

Now it’s time to create a graph. We’ll aim to massage the data into this model:

2015 05 16 22 50 32

Next we need to write some Cypher code to get the CSV data into the graph. The full script is here, a sample of which is below:

// match
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Match {id: row.matchId});
 
// teams
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Team {name: row.foulingPlayerTeam});
 
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Team {name: row.fouledPlayerTeam});
 
// players
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (player:Player {id: row.foulingPlayer + "_" + row.foulingPlayerTeam})
ON CREATE SET player.name = row.foulingPlayer;
 
// appearances
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MATCH (match:Match {id: row.matchId})
MATCH (player:Player {id: row.foulingPlayer + "_" + row.foulingPlayerTeam})
MATCH (team:Team {name: row.foulingPlayerTeam})
 
MERGE (appearance:Appearance {id: player.id + " in " + row.matchId})
MERGE (player)-[:MADE_APPEARANCE]->(appearance)
MERGE (appearance)-[:IN_MATCH]->(match)
MERGE (appearance)-[:FOR_TEAM]->(team);
 
// fouls
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
 
MATCH (foulingPlayer:Player {id:row.foulingPlayer + "_" + row.foulingPlayerTeam })
MATCH (fouledPlayer:Player {id:row.fouledPlayer + "_" + row.fouledPlayerTeam })
MATCH (match:Match {id: row.matchId})
 
MERGE (foul:Foul {eventId: row.foulId})
ON CREATE SET foul.time = row.time, foul.location = row.foulLocation
 
MERGE (foul)<-[:COMMITTED_FOUL]-(foulingPlayer)
MERGE (foul)-[:COMMITTED_AGAINST]->(fouledPlayer)
MERGE (foul)-[:COMMITTED_IN_MATCH]->(match);

We’ll use neo4j-shell to execute the script:

$ ./neo4j-community-2.2.1/bin/neo4j-shell --file import.cql

Now that we’ve got the data into Neo4j we need to come up with some questions to ask of it. I came up with the following but perhaps you can think of some others!

  • Where do the fouls happen on the pitch?
  • Who made the most fouls?
  • Who was fouled the most?
  • Who fouled who the most?
  • Which team fouled the most?
  • Who’s the worst fouler in each team?
  • Who’s the most fouled in each team?
Where do the fouls happen?
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC;
 
+----------------------------------+
| location                 | fouls |
+----------------------------------+
| "in the defensive half." | 12    |
| "in the attacking half." | 12    |
| "on the right wing."     | 3     |
| "on the left wing."      | 3     |
+----------------------------------+
4 rows
Who fouls the most?
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+------------------------------+
| fouler               | fouls |
+------------------------------+
| "Rafinha"            | 4     |
| "Pedro"              | 3     |
| "Medhi Benatia"      | 3     |
| "Dani Alves"         | 3     |
| "Xabi Alonso"        | 3     |
| "Javier Mascherano"  | 2     |
| "Thiago Alcántara"   | 2     |
| "Robert Lewandowski" | 2     |
| "Sebastian Rode"     | 1     |
| "Sergio Busquets"    | 1     |
+------------------------------+
10 rows
Who was fouled the most?
// who was fouled the most
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]->(fouled)
RETURN fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+----------------------------------+
| fouled                   | fouls |
+----------------------------------+
| "Robert Lewandowski"     | 4     |
| "Lionel Messi"           | 4     |
| "Neymar"                 | 3     |
| "Pedro"                  | 2     |
| "Xabi Alonso"            | 2     |
| "Andrés Iniesta"         | 2     |
| "Rafinha"                | 2     |
| "Bastian Schweinsteiger" | 2     |
| "Sebastian Rode"         | 1     |
| "Sergio Busquets"        | 1     |
+----------------------------------+
10 rows
Who fouled who the most?
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]->(fouled),
      (foul)<-[:COMMITTED_FOUL]-(fouler)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+--------------------------------------------------------+
| fouler              | fouled                   | fouls |
+--------------------------------------------------------+
| "Javier Mascherano" | "Robert Lewandowski"     | 2     |
| "Dani Alves"        | "Bastian Schweinsteiger" | 2     |
| "Xabi Alonso"       | "Lionel Messi"           | 2     |
| "Rafinha"           | "Neymar"                 | 2     |
| "Rafinha"           | "Andrés Iniesta"         | 2     |
| "Dani Alves"        | "Xabi Alonso"            | 1     |
| "Thiago Alcántara"  | "Javier Mascherano"      | 1     |
| "Pedro"             | "Juan Bernat"            | 1     |
| "Medhi Benatia"     | "Pedro"                  | 1     |
| "Neymar"            | "Sebastian Rode"         | 1     |
+--------------------------------------------------------+
10 rows
Which team fouled the most?
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+-----------------------------+
| team.name           | fouls |
+-----------------------------+
| "FC Bayern München" | 18    |
| "Barcelona"         | 12    |
+-----------------------------+
2 rows
Worst fouler for each team?
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;
 
+---------------------------------------------------------------+
| team.name           | topFouler.fouler.name | topFouler.fouls |
+---------------------------------------------------------------+
| "FC Bayern München" | "Rafinha"             | 4               |
| "Barcelona"         | "Pedro"               | 3               |
+---------------------------------------------------------------+
2 rows
Most fouled against for each team
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]-(fouled),
      (fouled)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouled, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouled:fouled, fouls:fouls})[0] AS topFouled
RETURN team.name, topFouled.fouled.name, topFouled.fouls;
 
+---------------------------------------------------------------+
| team.name           | topFouled.fouled.name | topFouled.fouls |
+---------------------------------------------------------------+
| "FC Bayern München" | "Robert Lewandowski"  | 4               |
| "Barcelona"         | "Lionel Messi"        | 4               |
+---------------------------------------------------------------+
2 rows

So Bayern fouled a bit more than Barca, the main forwards for each team (Messi/Lewandowski) were the most fouled players on the pitch and the fouling was mostly in the middle of the pitch.

I expect this graph will become much more interesting to query with more matches and with the other event types as well but I haven’t got those scraped yet. The code is on github if you want to play around with it and perhaps get the other events into the graph.

Categories: Programming

R: ggplot – Displaying multiple charts with a for loop

Thu, 05/14/2015 - 01:17

Continuing with my analysis of the Neo4j London user group I wanted to drill into some individual meetups and see the makeup of the people attending those meetups with respect to the cohort they belong to.

I started by writing a function which would take in an event ID and output a bar chart showing the number of people who attended that event from each cohort.
<?p>

We can work out the cohort that a member belongs to by querying for the first event they attended.

Our query for the most recent Intro to graphs session looks like this:

library(RNeo4j)
graph = startGraph("http://127.0.0.1:7474/db/data/")
 
eventId = "220750415"
query =  "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->
                (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) 
          WITH rsvp, person
          MATCH (person)-[:RSVPD]->(otherRSVP)
 
          WITH person, rsvp, otherRSVP
          ORDER BY person.id, otherRSVP.time
 
          WITH person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP
          return rsvp.time, earliestRSVP.time,  person.id"
 
df = cypher(graph, query, id= eventId)
 
> df %>% sample_n(10)
      rsvp.time earliestRSVP.time person.id
18 1.430819e+12      1.392726e+12 130976662
95 1.430069e+12      1.430069e+12  10286388
79 1.429035e+12      1.429035e+12  38344282
64 1.428108e+12      1.412935e+12 153473172
73 1.429513e+12      1.398236e+12 143322942
19 1.430389e+12      1.430389e+12 129261842
37 1.429643e+12      1.327603e+12   9750821
49 1.429618e+12      1.429618e+12 184325696
69 1.430781e+12      1.404554e+12  67485912
1  1.430929e+12      1.430146e+12 185405773

We’re not doing anything too clever here, just using a couple of WITH clauses to order RSVPs so we can get the earlier one for each person.

Once we’ve done that we’ll tidy up the data frame so that it contains columns containing the cohort in which the member attended their first event:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
df$time = timestampToDate(df$rsvp.time)
df$date = format(as.Date(df$time), "%Y-%m")
df$earliestTime = timestampToDate(df$earliestRSVP.time)
df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m")
 
> df %>% sample_n(10)
      rsvp.time earliestRSVP.time person.id                time    date        earliestTime earliestDate
47 1.430697e+12      1.430697e+12 186893861 2015-05-03 23:47:11 2015-05 2015-05-03 23:47:11      2015-05
44 1.430924e+12      1.430924e+12 186998186 2015-05-06 14:49:44 2015-05 2015-05-06 14:49:44      2015-05
85 1.429611e+12      1.422378e+12  53761842 2015-04-21 10:13:46 2015-04 2015-01-27 16:56:02      2015-01
14 1.430125e+12      1.412690e+12   7994846 2015-04-27 09:01:58 2015-04 2014-10-07 13:57:09      2014-10
29 1.430035e+12      1.430035e+12  37719672 2015-04-26 07:59:03 2015-04 2015-04-26 07:59:03      2015-04
12 1.430855e+12      1.430855e+12 186968869 2015-05-05 19:38:10 2015-05 2015-05-05 19:38:10      2015-05
41 1.428917e+12      1.422459e+12 133623562 2015-04-13 09:20:07 2015-04 2015-01-28 15:37:40      2015-01
87 1.430927e+12      1.430927e+12 185155627 2015-05-06 15:46:59 2015-05 2015-05-06 15:46:59      2015-05
62 1.430849e+12      1.430849e+12 186965212 2015-05-05 17:56:23 2015-05 2015-05-05 17:56:23      2015-05
8  1.430237e+12      1.425567e+12 184979500 2015-04-28 15:58:23 2015-04 2015-03-05 14:45:40      2015-03

Now that we’ve got that we can group by the earliestDate cohort and then create a bar chart:

byCohort = df %>% count(earliestDate) 
 
ggplot(aes(x= earliestDate, y = n), data = byCohort) + 
    geom_bar(stat="identity", fill = "dark blue") +
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1))

2015 05 13 00 30 59

This is good and gives us the insight that most of the members attending this version of intro to graphs just joined the group. The event was on 7th April and most people joined in March which makes sense.

Let’s see if that trend continues over the previous two years. To do this we need to create a for loop which goes over all the Intro to Graphs events and then outputs a chart for each one.

First I pulled out the code above into a function:

plotEvent = function(eventId) {
  query =  "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->
                (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) 
          WITH rsvp, person
          MATCH (person)-[:RSVPD]->(otherRSVP)
 
          WITH person, rsvp, otherRSVP
          ORDER BY person.id, otherRSVP.time
 
          WITH person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP
          return rsvp.time, earliestRSVP.time,  person.id"
 
  df = cypher(graph, query, id= eventId)
  df$time = timestampToDate(df$rsvp.time)
  df$date = format(as.Date(df$time), "%Y-%m")
  df$earliestTime = timestampToDate(df$earliestRSVP.time)
  df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m")
 
  byCohort = df %>% count(earliestDate) 
 
  ggplot(aes(x= earliestDate, y = n), data = byCohort) + 
    geom_bar(stat="identity", fill = "dark blue") +
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1))
}

We’d call it like this for the Intro to graphs meetup:

> plotEvent("220750415")

Next I tweaked the code to look up all Into to graphs events and then loop through and output a chart for each event:

events = cypher(graph, "match (e:Event {name: 'Intro to Graphs'}) RETURN e.id ORDER BY e.time")
 
for(event in events$n) {
  plotEvent(as.character(event))
}

Unfortunately that doesn’t print anything at all which we can fix by storing our plots in a list and then displaying it afterwards:

library(gridExtra)
p = list()
for(i in 1:count(events)$n) {
  event = events[i, 1]
  p[[i]] = plotEvent(as.character(event))
}
 
do.call(grid.arrange,p)

2015 05 14 00 57 10

This visualisation is probably better without any of the axis so let’s update the function to scrap those. We’ll also add the date of the event at the top of each chart which will require a slight tweak of the query:

plotEvent = function(eventId) {
  query =  "match (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->
                (e {id: {id}})<-[:TO]-(rsvp {response: 'yes'})<-[:RSVPD]-(person) 
            WITH e,rsvp, person
            MATCH (person)-[:RSVPD]->(otherRSVP)
 
            WITH e,person, rsvp, otherRSVP
            ORDER BY person.id, otherRSVP.time
 
            WITH e, person, rsvp, COLLECT(otherRSVP)[0] AS earliestRSVP
            return rsvp.time, earliestRSVP.time,  person.id, e.time"
 
  df = cypher(graph, query, id= eventId)
  df$time = timestampToDate(df$rsvp.time)
  df$eventTime = timestampToDate(df$e.time)
  df$date = format(as.Date(df$time), "%Y-%m")
  df$earliestTime = timestampToDate(df$earliestRSVP.time)
  df$earliestDate = format(as.Date(df$earliestTime), "%Y-%m")
 
  byCohort = df %>% count(earliestDate) 
 
  ggplot(aes(x= earliestDate, y = n), data = byCohort) + 
    geom_bar(stat="identity", fill = "dark blue") +
    theme(axis.ticks = element_blank(), 
          axis.text.x = element_blank(), 
          axis.text.y = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank()) + 
    labs(title = df$eventTime[1])
}
2015 05 14 01 08 54

I think this makes it a bit easier to read although I’ve made the mistake of not having all the charts representing the same scale – one to fix for next time.

We started doing the intro to graphs sessions less frequently towards the end of last year so my hypothesis was that we’d see a range of people from different cohorts RSVPing for them but that doesn’t seem to be the case. Instead it’s very dominated by people signing up close to the event.

Categories: Programming

R: Cohort heatmap of Neo4j London meetup

Tue, 05/12/2015 - 00:16

A few months ago I had a go at doing some cohort analysis of the Neo4j London meetup group which was an interesting experiment but unfortunately resulted in a chart that was completely illegible.

I wasn’t sure how to progress from there but a few days ago I came across the cohort heatmap which seemed like a better way of visualising things over time.

The underlying idea is still the same – we’ve comparing different cohorts of users against each other to see whether a change or intervention we did at a certain time had any impact.

However, the way we display the cohorts changes and I think for the better.

To recap, we start with the following data frame:

df = read.csv("/tmp/df.csv")
> df %>% sample_n(5)
        rsvp.time person.id                time    date
255  1.354277e+12  12228948 2012-11-30 12:05:08 2012-11
2475 1.407342e+12  19057581 2014-08-06 16:26:04 2014-08
3988 1.421769e+12  66122172 2015-01-20 15:58:02 2015-01
4411 1.419377e+12 165750262 2014-12-23 23:27:44 2014-12
1010 1.383057e+12  74602292 2013-10-29 14:24:32 2013-10

And we need to transform this into a data frame which is grouped by cohort (members who attended their first meetup in a particular month). The following code gets us there:

firstMeetup = df %>% 
  group_by(person.id) %>% 
  summarise(firstEvent = min(time), count = n()) %>% 
  arrange(desc(count))
firstMeetup$date = format(as.Date(firstMeetup$firstEvent), "%Y-%m")
 
countsForCohort = function(df, firstMeetup, cohort) {
  members = (firstMeetup %>% filter(date == cohort))$person.id
 
  attendance = df %>% 
    filter(person.id %in% members) %>% 
    count(person.id, date) %>% 
    ungroup() %>%
    count(date)
 
  allCohorts = df %>% select(date) %>% unique
  cohortAttendance = merge(allCohorts, attendance, by = "date", all.x = TRUE)  
  cohortAttendance[is.na(cohortAttendance) & cohortAttendance$date > cohort] = 0
  cohortAttendance %>% mutate(cohort = cohort, retention = n / length(members), members = n)  
}
 
cohorts = collect(df %>% select(date) %>% unique())[,1]
 
cohortAttendance = data.frame()
for(cohort in cohorts) {
  cohortAttendance = rbind(cohortAttendance,countsForCohort(df, firstMeetup, cohort))      
}
 
monthNumber = function(cohort, date) {
  cohortAsDate = as.yearmon(cohort)
  dateAsDate = as.yearmon(date)
 
  if(cohortAsDate > dateAsDate) {
    "NA"
  } else {
    paste(round((dateAsDate - cohortAsDate) * 12), sep="")
  }
}
 
cohortAttendanceWithMonthNumber = cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(!is.na(members)) %>%
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber)
 
> cohortAttendanceWithMonthNumber %>% head(10)
Source: local data frame [10 x 7]
Groups: row_number()
 
      date n  cohort retention members row_number() monthNumber
1  2011-06 4 2011-06      1.00       4            1           0
2  2011-07 1 2011-06      0.25       1            2           1
3  2011-08 1 2011-06      0.25       1            3           2
4  2011-09 2 2011-06      0.50       2            4           3
5  2011-10 1 2011-06      0.25       1            5           4
6  2011-11 1 2011-06      0.25       1            6           5
7  2012-01 1 2011-06      0.25       1            7           7
8  2012-04 2 2011-06      0.50       2            8          10
9  2012-05 1 2011-06      0.25       1            9          11
10 2012-06 1 2011-06      0.25       1           10          12

Now let’s create our first heatmap.

t <- max(cohortAttendanceWithMonthNumber$members)
 
cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e")
ggplot(cohortAttendanceWithMonthNumber, aes(y=cohort, x=date, fill=members)) +
  theme_minimal() +
  geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
  scale_fill_gradientn(colours=cols, limits=c(0, t),
                       breaks=seq(0, t, by=t/4),
                       labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
                       guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
  theme(legend.position='bottom', 
        legend.direction="horizontal",
        plot.title = element_text(size=20, face="bold", vjust=2),
        axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
  ggtitle("Cohort Activity Heatmap (number of members who attended event)")

2015 05 11 23 55 56

‘t’ is the maximum number of members within a cohort who attended a meetup in a given month. This makes it easy to see which cohorts started with the most members but makes it difficult to compare their retention over time.

We can fix that by showing the percentage of members in the cohort who attend each month rather than using absolute values. To do that we must first add an extra column containing the percentage values:

cohortAttendanceWithMonthNumber$retentionPercentage = ifelse(!is.na(cohortAttendanceWithMonthNumber$retention),  cohortAttendanceWithMonthNumber$retention * 100, 0)
t <- max(cohortAttendanceWithMonthNumber$retentionPercentage)
 
cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e")
ggplot(cohortAttendanceWithMonthNumber, aes(y=cohort, x=date, fill=retentionPercentage)) +
  theme_minimal() +
  geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
  scale_fill_gradientn(colours=cols, limits=c(0, t),
                       breaks=seq(0, t, by=t/4),
                       labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
                       guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
  theme(legend.position='bottom', 
        legend.direction="horizontal",
        plot.title = element_text(size=20, face="bold", vjust=2),
        axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
  ggtitle("Cohort Activity Heatmap (number of members who attended event)")

2015 05 12 00 01 55

This version allows us to compare cohorts against each other but now we don’t have the exact numbers which means earlier cohorts will look better since there are less people in them. We can get the best of both worlds by keeping this visualisation but showing the actual values inside each box:

t <- max(cohortAttendanceWithMonthNumber$retentionPercentage)
 
cols <- c("#e7f0fa", "#c9e2f6", "#95cbee", "#0099dc", "#4ab04a", "#ffd73e", "#eec73a", "#e29421", "#e29421", "#f05336", "#ce472e")
ggplot(cohortAttendanceWithMonthNumber, aes(y=cohort, x=date, fill=retentionPercentage)) +
  theme_minimal() +
  geom_tile(colour="white", linewidth=2, width=.9, height=.9) +
  scale_fill_gradientn(colours=cols, limits=c(0, t),
                       breaks=seq(0, t, by=t/4),
                       labels=c("0", round(t/4*1, 1), round(t/4*2, 1), round(t/4*3, 1), round(t/4*4, 1)),
                       guide=guide_colourbar(ticks=T, nbin=50, barheight=.5, label=T, barwidth=10)) +
  theme(legend.position='bottom', 
        legend.direction="horizontal",
        plot.title = element_text(size=20, face="bold", vjust=2),
        axis.text.x=element_text(size=8, angle=90, hjust=.5, vjust=.5, face="plain")) +
  ggtitle("Cohort Activity Heatmap (number of members who attended event)") + 
  geom_text(aes(label=members),size=3)

2015 05 12 00 04 31

What we can learn overall is that the majority of people seem to have a passing interest and then we have a smaller percentage who will continue to come to events.

It seems like we did a better job at retaining attendees in the middle of last year – one hypothesis is that the events we ran around then were more compelling but I need to do more analysis.

Next I’m going to drill further into some of the recent events and see what cohorts the attendees came from.

Categories: Programming

R: Neo4j London meetup group – How many events do people come to?

Sat, 05/09/2015 - 23:33

Earlier this week the number of members in the Neo4j London meetup group creeped over the 2,000 mark and I thought it’d be fun to re-explore the data that I previously imported into Neo4j.

How often do people come to meetups?

library(RNeo4j)
library(dplyr)
 
graph = startGraph("http://localhost:7474/db/data/")
 
query = "MATCH (g:Group {name: 'Neo4j - London User Group'})-[:HOSTED_EVENT]->(event)<-[:TO]-({response: 'yes'})<-[:RSVPD]-(profile)-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g)
         WHERE (event.time + event.utc_offset) < timestamp()
         RETURN event.id, event.time + event.utc_offset AS eventTime, profile.id, membership.joined"
 
df = cypher(graph, query)
 
> df %>% head()
  event.id    eventTime profile.id membership.joined
1 20616111 1.309372e+12    6436797      1.307285e+12
2 20616111 1.309372e+12   12964956      1.307275e+12
3 20616111 1.309372e+12   14533478      1.307290e+12
4 20616111 1.309372e+12   10793775      1.307705e+12
5 24528711 1.311793e+12   10793775      1.307705e+12
6 29953071 1.314815e+12   10595297      1.308154e+12
byEventsAttended = df %>% count(profile.id)
 
> byEventsAttended %>% sample_n(10)
Source: local data frame [10 x 2]
 
   profile.id  n
1   128137932  2
2   126947632  1
3    98733862  2
4    20468901 11
5    48293132  5
6   144764532  1
7    95259802  1
8    14524524  3
9    80611852  2
10  134907492  2

Now let’s visualise the number of people that have attended certain number of events:

ggplot(aes(x = n), data = byEventsAttended) + 
  geom_bar(binwidth = 1, fill = "Dark Blue") +
  scale_y_continuous(breaks = seq(0,750,by = 50))

2015 05 09 01 15 02

Most people come to one meetup and then there’s a long tail after that with fewer and fewer people coming to lots of meetups.

The chart has lots of blank space due to the sparseness of people on the right hand side. If we exclude any people who’ve attended more than 20 events we might get a more interesting visualisation:

ggplot(aes(x = n), data = byEventsAttended %>% filter(n <= 20)) + 
  geom_bar(binwidth = 1, fill = "Dark Blue") +
  scale_y_continuous(breaks = seq(0,750,by = 50))
2015 05 09 01 15 36

Nicole suggested a more interesting visualisation would be a box plot so I decided to try that next:

ggplot(aes(x = "Attendees", y = n), data = byEventsAttended) +
  geom_boxplot(fill = "grey80", colour = "Dark Blue") +
  coord_flip()

2015 05 09 22 31 20

This visualisation really emphasises that the majority are between 1 and 3 and it’s much less obvious how many values there are at the higher end. A quick check of the data with the summary function reveals as much:

> summary(byEventsAttended$n)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   2.000   2.837   3.000  69.000


Now to figure out how to move that box plot a bit to the right :)

Categories: Programming

Python: Selecting certain indexes in an array

Tue, 05/05/2015 - 22:39

A couple of days ago I was scrapping the UK parliament constituencies from Wikipedia in preparation for the Graph Connect hackathon and had got to the point where I had an array with one entry per column in the table.

2015 05 05 22 22 57

import requests
 
from bs4 import BeautifulSoup
from soupselect import select
 
page = open("constituencies.html", 'r')
soup = BeautifulSoup(page.read())
 
for row in select(soup, "table.wikitable tr"):
    if select(row, "th"):
        print [cell.text for cell in select(row, "th")]
 
    if select(row, "td"):
        print [cell.text for cell in select(row, "td")]
$ python blog.py
[u'Constituency', u'Electorate (2000)', u'Electorate (2010)', u'Largest Local Authority', u'Country of the UK']
[u'Aldershot', u'66,499', u'71,908', u'Hampshire', u'England']
[u'Aldridge-Brownhills', u'58,695', u'59,506', u'West Midlands', u'England']
[u'Altrincham and Sale West', u'69,605', u'72,008', u'Greater Manchester', u'England']
[u'Amber Valley', u'66,406', u'69,538', u'Derbyshire', u'England']
[u'Arundel and South Downs', u'71,203', u'76,697', u'West Sussex', u'England']
[u'Ashfield', u'74,674', u'77,049', u'Nottinghamshire', u'England']
[u'Ashford', u'72,501', u'81,947', u'Kent', u'England']
[u'Ashton-under-Lyne', u'67,334', u'68,553', u'Greater Manchester', u'England']
[u'Aylesbury', u'72,023', u'78,750', u'Buckinghamshire', u'England']
...

I wanted to get rid of the 2nd and 3rd columns (containing the electorates) from the array since those aren’t interesting to me as I have another source where I’ve got that data from.

I was struggling to do this but two different Stack Overflow questions came to the rescue with suggestions to use enumerate to get the index of each column and then add to the list comprehension to filter appropriately.

First we’ll look at the filtering on a simple example. Imagine we have a list of 5 people:

people = ["mark", "michael", "brian", "alistair", "jim"]

And we only want to keep the 1st, 4th and 5th people. We therefore only want to keep the values that exist in index positions 0,3 and 4 which we can do like this:

>>> [x[1] for x in enumerate(people) if x[0] in [0,3,4]]
['mark', 'alistair', 'jim']

Now let’s apply the same approach to our constituencies data set:

import requests
 
from bs4 import BeautifulSoup
from soupselect import select
 
page = open("constituencies.html", 'r')
soup = BeautifulSoup(page.read())
 
for row in select(soup, "table.wikitable tr"):
    if select(row, "th"):
        print [entry[1].text for entry in enumerate(select(row, "th")) if entry[0] in [0,3,4]]
 
    if select(row, "td"):
        print [entry[1].text for entry in enumerate(select(row, "td")) if entry[0] in [0,3,4]]
$ python blog.py
[u'Constituency', u'Largest Local Authority', u'Country of the UK']
[u'Aldershot', u'Hampshire', u'England']
[u'Aldridge-Brownhills', u'West Midlands', u'England']
[u'Altrincham and Sale West', u'Greater Manchester', u'England']
[u'Amber Valley', u'Derbyshire', u'England']
[u'Arundel and South Downs', u'West Sussex', u'England']
[u'Ashfield', u'Nottinghamshire', u'England']
[u'Ashford', u'Kent', u'England']
[u'Ashton-under-Lyne', u'Greater Manchester', u'England']
[u'Aylesbury', u'Buckinghamshire', u'England']
Categories: Programming

Neo4j: LOAD CSV – java.io.InputStreamReader there’s a field starting with a quote and whereas it ends that quote there seems to be character in that field after that ending quote. That isn’t supported.

Mon, 05/04/2015 - 10:56

I recently came across the last.fm dataset via Ben Frederickson’s blog and thought it’d be an interesting one to load into Neo4j and explore.

I started with a simple query to parse the CSV file and count the number of rows:

LOAD CSV FROM "file:///Users/markneedham/projects/neo4j-recommendations/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv" 
AS row FIELDTERMINATOR  "\t"
return COUNT(*)
 
At java.io.InputStreamReader@4d307fda:6484 there's a field starting with a quote and whereas it ends that quote there seems  to be character in that field after that ending quote. That isn't supported. This is what I read: 'weird al"'

This blows up because (as the message says) we’ve got a field which uses double quotes but then has other characters either side of the quotes.

A quick search through the file reveals one of the troublesome lines:

$ grep "\"weird" lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv  | head -n 1
0015371426d2cbef354b2f680340de38d0ebd2f0	7746d775-9550-4360-b8d5-c37bd448ce01	"weird al" yankovic	4099

I ran a file containing only that line through CSV Lint to see what it thought and indeed it is invalid:

2015 05 04 10 50 43

Let’s clean up our file to use single quotes instead of double quotes and try the query again:

$ tr "\"" "'" < lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv > lastfm-dataset-360K/clean.tsv
LOAD CSV FROM "file:///Users/markneedham/projects/neo4j-recommendations/lastfm-dataset-360K/clean.tsv" as row FIELDTERMINATOR  "\t"
return COUNT(*)
 
17559530

And we’re back in business! Interestingly Python’s CSV reader chooses to strip out the double quotes rather than throw an exception:

import csv
with open("smallWeird.tsv", "r") as file:
    reader = csv.reader(file, delimiter="\t")
 
    for row in reader:
        print row
$ python explore.py
['0015371426d2cbef354b2f680340de38d0ebd2f0', '7746d775-9550-4360-b8d5-c37bd448ce01', 'weird al yankovic', '4099']

I prefer LOAD CSV’s approach but it’s an interesting trade off I hadn’t considred before.

Categories: Programming

Coding: Visualising a bitmap

Sun, 05/03/2015 - 01:19

Over the last month or so I’ve spent some time each day reading a new part of the Neo4j code base to get more familiar with it, and one of my favourite classes is the Bits class which does all things low level on the wire and to disk.

In particular I like its toString method which returns a binary representation of the values that we’re storing in bytes, ints and longs.

I thought it’d be a fun exercise to try and write my own function which takes in a 32 bit map and returns a string containing a 1 or 0 depending if a bit is set or not.

The key insight is that we need to iterate down from the highest order bit and then create a bit mask of that value and do a bitwise and with the full bitmap. If the result of that calculation is 0 then the bit isn’t set, otherwise it is.

For example, to check if the highest order bit (index 31) was set our bit mask would have the 32nd bit set and all of the others 0’d out.

java> (1 << 31) & 0x80000000
java.lang.Integer res5 = -2147483648

If we wanted to check if lowest order bit was set then we’d run this computation instead:

java> (1 << 0) & 0x00000001
java.lang.Integer res7 = 0
 
java> (1 << 0) & 0x00000001
java.lang.Integer res8 = 1

Now let’s put that into a function which checks all 32 bits of the bitmap rather than just the ones we define:

private String  asString( int bitmap )
{
    StringBuilder sb = new StringBuilder();
    sb.append( "[" );
    for ( int i = Integer.SIZE - 1; i >= 0; i-- )
    {
        int bitMask = 1 << i;
        boolean bitIsSet = (bitmap & bitMask) != 0;
        sb.append( bitIsSet ? "1" : "0" );
 
        if ( i > 0 &&  i % 8 == 0 )
        {
            sb.append( "," );
        }
    }
    sb.append( "]" );
    return sb.toString();
}

And a quick test to check it works:

@Test
public void shouldInspectBits()
{
    System.out.println(asString( 0x00000001 ));
    // [00000000,00000000,00000000,00000001]
 
    System.out.println(asString( 0x80000000 ));
    // [10000000,00000000,00000000,00000000]
 
    System.out.println(asString( 0xA0 ));
    // [00000000,00000000,00000000,10100000]
 
    System.out.println(asString( 0xFFFFFFFF ));
    // [11111111,11111111,11111111,11111111]
}

Neat!

Categories: Programming

Deliberate Practice: Building confidence vs practicing

Thu, 04/30/2015 - 08:48

A few weeks ago I wrote about the learning to cycle dependency graph which described some of the skills required to become proficient at riding a bike.

IMG 20150430 073120


While we’ve been practicing various skills/sub skills I’ve often found myself saying the following:

if it’s not hard you’re not practicing

me, April 2015


i.e. you should find the skill you’re currently practicing difficult otherwise you’re not stretching yourself and therefore aren’t getting better.

For example, in cycling you could be very comfortable riding with both hands on the handle bars and find using one hand a struggle. However, if you don’t practice that you won’t be able to indicate and turn corners.

This ties in with all my reading about deliberate practice which suggests that the type of exercises you do while deliberately practicing aren’t intended to be fun and are meant to expose your lack of knowledge.

In an ideal world we would spend all our time practicing these challenging skills but in reality there’s some part of us that wants to feel that we’re actually improving by spending some of the time doing things that we’re good at. Doing things you’re not good at is a bit of a slog as well so we might find that we have less motivation for this type of thing.

We therefore need to find a balance between doing challenging exercises and having fun building something or writing code that we already know how to do. I’ve found the best way to do this is to combine the two types of work into mini projects which contain some tasks that we’re already good at and some that require us to struggle.

For me this might involved cleaning up and importing a data set into Neo4j, which I’m comfortable with, and combining that with something else that I want to learn.

For example in the middle of last year I did some meetup analysis which involved creating a Neo4j graph of London’s NoSQL meetups and learning a bit about R, dplyr and linear regression along the way.

In January I built a How I met your mother graph and then spent a few weeks learning various algorithms for extracting topics from free text to give even more ways to explore the dataset.

Most recently I’ve been practicing exercises from Think Bayes and while it’s good practice I think I’d probably spend more time doing it if I linked it into a mini project with something I’m already comfortable with.

I’ll go off and have a think what that should be!

Categories: Programming

R: dplyr – Error in (list: invalid subscript type ‘double’

Mon, 04/27/2015 - 23:34

In my continued playing around with R I wanted to find the minimum value for a specified percentile given a data frame representing a cumulative distribution function (CDF).

e.g. imagine we have the following CDF represented in a data frame:

library(dplyr)
df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5))

and we want to find the minimum value for the 0.05 percentile. We can use the filter function to do so:

> (df %>% filter(percentile > 0.05) %>% slice(1))$score
[1] 7

Things become more tricky if we want to return multiple percentiles in one go.

My first thought was to create a data frame with one row for each target percentile and then pull in the appropriate row from our original data frame:

targetPercentiles = c(0.05, 0.2)
percentilesDf = data.frame(targetPercentile = targetPercentiles)
> percentilesDf %>% 
    group_by(targetPercentile) %>%
    mutate(x = (df %>% filter(percentile > targetPercentile) %>% slice(1))$score)
 
Error in (list(score = c(5, 7, 8, 10, 12, 20), percentile = c(0.05, 0.1,  : 
  invalid subscript type 'double'

Unfortunately this didn’t quite work as I expected – Antonios pointed out that this is probably because we’re mixing up two pipelines and dplyr can’t figure out what we want to do.

Instead he suggested the following variant which uses the do function

df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5))
targetPercentiles = c(0.05, 0.2)
 
> data.frame(targetPercentile = targetPercentiles) %>%
    group_by(targetPercentile) %>%
    do(df) %>% 
    filter(percentile > targetPercentile) %>% 
    slice(1) %>%
    select(targetPercentile, score)
Source: local data frame [2 x 2]
Groups: targetPercentile
 
  targetPercentile score
1             0.05     7
2             0.20    12

We can then wrap this up in a function:

percentiles = function(df, targetPercentiles) {
  # make sure the percentiles are in order
  df = df %>% arrange(percentile)
 
  data.frame(targetPercentile = targetPercentiles) %>%
    group_by(targetPercentile) %>%
    do(df) %>% 
    filter(percentile > targetPercentile) %>% 
    slice(1) %>%
    select(targetPercentile, score)
}

which we call like this:

df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5))
> percentiles(df, c(0.08, 0.10, 0.50, 0.80))
Source: local data frame [2 x 2]
Groups: targetPercentile
 
  targetPercentile score
1             0.08     7
2             0.10     8

Note that we don’t actually get any rows back for 0.50 or 0.80 since we don’t have any entries greater than those percentiles. With a proper CDF we would though so the function does its job.

Categories: Programming

Deliberate Practice: Watching yourself fail

Sat, 04/25/2015 - 23:26

Think bayes cover medium

I’ve recently been reading the literature written by K. Anders Eriksson and co on Deliberate Practice and one of the suggestions for increasing our competence at a skill is to put ourselves in a situation where we can fail.

I’ve been reading Think Bayes – an introductory text on Bayesian statistics, something I know nothing about – and each chapter concludes with a set of exercises to practice, a potentially perfect exercise in failure!

I’ve been going through the exercises and capturing my screen while I do so, an idea I picked up from one of the papers:

our most important breakthrough was developing a relatively inexpensive and efficient way for students to record their exercises on video and to review and analyze their own performances against well-defined criteria

Ideally I’d get a coach to review the video but that seems too much of an ask of someone. Antonios has taken a look at some of my answers, however, and made suggestions for how he’d solve them which has been really helpful.

After each exercise I watch the video and look for areas where I get stuck or don’t make progress so that I can go and practice more in that area. I also try to find inefficiencies in how I solve a problem as well as the types of approaches I’m taking.

These are some of the observations from watching myself back over the last week or so:

  • I was most successful when I had some idea of what I was going to try first. Most of the time the first code I wrote didn’t end up being correct but it moved me closer to the answer or ruled out an approach.

    It’s much easier to see the error in approach if there is an approach! On one occasion where I hadn’t planned out an approach I ended up staring at the question for 10 minutes and didn’t make any progress at all.

  • I could either solve the problems within 20 minutes or I wasn’t going to solve them and needed to chunk down to a simpler problem and then try the original exercise again.

    e.g. one exercise was to calculate the 5th percentile of a posterior distribution which I flailed around with for 15 minutes before giving up. Watching back on the video it was obvious that I hadn’t completely understood what a probability mass function was. I read the Wikipedia entry and retried the exercise and this time got the answer.

  • Knowing that you’re going to watch the video back stops you from getting distracted by email, twitter, Facebook etc.
  • It’s a painful experience watching yourself struggle – you can see exactly which functions you don’t know or things you need to look up on Google.
  • I deliberately don’t copy/paste any code while doing these exercises. I want to see how well I can do the exercises from scratch so that would defeat the point.

One of the suggestions that Eriksson makes for practice sessions is to focus on ‘technique’ during practice sessions rather than only on outcome but I haven’t yet been able to translate what exactly that would involved in a programming context.

If you have any ideas or thoughts on this approach do let me know in the comments.

Categories: Programming

R: Think Bayes Locomotive Problem – Posterior probabilities for different priors

Sat, 04/25/2015 - 00:53

In my continued reading of Think Bayes the next problem to tackle is the Locomotive problem which is defined thus:

A railroad numbers its locomotives in order 1..N.

One day you see a locomotive with the number 60. Estimate how many loco- motives the railroad has.

The interesting thing about this question is that it initially seems that we don’t have enough information to come up with any sort of answer. However, we can get an estimate if we come up with a prior to work with.

The simplest prior is to assume that there’s one railroad operator with between say 1 and 1000 railroads with an equal probability of each size.

We can then write similar code as with the dice problem to update the prior based on the trains we’ve seen.

First we’ll create a data frame which captures the product of ‘number of locomotives’ and the observations of locomotives that we’ve seen (in this case we’ve only seen one locomotive with number ’60′:)

library(dplyr)
 
possibleValues = 1:1000
observations = c(60)
 
l = list(value = possibleValues, observation = observations)
df = expand.grid(l) 
 
> df %>% head()
  value observation
1     1          60
2     2          60
3     3          60
4     4          60
5     5          60
6     6          60

Next we want to add a column which represents the probability that the observed locomotive could have come from a particular fleet. If the number of railroads is less than 60 then we have a 0 probability, otherwise we have 1 / numberOfRailroadsInFleet:

prior = 1  / length(possibleValues)
df = df %>% mutate(score = ifelse(value < observation, 0, 1/value))
 
> df %>% sample_n(10)
     value observation       score
179    179          60 0.005586592
1001  1001          60 0.000999001
400    400          60 0.002500000
438    438          60 0.002283105
667    667          60 0.001499250
661    661          60 0.001512859
284    284          60 0.003521127
233    233          60 0.004291845
917    917          60 0.001090513
173    173          60 0.005780347

To find the probability of each fleet size we write the following code:

weightedDf = df %>% 
  group_by(value) %>% 
  summarise(aggScore = prior * prod(score)) %>%
  ungroup() %>%
  mutate(weighted = aggScore / sum(aggScore))
 
> weightedDf %>% sample_n(10)
Source: local data frame [10 x 3]
 
   value     aggScore     weighted
1    906 1.102650e-06 0.0003909489
2    262 3.812981e-06 0.0013519072
3    994 1.005031e-06 0.0003563377
4    669 1.493275e-06 0.0005294465
5    806 1.239455e-06 0.0004394537
6    673 1.484400e-06 0.0005262997
7    416 2.401445e-06 0.0008514416
8    624 1.600963e-06 0.0005676277
9     40 0.000000e+00 0.0000000000
10   248 4.028230e-06 0.0014282246

Let’s plot the data frame to see how the probability varies for each fleet size:

library(ggplot2)
ggplot(aes(x = value, y = weighted), data = weightedDf) + 
  geom_line(color="dark blue")

2015 04 25 00 25 47

The most likely choice is a fleet size of 60 based on this diagram but an alternative would be to find the mean of the posterior which we can do like so:

> weightedDf %>% mutate(mean = value * weighted) %>% select(mean) %>% sum()
[1] 333.6561

Now let’s create a function with all that code in so we can play around with some different priors and observations:

meanOfPosterior = function(values, observations) {
  l = list(value = values, observation = observations)   
  df = expand.grid(l) %>% mutate(score = ifelse(value < observation, 0, 1/value))
 
  prior = 1  / length(possibleValues)
  weightedDf = df %>% 
    group_by(value) %>% 
    summarise(aggScore = prior * prod(score)) %>%
    ungroup() %>%
    mutate(weighted = aggScore / sum(aggScore))
 
  return (weightedDf %>% mutate(mean = value * weighted) %>% select(mean) %>% sum()) 
}

If we update our observed railroads to have numbers 60, 30 and 90 we’d get the following means of posteriors assuming different priors:

> meanOfPosterior(1:500, c(60, 30, 90))
[1] 151.8496
> meanOfPosterior(1:1000, c(60, 30, 90))
[1] 164.3056
> meanOfPosterior(1:2000, c(60, 30, 90))
[1] 171.3382

At the moment the function assumes that we always want to have a uniform prior i.e. every option has an equal opportunity of being chosen, but we might want to vary the prior to see how different assumptions influence the posterior.

We can refactor the function to take in values & priors instead of calculating the priors in the function:

meanOfPosterior = function(values, priors, observations) {
  priorDf = data.frame(value = values, prior = priors)
  l = list(value = priorDf$value, observation = observations)
 
  df = merge(expand.grid(l), priorDf, by.x = "value", by.y = "value") %>% 
    mutate(score = ifelse(value < observation, 0, 1 / value))
 
  df %>% 
    group_by(value) %>% 
    summarise(aggScore = max(prior) * prod(score)) %>%
    ungroup() %>%
    mutate(weighted = aggScore / sum(aggScore)) %>%
    mutate(mean = value * weighted) %>%
    select(mean) %>%
    sum()
}

Now let’s check we get the same posterior means for the uniform priors:

> meanOfPosterior(1:500,  1/length(1:500), c(60, 30, 90))
[1] 151.8496
> meanOfPosterior(1:1000, 1/length(1:1000), c(60, 30, 90))
[1] 164.3056
> meanOfPosterior(1:2000, 1/length(1:2000), c(60, 30, 90))
[1] 171.3382

Now if instead of a uniform prior let’s use a power law one where the assumption is that smaller fleets are more likely:

> meanOfPosterior(1:500,  sapply(1:500,  function(x) x ** -1), c(60, 30, 90))
[1] 130.7085
> meanOfPosterior(1:1000, sapply(1:1000, function(x) x ** -1), c(60, 30, 90))
[1] 133.2752
> meanOfPosterior(1:2000, sapply(1:2000, function(x) x ** -1), c(60, 30, 90))
[1] 133.9975
> meanOfPosterior(1:5000, sapply(1:5000, function(x) x ** -1), c(60, 30, 90))
[1] 134.212
> meanOfPosterior(1:10000, sapply(1:10000, function(x) x ** -1), c(60, 30, 90))
[1] 134.2435

Now we get very similar posterior means which converge on 134 and so that’s our best prediction.

Categories: Programming

R: Replacing for loops with data frames

Wed, 04/22/2015 - 23:18

In my last blog post I showed how to derive posterior probabilities for the Think Bayes dice problem:

Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die. If you have ever played Dungeons & Dragons, you know what I am talking about.

Suppose I select a die from the box at random, roll it, and get a 6.
What is the probability that I rolled each die?

To recap, this was my final solution:

likelihoods = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))  
  names(scores) = names
 
  for(name in names) {
      for(observation in observations) {
        if(name < observation) {
          scores[paste(name)]  = 0
        } else {
          scores[paste(name)] = scores[paste(name)] *  (1.0 / name)
        }        
      }
    }  
  return(scores)
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods(dice, c(6))
 
> l1 / sum(l1)
        4         6         8        12        20 
0.0000000 0.3921569 0.2941176 0.1960784 0.1176471

Although it works we have nested for loops which aren’t very idiomatic R so let’s try and get rid of them.

The first thing we want to do is return a data frame rather than a vector so we tweak the first two lines to read like this:

scores = rep(1.0 / length(names), length(names))  
df = data.frame(score = scores, name = names)

Next we can get rid of the inner for loop and replace it with a call to ifelse wrapped inside a dplyr mutate call:

library(dplyr)
likelihoods2 = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))  
  df = data.frame(score = scores, name = names)
 
  for(observation in observations) {
    df = df %>% mutate(score = ifelse(name < observation, 0, score * (1.0 / name)) )
  }
 
  return(df)
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods2(dice, c(6))
 
> l1
       score name
1 0.00000000    4
2 0.03333333    6
3 0.02500000    8
4 0.01666667   12
5 0.01000000   20

Finally we’ll tidy up the scores so they’re relatively weighted against each other:

likelihoods2 = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))  
  df = data.frame(score = scores, name = names)
 
  for(observation in observations) {
    df = df %>% mutate(score = ifelse(name < observation, 0, score * (1.0 / name)) )
  }
 
  return(df %>% mutate(weighted = score / sum(score)) %>% select(name, weighted))
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods2(dice, c(6))
 
> l1
  name  weighted
1    4 0.0000000
2    6 0.3921569
3    8 0.2941176
4   12 0.1960784
5   20 0.1176471

Now we’re down to just the one for loop. Getting rid of that one is a bit trickier. First we’ll create a data frame which contains a row for every (observation, dice) pair, simulating the nested for loops:

likelihoods3 = function(names, observations) {
  l = list(observation = observations, roll = names)
  obsDf = do.call(expand.grid,l) %>% 
    mutate(likelihood = 1.0 / roll, 
           score = ifelse(roll < observation, 0, likelihood))   
 
  return(obsDf)
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods3(dice, c(6))
 
> l1
  observation roll likelihood      score
1           6    4 0.25000000 0.00000000
2           6    6 0.16666667 0.16666667
3           6    8 0.12500000 0.12500000
4           6   12 0.08333333 0.08333333
5           6   20 0.05000000 0.05000000
 
l2 = likelihoods3(dice, c(6, 4, 8, 7, 7, 2))
> l2
   observation roll likelihood      score
1            6    4 0.25000000 0.00000000
2            4    4 0.25000000 0.25000000
3            8    4 0.25000000 0.00000000
4            7    4 0.25000000 0.00000000
5            7    4 0.25000000 0.00000000
6            2    4 0.25000000 0.25000000
7            6    6 0.16666667 0.16666667
8            4    6 0.16666667 0.16666667
9            8    6 0.16666667 0.00000000
10           7    6 0.16666667 0.00000000
11           7    6 0.16666667 0.00000000
12           2    6 0.16666667 0.16666667
13           6    8 0.12500000 0.12500000
14           4    8 0.12500000 0.12500000
15           8    8 0.12500000 0.12500000
16           7    8 0.12500000 0.12500000
17           7    8 0.12500000 0.12500000
18           2    8 0.12500000 0.12500000
19           6   12 0.08333333 0.08333333
20           4   12 0.08333333 0.08333333
21           8   12 0.08333333 0.08333333
22           7   12 0.08333333 0.08333333
23           7   12 0.08333333 0.08333333
24           2   12 0.08333333 0.08333333
25           6   20 0.05000000 0.05000000
26           4   20 0.05000000 0.05000000
27           8   20 0.05000000 0.05000000
28           7   20 0.05000000 0.05000000
29           7   20 0.05000000 0.05000000
30           2   20 0.05000000 0.05000000

Now we need to iterate over the data frame, grouping by ‘roll’ so that we end up with one row for each one.

We’ll add a new column which stores the posterior probability for each dice. This will be calculated by multiplying the prior probability by the product of the ‘score’ entries.

This is what our new likelihood function looks like:

likelihoods3 = function(names, observations) {
  l = list(observation = observations, roll = names)
  obsDf = do.call(expand.grid,l) %>% 
    mutate(likelihood = 1.0 / roll, 
           score = ifelse(roll < observation, 0, likelihood))   
 
  return (obsDf %>% 
    group_by(roll) %>% 
    summarise(s = (1.0/length(names)) * prod(score) ) %>%
    ungroup() %>% 
    mutate(weighted = s / sum(s)) %>%
    select(roll, weighted))
}
 
l1 = likelihoods3(dice, c(6))
> l1
Source: local data frame [5 x 2]
 
  roll  weighted
1    4 0.0000000
2    6 0.3921569
3    8 0.2941176
4   12 0.1960784
5   20 0.1176471
 
l2 = likelihoods3(dice, c(6, 4, 8, 7, 7, 2))
> l2
Source: local data frame [5 x 2]
 
  roll    weighted
1    4 0.000000000
2    6 0.000000000
3    8 0.915845272
4   12 0.080403426
5   20 0.003751302

We’ve now got the same result as we did with our nested for loops so I think the refactoring has been a success.

Categories: Programming

R: Numeric keys in the nested list/dictionary

Tue, 04/21/2015 - 06:59

Last week I described how I’ve been creating fake dictionaries in R using lists and I found myself using the same structure while solving the dice problem in Think Bayes.

The dice problem is described as follows:

Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die. If you have ever played Dungeons & Dragons, you know what I am talking about.

Suppose I select a die from the box at random, roll it, and get a 6.
What is the probability that I rolled each die?

Here’s a simple example of the nested list that I started with:

dice = c(4,6,8,12,20)
priors = rep(1.0 / length(dice), length(dice))
names(priors) = dice
 
> priors
  4   6   8  12  20 
0.2 0.2 0.2 0.2 0.2

I wanted to retrieve the prior for the 8 dice which I tried to do like this:

> priors[8]
<NA> 
  NA

That comes back with ‘NA’ because we’re actually looking for the numeric index 8 rather than the item in that column.

As far as I understand if we want to look up values by name we have to use a string so I tweaked the code to store names as strings:

dice = c(4,6,8,12,20)
priors = rep(1.0 / length(dice), length(dice))
names(priors) = sapply(dice, paste)
 
> priors["8"]
  8 
0.2

That works much better but with some experimentation I realised I didn’t even need to run ‘dice’ through the sapply function, it already works the way it was:

dice = c(4,6,8,12,20)
priors = rep(1.0 / length(dice), length(dice))
names(priors) = dice
 
> priors["8"]
  8 
0.2

Now that we’ve got that working we can write a likelihood function which takes in observed dice rolls and tells us how likely it was that we rolled each type of dice. We start simple by copying the above code into a function:

likelihoods = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))  
  names(scores) = names
 
  return(scores)
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods(dice, c(6))
 
> l1
  4   6   8  12  20 
0.2 0.2 0.2 0.2 0.2

Next we’ll update the score for a particular dice to 0 if one of the observed rolls is greater than the dice’s maximum score:

likelihoods = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))  
  names(scores) = names
 
  for(name in names) {
      for(observation in observations) {
        if(name < observation) {
          scores[paste(name)]  = 0       
        }
      }
    }  
  return(scores)
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods(dice, c(6))
 
> l1
  4   6   8  12  20 
0.0 0.2 0.2 0.2 0.2

The 4 dice has been ruled out since we’ve rolled a 6! Now let’s put in the else condition which updates our score by the probability that we got the observed roll with each of valid dice. i.e. we have a 1/20 chance of rolling any number with the 20 side dice, a 1/8 chance with the 8 sided dice etc.

likelihoods = function(names, observations) {
  scores = rep(1.0 / length(names), length(names))  
  names(scores) = names
 
  for(name in names) {
      for(observation in observations) {
        if(name < observation) {
          scores[paste(name)]  = 0
        } else {
          scores[paste(name)] = scores[paste(name)] *  (1.0 / name)
        }        
      }
    }  
  return(scores)
}
 
dice = c(4,6,8,12,20)
l1 = likelihoods(dice, c(6))
 
> l1
         4          6          8         12         20 
0.00000000 0.03333333 0.02500000 0.01666667 0.01000000

And finally let’s normalise those scores so they’re a bit more readable:

> l1 / sum(l1)
        4         6         8        12        20 
0.0000000 0.3921569 0.2941176 0.1960784 0.1176471
Categories: Programming

R: non-numeric argument to binary operator

Mon, 04/20/2015 - 00:08

When debugging R code, given my Java background, I often find myself trying to print out the state of variables along with an appropriate piece of text like this:

names = c(1,2,3,4,5,6)
> print("names: " + names)
Error in "names: " + names : non-numeric argument to binary operator

We might try this next:

> print("names: ", names)
[1] "names: "

which doesn’t actually print the names variable – only the first argument to the print function is printed.

We’ll find more success with the paste function:

> print(paste("names: ", names))
[1] "names:  1" "names:  2" "names:  3" "names:  4" "names:  5" "names:  6"

This is an improvement but it repeats the ‘names:’ prefix multiple times which isn’t what we want. Introducing the toString function gets us over the line:

> print(paste("names: ", toString(names)))
[1] "names:  1, 2, 3, 4, 5, 6"
Categories: Programming

R: Removing for loops

Sun, 04/19/2015 - 00:53

In my last blog post I showed the translation of a likelihood function from Think Bayes into R and in my first attempt at this function I used a couple of nested for loops.

likelihoods = function(names, mixes, observations) {
  scores = rep(1, length(names))
  names(scores) = names
 
  for(name in names) {
      for(observation in observations) {
        scores[name] = scores[name] *  mixes[[name]][observation]      
      }
    }  
  return(scores)
}
Names = c("Bowl 1", "Bowl 2")
 
bowl1Mix = c(0.75, 0.25)
names(bowl1Mix) = c("vanilla", "chocolate")
bowl2Mix = c(0.5, 0.5)
names(bowl2Mix) = c("vanilla", "chocolate")
Mixes = list("Bowl 1" = bowl1Mix, "Bowl 2" = bowl2Mix)
Mixes
 
Observations = c("vanilla", "vanilla", "vanilla", "chocolate")
l = likelihoods(Names, Mixes, Observations)
 
> l / sum(l)
  Bowl 1   Bowl 2 
0.627907 0.372093

We pass in a vector of bowls, a nested dictionary describing the mixes of cookies in each bowl and the observations that we’ve made. The function tells us that there’s an almost 2/3 probability of the cookies coming from Bowl 1 and just over 1/3 of being Bowl 2.

In this case there probably won’t be much of a performance improvement by getting rid of the loops but we should be able to write something that’s more concise and hopefully idiomatic.

Let’s start by getting rid of the inner for loop. That can be replace by a call to the Reduce function like so:

likelihoods2 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  for(name in names) {
    scores[name] = Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1)
  }  
  return(scores)
}
l2 = likelihoods2(Names, Mixes, Observations)
 
> l2 / sum(l2)
  Bowl 1   Bowl 2 
0.627907 0.372093

So that’s good, we’ve still got the same probabilities as before. Now to get rid of the outer for loop. The Map function helps us out here:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = Map(function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1), 
    names)
 
  return(scores)
}
 
l3 = likelihoods3(Names, Mixes, Observations)
> l3
$`Bowl 1`
  vanilla 
0.1054688 
 
$`Bowl 2`
vanilla 
 0.0625

We end up with a list instead of a vector which we need to fix by using the unlist function:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = Map(function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1), 
    names)
 
  return(unlist(scores))
}
 
l3 = likelihoods3(Names, Mixes, Observations)
 
> l3 / sum(l3)
Bowl 1.vanilla Bowl 2.vanilla 
      0.627907       0.372093

Now we just have this annoying ‘vanilla’ in the name. That’s fixed easily enough:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = Map(function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1), 
    names)
 
  result = unlist(scores)
  names(result) = names
 
  return(result)
}
 
l3 = likelihoods3(Names, Mixes, Observations)
 
> l3 / sum(l3)
  Bowl 1   Bowl 2 
0.627907 0.372093

A slightly cleaner alternative makes use of the sapply function:

likelihoods3 = function(names, mixes, observations) {
  scores = rep(0, length(names))
  names(scores) = names
 
  scores = sapply(names, function(name) 
    Reduce(function(acc, observation) acc *  mixes[[name]][observation], Observations, 1))
  names(scores) = names
 
  return(scores)
}
 
l3 = likelihoods3(Names, Mixes, Observations)
 
> l3 / sum(l3)
  Bowl 1   Bowl 2 
0.627907 0.372093

That’s the best I’ve got for now but I wonder if we could write a version of this using matrix operations some how – but that’s for next time!

Categories: Programming