Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 7 hours 42 min ago

Neo4j: Detecting potential typos using EXPLAIN

Tue, 03/17/2015 - 23:46

I’ve been running a few intro to Neo4j training sessions recently using Neo4j 2.2.0 RC1 and at some stage in every session somebody will make a typo when writing out of the example queries.

For example one of the queries that we do about half way finds the actors and directors who have worked together and aggregates the movies they were in.

This is the correct query:

MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5

which should yield the following results:

==> +-----------------------------------------------------------------------------------------------------------------------+
==> | actor.name           | director.name    | movies                                                                      |
==> +-----------------------------------------------------------------------------------------------------------------------+
==> | "Hugo Weaving"       | "Andy Wachowski" | ["Cloud Atlas","The Matrix Revolutions","The Matrix Reloaded","The Matrix"] |
==> | "Hugo Weaving"       | "Lana Wachowski" | ["Cloud Atlas","The Matrix Revolutions","The Matrix Reloaded","The Matrix"] |
==> | "Laurence Fishburne" | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> | "Keanu Reeves"       | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> | "Carrie-Anne Moss"   | "Lana Wachowski" | ["The Matrix Revolutions","The Matrix Reloaded","The Matrix"]               |
==> +-----------------------------------------------------------------------------------------------------------------------+

However, a common typo is to write ‘DIRECTED_IN’ instead of ‘DIRECTED’ in which case we’ll see no results:

MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED_IN]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5
 
==> +-------------------------------------+
==> | actor.name | director.name | movies |
==> +-------------------------------------+
==> +-------------------------------------+
==> 0 row

It’s not immediately obvious why we aren’t seeing any results which can be quite frustrating.

However, in Neo4j 2.2 the ‘EXPLAIN’ keyword has been introduced and we can use this to see what the query planner thinks of the query we want to execute without actually executing it.

Instead the planner makes use of knowledge that it has about our schema to come up with a plan that it would run and how much of the graph it thinks that plan would touch:

EXPLAIN MATCH (actor:Person)-[:ACTED_IN]->(movie)<-[:DIRECTED_IN]-(director)
RETURN actor.name, director.name, COLLECT(movie.title) AS movies
ORDER BY LENGTH(movies) DESC
LIMIT 5

2015 03 17 23 39 55

The first row of the query plan describes an all nodes scan which tells us that the query will start from the ‘director’ but it’s the second row that’s interesting.

The estimated rows when expanding the ‘DIRECTED_IN’ relationship is 0 when we’d expect it to at least be a positive value if there were some instances of that relationship in the database.

If we compare this to the plan generated when using the proper ‘DIRECTED’ relationship we can see the difference:

2015 03 17 23 43 11

Here we see an estimated 44 rows from expanding the ‘DIRECTED’ relationship so we know there are at least some nodes connected by that relationship type.

In summary if you find your query not returning anything when you expect it to, prefix an ‘EXPLAIN’ and make sure you’re not seeing the dreaded ‘0 expected rows’.

Categories: Programming

One month of mini habits

Tue, 03/17/2015 - 02:32

I recently read a book in the ‘getting things done’ genre written by Stephen Guise titled ‘Mini Habits‘ and although I generally don’t like those types of books I quite enjoyed this one and decided to give his system a try.

The underlying idea is that there are two parts of actually doing stuff:

  • Planning what to do
  • Doing it

We often get stuck in between the first and second steps because what we’ve planned to do is too big and overwhelming.

Guise’s approach for overcoming this inaction is to shrink the amount of work to do until it’s small enough that we don’t feel any resistance to getting started.

It should be something that you can do in 1 or 2 minutes – stupidly small – something that you can do even on your worst day when you have no time/energy.

I’m extremely good at procrastinating so I thought I’d give it a try and see if it helped. Guise suggests starting with one or two habits but I had four things that I want to do so I’ve ignored that advice for now.

My attempted habits are the following:

  • Read one page of a data science related paper/article a day
  • Read one page of a computer science related paper/article a day
  • Write one line of data science related code a day
  • Write 50 words on blog a day
Sooooo….has it helped?

In terms of doing each of the habits I’ve been successful so far – today is the 35th day in a row that I’ve managed to do each of them. Having said that, there have been some times when I’ve got back home at 11pm and realised that I haven’t done 2 of the habits and need to quickly do the minimum to ‘tick them off’.

The habit I’ve enjoyed doing the most is writing one line of data science related code a day.

My initial intention was that this was going to only involved writing machine learning code but at the moment I’ve made it a bit more generic so it can include things like the Twitter Graph or other bits and pieces that I want to get started on.

The main problem I’ve had with making progress on mini projects like that is that I imagine its end state and it feels too daunting to start on. Committing to just one line of code a day has been liberating in some way.

One tweak I have made to all the habits is to have some rough goal of where all the daily habits are leading as I noticed that the stuff I was doing each day was becoming very random. Michael pointed me at Amy Hoy’s ‘Guide to doing it backwards‘ which describes a neat technique for working back from a goal and determining the small steps required to achieve it.

Writing at least 50 words a day has been beneficial for getting blog posts written. Before the last month I’ve found myself writing most of my posts at the end of month but I have a more regular cadence now which feels better.

Computer science wise I’ve been picking up papers which have some sort of link to databases to try and learn more of the low level detail there. e.g. I’ve read the LRU-K cache paper which Neo4j 2.2’s page cache is based on and have been flicking through the original CRDTs paper over the last few days.

I also recently came across the Papers We Love repository so I’ll probably work through some of the distributed systems papers they’ve collated next.

Other observations

I’ve found that if I do stuff early in the morning it feels better as you know it’s out of the way and doesn’t linger over you for the rest of the day.

I sometimes find myself wanting to just tick off the habits for the day even when it might be interesting to spend more time on one of the habits. I’m not sure what to make of this really – perhaps I should reduce the number of habits to the ones I’m really interested in?

With the writing it does sometimes feel like I’m just writing for the sake of it but it is a good habit to get into as it forces me to explain what I’m working on and get ideas from other people so I’m going to keep doing it.

I’ve enjoyed my experience with ‘mini habits’ so far although I think I’d be better off focusing on fewer habits so that there’s still enough time in the day to read/learn random spontaneous stuff that doesn’t fit into these habits.

Categories: Programming

Python: Transforming Twitter datetime string to timestamp (z’ is a bad directive in format)

Sun, 03/15/2015 - 23:43

I’ve been playing around with importing Twitter data into Neo4j and since Neo4j can’t store dates natively just yet I needed to convert a date string to timestamp.

I started with the following which unfortunately throws an exception:

from datetime import datetime
date = "Sat Mar 14 18:43:19 +0000 2015"
 
>>> datetime.strptime(date, "%a %b %d %H:%M:%S %z %Y")
 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/_strptime.py", line 317, in _strptime
    (bad_directive, format))
ValueError: 'z' is a bad directive in format '%a %b %d %H:%M:%S %z %Y'

%z is actually a valid option used to extract the timezone but my googling suggests it not working is one of the idiosyncrasies of strptime.

I eventually came across the python-dateutil library, as recommended by Joe Shaw on StackOverflow.

Using that library the problem is suddenly much simpler:

$ pip install python-dateutil
from dateutil import parser
parsed_date = parser.parse(date)
 
>>> parsed_date
datetime.datetime(2015, 3, 14, 18, 43, 19, tzinfo=tzutc())

To get to a timestamp we can use calendar as I’ve described before:

import calendar
timestamp = calendar.timegm(parser.parse(date).timetuple())
 
>>> timestamp
1426358599
Categories: Programming

Python: Checking any value in a list exists in a line of text

Sat, 03/14/2015 - 03:52

I’ve been doing some log file analysis to see what cypher queries were being run on a Neo4j instance and I wanted to narrow down the lines I looked at to only contain ones which had mutating operations i.e. those containing the words MERGE, DELETE, SET or CREATE

Here’s an example of the text file I was parsing:

$ cat blog.txt
MATCH n RETURN n
MERGE (n:Person {name: "Mark"}) RETURN n
MATCH (n:Person {name: "Mark"}) ON MATCH SET n.counter = 1 RETURN n

So I only want lines 2 & 3 to be returned as the first one only returns data and doesn’t execute any updates on the graph.

I started off with a very crude way of doing this;

with open("blog.txt", "r") as ins:
    for line in ins:
        if "MERGE" in line or "DELETE" in line or "SET" in line or "CREATE" in line:
           print line.strip()

A better way of doing this is to use the any command and make sure at least one of the words exists in the line:

mutating_commands = ["SET", "DELETE", "MERGE", "CREATE"]
with open("blog.txt", "r") as ins:
    for line in ins:
        if any(command in line for command in mutating_commands):
           print line.strip()

I thought I might be able to simplify the code even further by using itertools but my best attempt so far is less legible than the above:

import itertools
 
commands = ["SET", "CREATE", "MERGE", "DELETE"]
with open("blog.txt", "r") as ins:
    for line in ins:
        if len(list(itertools.ifilter(lambda x: x in line, mutating_commands))) > 0:
            print line.strip()

I think I’ll go with the 2nd approach for now but if I’m doing something wrong with itertools and it’s much easier to use than the example I’ve shown do correct me!

Categories: Programming

Python/Neo4j: Finding interesting computer sciency people to follow on Twitter

Wed, 03/11/2015 - 22:13

At the beginning of this year I moved from Neo4j’s field team to dev team and since the code we write there is much lower level than I’m used to I thought I should find some people to follow on twitter whom I can learn from.

My technique for finding some of those people was to pick a person from the Neo4j kernel team who’s very good at systems programming and uses twitter which led me to Mr Chris Vest.

I thought that the people Chris interacts with on twitter are likely to be interested in this type of programming so I manually traversed out to those people and had a look who they interacted with and put them all into a list.

This approach has worked well and I’ve picked up lots of reading material from following these people. It does feel like something that could be automated though and we’ll using Python and Neo4j as our tool kit.

We need to find a library to connect to the Twitter API. There are a few to choose from but tweepy seems simple enough so I started using that.

The first thing we need to so is fill in our Twitter API auth details. Since the application is just for me I’m not going to bother with setting up OAuth – instead I’ll just create an application on apps.twitter.com and grab the appropriate tokens:

2015 03 11 01 18 47

Once we’ve done that we can navigate to that app and get a consumer key, consumer secret, access token and access token secret. They all reside on the ‘Keys and Access Tokens’ tab:

2015 03 11 01 20 17 2015 03 11 01 20 30

Now that we’ve got ourself all auth’d up let’s write some code that starts from Chris and goes out and finds his latest tweets and the people he’s interacted with in them. We’ll write the appropriate information out to a CSV file so that we can import it into Neo4j later on:

import tweepy
import csv
from collections import Counter, deque
 
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
 
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
 
counter = Counter()
users_to_process = deque()
USERS_TO_PROCESS = 50
 
def extract_tweet(tweet):
    user_mentions = ",".join([user["screen_name"].encode("utf-8")
                             for user in tweet.entities["user_mentions"]])
    urls = ",".join([url["expanded_url"]
                     for url in tweet.entities["urls"]])
    return [tweet.user.screen_name.encode("utf-8"),
            tweet.id,
            tweet.text.encode("utf-8"),
            user_mentions,
            urls]
 
starting_user = "chvest"
with open("tweets.csv", "a") as tweets:
    writer = csv.writer(tweets, delimiter=",", escapechar="\\", doublequote = False)
    for tweet in tweepy.Cursor(api.user_timeline, id=starting_user).items(50):
        writer.writerow(extract_tweet(tweet))
        tweets.flush()
        for user in tweet.entities["user_mentions"]:
            if not len(users_to_process) > USERS_TO_PROCESS:
                users_to_process.append(user["screen_name"])
                counter[user["screen_name"]] += 1
            else:
                break

As well as printing out Chris’ tweets I’m also capturing other users who he’s had interacted and putting them in a queue that we’ll drain later on. We’re limiting the number of other users that we’ll process to 50 for now but it’s easy to change.

If we print out the first few lines of ‘tweets.csv’ this is what we’d see:

$ head -n 5 tweets.csv
userName,tweetId,contents,usersMentioned,urls
chvest,575427045167071233,@shipilev http://t.co/WxqFIsfiSF,shipilev,
chvest,575403105174552576,@AlTobey I often use http://t.co/G7Cdn9Udst for small one-off graph diagrams.,AlTobey,http://www.apcjones.com/arrows/
chvest,575337346687766528,RT @theburningmonk: this is why you need composition over inheritance... :s #CompositionOverInheritance http://t.co/aKRwUaZ0qo,theburningmonk,
chvest,575269402083459072,@chvest except…? “Each library implementation should therefore be identical with respect to the public API”,chvest,

We’re capturing the user, tweetId, the tweet itself, any users mentioned in the tweet and any URLs shared in the tweet.

Next we want to get some of the tweets of the people Chris has interacted with

# Grab the code from here too - https://gist.github.com/mneedham/3188c44b2cceb88c6de0
 
import tweepy
import csv
from collections import Counter, deque
 
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
 
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
 
counter = Counter()
users_to_process = deque()
USERS_TO_PROCESS = 50
 
def extract_tweet(tweet):
    user_mentions = ",".join([user["screen_name"].encode("utf-8")
                             for user in tweet.entities["user_mentions"]])
    urls = ",".join([url["expanded_url"]
                     for url in tweet.entities["urls"]])
    return [tweet.user.screen_name.encode("utf-8"),
            tweet.id,
            tweet.text.encode("utf-8"),
            user_mentions,
            urls]
 
starting_user = "chvest"
with open("tweets.csv", "a") as tweets:
    writer = csv.writer(tweets, delimiter=",", escapechar="\\", doublequote = False)
    for tweet in tweepy.Cursor(api.user_timeline, id=starting_user).items(50):
        writer.writerow(extract_tweet(tweet))
        tweets.flush()
        for user in tweet.entities["user_mentions"]:
            if not len(users_to_process) > USERS_TO_PROCESS:
                users_to_process.append(user["screen_name"])
                counter[user["screen_name"]] += 1
            else:
                break
    users_processed = set([starting_user])
    while True:
        if len(users_processed) >= USERS_TO_PROCESS:
            break
        else:
            if len(users_to_process) > 0:
                next_user = users_to_process.popleft()
                print next_user
                if next_user in users_processed:
                    "-- user already processed"
                else:
                    "-- processing user"
                    users_processed.add(next_user)
                    for tweet in tweepy.Cursor(api.user_timeline, id=next_user).items(10):
                        writer.writerow(extract_tweet(tweet))
                        tweets.flush()
                        for user_mentioned in tweet.entities["user_mentions"]:
                            if not len(users_processed) > 50:
                                users_to_process.append(user_mentioned["screen_name"])
                                counter[user_mentioned["screen_name"]] += 1
                            else:
                                break
            else:
                break

Finally let’s take a quick look at the users who show up most frequently:

>>> for user_name, count in counter.most_common(20):
        print user_name, count
 
neo4j 13
devnexus 12
AlTobey 11
bitprophet 11
hazelcast 10
chvest 9
shipilev 9
AntoineGrondin 8
gvsmirnov 8
GatlingTool 8
lagergren 7
tomsontom 6
dorkitude 5
noctarius2k 5
DanHeidinga 5
chris_mahan 5
coda 4
mccv 4
gAmUssA 4
jmhodges 4

A few of the people on that list are in my list which is a good start. We can explore the data set better once it’s in Neo4j though so let’s write some Cypher import statements to create our own mini Twitter graph:

// add people
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
MERGE (p:Person {userName: row.userName});
 
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.usersMentioned, ",") AS users
UNWIND users AS user
MERGE (p:Person {userName: user});
 
// add tweets
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
MERGE (t:Tweet {id: row.tweetId})
ON CREATE SET t.contents = row.contents;
 
// add urls
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.urls, ",") AS urls
UNWIND urls AS url
MERGE (:URL {value: url});
 
// add links
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
MATCH (p:Person {userName: row.userName})
MATCH (t:Tweet {id: row.tweetId})
MERGE (p)-[:TWEETED]->(t);
 
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.usersMentioned, ",") AS users, row
UNWIND users AS user
MATCH (p:Person {userName: user})
MATCH (t:Tweet {id: row.tweetId})
MERGE (p)-[:MENTIONED_IN]->(t);
 
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-twitter/tweets.csv" AS row
WITH SPLIT(row.urls, ",") AS urls, row
UNWIND urls AS url
MATCH (u:URL {value: url})
MATCH (t:Tweet {id: row.tweetId})
MERGE (t)-[:CONTAINS_LINK]->(u);

We can put all those commands in a file and execute them using neo4j-shell:

$ ./neo4j-community-2.2.0-RC01/bin/neo4j-shell --file import.cql

Now let’s write some queries against the graph:

// Find the tweets where Chris mentioned himself
MATCH path = (n:Person {userName: "chvest"})-[:TWEETED]->()<-[:MENTIONED_IN]-(n)
RETURN path

Graph  5

// Find the most popular links shared in the network
MATCH (u:URL)<-[r:CONTAINS_LINK]->()
RETURN u.value, COUNT(*) AS times
ORDER BY times DESC
LIMIT 10
 
+-------------------------------------------------------------------------------------------------+
| u.value                                                                                 | times |
+-------------------------------------------------------------------------------------------------+
| "http://www.polyglots.dk/"                                                              | 4     |
| "http://www.java-forum-nord.de/"                                                        | 4     |
| "http://hirt.se/blog/?p=646"                                                            | 3     |
| "http://wp.me/p26jdv-Ja"                                                                | 3     |
| "https://instagram.com/p/0D4I_hH77t/"                                                   | 3     |
| "https://blogs.oracle.com/java/entry/new_java_champion_tom_chindl"                      | 3     |
| "http://www.kennybastani.com/2015/03/spark-neo4j-tutorial-docker.html"                  | 2     |
| "https://firstlook.org/theintercept/2015/03/10/ispy-cia-campaign-steal-apples-secrets/" | 2     |
| "http://buff.ly/1GzZXlo"                                                                | 2     |
| "http://buff.ly/1BrgtQd"                                                                | 2     |
+-------------------------------------------------------------------------------------------------+
10 rows

The first link is for a programming language meetup in Copenhagen, the second for a Java conference in Hanovier and the third an announcement about the latest version of Java Mission Control. So far so good!

A next step in this area would be to run the links through Prismatic’s interest graph so we can model topics in our graph as well. For now let’s have a look at the interactions between Chris and others in the graph:

// Find the people who Chris interacts with most often
MATCH path = (n:Person {userName: "chvest"})-[:TWEETED]->()<-[:MENTIONED_IN]-(other)
RETURN other.userName, COUNT(*) AS times
ORDER BY times DESC
LIMIT 5
 
+------------------------+
| other.userName | times |
+------------------------+
| "gvsmirnov"    | 7     |
| "shipilev"     | 5     |
| "nitsanw"      | 4     |
| "DanHeidinga"  | 3     |
| "AlTobey"      | 3     |
+------------------------+
5 rows

Let’s generalise that to find interactions between any pair of people:

// Find the people who interact most often
MATCH (n:Person)-[:TWEETED]->()<-[:MENTIONED_IN]-(other)
WHERE n <> other
RETURN n.userName, other.userName, COUNT(*) AS times
ORDER BY times DESC
LIMIT 5
 
+------------------------------------------+
| n.userName    | other.userName   | times |
+------------------------------------------+
| "fbogsany"    | "AntoineGrondin" | 8     |
| "chvest"      | "gvsmirnov"      | 7     |
| "chris_mahan" | "bitprophet"     | 6     |
| "maxdemarzi"  | "neo4j"          | 6     |
| "chvest"      | "shipilev"       | 5     |
+------------------------------------------+
5 rows

Let’s combine a couple of these together to come up with a score for each person:

MATCH (n:Person)
 
// number of mentions
OPTIONAL MATCH (n)-[mention:MENTIONED_IN]->()
WITH n, COUNT(mention) AS mentions
 
// number of links shared by someone else
OPTIONAL MATCH (n)-[:TWEETED]->()-[:CONTAINS_LINK]->(link)<-[:CONTAINS_LINK]-()
 
WITH n, mentions, COUNT(link) AS links
RETURN n.userName, mentions + links AS score, mentions, links
ORDER BY score DESC
LIMIT 10
 
+------------------------------------------+
| n.userName    | score | mentions | links |
+------------------------------------------+
| "chvest"      | 17    | 10       | 7     |
| "hazelcast"   | 16    | 10       | 6     |
| "neo4j"       | 15    | 13       | 2     |
| "noctarius2k" | 14    | 4        | 10    |
| "devnexus"    | 12    | 12       | 0     |
| "polyglotsdk" | 11    | 2        | 9     |
| "shipilev"    | 11    | 10       | 1     |
| "AlTobey"     | 11    | 10       | 1     |
| "bitprophet"  | 10    | 9        | 1     |
| "GatlingTool" | 10    | 8        | 2     |
+------------------------------------------+
10 rows

Amusingly Chris is top of his own network but we also see three accounts which aren’t people, but rather products – neo4j, hazelcast and GatlingTool. The rest are legit though

That’s as far as I’ve got but to make this more useful I think we need to introduce follower/friend links as well as importing more data.

In the mean time I’ve got a bunch of links to go and read!

Categories: Programming

Python: Streaming/Appending to a file

Tue, 03/10/2015 - 00:00

I’ve been playing around with Twitter’s API (via the tweepy library) and due to the rate limiting it imposes I wanted to stream results to a CSV file rather than waiting until my whole program had finished.

I wrote the following program to simulate what I was trying to do:

import csv
import time
 
with open("rows.csv", "a") as file:
    writer = csv.writer(file, delimiter = ",")
 
    end = time.time() + 10
    while True:
        if time.time() > end:
            break
        else:
            writer.writerow(["mark", "123"])
            time.sleep(1)

The program will run for 10 seconds and append one line to ‘rows.csv’ once a second. Although I have used the ‘a’ flag in my call to ‘open’ if I poll that file before the 10 seconds is up it’s empty:

$ date && wc -l rows.csv
Mon  9 Mar 2015 22:54:27 GMT
       0 rows.csv
 
$ date && wc -l rows.csv
Mon  9 Mar 2015 22:54:31 GMT
       0 rows.csv
 
$ date && wc -l rows.csv
Mon  9 Mar 2015 22:54:34 GMT
       0 rows.csv
 
$ date && wc -l rows.csv
Mon  9 Mar 2015 22:54:43 GMT
      10 rows.csv

I thought the flushing of the file was completely controlled by the with block but lucky for me there’s actually a flush() function which allows me to force writes to the file whenever I want.

Here’s the new and improved sample program:

import csv
import time
 
with open("rows.csv", "a") as file:
    writer = csv.writer(file, delimiter = ",")
 
    end = time.time() + 10
    while True:
        if time.time() > end:
            break
        else:
            writer.writerow(["mark", "123"])
            time.sleep(1)
            file.flush()

And if we poll the file while the program’s running:

$ date && wc -l rows.csv
Mon  9 Mar 2015 22:57:36 GMT
      14 rows.csv
 
$ date && wc -l rows.csv
Mon  9 Mar 2015 22:57:37 GMT
      15 rows.csv
 
$ date && wc -l rows.csv
Mon  9 Mar 2015 22:57:40 GMT
      18 rows.csv
 
$ date && wc -l rows.csv
Mon  9 Mar 2015 22:57:45 GMT
      20 rows.csv

Much easier than I expected – I ♥ python!

Categories: Programming

Neo4j: TF/IDF (and variants) with cypher

Sun, 03/08/2015 - 14:24

A few weeks ago I wrote a blog post on running TF/IDF over HIMYM transcripts using scikit-learn to find the most important phrases by episode and afterwards I was curious how difficult it’d be to do in Neo4j.

I started by translating one of wikipedia’s TF/IDF examples to cypher to see what the algorithm would look like:

WITH 3 as termFrequency, 2 AS numberOfDocuments, 1 as numberOfDocumentsWithTerm
WITH termFrequency, log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency
return termFrequency * inverseDocumentFrequency
 
0.9030899869919435

Next I needed to go over the HIMYM episode transcripts and extract phrases and their corresponding counts in each episode. I used scikit-learn’s CountVectorizer to do this and wrote the results into a CSV file.

Here’s a preview of that file:

$ head -n 10 data/import/words_scikit.csv
EpisodeId,Phrase,Count
1,2005,1
1,2005 seven,1
1,2005 seven just,1
1,2030,3
1,2030 kids,1
1,2030 kids intently,1
1,2030 narrator,1
1,2030 narrator kids,1
1,2030 son,1

Now let’s import that into Neo4j using the LOAD CSV tool:

// phrases
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/words_scikit.csv" AS row
MERGE (phrase:Phrase {value: row.Phrase});
// episode -> phrase
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/words_scikit.csv" AS row
MATCH (phrase:Phrase {value: row.Phrase})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (episode)-[:CONTAINED_PHRASE {times:TOINT(row.Count)}]->(phrase);

Now that all the data’s in we can translate the TF/IDF query to make use of our graph. We’ll start with episode 1:

match (e:Episode)
WITH COUNT(e) AS numberOfDocuments
match (p:Phrase)<-[r:CONTAINED_PHRASE]-(e:Episode {id: 1})
WITH numberOfDocuments, p, r.times AS termFrequency
MATCH (p)<-[:CONTAINED_PHRASE]->(otherEpisode)
WITH p, COUNT(otherEpisode) AS numberOfDocumentsWithTerm, numberOfDocuments, termFrequency
WITH p, numberOfDocumentsWithTerm,  log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency, termFrequency, numberOfDocuments
RETURN p.value, termFrequency, numberOfDocumentsWithTerm, inverseDocumentFrequency, termFrequency * inverseDocumentFrequency AS score
ORDER BY score DESC
LIMIT 10
 
==> +--------------------------------------------------------------------------------------------------------------------+
==> | p.value                | termFrequency | numberOfDocumentsWithTerm | inverseDocumentFrequency | score              |
==> +--------------------------------------------------------------------------------------------------------------------+
==> | "olives"               | 18            | 2                         | 2.0170333392987803       | 36.306600107378046 |
==> | "yasmine"              | 13            | 1                         | 2.3180633349627615       | 30.1348233545159   |
==> | "signal"               | 11            | 5                         | 1.6127838567197355       | 17.740622423917088 |
==> | "goanna"               | 10            | 4                         | 1.7160033436347992       | 17.16003343634799  |
==> | "flashback date"       | 6             | 1                         | 2.3180633349627615       | 13.908380009776568 |
==> | "scene"                | 17            | 37                        | 0.6989700043360189       | 11.88249007371232  |
==> | "flashback date robin" | 5             | 1                         | 2.3180633349627615       | 11.590316674813808 |
==> | "ted yasmine"          | 5             | 1                         | 2.3180633349627615       | 11.590316674813808 |
==> | "smurf pen1s"          | 5             | 2                         | 2.0170333392987803       | 10.085166696493902 |
==> | "eye patch"            | 5             | 2                         | 2.0170333392987803       | 10.085166696493902 |
==> +--------------------------------------------------------------------------------------------------------------------+
==> 10 rows

The score we’ve calculated is different to scikit-learn’s but the relative order seems fine so that’s good. The neat thing about calculating this in Neo4j is that we can now vary the ‘inverse document’ part of the equation e.g. to find out the most important phrases in a season rather than an episode:

match (:Season) 
WITH COUNT(*) AS numberOfDocuments
match (p:Phrase)<-[r:CONTAINED_PHRASE]-(:Episode)-[:IN_SEASON]->(s:Season {number: "1"})
WITH p, SUM(r.times) AS termFrequency, numberOfDocuments
MATCH (p)<-[:CONTAINED_PHRASE]->(otherEpisode)-[:IN_SEASON]->(s:Season)
WITH p, COUNT(DISTINCT s) AS numberOfDocumentsWithTerm, termFrequency, numberOfDocuments
WITH p, numberOfDocumentsWithTerm,  log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency, termFrequency, numberOfDocuments
RETURN p.value, termFrequency, numberOfDocumentsWithTerm, inverseDocumentFrequency, termFrequency * inverseDocumentFrequency AS score
ORDER BY score DESC
LIMIT 10
 
==> +-------------------------------------------------------------------------------------------------------------+
==> | p.value         | termFrequency | numberOfDocumentsWithTerm | inverseDocumentFrequency | score              |
==> +-------------------------------------------------------------------------------------------------------------+
==> | "moby"          | 46            | 1                         | 0.9542425094393249       | 43.895155434208945 |
==> | "int"           | 71            | 3                         | 0.47712125471966244      | 33.87560908509603  |
==> | "ellen"         | 53            | 2                         | 0.6020599913279624       | 31.909179540382006 |
==> | "claudia"       | 104           | 4                         | 0.3010299956639812       | 31.307119549054043 |
==> | "ericksen"      | 59            | 3                         | 0.47712125471966244      | 28.150154028460083 |
==> | "party number"  | 29            | 1                         | 0.9542425094393249       | 27.67303277374042  |
==> | "subtitle"      | 27            | 1                         | 0.9542425094393249       | 25.76454775486177  |
==> | "vo"            | 47            | 3                         | 0.47712125471966244      | 22.424698971824135 |
==> | "ted vo"        | 47            | 3                         | 0.47712125471966244      | 22.424698971824135 |
==> | "future ted vo" | 45            | 3                         | 0.47712125471966244      | 21.47045646238481  |
==> +-------------------------------------------------------------------------------------------------------------+
==> 10 rows

From this query we learn that ‘Moby’ was only mentioned once in the whole series and actually all of those mentions were in the same episode. The occurrence of ‘int’ seems to be more of a data issue – in some episodes the transcript describes location but in many it doesn’t:

$ ack -iw "int" data/import/sentences.csv
2361,8,1,8,"INT. LIVING ROOM, YEAR 2030"
2377,8,1,8,INT. CHINESE RESTAURANT
2395,8,1,8,INT. APARTMENT
2412,8,1,8,INT. APARTMENT
2419,8,1,8,INT. BAR
2472,8,1,8,INT. APARTMENT
2489,8,1,8,INT. BAR
2495,8,1,8,INT. APARTMENT
2506,8,1,8,INT. BAR
2584,8,1,8,INT. APARTMENT
2629,8,1,8,INT. RESTAURANT
2654,8,1,8,INT. APARTMENT
2682,8,1,8,INT. RESTAURANT
2689,8,1,8,(Robin gets up and leaves restaurant) INT. HOSPITAL WAITING AREA

‘vo’ stands for voice over which should probably be stripped out in the stop words as it doesn’t add much value. It shows up here because the transcripts aren’t consistent in the way they represent Future Ted speaking.

Let’s have a look at the final season to see how that fares:

match (:Season)
WITH COUNT(*) AS numberOfDocuments
match (p:Phrase)<-[r:CONTAINED_PHRASE]-(:Episode)-[:IN_SEASON]->(s:Season {number: "9"})
WITH p, SUM(r.times) AS termFrequency, numberOfDocuments
MATCH (p)<-[:CONTAINED_PHRASE]->(otherEpisode:Episode)-[:IN_SEASON]->(s:Season)
WITH p, COUNT(DISTINCT s) AS numberOfDocumentsWithTerm, termFrequency, numberOfDocuments
WITH p, numberOfDocumentsWithTerm,  log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency, termFrequency, numberOfDocuments
RETURN p.value, termFrequency, numberOfDocumentsWithTerm, inverseDocumentFrequency, termFrequency * inverseDocumentFrequency AS score
ORDER BY score DESC
LIMIT 10
 
==> +------------------------------------------------------------------------------------------------------------------+
==> | p.value              | termFrequency | numberOfDocumentsWithTerm | inverseDocumentFrequency | score              |
==> +------------------------------------------------------------------------------------------------------------------+
==> | "ring bear"          | 28            | 1                         | 0.9542425094393249       | 26.718790264301095 |
==> | "click options"      | 26            | 1                         | 0.9542425094393249       | 24.810305245422448 |
==> | "thank linus"        | 26            | 1                         | 0.9542425094393249       | 24.810305245422448 |
==> | "vow"                | 39            | 2                         | 0.6020599913279624       | 23.480339661790534 |
==> | "just click"         | 24            | 1                         | 0.9542425094393249       | 22.901820226543798 |
==> | "rehearsal dinner"   | 23            | 1                         | 0.9542425094393249       | 21.947577717104473 |
==> | "linus"              | 36            | 2                         | 0.6020599913279624       | 21.674159687806647 |
==> | "just click options" | 22            | 1                         | 0.9542425094393249       | 20.993335207665147 |
==> | "locket"             | 32            | 2                         | 0.6020599913279624       | 19.265919722494797 |
==> | "cassie"             | 19            | 1                         | 0.9542425094393249       | 18.13060767934717  |
==> +------------------------------------------------------------------------------------------------------------------+

There are several phrases which are specific to Barney & Robin’s wedding (‘vow’, ‘ring bear’, ‘rehearsal dinner’) so it makes sense that those come out top. The ‘linus’ here mostly refers to the server at the bar who interacts with Lily although a quick search over the transcripts reveals that she also had an Uncle Linus!

$ ack -iw "linus" data/import/sentences.csv  | head -n 5
18649,61,3,17,Lily: Why don't we just call Duluth Mental Hospital and say my Uncle Linus can live with us?
59822,185,9,1,Linus.
59826,185,9,1,"Are you my guy, Linus?"
59832,185,9,1,Thank you Linus.
59985,185,9,1,"Thank you, Linus."
...

From doing this exercise I think TF/IDF is an interesting way of exploring unstructured data but for a phrase to be really interesting to us it should appear across multiple episodes/seasons.

One way to achieve that would be to weight those features more so I’ll try that out next.

All the code in this post is on github if you want to take a look and improve it.

Categories: Programming

Python: scikit-learn/lda: Extracting topics from QCon talk abstracts

Thu, 03/05/2015 - 09:52

Following on from Rik van Bruggen’s blog post on a QCon graph he’s created ahead of this week’s conference, I was curious whether we could extract any interesting relationships between talks based on their abstracts.

Talks are already grouped by their hosting track but there’s likely to be some overlap in topics even for talks on different tracks.
I therefore wanted to extract topics and connect each talk to the topic that describes it best.

My first attempt was following an example which uses Non-Negative Matrix factorization which worked very well for extracting topics but didn’t seem to provide an obvious way to work out how to link those topics to individual talks.

Instead I ended up looking at the lda library which uses Latent Dirichlet Allocation and allowed me to achieve both goals.

I already had some code to run TF/IDF over each of the talks so I thought I’d be able to feed the matrix output from that into the LDA function. This is what I started with:

import csv
 
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF
from collections import defaultdict
from bs4 import BeautifulSoup, NavigableString
from soupselect import select
 
def uri_to_file_name(uri):
    return uri.replace("/", "-")
 
sessions = {}
with open("data/sessions.csv", "r") as sessions_file:
    reader = csv.reader(sessions_file, delimiter = ",")
    reader.next() # header
    for row in reader:
        session_id = int(row[0])
        filename = "data/sessions/" + uri_to_file_name(row[4])
        page = open(filename).read()
        soup = BeautifulSoup(page)
        abstract = select(soup, "div.brenham-main-content p")
        if abstract:
            sessions[session_id] = {"abstract" : abstract[0].text, "title": row[3] }
        else:
            abstract = select(soup, "div.pane-content p")
            sessions[session_id] = {"abstract" : abstract[0].text, "title": row[3] }
 
corpus = []
titles = []
for id, session in sorted(sessions.iteritems(), key=lambda t: int(t[0])):
    corpus.append(session["abstract"])
    titles.append(session["title"])
 
n_topics = 15
n_top_words = 50
n_features = 6000
 
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0, stop_words = 'english')
matrix =  vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()
 
import lda
import numpy as np
 
vocab = feature_names
 
model = lda.LDA(n_topics=20, n_iter=500, random_state=1)
model.fit(matrix)
topic_word = model.topic_word_
n_top_words = 20
 
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

And if we run it?

Topic 0: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 1: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 2: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 3: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 4: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 5: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 6: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 7: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 8: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 9: 10 faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 10: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 11: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 12: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 13: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 14: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 15: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 16: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 17: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 18: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure
Topic 19: zoosk faced exposing expression external externally extra extraordinary extreme extremes face facebook facilitates faster factor factors fail failed failure

As you can see, every topic has the same set of words which isn’t what we want. Let’s switch out our TF/IDF vectorizer for a simpler count based one:

vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,1), min_df = 0, stop_words = 'english')

The rest of the code stays the same and these are the topics that get extracted:

Topic 0: time people company did writing real way used let cassandra soundcloud successful know web organization audio lives swift stuck
Topic 1: process development delivery platform developer continuous testing rapidly deployment implementing release demonstrate paas advice hard light predictable radically introduce
Topic 2: way open space kind people change meetings ll lead powerful practice times everyday simple qconlondon organization unconference track extraordinary
Topic 3: apache apis processing open spark distributed leading making environments solr cases brooklyn components existing ingestion contributing data target evolved
Topic 4: management million effective cost halo gameplay player billion ad catastrophic store microsoft final music influence information launch research purchased
Topic 5: product look like use talk problems working analysis projects challenges 2011 functionality useful spread business deep inside happens sensemaker
Topic 6: ll computers started principles free focus face smaller atlas control uses products avoid computing ground billions mean volume consistently
Topic 7: code end users developers just application way line apps mobile features sites hours issues applications write faster game better
Topic 8: ve development teams use things world like time learned lessons think methods multiple story say customer developer experiences organisations
Topic 9: software building docker built challenges monitoring gilt application discuss solution decision talk download source center critical decisions bintray customers
Topic 10: years infrastructure tools language different service lot devops talk adoption scala popular clojure advantages introduced effectively looking wasn includes
Topic 11: high does latency session requirements functional performance real world questions problem second engineering patterns gravity explain discuss expected time
Topic 12: business make build technology technologies help trying developers parts want interfaces small best centres implementations critical moo databases going
Topic 13: need design systems large driven scale software applications slow protocol change needs approach gets new contracts solutions complicated distributed
Topic 14: architecture service micro architectures increasing talk microservices order market value values new present presents services scalable trading practices today
Topic 15: java using fast robovm lmax ios presentation really jvm native best exchange azul hardware started project slowdowns goal bring
Topic 16: data services using traditional create ways support uk large user person complex systems production impact art organizations accessing mirage
Topic 17: agile team experience don work doing processes based key reach extra defined pressure machines nightmare practices learn goals guidance
Topic 18: internet new devices programming things iot big number deliver day connected performing growing got state thing provided times automated
Topic 19: cloud including deploy session api government security culture software type attack techniques environment digital secure microservice better creation interaction

Some of the groupings seem to make sense e.g. Topic 11 contains words related to high performance code with low latency; Topic 15 covers Java, the JVM and other related words; but others are more difficult to decipher

e.g. both Topic 14 and Topic 19 talk about micro services but the latter mentions ‘government’ and ‘security’ so perhaps the talks linked to that topic come at micro services from a different angle altogether.

Next let’s see which topics a talk is most likely to be about. We’ll look at the first ten:

doc_topic = model.doc_topic_
for i in range(0, 10):
    print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
    print(doc_topic[i].argsort()[::-1][:3])
 
To the Moon (top topic: 8)
[ 8  0 11]
Evolutionary Architecture and Micro-Services - A Match Enabled by Continuous Delivery (top topic: 14)
[14 19 16]
How SoundCloud uses Cassandra (top topic: 0)
[0 6 5]
DevOps and the Need for Speed (top topic: 18)
[18  5 16]
Neuro-diversity and agile (top topic: 7)
[17  7  2]
Java 8 in Anger (top topic: 7)
[ 7 15 12]
APIs that Change Lifestyles (top topic: 9)
[ 9  6 19]
Elasticsearch powers the Citizen Advice Bureau (CAB) to monitor trends in society before they become issues (top topic: 16)
[16 12 19]
Architecture Open Space (top topic: 2)
[ 2 19 18]
Don’t let Data Gravity crush your infrastructure (top topic: 11)
[11 16  3]

So our third talk on the list ‘How SoundCloud uses Cassandra’ does end up being tagged with topic 0 which mentions SoundCloud so that’s good!

Topic 0: time people company did writing real way used let cassandra soundcloud successful know web organization audio lives swift stuck

It’s next two topics are 5 & 6 which contain the following words…

Topic 5: product look like use talk problems working analysis projects challenges 2011 functionality useful spread business deep inside happens sensemaker
Topic 6: ll computers started principles free focus face smaller atlas control uses products avoid computing ground billions mean volume consistently

…which are not as intuitive. What about Java 8 in Anger? It’s been tagged with topics 7, 15 and 12:

Topic 7: code end users developers just application way line apps mobile features sites hours issues applications write faster game better
Topic 15: java using fast robovm lmax ios presentation really jvm native best exchange azul hardware started project slowdowns goal bring
Topic 12: business make build technology technologies help trying developers parts want interfaces small best centres implementations critical moo databases going

15 makes sense since that mentions Java and perhaps 12 and 7 do as well as they both mention developers.

So while the topics pulled out are not horrendous I don’t think they’re particularly useful yet either. These are some of the areas I need to do some more research around:

  • How do you measure the success of topic modelling? I’ve been eyeballing the output of the algorithm but I imagine there’s an automated way to do that.
  • How do you determine the right number of topics? I found an article written by Christophe Grainger which explains a way of doing that which I need to look at in more detail.
  • It feels like I would be able to pull out better topics if I had an ontology of computer science/software words and then ran the words through that to derive topics.
  • Another approach suggested by Michael is to find the most popular words using the CountVectorizer and tag talks with those instead.

If you have any suggestions let me know. The full code is on github if you want to play around with it.

Categories: Programming

Python: scikit-learn – Training a classifier with non numeric features

Mon, 03/02/2015 - 08:48

Following on from my previous posts on training a classifier to pick out the speaker in sentences of HIMYM transcripts the next thing to do was train a random forest of decision trees to see how that fared.

I’ve used scikit-learn for this before so I decided to use that. However, before building a random forest I wanted to check that I could build an equivalent decision tree.

I initially thought that scikit-learn’s DecisionTree classifier would take in data in the same format as nltk’s so I started out with the following code:

import json
import nltk
import collections
 
from himymutil.ml import pos_features
from sklearn import tree
from sklearn.cross_validation import train_test_split
 
with open("data/import/trained_sentences.json", "r") as json_file:
    json_data = json.load(json_file)
 
tagged_sents = []
for sentence in json_data:
    tagged_sents.append([(word["word"], word["speaker"]) for word in sentence["words"]])
 
featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    sentence_pos = nltk.pos_tag(untagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append((pos_features(untagged_sent, sentence_pos, i), tag) )
 
clf = tree.DecisionTreeClassifier()
 
train_data, test_data = train_test_split(featuresets, test_size=0.20, train_size=0.80)
 
>>> train_data[1]
({'word': u'your', 'word-pos': 'PRP$', 'next-word-pos': 'NN', 'prev-word-pos': 'VB', 'prev-word': u'throw', 'next-word': u'body'}, False)
 
>>> clf.fit([item[0] for item in train_data], [item[1] for item in train_data])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/sklearn/tree/tree.py", line 137, in fit
    X, = check_arrays(X, dtype=DTYPE, sparse_format="dense")
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/sklearn/utils/validation.py", line 281, in check_arrays
    array = np.asarray(array, dtype=dtype)
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/numpy/core/numeric.py", line 460, in asarray
    return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number

In fact, the classifier can only deal with numeric features so we need to translate our features into that format using DictVectorizer.

from sklearn.feature_extraction import DictVectorizer
 
vec = DictVectorizer()
X = vec.fit_transform([item[0] for item in featuresets]).toarray()
 
>>> len(X)
13016
 
>>> len(X[0])
7302
 
>>> vec.get_feature_names()[10:15]
['next-word-pos=EX', 'next-word-pos=IN', 'next-word-pos=JJ', 'next-word-pos=JJR', 'next-word-pos=JJS']

We end up with one feature for every key/value combination that exists in featuresets.

I was initially confused about how to split up training and test data sets but it’s actually fairly easy – train_test_split allows us to pass in multiple lists which it splits along the same seam:

vec = DictVectorizer()
X = vec.fit_transform([item[0] for item in featuresets]).toarray()
Y = [item[1] for item in featuresets]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20, train_size=0.80)

Next we want to train the classifier which is a couple of lines of code:

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)

I wrote the following function to assess the classifier:

import collections
import nltk
 
def assess(text, predictions_actual):
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)
    for i, (prediction, actual) in enumerate(predictions_actual):
        refsets[actual].add(i)
        testsets[prediction].add(i)
    speaker_precision = nltk.metrics.precision(refsets[True], testsets[True])
    speaker_recall = nltk.metrics.recall(refsets[True], testsets[True])
    non_speaker_precision = nltk.metrics.precision(refsets[False], testsets[False])
    non_speaker_recall = nltk.metrics.recall(refsets[False], testsets[False])
    return [text, speaker_precision, speaker_recall, non_speaker_precision, non_speaker_recall]

We can call it like so:

predictions = clf.predict(X_test)
assessment = assess("Decision Tree", zip(predictions, Y_test))
 
>>> assessment
['Decision Tree', 0.9459459459459459, 0.9210526315789473, 0.9970134395221503, 0.9980069755854509]

Those values are in the same ball park as we’ve seen with the nltk classifier so I’m happy it’s all wired up correctly.

The last thing I wanted to do was visualise the decision tree that had been created and the easiest way to do that is export the classifier to DOT format and then use graphviz to create an image:

with open("/tmp/decisionTree.dot", 'w') as file:
    tree.export_graphviz(clf, out_file = file, feature_names = vec.get_feature_names())
dot -Tpng /tmp/decisionTree.dot -o /tmp/decisionTree.png


The decision tree is quite a few levels deep so here’s part of it:

DecisionTreeSection

The full script is on github if you want to play around with it.

Categories: Programming

Python: Detecting the speaker in HIMYM using Parts of Speech (POS) tagging

Sun, 03/01/2015 - 03:36

Over the last couple of weeks I’ve been experimenting with different classifiers to detect speakers in HIMYM transcripts and in all my attempts so far the only features I’ve used have been words.

This led to classifiers that were overfitted to the training data so I wanted to generalise them by introducing parts of speech of the words in sentences which are more generic.

First I changed the function which generates the features for each word to also contain the parts of speech of the previous and next words as well as the word itself:

def pos_features(sentence, sentence_pos, i):
    features = {}
 
    features["word"] = sentence[i]
    features["word-pos"] = sentence_pos[i][1]
 
    if i == 0:
        features["prev-word"] = "<START>"
        features["prev-word-pos"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        features["prev-word-pos"] = sentence_pos[i-1][1]
 
    if i == len(sentence) - 1:
        features["next-word"] = "<END>"
        features["next-word-pos"] = "<END>"
    else:
        features["next-word"] = sentence[i+1]
        features["next-word-pos"] = sentence_pos[i+1][1]
 
    return features

Next we need to tweak our calling code to calculate the parts of speech tags for each sentence and pass it in:

featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    sentence_pos = nltk.pos_tag(untagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append((pos_features(untagged_sent, sentence_pos, i), tag) )

I’m using nltk to do this and although it’s slower than some alternatives, the data set is small enough that it’s not an issue.

Now it’s time to train a Decision Tree with the new features. I created three variants – one with both words and POS; one with only words; one with only POS.

I took a deep copy of the training/test data sets and then removed the appropriate keys:

def get_rid_of(entry, *keys):
    for key in keys:
        del entry[key]
 
import copy
 
# Word based classifier
tmp_train_data = copy.deepcopy(train_data)
for entry, tag in tmp_train_data:
    get_rid_of(entry, 'prev-word-pos', 'word-pos', 'next-word-pos')
 
tmp_test_data = copy.deepcopy(test_data)
for entry, tag in tmp_test_data:
    get_rid_of(entry, 'prev-word-pos', 'word-pos', 'next-word-pos')
 
c = nltk.DecisionTreeClassifier.train(tmp_train_data)
c.classify(tmp_test_data)
 
# POS based classifier
tmp_train_data = copy.deepcopy(train_data)
for entry, tag in tmp_train_data:
    get_rid_of(entry, 'prev-word', 'word', 'next-word')
 
tmp_test_data = copy.deepcopy(test_data)
for entry, tag in tmp_test_data:
    get_rid_of(entry, 'prev-word', 'word', 'next-word')
 
c = nltk.DecisionTreeClassifier.train(tmp_train_data)
c.classify(tmp_test_data)

The full code is on my github but these were the results I saw:

$ time python scripts/detect_speaker.py
Classifier              speaker precision    speaker recall    non-speaker precision    non-speaker recall
--------------------  -------------------  ----------------  -----------------------  --------------------
Decision Tree All In             0.911765          0.939394                 0.997602              0.996407
Decision Tree Words              0.911765          0.939394                 0.997602              0.996407
Decision Tree POS                0.90099           0.919192                 0.996804              0.996008

There’s still not much in it – the POS one has slightly more false positives and false positives when classifying speakers but on other runs it performed better.

If we take a look at the decision tree that’s been built for the POS one we can see that it’s all about POS now as you’d expect:

>>> print(c.pseudocode(depth=2))
if next-word-pos == '$': return False
if next-word-pos == "''": return False
if next-word-pos == ',': return False
if next-word-pos == '-NONE-': return False
if next-word-pos == '.': return False
if next-word-pos == ':':
  if prev-word-pos == ',': return False
  if prev-word-pos == '.': return False
  if prev-word-pos == ':': return False
  if prev-word-pos == '<START>': return True
  if prev-word-pos == 'CC': return False
  if prev-word-pos == 'CD': return False
  if prev-word-pos == 'DT': return False
  if prev-word-pos == 'IN': return False
  if prev-word-pos == 'JJ': return False
  if prev-word-pos == 'JJS': return False
  if prev-word-pos == 'MD': return False
  if prev-word-pos == 'NN': return False
  if prev-word-pos == 'NNP': return False
  if prev-word-pos == 'NNS': return False
  if prev-word-pos == 'POS': return False
  if prev-word-pos == 'PRP': return False
  if prev-word-pos == 'PRP$': return False
  if prev-word-pos == 'RB': return False
  if prev-word-pos == 'RP': return False
  if prev-word-pos == 'TO': return False
  if prev-word-pos == 'VB': return False
  if prev-word-pos == 'VBD': return False
  if prev-word-pos == 'VBG': return False
  if prev-word-pos == 'VBN': return True
  if prev-word-pos == 'VBP': return False
  if prev-word-pos == 'VBZ': return False
if next-word-pos == '<END>': return False
if next-word-pos == 'CC': return False
if next-word-pos == 'CD':
  if word-pos == '$': return False
  if word-pos == ',': return False
  if word-pos == ':': return True
  if word-pos == 'CD': return True
  if word-pos == 'DT': return False
  if word-pos == 'IN': return False
  if word-pos == 'JJ': return False
  if word-pos == 'JJR': return False
  if word-pos == 'JJS': return False
  if word-pos == 'NN': return False
  if word-pos == 'NNP': return False
  if word-pos == 'PRP$': return False
  if word-pos == 'RB': return False
  if word-pos == 'VB': return False
  if word-pos == 'VBD': return False
  if word-pos == 'VBG': return False
  if word-pos == 'VBN': return False
  if word-pos == 'VBP': return False
  if word-pos == 'VBZ': return False
  if word-pos == 'WDT': return False
  if word-pos == '``': return False
if next-word-pos == 'DT': return False
if next-word-pos == 'EX': return False
if next-word-pos == 'IN': return False
if next-word-pos == 'JJ': return False
if next-word-pos == 'JJR': return False
if next-word-pos == 'JJS': return False
if next-word-pos == 'MD': return False
if next-word-pos == 'NN': return False
if next-word-pos == 'NNP': return False
if next-word-pos == 'NNPS': return False
if next-word-pos == 'NNS': return False
if next-word-pos == 'PDT': return False
if next-word-pos == 'POS': return False
if next-word-pos == 'PRP': return False
if next-word-pos == 'PRP$': return False
if next-word-pos == 'RB': return False
if next-word-pos == 'RBR': return False
if next-word-pos == 'RBS': return False
if next-word-pos == 'RP': return False
if next-word-pos == 'TO': return False
if next-word-pos == 'UH': return False
if next-word-pos == 'VB': return False
if next-word-pos == 'VBD': return False
if next-word-pos == 'VBG': return False
if next-word-pos == 'VBN': return False
if next-word-pos == 'VBP': return False
if next-word-pos == 'VBZ': return False
if next-word-pos == 'WDT': return False
if next-word-pos == 'WP': return False
if next-word-pos == 'WRB': return False
if next-word-pos == '``': return False

I like that it’s identified the ‘:‘ pattern:

if next-word-pos == ':':
  ...
  if prev-word-pos == '<START>': return True

Next I need to drill into the types of sentence structures that it’s failing on and work out some features that can handle those. I still need to see how well a random forest of decision trees would as well.

Categories: Programming

R/ggplot: Controlling X axis order

Fri, 02/27/2015 - 01:49

As part of a talk I gave at the Neo4j London meetup earlier this week I wanted to show how you could build a simple chart showing the number of friends that different actors had using the ggplot library.

I started out with the following code:

df = read.csv("/tmp/friends.csv")
top = df %>% head(20)
 
ggplot(aes(x = p.name, y = colleagues), data = top) + 
  geom_bar(fill = "dark blue", stat = "identity")

The friends CSV file is available as a gist if you want to reproduce the chart. This is how the chart renders:

2015 02 27 00 41 08

It’s in a fairly arbitrary order when it would be quite cool if we could get the most popular people to show from left to right.

I had the people’s names in the correct order in the data frame but annoyingly it was then sorting them into alphabetical order. Luckily I came across the by using the scale_x_discrete function which does exactly what I needed.

If we pass in the list of names to that function we get the chart we desire:

ggplot(aes(x = p.name, y = colleagues), data = top) + 
  geom_bar(fill = "dark blue", stat = "identity") + 
  scale_x_discrete(limits= top$p.name)

2015 02 27 00 45 01

Categories: Programming

R: Conditionally updating rows of a data frame

Thu, 02/26/2015 - 01:45

In a blog post I wrote a couple of days ago about cohort analysis I had to assign a monthNumber to each row in a data frame and started out with the following code:

library(zoo)
library(dplyr)
 
monthNumber = function(cohort, date) {
  cohortAsDate = as.yearmon(cohort)
  dateAsDate = as.yearmon(date)
 
  if(cohortAsDate > dateAsDate) {
    "NA"
  } else {
    paste(round((dateAsDate - cohortAsDate) * 12), sep="")
  }
}
 
cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber)

If we time this function using system.time we’ll see that it’s not very snappy:

system.time(cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber))
 
   user  system elapsed 
  1.968   0.019   2.016

The reason for the poor performance is that we process each row of the data table individually due to the call to group_by on the second line. One way we can refactor the code is to use the ifelse which can process multiple rows at a time:

system.time(
cohortAttendance %>% 
  mutate(monthNumber = ifelse(as.yearmon(cohort) > as.yearmon(date), 
                              paste((round(as.yearmon(date) - as.yearmon(cohort))*12), sep=""), 
                              NA)))
   user  system elapsed 
  0.026   0.000   0.026

Antonios suggested another approach which involves first setting every row to ‘NA’ and then selectively updating the appropriate rows. I ended up with the following code:

cohortAttendance$monthNumber = NA
 
cohortAttendance$monthNumber[as.yearmon(cohortAttendance$cohort) > as.yearmon(cohortAttendance$date)] = paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep="")

Let’s measure that:

system.time(paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep=""))
   user  system elapsed 
  0.013   0.000   0.013

Both approaches are much quicker than my original version although this one seems to be marginally quicker than the ifelse approach.

Note to future Mark: try to avoid grouping by row number – there’s usually a better and faster solution!

Categories: Programming

Python/nltk: Naive vs Naive Bayes vs Decision Tree

Tue, 02/24/2015 - 23:39

Last week I wrote a blog post describing a decision tree I’d trained to detect the speakers in a How I met your mother transcript and after writing the post I wondered whether a simple classifier would do the job.

The simple classifier will work on the assumption that any word followed by a “:” is a speaker and anything else isn’t. Here’s the definition of a NaiveClassifier:

import nltk
from nltk import ClassifierI
 
class NaiveClassifier(ClassifierI):
    def classify(self, featureset):
        if featureset['next-word'] == ":":
            return True
        else:
            return False

As you can see it only implements the classify method and executes a static check.

While reading about ways to evaluate the effectiveness of text classifiers I came across Jacob Perkins blog which suggests that we should measure two things: precision and recall.

  • Higher precision means less false positives, while lower precision means more false positives.
  • Higher recall means less false negatives, while lower recall means more false negatives.

If (like me) you often get confused between false positives and negatives the following photo should help fix that:

False positive negative

I wrote the following function (adapted from Jacob’s blog post) to calculate precision and recall values for a given classifier:

import nltk
import collections
 
def assess_classifier(classifier, test_data, text):
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)
    for i, (feats, label) in enumerate(test_data):
        refsets[label].add(i)
        observed = classifier.classify(feats)
        testsets[observed].add(i)
 
    speaker_precision = nltk.metrics.precision(refsets[True], testsets[True])
    speaker_recall = nltk.metrics.recall(refsets[True], testsets[True])
 
    non_speaker_precision = nltk.metrics.precision(refsets[False], testsets[False])
    non_speaker_recall = nltk.metrics.recall(refsets[False], testsets[False])
 
    return [text, speaker_precision, speaker_recall, non_speaker_precision, non_speaker_recall]

Now let’s call that function with each of our classifiers:

import json
 
from sklearn.cross_validation import train_test_split
from himymutil.ml import pos_features
from himymutil.naive import NaiveClassifier
from tabulate import tabulate
 
with open("data/import/trained_sentences.json", "r") as json_file:
    json_data = json.load(json_file)
 
tagged_sents = []
for sentence in json_data:
    tagged_sents.append([(word["word"], word["speaker"]) for word in sentence["words"]])
 
featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append( (pos_features(untagged_sent, i), tag) )
 
train_data,test_data = train_test_split(featuresets, test_size=0.20, train_size=0.80)
 
table = []
table.append(assess_classifier(NaiveClassifier(), test_data, "Naive"))
table.append(assess_classifier(nltk.NaiveBayesClassifier.train(train_data), test_data, "Naive Bayes"))
table.append(assess_classifier(nltk.DecisionTreeClassifier.train(train_data), test_data, "Decision Tree"))
 
print(tabulate(table, headers=["Classifier","speaker precision", "speaker recall", "non-speaker precision", "non-speaker recall"]))

I’m using the tabulate library to print out a table showing each of the classifiers and their associated value for precision and recall. If we execute this file we’ll see the following output:

$ python scripts/detect_speaker.py
Classifier       speaker precision    speaker recall    non-speaker precision    non-speaker recall
-------------  -------------------  ----------------  -----------------------  --------------------
Naive                     0.9625            0.846154                 0.994453              0.998806
Naive Bayes               0.674603          0.934066                 0.997579              0.983685
Decision Tree             0.965517          0.923077                 0.997219              0.998806

The naive classifier is good on most measures but makes some mistakes on speaker recall – we have 16% false negatives i.e. 16% of words that should be classified as speaker aren’t.

Naive Bayes does poorly in terms of speaker false positives – 1/3 of the time when we say a word is a speaker it actually isn’t.

The decision tree performs best but has 8% speaker false negatives – 8% of words that should be classified as speakers aren’t.

The code is on github if you want to play around with it.

Categories: Programming

R: Cohort analysis of Neo4j meetup members

Tue, 02/24/2015 - 02:19

A few weeks ago I came across a blog post explaining how to apply cohort analysis to customer retention using R and I thought it’d be a fun exercise to calculate something similar for meetup attendees.

In the customer retention example we track customer purchases on a month by month basis and each customer is put into a cohort or bucket based on the first month they made a purchase in.

We then calculate how many of them made purchases in subsequent months and compare that with the behaviour of people in other cohorts.

In our case we aren’t selling anything so our equivalent will be a person attending a meetup. We’ll put people into cohorts based on the month of the first meetup they attended.

This can act as a proxy for when people become interested in a technology and could perhaps allow us to see how the behaviour of innovators, early adopters and the early majority differs, if at all.

The first thing we need to do is get the data showing the events that people RSVP’d ‘yes’ to. I’ve already got the data in Neo4j so we’ll write a query to extract it as a data frame:

library(RNeo4j)
graph = startGraph("http://127.0.0.1:7474/db/data/")
 
query = "MATCH (g:Group {name: \"Neo4j - London User Group\"})-[:HOSTED_EVENT]->(e),
               (e)<-[:TO]-(rsvp {response: \"yes\"})<-[:RSVPD]-(person)
         RETURN rsvp.time, person.id"
 
timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
df = cypher(graph, query)
df$time = timestampToDate(df$rsvp.time)
df$date = format(as.Date(df$time), "%Y-%m")
> df %>% head()
##         rsvp.time person.id                time    date
## 612  1.404857e+12  23362191 2014-07-08 22:00:29 2014-07
## 1765 1.380049e+12 112623332 2013-09-24 18:58:00 2013-09
## 1248 1.390563e+12   9746061 2014-01-24 11:24:35 2014-01
## 1541 1.390920e+12   7881615 2014-01-28 14:40:35 2014-01
## 3056 1.420670e+12  12810159 2015-01-07 22:31:04 2015-01
## 607  1.406025e+12  14329387 2014-07-22 10:34:51 2014-07
## 1634 1.391445e+12  91330472 2014-02-03 16:33:58 2014-02
## 2137 1.371453e+12  68874702 2013-06-17 07:17:10 2013-06
## 430  1.407835e+12 150265192 2014-08-12 09:15:31 2014-08
## 2957 1.417190e+12 182752269 2014-11-28 15:45:18 2014-11

Next we need to find the first meetup that a person attended – this will determine the cohort that the person is assigned to:

firstMeetup = df %>% 
  group_by(person.id) %>% 
  summarise(firstEvent = min(time), count = n()) %>% 
  arrange(desc(count))
 
> firstMeetup
## Source: local data frame [10 x 3]
## 
##    person.id          firstEvent count
## 1   13526622 2013-01-24 20:25:19     2
## 2  119400912 2014-10-03 13:09:09     2
## 3  122524352 2014-08-14 14:09:44     1
## 4   37201052 2012-05-21 10:26:24     3
## 5  137112712 2014-07-31 09:32:12     1
## 6  152448642 2014-06-20 08:32:50    17
## 7   98563682 2014-11-05 17:27:57     1
## 8  146976492 2014-05-17 00:04:42     4
## 9   12318409 2014-11-03 05:25:26     2
## 10  41280492 2014-10-16 19:02:03     5

Let’s assign each person to a cohort (month/year) and see how many people belong to each one:

firstMeetup$date = format(as.Date(firstMeetup$firstEvent), "%Y-%m")
byMonthYear = firstMeetup %>% count(date) %>% arrange(date)
 
ggplot(aes(x=date, y = n), data = byMonthYear) + 
  geom_bar(stat="identity", fill = "dark blue") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Unnamed chunk 4 1

Next we need to track a cohort over time to see whether people keep coming to events. I wrote the following function to work it out:

countsForCohort = function(df, firstMeetup, cohort) {
  members = (firstMeetup %>% filter(date == cohort))$person.id
 
  attendance = df %>% 
    filter(person.id %in% members) %>% 
    count(person.id, date) %>% 
    ungroup() %>%
    count(date)
 
  allCohorts = df %>% select(date) %>% unique
  cohortAttendance = merge(allCohorts, attendance, by = "date", all = TRUE)
 
  cohortAttendance[is.na(cohortAttendance) & cohortAttendance$date > cohort] = 0
  cohortAttendance %>% mutate(cohort = cohort, retention = n / length(members))  
}

On the first line we get the ids of all the people in the cohort so that we can filter the data frame to only include RSVPs by these people. The first call to ‘count’ makes sure that we only have one entry per person per month and the second call gives us a count of how many people attended an event in a given month.

Next we do the equivalent of a left join using the merge function to ensure we have a row representing each month even if noone from the cohort attended. This will lead to NA entries if there’s no matching row in the ‘attendance’ data frame – we’ll replace those with a 0 if the cohort is in the future. If not we’ll leave it as it is.

Finally we calculate the retention rate for each month for that cohort. e.g. these are some of the rows for the ‘2011-06′ cohort:

> countsForCohort(df, firstMeetup, "2011-06") %>% sample_n(10)
      date n  cohort retention
16 2013-01 1 2011-06      0.25
5  2011-10 1 2011-06      0.25
30 2014-03 0 2011-06      0.00
29 2014-02 0 2011-06      0.00
40 2015-01 0 2011-06      0.00
31 2014-04 0 2011-06      0.00
8  2012-04 2 2011-06      0.50
39 2014-12 0 2011-06      0.00
2  2011-07 1 2011-06      0.25
19 2013-04 1 2011-06      0.25

We could then choose to plot that cohort:

ggplot(aes(x=date, y = retention, colour = cohort), data = countsForCohort(df, firstMeetup, "2011-06")) + 
  geom_line(aes(group = cohort)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Unnamed chunk 5 1

From this chart we can see that none of the people who first attended a Neo4j meetup in June 2011 have attended any events for the last two years.

Next we want to be able to plot multiple cohorts on the same chart which we can easily do by constructing one big data frame and passing it to ggplot:

cohorts = collect(df %>% select(date) %>% unique())[,1]
 
cohortAttendance = data.frame()
for(cohort in cohorts) {
  cohortAttendance = rbind(cohortAttendance,countsForCohort(df, firstMeetup, cohort))      
}
 
ggplot(aes(x=date, y = retention, colour = cohort), data = cohortAttendance) + 
  geom_line(aes(group = cohort)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Unnamed chunk 5 2

This all looks a bit of a mess and at the moment we can’t easily compare cohorts as they start at different places on the x axis. We can fix that by adding a ‘monthNumber’ column to the data frame which we calculate with the following function:

monthNumber = function(cohort, date) {
  cohortAsDate = as.yearmon(cohort)
  dateAsDate = as.yearmon(date)
 
  if(cohortAsDate > dateAsDate) {
    "NA"
  } else {
    paste(round((dateAsDate - cohortAsDate) * 12), sep="")
  }
}

Now let’s create a new data frame with the month field added:

cohortAttendanceWithMonthNumber = cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber)

We’re also filtering out any ‘NA’ columns which would represent row entries for months from before the cohort started. We don’t want to plot those.

finally let’s plot a chart containing all cohorts normalised by month number:

ggplot(aes(x=monthNumber, y = retention, colour = cohort), data = cohortAttendanceWithMonthNumber) + 
  geom_line(aes(group = cohort)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1), panel.background = element_blank())
Unnamed chunk 5 3

It’s still a bit of a mess but what stands out is that when the number of people in a cohort is small the fluctuation in the retention value can be quite pronounced.

The next step is to make the cohorts a bit more coarse grained to see if it reveals some insights. I think I’ll start out with a cohort covering a 3 month period and see how that works out.

Categories: Programming

R/dplyr: Extracting data frame column value for filtering with %in%

Sun, 02/22/2015 - 09:58

I’ve been playing around with dplyr over the weekend and wanted to extract the values from a data frame column to use in a later filtering step.

I had a data frame:

library(dplyr)
df = data.frame(userId = c(1,2,3,4,5), score = c(2,3,4,5,5))

And wanted to extract the userIds of those people who have a score greater than 3. I started with:

highScoringPeople = df %>% filter(score > 3) %>% select(userId)
> highScoringPeople
  userId
1      3
2      4
3      5

And then filtered the data frame expecting to get back those 3 people:

> df %>% filter(userId %in% highScoringPeople)
[1] userId score 
<0 rows> (or 0-length row.names)

No rows! I created vector with the numbers 3-5 to make sure that worked:

> df %>% filter(userId %in% c(3,4,5))
  userId score
1      3     4
2      4     5
3      5     5

That works as expected so highScoringPeople obviously isn’t in the right format to facilitate an ‘in lookup’. Let’s explore:

> str(c(3,4,5))
 num [1:3] 3 4 5
 
> str(highScoringPeople)
'data.frame':	3 obs. of  1 variable:
 $ userId: num  3 4 5

Now it’s even more obvious why it doesn’t work – highScoringPeople is still a data frame when we need it to be a vector/list.

One way to fix this is to extract the userIds using the $ syntax instead of the select function:

highScoringPeople = (df %>% filter(score > 3))$userId
 
> str(highScoringPeople)
 num [1:3] 3 4 5
 
> df %>% filter(userId %in% highScoringPeople)
  userId score
1      3     4
2      4     5
3      5     5

Or if we want to do the column selection using dplyr we can extract the values for the column like this:

highScoringPeople = (df %>% filter(score > 3) %>% select(userId))[[1]]
 
> str(highScoringPeople)
 num [1:3] 3 4 5

Not so difficult after all.

Categories: Programming

Python/scikit-learn: Detecting which sentences in a transcript contain a speaker

Fri, 02/20/2015 - 23:42

Over the past couple of months I’ve been playing around with How I met your mother transcripts and the most recent thing I’ve been working on is how to extract the speaker for a particular sentence.

This initially seemed like a really simple problem as most of the initial sentences I looked at weere structured like this:

<speaker>: <sentence>

If there were all in that format then we could write a simple regular expression and then move on but unfortunately they aren’t. We could probably write a more complex regex to pull out the speaker but I thought it’d be fun to see if I could train a model to work it out instead.

The approach I’ve taken is derived from an example in the NLTK book.

The first problem with this approach was that I didn’t have any labelled data to work with so I wrote a little web application that made it easy for me to train chunks of sentences at a time:

2015 02 20 00 44 38

I stored the trained words in a JSON file. Each entry looks like this:

import json
with open("data/import/trained_sentences.json", "r") as json_file:
    json_data = json.load(json_file)
 
>>> json_data[0]
{u'words': [{u'word': u'You', u'speaker': False}, {u'word': u'ca', u'speaker': False}, {u'word': u"n't", u'speaker': False}, {u'word': u'be', u'speaker': False}, {u'word': u'friends', u'speaker': False}, {u'word': u'with', u'speaker': False}, {u'word': u'Robin', u'speaker': False}, {u'word': u'.', u'speaker': False}]}
 
>>> json_data[1]
{u'words': [{u'word': u'Robin', u'speaker': True}, {u'word': u':', u'speaker': False}, {u'word': u'Well', u'speaker': False}, {u'word': u'...', u'speaker': False}, {u'word': u'it', u'speaker': False}, {u'word': u"'s", u'speaker': False}, {u'word': u'a', u'speaker': False}, {u'word': u'bit', u'speaker': False}, {u'word': u'early', u'speaker': False}, {u'word': u'...', u'speaker': False}, {u'word': u'but', u'speaker': False}, {u'word': u'...', u'speaker': False}, {u'word': u'of', u'speaker': False}, {u'word': u'course', u'speaker': False}, {u'word': u',', u'speaker': False}, {u'word': u'I', u'speaker': False}, {u'word': u'might', u'speaker': False}, {u'word': u'consider', u'speaker': False}, {u'word': u'...', u'speaker': False}, {u'word': u'I', u'speaker': False}, {u'word': u'moved', u'speaker': False}, {u'word': u'here', u'speaker': False}, {u'word': u',', u'speaker': False}, {u'word': u'let', u'speaker': False}, {u'word': u'me', u'speaker': False}, {u'word': u'think', u'speaker': False}, {u'word': u'.', u'speaker': False}]}

Each word in the sentence is represented by a JSON object which also indicates if that word was a speaker in the sentence.

Feature selection

Now that I’ve got some trained data to work with I needed to choose which features I’d use to train my model.

One of the most obvious indicators that a word is the speaker in the sentence is that the next word is ‘:’ so ‘next word’ can be a feature. I also went with ‘previous word’ and the word itself for my first cut.

This is the function I wrote to convert a word in a sentence into a set of features:

def pos_features(sentence, i):
    features = {}
    features["word"] = sentence[i]
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    if i == len(sentence) - 1:
        features["next-word"] = "<END>"
    else:
        features["next-word"] = sentence[i+1]
    return features

Let’s try a couple of examples:

import nltk
 
>>> pos_features(nltk.word_tokenize("Robin: Hi Ted, how are you?"), 0)
{'prev-word': '<START>', 'word': 'Robin', 'next-word': ':'}
 
>>> pos_features(nltk.word_tokenize("Robin: Hi Ted, how are you?"), 5)
{'prev-word': ',', 'word': 'how', 'next-word': 'are'}

Now let’s run that function over our full set of labelled data:

with open("data/import/trained_sentences.json", "r") as json_file:
    json_data = json.load(json_file)
 
tagged_sents = []
for sentence in json_data:
    tagged_sents.append([(word["word"], word["speaker"]) for word in sentence["words"]])
 
featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append( (pos_features(untagged_sent, i), tag) )

Here’s a sample of the contents of featuresets:

>>> featuresets[:5]
[({'prev-word': '<START>', 'word': u'You', 'next-word': u'ca'}, False), ({'prev-word': u'You', 'word': u'ca', 'next-word': u"n't"}, False), ({'prev-word': u'ca', 'word': u"n't", 'next-word': u'be'}, False), ({'prev-word': u"n't", 'word': u'be', 'next-word': u'friends'}, False), ({'prev-word': u'be', 'word': u'friends', 'next-word': u'with'}, False)]

It’s nearly time to train our model, but first we need to split out labelled data into training and test sets so we can see how well our model performs on data it hasn’t seen before. sci-kit learn has a function that does this for us:

from sklearn.cross_validation import train_test_split
train_data,test_data = train_test_split(featuresets, test_size=0.20, train_size=0.80)
 
>>> len(train_data)
9480
 
>>> len(test_data)
2370

Now let’s train our model. I decided to try out Naive Bayes and Decision tree models to see how they got on:

>>> classifier = nltk.NaiveBayesClassifier.train(train_data)
>>> print nltk.classify.accuracy(classifier, test_data)
0.977215189873
 
>>> classifier = nltk.DecisionTreeClassifier.train(train_data)
>>> print nltk.classify.accuracy(classifier, test_data)
0.997046413502

It looks like both are doing a good job here with the decision tree doing slightly better. One thing to keep in mind is that most of the sentences we’ve trained at in the form ‘:‘ and we can get those correct with a simple regex so we should expect the accuracy to be very high.

If we explore the internals of the decision tree we’ll see that it’s massively overfitting which makes sense given our small training data set and the repetitiveness of the data:

>>> print(classifier.pseudocode(depth=2))
if next-word == u'!': return False
if next-word == u'$': return False
...
if next-word == u"'s": return False
if next-word == u"'ve": return False
if next-word == u'(':
  if word == u'!': return False
  ...
if next-word == u'*': return False
if next-word == u'*****': return False
if next-word == u',':
  if word == u"''": return False
  ...
if next-word == u'--': return False
if next-word == u'.': return False
if next-word == u'...':
  ...
  if word == u'who': return False
  if word == u'you': return False
if next-word == u'/i': return False
if next-word == u'1': return True
...
if next-word == u':':
  if prev-word == u"'s": return True
  if prev-word == u',': return False
  if prev-word == u'...': return False
  if prev-word == u'2030': return True
  if prev-word == '<START>': return True
  if prev-word == u'?': return False
...
if next-word == u'\u266a\u266a': return False

One update I may make to the features is to include the part of speech of the word rather than its actual value to see if that makes the model a bit more general. Another option is to train a bunch of decision trees against a subset of the data and build an ensemble/random forest of those trees.

Once I’ve got a working ‘speaker detector’ I want to then go and work out who the likely speaker is for the sentences which don’t contain a speaker. The plan is to calculate the word distributions of the speakers from sentences I do have and then calculate the probability that they spoke the unlabelled sentences.

This might not work perfectly as there could be new characters in those episodes but hopefully we can come up with something decent.

The full code for this example is on github if you want to have a play with it.

Any suggestions for improvements are always welcome in the comments.

Categories: Programming

Python’s pandas vs Neo4j’s cypher: Exploring popular phrases in How I met your mother transcripts

Thu, 02/19/2015 - 01:52

I’ve previously written about extracting TF/IDF scores for phrases in documents using scikit-learn and the final step in that post involved writing the words into a CSV file for analysis later on.

I wasn’t sure what the most appropriate tool of choice for that analysis was so I decided to explore the data using Python’s pandas library and load it into Neo4j and write some Cypher queries.

To do anything with Neo4j we need to first load the CSV file into the database. The easiest way to do that is with Cypher’s LOAD CSV command.

First we’ll load the phrases in and then we’ll connect them to the episodes which were previously loaded:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/tfidf_scikit.csv" AS row
MERGE (phrase:Phrase {value: row.Phrase});
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/tfidf_scikit.csv" AS row
MATCH (phrase:Phrase {value: row.Phrase})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (phrase)-[:USED_IN_EPISODE {tfidfScore: TOFLOAT(row.Score)}]->(episode);

Now we’re ready to start writing some queries. To start with we’ll write a simple query to find the top 3 phrases for each episode.

In pandas this is quite easy – we just need to group by the appropriate field and then take the top 3 records in that grouping:

top_words_by_episode = df \
    .sort(["EpisodeId", "Score"], ascending = [True, False]) \
    .groupby(["EpisodeId"], sort = False) \
    .head(3)
 
>>> print(top_words_by_episode.to_string())
 
        EpisodeId              Phrase     Score
3976            1                 ted  0.262518
2912            1              olives  0.195714
2441            1            marshall  0.155515
8143            2                 ted  0.292184
5197            2              carlos  0.227454
7482            2               robin  0.195150
12551           3                 ted  0.232662
9040            3              barney  0.187255
11254           3              mcneil  0.170619
15641           4             natalie  0.562485
16763           4                 ted  0.191873
16234           4               robin  0.102671
20715           5            subtitle  0.310866
18121           5          coat check  0.181682
20861           5                 ted  0.169973
...

The cypher version looks quite similar, the main difference being that we use the COLLECT to generate an array of phrases by episode and then take the top 3:

MATCH (e:Episode)<-[rel:USED_IN_EPISODE]-(phrase)
WITH e, rel, phrase
ORDER BY e.id, rel.tfidfScore DESC
RETURN e.id, e.title, COLLECT({phrase: phrase.value, score: rel.tfidfScore})[..3]
ORDER BY e.id
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | e.id | e.title                                     | COLLECT({phrase: phrase.value, score: rel.tfidfScore})[..3]                                                                                                               |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | 1    | "Pilot"                                     | [{phrase -> "ted", score -> 0.2625177493269755},{phrase -> "olives", score -> 0.19571419072701732},{phrase -> "marshall", score -> 0.15551468983363487}]                  |
==> | 2    | "Purple Giraffe"                            | [{phrase -> "ted", score -> 0.292184496766088},{phrase -> "carlos", score -> 0.22745438090499026},{phrase -> "robin", score -> 0.19514993122773566}]                      |
==> | 3    | "Sweet Taste of Liberty"                    | [{phrase -> "ted", score -> 0.23266190616714866},{phrase -> "barney", score -> 0.18725456678444408},{phrase -> "officer mcneil", score -> 0.17061872221616137}]           |
==> | 4    | "Return of the Shirt"                       | [{phrase -> "natalie", score -> 0.5624848345525686},{phrase -> "ted", score -> 0.19187323894701674},{phrase -> "robin", score -> 0.10267067360622682}]                    |
==> | 5    | "Okay Awesome"                              | [{phrase -> "subtitle", score -> 0.310865508347106},{phrase -> "coat check", score -> 0.18168178787561182},{phrase -> "ted", score -> 0.16997258596683185}]               |
==> | 6    | "Slutty Pumpkin"                            | [{phrase -> "mike", score -> 0.2966610054610693},{phrase -> "ted", score -> 0.19333276951599407},{phrase -> "robin", score -> 0.1656172994411056}]                        |
==> | 7    | "Matchmaker"                                | [{phrase -> "ellen", score -> 0.4947912795578686},{phrase -> "sarah", score -> 0.24462913913669443},{phrase -> "ted", score -> 0.23728319597607636}]                      |
==> | 8    | "The Duel"                                  | [{phrase -> "ted", score -> 0.26713931416222847},{phrase -> "marshall", score -> 0.22816702335751904},{phrase -> "swords", score -> 0.17841675237702592}]                 |
==> | 9    | "Belly Full of Turkey"                      | [{phrase -> "ericksen", score -> 0.43145756691027665},{phrase -> "mrs ericksen", score -> 0.1939318283559959},{phrase -> "kendall", score -> 0.1846969793866628}]         |
==> | 10   | "The Pineapple Incident"                    | [{phrase -> "ted", score -> 0.439756993033922},{phrase -> "trudy", score -> 0.36367907631894536},{phrase -> "carl", score -> 0.16413071244131686}]                        |
==> | 11   | "The Limo"                                  | [{phrase -> "moby", score -> 0.48314164479037003},{phrase -> "party number", score -> 0.30458929780262456},{phrase -> "ranjit", score -> 0.1991061739767796}]             |
...
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

In the cypher version we get one row per episode whereas with the Python version we get 3 rows. It might be possible to achieve this effect with pandas too but I wasn’t sure how to do so.

Next let’s find the top phrases for a single episode – the type of query that might be part of an episode page on a How I met your mother wiki:

top_words = df[(df["EpisodeId"] == 1)] \
    .sort(["Score"], ascending = False) \
    .head(20)
 
>>> print(top_words.to_string())
 
      EpisodeId                Phrase     Score
3976          1                   ted  0.262518
2912          1                olives  0.195714
2441          1              marshall  0.155515
4732          1               yasmine  0.152279
3347          1                 robin  0.130418
209           1                barney  0.124412
2146          1                  lily  0.122925
3637          1                signal  0.103793
1366          1                goanna  0.098138
3524          1                 scene  0.095342
710           1                   cut  0.091734
2720          1              narrator  0.086462
1147          1             flashback  0.078296
1148          1        flashback date  0.070283
3224          1                ranjit  0.069393
4178          1           ted yasmine  0.058569
1149          1  flashback date robin  0.058569
525           1                  carl  0.058210
3714          1           smurf pen1s  0.054365
2048          1              lebanese  0.054365
MATCH (e:Episode {title: "Pilot"})<-[rel:USED_IN_EPISODE]-(phrase)
WITH phrase, rel
ORDER BY rel.tfidfScore DESC
RETURN phrase.value AS phrase, rel.tfidfScore AS score
LIMIT 20
 
==> +-----------------------------------------------+
==> | phrase                 | score                |
==> +-----------------------------------------------+
==> | "ted"                  | 0.2625177493269755   |
==> | "olives"               | 0.19571419072701732  |
==> | "marshall"             | 0.15551468983363487  |
==> | "yasmine"              | 0.15227880637176266  |
==> | "robin"                | 0.1304175242341549   |
==> | "barney"               | 0.12441175186690791  |
==> | "lily"                 | 0.12292497785945679  |
==> | "signal"               | 0.1037932464656365   |
==> | "goanna"               | 0.09813798750091524  |
==> | "scene"                | 0.09534236041231685  |
==> | "cut"                  | 0.09173366535740156  |
==> | "narrator"             | 0.08646229819848741  |
==> | "flashback"            | 0.07829592155397117  |
==> | "flashback date"       | 0.07028252601773662  |
==> | "ranjit"               | 0.06939276915589167  |
==> | "ted yasmine"          | 0.05856877168144719  |
==> | "flashback date robin" | 0.05856877168144719  |
==> | "carl"                 | 0.058210117288760355 |
==> | "smurf pen1s"          | 0.05436505297972703  |
==> | "lebanese"             | 0.05436505297972703  |
==> +-----------------------------------------------+

Our next query is a negation – find the episodes which don’t mention the phrase ‘robin’. In python we can do some simple set operations to work this out:

all_episodes = set(range(1, 209))
robin_episodes = set(df[(df["Phrase"] == "robin")]["EpisodeId"])
 
>>> print(set(all_episodes) - set(robin_episodes))
set([145, 198, 143])

In cypher land a query will suffice:

MATCH (episode:Episode), (phrase:Phrase {value: "robin"})
WHERE NOT (episode)<-[:USED_IN_EPISODE]-(phrase)
RETURN episode.id AS id, episode.season AS season, episode.number AS episode

And finally a mini recommendation engine type query – how many of the top phrases in Episode 1 were used in other episodes:

First python:

phrases_used = set(df[(df["EpisodeId"] == 1)] \
    .sort(["Score"], ascending = False) \
    .head(10)["Phrase"])
 
phrases = df[df["Phrase"].isin(phrases_used)]
 
print (phrases[phrases["EpisodeId"] != 1] \
    .groupby(["Phrase"]) \
    .size() \
    .order(ascending = False))

Here we’ve pulled it out into a few steps – first we identify the top phrases, then we find out where they occur across the whole data set and finally we filter out the occurrences in the first episode and count the other occurrences.

Phrase
marshall    207
barney      207
ted         206
lily        206
robin       204
scene        36
signal        4
goanna        3
olives        1

In cypher we can write a query to do this as well:

MATCH (episode:Episode {title: "Pilot"})<-[rel:USED_IN_EPISODE]-(phrase)
WITH phrase, rel, episode
ORDER BY rel.tfidfScore DESC
LIMIT 10
MATCH (phrase)-[:USED_IN_EPISODE]->(otherEpisode)
WHERE otherEpisode <> episode
RETURN phrase.value AS phrase, COUNT(*) AS numberOfOtherEpisodes
ORDER BY numberOfOtherEpisodes DESC
 
==> +------------------------------------+
==> | phrase     | numberOfOtherEpisodes |
==> +------------------------------------+
==> | "barney"   | 207                   |
==> | "marshall" | 207                   |
==> | "ted"      | 206                   |
==> | "lily"     | 206                   |
==> | "robin"    | 204                   |
==> | "scene"    | 36                    |
==> | "signal"   | 4                     |
==> | "goanna"   | 3                     |
==> | "olives"   | 1                     |
==> +------------------------------------+

Overall there’s not much in it – for some of the queries I found it easier in cypher and for others easier with pandas. It’s always useful to have multiple tools in the toolbox!

Categories: Programming

Python/pandas: Column value in list (ValueError: The truth value of a Series is ambiguous.)

Mon, 02/16/2015 - 22:39

I’ve been using Python’s pandas library while exploring some CSV files and although for the most part I’ve found it intuitive to use, I had trouble filtering a data frame based on checking whether a column value was in a list.

A subset of one of the CSV files I’ve been working with looks like this:

$ cat foo.csv
"Foo"
1
2
3
4
5
6
7
8
9
10

Loading it into a pandas data frame is reasonably simple:

import pandas as pd
df = pd.read_csv('foo.csv', index_col=False, header=0)
>>> df
   Foo
0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
9   10

If we want to find the rows which have a value of 1 we’d write the following:

>>> df[df["Foo"] == 1]
   Foo
0    1

Finding the rows with a value less than 7 is as you’d expect too:

>>> df[df["Foo"] < 7]
   Foo
0    1
1    2
2    3
3    4
4    5
5    6

Next I wanted to filter out the rows containing odd numbers which I initially tried to do like this:

odds = [i for i in range(1,10) if i % 2 <> 0]
>>> odds
[1, 3, 5, 7, 9]
 
>>> df[df["Foo"] in odds]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/pandas/core/generic.py", line 698, in __nonzero__
    .format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Unfortunately that doesn’t work and I couldn’t get any of the suggestions from the error message to work either. Luckily pandas has a special isin function for this use case which we can call like this:

>>> df[df["Foo"].isin(odds)]
   Foo
0    1
2    3
4    5
6    7
8    9

Much better!

Categories: Programming

Python/scikit-learn: Calculating TF/IDF on How I met your mother transcripts

Sun, 02/15/2015 - 16:56

Over the past few weeks I’ve been playing around with various NLP techniques to find interesting insights into How I met your mother from its transcripts and one technique that kept coming up is TF/IDF.

The Wikipedia definition reads like this:

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

It is often used as a weighting factor in information retrieval and text mining.

The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

I wanted to generate a TF/IDF representation of phrases used in the hope that it would reveal some common themes used in the show.

Python’s scikit-learn library gives you two ways to generate the TF/IDF representation:

  1. Generate a matrix of token/phrase counts from a collection of text documents using CountVectorizer and feed it to TfidfTransformer to generate the TF/IDF representation.
  2. Feed the collection of text documents directly to TfidfVectorizer and go straight to the TF/IDF representation skipping the middle man.

I started out using the first approach and hadn’t quite got it working when I realised there was a much easier way!

I have a collection of sentences in a CSV file so the first step is to convert those into a list of documents:

from collections import defaultdict
import csv
 
episodes = defaultdict(list)
with open("data/import/sentences.csv", "r") as sentences_file:
    reader = csv.reader(sentences_file, delimiter=',')
    reader.next()
    for row in reader:
        episodes[row[1]].append(row[4])
 
for episode_id, text in episodes.iteritems():
    episodes[episode_id] = "".join(text)
 
corpus = []
for id, episode in sorted(episodes.iteritems(), key=lambda t: int(t[0])):
    corpus.append(episode)

corpus contains 208 entries (1 per episode), each of which is a string containing the transcript of that episode. Next it’s time to train our TF/IDF model which is only a few lines of code:

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')

The most interesting parameter here is ngram_range – we’re telling it to generate 2 and 3 word phrases along with the single words from the corpus.

e.g. if we had the sentence “Python is cool” we’d end up with 6 phrases – ‘Python’, ‘is’, ‘cool’, ‘Python is’, ‘Python is cool’ and ‘is cool’.

Let’s execute the model against our corpus:

tfidf_matrix =  tf.fit_transform(corpus)
>>> len(feature_names)
498254
 
>>> feature_names[50:70]
[u'00 does sound', u'00 don', u'00 don buy', u'00 dressed', u'00 dressed blond', u'00 drunkenly', u'00 drunkenly slurred', u'00 fair', u'00 fair tonight', u'00 fall', u'00 fall foliage', u'00 far', u'00 far impossible', u'00 fart', u'00 fart sure', u'00 friends', u'00 friends singing', u'00 getting', u'00 getting guys', u'00 god']

So we’re got nearly 500,000 phrases and if we look at tfidf_matrix we’d expect it to be a 208 x 498254 matrix – one row per episode, one column per phrase:

>>> tfidf_matrix
<208x498254 sparse matrix of type '<type 'numpy.float64'>'
	with 740396 stored elements in Compressed Sparse Row format>

This is what we’ve got although under the covers it’s using a sparse representation to save space. Let’s convert the matrix to dense format to explore further and find out why:

dense = tfidf_matrix.todense()
>>> len(dense[0].tolist()[0])
498254

What I’ve printed out here is the size of one row of the matrix which contains the TF/IDF score for every phrase in our corpus for the 1st episode of How I met your mother. A lot of those phrases won’t have happened in the 1st episode so let’s filter those out:

episode = dense[0].tolist()[0]
phrase_scores = [pair for pair in zip(range(0, len(episode)), episode) if pair[1] > 0]
 
>>> len(phrase_scores)
4823

There are just under 5000 phrases used in this episode, roughly 1% of the phrases in the whole corpus.
The sparse matrix makes a bit more sense – if scipy used a dense matrix representation there’d be 493,000 entries with no score which becomes more significant as the number of documents increases.

Next we’ll sort the phrases by score in descending order to find the most interesting phrases for the first episode of How I met your mother:

>>> sorted(phrase_scores, key=lambda t: t[1] * -1)[:5]
[(419207, 0.2625177493269755), (312591, 0.19571419072701732), (267538, 0.15551468983363487), (490429, 0.15227880637176266), (356632, 0.1304175242341549)]

The first value in each tuple is the phrase’s position in our initial vector and also corresponds to the phrase’s position in feature_names which allows us to map the scores back to phrases. Let’s look up a couple of phrases:

>>> feature_names[419207]
u'ted'
>>> feature_names[312591]
u'olives'
>>> feature_names[356632]
u'robin'

Let’s automate that lookup:

sorted_phrase_scores = sorted(phrase_scores, key=lambda t: t[1] * -1)
for phrase, score in [(feature_names[word_id], score) for (word_id, score) in sorted_phrase_scores][:20]:
   print('{0: <20} {1}'.format(phrase, score))
 
ted                  0.262517749327
olives               0.195714190727
marshall             0.155514689834
yasmine              0.152278806372
robin                0.130417524234
barney               0.124411751867
lily                 0.122924977859
signal               0.103793246466
goanna               0.0981379875009
scene                0.0953423604123
cut                  0.0917336653574
narrator             0.0864622981985
flashback            0.078295921554
flashback date       0.0702825260177
ranjit               0.0693927691559
flashback date robin 0.0585687716814
ted yasmine          0.0585687716814
carl                 0.0582101172888
eye patch            0.0543650529797
lebanese             0.0543650529797

We see all the main characters names which aren’t that interested – perhaps they should be part of the stop list – but ‘olives’ which is where the olive theory is first mentioned. I thought olives came up more often but a quick search for the term suggests it isn’t mentioned again until Episode 9 in Season 9:

$ grep -rni --color "olives" data/import/sentences.csv | cut -d, -f 2,3,4 | sort | uniq -c
  16 1,1,1
   3 193,9,9

‘yasmine’ is also an interesting phrase in this episode but she’s never mentioned again:

$ grep -h -rni --color "yasmine" data/import/sentences.csv
49:48,1,1,1,"Barney: (Taps a woman names Yasmine) Hi, have you met Ted? (Leaves and watches from a distance)."
50:49,1,1,1,"Ted: (To Yasmine) Hi, I'm Ted."
51:50,1,1,1,Yasmine: Yasmine.
53:52,1,1,1,"Yasmine: Thanks, It's Lebanese."
65:64,1,1,1,"[Cut to the bar, Ted is chatting with Yasmine]"
67:66,1,1,1,Yasmine: So do you think you'll ever get married?
68:67,1,1,1,"Ted: Well maybe eventually. Some fall day. Possibly in Central Park. Simple ceremony, we'll write our own vows. But--eh--no DJ, people will dance. I'm not going to worry about it! Damn it, why did Marshall have to get engaged? (Yasmine laughs) Yeah, nothing hotter than a guy planning out his own imaginary wedding, huh?"
69:68,1,1,1,"Yasmine: Actually, I think it's cute."
79:78,1,1,1,"Lily: You are unbelievable, Marshall. No-(Scene splits in half and shows both Lily and Marshall on top arguing and Ted and Yasmine on the bottom mingling)"
82:81,1,1,1,Ted: (To Yasmine) you wanna go out sometime?
85:84,1,1,1,[Cut to Scene with Ted and Yasmine at bar]
86:85,1,1,1,Yasmine: I'm sorry; Carl's my boyfriend (points to bartender)

It would be interesting to filter out the phrases which don’t occur in any other episode and see what insights we get from doing that. For now though we’ll extract phrases for all episodes and write to CSV so we can explore more easily:

with open("data/import/tfidf_scikit.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["EpisodeId", "Phrase", "Score"])
 
    doc_id = 0
    for doc in tfidf_matrix.todense():
        print "Document %d" %(doc_id)
        word_id = 0
        for score in doc.tolist()[0]:
            if score > 0:
                word = feature_names[word_id]
                writer.writerow([doc_id+1, word.encode("utf-8"), score])
            word_id +=1
        doc_id +=1

And finally a quick look at the contents of the CSV:

$ tail -n 10 data/import/tfidf_scikit.csv
208,york apparently laughs,0.012174304095213192
208,york aren,0.012174304095213192
208,york aren supposed,0.012174304095213192
208,young,0.013397275854758335
208,young ladies,0.012174304095213192
208,young ladies need,0.012174304095213192
208,young man,0.008437685963000223
208,young man game,0.012174304095213192
208,young stupid,0.011506395106658192
208,young stupid sighs,0.012174304095213192
Categories: Programming

Neo4j: Building a topic graph with Prismatic Interest Graph API

Sat, 02/14/2015 - 00:38

Over the last few weeks I’ve been using various NLP libraries to derive topics for my corpus of How I met your mother episodes without success and was therefore enthused to see the release of Prismatic’s Interest Graph API

The Interest Graph API exposes a web service to which you feed a block of text and get back a set of topics and associated score.

It has been trained over the last few years with millions of articles that people share on their social media accounts and in my experience using Prismatic the topics have been very useful for finding new material to read.

The first step is to head to interest-graph.getprismatic.com and get an API key which will be emailed to you.

Having done that we’re ready to make some calls to the API and get back some topics.

I’m going to use Python to call the API and I’ve found the requests library the easiest library to use for this type of work. Our call to the API looks like this:

import requests
payload = { 'title': "insert title of article here",
            'body': "insert body of text here"),
            'api-token': "insert token sent by email here"}
r = requests.post("http://interest-graph.getprismatic.com/text/topic", data=payload)

One thing to keep in mind is that the API is rate limited to 20 requests a second so we need to restrict our requests or we’re going to receive error response codes. Luckily I came across an excellent blog post showing how to write a decorator around a function and only allow it to execute at a certain frequency.

To rate limit our calls to the Interest Graph we need to pull the above code into a function and annotate it appropriately:

import time
 
def RateLimited(maxPerSecond):
    minInterval = 1.0 / float(maxPerSecond)
    def decorate(func):
        lastTimeCalled = [0.0]
        def rateLimitedFunction(*args,**kargs):
            elapsed = time.clock() - lastTimeCalled[0]
            leftToWait = minInterval - elapsed
            if leftToWait>0:
                time.sleep(leftToWait)
            ret = func(*args,**kargs)
            lastTimeCalled[0] = time.clock()
            return ret
        return rateLimitedFunction
    return decorate
 
@RateLimited(0.3)
def topics(title, body):
    payload = { 'title': title,
                'body': body,
                'api-token': "insert token sent by email here"}
    r = requests.post("http://interest-graph.getprismatic.com/text/topic", data=payload)
    return r

The text I want to classify is stored in a CSV file – one sentence per line. Here’s a sample:

$ head -n 10 data/import/sentences.csv
SentenceId,EpisodeId,Season,Episode,Sentence
1,1,1,1,Pilot
2,1,1,1,Scene One
3,1,1,1,[Title: The Year 2030]
4,1,1,1,"Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother"
5,1,1,1,Son: Are we being punished for something?
6,1,1,1,Narrator: No
7,1,1,1,"Daughter: Yeah, is this going to take a while?"
8,1,1,1,"Narrator: Yes. (Kids are annoyed) Twenty-five years ago, before I was dad, I had this whole other life."
9,1,1,1,"(Music Plays, Title ""How I Met Your Mother"" appears)"

We’ll also need to refer to another CSV file to get the title of each episode since it isn’t being stored with the sentence:

$ head -n 10 data/import/episodes_full.csv
NumberOverall,NumberInSeason,Episode,Season,DateAired,Timestamp,Title,Director,Viewers,Writers,Rating
1,1,/wiki/Pilot,1,"September 19, 2005",1127084400,Pilot,Pamela Fryman,10.94,"Carter Bays,Craig Thomas",68
2,2,/wiki/Purple_Giraffe,1,"September 26, 2005",1127689200,Purple Giraffe,Pamela Fryman,10.40,"Carter Bays,Craig Thomas",63
3,3,/wiki/Sweet_Taste_of_Liberty,1,"October 3, 2005",1128294000,Sweet Taste of Liberty,Pamela Fryman,10.44,"Phil Lord,Chris Miller",67
4,4,/wiki/Return_of_the_Shirt,1,"October 10, 2005",1128898800,Return of the Shirt,Pamela Fryman,9.84,Kourtney Kang,59
5,5,/wiki/Okay_Awesome,1,"October 17, 2005",1129503600,Okay Awesome,Pamela Fryman,10.14,Chris Harris,53
6,6,/wiki/Slutty_Pumpkin,1,"October 24, 2005",1130108400,Slutty Pumpkin,Pamela Fryman,10.89,Brenda Hsueh,62
7,7,/wiki/Matchmaker,1,"November 7, 2005",1131321600,Matchmaker,Pamela Fryman,10.55,"Sam Johnson,Chris Marcil",57
8,8,/wiki/The_Duel,1,"November 14, 2005",1131926400,The Duel,Pamela Fryman,10.35,Gloria Calderon Kellett,46
9,9,/wiki/Belly_Full_of_Turkey,1,"November 21, 2005",1132531200,Belly Full of Turkey,Pamela Fryman,10.29,"Phil Lord,Chris Miller",60

Now we need to get our episode titles and transcripts ready to pass to the topics function. Since we’ve only got ~ 200 episodes we can create a dictionary to store that data:

episodes = {}
with open("data/import/episodes_full.csv", "r") as episodesfile:
    episodes_reader = csv.reader(episodesfile, delimiter=",")
    episodes_reader.next()
    for episode in episodes_reader:
        episodes[int(episode[0])] = {"title": episode[6], "sentences" : [] }
 
with open("data/import/sentences.csv", "r") as sentencesfile:
     sentences_reader = csv.reader(sentencesfile, delimiter=",")
     sentences_reader.next()
     for sentence in sentences_reader:
         episodes[int(sentence[1])]["sentences"].append(sentence[4])
 
>>> episodes[1]["title"]
'Pilot'
>>> episodes[1]["sentences"][:5]
['Pilot', 'Scene One', '[Title: The Year 2030]', "Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother", 'Son: Are we being punished for something?']

Now we’re going to loop through each of the episodes, call topics and write the result into a CSV file so we can load it into Neo4j afterwards to explore the data:

import json
 
with open("data/import/topics.csv", "w") as topicsfile:
    topics_writer = csv.writer(topicsfile, delimiter=",")
    topics_writer.writerow(["EpisodeId", "TopicId", "Topic", "Score"])
 
    for episode_id, episode in episodes.iteritems():
        tmp = topics(episode["title"], "".join(episode["sentences"]).json()
        print episode_id, tmp
        for topic in tmp['topics']:
            topics_writer.writerow([episode_id, topic["id"], topic["topic"], topic["score"]])

It takes about 10 minutes to run and this is a sample of the output:

$ head -n 10 data/import/topics.csv
EpisodeId,TopicId,Topic,Score
1,1519,Fiction,0.5798245566455255
1,2015,Humour,0.565154963605359
1,24031,Laughing,0.5587120401021765
1,16693,Flirting,0.5514098189505282
1,1163,Dating and Courtship,0.5487490108554022
1,2386,Kissing,0.5476185929151934
1,31929,Puns,0.5375100569837977
2,24031,Laughing,0.5670926949850333
2,1519,Fiction,0.5396488295397263

We’ll use Neo4j’s LOAD CSV command to load the data in:

// make sure the topics exist
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MERGE (topic:Topic {id: TOINT(row.TopicId)})
ON CREATE SET topic.value = row.Topic
// make sure the topics exist
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MERGE (topic:Topic {id: TOINT(row.TopicId)})
ON CREATE SET topic.value = row.Topic
// now link the episodes and topics
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MATCH (topic:Topic {id: TOINT(row.TopicId)})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (episode)-[:TOPIC {score: TOFLOAT(row.Score)}]->(topic)

We’ll assume that the episodes and seasons are already loaded – the commands to load those in are on github.

We can now write some queries against our topic graph. We’ll start simple – show me the topics for an episode:

MATCH (episode:Episode {id: 1})-[r:TOPIC]->(topic)
RETURN topic, r

Graph

Let’s say we liked the ‘Puns’ aspect of the Pilot episode and want to find out which other episodes had puns. The following query would let us find those:

MATCH (episode:Episode {id: 1})-[r:TOPIC]->(topic {value: "Puns"})<-[:TOPIC]-(other)
RETURN episode, topic, other

Graph  1

Or maybe we want to find the episode which has the most topics in common:

MATCH (episode:Episode {id: 1})-[:TOPIC]->(topic),
      (topic)<-[r:TOPIC]-(otherEpisode)
RETURN otherEpisode.title as episode, COUNT(r) AS topicsInCommon
ORDER BY topicsInCommon DESC
LIMIT 10
==> +------------------------------------------------+
==> | episode                       | topicsInCommon |
==> +------------------------------------------------+
==> | "Purple Giraffe"              | 6              |
==> | "Ten Sessions"                | 5              |
==> | "Farhampton"                  | 4              |
==> | "The Three Days Rule"         | 4              |
==> | "How I Met Everyone Else"     | 4              |
==> | "The Time Travelers"          | 4              |
==> | "Mary the Paralegal"          | 4              |
==> | "Lobster Crawl"               | 4              |
==> | "The Magician's Code, Part 2" | 4              |
==> | "Slutty Pumpkin"              | 4              |
==> +------------------------------------------------+
==> 10 rows

We could then tweak that query to get the names of those topics:

MATCH (episode:Episode {id: 1})-[:TOPIC]->(topic),
      (topic)<-[r:TOPIC]-(otherEpisode)-[:IN_SEASON]->(season)
RETURN otherEpisode.title as episode, season.number AS season, COUNT(r) AS topicsInCommon, COLLECT(topic.value)
ORDER BY topicsInCommon DESC
LIMIT 10
 
==> +-----------------------------------------------------------------------------------------------------------------------------------+
==> | episode                   | season | topicsInCommon | COLLECT(topic.value)                                                        |
==> +-----------------------------------------------------------------------------------------------------------------------------------+
==> | "Purple Giraffe"          | "1"    | 6              | ["Humour","Fiction","Kissing","Dating and Courtship","Flirting","Laughing"] |
==> | "Ten Sessions"            | "3"    | 5              | ["Humour","Puns","Dating and Courtship","Flirting","Laughing"]              |
==> | "How I Met Everyone Else" | "3"    | 4              | ["Humour","Fiction","Dating and Courtship","Laughing"]                      |
==> | "Farhampton"              | "8"    | 4              | ["Humour","Fiction","Kissing","Dating and Courtship"]                       |
==> | "Bedtime Stories"         | "9"    | 4              | ["Humour","Puns","Dating and Courtship","Laughing"]                         |
==> | "Definitions"             | "5"    | 4              | ["Kissing","Dating and Courtship","Flirting","Laughing"]                    |
==> | "Lobster Crawl"           | "8"    | 4              | ["Humour","Dating and Courtship","Flirting","Laughing"]                     |
==> | "Little Boys"             | "3"    | 4              | ["Humour","Puns","Dating and Courtship","Laughing"]                         |
==> | "Wait for It"             | "3"    | 4              | ["Fiction","Puns","Flirting","Laughing"]                                    |
==> | "Mary the Paralegal"      | "1"    | 4              | ["Humour","Dating and Courtship","Flirting","Laughing"]                     |
==> +-----------------------------------------------------------------------------------------------------------------------------------+

Overall 168 (out of 208) of the other episodes have a topic in common with the first episode so perhaps just having a topic in common isn’t the best indication of similarity.

An interesting next step would be to calculate cosine or jaccard similarity between the episodes and store that value in the graph for querying later on.

I’ve also calculated the most common bigrams across all the transcripts so it would be interesting to see if there are any interesting insights at the intersection of episodes, topics and phrases.

Categories: Programming