Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 3 hours 20 min ago

R/ggplot: Controlling X axis order

Fri, 02/27/2015 - 01:49

As part of a talk I gave at the Neo4j London meetup earlier this week I wanted to show how you could build a simple chart showing the number of friends that different actors had using the ggplot library.

I started out with the following code:

df = read.csv("/tmp/friends.csv")
top = df %>% head(20)
 
ggplot(aes(x = p.name, y = colleagues), data = top) + 
  geom_bar(fill = "dark blue", stat = "identity")

The friends CSV file is available as a gist if you want to reproduce the chart. This is how the chart renders:

2015 02 27 00 41 08

It’s in a fairly arbitrary order when it would be quite cool if we could get the most popular people to show from left to right.

I had the people’s names in the correct order in the data frame but annoyingly it was then sorting them into alphabetical order. Luckily I came across the by using the scale_x_discrete function which does exactly what I needed.

If we pass in the list of names to that function we get the chart we desire:

ggplot(aes(x = p.name, y = colleagues), data = top) + 
  geom_bar(fill = "dark blue", stat = "identity") + 
  scale_x_discrete(limits= top$p.name)

2015 02 27 00 45 01

Categories: Programming

R: Conditionally updating rows of a data frame

Thu, 02/26/2015 - 01:45

In a blog post I wrote a couple of days ago about cohort analysis I had to assign a monthNumber to each row in a data frame and started out with the following code:

library(zoo)
library(dplyr)
 
monthNumber = function(cohort, date) {
  cohortAsDate = as.yearmon(cohort)
  dateAsDate = as.yearmon(date)
 
  if(cohortAsDate > dateAsDate) {
    "NA"
  } else {
    paste(round((dateAsDate - cohortAsDate) * 12), sep="")
  }
}
 
cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber)

If we time this function using system.time we’ll see that it’s not very snappy:

system.time(cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber))
 
   user  system elapsed 
  1.968   0.019   2.016

The reason for the poor performance is that we process each row of the data table individually due to the call to group_by on the second line. One way we can refactor the code is to use the ifelse which can process multiple rows at a time:

system.time(
cohortAttendance %>% 
  mutate(monthNumber = ifelse(as.yearmon(cohort) > as.yearmon(date), 
                              paste((round(as.yearmon(date) - as.yearmon(cohort))*12), sep=""), 
                              NA)))
   user  system elapsed 
  0.026   0.000   0.026

Antonios suggested another approach which involves first setting every row to ‘NA’ and then selectively updating the appropriate rows. I ended up with the following code:

cohortAttendance$monthNumber = NA
 
cohortAttendance$monthNumber[as.yearmon(cohortAttendance$cohort) > as.yearmon(cohortAttendance$date)] = paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep="")

Let’s measure that:

system.time(paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep=""))
   user  system elapsed 
  0.013   0.000   0.013

Both approaches are much quicker than my original version although this one seems to be marginally quicker than the ifelse approach.

Note to future Mark: try to avoid grouping by row number – there’s usually a better and faster solution!

Categories: Programming

Python/nltk: Naive vs Naive Bayes vs Decision Tree

Tue, 02/24/2015 - 23:39

Last week I wrote a blog post describing a decision tree I’d trained to detect the speakers in a How I met your mother transcript and after writing the post I wondered whether a simple classifier would do the job.

The simple classifier will work on the assumption that any word followed by a “:” is a speaker and anything else isn’t. Here’s the definition of a NaiveClassifier:

import nltk
from nltk import ClassifierI
 
class NaiveClassifier(ClassifierI):
    def classify(self, featureset):
        if featureset['next-word'] == ":":
            return True
        else:
            return False

As you can see it only implements the classify method and executes a static check.

While reading about ways to evaluate the effectiveness of text classifiers I came across Jacob Perkins blog which suggests that we should measure two things: precision and recall.

  • Higher precision means less false positives, while lower precision means more false positives.
  • Higher recall means less false negatives, while lower recall means more false negatives.

If (like me) you often get confused between false positives and negatives the following photo should help fix that:

False positive negative

I wrote the following function (adapted from Jacob’s blog post) to calculate precision and recall values for a given classifier:

import nltk
import collections
 
def assess_classifier(classifier, test_data, text):
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)
    for i, (feats, label) in enumerate(test_data):
        refsets[label].add(i)
        observed = classifier.classify(feats)
        testsets[observed].add(i)
 
    speaker_precision = nltk.metrics.precision(refsets[True], testsets[True])
    speaker_recall = nltk.metrics.recall(refsets[True], testsets[True])
 
    non_speaker_precision = nltk.metrics.precision(refsets[False], testsets[False])
    non_speaker_recall = nltk.metrics.recall(refsets[False], testsets[False])
 
    return [text, speaker_precision, speaker_recall, non_speaker_precision, non_speaker_recall]

Now let’s call that function with each of our classifiers:

import json
 
from sklearn.cross_validation import train_test_split
from himymutil.ml import pos_features
from himymutil.naive import NaiveClassifier
from tabulate import tabulate
 
with open("data/import/trained_sentences.json", "r") as json_file:
    json_data = json.load(json_file)
 
tagged_sents = []
for sentence in json_data:
    tagged_sents.append([(word["word"], word["speaker"]) for word in sentence["words"]])
 
featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append( (pos_features(untagged_sent, i), tag) )
 
train_data,test_data = train_test_split(featuresets, test_size=0.20, train_size=0.80)
 
table = []
table.append(assess_classifier(NaiveClassifier(), test_data, "Naive"))
table.append(assess_classifier(nltk.NaiveBayesClassifier.train(train_data), test_data, "Naive Bayes"))
table.append(assess_classifier(nltk.DecisionTreeClassifier.train(train_data), test_data, "Decision Tree"))
 
print(tabulate(table, headers=["Classifier","speaker precision", "speaker recall", "non-speaker precision", "non-speaker recall"]))

I’m using the tabulate library to print out a table showing each of the classifiers and their associated value for precision and recall. If we execute this file we’ll see the following output:

$ python scripts/detect_speaker.py
Classifier       speaker precision    speaker recall    non-speaker precision    non-speaker recall
-------------  -------------------  ----------------  -----------------------  --------------------
Naive                     0.9625            0.846154                 0.994453              0.998806
Naive Bayes               0.674603          0.934066                 0.997579              0.983685
Decision Tree             0.965517          0.923077                 0.997219              0.998806

The naive classifier is good on most measures but makes some mistakes on speaker recall – we have 16% false negatives i.e. 16% of words that should be classified as speaker aren’t.

Naive Bayes does poorly in terms of speaker false positives – 1/3 of the time when we say a word is a speaker it actually isn’t.

The decision tree performs best but has 8% speaker false negatives – 8% of words that should be classified as speakers aren’t.

The code is on github if you want to play around with it.

Categories: Programming

R: Cohort analysis of Neo4j meetup members

Tue, 02/24/2015 - 02:19

A few weeks ago I came across a blog post explaining how to apply cohort analysis to customer retention using R and I thought it’d be a fun exercise to calculate something similar for meetup attendees.

In the customer retention example we track customer purchases on a month by month basis and each customer is put into a cohort or bucket based on the first month they made a purchase in.

We then calculate how many of them made purchases in subsequent months and compare that with the behaviour of people in other cohorts.

In our case we aren’t selling anything so our equivalent will be a person attending a meetup. We’ll put people into cohorts based on the month of the first meetup they attended.

This can act as a proxy for when people become interested in a technology and could perhaps allow us to see how the behaviour of innovators, early adopters and the early majority differs, if at all.

The first thing we need to do is get the data showing the events that people RSVP’d ‘yes’ to. I’ve already got the data in Neo4j so we’ll write a query to extract it as a data frame:

library(RNeo4j)
graph = startGraph("http://127.0.0.1:7474/db/data/")
 
query = "MATCH (g:Group {name: \"Neo4j - London User Group\"})-[:HOSTED_EVENT]->(e),
               (e)<-[:TO]-(rsvp {response: \"yes\"})<-[:RSVPD]-(person)
         RETURN rsvp.time, person.id"
 
timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
df = cypher(graph, query)
df$time = timestampToDate(df$rsvp.time)
df$date = format(as.Date(df$time), "%Y-%m")
> df %>% head()
##         rsvp.time person.id                time    date
## 612  1.404857e+12  23362191 2014-07-08 22:00:29 2014-07
## 1765 1.380049e+12 112623332 2013-09-24 18:58:00 2013-09
## 1248 1.390563e+12   9746061 2014-01-24 11:24:35 2014-01
## 1541 1.390920e+12   7881615 2014-01-28 14:40:35 2014-01
## 3056 1.420670e+12  12810159 2015-01-07 22:31:04 2015-01
## 607  1.406025e+12  14329387 2014-07-22 10:34:51 2014-07
## 1634 1.391445e+12  91330472 2014-02-03 16:33:58 2014-02
## 2137 1.371453e+12  68874702 2013-06-17 07:17:10 2013-06
## 430  1.407835e+12 150265192 2014-08-12 09:15:31 2014-08
## 2957 1.417190e+12 182752269 2014-11-28 15:45:18 2014-11

Next we need to find the first meetup that a person attended – this will determine the cohort that the person is assigned to:

firstMeetup = df %>% 
  group_by(person.id) %>% 
  summarise(firstEvent = min(time), count = n()) %>% 
  arrange(desc(count))
 
> firstMeetup
## Source: local data frame [10 x 3]
## 
##    person.id          firstEvent count
## 1   13526622 2013-01-24 20:25:19     2
## 2  119400912 2014-10-03 13:09:09     2
## 3  122524352 2014-08-14 14:09:44     1
## 4   37201052 2012-05-21 10:26:24     3
## 5  137112712 2014-07-31 09:32:12     1
## 6  152448642 2014-06-20 08:32:50    17
## 7   98563682 2014-11-05 17:27:57     1
## 8  146976492 2014-05-17 00:04:42     4
## 9   12318409 2014-11-03 05:25:26     2
## 10  41280492 2014-10-16 19:02:03     5

Let’s assign each person to a cohort (month/year) and see how many people belong to each one:

firstMeetup$date = format(as.Date(firstMeetup$firstEvent), "%Y-%m")
byMonthYear = firstMeetup %>% count(date) %>% arrange(date)
 
ggplot(aes(x=date, y = n), data = byMonthYear) + 
  geom_bar(stat="identity", fill = "dark blue") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Unnamed chunk 4 1

Next we need to track a cohort over time to see whether people keep coming to events. I wrote the following function to work it out:

countsForCohort = function(df, firstMeetup, cohort) {
  members = (firstMeetup %>% filter(date == cohort))$person.id
 
  attendance = df %>% 
    filter(person.id %in% members) %>% 
    count(person.id, date) %>% 
    ungroup() %>%
    count(date)
 
  allCohorts = df %>% select(date) %>% unique
  cohortAttendance = merge(allCohorts, attendance, by = "date", all = TRUE)
 
  cohortAttendance[is.na(cohortAttendance) & cohortAttendance$date > cohort] = 0
  cohortAttendance %>% mutate(cohort = cohort, retention = n / length(members))  
}

On the first line we get the ids of all the people in the cohort so that we can filter the data frame to only include RSVPs by these people. The first call to ‘count’ makes sure that we only have one entry per person per month and the second call gives us a count of how many people attended an event in a given month.

Next we do the equivalent of a left join using the merge function to ensure we have a row representing each month even if noone from the cohort attended. This will lead to NA entries if there’s no matching row in the ‘attendance’ data frame – we’ll replace those with a 0 if the cohort is in the future. If not we’ll leave it as it is.

Finally we calculate the retention rate for each month for that cohort. e.g. these are some of the rows for the ‘2011-06′ cohort:

> countsForCohort(df, firstMeetup, "2011-06") %>% sample_n(10)
      date n  cohort retention
16 2013-01 1 2011-06      0.25
5  2011-10 1 2011-06      0.25
30 2014-03 0 2011-06      0.00
29 2014-02 0 2011-06      0.00
40 2015-01 0 2011-06      0.00
31 2014-04 0 2011-06      0.00
8  2012-04 2 2011-06      0.50
39 2014-12 0 2011-06      0.00
2  2011-07 1 2011-06      0.25
19 2013-04 1 2011-06      0.25

We could then choose to plot that cohort:

ggplot(aes(x=date, y = retention, colour = cohort), data = countsForCohort(df, firstMeetup, "2011-06")) + 
  geom_line(aes(group = cohort)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Unnamed chunk 5 1

From this chart we can see that none of the people who first attended a Neo4j meetup in June 2011 have attended any events for the last two years.

Next we want to be able to plot multiple cohorts on the same chart which we can easily do by constructing one big data frame and passing it to ggplot:

cohorts = collect(df %>% select(date) %>% unique())[,1]
 
cohortAttendance = data.frame()
for(cohort in cohorts) {
  cohortAttendance = rbind(cohortAttendance,countsForCohort(df, firstMeetup, cohort))      
}
 
ggplot(aes(x=date, y = retention, colour = cohort), data = cohortAttendance) + 
  geom_line(aes(group = cohort)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Unnamed chunk 5 2

This all looks a bit of a mess and at the moment we can’t easily compare cohorts as they start at different places on the x axis. We can fix that by adding a ‘monthNumber’ column to the data frame which we calculate with the following function:

monthNumber = function(cohort, date) {
  cohortAsDate = as.yearmon(cohort)
  dateAsDate = as.yearmon(date)
 
  if(cohortAsDate > dateAsDate) {
    "NA"
  } else {
    paste(round((dateAsDate - cohortAsDate) * 12), sep="")
  }
}

Now let’s create a new data frame with the month field added:

cohortAttendanceWithMonthNumber = cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber)

We’re also filtering out any ‘NA’ columns which would represent row entries for months from before the cohort started. We don’t want to plot those.

finally let’s plot a chart containing all cohorts normalised by month number:

ggplot(aes(x=monthNumber, y = retention, colour = cohort), data = cohortAttendanceWithMonthNumber) + 
  geom_line(aes(group = cohort)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1), panel.background = element_blank())
Unnamed chunk 5 3

It’s still a bit of a mess but what stands out is that when the number of people in a cohort is small the fluctuation in the retention value can be quite pronounced.

The next step is to make the cohorts a bit more coarse grained to see if it reveals some insights. I think I’ll start out with a cohort covering a 3 month period and see how that works out.

Categories: Programming

R/dplyr: Extracting data frame column value for filtering with %in%

Sun, 02/22/2015 - 09:58

I’ve been playing around with dplyr over the weekend and wanted to extract the values from a data frame column to use in a later filtering step.

I had a data frame:

library(dplyr)
df = data.frame(userId = c(1,2,3,4,5), score = c(2,3,4,5,5))

And wanted to extract the userIds of those people who have a score greater than 3. I started with:

highScoringPeople = df %>% filter(score > 3) %>% select(userId)
> highScoringPeople
  userId
1      3
2      4
3      5

And then filtered the data frame expecting to get back those 3 people:

> df %>% filter(userId %in% highScoringPeople)
[1] userId score 
<0 rows> (or 0-length row.names)

No rows! I created vector with the numbers 3-5 to make sure that worked:

> df %>% filter(userId %in% c(3,4,5))
  userId score
1      3     4
2      4     5
3      5     5

That works as expected so highScoringPeople obviously isn’t in the right format to facilitate an ‘in lookup’. Let’s explore:

> str(c(3,4,5))
 num [1:3] 3 4 5
 
> str(highScoringPeople)
'data.frame':	3 obs. of  1 variable:
 $ userId: num  3 4 5

Now it’s even more obvious why it doesn’t work – highScoringPeople is still a data frame when we need it to be a vector/list.

One way to fix this is to extract the userIds using the $ syntax instead of the select function:

highScoringPeople = (df %>% filter(score > 3))$userId
 
> str(highScoringPeople)
 num [1:3] 3 4 5
 
> df %>% filter(userId %in% highScoringPeople)
  userId score
1      3     4
2      4     5
3      5     5

Or if we want to do the column selection using dplyr we can extract the values for the column like this:

highScoringPeople = (df %>% filter(score > 3) %>% select(userId))[[1]]
 
> str(highScoringPeople)
 num [1:3] 3 4 5

Not so difficult after all.

Categories: Programming

Python/scikit-learn: Detecting which sentences in a transcript contain a speaker

Fri, 02/20/2015 - 23:42

Over the past couple of months I’ve been playing around with How I met your mother transcripts and the most recent thing I’ve been working on is how to extract the speaker for a particular sentence.

This initially seemed like a really simple problem as most of the initial sentences I looked at weere structured like this:

<speaker>: <sentence>

If there were all in that format then we could write a simple regular expression and then move on but unfortunately they aren’t. We could probably write a more complex regex to pull out the speaker but I thought it’d be fun to see if I could train a model to work it out instead.

The approach I’ve taken is derived from an example in the NLTK book.

The first problem with this approach was that I didn’t have any labelled data to work with so I wrote a little web application that made it easy for me to train chunks of sentences at a time:

2015 02 20 00 44 38

I stored the trained words in a JSON file. Each entry looks like this:

import json
with open("data/import/trained_sentences.json", "r") as json_file:
    json_data = json.load(json_file)
 
>>> json_data[0]
{u'words': [{u'word': u'You', u'speaker': False}, {u'word': u'ca', u'speaker': False}, {u'word': u"n't", u'speaker': False}, {u'word': u'be', u'speaker': False}, {u'word': u'friends', u'speaker': False}, {u'word': u'with', u'speaker': False}, {u'word': u'Robin', u'speaker': False}, {u'word': u'.', u'speaker': False}]}
 
>>> json_data[1]
{u'words': [{u'word': u'Robin', u'speaker': True}, {u'word': u':', u'speaker': False}, {u'word': u'Well', u'speaker': False}, {u'word': u'...', u'speaker': False}, {u'word': u'it', u'speaker': False}, {u'word': u"'s", u'speaker': False}, {u'word': u'a', u'speaker': False}, {u'word': u'bit', u'speaker': False}, {u'word': u'early', u'speaker': False}, {u'word': u'...', u'speaker': False}, {u'word': u'but', u'speaker': False}, {u'word': u'...', u'speaker': False}, {u'word': u'of', u'speaker': False}, {u'word': u'course', u'speaker': False}, {u'word': u',', u'speaker': False}, {u'word': u'I', u'speaker': False}, {u'word': u'might', u'speaker': False}, {u'word': u'consider', u'speaker': False}, {u'word': u'...', u'speaker': False}, {u'word': u'I', u'speaker': False}, {u'word': u'moved', u'speaker': False}, {u'word': u'here', u'speaker': False}, {u'word': u',', u'speaker': False}, {u'word': u'let', u'speaker': False}, {u'word': u'me', u'speaker': False}, {u'word': u'think', u'speaker': False}, {u'word': u'.', u'speaker': False}]}

Each word in the sentence is represented by a JSON object which also indicates if that word was a speaker in the sentence.

Feature selection

Now that I’ve got some trained data to work with I needed to choose which features I’d use to train my model.

One of the most obvious indicators that a word is the speaker in the sentence is that the next word is ‘:’ so ‘next word’ can be a feature. I also went with ‘previous word’ and the word itself for my first cut.

This is the function I wrote to convert a word in a sentence into a set of features:

def pos_features(sentence, i):
    features = {}
    features["word"] = sentence[i]
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    if i == len(sentence) - 1:
        features["next-word"] = "<END>"
    else:
        features["next-word"] = sentence[i+1]
    return features

Let’s try a couple of examples:

import nltk
 
>>> pos_features(nltk.word_tokenize("Robin: Hi Ted, how are you?"), 0)
{'prev-word': '<START>', 'word': 'Robin', 'next-word': ':'}
 
>>> pos_features(nltk.word_tokenize("Robin: Hi Ted, how are you?"), 5)
{'prev-word': ',', 'word': 'how', 'next-word': 'are'}

Now let’s run that function over our full set of labelled data:

with open("data/import/trained_sentences.json", "r") as json_file:
    json_data = json.load(json_file)
 
tagged_sents = []
for sentence in json_data:
    tagged_sents.append([(word["word"], word["speaker"]) for word in sentence["words"]])
 
featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append( (pos_features(untagged_sent, i), tag) )

Here’s a sample of the contents of featuresets:

>>> featuresets[:5]
[({'prev-word': '<START>', 'word': u'You', 'next-word': u'ca'}, False), ({'prev-word': u'You', 'word': u'ca', 'next-word': u"n't"}, False), ({'prev-word': u'ca', 'word': u"n't", 'next-word': u'be'}, False), ({'prev-word': u"n't", 'word': u'be', 'next-word': u'friends'}, False), ({'prev-word': u'be', 'word': u'friends', 'next-word': u'with'}, False)]

It’s nearly time to train our model, but first we need to split out labelled data into training and test sets so we can see how well our model performs on data it hasn’t seen before. sci-kit learn has a function that does this for us:

from sklearn.cross_validation import train_test_split
train_data,test_data = train_test_split(featuresets, test_size=0.20, train_size=0.80)
 
>>> len(train_data)
9480
 
>>> len(test_data)
2370

Now let’s train our model. I decided to try out Naive Bayes and Decision tree models to see how they got on:

>>> classifier = nltk.NaiveBayesClassifier.train(train_data)
>>> print nltk.classify.accuracy(classifier, test_data)
0.977215189873
 
>>> classifier = nltk.DecisionTreeClassifier.train(train_data)
>>> print nltk.classify.accuracy(classifier, test_data)
0.997046413502

It looks like both are doing a good job here with the decision tree doing slightly better. One thing to keep in mind is that most of the sentences we’ve trained at in the form ‘:‘ and we can get those correct with a simple regex so we should expect the accuracy to be very high.

If we explore the internals of the decision tree we’ll see that it’s massively overfitting which makes sense given our small training data set and the repetitiveness of the data:

>>> print(classifier.pseudocode(depth=2))
if next-word == u'!': return False
if next-word == u'$': return False
...
if next-word == u"'s": return False
if next-word == u"'ve": return False
if next-word == u'(':
  if word == u'!': return False
  ...
if next-word == u'*': return False
if next-word == u'*****': return False
if next-word == u',':
  if word == u"''": return False
  ...
if next-word == u'--': return False
if next-word == u'.': return False
if next-word == u'...':
  ...
  if word == u'who': return False
  if word == u'you': return False
if next-word == u'/i': return False
if next-word == u'1': return True
...
if next-word == u':':
  if prev-word == u"'s": return True
  if prev-word == u',': return False
  if prev-word == u'...': return False
  if prev-word == u'2030': return True
  if prev-word == '<START>': return True
  if prev-word == u'?': return False
...
if next-word == u'\u266a\u266a': return False

One update I may make to the features is to include the part of speech of the word rather than its actual value to see if that makes the model a bit more general. Another option is to train a bunch of decision trees against a subset of the data and build an ensemble/random forest of those trees.

Once I’ve got a working ‘speaker detector’ I want to then go and work out who the likely speaker is for the sentences which don’t contain a speaker. The plan is to calculate the word distributions of the speakers from sentences I do have and then calculate the probability that they spoke the unlabelled sentences.

This might not work perfectly as there could be new characters in those episodes but hopefully we can come up with something decent.

The full code for this example is on github if you want to have a play with it.

Any suggestions for improvements are always welcome in the comments.

Categories: Programming

Python’s pandas vs Neo4j’s cypher: Exploring popular phrases in How I met your mother transcripts

Thu, 02/19/2015 - 01:52

I’ve previously written about extracting TF/IDF scores for phrases in documents using scikit-learn and the final step in that post involved writing the words into a CSV file for analysis later on.

I wasn’t sure what the most appropriate tool of choice for that analysis was so I decided to explore the data using Python’s pandas library and load it into Neo4j and write some Cypher queries.

To do anything with Neo4j we need to first load the CSV file into the database. The easiest way to do that is with Cypher’s LOAD CSV command.

First we’ll load the phrases in and then we’ll connect them to the episodes which were previously loaded:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/tfidf_scikit.csv" AS row
MERGE (phrase:Phrase {value: row.Phrase});
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/tfidf_scikit.csv" AS row
MATCH (phrase:Phrase {value: row.Phrase})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (phrase)-[:USED_IN_EPISODE {tfidfScore: TOFLOAT(row.Score)}]->(episode);

Now we’re ready to start writing some queries. To start with we’ll write a simple query to find the top 3 phrases for each episode.

In pandas this is quite easy – we just need to group by the appropriate field and then take the top 3 records in that grouping:

top_words_by_episode = df \
    .sort(["EpisodeId", "Score"], ascending = [True, False]) \
    .groupby(["EpisodeId"], sort = False) \
    .head(3)
 
>>> print(top_words_by_episode.to_string())
 
        EpisodeId              Phrase     Score
3976            1                 ted  0.262518
2912            1              olives  0.195714
2441            1            marshall  0.155515
8143            2                 ted  0.292184
5197            2              carlos  0.227454
7482            2               robin  0.195150
12551           3                 ted  0.232662
9040            3              barney  0.187255
11254           3              mcneil  0.170619
15641           4             natalie  0.562485
16763           4                 ted  0.191873
16234           4               robin  0.102671
20715           5            subtitle  0.310866
18121           5          coat check  0.181682
20861           5                 ted  0.169973
...

The cypher version looks quite similar, the main difference being that we use the COLLECT to generate an array of phrases by episode and then take the top 3:

MATCH (e:Episode)<-[rel:USED_IN_EPISODE]-(phrase)
WITH e, rel, phrase
ORDER BY e.id, rel.tfidfScore DESC
RETURN e.id, e.title, COLLECT({phrase: phrase.value, score: rel.tfidfScore})[..3]
ORDER BY e.id
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | e.id | e.title                                     | COLLECT({phrase: phrase.value, score: rel.tfidfScore})[..3]                                                                                                               |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | 1    | "Pilot"                                     | [{phrase -> "ted", score -> 0.2625177493269755},{phrase -> "olives", score -> 0.19571419072701732},{phrase -> "marshall", score -> 0.15551468983363487}]                  |
==> | 2    | "Purple Giraffe"                            | [{phrase -> "ted", score -> 0.292184496766088},{phrase -> "carlos", score -> 0.22745438090499026},{phrase -> "robin", score -> 0.19514993122773566}]                      |
==> | 3    | "Sweet Taste of Liberty"                    | [{phrase -> "ted", score -> 0.23266190616714866},{phrase -> "barney", score -> 0.18725456678444408},{phrase -> "officer mcneil", score -> 0.17061872221616137}]           |
==> | 4    | "Return of the Shirt"                       | [{phrase -> "natalie", score -> 0.5624848345525686},{phrase -> "ted", score -> 0.19187323894701674},{phrase -> "robin", score -> 0.10267067360622682}]                    |
==> | 5    | "Okay Awesome"                              | [{phrase -> "subtitle", score -> 0.310865508347106},{phrase -> "coat check", score -> 0.18168178787561182},{phrase -> "ted", score -> 0.16997258596683185}]               |
==> | 6    | "Slutty Pumpkin"                            | [{phrase -> "mike", score -> 0.2966610054610693},{phrase -> "ted", score -> 0.19333276951599407},{phrase -> "robin", score -> 0.1656172994411056}]                        |
==> | 7    | "Matchmaker"                                | [{phrase -> "ellen", score -> 0.4947912795578686},{phrase -> "sarah", score -> 0.24462913913669443},{phrase -> "ted", score -> 0.23728319597607636}]                      |
==> | 8    | "The Duel"                                  | [{phrase -> "ted", score -> 0.26713931416222847},{phrase -> "marshall", score -> 0.22816702335751904},{phrase -> "swords", score -> 0.17841675237702592}]                 |
==> | 9    | "Belly Full of Turkey"                      | [{phrase -> "ericksen", score -> 0.43145756691027665},{phrase -> "mrs ericksen", score -> 0.1939318283559959},{phrase -> "kendall", score -> 0.1846969793866628}]         |
==> | 10   | "The Pineapple Incident"                    | [{phrase -> "ted", score -> 0.439756993033922},{phrase -> "trudy", score -> 0.36367907631894536},{phrase -> "carl", score -> 0.16413071244131686}]                        |
==> | 11   | "The Limo"                                  | [{phrase -> "moby", score -> 0.48314164479037003},{phrase -> "party number", score -> 0.30458929780262456},{phrase -> "ranjit", score -> 0.1991061739767796}]             |
...
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

In the cypher version we get one row per episode whereas with the Python version we get 3 rows. It might be possible to achieve this effect with pandas too but I wasn’t sure how to do so.

Next let’s find the top phrases for a single episode – the type of query that might be part of an episode page on a How I met your mother wiki:

top_words = df[(df["EpisodeId"] == 1)] \
    .sort(["Score"], ascending = False) \
    .head(20)
 
>>> print(top_words.to_string())
 
      EpisodeId                Phrase     Score
3976          1                   ted  0.262518
2912          1                olives  0.195714
2441          1              marshall  0.155515
4732          1               yasmine  0.152279
3347          1                 robin  0.130418
209           1                barney  0.124412
2146          1                  lily  0.122925
3637          1                signal  0.103793
1366          1                goanna  0.098138
3524          1                 scene  0.095342
710           1                   cut  0.091734
2720          1              narrator  0.086462
1147          1             flashback  0.078296
1148          1        flashback date  0.070283
3224          1                ranjit  0.069393
4178          1           ted yasmine  0.058569
1149          1  flashback date robin  0.058569
525           1                  carl  0.058210
3714          1           smurf pen1s  0.054365
2048          1              lebanese  0.054365
MATCH (e:Episode {title: "Pilot"})<-[rel:USED_IN_EPISODE]-(phrase)
WITH phrase, rel
ORDER BY rel.tfidfScore DESC
RETURN phrase.value AS phrase, rel.tfidfScore AS score
LIMIT 20
 
==> +-----------------------------------------------+
==> | phrase                 | score                |
==> +-----------------------------------------------+
==> | "ted"                  | 0.2625177493269755   |
==> | "olives"               | 0.19571419072701732  |
==> | "marshall"             | 0.15551468983363487  |
==> | "yasmine"              | 0.15227880637176266  |
==> | "robin"                | 0.1304175242341549   |
==> | "barney"               | 0.12441175186690791  |
==> | "lily"                 | 0.12292497785945679  |
==> | "signal"               | 0.1037932464656365   |
==> | "goanna"               | 0.09813798750091524  |
==> | "scene"                | 0.09534236041231685  |
==> | "cut"                  | 0.09173366535740156  |
==> | "narrator"             | 0.08646229819848741  |
==> | "flashback"            | 0.07829592155397117  |
==> | "flashback date"       | 0.07028252601773662  |
==> | "ranjit"               | 0.06939276915589167  |
==> | "ted yasmine"          | 0.05856877168144719  |
==> | "flashback date robin" | 0.05856877168144719  |
==> | "carl"                 | 0.058210117288760355 |
==> | "smurf pen1s"          | 0.05436505297972703  |
==> | "lebanese"             | 0.05436505297972703  |
==> +-----------------------------------------------+

Our next query is a negation – find the episodes which don’t mention the phrase ‘robin’. In python we can do some simple set operations to work this out:

all_episodes = set(range(1, 209))
robin_episodes = set(df[(df["Phrase"] == "robin")]["EpisodeId"])
 
>>> print(set(all_episodes) - set(robin_episodes))
set([145, 198, 143])

In cypher land a query will suffice:

MATCH (episode:Episode), (phrase:Phrase {value: "robin"})
WHERE NOT (episode)<-[:USED_IN_EPISODE]-(phrase)
RETURN episode.id AS id, episode.season AS season, episode.number AS episode

And finally a mini recommendation engine type query – how many of the top phrases in Episode 1 were used in other episodes:

First python:

phrases_used = set(df[(df["EpisodeId"] == 1)] \
    .sort(["Score"], ascending = False) \
    .head(10)["Phrase"])
 
phrases = df[df["Phrase"].isin(phrases_used)]
 
print (phrases[phrases["EpisodeId"] != 1] \
    .groupby(["Phrase"]) \
    .size() \
    .order(ascending = False))

Here we’ve pulled it out into a few steps – first we identify the top phrases, then we find out where they occur across the whole data set and finally we filter out the occurrences in the first episode and count the other occurrences.

Phrase
marshall    207
barney      207
ted         206
lily        206
robin       204
scene        36
signal        4
goanna        3
olives        1

In cypher we can write a query to do this as well:

MATCH (episode:Episode {title: "Pilot"})<-[rel:USED_IN_EPISODE]-(phrase)
WITH phrase, rel, episode
ORDER BY rel.tfidfScore DESC
LIMIT 10
MATCH (phrase)-[:USED_IN_EPISODE]->(otherEpisode)
WHERE otherEpisode <> episode
RETURN phrase.value AS phrase, COUNT(*) AS numberOfOtherEpisodes
ORDER BY numberOfOtherEpisodes DESC
 
==> +------------------------------------+
==> | phrase     | numberOfOtherEpisodes |
==> +------------------------------------+
==> | "barney"   | 207                   |
==> | "marshall" | 207                   |
==> | "ted"      | 206                   |
==> | "lily"     | 206                   |
==> | "robin"    | 204                   |
==> | "scene"    | 36                    |
==> | "signal"   | 4                     |
==> | "goanna"   | 3                     |
==> | "olives"   | 1                     |
==> +------------------------------------+

Overall there’s not much in it – for some of the queries I found it easier in cypher and for others easier with pandas. It’s always useful to have multiple tools in the toolbox!

Categories: Programming

Python/pandas: Column value in list (ValueError: The truth value of a Series is ambiguous.)

Mon, 02/16/2015 - 22:39

I’ve been using Python’s pandas library while exploring some CSV files and although for the most part I’ve found it intuitive to use, I had trouble filtering a data frame based on checking whether a column value was in a list.

A subset of one of the CSV files I’ve been working with looks like this:

$ cat foo.csv
"Foo"
1
2
3
4
5
6
7
8
9
10

Loading it into a pandas data frame is reasonably simple:

import pandas as pd
df = pd.read_csv('foo.csv', index_col=False, header=0)
>>> df
   Foo
0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
9   10

If we want to find the rows which have a value of 1 we’d write the following:

>>> df[df["Foo"] == 1]
   Foo
0    1

Finding the rows with a value less than 7 is as you’d expect too:

>>> df[df["Foo"] < 7]
   Foo
0    1
1    2
2    3
3    4
4    5
5    6

Next I wanted to filter out the rows containing odd numbers which I initially tried to do like this:

odds = [i for i in range(1,10) if i % 2 <> 0]
>>> odds
[1, 3, 5, 7, 9]
 
>>> df[df["Foo"] in odds]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/pandas/core/generic.py", line 698, in __nonzero__
    .format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Unfortunately that doesn’t work and I couldn’t get any of the suggestions from the error message to work either. Luckily pandas has a special isin function for this use case which we can call like this:

>>> df[df["Foo"].isin(odds)]
   Foo
0    1
2    3
4    5
6    7
8    9

Much better!

Categories: Programming

Python/scikit-learn: Calculating TF/IDF on How I met your mother transcripts

Sun, 02/15/2015 - 16:56

Over the past few weeks I’ve been playing around with various NLP techniques to find interesting insights into How I met your mother from its transcripts and one technique that kept coming up is TF/IDF.

The Wikipedia definition reads like this:

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

It is often used as a weighting factor in information retrieval and text mining.

The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

I wanted to generate a TF/IDF representation of phrases used in the hope that it would reveal some common themes used in the show.

Python’s scikit-learn library gives you two ways to generate the TF/IDF representation:

  1. Generate a matrix of token/phrase counts from a collection of text documents using CountVectorizer and feed it to TfidfTransformer to generate the TF/IDF representation.
  2. Feed the collection of text documents directly to TfidfVectorizer and go straight to the TF/IDF representation skipping the middle man.

I started out using the first approach and hadn’t quite got it working when I realised there was a much easier way!

I have a collection of sentences in a CSV file so the first step is to convert those into a list of documents:

from collections import defaultdict
import csv
 
episodes = defaultdict(list)
with open("data/import/sentences.csv", "r") as sentences_file:
    reader = csv.reader(sentences_file, delimiter=',')
    reader.next()
    for row in reader:
        episodes[row[1]].append(row[4])
 
for episode_id, text in episodes.iteritems():
    episodes[episode_id] = "".join(text)
 
corpus = []
for id, episode in sorted(episodes.iteritems(), key=lambda t: int(t[0])):
    corpus.append(episode)

corpus contains 208 entries (1 per episode), each of which is a string containing the transcript of that episode. Next it’s time to train our TF/IDF model which is only a few lines of code:

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')

The most interesting parameter here is ngram_range – we’re telling it to generate 2 and 3 word phrases along with the single words from the corpus.

e.g. if we had the sentence “Python is cool” we’d end up with 6 phrases – ‘Python’, ‘is’, ‘cool’, ‘Python is’, ‘Python is cool’ and ‘is cool’.

Let’s execute the model against our corpus:

tfidf_matrix =  tf.fit_transform(corpus)
>>> len(feature_names)
498254
 
>>> feature_names[50:70]
[u'00 does sound', u'00 don', u'00 don buy', u'00 dressed', u'00 dressed blond', u'00 drunkenly', u'00 drunkenly slurred', u'00 fair', u'00 fair tonight', u'00 fall', u'00 fall foliage', u'00 far', u'00 far impossible', u'00 fart', u'00 fart sure', u'00 friends', u'00 friends singing', u'00 getting', u'00 getting guys', u'00 god']

So we’re got nearly 500,000 phrases and if we look at tfidf_matrix we’d expect it to be a 208 x 498254 matrix – one row per episode, one column per phrase:

>>> tfidf_matrix
<208x498254 sparse matrix of type '<type 'numpy.float64'>'
	with 740396 stored elements in Compressed Sparse Row format>

This is what we’ve got although under the covers it’s using a sparse representation to save space. Let’s convert the matrix to dense format to explore further and find out why:

dense = tfidf_matrix.todense()
>>> len(dense[0].tolist()[0])
498254

What I’ve printed out here is the size of one row of the matrix which contains the TF/IDF score for every phrase in our corpus for the 1st episode of How I met your mother. A lot of those phrases won’t have happened in the 1st episode so let’s filter those out:

episode = dense[0].tolist()[0]
phrase_scores = [pair for pair in zip(range(0, len(episode)), episode) if pair[1] > 0]
 
>>> len(phrase_scores)
4823

There are just under 5000 phrases used in this episode, roughly 1% of the phrases in the whole corpus.
The sparse matrix makes a bit more sense – if scipy used a dense matrix representation there’d be 493,000 entries with no score which becomes more significant as the number of documents increases.

Next we’ll sort the phrases by score in descending order to find the most interesting phrases for the first episode of How I met your mother:

>>> sorted(phrase_scores, key=lambda t: t[1] * -1)[:5]
[(419207, 0.2625177493269755), (312591, 0.19571419072701732), (267538, 0.15551468983363487), (490429, 0.15227880637176266), (356632, 0.1304175242341549)]

The first value in each tuple is the phrase’s position in our initial vector and also corresponds to the phrase’s position in feature_names which allows us to map the scores back to phrases. Let’s look up a couple of phrases:

>>> feature_names[419207]
u'ted'
>>> feature_names[312591]
u'olives'
>>> feature_names[356632]
u'robin'

Let’s automate that lookup:

sorted_phrase_scores = sorted(phrase_scores, key=lambda t: t[1] * -1)
for phrase, score in [(feature_names[word_id], score) for (word_id, score) in sorted_phrase_scores][:20]:
   print('{0: <20} {1}'.format(phrase, score))
 
ted                  0.262517749327
olives               0.195714190727
marshall             0.155514689834
yasmine              0.152278806372
robin                0.130417524234
barney               0.124411751867
lily                 0.122924977859
signal               0.103793246466
goanna               0.0981379875009
scene                0.0953423604123
cut                  0.0917336653574
narrator             0.0864622981985
flashback            0.078295921554
flashback date       0.0702825260177
ranjit               0.0693927691559
flashback date robin 0.0585687716814
ted yasmine          0.0585687716814
carl                 0.0582101172888
eye patch            0.0543650529797
lebanese             0.0543650529797

We see all the main characters names which aren’t that interested – perhaps they should be part of the stop list – but ‘olives’ which is where the olive theory is first mentioned. I thought olives came up more often but a quick search for the term suggests it isn’t mentioned again until Episode 9 in Season 9:

$ grep -rni --color "olives" data/import/sentences.csv | cut -d, -f 2,3,4 | sort | uniq -c
  16 1,1,1
   3 193,9,9

‘yasmine’ is also an interesting phrase in this episode but she’s never mentioned again:

$ grep -h -rni --color "yasmine" data/import/sentences.csv
49:48,1,1,1,"Barney: (Taps a woman names Yasmine) Hi, have you met Ted? (Leaves and watches from a distance)."
50:49,1,1,1,"Ted: (To Yasmine) Hi, I'm Ted."
51:50,1,1,1,Yasmine: Yasmine.
53:52,1,1,1,"Yasmine: Thanks, It's Lebanese."
65:64,1,1,1,"[Cut to the bar, Ted is chatting with Yasmine]"
67:66,1,1,1,Yasmine: So do you think you'll ever get married?
68:67,1,1,1,"Ted: Well maybe eventually. Some fall day. Possibly in Central Park. Simple ceremony, we'll write our own vows. But--eh--no DJ, people will dance. I'm not going to worry about it! Damn it, why did Marshall have to get engaged? (Yasmine laughs) Yeah, nothing hotter than a guy planning out his own imaginary wedding, huh?"
69:68,1,1,1,"Yasmine: Actually, I think it's cute."
79:78,1,1,1,"Lily: You are unbelievable, Marshall. No-(Scene splits in half and shows both Lily and Marshall on top arguing and Ted and Yasmine on the bottom mingling)"
82:81,1,1,1,Ted: (To Yasmine) you wanna go out sometime?
85:84,1,1,1,[Cut to Scene with Ted and Yasmine at bar]
86:85,1,1,1,Yasmine: I'm sorry; Carl's my boyfriend (points to bartender)

It would be interesting to filter out the phrases which don’t occur in any other episode and see what insights we get from doing that. For now though we’ll extract phrases for all episodes and write to CSV so we can explore more easily:

with open("data/import/tfidf_scikit.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["EpisodeId", "Phrase", "Score"])
 
    doc_id = 0
    for doc in tfidf_matrix.todense():
        print "Document %d" %(doc_id)
        word_id = 0
        for score in doc.tolist()[0]:
            if score > 0:
                word = feature_names[word_id]
                writer.writerow([doc_id+1, word.encode("utf-8"), score])
            word_id +=1
        doc_id +=1

And finally a quick look at the contents of the CSV:

$ tail -n 10 data/import/tfidf_scikit.csv
208,york apparently laughs,0.012174304095213192
208,york aren,0.012174304095213192
208,york aren supposed,0.012174304095213192
208,young,0.013397275854758335
208,young ladies,0.012174304095213192
208,young ladies need,0.012174304095213192
208,young man,0.008437685963000223
208,young man game,0.012174304095213192
208,young stupid,0.011506395106658192
208,young stupid sighs,0.012174304095213192
Categories: Programming

Neo4j: Building a topic graph with Prismatic Interest Graph API

Sat, 02/14/2015 - 00:38

Over the last few weeks I’ve been using various NLP libraries to derive topics for my corpus of How I met your mother episodes without success and was therefore enthused to see the release of Prismatic’s Interest Graph API

The Interest Graph API exposes a web service to which you feed a block of text and get back a set of topics and associated score.

It has been trained over the last few years with millions of articles that people share on their social media accounts and in my experience using Prismatic the topics have been very useful for finding new material to read.

The first step is to head to interest-graph.getprismatic.com and get an API key which will be emailed to you.

Having done that we’re ready to make some calls to the API and get back some topics.

I’m going to use Python to call the API and I’ve found the requests library the easiest library to use for this type of work. Our call to the API looks like this:

import requests
payload = { 'title': "insert title of article here",
            'body': "insert body of text here"),
            'api-token': "insert token sent by email here"}
r = requests.post("http://interest-graph.getprismatic.com/text/topic", data=payload)

One thing to keep in mind is that the API is rate limited to 20 requests a second so we need to restrict our requests or we’re going to receive error response codes. Luckily I came across an excellent blog post showing how to write a decorator around a function and only allow it to execute at a certain frequency.

To rate limit our calls to the Interest Graph we need to pull the above code into a function and annotate it appropriately:

import time
 
def RateLimited(maxPerSecond):
    minInterval = 1.0 / float(maxPerSecond)
    def decorate(func):
        lastTimeCalled = [0.0]
        def rateLimitedFunction(*args,**kargs):
            elapsed = time.clock() - lastTimeCalled[0]
            leftToWait = minInterval - elapsed
            if leftToWait>0:
                time.sleep(leftToWait)
            ret = func(*args,**kargs)
            lastTimeCalled[0] = time.clock()
            return ret
        return rateLimitedFunction
    return decorate
 
@RateLimited(0.3)
def topics(title, body):
    payload = { 'title': title,
                'body': body,
                'api-token': "insert token sent by email here"}
    r = requests.post("http://interest-graph.getprismatic.com/text/topic", data=payload)
    return r

The text I want to classify is stored in a CSV file – one sentence per line. Here’s a sample:

$ head -n 10 data/import/sentences.csv
SentenceId,EpisodeId,Season,Episode,Sentence
1,1,1,1,Pilot
2,1,1,1,Scene One
3,1,1,1,[Title: The Year 2030]
4,1,1,1,"Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother"
5,1,1,1,Son: Are we being punished for something?
6,1,1,1,Narrator: No
7,1,1,1,"Daughter: Yeah, is this going to take a while?"
8,1,1,1,"Narrator: Yes. (Kids are annoyed) Twenty-five years ago, before I was dad, I had this whole other life."
9,1,1,1,"(Music Plays, Title ""How I Met Your Mother"" appears)"

We’ll also need to refer to another CSV file to get the title of each episode since it isn’t being stored with the sentence:

$ head -n 10 data/import/episodes_full.csv
NumberOverall,NumberInSeason,Episode,Season,DateAired,Timestamp,Title,Director,Viewers,Writers,Rating
1,1,/wiki/Pilot,1,"September 19, 2005",1127084400,Pilot,Pamela Fryman,10.94,"Carter Bays,Craig Thomas",68
2,2,/wiki/Purple_Giraffe,1,"September 26, 2005",1127689200,Purple Giraffe,Pamela Fryman,10.40,"Carter Bays,Craig Thomas",63
3,3,/wiki/Sweet_Taste_of_Liberty,1,"October 3, 2005",1128294000,Sweet Taste of Liberty,Pamela Fryman,10.44,"Phil Lord,Chris Miller",67
4,4,/wiki/Return_of_the_Shirt,1,"October 10, 2005",1128898800,Return of the Shirt,Pamela Fryman,9.84,Kourtney Kang,59
5,5,/wiki/Okay_Awesome,1,"October 17, 2005",1129503600,Okay Awesome,Pamela Fryman,10.14,Chris Harris,53
6,6,/wiki/Slutty_Pumpkin,1,"October 24, 2005",1130108400,Slutty Pumpkin,Pamela Fryman,10.89,Brenda Hsueh,62
7,7,/wiki/Matchmaker,1,"November 7, 2005",1131321600,Matchmaker,Pamela Fryman,10.55,"Sam Johnson,Chris Marcil",57
8,8,/wiki/The_Duel,1,"November 14, 2005",1131926400,The Duel,Pamela Fryman,10.35,Gloria Calderon Kellett,46
9,9,/wiki/Belly_Full_of_Turkey,1,"November 21, 2005",1132531200,Belly Full of Turkey,Pamela Fryman,10.29,"Phil Lord,Chris Miller",60

Now we need to get our episode titles and transcripts ready to pass to the topics function. Since we’ve only got ~ 200 episodes we can create a dictionary to store that data:

episodes = {}
with open("data/import/episodes_full.csv", "r") as episodesfile:
    episodes_reader = csv.reader(episodesfile, delimiter=",")
    episodes_reader.next()
    for episode in episodes_reader:
        episodes[int(episode[0])] = {"title": episode[6], "sentences" : [] }
 
with open("data/import/sentences.csv", "r") as sentencesfile:
     sentences_reader = csv.reader(sentencesfile, delimiter=",")
     sentences_reader.next()
     for sentence in sentences_reader:
         episodes[int(sentence[1])]["sentences"].append(sentence[4])
 
>>> episodes[1]["title"]
'Pilot'
>>> episodes[1]["sentences"][:5]
['Pilot', 'Scene One', '[Title: The Year 2030]', "Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother", 'Son: Are we being punished for something?']

Now we’re going to loop through each of the episodes, call topics and write the result into a CSV file so we can load it into Neo4j afterwards to explore the data:

import json
 
with open("data/import/topics.csv", "w") as topicsfile:
    topics_writer = csv.writer(topicsfile, delimiter=",")
    topics_writer.writerow(["EpisodeId", "TopicId", "Topic", "Score"])
 
    for episode_id, episode in episodes.iteritems():
        tmp = topics(episode["title"], "".join(episode["sentences"]).json()
        print episode_id, tmp
        for topic in tmp['topics']:
            topics_writer.writerow([episode_id, topic["id"], topic["topic"], topic["score"]])

It takes about 10 minutes to run and this is a sample of the output:

$ head -n 10 data/import/topics.csv
EpisodeId,TopicId,Topic,Score
1,1519,Fiction,0.5798245566455255
1,2015,Humour,0.565154963605359
1,24031,Laughing,0.5587120401021765
1,16693,Flirting,0.5514098189505282
1,1163,Dating and Courtship,0.5487490108554022
1,2386,Kissing,0.5476185929151934
1,31929,Puns,0.5375100569837977
2,24031,Laughing,0.5670926949850333
2,1519,Fiction,0.5396488295397263

We’ll use Neo4j’s LOAD CSV command to load the data in:

// make sure the topics exist
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MERGE (topic:Topic {id: TOINT(row.TopicId)})
ON CREATE SET topic.value = row.Topic
// make sure the topics exist
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MERGE (topic:Topic {id: TOINT(row.TopicId)})
ON CREATE SET topic.value = row.Topic
// now link the episodes and topics
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MATCH (topic:Topic {id: TOINT(row.TopicId)})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (episode)-[:TOPIC {score: TOFLOAT(row.Score)}]->(topic)

We’ll assume that the episodes and seasons are already loaded – the commands to load those in are on github.

We can now write some queries against our topic graph. We’ll start simple – show me the topics for an episode:

MATCH (episode:Episode {id: 1})-[r:TOPIC]->(topic)
RETURN topic, r

Graph

Let’s say we liked the ‘Puns’ aspect of the Pilot episode and want to find out which other episodes had puns. The following query would let us find those:

MATCH (episode:Episode {id: 1})-[r:TOPIC]->(topic {value: "Puns"})<-[:TOPIC]-(other)
RETURN episode, topic, other

Graph  1

Or maybe we want to find the episode which has the most topics in common:

MATCH (episode:Episode {id: 1})-[:TOPIC]->(topic),
      (topic)<-[r:TOPIC]-(otherEpisode)
RETURN otherEpisode.title as episode, COUNT(r) AS topicsInCommon
ORDER BY topicsInCommon DESC
LIMIT 10
==> +------------------------------------------------+
==> | episode                       | topicsInCommon |
==> +------------------------------------------------+
==> | "Purple Giraffe"              | 6              |
==> | "Ten Sessions"                | 5              |
==> | "Farhampton"                  | 4              |
==> | "The Three Days Rule"         | 4              |
==> | "How I Met Everyone Else"     | 4              |
==> | "The Time Travelers"          | 4              |
==> | "Mary the Paralegal"          | 4              |
==> | "Lobster Crawl"               | 4              |
==> | "The Magician's Code, Part 2" | 4              |
==> | "Slutty Pumpkin"              | 4              |
==> +------------------------------------------------+
==> 10 rows

We could then tweak that query to get the names of those topics:

MATCH (episode:Episode {id: 1})-[:TOPIC]->(topic),
      (topic)<-[r:TOPIC]-(otherEpisode)-[:IN_SEASON]->(season)
RETURN otherEpisode.title as episode, season.number AS season, COUNT(r) AS topicsInCommon, COLLECT(topic.value)
ORDER BY topicsInCommon DESC
LIMIT 10
 
==> +-----------------------------------------------------------------------------------------------------------------------------------+
==> | episode                   | season | topicsInCommon | COLLECT(topic.value)                                                        |
==> +-----------------------------------------------------------------------------------------------------------------------------------+
==> | "Purple Giraffe"          | "1"    | 6              | ["Humour","Fiction","Kissing","Dating and Courtship","Flirting","Laughing"] |
==> | "Ten Sessions"            | "3"    | 5              | ["Humour","Puns","Dating and Courtship","Flirting","Laughing"]              |
==> | "How I Met Everyone Else" | "3"    | 4              | ["Humour","Fiction","Dating and Courtship","Laughing"]                      |
==> | "Farhampton"              | "8"    | 4              | ["Humour","Fiction","Kissing","Dating and Courtship"]                       |
==> | "Bedtime Stories"         | "9"    | 4              | ["Humour","Puns","Dating and Courtship","Laughing"]                         |
==> | "Definitions"             | "5"    | 4              | ["Kissing","Dating and Courtship","Flirting","Laughing"]                    |
==> | "Lobster Crawl"           | "8"    | 4              | ["Humour","Dating and Courtship","Flirting","Laughing"]                     |
==> | "Little Boys"             | "3"    | 4              | ["Humour","Puns","Dating and Courtship","Laughing"]                         |
==> | "Wait for It"             | "3"    | 4              | ["Fiction","Puns","Flirting","Laughing"]                                    |
==> | "Mary the Paralegal"      | "1"    | 4              | ["Humour","Dating and Courtship","Flirting","Laughing"]                     |
==> +-----------------------------------------------------------------------------------------------------------------------------------+

Overall 168 (out of 208) of the other episodes have a topic in common with the first episode so perhaps just having a topic in common isn’t the best indication of similarity.

An interesting next step would be to calculate cosine or jaccard similarity between the episodes and store that value in the graph for querying later on.

I’ve also calculated the most common bigrams across all the transcripts so it would be interesting to see if there are any interesting insights at the intersection of episodes, topics and phrases.

Categories: Programming

Python/gensim: Creating bigrams over How I met your mother transcripts

Fri, 02/13/2015 - 00:45

As part of my continued playing around with How I met your mother transcripts I wanted to identify plot arcs and as a first step I wrote some code using the gensim and nltk libraries to identify bigrams (two word phrases).

There’s an easy to follow tutorial in the gensim docs showing how to go about this but I needed to do a couple of extra steps to get my text data from a CSV file into the structure gensim expects.

Let’s first remind ourselves what the sentences CSV file looks like:

$ head -n 15 data/import/sentences.csv  | tail
5,1,1,1,Son: Are we being punished for something?
6,1,1,1,Narrator: No
7,1,1,1,"Daughter: Yeah, is this going to take a while?"
8,1,1,1,"Narrator: Yes. (Kids are annoyed) Twenty-five years ago, before I was dad, I had this whole other life."
9,1,1,1,"(Music Plays, Title ""How I Met Your Mother"" appears)"
10,1,1,1,"Narrator: It was way back in 2005. I was twenty-seven just starting to make it as an architect and living in New York with my friend Marshall, my best friend from college. My life was good and then Uncle Marshall went and screwed the whole thing up."
11,1,1,1,Marshall: (Opens ring) Will you marry me.
12,1,1,1,"Ted: Yes, perfect! And then you're engaged, you pop the champagne! You drink a toast! You have s*x on the kitchen floor... Don't have s*x on our kitchen floor."
13,1,1,1,"Marshall: Got it. Thanks for helping me plan this out, Ted."
14,1,1,1,"Ted: Dude, are you kidding? It's you and Lily! I've been there for all the big moments of you and Lily. The night you met. Your first date... other first things."

We need to transform those sentences into an array of words for each line and feed it into gensim’s models.Phrase object:

import nltk
import csv
import string
from gensim.models import Phrases
from gensim.models import Word2Vec
from nltk.corpus import stopwords
 
sentences = []
bigram = Phrases()
with open("data/import/sentences.csv", "r") as sentencesfile:
    reader = csv.reader(sentencesfile, delimiter = ",")
    reader.next()
    for row in reader:
        sentence = [word.decode("utf-8")
                    for word in nltk.word_tokenize(row[4].lower())
                    if word not in string.punctuation]
        sentences.append(sentence)
        bigram.add_vocab([sentence])

We’re used nltk’s word_tokezine function to create our array of words and then we’ve got a clause to make sure we remove any words which are punctuation otherwise they will dominate our phrases.

We can take a quick peek at some of the phrases that have been created like so:

>>> list(bigram[sentences])[:5]
[[u'pilot'], [u'scene', u'one'], [u'title', u'the', u'year_2030'], [u'narrator_kids', u'i', u"'m", u'going', u'to', u'tell', u'you', u'an_incredible', u'story.', u'the', u'story', u'of', u'how', u'i', u'met_your', u'mother'], [u'son', u'are', u'we', u'being', u'punished', u'for', u'something']]

gensim uses an underscore character to indicate when it’s joined two words together and in this sample we’ve got three phrases – ‘narrator_kids’, ‘met_you’ and ‘an_incredible’.

We can now populate a Counter with our phrases and their counts and find out the most common phrases. One thing to note is that I’ve chosen to get rid of stopwords at this point rather than earlier because I didn’t want to generate ‘false bigrams’ where there was actually a stop word sitting in between.

bigram_counter = Counter()
for key in bigram.vocab.keys():
    if key not in stopwords.words("english"):
        if len(key.split("_")) > 1:
            bigram_counter[key] += bigram.vocab[key]
 
for key, counts in bigram_counter.most_common(20):
    print '{0: <20} {1}'.format(key.encode("utf-8"), counts)
 
i_'m                 4607
it_'s                4288
you_'re              2659
do_n't               2436
that_'s              2044
in_the               1696
gon_na               1576
you_know             1497
i_do                 1464
this_is              1407
and_i                1389
want_to              1071
it_was               1053
on_the               1052
at_the               1035
we_'re               1033
i_was                1018
of_the               1014
ca_n't               1010
are_you              994

Most of the phrases aren’t really that interesting and I had better luck feeding the phrases into a Word2Vec model and repeating the exercise:

bigram_model = Word2Vec(bigram[sentences], size=100)
bigram_model_counter = Counter()
for key in bigram_model.vocab.keys():
    if key not in stopwords.words("english"):
        if len(key.split("_")) > 1:
            bigram_model_counter[key] += bigram_model.vocab[key].count
 
for key, counts in bigram_model_counter.most_common(50):
    print '{0: <20} {1}'.format(key.encode("utf-8"), counts)
 
do_n't               2436
gon_na               1576
ca_n't               1010
did_n't              704
come_on              499
end_of               460
kind_of              396
from_2030            394
my_god               360
they_'re             351
'm_sorry             349
does_n't             341
end_flashback        327
all_right            308
've_been             303
'll_be               301
of_course            289
a_lot                284
right_now            279
new_york             270
look_at              265
trying_to            238
tell_me              196
a_few                195
've_got              189
wo_n't               174
so_much              172
got_ta               168
each_other           166
my_life              157
talking_about        157
talk_about           154
what_happened        151
at_least             141
oh_god               138
wan_na               129
supposed_to          126
give_me              124
last_night           121
my_dad               120
more_than            119
met_your             115
excuse_me            112
part_of              110
phone_rings          109
get_married          107
looks_like           105
'm_sorry.            104
said_``              101

The first 20 phrases or so aren’t particularly interesting although we do have ‘new_york’ in there which is good as that’s where the show is set. If we go further we’ll notice phrases like ‘my_dad’, ‘get_married’ and ‘last_night’ which may all explain interesting parts of the plot.

Having the data in the Word2Vec model allows us to do some other fun queries too. e.g.

>>> bigram_model.most_similar(['marshall', 'lily'], ['ted'], topn=10)
[(u'robin', 0.5474381446838379), (u'go_ahead', 0.5138797760009766), (u'zoey', 0.505358874797821), (u'karen', 0.48617005348205566), (u'cootes', 0.4757827818393707), (u'then', 0.45426881313323975), (u'lewis', 0.4510520100593567), (u'natalie.', 0.45070385932922363), (u'vo', 0.4189065098762512), (u'players', 0.4149518311023712)]
 
>>> bigram_model.similarity("ted", "robin")
0.51928683064927905
 
>>> bigram_model.similarity("barney", "robin")
0.62980405583219112
 
>>> bigram_model.most_similar(positive=['getting_married'])
[(u'so_glad', 0.780311107635498), (u'kidding', 0.7683225274085999), (u'awake', 0.7682262659072876), (u'lunch.', 0.7591195702552795), (u'ready.', 0.7372316718101501), (u'single.', 0.7350872755050659), (u'excited_about', 0.725479006767273), (u'swamped', 0.7252731323242188), (u'boyfriends', 0.7127221822738647), (u'believe_this.', 0.71015864610672)]
 
>>> bigram_model.most_similar(positive=['my_dad'])
[(u'my_mom', 0.7994954586029053), (u'somebody', 0.7758427262306213), (u'easier', 0.7305313944816589), (u'hot.', 0.7282992601394653), (u'pregnant.', 0.7103987336158752), (u'nobody', 0.7059557437896729), (u'himself.', 0.7046393156051636), (u'physically', 0.7044381499290466), (u'young_lady', 0.69412761926651), (u'at_bernie', 0.682607889175415)]

I’m not quite at the stage where I can automatically pull out the results of a gensim model and do something with it but it is helping me to see some of the main themes in the show.

Next up I’ll try out trigrams and then TF/IDF over the bigrams to see which are the most important on a per episode basis. I also need to dig into Word2Vec to figure out why it comes up with different top phrases than the Phrases model.

Categories: Programming

R: Weather vs attendance at NoSQL meetups

Wed, 02/11/2015 - 08:09

A few weeks ago I came across a tweet by Sean Taylor asking for a weather data set with a few years worth of recording and I was surprised to learn that R already has such a thing – the weatherData package.

Winner is: @UTVilla! library(weatherData) df <- getWeatherForYear("SFO", 2013) ggplot(df, aes(x=Date, y = Mean_TemperatureF)) + geom_line()

— Sean J. Taylor (@seanjtaylor) January 22, 2015

weatherData provides a thin veneer around the wunderground API and was exactly what I’d been looking for to compare meetup at London’s NoSQL against weather conditions that day.

The first step was to download the appropriate weather recordings and save them to a CSV file so I wouldn’t have to keep calling the API.

I thought I may as well download all the recordings available to me and wrote the following code to make that happen:

library(weatherData)
 
# London City Airport
getDetailedWeatherForYear = function(year) {
  getWeatherForDate("LCY", 
                    start_date= paste(sep="", year, "-01-01"),
                    end_date = paste(sep="", year, "-12-31"),
                    opt_detailed = FALSE,
                    opt_all_columns = TRUE)
}
 
df = rbind(getDetailedWeatherForYear(2011), 
      getDetailedWeatherForYear(2012),
      getDetailedWeatherForYear(2013),
      getDetailedWeatherForYear(2014),
      getWeatherForDate("LCY", start_date="2015-01-01",
                        end_date = "2015-01-25",
                        opt_detailed = FALSE,
                        opt_all_columns = TRUE))

I then saved that to a CSV file:

write.csv(df, 'weather/temp_data.csv', row.names = FALSE)
"Date","GMT","Max_TemperatureC","Mean_TemperatureC","Min_TemperatureC","Dew_PointC","MeanDew_PointC","Min_DewpointC","Max_Humidity","Mean_Humidity","Min_Humidity","Max_Sea_Level_PressurehPa","Mean_Sea_Level_PressurehPa","Min_Sea_Level_PressurehPa","Max_VisibilityKm","Mean_VisibilityKm","Min_VisibilitykM","Max_Wind_SpeedKm_h","Mean_Wind_SpeedKm_h","Max_Gust_SpeedKm_h","Precipitationmm","CloudCover","Events","WindDirDegrees"
2011-01-01,"2011-1-1",7,6,4,5,3,1,93,85,76,1027,1025,1023,10,9,3,14,10,NA,0,7,"Rain",312
2011-01-02,"2011-1-2",4,3,2,1,0,-1,87,81,75,1029,1028,1027,10,10,10,11,8,NA,0,7,"",321
2011-01-03,"2011-1-3",4,2,1,0,-2,-5,87,74,56,1028,1024,1019,10,10,10,8,5,NA,0,6,"Rain-Snow",249
2011-01-04,"2011-1-4",6,3,1,3,1,-1,93,83,65,1019,1013,1008,10,10,10,21,6,NA,0,5,"Rain",224
2011-01-05,"2011-1-5",8,7,5,6,3,0,93,80,61,1008,1000,994,10,9,4,26,16,45,0,4,"Rain",200
2011-01-06,"2011-1-6",7,4,3,6,3,1,93,90,87,1002,996,993,10,9,5,13,6,NA,0,5,"Rain",281
2011-01-07,"2011-1-7",11,6,2,9,5,2,100,91,82,1003,999,996,10,7,2,24,11,NA,0,5,"Rain-Snow",124
2011-01-08,"2011-1-8",11,7,4,8,4,-1,87,77,65,1004,997,987,10,10,5,39,23,50,0,5,"Rain",230
2011-01-09,"2011-1-9",7,4,3,1,0,-1,87,74,57,1018,1012,1004,10,10,10,24,16,NA,0,NA,"",242

If we want to read that back in future we can do so with the following code:

weather = read.csv("weather/temp_data.csv")
weather$Date = as.POSIXct(weather$Date)
 
> weather %>% sample_n(10) %>% select(Date, Min_TemperatureC, Mean_TemperatureC, Max_TemperatureC)
           Date Min_TemperatureC Mean_TemperatureC Max_TemperatureC
1471 2015-01-10                5                 9               14
802  2013-03-12               -2                 1                4
1274 2014-06-27               14                18               22
848  2013-04-27                5                 8               10
832  2013-04-11                6                 8               10
717  2012-12-17                6                 7                9
1463 2015-01-02                6                 9               13
1090 2013-12-25                4                 6                7
560  2012-07-13               15                18               20
1230 2014-05-14                9                14               19

The next step was to bring the weather data together with the meetup attendance data that I already had.

For simplicity’s sake I’ve got those saved in a CSV file as we can just read those in as well:

timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01", tz = "GMT")
 
events = read.csv("events.csv")
events$eventTime = timestampToDate(events$eventTime)
 
> events %>% sample_n(10) %>% select(event.name, rsvps, eventTime)
                                                           event.name rsvps           eventTime
36                                   London Office Hours - Old Street    10 2012-01-18 17:00:00
137                                          Enterprise Search London    34 2011-05-23 18:15:00
256                           MarkLogic User Group London: Jim Fuller    40 2014-04-29 18:30:00
117                                  Neural Networks and Data Science   171 2013-03-28 18:30:00
210                                  London Office Hours - Old Street     3 2011-09-15 17:00:00
443                                                      July social!    12 2014-07-14 19:00:00
322                                                   Intro to Graphs    39 2014-09-03 18:30:00
203                                  Vendor focus: Amazon CloudSearch    24 2013-05-16 17:30:00
17  Neo4J Tales from the Trenches: A Recommendation Engine Case Study    12 2012-04-25 18:30:00
55                                                London Office Hours    10 2013-09-18 17:00:00

Now that we’ve got our two datasets ready we can plot a simple chart of the average attendance and temperature grouped by month:

byMonth = events %>% 
  mutate(month = factor(format(eventTime, "%B"), levels=month.name)) %>%
  group_by(month) %>%
  summarise(events = n(), 
            count = sum(rsvps)) %>%
  mutate(ave = count / events) %>%
  arrange(desc(ave))
 
averageTemperatureByMonth = weather %>% 
  mutate(month = factor(format(Date, "%B"), levels=month.name)) %>%
  group_by(month) %>% 
  summarise(aveTemperature = mean(Mean_TemperatureC))
 
g1 = ggplot(aes(x = month, y = aveTemperature, group=1), data = averageTemperatureByMonth) + 
  geom_line( ) + 
  ggtitle("Temperature by month")
 
g2 = ggplot(aes(x = month, y = count, group=1), data = byMonth) + 
  geom_bar(stat="identity", fill="dark blue") +
  ggtitle("Attendance by month")
 
library(gridExtra)
grid.arrange(g1,g2, ncol = 1)

2015 02 09 20 32 50

We can see a rough inverse correlation between the temperature and attendance, particularly between April and August – as the temperature increases, total attendance decreases.

But what about if we compare at a finer level of granularity such as a specific date? We can do that by adding a ‘day’ column to our events data frame and merging it with the weather one:

byDay = events %>% 
  mutate(day = as.Date(as.POSIXct(eventTime))) %>%
  group_by(day) %>%
  summarise(events = n(), 
            count = sum(rsvps)) %>%
  mutate(ave = count / events) %>%
  arrange(desc(ave))
weather = weather %>% mutate(day = Date)
merged = merge(weather, byDay, by = "day")

Now we can plot the attendance vs the mean temperature for individual days:

ggplot(aes(x =count, y = Mean_TemperatureC,group = day), data = merged) + 
  geom_point()
2015 02 10 07 21 24

Interestingly there now doesn’t seem to be any correlation between the temperature and attendance. We can confirm our suspicions by running a correlation:

> cor(merged$count, merged$Mean_TemperatureC)
[1] 0.008516294

Not even 1% correlation between the values! One way we could confirm that non correlation is to plot the average temperature against the average attendance rather than total attendance:

g1 = ggplot(aes(x = month, y = aveTemperature, group=1), data = averageTemperatureByMonth) + 
  geom_line( ) + 
  ggtitle("Temperature by month")
 
g2 = ggplot(aes(x = month, y = ave, group=1), data = byMonth) + 
  geom_bar(stat="identity", fill="dark blue") +
  ggtitle("Attendance by month")
 
grid.arrange(g1,g2, ncol = 1)

2015 02 11 06 48 05

Now we can see there’s not really that much of a correlation between temperature and month – in fact 9 of the months have a very similar average attendance. It’s only July, December and especially August where there’s a noticeable dip.

This could suggest there’s another variable other than temperature which is influencing attendance in these months. My hypothesis is that we’d see lower attendance in the weeks of school holidays – the main ones happen in July/August, December and March/April (which interestingly don’t show the dip!)

Another interesting thing to look into is whether the reason for the dip in attendance isn’t through lack of will from attendees but rather because there aren’t actually any events to go to. Let’s plot the number of events being hosted each month against the temperature:

g1 = ggplot(aes(x = month, y = aveTemperature, group=1), data = averageTemperatureByMonth) + 
  geom_line( ) + 
  ggtitle("Temperature by month")
 
g2 = ggplot(aes(x = month, y = events, group=1), data = byMonth) + 
  geom_bar(stat="identity", fill="dark blue") +
  ggtitle("Events by month")
 
grid.arrange(g1,g2, ncol = 1)

2015 02 11 06 57 16

Here we notice there’s a big dip in events in December – organisers are hosting less events and we know from our earlier plot that on average less people are attending those events. Lots of events are hosted in the Autumn, slightly fewer in the Spring and fewer in January, March and August in particular.

Again there’s no particular correlation between temperature and the number of events being hosted on a particular day:

ggplot(aes(x = events, y = Mean_TemperatureC,group = day), data = merged) + 
  geom_point()

2015 02 11 07 05 48

There’s not any obvious correlation from looking at this plot although I find it difficult to interpret plots where we have the values all grouped around very few points (often factor variables) on one axis and spread out (continuous variable) on the other. Let’s confirm our suspicion by calculating the correlation between these two variables:

> cor(merged$events, merged$Mean_TemperatureC)
[1] 0.0251698

Back to the drawing board for my attendance prediction model then!

If you have any suggestions for doing this analysis more effectively or I’ve made any mistakes please let me know in the comments, I’m still learning how to investigate what data is actually telling us.

Categories: Programming

Python/matpotlib: Plotting occurrences of the main characters in How I Met Your Mother

Fri, 01/30/2015 - 22:29

Normally when I’m playing around with data sets in R I get out ggplot2 to plot some charts to get a feel for the data but having spent quite a bit of time with Python and How I met your mother transcripts I haven’t created a single plot. I thought I’d better change change that.

After a bit of searching around it seems that matplotlib is the go to library for this job and I thought an interesting thing to plot would be how often each of the main characters appear in each episode across the show.

I’ve already got all the sentences from each episode as well as the list of episodes pulled out into CSV files so we can start from there.

This is a sample of the sentences file:

$ head -n 10 data/import/sentences.csv
SentenceId,EpisodeId,Season,Episode,Sentence
1,1,1,1,Pilot
2,1,1,1,Scene One
3,1,1,1,[Title: The Year 2030]
4,1,1,1,"Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother"
5,1,1,1,Son: Are we being punished for something?
6,1,1,1,Narrator: No
7,1,1,1,"Daughter: Yeah, is this going to take a while?"
8,1,1,1,"Narrator: Yes. (Kids are annoyed) Twenty-five years ago, before I was dad, I had this whole other life."
9,1,1,1,"(Music Plays, Title ""How I Met Your Mother"" appears)"

My first step was to transform the CSV file into an array of words grouped by episode. I created a dictionary, iterated over the CSV file and then used nltk’s word tokeniser to pull out words from sentences:

import csv
from collections import defaultdict
 
episodes = defaultdict(list)
with open("data/import/sentences.csv", "r") as sentencesfile:
    reader = csv.reader(sentencesfile, delimiter = ",")
    reader.next()
    for row in reader:
        episodes[row[1]].append([ word for word in nltk.word_tokenize(row[4].lower())] )

Let’s have a quick look what’s in our dictionary:

>>> episodes.keys()[:10]
['165', '133', '132', '131', '130', '137', '136', '135', '134', '139']

We’ve got some episode numbers as we’d expect. Now let’s have a look at some of the words for one of the episodes:

>>> episodes["165"][5]
['\xe2\x99\xaa', 'how', 'i', 'met', 'your', 'mother', '8x05', '\xe2\x99\xaa']

So we’ve got an list of lists of words for each episode but gensim (which I wanted to play around with) requires a single array of words per document.

I transformed the data into the appropriate format and fed it into a gensim Dictionary:

from gensim import corpora
texts = []
for id, episode in episodes.iteritems():
    texts.append([item for sublist in episode for item in sublist])
dictionary = corpora.Dictionary(texts)

If we peek into ‘texts’ we can see that the list has been flattened:

>>> texts[0][10:20]
['a', 'bit', 'of', 'a', 'dog', ',', 'and', 'even', 'though', 'he']

We’ll now convert our dictionary of words into a sparse vector which contains pairs of word ids and the number of time they occur:

corpus = [dictionary.doc2bow(text) for text in texts]

Let’s try and find out how many times the word ‘ted’ occurs in our corpus. First we need to find out the word id for ‘ted':

>>> dictionary.token2id["ted"]
551

I don’t know how to look up the word id directly from the corpus but you can get back an individual document (episode) and its words quite easily:

>>> corpus[0][:5]
[(0, 8), (1, 1), (2, 2), (3, 13), (4, 20)]

We can then convert that into a dictionary and look up our word:

>>> dict(corpus[0]).get(551)
16

So ‘ted’ occurs 16 times in the first episode. If we generify that code we end up with the following:

words = ["ted", "robin", "barney", "lily", "marshall"]
words_dict = dict()
for word in words:
    word_id = dictionary.token2id[word]
    counts = []
    for episode in corpus:
        count = dict(episode).get(word_id) or 0
        counts.append(count)
    words_dict[word] = counts

There’s quite a lot of counts in there so let’s just preview the first 5 episodes:

>>> for word, counts in words_dict.iteritems():
       print word, counts[:5]
 
lily [3, 20, 47, 26, 41]
marshall [8, 25, 63, 27, 34]
barney [9, 94, 58, 92, 102]
ted [16, 46, 66, 32, 44]
robin [18, 43, 25, 24, 34]

Now it’s time to bring out matplotlib and make this visual! I initially put all the characters on one chart but it looks very messy and there’s a lot of overlap so I decided on separate charts.

The only thing I had to do to achieve this was call plt.figure() at the beginning of the loop to create a new plot:

import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
pylab.show()
 
for word, counts in words_dict.iteritems():
    plt.figure()
    plt.plot(counts)
    plt.legend([word], loc='upper left')
    plt.ylabel('occurrences')
    plt.xlabel('episode')
    plt.xlim(0, 208)
    plt.savefig('images/%s.png' % (word), dpi=200)

This generates plots like this:

2015 01 30 21 15 03

This is good but I thought it’d be interesting to put in the season demarcations to see if that could give any more insight. We can call the function plt.axvline and pass in the appropriate episode number to achieve this effect but I needed to know the episode ID for the last episode in each season which required a bit of code:

import pandas as pd
df = pd.read_csv('data/import/episodes.csv', index_col=False, header=0)
last_episode_in_season = list(df.groupby("Season").max()["NumberOverall"])
 
>>> last_episode_in_season
[22, 44, 64, 88, 112, 136, 160, 184, 208]

Now let’s plug that into matplotlib:

for word, counts in words_dict.iteritems():
    plt.figure()
    plt.plot(counts)
    for episode in last_episode_in_season:
        plt.axvline(x=episode, color = "red")
    plt.legend([word], loc='upper left')
    plt.ylabel('occurrences')
    plt.xlabel('episode')
    plt.xlim(0, 208)
    plt.savefig('images/%s.png' % (word), dpi=200)

2015 01 30 21 10 52

The last thing I wanted to do is get all the plots on the same scale for which I needed to get the maximum number of occurrences of any character in any episode. It was easier than I expected:

>>> y_max = max([max(count) for count in words_dict.values()])
>>> y_max
260

And now let’s plot again:

for word, counts in words_dict.iteritems():
    plt.figure()
    plt.plot(counts)
    for episode in last_episode_in_season:
        plt.axvline(x=episode, color = "red")
    plt.legend([word], loc='upper left')
    plt.ylabel('occurrences')
    plt.xlabel('episode')
    plt.xlim(0, 208)
    plt.ylim(0, y_max)
    plt.savefig('images/%s.png' % (word), dpi=200)


Our charts are now easy to compare:

2015 01 30 21 23 48

2015 01 30 21 24 03

For some reason there’s a big spike of the word ‘ted’ in the middle of the 7th season – I’m clearly not a big enough fan to know why that is but it’s a spike of 30% over the next highest value.

The data isn’t perfect – some of the episodes list the speaker of the sentence and some don’t so it may be that the spikes indicate that rather than anything else.

I find it’s always nice to do a bit of visual exploration of the data anyway and now I know it’s possible to do so pretty easily in Python land.

Categories: Programming

R: ggplot2 – Each group consist of only one observation. Do you need to adjust the group aesthetic?

Fri, 01/30/2015 - 01:27

I’ve been playing around with some weather data over the last couple of days which I aggregated down to the average temperature per month over the last 4 years and stored in a CSV file.

This is what the file looks like:

$ cat /tmp/averageTemperatureByMonth.csv
"month","aveTemperature"
"January",6.02684563758389
"February",5.89380530973451
"March",7.54838709677419
"April",10.875
"May",13.3064516129032
"June",15.9666666666667
"July",18.8387096774194
"August",18.3709677419355
"September",16.2583333333333
"October",13.4596774193548
"November",9.19166666666667
"December",7.01612903225806

I wanted to create a simple line chart which would show the months of the year in ascending order with the appropriate temperature.

My first attempt was the following:

df = read.csv("/tmp/averageTemperatureByMonth.csv")
df$month = factor(df$month, month.name)
 
ggplot(aes(x = month, y = aveTemperature), data = df) + 
  geom_line( ) + 
  ggtitle("Temperature by month")

which resulted in the following error:

geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?

My understanding is that the points don’t get joined up by default because the variable on the x axis is not a continuous one but rather a factor variable.

One way to work around this problem is to make it numeric, like so:

ggplot(aes(x = as.numeric(month), y = aveTemperature), data = df) + 
  geom_line( ) + 
  ggtitle("Temperature by month")

which results in the following chart:

2015 01 30 00 25 18

This isn’t bad but it’d be much nicer if we could have the month names along the bottom instead.

It turns out we can but we need to specify a group that each point belongs to. ggplot will then connects points which belong to the same group.

In this case we don’t really have one so we’ll define a dummy one instead:

ggplot(aes(x = month, y = aveTemperature, group=1), data = df) + 
  geom_line( ) + 
  ggtitle("Temperature by month")

And now we get the visualisation we want:

2015 01 29 23 28 23

Categories: Programming

Python: Find the highest value in a group

Sun, 01/25/2015 - 13:47

In my continued playing around with a How I met your mother data set I needed to find out the last episode that happened in a season so that I could use it in a chart I wanted to plot.

I had this CSV file containing each of the episodes:

$ head -n 10 data/import/episodes.csv
NumberOverall,NumberInSeason,Episode,Season,DateAired,Timestamp
1,1,/wiki/Pilot,1,"September 19, 2005",1127084400
2,2,/wiki/Purple_Giraffe,1,"September 26, 2005",1127689200
3,3,/wiki/Sweet_Taste_of_Liberty,1,"October 3, 2005",1128294000
4,4,/wiki/Return_of_the_Shirt,1,"October 10, 2005",1128898800
5,5,/wiki/Okay_Awesome,1,"October 17, 2005",1129503600
6,6,/wiki/Slutty_Pumpkin,1,"October 24, 2005",1130108400
7,7,/wiki/Matchmaker,1,"November 7, 2005",1131321600
8,8,/wiki/The_Duel,1,"November 14, 2005",1131926400
9,9,/wiki/Belly_Full_of_Turkey,1,"November 21, 2005",1132531200

I started out by parsing the CSV file into a dictionary of (seasons -> episode ids):

import csv
from collections import defaultdict
 
seasons = defaultdict(list)
with open("data/import/episodes.csv", "r") as episodesfile:
    reader = csv.reader(episodesfile, delimiter = ",")
    reader.next()
    for row in reader:
        seasons[int(row[3])].append(int(row[0]))
 
print seasons

which outputs the following:

$ python blog.py
defaultdict(<type 'list'>, {
  1: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], 
  2: [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44], 
  3: [45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64], 
  4: [65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88], 
  5: [89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112], 
  6: [113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136], 
  7: [137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160], 
  8: [161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184], 
  9: [185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208]})

It’s reasonably easy to transform that into a dictionary of (season -> max episode id) with the following couple of lines:

for season, episode_ids in seasons.iteritems():
    seasons[season] = max(episode_ids)
 
>>> print seasons
defaultdict(<type 'list'>, {1: 22, 2: 44, 3: 64, 4: 88, 5: 112, 6: 136, 7: 160, 8: 184, 9: 208})

This works fine but it felt very much like a dplyr problem to me so I wanted to see whether I could write something cleaner using pandas.

I started out by capturing the seasons and episode ids in separate lists and then building up a DataFrame:

import pandas as pd
from pandas import DataFrame
 
seasons, episode_ids = [], []
with open("data/import/episodes.csv", "r") as episodesfile:
    reader = csv.reader(episodesfile, delimiter = ",")
    reader.next()
    for row in reader:
        seasons.append(int(row[3]))
        episode_ids.append(int(row[0]))
 
df = DataFrame.from_items([('Season', seasons), ('EpisodeId', episode_ids)])
 
>>> print df.groupby("Season").max()["EpisodeId"]
Season
1          22
2          44
3          64
4          88
5         112
6         136
7         160
8         184
9         208

Or we can simplify that and read the CSV file directly into a DataFrame:

df = pd.read_csv('data/import/episodes.csv', index_col=False, header=0)
 
>>> print df.groupby("Season").max()["NumberOverall"]
Season
1          22
2          44
3          64
4          88
5         112
6         136
7         160
8         184
9         208

Pretty neat. I need to get more into pandas.

Categories: Programming

Python/pdfquery: Scraping the FIFA World Player of the Year votes PDF into shape

Thu, 01/22/2015 - 01:25

Last week the FIFA Ballon d’Or 2014 was announced and along with the announcement of the winner the individual votes were also made available.

Unfortunately they weren’t made open in a way that Ben Wellington (of IQuantNY fame) would approve of – the choice of format for the data is a PDF file!

I wanted to extract this data to play around with it but I wanted to automate the extraction as I’d done when working with Google Trends data.

I had a quick look for PDF scraping libraries in Python and R and eventually settled on Python’s pdfquery, mainly because there was lots of documentation which made it easy to get started.

One way you scrape data from a PDF is by locating an element on the page and then grabbing everything within a bounded box relative to that element.

In my case I had 17 pages all of which had a heading for each of six columns.

2015 01 22 00 08 18

I wanted to grab the data in each of those columns but initially struggled working out what elements I should be looking for until I came across the following function which allows you to dump an XML version of the PDF to disk:

import pdfquery
pdf = pdfquery.PDFQuery("fboaward_menplayer2014_neutral.pdf")
pdf.load()
pdf.tree.write("/tmp/yadda", pretty_print=True)

The output looks like this:

$ head -n 10 /tmp/yadda
<pdfxml ModDate="D:20150110224554+01'00'" CreationDate="D:20150110224539+01'00'" Producer="Microsoft&#174; Excel&#174; 2010" Creator="Microsoft&#174; Excel&#174; 2010">
  <LTPage bbox="[0, 0, 841.8, 595.2]" height="595.2" pageid="1" rotate="0" width="841.8" x0="0" x1="841.8" y0="0" y1="595.2" page_index="0" page_label="">
    <LTAnon> </LTAnon>
    <LTTextLineHorizontal bbox="[31.08, 546.15, 122.524, 556.59]" height="10.44" width="91.444" word_margin="0.1" x0="31.08" x1="122.524" y0="546.15" y1="556.59"><LTTextBoxHorizontal bbox="[31.08, 546.15, 122.524, 556.59]" height="10.44" index="0" width="91.444" x0="31.08" x1="122.524" y0="546.15" y1="556.59">FIFA Ballon d'Or 2014 </LTTextBoxHorizontal></LTTextLineHorizontal>
    <LTAnon> </LTAnon>
    <LTAnon> </LTAnon>
    <LTAnon> </LTAnon>
    <LTAnon> </LTAnon>
    <LTAnon> </LTAnon>
    <LTAnon> </LTAnon>

Having scanned through the file I realised that what I needed to do was locate the ‘LTTextLineHorizontal’ element for each heading and then grab all the ‘LTTextLineHorizontal’ elements that appeared in that column.

I started out by trying to grab the ‘Name’ column on the first page:

>>> name_element = pdf.pq('LTPage[pageid=\'1\'] LTTextLineHorizontal:contains("Name")')[0]
>>> name_element.text
'Name '

Next I needed to get the other elements in that column. With a bit of trial and error I ended up with the following code:

x = float(name_element.get('x0'))
y = float(name_element.get('y0'))
cells = pdf.extract( [
         ('with_parent','LTPage[pageid=\'1\']'),
         ('cells', 'LTTextLineHorizontal:in_bbox("%s,%s,%s,%s")' % (x, y-500, x+150, y))
    ])
 
>>> [cell.text.encode('utf-8').strip() for cell in cells['cells']]
['Amiri Islam', 'Cana Lorik', 'Bougherra Madjid', 'Luvu Rafe Talalelei', 'Sonejee Masand Oscar', 'Amaral Felisberto', 'Liddie Ryan', 'Griffith Quinton', 'Messi Lionel', 'Berezovskiy Roman', 'Breinburg Reinhard', 'Jedinak Mile', 'Fuchs Christian', 'Sadigov Rashad', 'Gavin Christie', 'Hasan Mohamed', 'Mamun Md Mamnul Islam', 'Burgess Romelle', 'Kalachou Tsimafei', 'Komany Vincent', 'Eiley Dalton', 'Nusum John', 'Tshering Passang', 'Raldes Ronald', 'D\xc5\xbeeko Edin', 'Da Silva Santos Junior Neymar', 'Ceasar Troy', 'Popov Ivelin', 'Kabore Charles', 'Ntibazonkiza Saidi', 'Kouch Sokumpheak']

I cleaned that up and generified it to work for any page and for columns of different widths. This is what the function looks like:

def extract_cells(page, header, cell_width):
    name_element = pdf.pq('LTPage[pageid=\'%s\'] LTTextLineHorizontal:contains("%s")' % (page, header))[0]
    x = float(name_element.get('x0'))
    y = float(name_element.get('y0'))
    cells = pdf.extract( [
         ('with_parent','LTPage[pageid=\'%s\']' %(page)),
         ('cells', 'LTTextLineHorizontal:in_bbox("%s,%s,%s,%s")' % (x, y-500, x+cell_width, y))
    ])
    return [cell.text.encode('utf-8').strip() for cell in cells['cells']]

We can then call that for each column on the page and zip together the resulting arrays to get a tuple for each row:

roles = extract_cells(1, "Vote", 50)
countries = extract_cells(1, "Country", 150)
voters = extract_cells(1, "Name", 170)
first = extract_cells(1, "First (5 points)", 150)
second = extract_cells(1, "Second (3 points)", 150)
third = extract_cells(1, "Third (1 point)", 130)
 
>>> for vote in zip(roles, countries, voters, first, second, third)[:5]:
       print vote
 
('Captain', 'Afghanistan', 'Amiri Islam', 'Messi Lionel', 'Cristiano Ronaldo', 'Ibrahimovic Zlatan')
('Captain', 'Albania', 'Cana Lorik', 'Cristiano Ronaldo', 'Robben Arjen', 'Mueller Thomas')
('Captain', 'Algeria', 'Bougherra Madjid', 'Cristiano Ronaldo', 'Robben Arjen', 'Benzema Karim')
('Captain', 'American Samoa', 'Luvu Rafe Talalelei', 'Neymar', 'Robben Arjen', 'Cristiano Ronaldo')
('Captain', 'Andorra', 'Sonejee Masand Oscar', 'Cristiano Ronaldo', 'Mueller Thomas', 'Kroos Toni')

The next step was to write out each of those rows to a CSV file so we can use it from another program. The full script looks like this:

import pdfquery
import csv
 
def extract_cells(page, header, cell_width):
    name_element = pdf.pq('LTPage[pageid=\'%s\'] LTTextLineHorizontal:contains("%s")' % (page, header))[0]
    x = float(name_element.get('x0'))
    y = float(name_element.get('y0'))
    cells = pdf.extract( [
         ('with_parent','LTPage[pageid=\'%s\']' %(page)),
         ('cells', 'LTTextLineHorizontal:in_bbox("%s,%s,%s,%s")' % (x, y-500, x+cell_width, y))
    ])
    return [cell.text.encode('utf-8').strip() for cell in cells['cells']]
 
if __name__ == "__main__":
    pdf = pdfquery.PDFQuery("fboaward_menplayer2014_neutral.pdf")
    pdf.load()
    pdf.tree.write("/tmp/yadda", pretty_print=True)
 
    pages_in_pdf = len(pdf.pq('LTPage'))
 
    with open('votes.csv', 'w') as votesfile:
        writer = csv.writer(votesfile, delimiter=",")
        writer.writerow(["Role", "Country", "Voter", "FirstPlace", "SecondPlace", "ThirdPlace"])
        for page in range(1, pages_in_pdf + 1):
            print page
            roles = extract_cells(page, "Vote", 50)
            countries = extract_cells(page, "Country", 150)
            voters = extract_cells(page, "Name", 170)
            first = extract_cells(page, "First (5 points)", 150)
            second = extract_cells(page, "Second (3 points)", 150)
            third = extract_cells(page, "Third (1 point)", 130)
            votes = zip(roles, countries, voters, first, second, third)
            print votes
            for vote in votes:
                writer.writerow(list(vote))

The code is on github if you want to play around with it or if you just want to grab the votes data that’s there too.

Categories: Programming

Python/NLTK: Finding the most common phrases in How I Met Your Mother

Mon, 01/19/2015 - 01:24

Following on from last week’s blog post where I found the most popular words in How I met your mother transcripts, in this post we’ll have a look at how we can pull out sentences and then phrases from our corpus.

The first thing I did was tweak the scraping script to pull out the sentences spoken by characters in the transcripts.

Each dialogue is separated by two line breaks so we use that as our separator. I also manually skimmed through the transcripts and found out which tags we need to strip out. I ended up with the following:

import csv
import nltk
import re
import bs4
 
from bs4 import BeautifulSoup, NavigableString
from soupselect import select
from nltk.corpus import stopwords
from collections import Counter
from nltk.tokenize import word_tokenize
 
episodes_dict = {}
 
def strip_tags(soup, invalid_tags):
    for tag in invalid_tags:
        for match in soup.findAll(tag):
            match.replaceWithChildren()
    return soup
 
def extract_sentences(html):
    clean = []
    brs_in_a_row = 0
    temp = ""
    for item in raw_text.contents:
        if item.name == "br":
            brs_in_a_row = brs_in_a_row + 1
        else:
            temp = temp + item
        if brs_in_a_row == 2:
            clean.append(temp)
            temp = ""
            brs_in_a_row = 0
    return clean
 
speakers = []
with open('data/import/episodes.csv', 'r') as episodes_file, \
     open("data/import/sentences.csv", 'w') as sentences_file:
    reader = csv.reader(episodes_file, delimiter=',')
    reader.next()
 
    writer = csv.writer(sentences_file, delimiter=',')
    writer.writerow(["SentenceId", "EpisodeId", "Season", "Episode", "Sentence"])
    sentence_id = 1
 
    for row in reader:
        transcript = open("data/transcripts/S%s-Ep%s" %(row[3], row[1])).read()
        soup = BeautifulSoup(transcript)
        rows = select(soup, "table.tablebg tr td.post-body div.postbody")
 
        raw_text = rows[0]
        [ad.extract() for ad in select(raw_text, "div.ads-topic")]
        [ad.extract() for ad in select(raw_text, "div.t-foot-links")]
        [ad.extract() for ad in select(raw_text, "hr")]
 
        for tag in ['strong', 'em', "a"]:
            for match in raw_text.findAll(tag):
                match.replace_with_children()
        print row
        for sentence in [
                item.encode("utf-8").strip()
                for item in extract_sentences(raw_text.contents)
            ]:
            writer.writerow([sentence_id, row[0], row[3], row[1], sentence])
            sentence_id = sentence_id + 1

Here’s a preview of the sentences CSV file:

$ head -n 10 data/import/sentences.csv
SentenceId,EpisodeId,Season,Episode,Sentence
1,1,1,1,Pilot
2,1,1,1,Scene One
3,1,1,1,[Title: The Year 2030]
4,1,1,1,"Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother"
5,1,1,1,Son: Are we being punished for something?
6,1,1,1,Narrator: No
7,1,1,1,"Daughter: Yeah, is this going to take a while?"
8,1,1,1,"Narrator: Yes. (Kids are annoyed) Twenty-five years ago, before I was dad, I had this whole other life."
9,1,1,1,"(Music Plays, Title ""How I Met Your Mother"" appears)"

The next step is to iterate through each of those sentences and create some n-grams to capture the common phrases in the transcripts.

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

Python’s nltk library has a function that makes this easy e.g.

>>> import nltk
>>> tokens = nltk.word_tokenize("I want to be in an n gram")
>>> tokens
['I', 'want', 'to', 'be', 'in', 'an', 'n', 'gram']
>>> nltk.util.ngrams(tokens, 2)
[('I', 'want'), ('want', 'to'), ('to', 'be'), ('be', 'in'), ('in', 'an'), ('an', 'n'), ('n', 'gram')]
>>> nltk.util.ngrams(tokens, 3)
[('I', 'want', 'to'), ('want', 'to', 'be'), ('to', 'be', 'in'), ('be', 'in', 'an'), ('in', 'an', 'n'), ('an', 'n', 'gram')]

If we do a similar thing of HIMYM transcripts while stripping out the speaker’s name – lines are mostly in the form “Speaker:Sentence” – we end up with the following top phrases:

import nltk
import csv
import string
import re
 
from collections import Counter
 
non_speaker = re.compile('[A-Za-z]+: (.*)')
 
def extract_phrases(text, phrase_counter, length):
    for sent in nltk.sent_tokenize(text):
        strip_speaker = non_speaker.match(sent)
        if strip_speaker is not None:
            sent = strip_speaker.group(1)
        words = nltk.word_tokenize(sent)
        for phrase in nltk.util.ngrams(words, length):
            phrase_counter[bphrase] += 1
 
phrase_counter = Counter()
 
with open("data/import/sentences.csv", "r") as sentencesfile:
    reader = csv.reader(sentencesfile, delimiter=",")
    reader.next()
    for sentence in reader:
        extract_phrases(sentence[4], phrase_counter, 3)
 
most_common_phrases = phrase_counter.most_common(50)
for k,v in most_common_phrases:
    print '{0: <5}'.format(v), k

And if we run that:

$ python extract_phrases.py
1123  (',', 'I', "'m")
1099  ('I', 'do', "n't")
1005  (',', 'it', "'s")
535   ('I', 'ca', "n't")
523   ('I', "'m", 'not')
507   ('I', 'mean', ',')
507   (',', 'you', "'re")
459   (',', 'that', "'s")
458   ('2030', ')', ':')
454   ('(', '2030', ')')
453   ('Ted', '(', '2030')
449   ('I', "'m", 'sorry')
...
247   ('I', 'have', 'to')
247   ('No', ',', 'I')
246   ("'s", 'gon', 'na')
241   (',', 'I', "'ll")
229   ('I', "'m", 'going')
226   ('do', "n't", 'want')
226   ('It', "'s", 'not')

I noticed that quite a few of the phrases had punctuation in so my next step was to get rid of any of the phrases that had any punctuation in. I updated extract_phrases like so:

def extract_phrases(text, phrase_counter, length):
    for sent in nltk.sent_tokenize(text):
        strip_speaker = non_speaker.match(sent)
        if strip_speaker is not None:
            sent = strip_speaker.group(1)
        words = nltk.word_tokenize(sent)
        for phrase in nltk.util.ngrams(words, length):
            if all(word not in string.punctuation for word in phrase):
                phrase_counter[phrase] += 1

Let’s run it again:

$ python extract_phrases.py
1099  ('I', 'do', "n't")
535   ('I', 'ca', "n't")
523   ('I', "'m", 'not')
449   ('I', "'m", 'sorry')
414   ('do', "n't", 'know')
383   ('Ted', 'from', '2030')
338   ("'m", 'gon', 'na')
334   ('I', "'m", 'gon')
300   ('gon', 'na', 'be')
279   ('END', 'OF', 'FLASHBACK')
267   ("'re", 'gon', 'na')
...
155   ('It', "'s", 'just')
151   ('at', 'the', 'bar')
150   ('a', 'lot', 'of')
147   ("'re", 'going', 'to')
144   ('I', 'have', 'a')
142   ('I', "'m", 'so')
138   ('do', "n't", 'have')
137   ('I', 'think', 'I')
136   ('not', 'gon', 'na')
136   ('I', 'can', 'not')
135   ('and', 'I', "'m")

Next I wanted to display each phrase as a string rather than a tuple which was more difficult than I expected. I ended up with the following function which almost does the job:

def untokenize(ngram):
    tokens = list(ngram)
    return "".join([" "+i if not i.startswith("'") and \
                             i not in string.punctuation and \
                             i != "n't"
                          else i for i in tokens]).strip()

I updated extract_phrases to use that function:

def extract_phrases(text, phrase_counter, length):
    for sent in nltk.sent_tokenize(text):
        strip_speaker = non_speaker.match(sent)
        if strip_speaker is not None:
            sent = strip_speaker.group(1)
        words = nltk.word_tokenize(sent)
        for phrase in nltk.util.ngrams(words, length):
            if all(word not in string.punctuation for word in phrase):
                phrase_counter[untokenize(phrase)] += 1

Let’s go again:

$ python extract_phrases.py
1099  I don't
535   I can't
523   I'm not
449   I'm sorry
414   don't know
383   Ted from 2030
338   'm gon na
334   I'm gon
300   gon na be
279   END OF FLASHBACK
...
151   at the bar
150   a lot of
147   're going to
144   I have a
142   I'm so
138   don't have
137   I think I
136   not gon na
136   I can not
135   and I'm

These were some of the interesting things that stood out for me and deserve further digging into:

  • A lot of the most popular phrases begin with ‘I’ – it would be interesting to filter those sentences to find the general sentiment.
  • The ‘untokenize’ function struggles to reconstruct the slang phrase ‘gonna’ into a single word.
  • ‘Ted from 2030′ is actually a speaker which doesn’t follow the expected regex pattern and so wasn’t filtered out.
  • ‘END OF FLASHBACK’ shows quite high up and pulling out those flashbacks would probably be an interesting feature to extract to see which episodes reference each other.
  • ‘Marshall and Lily’ and ‘Lily and Marshall’ show up on the list – it would be interesting to explore the frequency of pairs of other characters.

The code is all on github if you want to play with it.

Categories: Programming

Python: Counter – ValueError: too many values to unpack

Tue, 01/13/2015 - 00:16

I recently came across Python’s Counter tool which makes it really easy to count the number of occurrences of items in a list.

In my case I was trying to work out how many times words occurred in a corpus so I had something like the following:

>> from collections import Counter
>> counter = Counter(["word1", "word2", "word3", "word1"])
>> print counter
Counter({'word1': 2, 'word3': 1, 'word2': 1})

I wanted to write a for loop to iterate over the counter and print the (key, value) pairs and started with the following:

>>> for key, value in counter:
...   print key, value
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: too many values to unpack

I’m not sure why I expected this to work but in fact since Counter is a sub class of dict we need to call iteritems to get an iterator of pairs rather than just keys.

The following does the job:

>>> for key, value in counter.iteritems():
...   print key, value
...
word1 2
word3 1
word2 1

Hopefully future Mark will remember this!

Categories: Programming

Python: scikit-learn: ImportError: cannot import name __check_build

Sat, 01/10/2015 - 09:48

In part 3 of Kaggle’s series on text analytics I needed to install scikit-learn and having done so ran into the following error when trying to use one of its classes:

>>> from sklearn.feature_extraction.text import CountVectorizer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/sklearn/__init__.py", line 37, in <module>
    from . import __check_build
ImportError: cannot import name __check_build

This error doesn’t reveal very much but I found that when I exited the REPL and tried the same command again I got a different error which was a bit more useful:

>>> from sklearn.feature_extraction.text import CountVectorizer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/sklearn/__init__.py", line 38, in <module>
    from .base import clone
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/sklearn/base.py", line 10, in <module>
    from scipy import sparse
ImportError: No module named scipy

The fix for this is now obvious:

$ pip install scipy

And I can now load CountVectorizer without any problem:

$ python
Python 2.7.5 (default, Aug 25 2013, 00:04:04)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from sklearn.feature_extraction.text import CountVectorizer
Categories: Programming

Python: gensim – clang: error: unknown argument: ‘-mno-fused-madd’ [-Wunused-command-line-argument-hard-error-in-future]

Sat, 01/10/2015 - 09:39

While working through part 2 of Kaggle’s bag of words tutorial I needed to install the gensim library and initially ran into the following error:

$ pip install gensim
 
...
 
cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I/Users/markneedham/projects/neo4j-himym/himym/build/gensim/gensim/models -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -I/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/numpy/core/include -c ./gensim/models/word2vec_inner.c -o build/temp.macosx-10.9-intel-2.7/./gensim/models/word2vec_inner.o
 
clang: error: unknown argument: '-mno-fused-madd' [-Wunused-command-line-argument-hard-error-in-future]
 
clang: note: this will be a hard error (cannot be downgraded to a warning) in the future
 
command 'cc' failed with exit status 1
 
an integer is required
 
Traceback (most recent call last):
 
  File "<string>", line 1, in <module>
 
  File "/Users/markneedham/projects/neo4j-himym/himym/build/gensim/setup.py", line 166, in <module>
 
    include_package_data=True,
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/core.py", line 152, in setup
 
    dist.run_commands()
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 953, in run_commands
 
    self.run_command(cmd)
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 972, in run_command
 
    cmd_obj.run()
 
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/setuptools/command/install.py", line 59, in run
 
    return orig.install.run(self)
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/command/install.py", line 573, in run
 
    self.run_command('build')
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/cmd.py", line 326, in run_command
 
    self.distribution.run_command(command)
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 972, in run_command
 
    cmd_obj.run()
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/command/build.py", line 127, in run
 
    self.run_command(cmd_name)
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/cmd.py", line 326, in run_command
 
    self.distribution.run_command(command)
 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 972, in run_command
 
    cmd_obj.run()
 
  File "/Users/markneedham/projects/neo4j-himym/himym/build/gensim/setup.py", line 71, in run
 
    "There was an issue with your platform configuration - see above.")
 
TypeError: an integer is required
 
----------------------------------------
Cleaning up...
Command /Users/markneedham/projects/neo4j-himym/himym/bin/python -c "import setuptools, tokenize;__file__='/Users/markneedham/projects/neo4j-himym/himym/build/gensim/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/sb/6zb6j_7n6bz1jhhplc7c41n00000gn/T/pip-i8aeKR-record/install-record.txt --single-version-externally-managed --compile --install-headers /Users/markneedham/projects/neo4j-himym/himym/include/site/python2.7 failed with error code 1 in /Users/markneedham/projects/neo4j-himym/himym/build/gensim
Storing debug log for failure in /Users/markneedham/.pip/pip.log

The exception didn’t make much sense to me but I came across a blog post which explained it:

The Apple LLVM compiler in Xcode 5.1 treats unrecognized command-line options as errors. This issue has been seen when building both Python native extensions and Ruby Gems, where some invalid compiler options are currently specified.

The author suggests this only became a problem with XCode 5.1 so I’m surprised I hadn’t come across it sooner since I haven’t upgraded XCode in a long time.

We can work around the problem by telling the compiler to treat extra command line arguments as a warning rather than an error

export ARCHFLAGS=-Wno-error=unused-command-line-argument-hard-error-in-future

Now it installs with no problems.

Categories: Programming