Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 3 hours 26 min ago

Python: UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe2 in position 0: ordinal not in range(128)

Wed, 07/15/2015 - 07:20

I was recently doing some text scrubbing and had difficulty working out how to remove the ‘†’ character from strings.

e.g. I had a string like this:

>>> u'foo †'
u'foo \u2020'

I wanted to get rid of the ‘†’ character and then strip any trailing spaces so I’d end up with the string ‘foo’. I tried to do this in one call to ‘replace':

>>> u'foo †'.replace(" †", "")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)

It took me a while to work out that “† ” was being treated as ASCII rather than UTF-8. Let’s fix that:

>>> u'foo †'.replace(u' †', "")
u'foo'

I think the following call to unicode, which I’ve written about before, is equivalent:

>>> u'foo †'.replace(unicode(' †', "utf-8"), "")
u'foo'

Now back to the scrubbing!

Categories: Programming

R: I write more in the last week of the month, or do I?

Sun, 07/12/2015 - 10:53

I’ve been writing on this blog for almost 7 years and have always believed that I write more frequently towards the end of a month. Now that I’ve got all the data I thought it’d be interesting to test that belief.

I started with a data frame containing each post and its publication date and added an extra column which works out how many weeks from the end of the month that post was written:

> df %>% sample_n(5)
                                                               title                date
946  Python: Equivalent to flatMap for flattening an array of arrays 2015-03-23 00:45:00
175                                         Ruby: Hash default value 2010-10-16 14:02:37
375               Java/Scala: Runtime.exec hanging/in 'pipe_w' state 2011-11-20 20:20:08
1319                            Coding Dojo #18: Groovy Bowling Game 2009-06-26 08:15:23
381                   Continuous Delivery: Removing manual scenarios 2011-12-05 23:13:34
 
calculate_start_of_week = function(week, year) {
  date <- ymd(paste(year, 1, 1, sep="-"))
  week(date) = week
  return(date)
}
 
tidy_df  = df %>% 
  mutate(year = year(date), 
         week = week(date),
         week_in_month = ceiling(day(date) / 7),
         max_week = max(week_in_month), 
         weeks_from_end = max_week - week_in_month,
         start_of_week = calculate_start_of_week(week, year))
 
> tidy_df %>% select(date, weeks_from_end, start_of_week) %>% sample_n(5)
 
                    date weeks_from_end start_of_week
1023 2008-08-08 21:16:02              3    2008-08-05
800  2014-01-31 06:51:06              0    2014-01-29
859  2014-08-14 10:24:52              3    2014-08-13
107  2010-07-10 22:49:52              3    2010-07-09
386  2011-12-20 23:57:51              2    2011-12-17

Next I want to get a count of how many posts were published in a given week. The following code does that transformation for us:

weeks_from_end_counts =  tidy_df %>%
  group_by(start_of_week, weeks_from_end) %>%
  summarise(count = n())
 
> weeks_from_end_counts
Source: local data frame [540 x 4]
Groups: start_of_week, weeks_from_end
 
   start_of_week weeks_from_end year count
1     2006-08-27              0 2006     1
2     2006-08-27              4 2006     3
3     2006-09-03              4 2006     1
4     2008-02-05              3 2008     2
5     2008-02-12              3 2008     2
6     2008-07-15              2 2008     1
7     2008-07-22              1 2008     1
8     2008-08-05              3 2008     8
9     2008-08-12              2 2008     5
10    2008-08-12              3 2008     9
..           ...            ...  ...   ...

We group by both ‘start_of_week’ and ‘weeks_from_end’ because we could have posts published in the same week but different month and we want to capture that difference. Now we can run a correlation on the data frame to see if there’s any relationship between ‘count’ and ‘weeks_from_end':

> cor(weeks_from_end_counts %>% ungroup() %>% select(weeks_from_end, count))
               weeks_from_end       count
weeks_from_end     1.00000000 -0.08253569
count             -0.08253569  1.00000000

This suggests there’s a slight negative correlation between the two variables i.e. ‘count’ decreases as ‘weeks_from_end’ increases. Let’s plug the data frame into a linear model to see how good ‘weeks_from_end’ is as a predictor of ‘count':

> fit = lm(count ~ weeks_from_end, weeks_from_end_counts)
 
> summary(fit)
 
Call:
lm(formula = count ~ weeks_from_end, data = weeks_from_end_counts)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-2.0000 -1.5758 -0.5758  1.1060  8.0000 
 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3.00000    0.13764  21.795   <2e-16 ***
weeks_from_end -0.10605    0.05521  -1.921   0.0553 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 1.698 on 538 degrees of freedom
Multiple R-squared:  0.006812,	Adjusted R-squared:  0.004966 
F-statistic:  3.69 on 1 and 538 DF,  p-value: 0.05527

We see a similar result here. The effect of ‘weeks_from_end’ is worth 0.1 posts per week with a p value of 0.0553 so it’s on the border line of being significant.

We also have a very low ‘R squared’ value which suggests the ‘weeks_from_end’ isn’t explaining much of the variation in the data which makes sense given that we didn’t see much of a correlation.

If we charged on and wanted to predict the number of posts likely to be published in a given week we could run the predict function like this:

> predict(fit, data.frame(weeks_from_end=c(1,2,3,4,5)))
       1        2        3        4        5 
2.893952 2.787905 2.681859 2.575812 2.469766

Obviously it’s a bit flawed since we could plug in any numeric value we want, even ones that don’t make any sense, and it’d still come back with a prediction:

> predict(fit, data.frame(weeks_from_end=c(30 ,-10)))
        1         2 
-0.181394  4.060462

I think we’d probably protect against that with a function wrapping our call to predict that doesn’t allow ‘weeks_from_end’ to be greater than 5 or less than 0.

So far it looks like my belief is incorrect! I’m a bit dubious about my calculation of ‘weeks_from_end’ though – it’s not completely capturing what I want since in some months the last week only contains a couple of days.

Next I’m going to explore whether it makes any difference if I calculate that value by counting the number of days back from the last day of the month rather than using week number.

Categories: Programming

R: Filling in missing dates with 0s

Sun, 07/12/2015 - 09:30

I wanted to plot a chart showing the number of blog posts published by month and started with the following code which makes use of zoo’s ‘as.yearmon’ function to add the appropriate column and grouping:

> library(zoo)
> library(dplyr)
> df %>% sample_n(5)
                                                  title                date
888        R: Converting a named vector to a data frame 2014-10-31 23:47:26
144  Rails: Populating a dropdown list using 'form_for' 2010-08-31 01:22:14
615                    Onboarding: Sketch the landscape 2013-02-15 07:36:06
28                        Javascript: The 'new' keyword 2010-03-06 15:16:02
1290                Coding Dojo #16: Reading SUnit code 2009-05-28 23:23:19
 
> posts_by_date  = df %>% mutate(year_mon = as.Date(as.yearmon(date))) %>% count(year_mon)
> posts_by_date %>% head(5)
 
    year_mon  n
1 2006-08-01  1
2 2006-09-01  4
3 2008-02-01  4
4 2008-07-01  2
5 2008-08-01 38

I then plugged the new data frame into ggplot to get the chart:

> ggplot(aes(x = year_mon, y = n), data = posts_by_date) + geom_line()

2015 07 12 09 07 47

The problem with this chart is that it’s showing there being 4 posts per month for all the dates between September 2006 and February 2008 even though I didn’t write anything! It’s doing the same thing between February 2008 and July 2008 too.

We can fix that by filling in the gaps with 0s.

First we’ll create a vector containing every month in the data range contained by our data frame:

> all_dates = seq(as.Date(as.yearmon(min(df$date))), as.Date(as.yearmon(max(df$date))), by="month")
 
> all_dates
  [1] "2006-08-01" "2006-09-01" "2006-10-01" "2006-11-01" "2006-12-01" "2007-01-01" "2007-02-01" "2007-03-01"
  [9] "2007-04-01" "2007-05-01" "2007-06-01" "2007-07-01" "2007-08-01" "2007-09-01" "2007-10-01" "2007-11-01"
 [17] "2007-12-01" "2008-01-01" "2008-02-01" "2008-03-01" "2008-04-01" "2008-05-01" "2008-06-01" "2008-07-01"
 [25] "2008-08-01" "2008-09-01" "2008-10-01" "2008-11-01" "2008-12-01" "2009-01-01" "2009-02-01" "2009-03-01"
 [33] "2009-04-01" "2009-05-01" "2009-06-01" "2009-07-01" "2009-08-01" "2009-09-01" "2009-10-01" "2009-11-01"
 [41] "2009-12-01" "2010-01-01" "2010-02-01" "2010-03-01" "2010-04-01" "2010-05-01" "2010-06-01" "2010-07-01"
 [49] "2010-08-01" "2010-09-01" "2010-10-01" "2010-11-01" "2010-12-01" "2011-01-01" "2011-02-01" "2011-03-01"
 [57] "2011-04-01" "2011-05-01" "2011-06-01" "2011-07-01" "2011-08-01" "2011-09-01" "2011-10-01" "2011-11-01"
 [65] "2011-12-01" "2012-01-01" "2012-02-01" "2012-03-01" "2012-04-01" "2012-05-01" "2012-06-01" "2012-07-01"
 [73] "2012-08-01" "2012-09-01" "2012-10-01" "2012-11-01" "2012-12-01" "2013-01-01" "2013-02-01" "2013-03-01"
 [81] "2013-04-01" "2013-05-01" "2013-06-01" "2013-07-01" "2013-08-01" "2013-09-01" "2013-10-01" "2013-11-01"
 [89] "2013-12-01" "2014-01-01" "2014-02-01" "2014-03-01" "2014-04-01" "2014-05-01" "2014-06-01" "2014-07-01"
 [97] "2014-08-01" "2014-09-01" "2014-10-01" "2014-11-01" "2014-12-01" "2015-01-01" "2015-02-01" "2015-03-01"
[105] "2015-04-01" "2015-05-01" "2015-06-01" "2015-07-01"

Now we need to create a data frame containing those dates and merge it with the original:

posts_by_date_clean = merge(data.frame(date = all_dates),
                            posts_by_date,
                            by.x='date',
                            by.y='year_mon',
                            all.x=T,
                            all.y=T)
 
> posts_by_date_clean %>% head()
        date  n
1 2006-08-01  1
2 2006-09-01  4
3 2006-10-01 NA
4 2006-11-01 NA
5 2006-12-01 NA
6 2007-01-01 NA

We’ve still got some ‘NA’ values in there which won’t plot so well. Let’s set those to 0 and then try and plot our chart again:

> posts_by_date_clean$n[is.na(posts_by_date_clean$n)] = 0
> ggplot(aes(x = date, y = n), data = posts_by_date_clean) + geom_line()
2015 07 12 09 17 10

Much better!

Categories: Programming

R: Date for given week/year

Fri, 07/10/2015 - 23:01

As I mentioned in my last couple of blog posts I’ve been looking at the data behind this blog and I wanted to plot a chart showing the number of posts per week since the blog started.

I started out with a data frame with posts and publication date:

> library(dplyr)
> df = read.csv("posts.csv")
> df$date = ymd_hms(df$date)
 
> df %>% sample_n(10)
                                                                                title                date
538                                    Nygard Big Data Model: The Investigation Stage 2012-10-10 00:00:36
341                                                            The read-only database 2011-08-29 23:32:26
1112                                  CSS in Internet Explorer - Some lessons learned 2008-10-31 15:24:51
143                                                       Coding: Mutating parameters 2010-08-26 07:47:23
433  Scala: Counting number of inversions (via merge sort) for an unsorted collection 2012-03-20 06:53:18
618                                    neo4j/cypher: SQL style GROUP BY functionality 2013-02-17 21:05:27
1111                                 Testing Hibernate mappings: Setting up test data 2008-10-30 13:24:14
462                                       neo4j: What question do you want to answer? 2012-05-05 13:20:41
1399                                       Book Club: Design Sense (Michael Feathers) 2009-09-29 14:42:29
494                                    Bash Shell: Reusing parts of previous commands 2012-07-05 23:42:35

The first step was to add a couple of columns representing the week and year for the publication date. The ‘lubridate’ library came in handy here:

byWeek = df %>% 
  mutate(year = year(date), week = week(date)) %>% 
  group_by(week, year) %>% summarise(n = n()) %>% 
  ungroup() %>% arrange(desc(n))
 
> byWeek
Source: local data frame [352 x 3]
 
   week year  n
1    33 2008 14
2    35 2008 11
3    53 2012 11
4     9 2013 10
5    12 2013  9
6    21 2009  9
7    22 2009  9
8    38 2013  9
9    40 2008  9
10   48 2012  9
..  ...  ... ..

The next step is to calculate the start date of each of those weeks so that we can plot the counts on a continuous date scale. I spent a while searching how to do this before realising that the ‘week’ function I used before can set the week for a given data as well. Let’s get to work:

calculate_start_of_week = function(week, year) {
  date <- ymd(paste(year, 1, 1, sep="-"))
  week(date) = week
  return(date)
}
 
> calculate_start_of_week(c(1,2,3), c(2015,2014,2013))
[1] "2015-01-01 UTC" "2014-01-08 UTC" "2013-01-15 UTC"

And now let’s transform our data frame and plot the counts:

ggplot(aes(x=start_of_week, y=n, group=1), 
       data = byWeek %>% mutate(start_of_week = calculate_start_of_week(week, year))) + 
  geom_line()

2015 07 10 22 43 54

It’s a bit erratic as you can see. Some of this can be explained by the fact that I do in fact post in an erratic way while some of it is explained by the fact that some weeks only have a few days if they start on the 29th onwards.

Categories: Programming

R: dplyr – Error: cannot modify grouping variable

Thu, 07/09/2015 - 06:55

I’ve been doing some exploration of the posts made on this blog and I thought I’d start with answering a simple question – on which dates did I write the most posts?

I started with a data frame containing each post and the date it was published:

> library(dplyr)
> df %>% sample_n(5)
                                                title                date
1148 Taiichi Ohno's Workplace Management: Book Review 2008-12-08 14:14:48
158     Rails: Faking a delete method with 'form_for' 2010-09-20 18:52:15
331           Retrospectives: The 4 L's Retrospective 2011-07-25 21:00:30
1035       msbuild - Use OutputPath instead of OutDir 2008-08-14 18:54:03
1181                The danger of commenting out code 2009-01-17 06:02:33

To find the most popular days for blog posts we can write the following aggregation function:

> df %>% mutate(day = as.Date(date)) %>% count(day) %>% arrange(desc(n))
 
Source: local data frame [1,140 x 2]
 
          day n
1  2012-12-31 6
2  2014-05-31 6
3  2008-08-08 5
4  2013-01-27 5
5  2009-08-24 4
6  2012-06-24 4
7  2012-09-30 4
8  2012-10-27 4
9  2012-11-24 4
10 2013-02-28 4

So we can see a couple of days with 6 posts, a couple with 5 posts, a few more with 4 posts and then presumably loads of days with 1 post.

I thought it’d be cool if we could blog a histogram which had on the x axis the number of posts and on the y axis how many days that number of posts occurred e.g. for an x value of 6 (posts) we’d have a y value of 2 (occurrences).

My initial attempt was this:

> df %>% mutate(day = as.Date(date)) %>% count(day) %>% count(n)
Error: cannot modify grouping variable

Unfortunately that isn’t allowed. I tried ungrouping and then counting again:

 df %>% mutate(day = as.Date(date)) %>% count(day) %>% ungroup() %>% count(n)
Error: cannot modify grouping variable

Still no luck. I did a bit of googlign around and came across a post which suggested using a combination of group_by + mutate or group_by + summarize.

I tried the mutate approach first:

> df %>% mutate(day = as.Date(date)) %>% 
+     group_by(day) %>% mutate(n = n()) %>% ungroup() %>% sample_n(5)
                                                        title                Source: local data frame [5 x 4]
 
                                    title                date        day n
1 QCon London 2009: DDD & BDD - Dan North 2009-03-13 15:28:04 2009-03-13 2
2        Onboarding: Sketch the landscape 2013-02-15 07:36:06 2013-02-15 1
3                           Ego Depletion 2013-06-04 23:16:29 2013-06-04 1
4                 Clean Code: Book Review 2008-09-15 09:52:33 2008-09-15 1
5            Dreyfus Model: More thoughts 2009-08-10 10:36:51 2009-08-10 1

That keeps around the ‘title’ which is a bit annoying. We can get rid of it using a distinct on ‘day’ if we want and if we also implement the second part of the function we end up with the following:

> df %>% mutate(day = as.Date(date)) %>% 
    group_by(day) %>% mutate(n = n()) %>% distinct(day) %>% ungroup() %>% 
    group_by(n) %>%
    mutate(c = n()) %>%
    distinct(n)  
 
Source: local data frame [6 x 5]
Groups: n
 
                                                title                date        day n   c
1       Functional C#: Writing a 'partition' function 2010-02-01 23:34:02 2010-02-01 1 852
2                            Willed vs Forced designs 2010-02-08 22:48:05 2010-02-08 2 235
3                            TDD: Testing collections 2010-07-28 06:05:25 2010-07-28 3  41
4  Creating a Samba share between Ubuntu and Mac OS X 2012-06-24 00:40:35 2012-06-24 4   8
5            Gamification and Software: Some thoughts 2012-12-31 10:57:19 2012-12-31 6   2
6 Python/numpy: Selecting specific column in 2D array 2013-01-27 02:10:10 2013-01-27 5   2

Annoyingly we’ve still got the ‘title’, ‘date’ and ‘day’ columns hanging around which we’d need to get rid of with a call to ‘select’. The code also feels quite icky, especially the use of distinct in a couple of places.

In fact we can simplify the code if we use summarize instead of mutate:

> df %>% mutate(day = as.Date(date)) %>% 
    group_by(day) %>% summarize(n = n()) %>% ungroup() %>% 
    group_by(n) %>% summarize(c = n())
 
 
Source: local data frame [6 x 2]
 
  n   c
1 1 852
2 2 235
3 3  41
4 4   8
5 5   2
6 6   2

And we’ve got also rid of the extra columns in the bargain which is great! And now we can plot our histogram:

> library(ggplot2)
> post_frequencies = df %>% mutate(day = as.Date(date)) %>% 
    group_by(day) %>% summarize(n = n()) %>% ungroup() %>% 
    group_by(n) %>% summarize(c = n())
> ggplot(aes(x = n, y = c), data = post_frequencies) + geom_bar(stat = "identity")

2015 07 09 06 44 47

In this case we don’t actually need to do the second grouping to create the bar chart since ggplot will do it for us if we feed it the following data:

. ggplot(aes(x = n), 
         data = df %>% mutate(day = as.Date(date)) %>% group_by(day) %>% summarize(n = n()) %>% ungroup()) +
    geom_bar(binwidth = 1) +
    scale_x_continuous(limits=c(1, 6))
2015 07 09 06 55 12

Still, it’s good to know how!

Categories: Programming

Python: Converting WordPress posts in CSV format

Tue, 07/07/2015 - 07:28

Over the weekend I wanted to look into the WordPress data behind this blog (very meta!) and wanted to get the data in CSV format so I could do some analysis in R.

2015 07 07 06 59 02

I found a couple of WordPress CSV plugins but unfortunately I couldn’t get any of them to work and ended up working with the raw XML data that WordPress produces when you ‘export’ a blog.

I had the problem of the export being incomplete which I ‘solved’ by importing the posts in two parts of a few years each.

I then spent quite a few hours struggling to get the data into shape using R’s rvest library but eventually decided to do the scraping using Python’s beautifulsoup and save it to a CSV file for analysis in R.

The structure of the XML that we want to extract is as follows:

<rss version="2.0"
	xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:wp="http://wordpress.org/export/1.2/"
>
...
    <channel>
		<item>
		<title>First thoughts on Ruby...</title>
		<link>http://www.markhneedham.com/blog/2006/08/29/first-thoughts-on-ruby/</link>
		<pubDate>Tue, 29 Aug 2006 13:31:05 +0000</pubDate>
...

I wrote the following script to parse the files:

from bs4 import BeautifulSoup
from soupselect import select
from dateutil import parser
 
import csv
 
def read_page(page):
    return BeautifulSoup(open(page, 'r').read())
 
with open("posts.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["title", "date"])
 
    for row in select(read_page("part2.xml"), "item"):
        title = select(row, "title")[0].text.encode("utf-8")
        date = parser.parse(select(row, "pubdate")[0].text)
        writer.writerow([title, date])
 
    for row in select(read_page("part1.xml"), "item"):
        title = select(row, "title")[0].text.encode("utf-8")
        date = parser.parse(select(row, "pubdate")[0].text)
        writer.writerow([title, date])

We end up with a CSV file that looks like this:

$ head -n 10 posts.csv
title,date
Functional C#: Writing a 'partition' function,2010-02-01 23:34:02+00:00
Coding: Wrapping/not wrapping 3rd party libraries and DSLs,2010-02-02 23:54:21+00:00
Functional C#: LINQ vs Method chaining,2010-02-05 18:06:28+00:00
F#: function keyword,2010-02-07 02:54:13+00:00
Willed vs Forced designs,2010-02-08 22:48:05+00:00
Functional C#: Extracting a higher order function with generics,2010-02-08 23:17:47+00:00
Javascript: File encoding when using string.replace,2010-02-10 00:02:02+00:00
F#: Inline functions and statically resolved type parameters,2010-02-10 23:06:14+00:00
Javascript: Passing functions around with call and apply,2010-02-12 20:18:02+00:00

Let’s quickly look over the data in R and check it’s being correctly exported:

require(dplyr)
require(lubridate)
 
df = read.csv("posts.csv")
 
> df %>% count()
Source: local data frame [1 x 1]
 
     n
1 1501

So we’ve exported 1501 posts. Let’s cross check with the WordPress dashboard:

2015 07 07 07 06 02

We’ve gained two extra posts! A bit more exploration of the WordPress dashboard reveals that there are actually 2 draft posts lying around.

We probably want to remove those from the export and luckily there’s a ‘status’ tag for each post that we can check. We want to make sure it doesn’t have the value ‘draft':

from bs4 import BeautifulSoup
from soupselect import select
from dateutil import parser
 
import csv
 
def read_page(page):
    return BeautifulSoup(open(page, 'r').read())
 
with open("posts.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["title", "date"])
 
    for row in select(read_page("part2.xml"), "item"):
        if (not row.find("wp:status")) or row.find("wp:status").text != "draft":
            title = select(row, "title")[0].text.encode("utf-8")
            date = parser.parse(select(row, "pubdate")[0].text)
            writer.writerow([title, date])
 
    for row in select(read_page("part1.xml"), "item"):
        if (not row.find("wp:status")) or row.find("wp:status").text != "draft":
            title = select(row, "title")[0].text.encode("utf-8")
            date = parser.parse(select(row, "pubdate")[0].text)
            writer.writerow([title, date])

I also had to check if that tag actually existed since there were a couple of posts which didn’t have it but had been published. If we check the resulting CSV file in R we can see that we’ve now got all the posts:

> df = read.csv("posts.csv")
> df %>% count()
Source: local data frame [1 x 1]
 
     n
1 1499

Now we’re ready to test a couple of hypotheses that I have but that’s for another post!

Categories: Programming

R: Wimbledon – How do the seeds get on?

Sun, 07/05/2015 - 09:38

Continuing on with the Wimbledon data set I’ve been playing with I wanted to do some exploration on how the seeded players have fared over the years.

Taking the last 10 years worth of data there have always had 32 seeds and with the following function we can feed in a seeding and get back the round they would be expected to reach:

expected_round = function(seeding) {  
  if(seeding == 1) {
    return("Winner")
  } else if(seeding == 2) {
    return("Finals") 
  } else if(seeding <= 4) {
    return("Semi-Finals")
  } else if(seeding <= 8) {
    return("Quarter-Finals")
  } else if(seeding <= 16) {
    return("Round of 16")
  } else {
    return("Round of 32")
  }
}
 
> expected_round(1)
[1] "Winner"
 
> expected_round(4)
[1] "Semi-Finals"

We can then have a look at each of the Wimbledon tournaments and work out how far they actually got.

round_reached = function(player, main_matches) {
  furthest_match = main_matches %>% 
    filter(winner == player | loser == player) %>% 
    arrange(desc(round)) %>% 
    head(1)  
 
    return(ifelse(furthest_match$winner == player, "Winner", as.character(furthest_match$round)))
}
 
seeds = function(matches_to_consider) {
  winners =  matches_to_consider %>% filter(!is.na(winner_seeding)) %>% 
    select(name = winner, seeding =  winner_seeding) %>% distinct()
  losers = matches_to_consider %>% filter( !is.na(loser_seeding)) %>% 
    select(name = loser, seeding =  loser_seeding) %>% distinct()
 
  return(rbind(winners, losers) %>% distinct() %>% mutate(name = as.character(name)))
}

Let’s have a look how the seeds got on last year:

matches_to_consider = main_matches %>% filter(year == 2014)
 
result = seeds(matches_to_consider) %>% group_by(name) %>% 
    mutate(expected = expected_round(seeding), round = round_reached(name, matches_to_consider)) %>% 
    ungroup() %>% arrange(seeding)
 
rounds = c("Did not enter", "Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals", "Winner")
result$round = factor(result$round, levels = rounds, ordered = TRUE)
result$expected = factor(result$expected, levels = rounds, ordered = TRUE) 
 
> result %>% head(10)
Source: local data frame [10 x 4]
 
             name seeding       expected          round
1  Novak Djokovic       1         Winner         Winner
2    Rafael Nadal       2         Finals    Round of 16
3     Andy Murray       3    Semi-Finals Quarter-Finals
4   Roger Federer       4    Semi-Finals         Finals
5   Stan Wawrinka       5 Quarter-Finals Quarter-Finals
6   Tomas Berdych       6 Quarter-Finals    Round of 32
7    David Ferrer       7 Quarter-Finals    Round of 64
8    Milos Raonic       8 Quarter-Finals    Semi-Finals
9      John Isner       9    Round of 16    Round of 32
10  Kei Nishikori      10    Round of 16    Round of 16

We’ll wrap all of that code into the following function:

expectations = function(y, matches) {
  matches_to_consider = matches %>% filter(year == y)  
 
  result = seeds(matches_to_consider) %>% group_by(name) %>% 
    mutate(expected = expected_round(seeding), round = round_reached(name, matches_to_consider)) %>% 
    ungroup() %>% arrange(seeding)
 
  result$round = factor(result$round, levels = rounds, ordered = TRUE)
  result$expected = factor(result$expected, levels = rounds, ordered = TRUE)  
 
  return(result)
}

Next, instead of showing the round names it’d be cool to come up with numerical value indicating how well the player did:

  • -1 would mean they lost in the round before their seeding suggested e.g. seed 2 loses in Semi Final
  • 2 would mean they got 2 rounds further than they should have e.g. Seed 7 reaches the Final

The unclass function comes to our rescue here:

# expectations plot
years = 2005:2014
exp = data.frame()
for(y in years) {
  differences = (expectations(y, main_matches)  %>% 
                   mutate(expected_n = unclass(expected), 
                          round_n = unclass(round), 
                          difference = round_n - expected_n))$difference %>% as.numeric()    
  exp = rbind(exp, data.frame(year = rep(y, length(differences)), difference = differences)) 
}
 
> exp %>% sample_n(10)
Source: local data frame [10 x 6]
 
              name seeding expected_n round_n difference year
1    Tomas Berdych       6          6       5         -1 2011
2    Tomas Berdych       7          6       6          0 2013
3     Rafael Nadal       2          8       5         -3 2014
4    Fabio Fognini      16          5       4         -1 2014
5  Robin Soderling      13          5       5          0 2009
6    Jurgen Melzer      16          5       5          0 2010
7  Nicolas Almagro      19          4       2         -2 2010
8    Stan Wawrinka      14          5       3         -2 2011
9     David Ferrer       7          6       5         -1 2011
10 Mikhail Youzhny      14          5       5          0 2007

We can then group by the ‘difference’ column to see how seeds are getting on as a whole:

> exp %>% count(difference)
Source: local data frame [9 x 2]
 
  difference  n
1         -5  2
2         -4  7
3         -3 24
4         -2 70
5         -1 66
6          0 85
7          1 43
8          2 17
9          3  4
 
library(ggplot2)
ggplot(aes(x = difference, y = n), data = exp %>% count(difference)) +
  geom_bar(stat = "identity") +
  scale_x_continuous(limits=c(min(potential), max(potential) + 1))
2015 07 04 00 45 02

So from this visualisation we can see that the most common outcome for a seed is that they reach the round they were expected to reach. There are still a decent number of seeds who do 1 or 2 rounds worse than expected as well though.

Antonios suggested doing some analysis of how the seeds fared on a year by year basis – we’ll start by looking at what % of them exactly achieved their seeding:

exp$correct_pred = 0
exp$correct_pred[dt$difference==0] = 1
 
exp %>% group_by(year) %>% 
  summarise(MeanDiff = mean(difference),
            PrcCorrect = mean(correct_pred),
            N=n())
 
Source: local data frame [10 x 4]
 
   year   MeanDiff PrcCorrect  N
1  2005 -0.6562500  0.2187500 32
2  2006 -0.8125000  0.2812500 32
3  2007 -0.4838710  0.4193548 31
4  2008 -0.9677419  0.2580645 31
5  2009 -0.3750000  0.2500000 32
6  2010 -0.7187500  0.4375000 32
7  2011 -0.7187500  0.0937500 32
8  2012 -0.7500000  0.2812500 32
9  2013 -0.9375000  0.2500000 32
10 2014 -0.7187500  0.1875000 32

Some years are better than others – we can use a chisq test to see whether there are any significant differences between the years:

tbl = table(exp$year, exp$correct_pred)
tbl
 
> chisq.test(tbl)
 
	Pearson's Chi-squared test
 
data:  tbl
X-squared = 14.9146, df = 9, p-value = 0.09331

This looks for at least one statistically significant different between the years, although it doesn’t look like there are any. We can also try doing a comparison of each year against all the others:

> pairwise.prop.test(tbl)
 
	Pairwise comparisons using Pairwise comparison of proportions 
 
data:  tbl 
 
     2005 2006 2007 2008 2009 2010 2011 2012 2013
2006 1.00 -    -    -    -    -    -    -    -   
2007 1.00 1.00 -    -    -    -    -    -    -   
2008 1.00 1.00 1.00 -    -    -    -    -    -   
2009 1.00 1.00 1.00 1.00 -    -    -    -    -   
2010 1.00 1.00 1.00 1.00 1.00 -    -    -    -   
2011 1.00 1.00 0.33 1.00 1.00 0.21 -    -    -   
2012 1.00 1.00 1.00 1.00 1.00 1.00 1.00 -    -   
2013 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 -   
2014 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
 
P value adjustment method: holm


2007/2011 and 2010/2011 show the biggest differences but they’re still not significant. Since we have so few data items in each bucket there has to be a really massive difference for it to be significant.

The data I used in this post is available on this gist if you want to look into it and come up with your own analysis.

Categories: Programming

R: Calculating the difference between ordered factor variables

Thu, 07/02/2015 - 23:55

In my continued exploration of Wimbledon data I wanted to work out whether a player had done as well as their seeding suggested they should.

I therefore wanted to work out the difference between the round they reached and the round they were expected to reach. A ’round’ in the dataset is an ordered factor variable.

These are all the possible values:

rounds = c("Did not enter", "Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals", "Winner")

And if we want to factorise a couple of strings into this factor we would do it like this:

round = factor("Finals", levels = rounds, ordered = TRUE)
expected = factor("Winner", levels = rounds, ordered = TRUE)  
 
> round
[1] Finals
9 Levels: Did not enter < Round of 128 < Round of 64 < Round of 32 < Round of 16 < Quarter-Finals < ... < Winner
 
> expected
[1] Winner
9 Levels: Did not enter < Round of 128 < Round of 64 < Round of 32 < Round of 16 < Quarter-Finals < ... < Winner

In this case the difference between the actual round and expected round should be -1 – the player was expected to win the tournament but lost in the final. We can calculate that differnce by calling the unclass function on each variable:

 
> unclass(round) - unclass(expected)
[1] -1
attr(,"levels")
[1] "Did not enter"  "Round of 128"   "Round of 64"    "Round of 32"    "Round of 16"    "Quarter-Finals"
[7] "Semi-Finals"    "Finals"         "Winner"

That still seems to have some remnants of the factor variable so to get rid of that we can cast it to a numeric value:

> as.numeric(unclass(round) - unclass(expected))
[1] -1

And that’s it! We can now go and apply this calculation to all seeds to see how they got on.

Categories: Programming

R: write.csv – unimplemented type ‘list’ in ‘EncodeElement’

Tue, 06/30/2015 - 23:26

Everyone now and then I want to serialise an R data frame to a CSV file so I can easily load it up again if my R environment crashes without having to recalculate everything but recently ran into the following error:

> write.csv(foo, "/tmp/foo.csv", row.names = FALSE)
Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol,  : 
  unimplemented type 'list' in 'EncodeElement'

If we take a closer look at the data frame in question it looks ok:

> foo
  col1 col2
1    1    a
2    2    b
3    3    c

However, one of the columns contains a list in each cell and we need to find out which one it is. I’ve found the quickest way is to run the typeof function over each column:

> typeof(foo$col1)
[1] "double"
 
> typeof(foo$col2)
[1] "list"

So ‘col2′ is the problem one which isn’t surprising if you consider the way I created ‘foo':

library(dplyr)
foo = data.frame(col1 = c(1,2,3)) %>% mutate(col2 = list("a", "b", "c"))

If we do have a list that we want to add to the data frame we need to convert it to a vector first so we don’t run into this type of problem:

foo = data.frame(col1 = c(1,2,3)) %>% mutate(col2 = list("a", "b", "c") %>% unlist())

And now we can write to the CSV file:

write.csv(foo, "/tmp/foo.csv", row.names = FALSE)
$ cat /tmp/foo.csv
"col1","col2"
1,"a"
2,"b"
3,"c"

And that’s it!

Categories: Programming

R: Speeding up the Wimbledon scraping job

Mon, 06/29/2015 - 06:36

Over the past few days I’ve written a few blog posts about a Wimbledon data set I’ve been building and after running the scripts a few times I noticed that it was taking much longer to run that I expected.

To recap, I started out with the following function which takes in a URI and returns a data frame containing a row for each match:

library(rvest)
library(dplyr)
 
scrape_matches1 = function(uri) {
  matches = data.frame()
 
  s = html(uri)
  rows = s %>% html_nodes("div#scoresResultsContent tr")
  i = 0
  for(row in rows) {  
    players = row %>% html_nodes("td.day-table-name a")
    seedings = row %>% html_nodes("td.day-table-seed")
    score = row %>% html_node("td.day-table-score a")
    flags = row %>% html_nodes("td.day-table-flag img")
 
    if(!is.null(score)) {
      player1 = players[1] %>% html_text() %>% str_trim()
      seeding1 = ifelse(!is.na(seedings[1]), seedings[1] %>% html_node("span") %>% html_text() %>% str_trim(), NA)
      flag1 = flags[1] %>% html_attr("alt")
 
      player2 = players[2] %>% html_text() %>% str_trim()
      seeding2 = ifelse(!is.na(seedings[2]), seedings[2] %>% html_node("span") %>% html_text() %>% str_trim(), NA)
      flag2 = flags[2] %>% html_attr("alt")
 
      matches = rbind(data.frame(winner = player1, 
                                 winner_seeding = seeding1, 
                                 winner_flag = flag1,
                                 loser = player2, 
                                 loser_seeding = seeding2,
                                 loser_flag = flag2,
                                 score = score %>% html_text() %>% str_trim(),
                                 round = round), matches)      
    } else {
      round = row %>% html_node("th") %>% html_text()
    }
  } 
  return(matches)
}

Let’s run it to get an idea of the data that it returns:

matches1 = scrape_matches1("http://www.atpworldtour.com/en/scores/archive/wimbledon/540/2014/results")
 
> matches1 %>% filter(round %in% c("Finals", "Semi-Finals", "Quarter-Finals"))
           winner winner_seeding winner_flag           loser loser_seeding loser_flag            score          round
1    Milos Raonic            (8)         CAN    Nick Kyrgios          (WC)        AUS    674 62 64 764 Quarter-Finals
2   Roger Federer            (4)         SUI   Stan Wawrinka           (5)        SUI     36 765 64 64 Quarter-Finals
3 Grigor Dimitrov           (11)         BUL     Andy Murray           (3)        GBR        61 764 62 Quarter-Finals
4  Novak Djokovic            (1)         SRB     Marin Cilic          (26)        CRO  61 36 674 62 62 Quarter-Finals
5   Roger Federer            (4)         SUI    Milos Raonic           (8)        CAN         64 64 64    Semi-Finals
6  Novak Djokovic            (1)         SRB Grigor Dimitrov          (11)        BUL    64 36 762 767    Semi-Finals
7  Novak Djokovic            (1)         SRB   Roger Federer           (4)        SUI 677 64 764 57 64         Finals

As I mentioned, it’s quite slow but I thought I’d wrap it in system.time so I could see exactly how long it was taking:

> system.time(scrape_matches1("http://www.atpworldtour.com/en/scores/archive/wimbledon/540/2014/results"))
   user  system elapsed 
 25.570   0.111  31.416

About 30 seconds! The first thing I tried was downloading the file separately and running the function against the local file:

> system.time(scrape_matches1("data/raw/2014.html"))
   user  system elapsed 
 25.662   0.123  25.863

Hmmm, that’s only saved us 5 seconds so the bottleneck must be somewhere else. Still there’s no point making a HTTP request every time we run the script so we’ll stick with the local file version.

While browsing rvest’s vignette I noticed a function called html_table which I was curious about. I decided to try and replace some of my code with a call to that:

matches2= html("data/raw/2014.html") %>% 
  html_node("div#scoresResultsContent table.day-table") %>% html_table(header = FALSE) %>% 
  mutate(X1 = ifelse(X1 == "", NA, X1)) %>%
  mutate(round = ifelse(grepl("\\([0-9]\\)|\\(", X1), NA, X1)) %>% 
  mutate(round = na.locf(round)) %>%
  filter(!is.na(X8)) %>%
  select(winner = X3, winner_seeding = X1, loser = X7, loser_seeding = X5, score = X8, round)
 
> matches2 %>% filter(round %in% c("Finals", "Semi-Finals", "Quarter-Finals"))
           winner winner_seeding           loser loser_seeding            score          round
1  Novak Djokovic            (1)   Roger Federer           (4) 677 64 764 57 64         Finals
2  Novak Djokovic            (1) Grigor Dimitrov          (11)    64 36 762 767    Semi-Finals
3   Roger Federer            (4)    Milos Raonic           (8)         64 64 64    Semi-Finals
4  Novak Djokovic            (1)     Marin Cilic          (26)  61 36 674 62 62 Quarter-Finals
5 Grigor Dimitrov           (11)     Andy Murray           (3)        61 764 62 Quarter-Finals
6   Roger Federer            (4)   Stan Wawrinka           (5)     36 765 64 64 Quarter-Finals
7    Milos Raonic            (8)    Nick Kyrgios          (WC)    674 62 64 764 Quarter-Finals

I had to do some slightly clever stuff to get the ’round’ column into shape using zoo’s na.locf function which I wrote about previously.

Unfortunately I couldn’t work out how to extract the flag with this version – that value is hidden in the ‘alt’ tag of an img and presumably html_table is just grabbing the text value of each cell. This version is much quicker though!

system.time(html("data/raw/2014.html") %>% 
  html_node("div#scoresResultsContent table.day-table") %>% html_table(header = FALSE) %>% 
  mutate(X1 = ifelse(X1 == "", NA, X1)) %>%
  mutate(round = ifelse(grepl("\\([0-9]\\)|\\(", X1), NA, X1)) %>% 
  mutate(round = na.locf(round)) %>%
  filter(!is.na(X8)) %>%
  select(winner = X3, winner_seeding = X1, loser = X7, loser_seeding = X5, score = X8, round))
 
   user  system elapsed 
  0.545   0.002   0.548

What I realised from writing this version is that I need to match all the columns with one call to html_nodes rather than getting the row and then each column in a loop.

I rewrote the function to do that:

scrape_matches3 = function(uri) {
  s = html(uri)
 
  players  = s %>% html_nodes("div#scoresResultsContent tr td.day-table-name a")
  seedings = s %>% html_nodes("div#scoresResultsContent tr td.day-table-seed")
  scores   = s %>% html_nodes("div#scoresResultsContent tr td.day-table-score a")
  flags    = s %>% html_nodes("div#scoresResultsContent tr td.day-table-flag img") %>% html_attr("alt") %>% str_trim()
 
  matches3 = data.frame(
    winner         = sapply(seq(1,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    winner_seeding = sapply(seq(1,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    winner_flag    = sapply(seq(1,length(flags),2),    function(idx) flags[[idx]]),  
    loser          = sapply(seq(2,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    loser_seeding  = sapply(seq(2,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    loser_flag     = sapply(seq(2,length(flags),2),    function(idx) flags[[idx]]),
    score          = sapply(scores,                    function(score) score %>% html_text() %>% str_trim())
  )
  return(matches3)
}

Let’s run and time that to check we’re getting back the right results in a timely manner:

> matches3 %>% sample_n(10)
                   winner winner_seeding winner_flag               loser loser_seeding loser_flag         score
70           David Ferrer            (7)         ESP Pablo Carreno Busta                      ESP  60 673 61 61
128        Alex Kuznetsov           (26)         USA         Tim Smyczek           (3)        USA   46 63 63 63
220   Rogerio Dutra Silva                        BRA   Kristijan Mesaros                      CRO         62 63
83         Kevin Anderson           (20)         RSA        Aljaz Bedene          (LL)        GBR      63 75 62
73          Kei Nishikori           (10)         JPN   Kenny De Schepper                      FRA     64 765 75
56  Roberto Bautista Agut           (27)         ESP         Jan Hernych           (Q)        CZE   75 46 62 62
138            Ante Pavic                        CRO        Marc Gicquel          (29)        FRA  46 63 765 64
174             Tim Puetz                        GER     Ruben Bemelmans                      BEL         64 62
103        Lleyton Hewitt                        AUS   Michal Przysiezny                      POL 62 6714 61 64
35          Roger Federer            (4)         SUI       Gilles Muller           (Q)        LUX      63 75 63
 
> system.time(scrape_matches3("data/raw/2014.html"))
   user  system elapsed 
  0.815   0.006   0.827

It’s still quick – a bit slower than html_table but we can deal with that. As you can see, I also had to add some logic to separate the values for the winners and losers – the players, seeds, flags come back as as one big list. The odd rows represent the winner; the even rows the loser.

Annoyingly we’ve now lost the ’round’ column because that appears as a table heading so we can’t extract it the same way. I ended up cheating a bit to get it to work by working out how many matches each round should contain and generated a vector with that number of entries:

raw_rounds = s %>% html_nodes("th") %>% html_text()
 
> raw_rounds
 [1] "Finals"               "Semi-Finals"          "Quarter-Finals"       "Round of 16"          "Round of 32"         
 [6] "Round of 64"          "Round of 128"         "3rd Round Qualifying" "2nd Round Qualifying" "1st Round Qualifying"
 
rounds = c( sapply(0:6, function(idx) rep(raw_rounds[[idx + 1]], 2 ** idx)) %>% unlist(),
            sapply(7:9, function(idx) rep(raw_rounds[[idx + 1]], 2 ** (idx - 3))) %>% unlist())
 
> rounds[1:10]
 [1] "Finals"         "Semi-Finals"    "Semi-Finals"    "Quarter-Finals" "Quarter-Finals" "Quarter-Finals" "Quarter-Finals"
 [8] "Round of 16"    "Round of 16"    "Round of 16"

Let’s put that code into the function and see if we end up with the same resulting data frame:

scrape_matches4 = function(uri) {
  s = html(uri)
 
  players  = s %>% html_nodes("div#scoresResultsContent tr td.day-table-name a")
  seedings = s %>% html_nodes("div#scoresResultsContent tr td.day-table-seed")
  scores   = s %>% html_nodes("div#scoresResultsContent tr td.day-table-score a")
  flags    = s %>% html_nodes("div#scoresResultsContent tr td.day-table-flag img") %>% html_attr("alt") %>% str_trim()
 
  raw_rounds = s %>% html_nodes("th") %>% html_text()
  rounds = c( sapply(0:6, function(idx) rep(raw_rounds[[idx + 1]], 2 ** idx)) %>% unlist(),
              sapply(7:9, function(idx) rep(raw_rounds[[idx + 1]], 2 ** (idx - 3))) %>% unlist())
 
  matches4 = data.frame(
    winner         = sapply(seq(1,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    winner_seeding = sapply(seq(1,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    winner_flag    = sapply(seq(1,length(flags),2),    function(idx) flags[[idx]]),  
    loser          = sapply(seq(2,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    loser_seeding  = sapply(seq(2,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    loser_flag     = sapply(seq(2,length(flags),2),    function(idx) flags[[idx]]),
    score          = sapply(scores,                    function(score) score %>% html_text() %>% str_trim()),
    round          = rounds
  )
  return(matches4)
}
 
matches4 = scrape_matches4("data/raw/2014.html")
 
> matches4 %>% filter(round %in% c("Finals", "Semi-Finals", "Quarter-Finals"))
           winner winner_seeding winner_flag           loser loser_seeding loser_flag            score          round
1  Novak Djokovic            (1)         SRB   Roger Federer           (4)        SUI 677 64 764 57 64         Finals
2  Novak Djokovic            (1)         SRB Grigor Dimitrov          (11)        BUL    64 36 762 767    Semi-Finals
3   Roger Federer            (4)         SUI    Milos Raonic           (8)        CAN         64 64 64    Semi-Finals
4  Novak Djokovic            (1)         SRB     Marin Cilic          (26)        CRO  61 36 674 62 62 Quarter-Finals
5 Grigor Dimitrov           (11)         BUL     Andy Murray           (3)        GBR        61 764 62 Quarter-Finals
6   Roger Federer            (4)         SUI   Stan Wawrinka           (5)        SUI     36 765 64 64 Quarter-Finals
7    Milos Raonic            (8)         CAN    Nick Kyrgios          (WC)        AUS    674 62 64 764 Quarter-Finals

We shouldn’t have added much to the time but let’s check:

> system.time(scrape_matches4("data/raw/2014.html"))
   user  system elapsed 
  0.816   0.004   0.824

Sweet. We’ve saved ourselves 29 seconds per page as long as the number of rounds stayed constant over the years. For the 10 years that I’ve looked at it has but I expect if you go back further the draw sizes will have been different and our script would break.

For now though this will do!

Categories: Programming

R: dplyr – Update rows with earlier/previous rows values

Sun, 06/28/2015 - 23:30

Recently I had a data frame which contained a column which had mostly empty values:

> data.frame(col1 = c(1,2,3,4,5), col2  = c("a", NA, NA , "b", NA))
  col1 col2
1    1    a
2    2 <NA>
3    3 <NA>
4    4    b
5    5 <NA>

I wanted to fill in the NA values with the last non NA value from that column. So I want the data frame to look like this:

1    1    a
2    2    a
3    3    a
4    4    b
5    5    b

I spent ages searching around before I came across the na.locf function in the zoo library which does the job:

library(zoo)
library(dplyr)
 
> data.frame(col1 = c(1,2,3,4,5), col2  = c("a", NA, NA , "b", NA)) %>% 
    do(na.locf(.))
  col1 col2
1    1    a
2    2    a
3    3    a
4    4    b
5    5    b

This will fill in the missing values for every column, so if we had a third column with missing values it would populate those too:

> data.frame(col1 = c(1,2,3,4,5), col2  = c("a", NA, NA , "b", NA), col3 = c("A", NA, "B", NA, NA)) %>% 
    do(na.locf(.))
 
  col1 col2 col3
1    1    a    A
2    2    a    A
3    3    a    B
4    4    b    B
5    5    b    B

If we only want to populate ‘col2′ and leave ‘col3′ as it is we can apply the function specifically to that column:

> data.frame(col1 = c(1,2,3,4,5), col2  = c("a", NA, NA , "b", NA), col3 = c("A", NA, "B", NA, NA)) %>% 
    mutate(col2 = na.locf(col2))
  col1 col2 col3
1    1    a    A
2    2    a <NA>
3    3    a    B
4    4    b <NA>
5    5    b <NA>

It’s quite a neat function and certainly comes in helpful when cleaning up data sets which don’t tend to be as uniform as you’d hope!

Categories: Programming

R: Command line – Error in GenericTranslator$new : could not find function “loadMethod”

Sat, 06/27/2015 - 23:47

I’ve been reading Text Processing with Ruby over the last week or so and one of the ideas the author describes is setting up your scripts so you can run them directly from the command line.

I wanted to do this with my Wimbledon R script and wrote the following script which uses the ‘Rscript’ executable so that R doesn’t launch in interactive mode:

wimbledon

#!/usr/bin/env Rscript
 
library(rvest)
library(dplyr)
library(stringr)
library(readr)
 
# stuff

Then I tried to run it:

$ time ./wimbledon
 
...
 
Error in GenericTranslator$new : could not find function "loadMethod"
Calls: write.csv ... html_extract_n -> <Anonymous> -> Map -> mapply -> <Anonymous> -> $
Execution halted
 
real	0m1.431s
user	0m1.127s
sys	0m0.078s

As the error suggests, the script fails when trying to write to a CSV file – it looks like Rscript doesn’t load in something from the core library that we need. It turns out adding the following line to our script is all we need:

library(methods)

So we end up with this:

#!/usr/bin/env Rscript
 
library(methods)
library(rvest)
library(dplyr)
library(stringr)
library(readr)

And when we run that all is well!

Categories: Programming

R: dplyr – squashing multiple rows per group into one

Sat, 06/27/2015 - 23:36

I spent a bit of the day working on my Wimbledon data set and the next thing I explored is all the people that have beaten Andy Murray in the tournament.

The following dplyr query gives us the names of those people and the year the match took place:

library(dplyr)
 
> main_matches %>% filter(loser == "Andy Murray") %>% select(winner, year)
 
            winner year
1  Grigor Dimitrov 2014
2    Roger Federer 2012
3     Rafael Nadal 2011
4     Rafael Nadal 2010
5     Andy Roddick 2009
6     Rafael Nadal 2008
7 Marcos Baghdatis 2006
8 David Nalbandian 2005

As you can see, Rafael Nadal shows up multiple times. I wanted to get one row per player and list all the years in a single column.

This was my initial attempt:

> main_matches %>% filter(loser == "Andy Murray") %>% 
     group_by(winner) %>% summarise(years = paste(year))
Source: local data frame [6 x 2]
 
            winner years
1     Andy Roddick  2009
2 David Nalbandian  2005
3  Grigor Dimitrov  2014
4 Marcos Baghdatis  2006
5     Rafael Nadal  2011
6    Roger Federer  2012

Unfortunately it just gives you the last matching row per group which isn’t quite what we want.. I realised my mistake while trying to pass a vector into paste and noticing that a vector came back when I’d expected a string:

> paste(c(2008,2009,2010))
[1] "2008" "2009" "2010"

The missing argument was ‘collapse’ – something I’d come across when using plyr last year:

> paste(c(2008,2009,2010), collapse=", ")
[1] "2008, 2009, 2010"

Now, if we apply that to our original function:

> main_matches %>% filter(loser == "Andy Murray") %>% 
     group_by(winner) %>% summarise(years = paste(year, collapse=", "))
Source: local data frame [6 x 2]
 
            winner            years
1     Andy Roddick             2009
2 David Nalbandian             2005
3  Grigor Dimitrov             2014
4 Marcos Baghdatis             2006
5     Rafael Nadal 2011, 2010, 2008
6    Roger Federer             2012

That’s exactly what we want. Let’s tidy that up a bit:

> main_matches %>% filter(loser == "Andy Murray") %>% 
     group_by(winner) %>% arrange(year) %>%
     summarise(years  = paste(year, collapse =","), times = length(year))  %>%
     arrange(desc(times), years)
Source: local data frame [6 x 3]
 
            winner          years times
1     Rafael Nadal 2008,2010,2011     3
2 David Nalbandian           2005     1
3 Marcos Baghdatis           2006     1
4     Andy Roddick           2009     1
5    Roger Federer           2012     1
6  Grigor Dimitrov           2014     1
Categories: Programming

R: ggplot – Show discrete scale even with no value

Fri, 06/26/2015 - 23:48

As I mentioned in a previous blog post, I’ve been scraping data for the Wimbledon tennis tournament, and having got the data for the last ten years I wrote a query using dplyr to find out how players did each year over that period.

I ended up with the following functions to filter my data frame of all the mataches:

round_reached = function(player, main_matches) {
  furthest_match = main_matches %>% 
    filter(winner == player | loser == player) %>% 
    arrange(desc(round)) %>% 
    head(1)  
 
    return(ifelse(furthest_match$winner == player, "Winner", as.character(furthest_match$round)))
}
 
player_performance = function(name, matches) {
  player = data.frame()
  for(y in 2005:2014) {
    round = round_reached(name, filter(matches, year == y))
    if(length(round) == 1) {
      player = rbind(player, data.frame(year = y, round = round))      
    } else {
      player = rbind(player, data.frame(year = y, round = "Did not enter"))
    } 
  }
  return(player)
}

When we call that function we see the following output:

> player_performance("Andy Murray", main_matches)
   year          round
1  2005    Round of 32
2  2006    Round of 16
3  2007  Did not enter
4  2008 Quarter-Finals
5  2009    Semi-Finals
6  2010    Semi-Finals
7  2011    Semi-Finals
8  2012         Finals
9  2013         Winner
10 2014 Quarter-Finals

I wanted to create a chart showing Murray’s progress over the years with the round reached on the y axis and the year on the x axis. In order to do this I had to make sure the ’round’ column was being treated as a factor variable:

df = player_performance("Andy Murray", main_matches)
 
rounds = c("Did not enter", "Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals", "Winner")
df$round = factor(df$round, levels =  rounds)
 
> df$round
 [1] Round of 32    Round of 16    Did not enter  Quarter-Finals Semi-Finals    Semi-Finals    Semi-Finals   
 [8] Finals         Winner         Quarter-Finals
Levels: Did not enter Round of 128 Round of 64 Round of 32 Round of 16 Quarter-Finals Semi-Finals Finals Winner

Now that we’ve got that we can plot his progress:

ggplot(aes(x = year, y = round, group=1), data = df) + 
    geom_point() + 
    geom_line() + 
    scale_x_continuous(breaks=df$year) + 
    scale_y_discrete(breaks = rounds)

2015 06 26 23 37 32

This is a good start but we’ve lost the rounds which don’t have a corresponding entry on the x axis. I’d like to keep them so it’s easier to compare the performance of different players.

It turns out that all we need to do is pass ‘drop = FALSE’ to scale_y_discrete and it will work exactly as we want:

ggplot(aes(x = year, y = round, group=1), data = df) + 
    geom_point() + 
    geom_line() + 
    scale_x_continuous(breaks=df$year) + 
    scale_y_discrete(breaks = rounds, drop = FALSE)

2015 06 26 23 41 01

Neat. Now let’s have a look at the performances of some of the other top players:

draw_chart = function(player, main_matches){
  df = player_performance(player, main_matches)
  df$round = factor(df$round, levels =  rounds)
 
  ggplot(aes(x = year, y = round, group=1), data = df) + 
    geom_point() + 
    geom_line() + 
    scale_x_continuous(breaks=df$year) + 
    scale_y_discrete(breaks = rounds, drop=FALSE) + 
    ggtitle(player) + 
    theme(axis.text.x=element_text(angle=90, hjust=1))
}
 
a = draw_chart("Andy Murray", main_matches)
b = draw_chart("Novak Djokovic", main_matches)
c = draw_chart("Rafael Nadal", main_matches)
d = draw_chart("Roger Federer", main_matches)
 
library(gridExtra)
grid.arrange(a,b,c,d, ncol=2)

2015 06 26 23 46 15

And that’s all for now!

Categories: Programming

R: Scraping Wimbledon draw data

Fri, 06/26/2015 - 00:14

Given Wimbledon starts next week I wanted to find a data set to explore before it gets underway. Having searched around and failed to find one I had to resort to scraping the ATP World Tour’s event page which displays the matches in an easy to access format.

We’ll be using the Wimbledon 2013 draw since Andy Murray won that year! This is what the page looks like:

2015 06 25 23 47 16

Each match is in its own row of a table and each column has a class attribute which makes it really easy to scrape. We’ll be using R’s rvest again. I wrote the following script which grabs the player names, seedings and score of the match and stores everything in a data frame:

library(rvest)
library(dplyr)
library(stringr)
 
s = html_session("http://www.atpworldtour.com/en/scores/archive/wimbledon/540/2013/results")
rows = s %>% html_nodes("div#scoresResultsContent tr")
 
matches = data.frame()
for(row in rows) {  
  players = row %>% html_nodes("td.day-table-name a")
  seedings = row %>% html_nodes("td.day-table-seed")
  score = row %>% html_node("td.day-table-score a")
 
  if(!is.null(score)) {
    player1 = players[1] %>% html_text() %>% str_trim()
    seeding1 = ifelse(!is.na(seedings[1]), seedings[1] %>% html_node("span") %>% html_text() %>% str_trim(), NA)
 
    player2 = players[2] %>% html_text() %>% str_trim()
    seeding2 = ifelse(!is.na(seedings[2]), seedings[2] %>% html_node("span") %>% html_text() %>% str_trim(), NA)
 
    matches = rbind(data.frame(winner = player1, 
                               winner_seeding = seeding1, 
                               loser = player2, 
                               loser_seeding = seeding2,
                               score = score %>% html_text() %>% str_trim(),
                               round = round), matches)
 
  } else {
    round = row %>% html_node("th") %>% html_text()
  }
}

This is what the data frame looks like:

> matches %>% sample_n(10)
               winner winner_seeding                       loser loser_seeding            score                round
61      Wayne Odesnik            (4)                Thiago Alves          <NA>            61 64 1st Round Qualifying
4     Danai Udomchoke           <NA>            Marton Fucsovics          <NA>       61 57 1210 1st Round Qualifying
233    Jerzy Janowicz           (24)                Lukasz Kubot          <NA>         75 64 64       Quarter-Finals
90       Malek Jaziri           <NA>             Illya Marchenko           (9)        674 75 64 2nd Round Qualifying
222      David Ferrer            (4)         Alexandr Dolgopolov          (26) 676 762 26 61 62          Round of 32
54  Michal Przysiezny           (11)                 Dusan Lojda          <NA>         26 63 62 1st Round Qualifying
52           Go Soeda           (13)               Nikola Mektic          <NA>            62 60 1st Round Qualifying
42    Ruben Bemelmans           (23) Jonathan Dasnieres de Veigy          <NA>            63 64 1st Round Qualifying
31        Mirza Basic           <NA>              Tsung-Hua Yang          <NA>     674 33 (RET) 1st Round Qualifying
179     Jurgen Melzer           <NA>              Julian Reister           (Q)    36 762 765 62          Round of 64

It also contains qualifying matches which I’m not so interested in. Let’s strip those out:

main_matches = matches %>% filter(!grepl("Qualifying", round)) %>% mutate(year = 2013)

We’ll also put a column in for ‘year’ so that we can handle the draws for multiple years later on.

Next I wanted to clean up the data a bit. I’d like to be able to do some queries based on the seedings of the players but at the moment that column contains numeric brackets in values as well as some other values which indicate whether a player is a qualifier, lucky loser or wildcard entry.

I started by adding a column to store this extra information:

main_matches$winner_type = NA
main_matches$winner_type[main_matches$winner_seeding == "(WC)"] = "wildcard"
main_matches$winner_type[main_matches$winner_seeding == "(Q)"] = "qualifier"
main_matches$winner_type[main_matches$winner_seeding == "(LL)"] = "lucky loser"
 
main_matches$loser_type = NA
main_matches$loser_type[main_matches$loser_seeding == "(WC)"] = "wildcard"
main_matches$loser_type[main_matches$loser_seeding == "(Q)"] = "qualifier"
main_matches$loser_type[main_matches$loser_seeding == "(LL)"] = "lucky loser"

And then I cleaned up the existing column:

tidy_seeding = function(seeding) {
  no_brackets = gsub("\\(|\\)", "", seeding)
  return(gsub("WC|Q|L", NA, no_brackets))
}
 
main_matches = main_matches %>% 
  mutate(winner_seeding = as.numeric(tidy_seeding(winner_seeding)),
         loser_seeding = as.numeric(tidy_seeding(loser_seeding)))

Now we can write a query against the data frame to find out when the underdog won i.e. a player with no seeding beat a player with a seeding or a lower seeded player beat a higher seeded one:

> main_matches %>%  filter((winner_seeding > loser_seeding) | (is.na(winner_seeding) & !is.na(loser_seeding)))
                  winner winner_seeding                 loser loser_seeding                  score          round year
1          Jurgen Melzer             NA         Fabio Fognini            30           675 75 63 62   Round of 128 2013
2          Bernard Tomic             NA           Sam Querrey            21       766 763 36 26 63   Round of 128 2013
3        Feliciano Lopez             NA          Gilles Simon            19             62 64 7611   Round of 128 2013
4             Ivan Dodig             NA Philipp Kohlschreiber            16 46 676 763 63 21 (RET)   Round of 128 2013
5         Viktor Troicki             NA      Janko Tipsarevic            14              63 64 765   Round of 128 2013
6         Lleyton Hewitt             NA         Stan Wawrinka            11               64 75 63   Round of 128 2013
7           Steve Darcis             NA          Rafael Nadal             5             764 768 64   Round of 128 2013
8      Fernando Verdasco             NA      Julien Benneteau            31             761 764 64    Round of 64 2013
9           Grega Zemlja             NA       Grigor Dimitrov            29       36 764 36 64 119    Round of 64 2013
10      Adrian Mannarino             NA            John Isner            18               11 (RET)    Round of 64 2013
11         Igor Sijsling             NA          Milos Raonic            17              75 64 764    Round of 64 2013
12     Kenny De Schepper             NA           Marin Cilic            10                  (W/O)    Round of 64 2013
13        Ernests Gulbis             NA    Jo-Wilfried Tsonga             6         36 63 63 (RET)    Round of 64 2013
14     Sergiy Stakhovsky             NA         Roger Federer             3         675 765 75 765    Round of 64 2013
15          Lukasz Kubot             NA          Benoit Paire            25               61 63 64    Round of 32 2013
16     Kenny De Schepper             NA           Juan Monaco            22              64 768 64    Round of 32 2013
17        Jerzy Janowicz             24       Nicolas Almagro            15              766 63 64    Round of 32 2013
18         Andreas Seppi             23         Kei Nishikori            12        36 62 674 61 64    Round of 32 2013
19         Bernard Tomic             NA       Richard Gasquet             9          767 57 75 765    Round of 32 2013
20 Juan Martin Del Potro              8          David Ferrer             4              62 64 765 Quarter-Finals 2013
21           Andy Murray              2        Novak Djokovic             1               64 75 64         Finals 2013

There are actually very few times when a lower seeded player beat a higher seeded one but there are quite a few instances of non seeds beating seeds. We’ve got 21 occurrences of underdogs winning out of a total of 127 matches.

Let’s filter that set of rows and see which seeds lost in the first round:

> main_matches %>%  filter(round == "Round of 128" & !is.na(loser_seeding))
           winner winner_seeding                 loser loser_seeding                  score        round year
1   Jurgen Melzer             NA         Fabio Fognini            30           675 75 63 62 Round of 128 2013
2   Bernard Tomic             NA           Sam Querrey            21       766 763 36 26 63 Round of 128 2013
3 Feliciano Lopez             NA          Gilles Simon            19             62 64 7611 Round of 128 2013
4      Ivan Dodig             NA Philipp Kohlschreiber            16 46 676 763 63 21 (RET) Round of 128 2013
5  Viktor Troicki             NA      Janko Tipsarevic            14              63 64 765 Round of 128 2013
6  Lleyton Hewitt             NA         Stan Wawrinka            11               64 75 63 Round of 128 2013
7    Steve Darcis             NA          Rafael Nadal             5             764 768 64 Round of 128 2013

Rafael Nadal is the most prominent but Stan Wawrinka also lost in the first round that year which I’d forgotten about! Next let’s make the ’round’ column an ordered factor one so that we can sort matches by round:

main_matches$round = factor(main_matches$round, levels =  c("Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals"))
 
> main_matches$round
...     
Levels: Round of 128 Round of 64 Round of 32 Round of 16 Quarter-Finals Semi-Finals Finals

We can now really easily work out which unseeded players went the furthest in the tournament:

> main_matches %>% filter(is.na(loser_seeding)) %>% arrange(desc(round)) %>% head(5)
             winner winner_seeding             loser loser_seeding           score          round year
1    Jerzy Janowicz             24      Lukasz Kubot            NA        75 64 64 Quarter-Finals 2013
2       Andy Murray              2 Fernando Verdasco            NA  46 36 61 64 75 Quarter-Finals 2013
3 Fernando Verdasco             NA Kenny De Schepper            NA        64 64 64    Round of 16 2013
4      Lukasz Kubot             NA  Adrian Mannarino            NA  46 63 36 63 64    Round of 16 2013
5    Jerzy Janowicz             24     Jurgen Melzer            NA 36 761 64 46 64    Round of 16 2013

Next up I thought it’d be cool to write a function which showed which round each player exited in:

round_reached = function(player, main_matches) {
  furthest_match = main_matches %>% 
    filter(winner == player | loser == player) %>% 
    arrange(desc(round)) %>% 
    head(1)  
 
    return(ifelse(furthest_match$winner == player, "Winner", as.character(furthest_match$round)))
}

Our function isn’t vectorisable – it only works if we pass in a single player at a time so we’ll have to group the data frame by player before calling it. Let’s check it works by seeing how far Andy Murray and Rafael Nadal got:

> round_reached("Rafael Nadal", main_matches)
[1] "Round of 128"
> round_reached("Andy Murray", main_matches)
[1] "Winner"

Great. What about if we try it against each of the top 8 seeds?

> rbind(main_matches %>% filter(winner_seeding %in% 1:8) %>% mutate(name = winner, seeding = winner_seeding), 
        main_matches %>% filter(loser_seeding %in% 1:8) %>% mutate(name = loser, seeding = loser_seeding)) %>%
    select(name, seeding) %>%
    distinct() %>%
    arrange(seeding) %>%
    group_by(name) %>%
    mutate(round_reached = round_reached(name, main_matches))
Source: local data frame [8 x 3]
Groups: name
 
                   name seeding  round_reached
1        Novak Djokovic       1         Finals
2           Andy Murray       2         Winner
3         Roger Federer       3    Round of 64
4          David Ferrer       4 Quarter-Finals
5          Rafael Nadal       5   Round of 128
6    Jo-Wilfried Tsonga       6    Round of 64
7         Tomas Berdych       7 Quarter-Finals
8 Juan Martin Del Potro       8    Semi-Finals

Neat. Next up I want to do a comparison between the round they reached and the round you’d expect them to get to given their seeding but that’s for the weekend!

I’ve put a CSV file containing all the data in this gist in case you want to play with it. I’m planning to scrape a few more years worth of data before Monday and add in some extra fields as well but in case I don’t get around to it the full script in this blog post is included in the gist as well so feel free to tweak it if tennis is your thing.

Categories: Programming

R: Scraping the release dates of github projects

Tue, 06/23/2015 - 23:34

Continuing on from my blog post about scraping Neo4j’s release dates I thought it’d be even more interesting to chart the release dates of some github projects.

In theory the release dates should be accessible through the github API but the few that I looked at weren’t returning any data so I scraped the data together.

We’ll be using rvest again and I first wrote the following function to extract the release versions and dates from a single page:

library(dplyr)
library(rvest)
 
process_page = function(releases, session) {
  rows = session %>% html_nodes("ul.release-timeline-tags li")
 
  for(row in rows) {
    date = row %>% html_node("span.date")
    version = row %>% html_node("div.tag-info a")
 
    if(!is.null(version) && !is.null(date)) {
      date = date %>% html_text() %>% str_trim()
      version = version %>% html_text() %>% str_trim()
      releases = rbind(releases, data.frame(date = date, version = version))
    }  
  }
  return(releases)
}

Let’s try it out on the Cassandra release page and see what it comes back with:

> r = process_page(data.frame(), html_session("https://github.com/apache/cassandra/releases"))
> r
           date               version
1  Jun 22, 2015       cassandra-2.1.7
2  Jun 22, 2015      cassandra-2.0.16
3   Jun 8, 2015       cassandra-2.1.6
4   Jun 8, 2015   cassandra-2.2.0-rc1
5  May 19, 2015 cassandra-2.2.0-beta1
6  May 18, 2015      cassandra-2.0.15
7  Apr 29, 2015       cassandra-2.1.5
8   Apr 1, 2015      cassandra-2.0.14
9   Apr 1, 2015       cassandra-2.1.4
10 Mar 16, 2015      cassandra-2.0.13

That works pretty well but it’s only one page! To get all the pages we can use the follow_link function to follow the ‘Next’ link until there aren’t anymore pages to process.

We end up with the following function to do this:

find_all_releases = function(starting_page) {
  s = html_session(starting_page)
  releases = data.frame()
 
  next_page = TRUE
  while(next_page) {
    possibleError = tryCatch({  
      releases = process_page(releases, s)
      s = s %>% follow_link("Next") 
    }, error = function(e) { e })
 
    if(inherits(possibleError, "error")){
      next_page = FALSE
    }
  }
  return(releases)
}

Let’s try it out starting from the Cassandra page:

> cassandra = find_all_releases("https://github.com/apache/cassandra/releases")
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.13
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.10
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.8
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.2.13
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.0-rc1
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.2.3
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.2.0-beta2
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.0.10
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.0.6
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.0.0-rc2
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.7.7
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.7.4
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.7.0-rc3
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.6.4
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.5.0-rc3
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.4.0-final
 
> cassandra %>% sample_n(10)
            date               version
151 Mar 13, 2010   cassandra-0.5.0-rc2
25   Jul 3, 2014      cassandra-1.2.18
51  Jul 27, 2013       cassandra-1.2.8
21  Aug 19, 2014   cassandra-2.1.0-rc6
73  Sep 24, 2012 cassandra-1.2.0-beta1
158 Mar 13, 2010   cassandra-0.4.0-rc2
113 May 20, 2011     cassandra-0.7.6-2
15  Oct 24, 2014       cassandra-2.1.1
103 Sep 15, 2011 cassandra-1.0.0-beta1
93  Nov 29, 2011       cassandra-1.0.4

I want to plot when the different releases happened in time and in order to do that we need to create an extra column containing the ‘release series’ which we can do with the following transformation:

series = function(version) {
  parts = strsplit(as.character(version), "\\.")  
  return(unlist(lapply(parts, function(p) paste(p %>% unlist %>% head(2), collapse = "."))))  
}
 
bySeries = cassandra %>%
  mutate(date2 = mdy(date), series = series(version),
         short_version = gsub("cassandra-", "", version),
         short_series = series(short_version))
 
> bySeries %>% sample_n(10)
            date               version      date2        series short_version short_series
3    Jun 8, 2015       cassandra-2.1.6 2015-06-08 cassandra-2.1         2.1.6          2.1
161 Mar 13, 2010 cassandra-0.4.0-beta1 2010-03-13 cassandra-0.4   0.4.0-beta1          0.4
62  Feb 15, 2013      cassandra-1.1.10 2013-02-15 cassandra-1.1        1.1.10          1.1
153 Mar 13, 2010 cassandra-0.5.0-beta2 2010-03-13 cassandra-0.5   0.5.0-beta2          0.5
37   Feb 7, 2014       cassandra-2.0.5 2014-02-07 cassandra-2.0         2.0.5          2.0
36   Feb 7, 2014      cassandra-1.2.15 2014-02-07 cassandra-1.2        1.2.15          1.2
29   Jun 2, 2014   cassandra-2.1.0-rc1 2014-06-02 cassandra-2.1     2.1.0-rc1          2.1
21  Aug 19, 2014   cassandra-2.1.0-rc6 2014-08-19 cassandra-2.1     2.1.0-rc6          2.1
123 Feb 16, 2011       cassandra-0.7.2 2011-02-16 cassandra-0.7         0.7.2          0.7
135  Nov 1, 2010 cassandra-0.7.0-beta3 2010-11-01 cassandra-0.7   0.7.0-beta3          0.7

Now let’s plot those releases and see what we get:

ggplot(aes(x = date2, y = short_series), 
       data = bySeries %>% filter(!grepl("beta|rc", short_version))) +     
  geom_text(aes(label=short_version),hjust=0.5, vjust=0.5, size = 4, angle = 90) + 
  theme_bw()

2015 06 23 22 59 19

An interesting thing we can see from this visualisation is what overlap the various series of versions have. Most of the time there are only two series of versions overlapping but the 1.2, 2.0 and 2.1 series all overlap which is unusual.

In this chart we excluded all beta and RC versions. Let’s bring those back in and just show the last 3 versions:

ggplot(aes(x = date2, y = short_series), 
       data = bySeries %>% filter(grepl("2\\.[012]\\.|1\\.2\\.", short_version))) +     
  geom_text(aes(label=short_version),hjust=0.5, vjust=0.5, size = 4, angle = 90) + 
  theme_bw()

2015 06 23 23 08 04

From this chart it’s clearer that the 2.0 and 2.1 series have recent releases so there will probably be three overlapping versions when the 2.2 series is released as well.

The chart is still a bit cluttered although less than before. I’m not sure of a better way of visualising this type of data so if you have any ideas do let me know!

Categories: Programming

R: Scraping Neo4j release dates with rvest

Sun, 06/21/2015 - 23:07

As part of my log analysis I wanted to get the Neo4j release dates which are accessible from the release notes and decided to try out Hadley Wickham’s rvest scraping library which he released at the end of 2014.

rvest is based on Python’s beautifulsoup which has become my scraping library of choice so I didn’t find it too difficult to pick up.

To start with we need to download the release notes locally so we don’t have to go over the network when we’re doing our scraping:

download.file("http://neo4j.com/release-notes/page/1", "release-notes.html")
download.file("http://neo4j.com/release-notes/page/2", "release-notes2.html")

We want to parse those pages back and return the rows which contain version numbers and release dates. The HTML looks like this:

2015 06 21 22 57 20

We can get the rows with the following code:

library(rvest)
library(dplyr)
 
page1 <- html("release-notes.html")
page2 <- html("release-notes2.html")
 
rows = c(page1 %>% html_nodes("div.small-12 div.row"), 
         page2 %>% html_nodes("div.small-12 div.row") ) 
 
> rows %>% head(1)
[[1]]
<div class="row"> <h3 class="entry-title"><a href="http://neo4j.com/release-notes/neo4j-2-2-2/">Latest Release: Neo4j 2.2.2</a></h3> <h6>05/21/2015</h6> <p>Neo4j 2.2.2 is a maintenance release, with critical improvements.</p> <p>Notably, this release:</p> <ul><li>Provides support for running Neo4j on Oracle and OpenJDK Java 8 runtimes</li> <li>Resolves an issue that prevented the Neo4j Browser from loading in the latest Chrome release (43.0.2357.65).</li> <li>Corrects the behavior of the <code>:sysinfo</code> (aka <code>:play sysinfo</code>) browser directive.</li> <li>Improves the <a href="http://neo4j.com/docs/2.2.2/import-tool.html">import tool</a> handling of values containing newlines, and adds support f...</li></ul><a href="http://neo4j.com/release-notes/neo4j-2-2-2/">Read full notes →</a> </div>

Now we need to loop through the rows and pull out just the version and release date. I wrote the following function to do this and strip out any extra text that we’re not interested in:

generate_releases = function(rows) {
  releases = data.frame()
  for(row in rows) {
    version = row %>% html_node("h3.entry-title")
    date = row %>% html_node("h6")  
 
    if(!is.null(version) && !is.null(date)) {
      version = version %>% html_text()
      version = gsub("Latest Release: ", "", version)
      version = gsub("Neo4j ", "", version)
      releases = rbind(releases, data.frame(version = version, date = date %>% html_text()))
    }
  }
  return(releases)
}
 
> generate_releases(rows)
   version       date
1    2.2.2 05/21/2015
2    2.2.1 04/14/2015
3    2.1.8 04/01/2015
4    2.2.0 03/25/2015
5    2.1.7 02/03/2015
6    2.1.6 11/25/2014
7    1.9.9 10/13/2014
8    2.1.5 09/30/2014
9    2.1.4 09/04/2014
10   2.1.3 07/28/2014
11   2.0.4 07/08/2014
12   1.9.8 06/19/2014
13   2.1.2 06/11/2014
14   2.0.3 04/30/2014
15   2.0.1 02/04/2014
16   2.0.2 04/15/2014
17   1.9.7 04/11/2014
18   1.9.6 02/03/2014
19     2.0 12/11/2013
20   1.9.5 11/11/2013
21   1.9.4 09/19/2013
22   1.9.3 08/30/2013
23   1.9.2 07/16/2013
24   1.9.1 06/24/2013
25     1.9 05/13/2013
26   1.8.3         //

Finally I wanted to convert the ‘date’ column to be in R date format and get rid of the 1.8.3 row since it doesn’t contain a date. lubridate is my goto library for date manipulation in R so we’ll use that here:

library(lubridate)
 
> generate_releases(rows) %>%  
      mutate(date = mdy(date)) %>%   
      filter(!is.na(date)) 
 
   version       date
1    2.2.2 2015-05-21
2    2.2.1 2015-04-14
3    2.1.8 2015-04-01
4    2.2.0 2015-03-25
5    2.1.7 2015-02-03
6    2.1.6 2014-11-25
7    1.9.9 2014-10-13
8    2.1.5 2014-09-30
9    2.1.4 2014-09-04
10   2.1.3 2014-07-28
11   2.0.4 2014-07-08
12   1.9.8 2014-06-19
13   2.1.2 2014-06-11
14   2.0.3 2014-04-30
15   2.0.1 2014-02-04
16   2.0.2 2014-04-15
17   1.9.7 2014-04-11
18   1.9.6 2014-02-03
19     2.0 2013-12-11
20   1.9.5 2013-11-11
21   1.9.4 2013-09-19
22   1.9.3 2013-08-30
23   1.9.2 2013-07-16
24   1.9.1 2013-06-24
25     1.9 2013-05-13

We could then easily see how many releases there were by year:

releasesByDate = generate_releases(rows) %>%  
  mutate(date = mdy(date)) %>%   
  filter(!is.na(date))
 
> releasesByDate %>% mutate(year = year(date)) %>% count(year)
Source: local data frame [3 x 2]
 
  year  n
1 2013  7
2 2014 13
3 2015  5

Or by month:

> releasesByDate %>% mutate(month = month(date)) %>% count(month)
Source: local data frame [11 x 2]
 
   month n
1      2 3
2      3 1
3      4 5
4      5 2
5      6 3
6      7 3
7      8 1
8      9 3
9     10 1
10    11 2
11    12 1

Previous to this quick bit of hacking I’d always turned to Ruby or Python whenever I wanted to scrape a dataset but it looks like rvest makes R a decent option for this type of work now. Good times!

Categories: Programming

R: dplyr – segfault cause ‘memory not mapped’

Sat, 06/20/2015 - 23:18

In my continued playing around with web logs in R I wanted to process the logs for a day and see what the most popular URIs were.

I first read in all the lines using the read_lines function in readr and put the vector it produced into a data frame so I could process it using dplyr.

library(readr)
dlines = data.frame(column = read_lines("~/projects/logs/2015-06-18-22-docs"))

In the previous post I showed some code to extract the URI from a log line. I extracted this code out into a function and adapted it so that I could pass in a list of values instead of a single value:

extract_uri = function(log) {
  parts = str_extract_all(log, "\"[^\"]*\"")
  return(lapply(parts, function(p) str_match(p[1], "GET (.*) HTTP")[2] %>% as.character))
}

Next I ran the following function to count the number of times each URI appeared in the logs:

library(dplyr)
pages_viewed = dlines %>%
  mutate(uri  = extract_uri(column)) %>% 
  count(uri) %>%
  arrange(desc(n))

This crashed my R process with the following error message:

segfault cause 'memory not mapped'

I narrowed it down to a problem when doing a group by operation on the ‘uri’ field and came across this post which suggested that it was handled more cleanly in more recently version of dplyr.

I upgraded to 0.4.2 and tried again:

## Error in eval(expr, envir, enclos): cannot group column uri, of class 'list'

That makes more sense. We’re probably returning a list from extract_uri rather than a vector which would fit nicely back into the data frame. That’s fixed easily enough by unlisting the result:

extract_uri = function(log) {
  parts = str_extract_all(log, "\"[^\"]*\"")
  return(unlist(lapply(parts, function(p) str_match(p[1], "GET (.*) HTTP")[2] %>% as.character)))
}

And now when we run the count function it’s happy again, good times!

Categories: Programming

R: Regex – capturing multiple matches of the same group

Fri, 06/19/2015 - 22:38

I’ve been playing around with some web logs using R and I wanted to extract everything that existed in double quotes within a logged entry.

This is an example of a log entry that I want to parse:

log = '2015-06-18-22:277:548311224723746831\t2015-06-18T22:00:11\t2015-06-18T22:00:05Z\t93317114\tip-127-0-0-1\t127.0.0.5\tUser\tNotice\tneo4j.com.access.log\t127.0.0.3 - - [18/Jun/2015:22:00:11 +0000] "GET /docs/stable/query-updating.html HTTP/1.1" 304 0 "http://neo4j.com/docs/stable/cypher-introduction.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36"'

And I want to extract these 3 things:

  • /docs/stable/query-updating.html
  • http://neo4j.com/docs/stable/cypher-introduction.html
  • Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36

i.e. the URI, the referrer and browser details.

I’ll be using the stringr library which seems to work quite well for this type of work.

To extract these values we need to find all the occurrences of double quotes and get the text inside those quotes. We might start by using the str_match function:

> library(stringr)
> str_match(log, "\"[^\"]*\"")
     [,1]                                               
[1,] "\"GET /docs/stable/query-updating.html HTTP/1.1\""

Unfortunately that only picked up the first occurrence of the pattern so we’ve got the URI but not the referrer or browser details. I tried str_extract with similar results before I found str_extract_all which does the job:

> str_extract_all(log, "\"[^\"]*\"")
[[1]]
[1] "\"GET /docs/stable/query-updating.html HTTP/1.1\""                                                                            
[2] "\"http://neo4j.com/docs/stable/cypher-introduction.html\""                                                                    
[3] "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36\""

We still need to do a bit of cleanup to get rid of the ‘GET’ and ‘HTTP/1.1′ in the URI and the quotes in all of them:

parts = str_extract_all(log, "\"[^\"]*\"")[[1]]
uri = str_match(parts[1], "GET (.*) HTTP")[2]
referer = str_match(parts[2], "\"(.*)\"")[2]
browser = str_match(parts[3], "\"(.*)\"")[2]
 
> uri
[1] "/docs/stable/query-updating.html"
 
> referer
[1] "https://www.google.com/"
 
> browser
[1] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36"

We could then go on to split out the browser string into its sub components but that’ll do for now!

Categories: Programming

Coding: Explore and retreat

Wed, 06/17/2015 - 18:23

When refactoring code or looking for the best way to integrate a new piece of functionality I generally favour a small steps/incremental approach but recent experiences have led me to believe that this isn’t always the quickest approach.

Sometimes it seems to make more sense to go on little discovery missions in the code, make some bigger steps and then if necessary retreat and revert our changes and apply the lessons learnt on our next discovery mission. This technique which isn’t anything novel but I think is quite effective.

Michael and I were recently looking at the Smart Local Moving algorithm which is used for community detection in large networks and decided to refactor the code to make sure we understood how it worked. When we started the outline of the main class was like this:

public class Network implements Cloneable, Serializable
{
    private static final long serialVersionUID = 1;
 
    private int numberOfNodes;
    private int[] firstNeighborIndex;
    private int[] neighbor;
    private double[] edgeWeight;
    private double totalEdgeWeightSelfLinks;
    private double[] nodeWeight;
    private int nClusters;
    private int[] cluster;
 
    private double[] clusterWeight;
    private int[] numberNodesPerCluster;
    private int[][] nodePerCluster;
    private boolean clusteringStatsAvailable;
...
}

My initial approach was to put methods around things to make it a bit easier to understand and then step by step replace each of those fields with nodes and relationships. I spent the first couple of hours doing this and while it was making the code more readable it wasn’t progressing very quickly and I wasn’t much wiser about how the code worked.

Michael and I paired on it for a few hours and he adopted a slightly different but more successful approach where we looked at slightly bigger chunks of code e.g. all the loops that used the firstNeighborIndex field and then created a hypothesis of what that code was doing.

In this case firstNeighborIndex acts as an offset into neighbor and is used to iterate through a node’s relationships. We thought we could probably replace that with something more similar to the Neo4j model where you have classes for nodes and relationships and a node has a method which returns a collection of relationships.

We tried tearing out everywhere that used those two fields and replacing them with our new nodes/relationships code but that didn’t work because we hadn’t realised that edgeWeight and nodeWeight are also tied to the contents of the original fields.

We therefore needed to retreat and try again. This time I put the new approach alongside the existing approach and then slowly replaced existing bits of code.

Along the way I came up with other ideas about how to restructure the code, tried some more bigger leaps to validate my ideas and then moved back into incremental mode again.

In summary I’ve found the combination of incrementally changing code and going on bigger exploratory missions works quite well.

Now I’m trying to work out when each approach is appropriate and I’ll write that up when I learn more! You can see my progress via the github commits.

Categories: Programming