Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 6 hours 32 min ago

R: Time to/from the weekend

Sat, 12/13/2014 - 21:38

In my last post I showed some examples using R’s lubridate package and another problem it made really easy to solve was working out how close a particular date time was to the weekend.

I wanted to write a function which would return the previous Sunday or upcoming Saturday depending on which was closer.

lubridate’s floor_date and ceiling_date functions make this quite simple.

e.g. if we want to round the 18th December down to the beginning of the week and up to the beginning of the next week we could do the following:

> library(lubridate)
> floor_date(ymd("2014-12-18"), "week")
[1] "2014-12-14 UTC"
 
> ceiling_date(ymd("2014-12-18"), "week")
[1] "2014-12-21 UTC"

For the date in the future we actually want to grab the Saturday rather than the Sunday so we’ll subtract one day from that:

> ceiling_date(ymd("2014-12-18"), "week") - days(1)
[1] "2014-12-20 UTC"

Now let’s put that together into a function which finds the closest weekend for a given date:

findClosestWeekendDay = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  if((dateToLookup - before) < (after - dateToLookup)) {
    before  
  } else {
    after  
  }
}
 
> findClosestWeekendDay(ymd_hms("2014-12-13 13:33:29"))
[1] "2014-12-13 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-14 18:33:29"))
[1] "2014-12-14 23:59:59 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-15 18:33:29"))
[1] "2014-12-14 23:59:59 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-17 11:33:29"))
[1] "2014-12-14 23:59:59 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-17 13:33:29"))
[1] "2014-12-20 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-19 13:33:29"))
[1] "2014-12-20 UTC"

I’ve set the Sunday date at 23:59:59 so that I can use this date in the next step where we want to calculate how how many hours it is from the current date to the nearest weekend.

I ended up with this function:

distanceFromWeekend = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  timeToBefore = dateToLookup - before
  timeToAfter = after - dateToLookup
 
  if(timeToBefore < 0 || timeToAfter < 0) {
    0  
  } else {
    if(timeToBefore < timeToAfter) {
      timeToBefore / dhours(1)
    } else {
      timeToAfter / dhours(1)
    }
  }
}
 
> distanceFromWeekend(ymd_hms("2014-12-13 13:33:29"))
[1] 0
 
> distanceFromWeekend(ymd_hms("2014-12-14 18:33:29"))
[1] 0
 
> distanceFromWeekend(ymd_hms("2014-12-15 18:33:29"))
[1] 18.55833
 
> distanceFromWeekend(ymd_hms("2014-12-17 11:33:29"))
[1] 59.55833
 
> distanceFromWeekend(ymd_hms("2014-12-17 13:33:29"))
[1] 58.44194
 
> distanceFromWeekend(ymd_hms("2014-12-19 13:33:29"))
[1] 10.44194

While this works it’s quite slow when you run it over a data frame which contains a lot of rows.

There must be a clever R way of doing the same thing (perhaps using matrices) which I haven’t figured out yet so if you know how to speed it up do let me know.

Categories: Programming

R: Numeric representation of date time

Sat, 12/13/2014 - 20:58

I’ve been playing around with date times in R recently and I wanted to derive a numeric representation for a given value to make it easier to see the correlation between time and another variable.

e.g. December 13th 2014 17:30 should return 17.5 since it’s 17.5 hours since midnight.

Using the standard R libraries we would write the following code:

> december13 = as.POSIXlt("2014-12-13 17:30:00")
> as.numeric(december13 - trunc(december13, "day"), units="hours")
[1] 17.5

That works pretty well but Antonios recently introduced me to the lubridate so I thought I’d give that a try as well.

The first nice thing about lubridate is that we can use the date we created earlier and call the floor_date function rather than truncate:

> (december13 - floor_date(december13, "day"))
Time difference of 17.5 hours

That gives us back a difftime…

> class((december13 - floor_date(december13, "day")))
[1] "difftime"

…which we can divide by different units to get the granularity we want:

> diff = (december13 - floor_date(december13, "day"))
> diff / dhours(1)
[1] 17.5
 
> diff / ddays(1)
[1] 0.7291667
 
> diff / dminutes(1)
[1] 1050

Pretty neat!

lubridate also has some nice functions for creating dates/date times. e.g.

> ymd_hms("2014-12-13 17:00:00")
[1] "2014-12-13 17:00:00 UTC"
 
> ymd_hm("2014-12-13 17:00")
[1] "2014-12-13 17:00:00 UTC"
 
> ymd_h("2014-12-13 17")
[1] "2014-12-13 17:00:00 UTC"
 
> ymd("2014-12-13")
[1] "2014-12-13 UTC"

And if you want a different time zone that’s pretty easy too:

> with_tz(ymd("2014-12-13"), "GMT")
[1] "2014-12-13 GMT"
Categories: Programming

R: data.table/dplyr/lubridate – Error in wday(date, label = TRUE, abbr = FALSE) : unused arguments (label = TRUE, abbr = FALSE)

Thu, 12/11/2014 - 20:03

I spent a couple of hours playing around with data.table this evening and tried changing some code written using a data frame to use a data table instead.

I started off by building a data frame which contains all the weekends between 2010 and 2015…

> library(lubridate)
 
> library(dplyr)
 
> dates = data.frame(date = seq( dmy("01-01-2010"), to=dmy("01-01-2015"), by="day" ))
 
> dates = dates %>% filter(wday(date, label = TRUE, abbr = FALSE) %in% c("Saturday", "Sunday"))

…which works fine:

> dates %>% head()
         date
1: 2010-01-02
2: 2010-01-03
3: 2010-01-09
4: 2010-01-10
5: 2010-01-16
6: 2010-01-17

I then tried to change the code to use a data table instead which led to the following error:

> library(data.table)
 
> dates = data.table(date = seq( dmy("01-01-2010"), to=dmy("01-01-2015"), by="day" ))
 
> dates = dates %>% filter(wday(date, label = TRUE, abbr = FALSE) %in% c("Saturday", "Sunday"))
Error in wday(date, label = TRUE, abbr = FALSE) : 
  unused arguments (label = TRUE, abbr = FALSE)

I wasn’t sure what was going on so I went back to the data frame version to check if that still worked…

> dates = data.frame(date = seq( dmy("01-01-2010"), to=dmy("01-01-2015"), by="day" ))
 
> dates = dates %>% filter(wday(date, label = TRUE, abbr = FALSE) %in% c("Saturday", "Sunday"))
Error in wday(c(1262304000, 1262390400, 1262476800, 1262563200, 1262649600,  : 
  unused arguments (label = TRUE, abbr = FALSE)

…except it now didn’t work either! I decided to check what wday was referring to…

Help on topic ‘wday’ was found in the following packages:

Integer based date class
(in package data.table in library /Library/Frameworks/R.framework/Versions/3.1/Resources/library)
Get/set days component of a date-time.
(in package lubridate in library /Library/Frameworks/R.framework/Versions/3.1/Resources/library)

…and realised that data.table has its own wday function – I’d been caught out by R’s global scoping of all the things!

We can probably work around that by the order in which we require the various libraries but for now I’m just prefixing the call to wday and all is well:

dates = dates %>% filter(lubridate::wday(date, label = TRUE, abbr = FALSE) %in% c("Saturday", "Sunday"))
Categories: Programming

R: Cleaning up and plotting Google Trends data

Tue, 12/09/2014 - 19:14

I recently came across an excellent article written by Stian Haklev in which he describes things he wishes he’d been told before starting out with R, one being to do all data clean up in code which I thought I’d give a try.

My goal is to leave the raw data completely unchanged, and do all the transformation in code, which can be rerun at any time.

While I’m writing the scripts, I’m often jumping around, selectively executing individual lines or code blocks, running commands to inspect the data in the REPL (read-evaluate-print-loop, where each command is executed as soon as you type enter, in the picture above it’s the pane to the right), etc.

But I try to make sure that when I finish up, the script is runnable by itself.

I thought the Google Trends data set would be an interesting one to play around with as it gives you a CSV containing several different bits of data of which I’m only interested in ‘interest over time’.

It’s not very easy to automate the download of the CSV file so I did that bit manually and automated everything from there onwards.

The first step was to read the CSV file and explore some of the rows to see what it contained:

> library(dplyr)
 
> googleTrends = read.csv("/Users/markneedham/Downloads/report.csv", row.names=NULL)
 
> googleTrends %>% head()
##                   row.names Web.Search.interest..neo4j
## 1 Worldwide; 2004 - present                           
## 2        Interest over time                           
## 3                      Week                      neo4j
## 4   2004-01-04 - 2004-01-10                          0
## 5   2004-01-11 - 2004-01-17                          0
## 6   2004-01-18 - 2004-01-24                          0
 
> googleTrends %>% sample_n(10)
##                   row.names Web.Search.interest..neo4j
## 109 2006-01-08 - 2006-01-14                          0
## 113 2006-02-05 - 2006-02-11                          0
## 267 2009-01-18 - 2009-01-24                          0
## 199 2007-09-30 - 2007-10-06                          0
## 522 2013-12-08 - 2013-12-14                         88
## 265 2009-01-04 - 2009-01-10                          0
## 285 2009-05-24 - 2009-05-30                          0
## 318 2010-01-10 - 2010-01-16                          0
## 495 2013-06-02 - 2013-06-08                         79
## 28  2004-06-20 - 2004-06-26                          0
 
> googleTrends %>% tail()
##                row.names Web.Search.interest..neo4j
## 658        neo4j example                   Breakout
## 659 neo4j graph database                   Breakout
## 660           neo4j java                   Breakout
## 661           neo4j node                   Breakout
## 662           neo4j rest                   Breakout
## 663       neo4j tutorial                   Breakout

We only want to keep the rows which contain (week, interest) pairs so the first thing we’ll do is rename the columns:

names(googleTrends) = c("week", "score")

Now we want to strip out the rows which don’t contain (week, interest) pairs. The easiest way to do this is to look for rows which don’t contain date values in the ‘week’ column.

First we need to split the start and end dates in that column by using the strsplit function.

I found it much easier to apply the function to each row individually rather than passing in a list of values so I created a dummy column with a row number in to allow me to do that (a trick Antonios showed me):

> googleTrends %>% 
    mutate(ind = row_number()) %>% 
    group_by(ind) %>%
    mutate(dates = strsplit(week, " - "),
           start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character(),
           end =   dates[[1]][2] %>% strptime("%Y-%m-%d") %>% as.character()) %>%
    head()
## Source: local data frame [6 x 6]
## Groups: ind
## 
##                        week score ind    dates      start        end
## 1 Worldwide; 2004 - present     1   1 <chr[2]>         NA         NA
## 2        Interest over time     1   2 <chr[1]>         NA         NA
## 3                      Week    90   3 <chr[1]>         NA         NA
## 4   2004-01-04 - 2004-01-10     3   4 <chr[2]> 2004-01-04 2004-01-10
## 5   2004-01-11 - 2004-01-17     3   5 <chr[2]> 2004-01-11 2004-01-17
## 6   2004-01-18 - 2004-01-24     3   6 <chr[2]> 2004-01-18 2004-01-24

Now we need to get rid of the rows which have an NA value for ‘start’ or ‘end':

> googleTrends %>% 
    mutate(ind = row_number()) %>% 
    group_by(ind) %>%
    mutate(dates = strsplit(week, " - "),
           start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character(),
           end =   dates[[1]][2] %>% strptime("%Y-%m-%d") %>% as.character()) %>%
    filter(!is.na(start) | !is.na(end)) %>% 
    head()
## Source: local data frame [6 x 6]
## Groups: ind
## 
##                      week score ind    dates      start        end
## 1 2004-01-04 - 2004-01-10     3   4 <chr[2]> 2004-01-04 2004-01-10
## 2 2004-01-11 - 2004-01-17     3   5 <chr[2]> 2004-01-11 2004-01-17
## 3 2004-01-18 - 2004-01-24     3   6 <chr[2]> 2004-01-18 2004-01-24
## 4 2004-01-25 - 2004-01-31     3   7 <chr[2]> 2004-01-25 2004-01-31
## 5 2004-02-01 - 2004-02-07     3   8 <chr[2]> 2004-02-01 2004-02-07
## 6 2004-02-08 - 2004-02-14     3   9 <chr[2]> 2004-02-08 2004-02-14

Next we’ll get rid of ‘week’, ‘ind’ and ‘dates’ as we aren’t going to need those anymore:

> cleanGoogleTrends = googleTrends %>% 
    mutate(ind = row_number()) %>% 
    group_by(ind) %>%
    mutate(dates = strsplit(week, " - "),
           start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character(),
           end =   dates[[1]][2] %>% strptime("%Y-%m-%d") %>% as.character()) %>%
    filter(!is.na(start) | !is.na(end)) %>%
    ungroup() %>%
    select(-c(ind, dates, week))
 
> cleanGoogleTrends %>% head()
## Source: local data frame [6 x 3]
## 
##   score      start        end
## 1     3 2004-01-04 2004-01-10
## 2     3 2004-01-11 2004-01-17
## 3     3 2004-01-18 2004-01-24
## 4     3 2004-01-25 2004-01-31
## 5     3 2004-02-01 2004-02-07
## 6     3 2004-02-08 2004-02-14
 
> cleanGoogleTrends %>% sample_n(10)
## Source: local data frame [10 x 3]
## 
##    score      start        end
## 1      8 2010-09-26 2010-10-02
## 2     73 2013-11-17 2013-11-23
## 3     52 2012-07-01 2012-07-07
## 4      3 2005-06-19 2005-06-25
## 5      3 2004-12-12 2004-12-18
## 6      3 2009-09-06 2009-09-12
## 7     71 2014-09-14 2014-09-20
## 8      3 2004-12-26 2005-01-01
## 9     62 2013-03-03 2013-03-09
## 10     3 2006-03-19 2006-03-25
 
> cleanGoogleTrends %>% tail()
## Source: local data frame [6 x 3]
## 
##   score      start        end
## 1    80 2014-10-19 2014-10-25
## 2    80 2014-10-26 2014-11-01
## 3    84 2014-11-02 2014-11-08
## 4    81 2014-11-09 2014-11-15
## 5    83 2014-11-16 2014-11-22
## 6     2 2014-11-23 2014-11-29

Ok now we’re ready to plot. This was my first attempt:

> library(ggplot2)
> ggplot(aes(x = start, y = score), data = cleanGoogleTrends) + 
    geom_line(size = 0.5)
## geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
2014 12 09 17 57 49

As you can see, not too successful! The first mistake I’ve made is not telling ggplot that the ‘start’ column is a date and so it can use that ordering when plotting:

> cleanGoogleTrends = cleanGoogleTrends %>% mutate(start =  as.Date(start))
> ggplot(aes(x = start, y = score), data = cleanGoogleTrends) + 
    geom_line(size = 0.5)

2014 12 09 18 00 03

My next mistake is that ‘score’ is not being treated as a continuous variable and so we’re ending up with this very strange looking chart. We can see that if we call the class function:

> class(cleanGoogleTrends$score)
## [1] "factor"

Let’s fix that and plot again:

> cleanGoogleTrends = cleanGoogleTrends %>% mutate(score = as.numeric(score))
> ggplot(aes(x = start, y = score), data = cleanGoogleTrends) + 
    geom_line(size = 0.5)

2014 12 09 18 02 39

That’s much better but there is quite a bit of noise in the week to week scores which we can flatten a bit by plotting a rolling mean of the last 4 weeks instead:

> library(zoo)
> cleanGoogleTrends = cleanGoogleTrends %>% 
    mutate(rolling = rollmean(score, 4, fill = NA, align=c("right")),
           start =  as.Date(start))
 
> ggplot(aes(x = start, y = rolling), data = cleanGoogleTrends) + 
    geom_line(size = 0.5)

2014 12 09 18 05 26

Here’s the full code if you want to reproduce:

library(dplyr)
library(zoo)
library(ggplot2)
 
googleTrends = read.csv("/Users/markneedham/Downloads/report.csv", row.names=NULL)
names(googleTrends) = c("week", "score")
 
cleanGoogleTrends = googleTrends %>% 
  mutate(ind = row_number()) %>% 
  group_by(ind) %>%
  mutate(dates = strsplit(week, " - "),
         start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character(),
         end =   dates[[1]][2] %>% strptime("%Y-%m-%d") %>% as.character()) %>%
  filter(!is.na(start) | !is.na(end)) %>%
  ungroup() %>%
  select(-c(ind, dates, week)) %>%
  mutate(start =  as.Date(start),
         score = as.numeric(score),
         rolling = rollmean(score, 4, fill = NA, align=c("right")))
 
ggplot(aes(x = start, y = rolling), data = cleanGoogleTrends) + 
  geom_line(size = 0.5)

My next step is to plot the Google Trends scores against my meetup data set to see if there’s any interesting correlations going on.

As an aside I made use of knitr while putting together this post – it works really well for checking that you’ve included all the steps and that it actually works!

Categories: Programming

R: dplyr – mutate with strptime (incompatible size/wrong result size)

Mon, 12/08/2014 - 20:02

Having worked out how to translate a string into a date or NA if it wasn’t the appropriate format the next thing I wanted to do was store the result of the transformation in my data frame.

I started off with this:

data = data.frame(x = c("2014-01-01", "2014-02-01", "foo"))
> data
           x
1 2014-01-01
2 2014-02-01
3        foo

And when I tried to do the date translation ran into the following error:

> data %>% mutate(y = strptime(x, "%Y-%m-%d"))
Error: wrong result size (11), expected 3 or 1

As I understand it this error is telling us that we are trying to put a value into the data frame which represents 11 rows rather than 3 rows or 1 row.

It turns out that storing POSIXlts in a data frame isn’t such a good idea! In this case we can use the as.character function to create a character vector which can be stored in the data frame:

> data %>% mutate(y = strptime(x, "%Y-%m-%d") %>% as.character())
           x          y
1 2014-01-01 2014-01-01
2 2014-02-01 2014-02-01
3        foo       <NA>

We can then get rid of the NA row by using the is.na function:

> data %>% mutate(y = strptime(x, "%Y-%m-%d") %>% as.character()) %>% filter(!is.na(y))
           x          y
1 2014-01-01 2014-01-01
2 2014-02-01 2014-02-01

And a final tweak so that we have 100% pipelining goodness:

> data %>% 
    mutate(y = x %>% strptime("%Y-%m-%d") %>% as.character()) %>%
    filter(!is.na(y))
           x          y
1 2014-01-01 2014-01-01
2 2014-02-01 2014-02-01
Categories: Programming

R: dplyr – mutate with strptime (incompatible size/wrong result size)

Mon, 12/08/2014 - 20:02

Having worked out how to translate a string into a date or NA if it wasn’t the appropriate format the next thing I wanted to do was store the result of the transformation in my data frame.

I started off with this:

data = data.frame(x = c("2014-01-01", "2014-02-01", "foo"))
> data
           x
1 2014-01-01
2 2014-02-01
3        foo

And when I tried to do the date translation ran into the following error:

> data %>% mutate(y = strptime(x, "%Y-%m-%d"))
Error: wrong result size (11), expected 3 or 1

As I understand it this error is telling us that we are trying to put a value into the data frame which represents 11 rows rather than 3 rows or 1 row.

It turns out that storing POSIXlts in a data frame isn’t such a good idea! In this case we can use the as.character function to create a character vector which can be stored in the data frame:

> data %>% mutate(y = strptime(x, "%Y-%m-%d") %>% as.character())
           x          y
1 2014-01-01 2014-01-01
2 2014-02-01 2014-02-01
3        foo       <NA>

We can then get rid of the NA row by using the is.na function:

> data %>% mutate(y = strptime(x, "%Y-%m-%d") %>% as.character()) %>% filter(!is.na(y))
           x          y
1 2014-01-01 2014-01-01
2 2014-02-01 2014-02-01

And a final tweak so that we have 100% pipelining goodness:

> data %>% 
    mutate(y = x %>% strptime("%Y-%m-%d") %>% as.character()) %>%
    filter(!is.na(y))
           x          y
1 2014-01-01 2014-01-01
2 2014-02-01 2014-02-01
Categories: Programming

R: String to Date or NA

Sun, 12/07/2014 - 20:29

I’ve been trying to clean up a CSV file which contains some rows with dates and some not – I only want to keep the cells which do have dates so I’ve been trying to work out how to do that.

My first thought was that I’d try and find a function which would convert the contents of the cell into a date if it was in date format and NA if not. I could then filter out the NA values using the is.na function.

I started out with the as.Date function…

> as.Date("2014-01-01")
[1] "2014-01-01"
 
> as.Date("foo")
Error in charToDate(x) : 
  character string is not in a standard unambiguous format

…but that throws an error if we have a non date value so it’s not so useful in this case.

Instead we can make use of the strptime function which does exactly what we want:

> strptime("2014-01-01", "%Y-%m-%d")
[1] "2014-01-01 GMT"
 
> strptime("foo", "%Y-%m-%d")
[1] NA

We can then feed those values into is.na..

> strptime("2014-01-01", "%Y-%m-%d") %>% is.na()
[1] FALSE
 
> strptime("foo", "%Y-%m-%d") %>% is.na()
[1] TRUE

…and we have exactly the behaviour we were looking for.

Categories: Programming

R: Applying a function to every row of a data frame

Thu, 12/04/2014 - 07:31

In my continued exploration of London’s meetups I wanted to calculate the distance from meetup venues to a centre point in London.

I’ve created a gist containing the coordinates of some of the venues that host NoSQL meetups in London town if you want to follow along:

library(dplyr)
 
# https://gist.github.com/mneedham/7e926a213bf76febf5ed
venues = read.csv("/tmp/venues.csv")
 
venues %>% head()
##                        venue      lat       lon
## 1              Skills Matter 51.52482 -0.099109
## 2                   Skinkers 51.50492 -0.083870
## 3          Theodore Bullfrog 51.50878 -0.123749
## 4 The Skills Matter eXchange 51.52452 -0.099231
## 5               The Guardian 51.53373 -0.122340
## 6            White Bear Yard 51.52227 -0.109804

Now to do the calculation. I’ve chosen the Centre Point building in Tottenham Court Road as our centre point. We can use the distHaversine function in the geosphere library allows us to do the calculation:

options("scipen"=100, "digits"=4)
library(geosphere)
 
centre = c(-0.129581, 51.516578)
aVenue = venues %>% slice(1)
aVenue
##           venue   lat      lon
## 1 Skills Matter 51.52 -0.09911

Now we can calculate the distance from Skillsmatter to our centre point:

distHaversine(c(aVenue$lon, aVenue$lat), centre)
## [1] 2302

That works pretty well so now we want to apply it to every row in the venues data frame and add an extra column containing that value.

This was my first attempt…

venues %>% mutate(distHaversine(c(lon,lat),centre))
## Error in .pointsToMatrix(p1): Wrong length for a vector, should be 2

…which didn’t work quite as I’d imagined!

I eventually found my way to the by function which allows you to ‘apply a function to a data frame split by factors’. In this case I wouldn’t be grouping rows by a factor – I’d apply the function to each row separately.

I wired everything up like so:

distanceFromCentre = by(venues, 1:nrow(venues), function(row) { distHaversine(c(row$lon, row$lat), centre)  })
 
distanceFromCentre %>% head()
## 1:nrow(venues)
##      1      2      3      4      5      6 
## 2301.6 3422.6  957.5 2280.6 1974.1 1509.5

We can now add the distances to our venues data frame:

venuesWithCentre = venues %>% 
  mutate(distanceFromCentre = by(venues, 1:nrow(venues), function(row) { distHaversine(c(row$lon, row$lat), centre)  }))
 
venuesWithCentre %>% head()
##                        venue   lat      lon distanceFromCentre
## 1              Skills Matter 51.52 -0.09911             2301.6
## 2                   Skinkers 51.50 -0.08387             3422.6
## 3          Theodore Bullfrog 51.51 -0.12375              957.5
## 4 The Skills Matter eXchange 51.52 -0.09923             2280.6
## 5               The Guardian 51.53 -0.12234             1974.1
## 6            White Bear Yard 51.52 -0.10980             1509.5

Et voila!

Categories: Programming

Spark: Write to CSV file with header using saveAsFile

Sun, 11/30/2014 - 09:21

In my last blog post I showed how to write to a single CSV file using Spark and Hadoop and the next thing I wanted to do was add a header row to the resulting row.

Hadoop’s FileUtil#copyMerge function does take a String parameter but it adds this text to the end of each partition file which isn’t quite what we want.

However, if we copy that function into our own FileUtil class we can restructure it to do what we want:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.IOUtils;
import java.io.IOException;
 
public class MyFileUtil {
    public static boolean copyMergeWithHeader(FileSystem srcFS, Path srcDir, FileSystem dstFS, Path dstFile, boolean deleteSource, Configuration conf, String header) throws IOException {
        dstFile = checkDest(srcDir.getName(), dstFS, dstFile, false);
        if(!srcFS.getFileStatus(srcDir).isDir()) {
            return false;
        } else {
            FSDataOutputStream out = dstFS.create(dstFile);
            if(header != null) {
                out.write((header + "\n").getBytes("UTF-8"));
            }
 
            try {
                FileStatus[] contents = srcFS.listStatus(srcDir);
 
                for(int i = 0; i < contents.length; ++i) {
                    if(!contents[i].isDir()) {
                        FSDataInputStream in = srcFS.open(contents[i].getPath());
 
                        try {
                            IOUtils.copyBytes(in, out, conf, false);
 
                        } finally {
                            in.close();
                        }
                    }
                }
            } finally {
                out.close();
            }
 
            return deleteSource?srcFS.delete(srcDir, true):true;
        }
    }
 
    private static Path checkDest(String srcName, FileSystem dstFS, Path dst, boolean overwrite) throws IOException {
        if(dstFS.exists(dst)) {
            FileStatus sdst = dstFS.getFileStatus(dst);
            if(sdst.isDir()) {
                if(null == srcName) {
                    throw new IOException("Target " + dst + " is a directory");
                }
 
                return checkDest((String)null, dstFS, new Path(dst, srcName), overwrite);
            }
 
            if(!overwrite) {
                throw new IOException("Target " + dst + " already exists");
            }
        }
        return dst;
    }
}

We can then update our merge function to call this instead:

def merge(srcPath: String, dstPath: String, header:String): Unit =  {
  val hadoopConfig = new Configuration()
  val hdfs = FileSystem.get(hadoopConfig)
  MyFileUtil.copyMergeWithHeader(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, header)
}

We call merge from our code like this:

merge(file, destinationFile, "type,count")

I wasn’t sure how to import my Java based class into the Spark shell so I compiled the code into a JAR and submitted it as a job instead:

$ sbt package
[info] Loading global plugins from /Users/markneedham/.sbt/0.13/plugins
[info] Loading project definition from /Users/markneedham/projects/spark-play/playground/project
[info] Set current project to playground (in build file:/Users/markneedham/projects/spark-play/playground/)
[info] Compiling 3 Scala sources to /Users/markneedham/projects/spark-play/playground/target/scala-2.10/classes...
[info] Packaging /Users/markneedham/projects/spark-play/playground/target/scala-2.10/playground_2.10-1.0.jar ...
[info] Done packaging.
[success] Total time: 8 s, completed 30-Nov-2014 08:12:26
 
$ time ./bin/spark-submit --class "WriteToCsvWithHeader" --master local[4] /path/to/playground/target/scala-2.10/playground_2.10-1.0.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.propertie
...
14/11/30 08:16:15 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
14/11/30 08:16:15 INFO SparkContext: Job finished: saveAsTextFile at WriteToCsvWithHeader.scala:49, took 0.589036 s
 
real	0m13.061s
user	0m38.977s
sys	0m3.393s

And if we look at our destination file:

$ cat /tmp/singlePrimaryTypes.csv
type,count
THEFT,859197
BATTERY,757530
NARCOTICS,489528
CRIMINAL DAMAGE,488209
BURGLARY,257310
OTHER OFFENSE,253964
ASSAULT,247386
MOTOR VEHICLE THEFT,197404
ROBBERY,157706
DECEPTIVE PRACTICE,137538
CRIMINAL TRESPASS,124974
PROSTITUTION,47245
WEAPONS VIOLATION,40361
PUBLIC PEACE VIOLATION,31585
OFFENSE INVOLVING CHILDREN,26524
CRIM SEXUAL ASSAULT,14788
SEX OFFENSE,14283
GAMBLING,10632
LIQUOR LAW VIOLATION,8847
ARSON,6443
INTERFERE WITH PUBLIC OFFICER,5178
HOMICIDE,4846
KIDNAPPING,3585
INTERFERENCE WITH PUBLIC OFFICER,3147
INTIMIDATION,2471
STALKING,1985
OFFENSES INVOLVING CHILDREN,355
OBSCENITY,219
PUBLIC INDECENCY,86
OTHER NARCOTIC VIOLATION,80
RITUALISM,12
NON-CRIMINAL,12
OTHER OFFENSE ,6
NON - CRIMINAL,2
NON-CRIMINAL (SUBJECT SPECIFIED),2

Happy days!

The code is available as a gist if you want to see all the details.

Categories: Programming

Spark: Write to CSV file

Sun, 11/30/2014 - 08:40

A couple of weeks ago I wrote how I’d been using Spark to explore a City of Chicago Crime data set and having worked out how many of each crime had been committed I wanted to write that to a CSV file.

Spark provides a saveAsTextFile function which allows us to save RDD’s so I refactored my code into the following format to allow me to use that:

import au.com.bytecode.opencsv.CSVParser
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
 
def dropHeader(data: RDD[String]): RDD[String] = {
  data.mapPartitionsWithIndex((idx, lines) => {
    if (idx == 0) {
      lines.drop(1)
    }
    lines
  })
}
 
// https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
val crimeFile = "/Users/markneedham/Downloads/Crimes_-_2001_to_present.csv"
 
val crimeData = sc.textFile(crimeFile).cache()
val withoutHeader: RDD[String] = dropHeader(crimeData)
 
val file = "/tmp/primaryTypes.csv"
FileUtil.fullyDelete(new File(file))
 
val partitions: RDD[(String, Int)] = withoutHeader.mapPartitions(lines => {
  val parser = new CSVParser(',')
  lines.map(line => {
    val columns = parser.parseLine(line)
    (columns(5), 1)
  })
})
 
val counts = partitions.
  reduceByKey {case (x,y) => x + y}.
  sortBy {case (key, value) => -value}.
  map { case (key, value) => Array(key, value).mkString(",") }
 
counts.saveAsTextFile(file)

If we run that code from the Spark shell we end up with a folder called /tmp/primaryTypes.csv containing multiple part files:

$ ls -lah /tmp/primaryTypes.csv/
total 496
drwxr-xr-x  66 markneedham  wheel   2.2K 30 Nov 07:17 .
drwxrwxrwt  80 root         wheel   2.7K 30 Nov 07:16 ..
-rw-r--r--   1 markneedham  wheel     8B 30 Nov 07:16 ._SUCCESS.crc
-rw-r--r--   1 markneedham  wheel    12B 30 Nov 07:16 .part-00000.crc
-rw-r--r--   1 markneedham  wheel    12B 30 Nov 07:16 .part-00001.crc
-rw-r--r--   1 markneedham  wheel    12B 30 Nov 07:16 .part-00002.crc
-rw-r--r--   1 markneedham  wheel    12B 30 Nov 07:16 .part-00003.crc
...
-rwxrwxrwx   1 markneedham  wheel     0B 30 Nov 07:16 _SUCCESS
-rwxrwxrwx   1 markneedham  wheel    28B 30 Nov 07:16 part-00000
-rwxrwxrwx   1 markneedham  wheel    17B 30 Nov 07:16 part-00001
-rwxrwxrwx   1 markneedham  wheel    23B 30 Nov 07:16 part-00002
-rwxrwxrwx   1 markneedham  wheel    16B 30 Nov 07:16 part-00003
...

If we look at some of those part files we can see that it’s written the crime types and counts as expected:

$ cat /tmp/primaryTypes.csv/part-00000
THEFT,859197
BATTERY,757530
 
$ cat /tmp/primaryTypes.csv/part-00003
BURGLARY,257310

This is fine if we’re going to pass those CSV files into another Hadoop based job but I actually want a single CSV file so it’s not quite what I want.

One way to achieve this is to force everything to be calculated on one partition which will mean we only get one part file generated:

val counts = partitions.repartition(1).
  reduceByKey {case (x,y) => x + y}.
  sortBy {case (key, value) => -value}.
  map { case (key, value) => Array(key, value).mkString(",") }
 
 
counts.saveAsTextFile(file)

part-00000 now looks like this:

$ cat !$
cat /tmp/primaryTypes.csv/part-00000
THEFT,859197
BATTERY,757530
NARCOTICS,489528
CRIMINAL DAMAGE,488209
BURGLARY,257310
OTHER OFFENSE,253964
ASSAULT,247386
MOTOR VEHICLE THEFT,197404
ROBBERY,157706
DECEPTIVE PRACTICE,137538
CRIMINAL TRESPASS,124974
PROSTITUTION,47245
WEAPONS VIOLATION,40361
PUBLIC PEACE VIOLATION,31585
OFFENSE INVOLVING CHILDREN,26524
CRIM SEXUAL ASSAULT,14788
SEX OFFENSE,14283
GAMBLING,10632
LIQUOR LAW VIOLATION,8847
ARSON,6443
INTERFERE WITH PUBLIC OFFICER,5178
HOMICIDE,4846
KIDNAPPING,3585
INTERFERENCE WITH PUBLIC OFFICER,3147
INTIMIDATION,2471
STALKING,1985
OFFENSES INVOLVING CHILDREN,355
OBSCENITY,219
PUBLIC INDECENCY,86
OTHER NARCOTIC VIOLATION,80
NON-CRIMINAL,12
RITUALISM,12
OTHER OFFENSE ,6
NON - CRIMINAL,2
NON-CRIMINAL (SUBJECT SPECIFIED),2

This works but it’s quite a bit slower than when we were doing the aggregation across partitions so it’s not ideal.

Instead, what we can do is make use of one of Hadoop’s merge functions which squashes part files together into a single file.

First we import Hadoop into our SBT file:

libraryDependencies += "org.apache.hadoop" % "hadoop-hdfs" % "2.5.2"

Now let’s bring our merge function into the Spark shell:

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
 
def merge(srcPath: String, dstPath: String): Unit =  {
  val hadoopConfig = new Configuration()
  val hdfs = FileSystem.get(hadoopConfig)
  FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}

And now let’s make use of it:

val file = "/tmp/primaryTypes.csv"
FileUtil.fullyDelete(new File(file))
 
val destinationFile= "/tmp/singlePrimaryTypes.csv"
FileUtil.fullyDelete(new File(destinationFile))
 
val counts = partitions.
reduceByKey {case (x,y) => x + y}.
sortBy {case (key, value) => -value}.
map { case (key, value) => Array(key, value).mkString(",") }
 
counts.saveAsTextFile(file)
 
merge(file, destinationFile)

And now we’ve got the best of both worlds:

$ cat /tmp/singlePrimaryTypes.csv
THEFT,859197
BATTERY,757530
NARCOTICS,489528
CRIMINAL DAMAGE,488209
BURGLARY,257310
OTHER OFFENSE,253964
ASSAULT,247386
MOTOR VEHICLE THEFT,197404
ROBBERY,157706
DECEPTIVE PRACTICE,137538
CRIMINAL TRESPASS,124974
PROSTITUTION,47245
WEAPONS VIOLATION,40361
PUBLIC PEACE VIOLATION,31585
OFFENSE INVOLVING CHILDREN,26524
CRIM SEXUAL ASSAULT,14788
SEX OFFENSE,14283
GAMBLING,10632
LIQUOR LAW VIOLATION,8847
ARSON,6443
INTERFERE WITH PUBLIC OFFICER,5178
HOMICIDE,4846
KIDNAPPING,3585
INTERFERENCE WITH PUBLIC OFFICER,3147
INTIMIDATION,2471
STALKING,1985
OFFENSES INVOLVING CHILDREN,355
OBSCENITY,219
PUBLIC INDECENCY,86
OTHER NARCOTIC VIOLATION,80
RITUALISM,12
NON-CRIMINAL,12
OTHER OFFENSE ,6
NON - CRIMINAL,2
NON-CRIMINAL (SUBJECT SPECIFIED),2

The full code is available as a gist if you want to play around with it.

Categories: Programming

Docker/Neo4j: Port forwarding on Mac OS X not working

Thu, 11/27/2014 - 13:28

Prompted by Ognjen Bubalo’s excellent blog post I thought it was about time I tried running Neo4j on a docker container on my Mac Book Pro to make it easier to play around with different data sets.

I got the container up and running by following Ognien’s instructions and had the following ports forwarded to my host machine:

$ docker ps
CONTAINER ID        IMAGE                 COMMAND                CREATED             STATUS              PORTS                                              NAMES
c62f8601e557        tpires/neo4j:latest   "/bin/bash -c /launc   About an hour ago   Up About an hour    0.0.0.0:49153->1337/tcp, 0.0.0.0:49154->7474/tcp   neo4j

This should allow me to access Neo4j on port 49154 but when I tried to access that host:port pair I got a connection refused message:

$ curl -v http://localhost:49154
* Adding handle: conn: 0x7ff369803a00
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7ff369803a00) send_pipe: 1, recv_pipe: 0
* About to connect() to localhost port 49154 (#0)
*   Trying ::1...
*   Trying 127.0.0.1...
*   Trying fe80::1...
* Failed connect to localhost:49154; Connection refused
* Closing connection 0
curl: (7) Failed connect to localhost:49154; Connection refused

My first thought was the maybe Neo4j hadn’t started up correctly inside the container so I checked the logs:

$ docker logs --tail=10 c62f8601e557
10:59:12.994 [main] INFO  o.e.j.server.handler.ContextHandler - Started o.e.j.w.WebAppContext@2edfbe28{/webadmin,jar:file:/usr/share/neo4j/system/lib/neo4j-server-2.1.5-static-web.jar!/webadmin-html,AVAILABLE}
10:59:13.449 [main] INFO  o.e.j.server.handler.ContextHandler - Started o.e.j.s.ServletContextHandler@192efb4e{/db/manage,null,AVAILABLE}
10:59:13.699 [main] INFO  o.e.j.server.handler.ContextHandler - Started o.e.j.s.ServletContextHandler@7e94c035{/db/data,null,AVAILABLE}
10:59:13.714 [main] INFO  o.e.j.w.StandardDescriptorProcessor - NO JSP Support for /browser, did not find org.apache.jasper.servlet.JspServlet
10:59:13.715 [main] INFO  o.e.j.server.handler.ContextHandler - Started o.e.j.w.WebAppContext@3e84ae71{/browser,jar:file:/usr/share/neo4j/system/lib/neo4j-browser-2.1.5.jar!/browser,AVAILABLE}
10:59:13.807 [main] INFO  o.e.j.server.handler.ContextHandler - Started o.e.j.s.ServletContextHandler@4b6690b1{/,null,AVAILABLE}
10:59:13.819 [main] INFO  o.e.jetty.server.ServerConnector - Started ServerConnector@495350f0{HTTP/1.1}{c62f8601e557:7474}
10:59:13.900 [main] INFO  o.e.jetty.server.ServerConnector - Started ServerConnector@23ad0c5a{SSL-HTTP/1.1}{c62f8601e557:7473}
2014-11-27 10:59:13.901+0000 INFO  [API] Server started on: http://c62f8601e557:7474/
2014-11-27 10:59:13.902+0000 INFO  [API] Remote interface ready and available at [http://c62f8601e557:7474/]

Nope! It’s up and running perfectly fine which suggested the problemw was with port forwarding.

I eventually found my way to Chris Jones’ ‘How to use Docker on OS X: The Missing Guide‘ which explained the problem:

The Problem: Docker forwards ports from the container to the host, which is boot2docker, not OS X.

The Solution: Use the VM’s IP address.

So to access Neo4j on my machine I need to use the VM’s IP address rather than localhost. We can get the VM’s IP address like so:

$ boot2docker ip
 
The VM's Host only interface IP address is: 192.168.59.103

Let’s strip out that surrounding text though:

$ boot2docker ip 2> /dev/null
192.168.59.103

Now if we cURL using that IP instead:

$ curl -v http://192.168.59.103:49154
* About to connect() to 192.168.59.103 port 49154 (#0)
*   Trying 192.168.59.103...
* Adding handle: conn: 0x7fd794003a00
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7fd794003a00) send_pipe: 1, recv_pipe: 0
* Connected to 192.168.59.103 (192.168.59.103) port 49154 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.30.0
> Host: 192.168.59.103:49154
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Access-Control-Allow-Origin: *
< Content-Length: 112
* Server Jetty(9.0.5.v20130815) is not blacklisted
< Server: Jetty(9.0.5.v20130815)
<
{
  "management" : "http://192.168.59.103:49154/db/manage/",
  "data" : "http://192.168.59.103:49154/db/data/"
* Connection #0 to host 192.168.59.103 left intact

Happy days!

Chris has solutions to lots of other common problems people come across when using Docker with Mac OS X so it’s worth having a flick through his post.

Categories: Programming

R: dplyr – Select ‘random’ rows from a data frame

Wed, 11/26/2014 - 01:01

Frequently I find myself wanting to take a sample of the rows in a data frame where just taking the head isn’t enough.

Let’s say we start with the following data frame:

data = data.frame(
    letter = sample(LETTERS, 50000, replace = TRUE),
    number = sample (1:10, 50000, replace = TRUE)
    )

And we’d like to sample 10 rows to see what it contains. We’ll start by generating 10 random numbers to represent row numbers using the runif function:

> randomRows = sample(1:length(data[,1]), 10, replace=T)
> randomRows
 [1]  8723 18772  4964 36134 27467 31890 16313 12841 49214 15621

We can then pass that list of row numbers into dplyr’s slice function like so:

> data %>% slice(randomRows)
   letter number
1       Z      4
2       F      1
3       Y      6
4       R      6
5       Y      4
6       V     10
7       R      6
8       D      6
9       J      7
10      E      2

If we’re using that code throughout our code then we might want to pull out a function like so:

pickRandomRows = function(df, numberOfRows = 10) {
  df %>% slice(runif(numberOfRows,0, length(df[,1])))
}

And then call it like so:

> data %>% pickRandomRows()
   letter number
1       W      5
2       Y      3
3       E      6
4       Q      8
5       M      9
6       H      9
7       E     10
8       T      2
9       I      5
10      V      4
 
> data %>% pickRandomRows(7)
  letter number
1      V      7
2      N      4
3      W      1
4      N      8
5      G      7
6      V      1
7      N      7
Update

Antonios pointed out via email that we could just make use of the in-built sample_n function which I didn’t know about until now:

> data %>% sample_n(10)
      letter number
29771      U      1
48666      T     10
30635      A      1
34865      X      7
20140      A      3
41715      T     10
43786      E     10
18284      A      7
21406      S      8
35542      J      8
Categories: Programming

R: dplyr – “Variables not shown”

Sun, 11/23/2014 - 02:02

I recently ran into a problem where the result of applying some operations to a data frame wasn’t being output the way I wanted.

I started with this data frame:

words = function(numberOfWords, lengthOfWord) {
  w = c(1:numberOfWords)  
  for(i in 1:numberOfWords) {
    w[i] = paste(sample(letters, lengthOfWord, replace=TRUE), collapse = "")
  }
  w
}
 
numberOfRows = 100
df = data.frame(a = sample (1:numberOfRows, 10, replace = TRUE),
                b = sample (1:numberOfRows, 10, replace = TRUE),
                name = words(numberOfRows, 10))

I wanted to group the data frame by a and b and output a comma separated list of the associated names. I started with this:

> df %>% 
    group_by(a,b) %>%
    summarise(n = n(), words = paste(name, collapse = ",")) %>%
    arrange(desc(n)) %>%
    head(5)
 
Source: local data frame [5 x 4]
Groups: a
 
   a  b  n
1 19 90 10
2 24 36 10
3 29 20 10
4 29 80 10
5 62 54 10
Variables not shown: words (chr)

Unfortunately the words column has been excluded and I came across this Stack Overflow post which suggested that the print.tbl_df function was the one responsible for filtering columns.

Browsing the docs I found a couple of ways to overwrite this behaviour:

> df %>% 
    group_by(a,b) %>%
    summarise(n = n(), words = paste(name, collapse = ",")) %>%
    arrange(desc(n)) %>%
    head(5) %>%
    print(width = Inf)

or

> options(dplyr.width = Inf)
> df %>% 
    group_by(a,b) %>%
    summarise(n = n(), words = paste(name, collapse = ",")) %>%
    arrange(desc(n)) %>%
    head(5)

And now we see this output instead:

Source: local data frame [5 x 4]
Groups: a
 
   a  b  n                                                                                                         words
1 19 90 10 dfhtcgymxt,zpemxbpnri,rfmkksuavp,jxaarxzdzd,peydpxjizc,trdzchaxiy,arthnxbaeg,kjbpdvvghm,kpvsddlsua,xmysfcynxw
2 24 36 10 wtokzdfecx,eprsvpsdcp,kzgxtwnqli,jbyuicevrn,klriuenjzu,qzgtmkljoy,bonbhmqfaz,uauoybprrl,rzummfbkbx,icyeorwzxl
3 29 20 10 ebubytlosp,vtligdgvqw,ejlqonhuit,jwidjvtark,kmdzcalblg,qzrlewxcsr,eckfgjnkys,vfdaeqbfqi,rumblliqmn,fvezcdfiaz
4 29 80 10 wputpwgayx,lpawiyhzuh,ufykwguynu,nyqnwjallh,abaxicpixl,uirudflazn,wyynsikwcl,usescualww,bkvsowfaab,gfhyifzepx
5 62 54 10 beuegfzssp,gfmegjtrys,wkubhvnkkk,rkhgprxttb,cwsrzulnpo,hzkvjbiywc,gbmiupnlbw,gffovxwtok,uxadfrjvdn,aojjfhxygs

Much better!

Categories: Programming

R: ggmap – Overlay shapefile with filled polygon of regions

Mon, 11/17/2014 - 01:53

I’ve been playing around with plotting maps in R over the last week and got to the point where I wanted to have a google map in the background with a filled polygon on a shapefile in the foreground.

The first bit is reasonably simple – we can just import the ggmap library and make a call to get_map:

> library(ggmap)
> sfMap = map = get_map(location = 'San Francisco', zoom = 12)
2014 11 17 00 27 11

Next I wanted to show the outlines of the different San Francisco zip codes and came across a blog post by Paul Bidanset on Baltimore neighbourhoods which I was able to adapt.

I downloaded a shapefile of San Francisco’s zip codes from the DataSF website and then loaded it into R using the readOGR and spTransform functions from the rgdal package:

> library(rgdal)
> library(ggplot2)
> sfn = readOGR(".","sfzipcodes") %>% spTransform(CRS("+proj=longlat +datum=WGS84"))
> ggplot(data = sfn, aes(x = long, y = lat, group = group)) + geom_path()
2014 11 17 00 38 32

sfn is a spatial type of data frame…

> class(sfn)
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"

…but we need a normal data frame to be able to easily merge other data onto the map and then plot it. We can use ggplot2’s fortify command to do this:

> names(sfn)
[1] "OBJECTID" "ZIP_CODE" "ID"   
 
> sfn.f = sfn %>% fortify(region = 'ZIP_CODE')
 
SFNeighbourhoods  = merge(sfn.f, sfn@data, by.x = 'id', by.y = 'ZIP_CODE')

I then made up some fake values for each zip code so that we could have different colour shadings for each zip code on the visualisation:

> library(dplyr) 
 
> postcodes = SFNeighbourhoods %>% select(id) %>% distinct()
 
> values = data.frame(id = c(postcodes),
                      value = c(runif(postcodes %>% count() %>% unlist(),5.0, 25.0)))

I then merged those values onto SFNeighbourhoods:

> sf = merge(SFNeighbourhoods, values, by.x='id')
 
> sf %>% group_by(id) %>% do(head(., 1)) %>% head(10)
Source: local data frame [10 x 10]
Groups: id
 
      id      long      lat order  hole piece   group OBJECTID    ID     value
1  94102 -122.4193 37.77515     1 FALSE     1 94102.1       14 94102  6.184814
2  94103 -122.4039 37.77006   106 FALSE     1 94103.1       12 94103 21.659752
3  94104 -122.4001 37.79030   255 FALSE     1 94104.1       10 94104  5.173199
4  94105 -122.3925 37.79377   293 FALSE     1 94105.1        2 94105 15.723456
5  94107 -122.4012 37.78202   504 FALSE     1 94107.1        1 94107  8.402726
6  94108 -122.4042 37.79169  2232 FALSE     1 94108.1       11 94108  8.632652
7  94109 -122.4139 37.79046  2304 FALSE     1 94109.1        8 94109 20.129402
8  94110 -122.4217 37.73181  2794 FALSE     1 94110.1       16 94110 12.410610
9  94111 -122.4001 37.79369  3067 FALSE     1 94111.1        9 94111 10.185054
10 94112 -122.4278 37.73469  3334 FALSE     1 94112.1       18 94112 24.297588

Now we can easily plot those colours onto our shapefile by calling geom_polgon instead of geom_path:

> ggplot(sf, aes(long, lat, group = group)) + 
    geom_polygon(aes(fill = value))

2014 11 17 00 49 11

And finally let’s wire it up to our google map:

> ggmap(sfMap) + 
    geom_polygon(aes(fill = value, x = long, y = lat, group = group), 
                 data = sf,
                 alpha = 0.8, 
                 color = "black",
                 size = 0.2)
2014 11 17 00 50 13

I spent way too long with the alpha value set to ‘0’ on this last plot wondering why I wasn’t seeing any shading so don’t make that mistake!

Categories: Programming

Spark: Parse CSV file and group by column value

Sun, 11/16/2014 - 23:53

I’ve found myself working with large CSV files quite frequently and realising that my existing toolset didn’t let me explore them quickly I thought I’d spend a bit of time looking at Spark to see if it could help.

I’m working with a crime data set released by the City of Chicago: it’s 1GB in size and contains details of 4 million crimes:

$ ls -alh ~/Downloads/Crimes_-_2001_to_present.csv
-rw-r--r--@ 1 markneedham  staff   1.0G 16 Nov 12:14 /Users/markneedham/Downloads/Crimes_-_2001_to_present.csv
 
$ wc -l ~/Downloads/Crimes_-_2001_to_present.csv
 4193441 /Users/markneedham/Downloads/Crimes_-_2001_to_present.csv

We can get a rough idea of the contents of the file by looking at the first row along with the header:

$ head -n 2 ~/Downloads/Crimes_-_2001_to_present.csv
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
9464711,HX114160,01/14/2014 05:00:00 AM,028XX E 80TH ST,0560,ASSAULT,SIMPLE,APARTMENT,false,true,0422,004,7,46,08A,1196652,1852516,2014,01/20/2014 12:40:05 AM,41.75017626412204,-87.55494559131228,"(41.75017626412204, -87.55494559131228)"

I wanted to do a count of the ‘Primary Type’ column to see how many of each crime we have. Using just Unix command line tools this is how we’d do that:

$ time tail +2 ~/Downloads/Crimes_-_2001_to_present.csv | cut -d, -f6  | sort | uniq -c | sort -rn
859197 THEFT
757530 BATTERY
489528 NARCOTICS
488209 CRIMINAL DAMAGE
257310 BURGLARY
253964 OTHER OFFENSE
247386 ASSAULT
197404 MOTOR VEHICLE THEFT
157706 ROBBERY
137538 DECEPTIVE PRACTICE
124974 CRIMINAL TRESPASS
47245 PROSTITUTION
40361 WEAPONS VIOLATION
31585 PUBLIC PEACE VIOLATION
26524 OFFENSE INVOLVING CHILDREN
14788 CRIM SEXUAL ASSAULT
14283 SEX OFFENSE
10632 GAMBLING
8847 LIQUOR LAW VIOLATION
6443 ARSON
5178 INTERFERE WITH PUBLIC OFFICER
4846 HOMICIDE
3585 KIDNAPPING
3147 INTERFERENCE WITH PUBLIC OFFICER
2471 INTIMIDATION
1985 STALKING
 355 OFFENSES INVOLVING CHILDREN
 219 OBSCENITY
  86 PUBLIC INDECENCY
  80 OTHER NARCOTIC VIOLATION
  12 RITUALISM
  12 NON-CRIMINAL
   6 OTHER OFFENSE
   2 NON-CRIMINAL (SUBJECT SPECIFIED)
   2 NON - CRIMINAL
 
real	2m37.495s
user	3m0.337s
sys	0m1.471s

This isn’t too bad but it seems like the type of calculation that Spark is made for so I had a look at how I could go about doing that. To start with I created an SBT project with the following build file:

name := "playground"
 
version := "1.0"
 
scalaVersion := "2.10.4"
 
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
 
libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3"
 
ideaExcludeFolders += ".idea"
 
ideaExcludeFolders += ".idea_modules"

I downloaded Spark and after unpacking it launched the Spark shell:

$ pwd
/Users/markneedham/projects/spark-play/spark-1.1.0/spark-1.1.0-bin-hadoop1
 
$ ./bin/spark-shell
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.1.0
      /_/
 
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
...
Spark context available as sc.
 
scala>

I first import some classes I’m going to need:

scala> import au.com.bytecode.opencsv.CSVParser
import au.com.bytecode.opencsv.CSVParser
 
scala> import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD

Now, following the quick start example, we’ll create a Resilient Distributed Dataset (RDD) from our Crime CSV file:

scala> val crimeFile = "/Users/markneedham/Downloads/Crimes_-_2001_to_present.csv"
crimeFile: String = /Users/markneedham/Downloads/Crimes_-_2001_to_present.csv
 
scala> val crimeData = sc.textFile(crimeFile).cache()
14/11/16 22:31:16 INFO MemoryStore: ensureFreeSpace(32768) called with curMem=0, maxMem=278302556
14/11/16 22:31:16 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 32.0 KB, free 265.4 MB)
crimeData: org.apache.spark.rdd.RDD[String] = /Users/markneedham/Downloads/Crimes_-_2001_to_present.csv MappedRDD[1] at textFile at <console>:17

Our next step is to process each line of the file using our CSV Parser. A simple way to do this would be to create a new CSVParser for each line:

scala> crimeData.map(line => {
         val parser = new CSVParser(',')
         parser.parseLine(line).mkString(",")
       }).take(5).foreach(println)
14/11/16 22:35:49 INFO SparkContext: Starting job: take at <console>:23
...
4/11/16 22:35:49 INFO SparkContext: Job finished: take at <console>:23, took 0.013904 s
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
9464711,HX114160,01/14/2014 05:00:00 AM,028XX E 80TH ST,0560,ASSAULT,SIMPLE,APARTMENT,false,true,0422,004,7,46,08A,1196652,1852516,2014,01/20/2014 12:40:05 AM,41.75017626412204,-87.55494559131228,(41.75017626412204, -87.55494559131228)
9460704,HX113741,01/14/2014 04:55:00 AM,091XX S JEFFERY AVE,031A,ROBBERY,ARMED: HANDGUN,SIDEWALK,false,false,0413,004,8,48,03,1191060,1844959,2014,01/18/2014 12:39:56 AM,41.729576153145636,-87.57568059471686,(41.729576153145636, -87.57568059471686)
9460339,HX113740,01/14/2014 04:44:00 AM,040XX W MAYPOLE AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,false,true,1114,011,28,26,14,1149075,1901099,2014,01/16/2014 12:40:00 AM,41.884543798701515,-87.72803579358926,(41.884543798701515, -87.72803579358926)
9461467,HX114463,01/14/2014 04:43:00 AM,059XX S CICERO AVE,0820,THEFT,$500 AND UNDER,PARKING LOT/GARAGE(NON.RESID.),false,false,0813,008,13,64,06,1145661,1865031,2014,01/16/2014 12:40:00 AM,41.785633535413176,-87.74148516669783,(41.785633535413176, -87.74148516669783)

That works but it’s a bit wasteful to create a new CSVParser each time so instead let’s just create one for each partition that Spark splits our file up into:

scala> crimeData.mapPartitions(lines => {
         val parser = new CSVParser(',')
         lines.map(line => {
           parser.parseLine(line).mkString(",")
         })
       }).take(5).foreach(println)
14/11/16 22:38:44 INFO SparkContext: Starting job: take at <console>:25
...
14/11/16 22:38:44 INFO SparkContext: Job finished: take at <console>:25, took 0.015216 s
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
9464711,HX114160,01/14/2014 05:00:00 AM,028XX E 80TH ST,0560,ASSAULT,SIMPLE,APARTMENT,false,true,0422,004,7,46,08A,1196652,1852516,2014,01/20/2014 12:40:05 AM,41.75017626412204,-87.55494559131228,(41.75017626412204, -87.55494559131228)
9460704,HX113741,01/14/2014 04:55:00 AM,091XX S JEFFERY AVE,031A,ROBBERY,ARMED: HANDGUN,SIDEWALK,false,false,0413,004,8,48,03,1191060,1844959,2014,01/18/2014 12:39:56 AM,41.729576153145636,-87.57568059471686,(41.729576153145636, -87.57568059471686)
9460339,HX113740,01/14/2014 04:44:00 AM,040XX W MAYPOLE AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,false,true,1114,011,28,26,14,1149075,1901099,2014,01/16/2014 12:40:00 AM,41.884543798701515,-87.72803579358926,(41.884543798701515, -87.72803579358926)
9461467,HX114463,01/14/2014 04:43:00 AM,059XX S CICERO AVE,0820,THEFT,$500 AND UNDER,PARKING LOT/GARAGE(NON.RESID.),false,false,0813,008,13,64,06,1145661,1865031,2014,01/16/2014 12:40:00 AM,41.785633535413176,-87.74148516669783,(41.785633535413176, -87.74148516669783)

You’ll notice that we’ve still got the header being printed which isn’t ideal – let’s get rid of it!

I expected there to be a ‘drop’ function which would allow me to do that but in fact there isn’t. Instead we can make use of our knowledge that the first partition will contain the first line and strip it out that way:

scala> def dropHeader(data: RDD[String]): RDD[String] = {
         data.mapPartitionsWithIndex((idx, lines) => {
           if (idx == 0) {
             lines.drop(1)
           }
           lines
         })
       }
dropHeader: (data: org.apache.spark.rdd.RDD[String])org.apache.spark.rdd.RDD[String]

Now let’s grab the first 5 lines again and print them out:

scala> val withoutHeader: RDD[String] = dropHeader(crimeData)
withoutHeader: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at mapPartitionsWithIndex at <console>:14
 
scala> withoutHeader.mapPartitions(lines => {
         val parser = new CSVParser(',')
         lines.map(line => {
           parser.parseLine(line).mkString(",")
         })
       }).take(5).foreach(println)
14/11/16 22:43:27 INFO SparkContext: Starting job: take at <console>:29
...
14/11/16 22:43:27 INFO SparkContext: Job finished: take at <console>:29, took 0.018557 s
9464711,HX114160,01/14/2014 05:00:00 AM,028XX E 80TH ST,0560,ASSAULT,SIMPLE,APARTMENT,false,true,0422,004,7,46,08A,1196652,1852516,2014,01/20/2014 12:40:05 AM,41.75017626412204,-87.55494559131228,(41.75017626412204, -87.55494559131228)
9460704,HX113741,01/14/2014 04:55:00 AM,091XX S JEFFERY AVE,031A,ROBBERY,ARMED: HANDGUN,SIDEWALK,false,false,0413,004,8,48,03,1191060,1844959,2014,01/18/2014 12:39:56 AM,41.729576153145636,-87.57568059471686,(41.729576153145636, -87.57568059471686)
9460339,HX113740,01/14/2014 04:44:00 AM,040XX W MAYPOLE AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,false,true,1114,011,28,26,14,1149075,1901099,2014,01/16/2014 12:40:00 AM,41.884543798701515,-87.72803579358926,(41.884543798701515, -87.72803579358926)
9461467,HX114463,01/14/2014 04:43:00 AM,059XX S CICERO AVE,0820,THEFT,$500 AND UNDER,PARKING LOT/GARAGE(NON.RESID.),false,false,0813,008,13,64,06,1145661,1865031,2014,01/16/2014 12:40:00 AM,41.785633535413176,-87.74148516669783,(41.785633535413176, -87.74148516669783)
9460355,HX113738,01/14/2014 04:21:00 AM,070XX S PEORIA ST,0820,THEFT,$500 AND UNDER,STREET,true,false,0733,007,17,68,06,1171480,1858195,2014,01/16/2014 12:40:00 AM,41.766348042591375,-87.64702037047671,(41.766348042591375, -87.64702037047671)

We’re finally in good shape to extract the values from the ‘Primary Type’ column and count how many times each of those appears in our data set:

scala> withoutHeader.mapPartitions(lines => {
         val parser=new CSVParser(',')
         lines.map(line => {
           val columns = parser.parseLine(line)
           Array(columns(5)).mkString(",")
         })
       }).countByValue().toList.sortBy(-_._2).foreach(println)
14/11/16 22:45:20 INFO SparkContext: Starting job: countByValue at <console>:30
14/11/16 22:45:20 INFO DAGScheduler: Got job 7 (countByValue at <console>:30) with 32 output partitions (allowLocal=false)
...
14/11/16 22:45:30 INFO SparkContext: Job finished: countByValue at <console>:30, took 9.796565 s
(THEFT,859197)
(BATTERY,757530)
(NARCOTICS,489528)
(CRIMINAL DAMAGE,488209)
(BURGLARY,257310)
(OTHER OFFENSE,253964)
(ASSAULT,247386)
(MOTOR VEHICLE THEFT,197404)
(ROBBERY,157706)
(DECEPTIVE PRACTICE,137538)
(CRIMINAL TRESPASS,124974)
(PROSTITUTION,47245)
(WEAPONS VIOLATION,40361)
(PUBLIC PEACE VIOLATION,31585)
(OFFENSE INVOLVING CHILDREN,26524)
(CRIM SEXUAL ASSAULT,14788)
(SEX OFFENSE,14283)
(GAMBLING,10632)
(LIQUOR LAW VIOLATION,8847)
(ARSON,6443)
(INTERFERE WITH PUBLIC OFFICER,5178)
(HOMICIDE,4846)
(KIDNAPPING,3585)
(INTERFERENCE WITH PUBLIC OFFICER,3147)
(INTIMIDATION,2471)
(STALKING,1985)
(OFFENSES INVOLVING CHILDREN,355)
(OBSCENITY,219)
(PUBLIC INDECENCY,86)
(OTHER NARCOTIC VIOLATION,80)
(NON-CRIMINAL,12)
(RITUALISM,12)
(OTHER OFFENSE ,6)
(NON-CRIMINAL (SUBJECT SPECIFIED),2)
(NON - CRIMINAL,2)

We get the same results as with the Unix commands except it took less than 10 seconds to calculate which is pretty cool!

Categories: Programming

R: dplyr – Sum for group_by multiple columns

Tue, 11/11/2014 - 01:17

Over the weekend I was playing around with dplyr and had the following data frame grouped by both columns:

> library(dplyr)
 
> data = data.frame(
    letter = sample(LETTERS, 50000, replace = TRUE),
    number = sample (1:10, 50000, replace = TRUE)
    )
 
> data %>% count(letter, number) %>% head(20)
Source: local data frame [20 x 3]
Groups: letter
 
   letter number   n
1       A      1 184
2       A      2 192
3       A      3 183
4       A      4 193
5       A      5 214
6       A      6 172
7       A      7 196
8       A      8 198
9       A      9 174
10      A     10 196
11      B      1 212
12      B      2 198
13      B      3 194
14      B      4 181
15      B      5 203
16      B      6 234
17      B      7 221
18      B      8 179
19      B      9 182
20      B     10 170

I wanted to add an extra column which would show what percentage of the values for that letter each number had.

If we wrote that code standalone we’d have the following:

> data %>% count(letter)
Source: local data frame [26 x 2]
 
   letter    n
1       A 1902
2       B 1974
3       C 1911
4       D 1948
5       E 1888
6       F 1975
7       G 1914
8       H 1886
9       I 1910
10      J 1924
11      K 1974
12      L 1891
13      M 1894
14      N 1946
15      O 1956
16      P 1934
17      Q 1865
18      R 1992
19      S 1946
20      T 1854
21      U 1919
22      V 1913
23      W 1928
24      X 1934
25      Y 1908
26      Z 1914

We can also get that value by summing up the individual (letter, number) pairs from the previous data frame. The ungroup function allows us to do so:

> data %>% count(letter, number) %>% ungroup %>% group_by(letter) %>% mutate(sum.n = sum(n)) %>% head(20)
Source: local data frame [20 x 4]
Groups: letter
 
   letter number   n sum.n
1       A      1 184  1902
2       A      2 192  1902
3       A      3 183  1902
4       A      4 193  1902
5       A      5 214  1902
6       A      6 172  1902
7       A      7 196  1902
8       A      8 198  1902
9       A      9 174  1902
10      A     10 196  1902
11      B      1 212  1974
12      B      2 198  1974
13      B      3 194  1974
14      B      4 181  1974
15      B      5 203  1974
16      B      6 234  1974
17      B      7 221  1974
18      B      8 179  1974
19      B      9 182  1974
20      B     10 170  1974

Pretty neat!

Categories: Programming

R: dplyr – Maximum value row in each group

Mon, 11/10/2014 - 23:06

In my continued work with R’s dplyr I wanted to be able to group a data frame by some columns and then find the maximum value for each group.

We’ll continue with my favourite dummy data set:

> library(dplyr)
 
> data = data.frame(
    letter = sample(LETTERS, 50000, replace = TRUE),
    number = sample (1:10, 50000, replace = TRUE)
    )

I started with the following code to count how many occurrences of each (letter, number) pair there were:

> data %>% count(letter, number)
Source: local data frame [260 x 3]
Groups: letter
 
   letter number   n
1       A      1 184
2       A      2 192
3       A      3 183
4       A      4 193
5       A      5 214
6       A      6 172
7       A      7 196
8       A      8 198
9       A      9 174
10      A     10 196
..    ...    ... ...

I wanted to find the top number for each letter and with a bit of help from Stack Overflow I ended up with the following:

> data %>% count(letter, number) %>% filter(n == max(n))
Source: local data frame [26 x 3]
Groups: letter
 
   letter number   n
1       A      5 214
2       B      6 234
3       C      8 213
4       D      6 211
5       E      9 208
6       F      2 219
7       G      1 213
8       H      2 208
9       I      9 220
10      J      7 213
11      K      3 228
12      L      2 206
13      M      3 212
14      N      4 222
15      O      1 211
16      P      7 211
17      Q      9 210
18      R      5 224
19      S      2 211
20      T      9 204
21      U      8 217
22      V      6 220
23      W      2 213
24      X      2 214
25      Y      3 211
26      Z      3 206

Here we’re filtering over a collection of rows grouped by letter and only selecting the row which has the max value. We can see more clearly what’s happening if we filter the data frame to only contain one letter:

> letterA = data %>% filter(letter == "A") %>% count(letter, number)
> letterA
Source: local data frame [10 x 3]
Groups: letter
 
   letter number   n
1       A      1 184
2       A      2 192
3       A      3 183
4       A      4 193
5       A      5 214
6       A      6 172
7       A      7 196
8       A      8 198
9       A      9 174
10      A     10 196

And now let’s apply the filter command while also printing out information about each row:

> letterA %>% filter(print(n), print(n == max(n)), n == max(n))
 [1] 184 192 183 193 214 172 196 198 174 196
 [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
Source: local data frame [1 x 3]
Groups: letter
 
  letter number   n
1      A      5 214

If instead of getting the top number by letter we wanted to get the top letter by number we just need to reorder the field names in the count method:

> data %>% count(number, letter) %>% filter(n == max(n))
Source: local data frame [10 x 3]
Groups: number
 
   number letter   n
1       1      G 213
2       2      F 219
3       3      K 228
4       4      N 222
5       5      R 224
6       6      B 234
7       7      B 221
8       8      U 217
9       9      I 220
10     10      O 209

And if we want to see the letter that shows up least frequently we can change the call to ‘max’ to an equivalent one to ‘min':

> data %>% count(number, letter) %>% filter(n == min(n))
Source: local data frame [11 x 3]
Groups: number
 
   number letter   n
1       1      H 166
2       2      O 160
3       3      E 156
4       4      R 169
5       5      L 169
6       6      I 164
7       7      H 170
8       7      I 170
9       8      Q 166
10      9      W 162
11     10      J 168
Categories: Programming

R: dplyr – Ordering by count after multiple column group_by

Sun, 11/09/2014 - 10:30

I was recently trying to group a data frame by two columns and then sort by the count using dplyr but it wasn’t sorting in the way I expecting which was initially very confusing.

I started with this data frame:

library(dplyr)
 
data = data.frame(
  letter = sample(LETTERS, 50000, replace = TRUE),
  number = sample (1:10, 50000, replace = TRUE)
  )

And I wanted to find out how many occurrences of each (letter, number) pair exist in the data set. I started with the following code:

> data %>% count(letter, number, sort = TRUE)
Source: local data frame [260 x 3]
Groups: letter
 
   letter number   n
1       A      4 205
2       A      9 201
3       A      3 197
4       A      1 195
5       A     10 191
6       A      2 189
7       A      8 184
8       A      7 183
9       A      5 181
10      A      6 173
..    ...    ... ...

As you can see it’s only showing A’s which is interesting as I wouldn’t expect there to be a bias towards that letter. Let’s filter out the A’s:

> data %>% filter(letter != "A") %>% count(letter, number, sort = TRUE)
Source: local data frame [250 x 3]
Groups: letter
 
   letter number   n
1       B      8 222
2       B      9 212
3       B      5 207
4       B      6 201
5       B     10 200
6       B      7 192
7       B      2 189
8       B      3 189
9       B      1 187
10      B      4 181
..    ...    ... ...

Now all we see are B’s and we can see that both (B,8) and (B,9) have a higher ‘n’ value than any of the A’s.

I put the code back into the more verbose form to see if it was the count function that behaved unexpectedly:

> data %>% group_by(letter, number) %>% summarise(n = n()) %>% arrange(desc(n))
Source: local data frame [260 x 3]
Groups: letter
 
   letter number   n
1       A      4 205
2       A      9 201
3       A      3 197
4       A      1 195
5       A     10 191
6       A      2 189
7       A      8 184
8       A      7 183
9       A      5 181
10      A      6 173
..    ...    ... ...

Nope, still the same behaviour.

At this point I vaguely remembered there being a function called ungroup which I hadn’t used and wondered if now was the time.

> data %>% group_by(letter, number) %>% summarise(n = n()) %>% ungroup() %>% arrange(desc(n))
Source: local data frame [260 x 3]
 
   letter number   n
1       L      2 236
2       V      1 231
3       Y      8 226
4       J      4 225
5       J     10 223
6       Q      7 223
7       B      8 222
8       O      9 222
9       Q     10 221
10      Z      9 221
..    ...    ... ...

Indeed it was and now we can go back to our original version of the code using count and handle the sorting afterwards:

> data %>% count(letter, number) %>% ungroup() %>% arrange(desc(n))
Source: local data frame [260 x 3]
 
   letter number   n
1       L      2 236
2       V      1 231
3       Y      8 226
4       J      4 225
5       J     10 223
6       Q      7 223
7       B      8 222
8       O      9 222
9       Q     10 221
10      Z      9 221
..    ...    ... ...
Categories: Programming

R: Refactoring to dplyr

Sun, 11/09/2014 - 01:11

I’ve been looking back over some of the early code I wrote using R before I knew about the dplyr library and thought it’d be an interesting exercise to refactor some of the snippets.

We’ll use the following data frame for each of the examples:

library(dplyr)
 
data = data.frame(
  letter = sample(LETTERS, 50000, replace = TRUE),
  number = sample (1:10, 50000, replace = TRUE)
  )
Take {n} rows
> data[1:5,]
  letter number
1      R      7
2      Q      3
3      B      8
4      R      3
5      U      2

becomes:

> data %>% head(5)
  letter number
1      R      7
2      Q      3
3      B      8
4      R      3
5      U      2
Order by numeric value descending
> data[order(-(data$number)),][1:5,]
   letter number
14      H     10
17      G     10
63      L     10
66      W     10
73      R     10

becomes:

> data %>% arrange(desc(number)) %>% head(5)
  letter number
1      H     10
2      G     10
3      L     10
4      W     10
5      R     10
Count number of items
> length(data[,1])
[1] 50000

becomes:

> data %>% count()
Source: local data frame [1 x 1]
 
      n
1 50000
Filter by column value
> length(subset(data, number == 1)[, 1])
[1] 4928

becomes:

> data %>% filter(number == 1) %>% count()
Source: local data frame [1 x 1]
 
     n
1 4928
Group by variable and count
> aggregate(data, by= list(data$number), function(x) length(x))
   Group.1 letter number
1        1   4928   4928
2        2   5045   5045
3        3   5064   5064
4        4   4823   4823
5        5   5032   5032
6        6   5163   5163
7        7   4945   4945
8        8   5077   5077
9        9   5025   5025
10      10   4898   4898

becomes:

> data %>% count(number)
Source: local data frame [10 x 2]
 
   number    n
1       1 4928
2       2 5045
3       3 5064
4       4 4823
5       5 5032
6       6 5163
7       7 4945
8       8 5077
9       9 5025
10     10 4898
Select a range of rows
> data[4:5,]
  letter number
4      R      3
5      U      2

becomes:

> data %>% slice(4:5)
  letter number
1      R      3
2      U      2

There’s certainly more code in some of the dplyr examples but I find it easier to remember how the dplyr code works when I come back to it and hence tend to favour that approach.

Categories: Programming

R: dplyr – Group by field dynamically (‘regroup’ is deprecated / no applicable method for ‘as.lazy’ applied to an object of class “list” )

Sat, 11/08/2014 - 23:29

A few months ago I wrote a blog explaining how to dynamically/programatically group a data frame by a field using dplyr but that approach has been deprecated in the latest version.

To recap, the original function looked like this:

library(dplyr)
 
groupBy = function(df, field) {
  df %.% regroup(list(field)) %.% summarise(n = n())
}

And if we execute that with a sample data frame we’ll see the following:

> data = data.frame(
      letter = sample(LETTERS, 50000, replace = TRUE),
      number = sample (1:10, 50000, replace = TRUE)
  )
 
> groupBy(data, 'letter') %>% head(5)
Source: local data frame [5 x 2]
 
  letter    n
1      A 1951
2      B 1903
3      C 1954
4      D 1923
5      E 1886
Warning messages:
1: %.% is deprecated. Please use %>% 
2: %.% is deprecated. Please use %>% 
3: 'regroup' is deprecated.
Use 'group_by_' instead.
See help("Deprecated")

I replaced each of the deprecated operators and ended up with this function:

groupBy = function(df, field) {
  df %>% group_by_(list(field)) %>% summarise(n = n())
}

Now if we run that:

> groupBy(data, 'letter') %>% head(5)
Error in UseMethod("as.lazy") : 
  no applicable method for 'as.lazy' applied to an object of class "list"

It turns out the ‘group_by_’ function doesn’t want to receive a list of fields so let’s remove the call to list:

groupBy = function(df, field) {
  df %>% group_by_(field) %>% summarise(n = n())
}

And now if we run that:

> groupBy(data, 'letter') %>% head(5)
Source: local data frame [5 x 2]
 
  letter    n
1      A 1951
2      B 1903
3      C 1954
4      D 1923
5      E 1886

Good times! We get the correct result and no deprecation messages.

If we want to group by multiple fields we can just pass in the field names like so:

groupBy = function(df, field1, field2) {
  df %>% group_by_(field1, field2) %>% summarise(n = n())
}
> groupBy(data, 'letter', 'number') %>% head(5)
Source: local data frame [5 x 3]
Groups: letter
 
  letter number   n
1      A      1 200
2      A      2 218
3      A      3 205
4      A      4 176
5      A      5 203

Or with this simpler version:

groupBy = function(df, ...) {
  df %>% group_by_(...) %>% summarise(n = n())
}
> groupBy(data, 'letter', 'number') %>% head(5)
Source: local data frame [5 x 3]
Groups: letter
 
  letter number   n
1      A      1 200
2      A      2 218
3      A      3 205
4      A      4 176
5      A      5 203

I realised that we can actually just use the group_by itself and pass in the field names without quotes, something I couldn’t get to work in earlier versions:

groupBy = function(df, ...) {
  df %>% group_by(...) %>% summarise(n = n())
}
> groupBy(data, letter, number) %>% head(5)
Source: local data frame [5 x 3]
Groups: letter
 
  letter number   n
1      A      1 200
2      A      2 218
3      A      3 205
4      A      4 176
5      A      5 203

We could even get a bit of pipelining going on if we fancied it:

> data %>% groupBy(letter, number) %>% head(5)
Source: local data frame [5 x 3]
Groups: letter
 
  letter number   n
1      A      1 200
2      A      2 218
3      A      3 205
4      A      4 176
5      A      5 203

And as of dplyr 0.3 we can simplify our groupBy function to make use of the new count function which combines group_by and summarise:

groupBy = function(df, ...) {
  df %>% count(...)
}
> data %>% groupBy(letter, number) %>% head(5)
Source: local data frame [5 x 3]
Groups: letter
 
  letter number   n
1      A      1 200
2      A      2 218
3      A      3 205
4      A      4 176
5      A      5 203
Categories: Programming