Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 5 hours 59 min ago

Neo4j 2.1.6 – Cypher: FOREACH slowness

Sun, 12/28/2014 - 05:28

A common problem that people have when using Neo4j for social network applications is updating a person with their newly imported friends.

We’ll have an array of friends that we want to connect to a single Person node. Assuming the following schema…

$ schema
Indexes
  ON :Person(id) ONLINE
 
No constraints

…a simplified version would look like this:

WITH range (2,1002) AS friends
MERGE (p:Person {id: 1})
 
FOREACH(f IN friends |
  MERGE (friend:Person {id: f})
  MERGE (friend)-[:FRIENDS]->p);

If we execute that on an empty database we’ll see something like this:

+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1002
Relationships created: 1001
Properties set: 1002
Labels added: 1002
19173 ms

This took much longer than we’d expect so let’s have a look at the PROFILE output:

EmptyResult
  |
  +UpdateGraph(0)
    |
    +Eager
      |
      +UpdateGraph(1)
        |
        +Extract
          |
          +Null
 
+----------------+------+---------+-------------+--------------------------------------+
|       Operator | Rows |  DbHits | Identifiers |                                Other |
+----------------+------+---------+-------------+--------------------------------------+
|    EmptyResult |    0 |       0 |             |                                      |
| UpdateGraph(0) |    1 | 3015012 |             |                              Foreach |
|          Eager |    1 |       0 |             |                                      |
| UpdateGraph(1) |    1 |       5 |        p, p | MergeNode; {  AUTOINT2}; :Person(id) |
|        Extract |    1 |       0 |             |                              friends |
|           Null |    ? |       ? |             |                                      |
+----------------+------+---------+-------------+--------------------------------------+

The DbHits value on the 2nd row seems suspiciously high suggesting that FOREACH might not be making use of the Person#id index and is instead scanning all Person nodes each time.

I’m not sure how to drill into that further but an alternative approach is to try out the same query but using UNWIND instead and checking the profile output of that:

WITH range (2,1002) AS friends
MERGE (p:Person {id: 1})
WITH p, friends
UNWIND friends AS f
MERGE (friend:Person {id: f})
MERGE (friend)-[:FRIENDS]->p;
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1002
Relationships created: 1001
Properties set: 1002
Labels added: 1002
343 ms
EmptyResult
  |
  +UpdateGraph(0)
    |
    +Eager(0)
      |
      +UpdateGraph(1)
        |
        +UNWIND
          |
          +Eager(1)
            |
            +UpdateGraph(2)
              |
              +Extract
                |
                +Null
 
+----------------+------+--------+-------------------------+--------------------------------------+
|       Operator | Rows | DbHits |             Identifiers |                                Other |
+----------------+------+--------+-------------------------+--------------------------------------+
|    EmptyResult |    0 |      0 |                         |                                      |
| UpdateGraph(0) | 1001 |      0 | friend, p,   UNNAMED136 |                         MergePattern |
|       Eager(0) | 1001 |      0 |                         |                                      |
| UpdateGraph(1) | 1001 |   5005 |          friend, friend |            MergeNode; f; :Person(id) |
|         UNWIND | 1001 |      0 |                         |                                      |
|       Eager(1) |    1 |      0 |                         |                                      |
| UpdateGraph(2) |    1 |      5 |                    p, p | MergeNode; {  AUTOINT2}; :Person(id) |
|        Extract |    1 |      0 |                         |                              friends |
|           Null |    ? |      ? |                         |                                      |
+----------------+------+--------+-------------------------+--------------------------------------+

That’s much quicker and doesn’t touch as many nodes as FOREACH was. I expect the index issue will be sorted out in future but until then UNWIND is our friend.

Categories: Programming

R: Vectorising all the things

Mon, 12/22/2014 - 12:46

After my last post about finding the distance a date/time is from the weekend Hadley Wickham suggested I could improve the function by vectorising it…

@markhneedham vectorise with pmin(pmax(dateToLookup – before, 0), pmax(after – dateToLookup, 0)) / dhours(1)

— Hadley Wickham (@hadleywickham) December 14, 2014

…so I thought I’d try and vectorise some of the other functions I’ve written recently and show the two versions.

I found the following articles useful for explaining vectorisation and why you might want to do it:

Let’s get started.

Distance from the weekend

We want to find out how many hours away from the weekend i.e. nearest Saturday/Sunday a particular date/time is. We’ll be using the following libraries and set of date/times:

library(dplyr)
library(lubridate)
library(geosphere)
options("scipen"=100, "digits"=4)
 
times = ymd_hms("2002-01-01 17:00:00") + c(0:99) * hours(1)
data = data.frame(time = times)
> data %>% head()
                 time
1 2002-01-01 17:00:00
2 2002-01-01 18:00:00
3 2002-01-01 19:00:00
4 2002-01-01 20:00:00
5 2002-01-01 21:00:00
6 2002-01-01 22:00:00

Let’s have a look at the non vectorised version first:

distanceFromWeekend = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  timeToBefore = dateToLookup - before
  timeToAfter = after - dateToLookup
 
  if(timeToBefore < 0 || timeToAfter < 0) {
    0  
  } else {
    if(timeToBefore < timeToAfter) {
      timeToBefore / dhours(1)
    } else {
      timeToAfter / dhours(1)
    }
  }
}

Now let’s run it against our data frame:

> system.time(
    data %>% mutate(ind = row_number()) %>% group_by(ind) %>% mutate(dist = distanceFromWeekend(time))  
    )
   user  system elapsed 
  1.837   0.020   1.884

And now for Hadley’s vectorised version:

distanceFromWeekendVectorised = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  pmin(pmax(dateToLookup - before, 0), pmax(after - dateToLookup, 0)) / dhours(1)
}
 
> system.time(data %>% mutate(dist = distanceFromWeekendVectorised(time)))
   user  system elapsed 
  0.020   0.001   0.023
Extracting start date

My next example was from cleaning up Google Trends data and extracting the start date from a cell inside a CSV file.

We’ll use this data frame:

googleTrends = read.csv("/Users/markneedham/Downloads/report.csv", row.names=NULL)
names(googleTrends) = c("week", "score")
> googleTrends %>% head(10)
                        week score
1  Worldwide; 2004 - present      
2         Interest over time      
3                       Week neo4j
4    2004-01-04 - 2004-01-10     0
5    2004-01-11 - 2004-01-17     0
6    2004-01-18 - 2004-01-24     0
7    2004-01-25 - 2004-01-31     0
8    2004-02-01 - 2004-02-07     0
9    2004-02-08 - 2004-02-14     0
10   2004-02-15 - 2004-02-21     0

The non vectorised version looked like this:

> system.time(
    googleTrends %>% 
      mutate(ind = row_number()) %>% 
      group_by(ind) %>%
      mutate(dates = strsplit(week, " - "),
             start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character())
    )
   user  system elapsed 
  0.215   0.000   0.214

In this case it’s actually not possible to vectorise the code using the strsplit so we need to use something else. Antonios showed me how to do so using substr:

> system.time(googleTrends %>% mutate(start = substr(week, 1, 10) %>% ymd()))
   user  system elapsed 
  0.018   0.000   0.017
Calculating haversine distance

I wanted to work out the great circular distance from a collection of venues to a centre point in London. I started out with this data frame:

centre = c(-0.129581, 51.516578)
venues = read.csv("/tmp/venues.csv")
 
> venues %>% head()
                       venue   lat      lon
1              Skills Matter 51.52 -0.09911
2                   Skinkers 51.50 -0.08387
3          Theodore Bullfrog 51.51 -0.12375
4 The Skills Matter eXchange 51.52 -0.09923
5               The Guardian 51.53 -0.12234
6            White Bear Yard 51.52 -0.10980

My non vectorised version looked like this:

> system.time(venues %>% 
    mutate(distanceFromCentre = by(venues, 1:nrow(venues), function(row) { distHaversine(c(row$lon, row$lat), centre)  }))
    )
   user  system elapsed 
  0.034   0.000   0.033

It’s pretty quick but we can do better – the distHaversine function allows us to calculate multiple distances if the first argument ot it is a matrix of lon/lat values rather than a vector:

> system.time(
    venues %>% mutate(distanceFromCentre = distHaversine(cbind(venues$lon, venues$lat), centre))
    )
   user  system elapsed 
  0.001   0.000   0.001
One I can’t figure out…

And finally I have a function which I can’t figure out how to vectorise but maybe someone with more R skillz than me can?

I have a data frame containing the cumulative member counts of various NoSQL London groups:

cumulativeMeetupMembers = read.csv("/tmp/cumulativeMeetupMembers.csv")
> cumulativeMeetupMembers %>% sample_n(10)
                               g.name dayMonthYear    n
4734            Hadoop Users Group UK   2013-10-26 1144
4668            Hadoop Users Group UK   2013-08-03  979
4936            Hadoop Users Group UK   2014-07-31 1644
5150                      Hive London   2012-10-15  109
8020        Neo4j - London User Group   2014-03-15  826
7666        Neo4j - London User Group   2012-08-06   78
1030                  Big Data London   2013-03-01 1416
6500        London MongoDB User Group   2013-09-21  952
8290 Oracle Big Data 4 the Enterprise   2012-06-04   61
2584              Data Science London   2012-03-20  285

And I want to find out the number of members for a group on a specific date. e.g. given the following data…

> cumulativeMeetupMembers %>% head(10)
                                          g.name dayMonthYear  n
1  Big Data / Data Science / Data Analytics Jobs   2013-01-29  1
2  Big Data / Data Science / Data Analytics Jobs   2013-02-06 15
3  Big Data / Data Science / Data Analytics Jobs   2013-02-07 28
4  Big Data / Data Science / Data Analytics Jobs   2013-02-10 31
5  Big Data / Data Science / Data Analytics Jobs   2013-02-18 33
6  Big Data / Data Science / Data Analytics Jobs   2013-03-27 38
7  Big Data / Data Science / Data Analytics Jobs   2013-04-16 41
8  Big Data / Data Science / Data Analytics Jobs   2013-07-17 53
9  Big Data / Data Science / Data Analytics Jobs   2013-08-28 58
10 Big Data / Data Science / Data Analytics Jobs   2013-11-11 63

…the number of members for the ‘Big Data / Data Science / Data Analytics Jobs’ group on the 10th November 2013 should be 58.

I created this data frame of groups and random dates:

dates = ymd("2014-09-01") + c(0:9) * weeks(1)
groups = cumulativeMeetupMembers %>% distinct(g.name) %>% select(g.name)
 
groupsOnDate = merge(dates, groups)
names(groupsOnDate) = c('date', 'name')
 
> groupsOnDate %>% sample_n(10)
          date                                            name
156 2014-10-06                                 GridGain London
153 2014-09-15                                 GridGain London
70  2014-11-03                                Couchbase London
185 2014-09-29                           Hadoop Users Group UK
105 2014-09-29                             Data Science London
137 2014-10-13            Equal Experts Technical Meetup Group
360 2014-11-03                        Scale Warriors of London
82  2014-09-08 Data Science & Business Analytics London Meetup
233 2014-09-15                 London ElasticSearch User Group
84  2014-09-22 Data Science & Business Analytics London Meetup

The non vectorised version looks like this:

memberCount = function(meetupMembers) {
  function(groupName, date) {
    (meetupMembers %>% 
       filter(g.name == groupName & dayMonthYear < date) %>% do(tail(., 1)))$n    
  }  
} 
 
findMemberCount = memberCount(cumulativeMeetupMembers)
 
> system.time(groupsOnDate %>% mutate(groupMembers = by(groupsOnDate, 1:nrow(groupsOnDate), function(row) { 
          findMemberCount(row$name, as.character(row$date))
        }) %>% 
        cbind() %>% 
        as.vector() ))
   user  system elapsed 
  2.259   0.005   2.269

The output looks like this:

          date                                     name groupMembers
116 2014-10-06                      DeNormalised London          157
322 2014-09-08                 OpenCredo Tech Workshops            7
71  2014-09-01                  Data Enthusiasts London             
233 2014-09-15          London ElasticSearch User Group          614
171 2014-09-01 HPC & GPU Supercomputing Group of London           80
109 2014-10-27                      Data Science London         3632
20  2014-11-03            Big Data Developers in London          708
42  2014-09-08              Big Data Week London Meetup           96
127 2014-10-13          Enterprise Search London Meetup          575
409 2014-10-27                            Women in Data          548

I’ve tried many different approaches but haven’t been able to come up with a version that lets me pass in all the rows to memberCount and calculate the count for each row in one go.

Any ideas/advice/hints welcome!

Categories: Programming

R: Time to/from the weekend

Sat, 12/13/2014 - 21:38

In my last post I showed some examples using R’s lubridate package and another problem it made really easy to solve was working out how close a particular date time was to the weekend.

I wanted to write a function which would return the previous Sunday or upcoming Saturday depending on which was closer.

lubridate’s floor_date and ceiling_date functions make this quite simple.

e.g. if we want to round the 18th December down to the beginning of the week and up to the beginning of the next week we could do the following:

> library(lubridate)
> floor_date(ymd("2014-12-18"), "week")
[1] "2014-12-14 UTC"
 
> ceiling_date(ymd("2014-12-18"), "week")
[1] "2014-12-21 UTC"

For the date in the future we actually want to grab the Saturday rather than the Sunday so we’ll subtract one day from that:

> ceiling_date(ymd("2014-12-18"), "week") - days(1)
[1] "2014-12-20 UTC"

Now let’s put that together into a function which finds the closest weekend for a given date:

findClosestWeekendDay = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  if((dateToLookup - before) < (after - dateToLookup)) {
    before  
  } else {
    after  
  }
}
 
> findClosestWeekendDay(ymd_hms("2014-12-13 13:33:29"))
[1] "2014-12-13 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-14 18:33:29"))
[1] "2014-12-14 23:59:59 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-15 18:33:29"))
[1] "2014-12-14 23:59:59 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-17 11:33:29"))
[1] "2014-12-14 23:59:59 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-17 13:33:29"))
[1] "2014-12-20 UTC"
 
> findClosestWeekendDay(ymd_hms("2014-12-19 13:33:29"))
[1] "2014-12-20 UTC"

I’ve set the Sunday date at 23:59:59 so that I can use this date in the next step where we want to calculate how how many hours it is from the current date to the nearest weekend.

I ended up with this function:

distanceFromWeekend = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  timeToBefore = dateToLookup - before
  timeToAfter = after - dateToLookup
 
  if(timeToBefore < 0 || timeToAfter < 0) {
    0  
  } else {
    if(timeToBefore < timeToAfter) {
      timeToBefore / dhours(1)
    } else {
      timeToAfter / dhours(1)
    }
  }
}
 
> distanceFromWeekend(ymd_hms("2014-12-13 13:33:29"))
[1] 0
 
> distanceFromWeekend(ymd_hms("2014-12-14 18:33:29"))
[1] 0
 
> distanceFromWeekend(ymd_hms("2014-12-15 18:33:29"))
[1] 18.55833
 
> distanceFromWeekend(ymd_hms("2014-12-17 11:33:29"))
[1] 59.55833
 
> distanceFromWeekend(ymd_hms("2014-12-17 13:33:29"))
[1] 58.44194
 
> distanceFromWeekend(ymd_hms("2014-12-19 13:33:29"))
[1] 10.44194

While this works it’s quite slow when you run it over a data frame which contains a lot of rows.

There must be a clever R way of doing the same thing (perhaps using matrices) which I haven’t figured out yet so if you know how to speed it up do let me know.

Categories: Programming

R: Numeric representation of date time

Sat, 12/13/2014 - 20:58

I’ve been playing around with date times in R recently and I wanted to derive a numeric representation for a given value to make it easier to see the correlation between time and another variable.

e.g. December 13th 2014 17:30 should return 17.5 since it’s 17.5 hours since midnight.

Using the standard R libraries we would write the following code:

> december13 = as.POSIXlt("2014-12-13 17:30:00")
> as.numeric(december13 - trunc(december13, "day"), units="hours")
[1] 17.5

That works pretty well but Antonios recently introduced me to the lubridate so I thought I’d give that a try as well.

The first nice thing about lubridate is that we can use the date we created earlier and call the floor_date function rather than truncate:

> (december13 - floor_date(december13, "day"))
Time difference of 17.5 hours

That gives us back a difftime…

> class((december13 - floor_date(december13, "day")))
[1] "difftime"

…which we can divide by different units to get the granularity we want:

> diff = (december13 - floor_date(december13, "day"))
> diff / dhours(1)
[1] 17.5
 
> diff / ddays(1)
[1] 0.7291667
 
> diff / dminutes(1)
[1] 1050

Pretty neat!

lubridate also has some nice functions for creating dates/date times. e.g.

> ymd_hms("2014-12-13 17:00:00")
[1] "2014-12-13 17:00:00 UTC"
 
> ymd_hm("2014-12-13 17:00")
[1] "2014-12-13 17:00:00 UTC"
 
> ymd_h("2014-12-13 17")
[1] "2014-12-13 17:00:00 UTC"
 
> ymd("2014-12-13")
[1] "2014-12-13 UTC"

And if you want a different time zone that’s pretty easy too:

> with_tz(ymd("2014-12-13"), "GMT")
[1] "2014-12-13 GMT"
Categories: Programming

R: data.table/dplyr/lubridate – Error in wday(date, label = TRUE, abbr = FALSE) : unused arguments (label = TRUE, abbr = FALSE)

Thu, 12/11/2014 - 20:03

I spent a couple of hours playing around with data.table this evening and tried changing some code written using a data frame to use a data table instead.

I started off by building a data frame which contains all the weekends between 2010 and 2015…

> library(lubridate)
 
> library(dplyr)
 
> dates = data.frame(date = seq( dmy("01-01-2010"), to=dmy("01-01-2015"), by="day" ))
 
> dates = dates %>% filter(wday(date, label = TRUE, abbr = FALSE) %in% c("Saturday", "Sunday"))

…which works fine:

> dates %>% head()
         date
1: 2010-01-02
2: 2010-01-03
3: 2010-01-09
4: 2010-01-10
5: 2010-01-16
6: 2010-01-17

I then tried to change the code to use a data table instead which led to the following error:

> library(data.table)
 
> dates = data.table(date = seq( dmy("01-01-2010"), to=dmy("01-01-2015"), by="day" ))
 
> dates = dates %>% filter(wday(date, label = TRUE, abbr = FALSE) %in% c("Saturday", "Sunday"))
Error in wday(date, label = TRUE, abbr = FALSE) : 
  unused arguments (label = TRUE, abbr = FALSE)

I wasn’t sure what was going on so I went back to the data frame version to check if that still worked…

> dates = data.frame(date = seq( dmy("01-01-2010"), to=dmy("01-01-2015"), by="day" ))
 
> dates = dates %>% filter(wday(date, label = TRUE, abbr = FALSE) %in% c("Saturday", "Sunday"))
Error in wday(c(1262304000, 1262390400, 1262476800, 1262563200, 1262649600,  : 
  unused arguments (label = TRUE, abbr = FALSE)

…except it now didn’t work either! I decided to check what wday was referring to…

Help on topic ‘wday’ was found in the following packages:

Integer based date class
(in package data.table in library /Library/Frameworks/R.framework/Versions/3.1/Resources/library)
Get/set days component of a date-time.
(in package lubridate in library /Library/Frameworks/R.framework/Versions/3.1/Resources/library)

…and realised that data.table has its own wday function – I’d been caught out by R’s global scoping of all the things!

We can probably work around that by the order in which we require the various libraries but for now I’m just prefixing the call to wday and all is well:

dates = dates %>% filter(lubridate::wday(date, label = TRUE, abbr = FALSE) %in% c("Saturday", "Sunday"))
Categories: Programming

R: Cleaning up and plotting Google Trends data

Tue, 12/09/2014 - 19:14

I recently came across an excellent article written by Stian Haklev in which he describes things he wishes he’d been told before starting out with R, one being to do all data clean up in code which I thought I’d give a try.

My goal is to leave the raw data completely unchanged, and do all the transformation in code, which can be rerun at any time.

While I’m writing the scripts, I’m often jumping around, selectively executing individual lines or code blocks, running commands to inspect the data in the REPL (read-evaluate-print-loop, where each command is executed as soon as you type enter, in the picture above it’s the pane to the right), etc.

But I try to make sure that when I finish up, the script is runnable by itself.

I thought the Google Trends data set would be an interesting one to play around with as it gives you a CSV containing several different bits of data of which I’m only interested in ‘interest over time’.

It’s not very easy to automate the download of the CSV file so I did that bit manually and automated everything from there onwards.

The first step was to read the CSV file and explore some of the rows to see what it contained:

> library(dplyr)
 
> googleTrends = read.csv("/Users/markneedham/Downloads/report.csv", row.names=NULL)
 
> googleTrends %>% head()
##                   row.names Web.Search.interest..neo4j
## 1 Worldwide; 2004 - present                           
## 2        Interest over time                           
## 3                      Week                      neo4j
## 4   2004-01-04 - 2004-01-10                          0
## 5   2004-01-11 - 2004-01-17                          0
## 6   2004-01-18 - 2004-01-24                          0
 
> googleTrends %>% sample_n(10)
##                   row.names Web.Search.interest..neo4j
## 109 2006-01-08 - 2006-01-14                          0
## 113 2006-02-05 - 2006-02-11                          0
## 267 2009-01-18 - 2009-01-24                          0
## 199 2007-09-30 - 2007-10-06                          0
## 522 2013-12-08 - 2013-12-14                         88
## 265 2009-01-04 - 2009-01-10                          0
## 285 2009-05-24 - 2009-05-30                          0
## 318 2010-01-10 - 2010-01-16                          0
## 495 2013-06-02 - 2013-06-08                         79
## 28  2004-06-20 - 2004-06-26                          0
 
> googleTrends %>% tail()
##                row.names Web.Search.interest..neo4j
## 658        neo4j example                   Breakout
## 659 neo4j graph database                   Breakout
## 660           neo4j java                   Breakout
## 661           neo4j node                   Breakout
## 662           neo4j rest                   Breakout
## 663       neo4j tutorial                   Breakout

We only want to keep the rows which contain (week, interest) pairs so the first thing we’ll do is rename the columns:

names(googleTrends) = c("week", "score")

Now we want to strip out the rows which don’t contain (week, interest) pairs. The easiest way to do this is to look for rows which don’t contain date values in the ‘week’ column.

First we need to split the start and end dates in that column by using the strsplit function.

I found it much easier to apply the function to each row individually rather than passing in a list of values so I created a dummy column with a row number in to allow me to do that (a trick Antonios showed me):

> googleTrends %>% 
    mutate(ind = row_number()) %>% 
    group_by(ind) %>%
    mutate(dates = strsplit(week, " - "),
           start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character(),
           end =   dates[[1]][2] %>% strptime("%Y-%m-%d") %>% as.character()) %>%
    head()
## Source: local data frame [6 x 6]
## Groups: ind
## 
##                        week score ind    dates      start        end
## 1 Worldwide; 2004 - present     1   1 <chr[2]>         NA         NA
## 2        Interest over time     1   2 <chr[1]>         NA         NA
## 3                      Week    90   3 <chr[1]>         NA         NA
## 4   2004-01-04 - 2004-01-10     3   4 <chr[2]> 2004-01-04 2004-01-10
## 5   2004-01-11 - 2004-01-17     3   5 <chr[2]> 2004-01-11 2004-01-17
## 6   2004-01-18 - 2004-01-24     3   6 <chr[2]> 2004-01-18 2004-01-24

Now we need to get rid of the rows which have an NA value for ‘start’ or ‘end':

> googleTrends %>% 
    mutate(ind = row_number()) %>% 
    group_by(ind) %>%
    mutate(dates = strsplit(week, " - "),
           start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character(),
           end =   dates[[1]][2] %>% strptime("%Y-%m-%d") %>% as.character()) %>%
    filter(!is.na(start) | !is.na(end)) %>% 
    head()
## Source: local data frame [6 x 6]
## Groups: ind
## 
##                      week score ind    dates      start        end
## 1 2004-01-04 - 2004-01-10     3   4 <chr[2]> 2004-01-04 2004-01-10
## 2 2004-01-11 - 2004-01-17     3   5 <chr[2]> 2004-01-11 2004-01-17
## 3 2004-01-18 - 2004-01-24     3   6 <chr[2]> 2004-01-18 2004-01-24
## 4 2004-01-25 - 2004-01-31     3   7 <chr[2]> 2004-01-25 2004-01-31
## 5 2004-02-01 - 2004-02-07     3   8 <chr[2]> 2004-02-01 2004-02-07
## 6 2004-02-08 - 2004-02-14     3   9 <chr[2]> 2004-02-08 2004-02-14

Next we’ll get rid of ‘week’, ‘ind’ and ‘dates’ as we aren’t going to need those anymore:

> cleanGoogleTrends = googleTrends %>% 
    mutate(ind = row_number()) %>% 
    group_by(ind) %>%
    mutate(dates = strsplit(week, " - "),
           start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character(),
           end =   dates[[1]][2] %>% strptime("%Y-%m-%d") %>% as.character()) %>%
    filter(!is.na(start) | !is.na(end)) %>%
    ungroup() %>%
    select(-c(ind, dates, week))
 
> cleanGoogleTrends %>% head()
## Source: local data frame [6 x 3]
## 
##   score      start        end
## 1     3 2004-01-04 2004-01-10
## 2     3 2004-01-11 2004-01-17
## 3     3 2004-01-18 2004-01-24
## 4     3 2004-01-25 2004-01-31
## 5     3 2004-02-01 2004-02-07
## 6     3 2004-02-08 2004-02-14
 
> cleanGoogleTrends %>% sample_n(10)
## Source: local data frame [10 x 3]
## 
##    score      start        end
## 1      8 2010-09-26 2010-10-02
## 2     73 2013-11-17 2013-11-23
## 3     52 2012-07-01 2012-07-07
## 4      3 2005-06-19 2005-06-25
## 5      3 2004-12-12 2004-12-18
## 6      3 2009-09-06 2009-09-12
## 7     71 2014-09-14 2014-09-20
## 8      3 2004-12-26 2005-01-01
## 9     62 2013-03-03 2013-03-09
## 10     3 2006-03-19 2006-03-25
 
> cleanGoogleTrends %>% tail()
## Source: local data frame [6 x 3]
## 
##   score      start        end
## 1    80 2014-10-19 2014-10-25
## 2    80 2014-10-26 2014-11-01
## 3    84 2014-11-02 2014-11-08
## 4    81 2014-11-09 2014-11-15
## 5    83 2014-11-16 2014-11-22
## 6     2 2014-11-23 2014-11-29

Ok now we’re ready to plot. This was my first attempt:

> library(ggplot2)
> ggplot(aes(x = start, y = score), data = cleanGoogleTrends) + 
    geom_line(size = 0.5)
## geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
2014 12 09 17 57 49

As you can see, not too successful! The first mistake I’ve made is not telling ggplot that the ‘start’ column is a date and so it can use that ordering when plotting:

> cleanGoogleTrends = cleanGoogleTrends %>% mutate(start =  as.Date(start))
> ggplot(aes(x = start, y = score), data = cleanGoogleTrends) + 
    geom_line(size = 0.5)

2014 12 09 18 00 03

My next mistake is that ‘score’ is not being treated as a continuous variable and so we’re ending up with this very strange looking chart. We can see that if we call the class function:

> class(cleanGoogleTrends$score)
## [1] "factor"

Let’s fix that and plot again:

> cleanGoogleTrends = cleanGoogleTrends %>% mutate(score = as.numeric(score))
> ggplot(aes(x = start, y = score), data = cleanGoogleTrends) + 
    geom_line(size = 0.5)

2014 12 09 18 02 39

That’s much better but there is quite a bit of noise in the week to week scores which we can flatten a bit by plotting a rolling mean of the last 4 weeks instead:

> library(zoo)
> cleanGoogleTrends = cleanGoogleTrends %>% 
    mutate(rolling = rollmean(score, 4, fill = NA, align=c("right")),
           start =  as.Date(start))
 
> ggplot(aes(x = start, y = rolling), data = cleanGoogleTrends) + 
    geom_line(size = 0.5)

2014 12 09 18 05 26

Here’s the full code if you want to reproduce:

library(dplyr)
library(zoo)
library(ggplot2)
 
googleTrends = read.csv("/Users/markneedham/Downloads/report.csv", row.names=NULL)
names(googleTrends) = c("week", "score")
 
cleanGoogleTrends = googleTrends %>% 
  mutate(ind = row_number()) %>% 
  group_by(ind) %>%
  mutate(dates = strsplit(week, " - "),
         start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character(),
         end =   dates[[1]][2] %>% strptime("%Y-%m-%d") %>% as.character()) %>%
  filter(!is.na(start) | !is.na(end)) %>%
  ungroup() %>%
  select(-c(ind, dates, week)) %>%
  mutate(start =  as.Date(start),
         score = as.numeric(score),
         rolling = rollmean(score, 4, fill = NA, align=c("right")))
 
ggplot(aes(x = start, y = rolling), data = cleanGoogleTrends) + 
  geom_line(size = 0.5)

My next step is to plot the Google Trends scores against my meetup data set to see if there’s any interesting correlations going on.

As an aside I made use of knitr while putting together this post – it works really well for checking that you’ve included all the steps and that it actually works!

Categories: Programming

R: dplyr – mutate with strptime (incompatible size/wrong result size)

Mon, 12/08/2014 - 20:02

Having worked out how to translate a string into a date or NA if it wasn’t the appropriate format the next thing I wanted to do was store the result of the transformation in my data frame.

I started off with this:

data = data.frame(x = c("2014-01-01", "2014-02-01", "foo"))
> data
           x
1 2014-01-01
2 2014-02-01
3        foo

And when I tried to do the date translation ran into the following error:

> data %>% mutate(y = strptime(x, "%Y-%m-%d"))
Error: wrong result size (11), expected 3 or 1

As I understand it this error is telling us that we are trying to put a value into the data frame which represents 11 rows rather than 3 rows or 1 row.

It turns out that storing POSIXlts in a data frame isn’t such a good idea! In this case we can use the as.character function to create a character vector which can be stored in the data frame:

> data %>% mutate(y = strptime(x, "%Y-%m-%d") %>% as.character())
           x          y
1 2014-01-01 2014-01-01
2 2014-02-01 2014-02-01
3        foo       <NA>

We can then get rid of the NA row by using the is.na function:

> data %>% mutate(y = strptime(x, "%Y-%m-%d") %>% as.character()) %>% filter(!is.na(y))
           x          y
1 2014-01-01 2014-01-01
2 2014-02-01 2014-02-01

And a final tweak so that we have 100% pipelining goodness:

> data %>% 
    mutate(y = x %>% strptime("%Y-%m-%d") %>% as.character()) %>%
    filter(!is.na(y))
           x          y
1 2014-01-01 2014-01-01
2 2014-02-01 2014-02-01
Categories: Programming

R: dplyr – mutate with strptime (incompatible size/wrong result size)

Mon, 12/08/2014 - 20:02

Having worked out how to translate a string into a date or NA if it wasn’t the appropriate format the next thing I wanted to do was store the result of the transformation in my data frame.

I started off with this:

data = data.frame(x = c("2014-01-01", "2014-02-01", "foo"))
> data
           x
1 2014-01-01
2 2014-02-01
3        foo

And when I tried to do the date translation ran into the following error:

> data %>% mutate(y = strptime(x, "%Y-%m-%d"))
Error: wrong result size (11), expected 3 or 1

As I understand it this error is telling us that we are trying to put a value into the data frame which represents 11 rows rather than 3 rows or 1 row.

It turns out that storing POSIXlts in a data frame isn’t such a good idea! In this case we can use the as.character function to create a character vector which can be stored in the data frame:

> data %>% mutate(y = strptime(x, "%Y-%m-%d") %>% as.character())
           x          y
1 2014-01-01 2014-01-01
2 2014-02-01 2014-02-01
3        foo       <NA>

We can then get rid of the NA row by using the is.na function:

> data %>% mutate(y = strptime(x, "%Y-%m-%d") %>% as.character()) %>% filter(!is.na(y))
           x          y
1 2014-01-01 2014-01-01
2 2014-02-01 2014-02-01

And a final tweak so that we have 100% pipelining goodness:

> data %>% 
    mutate(y = x %>% strptime("%Y-%m-%d") %>% as.character()) %>%
    filter(!is.na(y))
           x          y
1 2014-01-01 2014-01-01
2 2014-02-01 2014-02-01
Categories: Programming