Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 9 hours 27 min ago

R: Speeding up the Wimbledon scraping job

Mon, 06/29/2015 - 06:36

Over the past few days I’ve written a few blog posts about a Wimbledon data set I’ve been building and after running the scripts a few times I noticed that it was taking much longer to run that I expected.

To recap, I started out with the following function which takes in a URI and returns a data frame containing a row for each match:

library(rvest)
library(dplyr)
 
scrape_matches1 = function(uri) {
  matches = data.frame()
 
  s = html(uri)
  rows = s %>% html_nodes("div#scoresResultsContent tr")
  i = 0
  for(row in rows) {  
    players = row %>% html_nodes("td.day-table-name a")
    seedings = row %>% html_nodes("td.day-table-seed")
    score = row %>% html_node("td.day-table-score a")
    flags = row %>% html_nodes("td.day-table-flag img")
 
    if(!is.null(score)) {
      player1 = players[1] %>% html_text() %>% str_trim()
      seeding1 = ifelse(!is.na(seedings[1]), seedings[1] %>% html_node("span") %>% html_text() %>% str_trim(), NA)
      flag1 = flags[1] %>% html_attr("alt")
 
      player2 = players[2] %>% html_text() %>% str_trim()
      seeding2 = ifelse(!is.na(seedings[2]), seedings[2] %>% html_node("span") %>% html_text() %>% str_trim(), NA)
      flag2 = flags[2] %>% html_attr("alt")
 
      matches = rbind(data.frame(winner = player1, 
                                 winner_seeding = seeding1, 
                                 winner_flag = flag1,
                                 loser = player2, 
                                 loser_seeding = seeding2,
                                 loser_flag = flag2,
                                 score = score %>% html_text() %>% str_trim(),
                                 round = round), matches)      
    } else {
      round = row %>% html_node("th") %>% html_text()
    }
  } 
  return(matches)
}

Let’s run it to get an idea of the data that it returns:

matches1 = scrape_matches1("http://www.atpworldtour.com/en/scores/archive/wimbledon/540/2014/results")
 
> matches1 %>% filter(round %in% c("Finals", "Semi-Finals", "Quarter-Finals"))
           winner winner_seeding winner_flag           loser loser_seeding loser_flag            score          round
1    Milos Raonic            (8)         CAN    Nick Kyrgios          (WC)        AUS    674 62 64 764 Quarter-Finals
2   Roger Federer            (4)         SUI   Stan Wawrinka           (5)        SUI     36 765 64 64 Quarter-Finals
3 Grigor Dimitrov           (11)         BUL     Andy Murray           (3)        GBR        61 764 62 Quarter-Finals
4  Novak Djokovic            (1)         SRB     Marin Cilic          (26)        CRO  61 36 674 62 62 Quarter-Finals
5   Roger Federer            (4)         SUI    Milos Raonic           (8)        CAN         64 64 64    Semi-Finals
6  Novak Djokovic            (1)         SRB Grigor Dimitrov          (11)        BUL    64 36 762 767    Semi-Finals
7  Novak Djokovic            (1)         SRB   Roger Federer           (4)        SUI 677 64 764 57 64         Finals

As I mentioned, it’s quite slow but I thought I’d wrap it in system.time so I could see exactly how long it was taking:

> system.time(scrape_matches1("http://www.atpworldtour.com/en/scores/archive/wimbledon/540/2014/results"))
   user  system elapsed 
 25.570   0.111  31.416

About 30 seconds! The first thing I tried was downloading the file separately and running the function against the local file:

> system.time(scrape_matches1("data/raw/2014.html"))
   user  system elapsed 
 25.662   0.123  25.863

Hmmm, that’s only saved us 5 seconds so the bottleneck must be somewhere else. Still there’s no point making a HTTP request every time we run the script so we’ll stick with the local file version.

While browsing rvest’s vignette I noticed a function called html_table which I was curious about. I decided to try and replace some of my code with a call to that:

matches2= html("data/raw/2014.html") %>% 
  html_node("div#scoresResultsContent table.day-table") %>% html_table(header = FALSE) %>% 
  mutate(X1 = ifelse(X1 == "", NA, X1)) %>%
  mutate(round = ifelse(grepl("\\([0-9]\\)|\\(", X1), NA, X1)) %>% 
  mutate(round = na.locf(round)) %>%
  filter(!is.na(X8)) %>%
  select(winner = X3, winner_seeding = X1, loser = X7, loser_seeding = X5, score = X8, round)
 
> matches2 %>% filter(round %in% c("Finals", "Semi-Finals", "Quarter-Finals"))
           winner winner_seeding           loser loser_seeding            score          round
1  Novak Djokovic            (1)   Roger Federer           (4) 677 64 764 57 64         Finals
2  Novak Djokovic            (1) Grigor Dimitrov          (11)    64 36 762 767    Semi-Finals
3   Roger Federer            (4)    Milos Raonic           (8)         64 64 64    Semi-Finals
4  Novak Djokovic            (1)     Marin Cilic          (26)  61 36 674 62 62 Quarter-Finals
5 Grigor Dimitrov           (11)     Andy Murray           (3)        61 764 62 Quarter-Finals
6   Roger Federer            (4)   Stan Wawrinka           (5)     36 765 64 64 Quarter-Finals
7    Milos Raonic            (8)    Nick Kyrgios          (WC)    674 62 64 764 Quarter-Finals

I had to do some slightly clever stuff to get the ’round’ column into shape using zoo’s na.locf function which I wrote about previously.

Unfortunately I couldn’t work out how to extract the flag with this version – that value is hidden in the ‘alt’ tag of an img and presumably html_table is just grabbing the text value of each cell. This version is much quicker though!

system.time(html("data/raw/2014.html") %>% 
  html_node("div#scoresResultsContent table.day-table") %>% html_table(header = FALSE) %>% 
  mutate(X1 = ifelse(X1 == "", NA, X1)) %>%
  mutate(round = ifelse(grepl("\\([0-9]\\)|\\(", X1), NA, X1)) %>% 
  mutate(round = na.locf(round)) %>%
  filter(!is.na(X8)) %>%
  select(winner = X3, winner_seeding = X1, loser = X7, loser_seeding = X5, score = X8, round))
 
   user  system elapsed 
  0.545   0.002   0.548

What I realised from writing this version is that I need to match all the columns with one call to html_nodes rather than getting the row and then each column in a loop.

I rewrote the function to do that:

scrape_matches3 = function(uri) {
  s = html(uri)
 
  players  = s %>% html_nodes("div#scoresResultsContent tr td.day-table-name a")
  seedings = s %>% html_nodes("div#scoresResultsContent tr td.day-table-seed")
  scores   = s %>% html_nodes("div#scoresResultsContent tr td.day-table-score a")
  flags    = s %>% html_nodes("div#scoresResultsContent tr td.day-table-flag img") %>% html_attr("alt") %>% str_trim()
 
  matches3 = data.frame(
    winner         = sapply(seq(1,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    winner_seeding = sapply(seq(1,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    winner_flag    = sapply(seq(1,length(flags),2),    function(idx) flags[[idx]]),  
    loser          = sapply(seq(2,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    loser_seeding  = sapply(seq(2,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    loser_flag     = sapply(seq(2,length(flags),2),    function(idx) flags[[idx]]),
    score          = sapply(scores,                    function(score) score %>% html_text() %>% str_trim())
  )
  return(matches3)
}

Let’s run and time that to check we’re getting back the right results in a timely manner:

> matches3 %>% sample_n(10)
                   winner winner_seeding winner_flag               loser loser_seeding loser_flag         score
70           David Ferrer            (7)         ESP Pablo Carreno Busta                      ESP  60 673 61 61
128        Alex Kuznetsov           (26)         USA         Tim Smyczek           (3)        USA   46 63 63 63
220   Rogerio Dutra Silva                        BRA   Kristijan Mesaros                      CRO         62 63
83         Kevin Anderson           (20)         RSA        Aljaz Bedene          (LL)        GBR      63 75 62
73          Kei Nishikori           (10)         JPN   Kenny De Schepper                      FRA     64 765 75
56  Roberto Bautista Agut           (27)         ESP         Jan Hernych           (Q)        CZE   75 46 62 62
138            Ante Pavic                        CRO        Marc Gicquel          (29)        FRA  46 63 765 64
174             Tim Puetz                        GER     Ruben Bemelmans                      BEL         64 62
103        Lleyton Hewitt                        AUS   Michal Przysiezny                      POL 62 6714 61 64
35          Roger Federer            (4)         SUI       Gilles Muller           (Q)        LUX      63 75 63
 
> system.time(scrape_matches3("data/raw/2014.html"))
   user  system elapsed 
  0.815   0.006   0.827

It’s still quick – a bit slower than html_table but we can deal with that. As you can see, I also had to add some logic to separate the values for the winners and losers – the players, seeds, flags come back as as one big list. The odd rows represent the winner; the even rows the loser.

Annoyingly we’ve now lost the ’round’ column because that appears as a table heading so we can’t extract it the same way. I ended up cheating a bit to get it to work by working out how many matches each round should contain and generated a vector with that number of entries:

raw_rounds = s %>% html_nodes("th") %>% html_text()
 
> raw_rounds
 [1] "Finals"               "Semi-Finals"          "Quarter-Finals"       "Round of 16"          "Round of 32"         
 [6] "Round of 64"          "Round of 128"         "3rd Round Qualifying" "2nd Round Qualifying" "1st Round Qualifying"
 
rounds = c( sapply(0:6, function(idx) rep(raw_rounds[[idx + 1]], 2 ** idx)) %>% unlist(),
            sapply(7:9, function(idx) rep(raw_rounds[[idx + 1]], 2 ** (idx - 3))) %>% unlist())
 
> rounds[1:10]
 [1] "Finals"         "Semi-Finals"    "Semi-Finals"    "Quarter-Finals" "Quarter-Finals" "Quarter-Finals" "Quarter-Finals"
 [8] "Round of 16"    "Round of 16"    "Round of 16"

Let’s put that code into the function and see if we end up with the same resulting data frame:

scrape_matches4 = function(uri) {
  s = html(uri)
 
  players  = s %>% html_nodes("div#scoresResultsContent tr td.day-table-name a")
  seedings = s %>% html_nodes("div#scoresResultsContent tr td.day-table-seed")
  scores   = s %>% html_nodes("div#scoresResultsContent tr td.day-table-score a")
  flags    = s %>% html_nodes("div#scoresResultsContent tr td.day-table-flag img") %>% html_attr("alt") %>% str_trim()
 
  raw_rounds = s %>% html_nodes("th") %>% html_text()
  rounds = c( sapply(0:6, function(idx) rep(raw_rounds[[idx + 1]], 2 ** idx)) %>% unlist(),
              sapply(7:9, function(idx) rep(raw_rounds[[idx + 1]], 2 ** (idx - 3))) %>% unlist())
 
  matches4 = data.frame(
    winner         = sapply(seq(1,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    winner_seeding = sapply(seq(1,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    winner_flag    = sapply(seq(1,length(flags),2),    function(idx) flags[[idx]]),  
    loser          = sapply(seq(2,length(players),2),  function(idx) players[[idx]] %>% html_text()),
    loser_seeding  = sapply(seq(2,length(seedings),2), function(idx) seedings[[idx]] %>% html_text() %>% str_trim()),
    loser_flag     = sapply(seq(2,length(flags),2),    function(idx) flags[[idx]]),
    score          = sapply(scores,                    function(score) score %>% html_text() %>% str_trim()),
    round          = rounds
  )
  return(matches4)
}
 
matches4 = scrape_matches4("data/raw/2014.html")
 
> matches4 %>% filter(round %in% c("Finals", "Semi-Finals", "Quarter-Finals"))
           winner winner_seeding winner_flag           loser loser_seeding loser_flag            score          round
1  Novak Djokovic            (1)         SRB   Roger Federer           (4)        SUI 677 64 764 57 64         Finals
2  Novak Djokovic            (1)         SRB Grigor Dimitrov          (11)        BUL    64 36 762 767    Semi-Finals
3   Roger Federer            (4)         SUI    Milos Raonic           (8)        CAN         64 64 64    Semi-Finals
4  Novak Djokovic            (1)         SRB     Marin Cilic          (26)        CRO  61 36 674 62 62 Quarter-Finals
5 Grigor Dimitrov           (11)         BUL     Andy Murray           (3)        GBR        61 764 62 Quarter-Finals
6   Roger Federer            (4)         SUI   Stan Wawrinka           (5)        SUI     36 765 64 64 Quarter-Finals
7    Milos Raonic            (8)         CAN    Nick Kyrgios          (WC)        AUS    674 62 64 764 Quarter-Finals

We shouldn’t have added much to the time but let’s check:

> system.time(scrape_matches4("data/raw/2014.html"))
   user  system elapsed 
  0.816   0.004   0.824

Sweet. We’ve saved ourselves 29 seconds per page as long as the number of rounds stayed constant over the years. For the 10 years that I’ve looked at it has but I expect if you go back further the draw sizes will have been different and our script would break.

For now though this will do!

Categories: Programming

R: dplyr – Update rows with earlier/previous rows values

Sun, 06/28/2015 - 23:30

Recently I had a data frame which contained a column which had mostly empty values:

> data.frame(col1 = c(1,2,3,4,5), col2  = c("a", NA, NA , "b", NA))
  col1 col2
1    1    a
2    2 <NA>
3    3 <NA>
4    4    b
5    5 <NA>

I wanted to fill in the NA values with the last non NA value from that column. So I want the data frame to look like this:

1    1    a
2    2    a
3    3    a
4    4    b
5    5    b

I spent ages searching around before I came across the na.locf function in the zoo library which does the job:

library(zoo)
library(dplyr)
 
> data.frame(col1 = c(1,2,3,4,5), col2  = c("a", NA, NA , "b", NA)) %>% 
    do(na.locf(.))
  col1 col2
1    1    a
2    2    a
3    3    a
4    4    b
5    5    b

This will fill in the missing values for every column, so if we had a third column with missing values it would populate those too:

> data.frame(col1 = c(1,2,3,4,5), col2  = c("a", NA, NA , "b", NA), col3 = c("A", NA, "B", NA, NA)) %>% 
    do(na.locf(.))
 
  col1 col2 col3
1    1    a    A
2    2    a    A
3    3    a    B
4    4    b    B
5    5    b    B

If we only want to populate ‘col2′ and leave ‘col3′ as it is we can apply the function specifically to that column:

> data.frame(col1 = c(1,2,3,4,5), col2  = c("a", NA, NA , "b", NA), col3 = c("A", NA, "B", NA, NA)) %>% 
    mutate(col2 = na.locf(col2))
  col1 col2 col3
1    1    a    A
2    2    a <NA>
3    3    a    B
4    4    b <NA>
5    5    b <NA>

It’s quite a neat function and certainly comes in helpful when cleaning up data sets which don’t tend to be as uniform as you’d hope!

Categories: Programming

R: Command line – Error in GenericTranslator$new : could not find function “loadMethod”

Sat, 06/27/2015 - 23:47

I’ve been reading Text Processing with Ruby over the last week or so and one of the ideas the author describes is setting up your scripts so you can run them directly from the command line.

I wanted to do this with my Wimbledon R script and wrote the following script which uses the ‘Rscript’ executable so that R doesn’t launch in interactive mode:

wimbledon

#!/usr/bin/env Rscript
 
library(rvest)
library(dplyr)
library(stringr)
library(readr)
 
# stuff

Then I tried to run it:

$ time ./wimbledon
 
...
 
Error in GenericTranslator$new : could not find function "loadMethod"
Calls: write.csv ... html_extract_n -> <Anonymous> -> Map -> mapply -> <Anonymous> -> $
Execution halted
 
real	0m1.431s
user	0m1.127s
sys	0m0.078s

As the error suggests, the script fails when trying to write to a CSV file – it looks like Rscript doesn’t load in something from the core library that we need. It turns out adding the following line to our script is all we need:

library(methods)

So we end up with this:

#!/usr/bin/env Rscript
 
library(methods)
library(rvest)
library(dplyr)
library(stringr)
library(readr)

And when we run that all is well!

Categories: Programming

R: dplyr – squashing multiple rows per group into one

Sat, 06/27/2015 - 23:36

I spent a bit of the day working on my Wimbledon data set and the next thing I explored is all the people that have beaten Andy Murray in the tournament.

The following dplyr query gives us the names of those people and the year the match took place:

library(dplyr)
 
> main_matches %>% filter(loser == "Andy Murray") %>% select(winner, year)
 
            winner year
1  Grigor Dimitrov 2014
2    Roger Federer 2012
3     Rafael Nadal 2011
4     Rafael Nadal 2010
5     Andy Roddick 2009
6     Rafael Nadal 2008
7 Marcos Baghdatis 2006
8 David Nalbandian 2005

As you can see, Rafael Nadal shows up multiple times. I wanted to get one row per player and list all the years in a single column.

This was my initial attempt:

> main_matches %>% filter(loser == "Andy Murray") %>% 
     group_by(winner) %>% summarise(years = paste(year))
Source: local data frame [6 x 2]
 
            winner years
1     Andy Roddick  2009
2 David Nalbandian  2005
3  Grigor Dimitrov  2014
4 Marcos Baghdatis  2006
5     Rafael Nadal  2011
6    Roger Federer  2012

Unfortunately it just gives you the last matching row per group which isn’t quite what we want.. I realised my mistake while trying to pass a vector into paste and noticing that a vector came back when I’d expected a string:

> paste(c(2008,2009,2010))
[1] "2008" "2009" "2010"

The missing argument was ‘collapse’ – something I’d come across when using plyr last year:

> paste(c(2008,2009,2010), collapse=", ")
[1] "2008, 2009, 2010"

Now, if we apply that to our original function:

> main_matches %>% filter(loser == "Andy Murray") %>% 
     group_by(winner) %>% summarise(years = paste(year, collapse=", "))
Source: local data frame [6 x 2]
 
            winner            years
1     Andy Roddick             2009
2 David Nalbandian             2005
3  Grigor Dimitrov             2014
4 Marcos Baghdatis             2006
5     Rafael Nadal 2011, 2010, 2008
6    Roger Federer             2012

That’s exactly what we want. Let’s tidy that up a bit:

> main_matches %>% filter(loser == "Andy Murray") %>% 
     group_by(winner) %>% arrange(year) %>%
     summarise(years  = paste(year, collapse =","), times = length(year))  %>%
     arrange(desc(times), years)
Source: local data frame [6 x 3]
 
            winner          years times
1     Rafael Nadal 2008,2010,2011     3
2 David Nalbandian           2005     1
3 Marcos Baghdatis           2006     1
4     Andy Roddick           2009     1
5    Roger Federer           2012     1
6  Grigor Dimitrov           2014     1
Categories: Programming

R: ggplot – Show discrete scale even with no value

Fri, 06/26/2015 - 23:48

As I mentioned in a previous blog post, I’ve been scraping data for the Wimbledon tennis tournament, and having got the data for the last ten years I wrote a query using dplyr to find out how players did each year over that period.

I ended up with the following functions to filter my data frame of all the mataches:

round_reached = function(player, main_matches) {
  furthest_match = main_matches %>% 
    filter(winner == player | loser == player) %>% 
    arrange(desc(round)) %>% 
    head(1)  
 
    return(ifelse(furthest_match$winner == player, "Winner", as.character(furthest_match$round)))
}
 
player_performance = function(name, matches) {
  player = data.frame()
  for(y in 2005:2014) {
    round = round_reached(name, filter(matches, year == y))
    if(length(round) == 1) {
      player = rbind(player, data.frame(year = y, round = round))      
    } else {
      player = rbind(player, data.frame(year = y, round = "Did not enter"))
    } 
  }
  return(player)
}

When we call that function we see the following output:

> player_performance("Andy Murray", main_matches)
   year          round
1  2005    Round of 32
2  2006    Round of 16
3  2007  Did not enter
4  2008 Quarter-Finals
5  2009    Semi-Finals
6  2010    Semi-Finals
7  2011    Semi-Finals
8  2012         Finals
9  2013         Winner
10 2014 Quarter-Finals

I wanted to create a chart showing Murray’s progress over the years with the round reached on the y axis and the year on the x axis. In order to do this I had to make sure the ’round’ column was being treated as a factor variable:

df = player_performance("Andy Murray", main_matches)
 
rounds = c("Did not enter", "Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals", "Winner")
df$round = factor(df$round, levels =  rounds)
 
> df$round
 [1] Round of 32    Round of 16    Did not enter  Quarter-Finals Semi-Finals    Semi-Finals    Semi-Finals   
 [8] Finals         Winner         Quarter-Finals
Levels: Did not enter Round of 128 Round of 64 Round of 32 Round of 16 Quarter-Finals Semi-Finals Finals Winner

Now that we’ve got that we can plot his progress:

ggplot(aes(x = year, y = round, group=1), data = df) + 
    geom_point() + 
    geom_line() + 
    scale_x_continuous(breaks=df$year) + 
    scale_y_discrete(breaks = rounds)

2015 06 26 23 37 32

This is a good start but we’ve lost the rounds which don’t have a corresponding entry on the x axis. I’d like to keep them so it’s easier to compare the performance of different players.

It turns out that all we need to do is pass ‘drop = FALSE’ to scale_y_discrete and it will work exactly as we want:

ggplot(aes(x = year, y = round, group=1), data = df) + 
    geom_point() + 
    geom_line() + 
    scale_x_continuous(breaks=df$year) + 
    scale_y_discrete(breaks = rounds, drop = FALSE)

2015 06 26 23 41 01

Neat. Now let’s have a look at the performances of some of the other top players:

draw_chart = function(player, main_matches){
  df = player_performance(player, main_matches)
  df$round = factor(df$round, levels =  rounds)
 
  ggplot(aes(x = year, y = round, group=1), data = df) + 
    geom_point() + 
    geom_line() + 
    scale_x_continuous(breaks=df$year) + 
    scale_y_discrete(breaks = rounds, drop=FALSE) + 
    ggtitle(player) + 
    theme(axis.text.x=element_text(angle=90, hjust=1))
}
 
a = draw_chart("Andy Murray", main_matches)
b = draw_chart("Novak Djokovic", main_matches)
c = draw_chart("Rafael Nadal", main_matches)
d = draw_chart("Roger Federer", main_matches)
 
library(gridExtra)
grid.arrange(a,b,c,d, ncol=2)

2015 06 26 23 46 15

And that’s all for now!

Categories: Programming

R: Scraping Wimbledon draw data

Fri, 06/26/2015 - 00:14

Given Wimbledon starts next week I wanted to find a data set to explore before it gets underway. Having searched around and failed to find one I had to resort to scraping the ATP World Tour’s event page which displays the matches in an easy to access format.

We’ll be using the Wimbledon 2013 draw since Andy Murray won that year! This is what the page looks like:

2015 06 25 23 47 16

Each match is in its own row of a table and each column has a class attribute which makes it really easy to scrape. We’ll be using R’s rvest again. I wrote the following script which grabs the player names, seedings and score of the match and stores everything in a data frame:

library(rvest)
library(dplyr)
library(stringr)
 
s = html_session("http://www.atpworldtour.com/en/scores/archive/wimbledon/540/2013/results")
rows = s %>% html_nodes("div#scoresResultsContent tr")
 
matches = data.frame()
for(row in rows) {  
  players = row %>% html_nodes("td.day-table-name a")
  seedings = row %>% html_nodes("td.day-table-seed")
  score = row %>% html_node("td.day-table-score a")
 
  if(!is.null(score)) {
    player1 = players[1] %>% html_text() %>% str_trim()
    seeding1 = ifelse(!is.na(seedings[1]), seedings[1] %>% html_node("span") %>% html_text() %>% str_trim(), NA)
 
    player2 = players[2] %>% html_text() %>% str_trim()
    seeding2 = ifelse(!is.na(seedings[2]), seedings[2] %>% html_node("span") %>% html_text() %>% str_trim(), NA)
 
    matches = rbind(data.frame(winner = player1, 
                               winner_seeding = seeding1, 
                               loser = player2, 
                               loser_seeding = seeding2,
                               score = score %>% html_text() %>% str_trim(),
                               round = round), matches)
 
  } else {
    round = row %>% html_node("th") %>% html_text()
  }
}

This is what the data frame looks like:

> matches %>% sample_n(10)
               winner winner_seeding                       loser loser_seeding            score                round
61      Wayne Odesnik            (4)                Thiago Alves          <NA>            61 64 1st Round Qualifying
4     Danai Udomchoke           <NA>            Marton Fucsovics          <NA>       61 57 1210 1st Round Qualifying
233    Jerzy Janowicz           (24)                Lukasz Kubot          <NA>         75 64 64       Quarter-Finals
90       Malek Jaziri           <NA>             Illya Marchenko           (9)        674 75 64 2nd Round Qualifying
222      David Ferrer            (4)         Alexandr Dolgopolov          (26) 676 762 26 61 62          Round of 32
54  Michal Przysiezny           (11)                 Dusan Lojda          <NA>         26 63 62 1st Round Qualifying
52           Go Soeda           (13)               Nikola Mektic          <NA>            62 60 1st Round Qualifying
42    Ruben Bemelmans           (23) Jonathan Dasnieres de Veigy          <NA>            63 64 1st Round Qualifying
31        Mirza Basic           <NA>              Tsung-Hua Yang          <NA>     674 33 (RET) 1st Round Qualifying
179     Jurgen Melzer           <NA>              Julian Reister           (Q)    36 762 765 62          Round of 64

It also contains qualifying matches which I’m not so interested in. Let’s strip those out:

main_matches = matches %>% filter(!grepl("Qualifying", round)) %>% mutate(year = 2013)

We’ll also put a column in for ‘year’ so that we can handle the draws for multiple years later on.

Next I wanted to clean up the data a bit. I’d like to be able to do some queries based on the seedings of the players but at the moment that column contains numeric brackets in values as well as some other values which indicate whether a player is a qualifier, lucky loser or wildcard entry.

I started by adding a column to store this extra information:

main_matches$winner_type = NA
main_matches$winner_type[main_matches$winner_seeding == "(WC)"] = "wildcard"
main_matches$winner_type[main_matches$winner_seeding == "(Q)"] = "qualifier"
main_matches$winner_type[main_matches$winner_seeding == "(LL)"] = "lucky loser"
 
main_matches$loser_type = NA
main_matches$loser_type[main_matches$loser_seeding == "(WC)"] = "wildcard"
main_matches$loser_type[main_matches$loser_seeding == "(Q)"] = "qualifier"
main_matches$loser_type[main_matches$loser_seeding == "(LL)"] = "lucky loser"

And then I cleaned up the existing column:

tidy_seeding = function(seeding) {
  no_brackets = gsub("\\(|\\)", "", seeding)
  return(gsub("WC|Q|L", NA, no_brackets))
}
 
main_matches = main_matches %>% 
  mutate(winner_seeding = as.numeric(tidy_seeding(winner_seeding)),
         loser_seeding = as.numeric(tidy_seeding(loser_seeding)))

Now we can write a query against the data frame to find out when the underdog won i.e. a player with no seeding beat a player with a seeding or a lower seeded player beat a higher seeded one:

> main_matches %>%  filter((winner_seeding > loser_seeding) | (is.na(winner_seeding) & !is.na(loser_seeding)))
                  winner winner_seeding                 loser loser_seeding                  score          round year
1          Jurgen Melzer             NA         Fabio Fognini            30           675 75 63 62   Round of 128 2013
2          Bernard Tomic             NA           Sam Querrey            21       766 763 36 26 63   Round of 128 2013
3        Feliciano Lopez             NA          Gilles Simon            19             62 64 7611   Round of 128 2013
4             Ivan Dodig             NA Philipp Kohlschreiber            16 46 676 763 63 21 (RET)   Round of 128 2013
5         Viktor Troicki             NA      Janko Tipsarevic            14              63 64 765   Round of 128 2013
6         Lleyton Hewitt             NA         Stan Wawrinka            11               64 75 63   Round of 128 2013
7           Steve Darcis             NA          Rafael Nadal             5             764 768 64   Round of 128 2013
8      Fernando Verdasco             NA      Julien Benneteau            31             761 764 64    Round of 64 2013
9           Grega Zemlja             NA       Grigor Dimitrov            29       36 764 36 64 119    Round of 64 2013
10      Adrian Mannarino             NA            John Isner            18               11 (RET)    Round of 64 2013
11         Igor Sijsling             NA          Milos Raonic            17              75 64 764    Round of 64 2013
12     Kenny De Schepper             NA           Marin Cilic            10                  (W/O)    Round of 64 2013
13        Ernests Gulbis             NA    Jo-Wilfried Tsonga             6         36 63 63 (RET)    Round of 64 2013
14     Sergiy Stakhovsky             NA         Roger Federer             3         675 765 75 765    Round of 64 2013
15          Lukasz Kubot             NA          Benoit Paire            25               61 63 64    Round of 32 2013
16     Kenny De Schepper             NA           Juan Monaco            22              64 768 64    Round of 32 2013
17        Jerzy Janowicz             24       Nicolas Almagro            15              766 63 64    Round of 32 2013
18         Andreas Seppi             23         Kei Nishikori            12        36 62 674 61 64    Round of 32 2013
19         Bernard Tomic             NA       Richard Gasquet             9          767 57 75 765    Round of 32 2013
20 Juan Martin Del Potro              8          David Ferrer             4              62 64 765 Quarter-Finals 2013
21           Andy Murray              2        Novak Djokovic             1               64 75 64         Finals 2013

There are actually very few times when a lower seeded player beat a higher seeded one but there are quite a few instances of non seeds beating seeds. We’ve got 21 occurrences of underdogs winning out of a total of 127 matches.

Let’s filter that set of rows and see which seeds lost in the first round:

> main_matches %>%  filter(round == "Round of 128" & !is.na(loser_seeding))
           winner winner_seeding                 loser loser_seeding                  score        round year
1   Jurgen Melzer             NA         Fabio Fognini            30           675 75 63 62 Round of 128 2013
2   Bernard Tomic             NA           Sam Querrey            21       766 763 36 26 63 Round of 128 2013
3 Feliciano Lopez             NA          Gilles Simon            19             62 64 7611 Round of 128 2013
4      Ivan Dodig             NA Philipp Kohlschreiber            16 46 676 763 63 21 (RET) Round of 128 2013
5  Viktor Troicki             NA      Janko Tipsarevic            14              63 64 765 Round of 128 2013
6  Lleyton Hewitt             NA         Stan Wawrinka            11               64 75 63 Round of 128 2013
7    Steve Darcis             NA          Rafael Nadal             5             764 768 64 Round of 128 2013

Rafael Nadal is the most prominent but Stan Wawrinka also lost in the first round that year which I’d forgotten about! Next let’s make the ’round’ column an ordered factor one so that we can sort matches by round:

main_matches$round = factor(main_matches$round, levels =  c("Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals"))
 
> main_matches$round
...     
Levels: Round of 128 Round of 64 Round of 32 Round of 16 Quarter-Finals Semi-Finals Finals

We can now really easily work out which unseeded players went the furthest in the tournament:

> main_matches %>% filter(is.na(loser_seeding)) %>% arrange(desc(round)) %>% head(5)
             winner winner_seeding             loser loser_seeding           score          round year
1    Jerzy Janowicz             24      Lukasz Kubot            NA        75 64 64 Quarter-Finals 2013
2       Andy Murray              2 Fernando Verdasco            NA  46 36 61 64 75 Quarter-Finals 2013
3 Fernando Verdasco             NA Kenny De Schepper            NA        64 64 64    Round of 16 2013
4      Lukasz Kubot             NA  Adrian Mannarino            NA  46 63 36 63 64    Round of 16 2013
5    Jerzy Janowicz             24     Jurgen Melzer            NA 36 761 64 46 64    Round of 16 2013

Next up I thought it’d be cool to write a function which showed which round each player exited in:

round_reached = function(player, main_matches) {
  furthest_match = main_matches %>% 
    filter(winner == player | loser == player) %>% 
    arrange(desc(round)) %>% 
    head(1)  
 
    return(ifelse(furthest_match$winner == player, "Winner", as.character(furthest_match$round)))
}

Our function isn’t vectorisable – it only works if we pass in a single player at a time so we’ll have to group the data frame by player before calling it. Let’s check it works by seeing how far Andy Murray and Rafael Nadal got:

> round_reached("Rafael Nadal", main_matches)
[1] "Round of 128"
> round_reached("Andy Murray", main_matches)
[1] "Winner"

Great. What about if we try it against each of the top 8 seeds?

> rbind(main_matches %>% filter(winner_seeding %in% 1:8) %>% mutate(name = winner, seeding = winner_seeding), 
        main_matches %>% filter(loser_seeding %in% 1:8) %>% mutate(name = loser, seeding = loser_seeding)) %>%
    select(name, seeding) %>%
    distinct() %>%
    arrange(seeding) %>%
    group_by(name) %>%
    mutate(round_reached = round_reached(name, main_matches))
Source: local data frame [8 x 3]
Groups: name
 
                   name seeding  round_reached
1        Novak Djokovic       1         Finals
2           Andy Murray       2         Winner
3         Roger Federer       3    Round of 64
4          David Ferrer       4 Quarter-Finals
5          Rafael Nadal       5   Round of 128
6    Jo-Wilfried Tsonga       6    Round of 64
7         Tomas Berdych       7 Quarter-Finals
8 Juan Martin Del Potro       8    Semi-Finals

Neat. Next up I want to do a comparison between the round they reached and the round you’d expect them to get to given their seeding but that’s for the weekend!

I’ve put a CSV file containing all the data in this gist in case you want to play with it. I’m planning to scrape a few more years worth of data before Monday and add in some extra fields as well but in case I don’t get around to it the full script in this blog post is included in the gist as well so feel free to tweak it if tennis is your thing.

Categories: Programming

R: Scraping the release dates of github projects

Tue, 06/23/2015 - 23:34

Continuing on from my blog post about scraping Neo4j’s release dates I thought it’d be even more interesting to chart the release dates of some github projects.

In theory the release dates should be accessible through the github API but the few that I looked at weren’t returning any data so I scraped the data together.

We’ll be using rvest again and I first wrote the following function to extract the release versions and dates from a single page:

library(dplyr)
library(rvest)
 
process_page = function(releases, session) {
  rows = session %>% html_nodes("ul.release-timeline-tags li")
 
  for(row in rows) {
    date = row %>% html_node("span.date")
    version = row %>% html_node("div.tag-info a")
 
    if(!is.null(version) && !is.null(date)) {
      date = date %>% html_text() %>% str_trim()
      version = version %>% html_text() %>% str_trim()
      releases = rbind(releases, data.frame(date = date, version = version))
    }  
  }
  return(releases)
}

Let’s try it out on the Cassandra release page and see what it comes back with:

> r = process_page(data.frame(), html_session("https://github.com/apache/cassandra/releases"))
> r
           date               version
1  Jun 22, 2015       cassandra-2.1.7
2  Jun 22, 2015      cassandra-2.0.16
3   Jun 8, 2015       cassandra-2.1.6
4   Jun 8, 2015   cassandra-2.2.0-rc1
5  May 19, 2015 cassandra-2.2.0-beta1
6  May 18, 2015      cassandra-2.0.15
7  Apr 29, 2015       cassandra-2.1.5
8   Apr 1, 2015      cassandra-2.0.14
9   Apr 1, 2015       cassandra-2.1.4
10 Mar 16, 2015      cassandra-2.0.13

That works pretty well but it’s only one page! To get all the pages we can use the follow_link function to follow the ‘Next’ link until there aren’t anymore pages to process.

We end up with the following function to do this:

find_all_releases = function(starting_page) {
  s = html_session(starting_page)
  releases = data.frame()
 
  next_page = TRUE
  while(next_page) {
    possibleError = tryCatch({  
      releases = process_page(releases, s)
      s = s %>% follow_link("Next") 
    }, error = function(e) { e })
 
    if(inherits(possibleError, "error")){
      next_page = FALSE
    }
  }
  return(releases)
}

Let’s try it out starting from the Cassandra page:

> cassandra = find_all_releases("https://github.com/apache/cassandra/releases")
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.13
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.10
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.8
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.2.13
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.0-rc1
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.2.3
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.2.0-beta2
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.0.10
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.0.6
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.0.0-rc2
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.7.7
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.7.4
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.7.0-rc3
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.6.4
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.5.0-rc3
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.4.0-final
 
> cassandra %>% sample_n(10)
            date               version
151 Mar 13, 2010   cassandra-0.5.0-rc2
25   Jul 3, 2014      cassandra-1.2.18
51  Jul 27, 2013       cassandra-1.2.8
21  Aug 19, 2014   cassandra-2.1.0-rc6
73  Sep 24, 2012 cassandra-1.2.0-beta1
158 Mar 13, 2010   cassandra-0.4.0-rc2
113 May 20, 2011     cassandra-0.7.6-2
15  Oct 24, 2014       cassandra-2.1.1
103 Sep 15, 2011 cassandra-1.0.0-beta1
93  Nov 29, 2011       cassandra-1.0.4

I want to plot when the different releases happened in time and in order to do that we need to create an extra column containing the ‘release series’ which we can do with the following transformation:

series = function(version) {
  parts = strsplit(as.character(version), "\\.")  
  return(unlist(lapply(parts, function(p) paste(p %>% unlist %>% head(2), collapse = "."))))  
}
 
bySeries = cassandra %>%
  mutate(date2 = mdy(date), series = series(version),
         short_version = gsub("cassandra-", "", version),
         short_series = series(short_version))
 
> bySeries %>% sample_n(10)
            date               version      date2        series short_version short_series
3    Jun 8, 2015       cassandra-2.1.6 2015-06-08 cassandra-2.1         2.1.6          2.1
161 Mar 13, 2010 cassandra-0.4.0-beta1 2010-03-13 cassandra-0.4   0.4.0-beta1          0.4
62  Feb 15, 2013      cassandra-1.1.10 2013-02-15 cassandra-1.1        1.1.10          1.1
153 Mar 13, 2010 cassandra-0.5.0-beta2 2010-03-13 cassandra-0.5   0.5.0-beta2          0.5
37   Feb 7, 2014       cassandra-2.0.5 2014-02-07 cassandra-2.0         2.0.5          2.0
36   Feb 7, 2014      cassandra-1.2.15 2014-02-07 cassandra-1.2        1.2.15          1.2
29   Jun 2, 2014   cassandra-2.1.0-rc1 2014-06-02 cassandra-2.1     2.1.0-rc1          2.1
21  Aug 19, 2014   cassandra-2.1.0-rc6 2014-08-19 cassandra-2.1     2.1.0-rc6          2.1
123 Feb 16, 2011       cassandra-0.7.2 2011-02-16 cassandra-0.7         0.7.2          0.7
135  Nov 1, 2010 cassandra-0.7.0-beta3 2010-11-01 cassandra-0.7   0.7.0-beta3          0.7

Now let’s plot those releases and see what we get:

ggplot(aes(x = date2, y = short_series), 
       data = bySeries %>% filter(!grepl("beta|rc", short_version))) +     
  geom_text(aes(label=short_version),hjust=0.5, vjust=0.5, size = 4, angle = 90) + 
  theme_bw()

2015 06 23 22 59 19

An interesting thing we can see from this visualisation is what overlap the various series of versions have. Most of the time there are only two series of versions overlapping but the 1.2, 2.0 and 2.1 series all overlap which is unusual.

In this chart we excluded all beta and RC versions. Let’s bring those back in and just show the last 3 versions:

ggplot(aes(x = date2, y = short_series), 
       data = bySeries %>% filter(grepl("2\\.[012]\\.|1\\.2\\.", short_version))) +     
  geom_text(aes(label=short_version),hjust=0.5, vjust=0.5, size = 4, angle = 90) + 
  theme_bw()

2015 06 23 23 08 04

From this chart it’s clearer that the 2.0 and 2.1 series have recent releases so there will probably be three overlapping versions when the 2.2 series is released as well.

The chart is still a bit cluttered although less than before. I’m not sure of a better way of visualising this type of data so if you have any ideas do let me know!

Categories: Programming

R: Scraping Neo4j release dates with rvest

Sun, 06/21/2015 - 23:07

As part of my log analysis I wanted to get the Neo4j release dates which are accessible from the release notes and decided to try out Hadley Wickham’s rvest scraping library which he released at the end of 2014.

rvest is based on Python’s beautifulsoup which has become my scraping library of choice so I didn’t find it too difficult to pick up.

To start with we need to download the release notes locally so we don’t have to go over the network when we’re doing our scraping:

download.file("http://neo4j.com/release-notes/page/1", "release-notes.html")
download.file("http://neo4j.com/release-notes/page/2", "release-notes2.html")

We want to parse those pages back and return the rows which contain version numbers and release dates. The HTML looks like this:

2015 06 21 22 57 20

We can get the rows with the following code:

library(rvest)
library(dplyr)
 
page1 <- html("release-notes.html")
page2 <- html("release-notes2.html")
 
rows = c(page1 %>% html_nodes("div.small-12 div.row"), 
         page2 %>% html_nodes("div.small-12 div.row") ) 
 
> rows %>% head(1)
[[1]]
<div class="row"> <h3 class="entry-title"><a href="http://neo4j.com/release-notes/neo4j-2-2-2/">Latest Release: Neo4j 2.2.2</a></h3> <h6>05/21/2015</h6> <p>Neo4j 2.2.2 is a maintenance release, with critical improvements.</p> <p>Notably, this release:</p> <ul><li>Provides support for running Neo4j on Oracle and OpenJDK Java 8 runtimes</li> <li>Resolves an issue that prevented the Neo4j Browser from loading in the latest Chrome release (43.0.2357.65).</li> <li>Corrects the behavior of the <code>:sysinfo</code> (aka <code>:play sysinfo</code>) browser directive.</li> <li>Improves the <a href="http://neo4j.com/docs/2.2.2/import-tool.html">import tool</a> handling of values containing newlines, and adds support f...</li></ul><a href="http://neo4j.com/release-notes/neo4j-2-2-2/">Read full notes →</a> </div>

Now we need to loop through the rows and pull out just the version and release date. I wrote the following function to do this and strip out any extra text that we’re not interested in:

generate_releases = function(rows) {
  releases = data.frame()
  for(row in rows) {
    version = row %>% html_node("h3.entry-title")
    date = row %>% html_node("h6")  
 
    if(!is.null(version) && !is.null(date)) {
      version = version %>% html_text()
      version = gsub("Latest Release: ", "", version)
      version = gsub("Neo4j ", "", version)
      releases = rbind(releases, data.frame(version = version, date = date %>% html_text()))
    }
  }
  return(releases)
}
 
> generate_releases(rows)
   version       date
1    2.2.2 05/21/2015
2    2.2.1 04/14/2015
3    2.1.8 04/01/2015
4    2.2.0 03/25/2015
5    2.1.7 02/03/2015
6    2.1.6 11/25/2014
7    1.9.9 10/13/2014
8    2.1.5 09/30/2014
9    2.1.4 09/04/2014
10   2.1.3 07/28/2014
11   2.0.4 07/08/2014
12   1.9.8 06/19/2014
13   2.1.2 06/11/2014
14   2.0.3 04/30/2014
15   2.0.1 02/04/2014
16   2.0.2 04/15/2014
17   1.9.7 04/11/2014
18   1.9.6 02/03/2014
19     2.0 12/11/2013
20   1.9.5 11/11/2013
21   1.9.4 09/19/2013
22   1.9.3 08/30/2013
23   1.9.2 07/16/2013
24   1.9.1 06/24/2013
25     1.9 05/13/2013
26   1.8.3         //

Finally I wanted to convert the ‘date’ column to be in R date format and get rid of the 1.8.3 row since it doesn’t contain a date. lubridate is my goto library for date manipulation in R so we’ll use that here:

library(lubridate)
 
> generate_releases(rows) %>%  
      mutate(date = mdy(date)) %>%   
      filter(!is.na(date)) 
 
   version       date
1    2.2.2 2015-05-21
2    2.2.1 2015-04-14
3    2.1.8 2015-04-01
4    2.2.0 2015-03-25
5    2.1.7 2015-02-03
6    2.1.6 2014-11-25
7    1.9.9 2014-10-13
8    2.1.5 2014-09-30
9    2.1.4 2014-09-04
10   2.1.3 2014-07-28
11   2.0.4 2014-07-08
12   1.9.8 2014-06-19
13   2.1.2 2014-06-11
14   2.0.3 2014-04-30
15   2.0.1 2014-02-04
16   2.0.2 2014-04-15
17   1.9.7 2014-04-11
18   1.9.6 2014-02-03
19     2.0 2013-12-11
20   1.9.5 2013-11-11
21   1.9.4 2013-09-19
22   1.9.3 2013-08-30
23   1.9.2 2013-07-16
24   1.9.1 2013-06-24
25     1.9 2013-05-13

We could then easily see how many releases there were by year:

releasesByDate = generate_releases(rows) %>%  
  mutate(date = mdy(date)) %>%   
  filter(!is.na(date))
 
> releasesByDate %>% mutate(year = year(date)) %>% count(year)
Source: local data frame [3 x 2]
 
  year  n
1 2013  7
2 2014 13
3 2015  5

Or by month:

> releasesByDate %>% mutate(month = month(date)) %>% count(month)
Source: local data frame [11 x 2]
 
   month n
1      2 3
2      3 1
3      4 5
4      5 2
5      6 3
6      7 3
7      8 1
8      9 3
9     10 1
10    11 2
11    12 1

Previous to this quick bit of hacking I’d always turned to Ruby or Python whenever I wanted to scrape a dataset but it looks like rvest makes R a decent option for this type of work now. Good times!

Categories: Programming

R: dplyr – segfault cause ‘memory not mapped’

Sat, 06/20/2015 - 23:18

In my continued playing around with web logs in R I wanted to process the logs for a day and see what the most popular URIs were.

I first read in all the lines using the read_lines function in readr and put the vector it produced into a data frame so I could process it using dplyr.

library(readr)
dlines = data.frame(column = read_lines("~/projects/logs/2015-06-18-22-docs"))

In the previous post I showed some code to extract the URI from a log line. I extracted this code out into a function and adapted it so that I could pass in a list of values instead of a single value:

extract_uri = function(log) {
  parts = str_extract_all(log, "\"[^\"]*\"")
  return(lapply(parts, function(p) str_match(p[1], "GET (.*) HTTP")[2] %>% as.character))
}

Next I ran the following function to count the number of times each URI appeared in the logs:

library(dplyr)
pages_viewed = dlines %>%
  mutate(uri  = extract_uri(column)) %>% 
  count(uri) %>%
  arrange(desc(n))

This crashed my R process with the following error message:

segfault cause 'memory not mapped'

I narrowed it down to a problem when doing a group by operation on the ‘uri’ field and came across this post which suggested that it was handled more cleanly in more recently version of dplyr.

I upgraded to 0.4.2 and tried again:

## Error in eval(expr, envir, enclos): cannot group column uri, of class 'list'

That makes more sense. We’re probably returning a list from extract_uri rather than a vector which would fit nicely back into the data frame. That’s fixed easily enough by unlisting the result:

extract_uri = function(log) {
  parts = str_extract_all(log, "\"[^\"]*\"")
  return(unlist(lapply(parts, function(p) str_match(p[1], "GET (.*) HTTP")[2] %>% as.character)))
}

And now when we run the count function it’s happy again, good times!

Categories: Programming

R: Regex – capturing multiple matches of the same group

Fri, 06/19/2015 - 22:38

I’ve been playing around with some web logs using R and I wanted to extract everything that existed in double quotes within a logged entry.

This is an example of a log entry that I want to parse:

log = '2015-06-18-22:277:548311224723746831\t2015-06-18T22:00:11\t2015-06-18T22:00:05Z\t93317114\tip-127-0-0-1\t127.0.0.5\tUser\tNotice\tneo4j.com.access.log\t127.0.0.3 - - [18/Jun/2015:22:00:11 +0000] "GET /docs/stable/query-updating.html HTTP/1.1" 304 0 "http://neo4j.com/docs/stable/cypher-introduction.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36"'

And I want to extract these 3 things:

  • /docs/stable/query-updating.html
  • http://neo4j.com/docs/stable/cypher-introduction.html
  • Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36

i.e. the URI, the referrer and browser details.

I’ll be using the stringr library which seems to work quite well for this type of work.

To extract these values we need to find all the occurrences of double quotes and get the text inside those quotes. We might start by using the str_match function:

> library(stringr)
> str_match(log, "\"[^\"]*\"")
     [,1]                                               
[1,] "\"GET /docs/stable/query-updating.html HTTP/1.1\""

Unfortunately that only picked up the first occurrence of the pattern so we’ve got the URI but not the referrer or browser details. I tried str_extract with similar results before I found str_extract_all which does the job:

> str_extract_all(log, "\"[^\"]*\"")
[[1]]
[1] "\"GET /docs/stable/query-updating.html HTTP/1.1\""                                                                            
[2] "\"http://neo4j.com/docs/stable/cypher-introduction.html\""                                                                    
[3] "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36\""

We still need to do a bit of cleanup to get rid of the ‘GET’ and ‘HTTP/1.1′ in the URI and the quotes in all of them:

parts = str_extract_all(log, "\"[^\"]*\"")[[1]]
uri = str_match(parts[1], "GET (.*) HTTP")[2]
referer = str_match(parts[2], "\"(.*)\"")[2]
browser = str_match(parts[3], "\"(.*)\"")[2]
 
> uri
[1] "/docs/stable/query-updating.html"
 
> referer
[1] "https://www.google.com/"
 
> browser
[1] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36"

We could then go on to split out the browser string into its sub components but that’ll do for now!

Categories: Programming

Coding: Explore and retreat

Wed, 06/17/2015 - 18:23

When refactoring code or looking for the best way to integrate a new piece of functionality I generally favour a small steps/incremental approach but recent experiences have led me to believe that this isn’t always the quickest approach.

Sometimes it seems to make more sense to go on little discovery missions in the code, make some bigger steps and then if necessary retreat and revert our changes and apply the lessons learnt on our next discovery mission. This technique which isn’t anything novel but I think is quite effective.

Michael and I were recently looking at the Smart Local Moving algorithm which is used for community detection in large networks and decided to refactor the code to make sure we understood how it worked. When we started the outline of the main class was like this:

public class Network implements Cloneable, Serializable
{
    private static final long serialVersionUID = 1;
 
    private int numberOfNodes;
    private int[] firstNeighborIndex;
    private int[] neighbor;
    private double[] edgeWeight;
    private double totalEdgeWeightSelfLinks;
    private double[] nodeWeight;
    private int nClusters;
    private int[] cluster;
 
    private double[] clusterWeight;
    private int[] numberNodesPerCluster;
    private int[][] nodePerCluster;
    private boolean clusteringStatsAvailable;
...
}

My initial approach was to put methods around things to make it a bit easier to understand and then step by step replace each of those fields with nodes and relationships. I spent the first couple of hours doing this and while it was making the code more readable it wasn’t progressing very quickly and I wasn’t much wiser about how the code worked.

Michael and I paired on it for a few hours and he adopted a slightly different but more successful approach where we looked at slightly bigger chunks of code e.g. all the loops that used the firstNeighborIndex field and then created a hypothesis of what that code was doing.

In this case firstNeighborIndex acts as an offset into neighbor and is used to iterate through a node’s relationships. We thought we could probably replace that with something more similar to the Neo4j model where you have classes for nodes and relationships and a node has a method which returns a collection of relationships.

We tried tearing out everywhere that used those two fields and replacing them with our new nodes/relationships code but that didn’t work because we hadn’t realised that edgeWeight and nodeWeight are also tied to the contents of the original fields.

We therefore needed to retreat and try again. This time I put the new approach alongside the existing approach and then slowly replaced existing bits of code.

Along the way I came up with other ideas about how to restructure the code, tried some more bigger leaps to validate my ideas and then moved back into incremental mode again.

In summary I’ve found the combination of incrementally changing code and going on bigger exploratory missions works quite well.

Now I’m trying to work out when each approach is appropriate and I’ll write that up when I learn more! You can see my progress via the github commits.

Categories: Programming

Northwind: Finding direct/transitive Reports in SQL and Neo4j’s Cypher

Mon, 06/15/2015 - 23:53

Every few months we run a relational to graph meetup at the Neo London office where we go through how to take your data from a relational database and into the graph.

We use the Northwind dataset which often comes as a demo dataset on relational databases and come up with some queries which seem graph in nature.

My favourite query is one which finds out how employees are organised and who reports to whom. I thought it’d be quite interesting to see what it would look like in Postgres SQL as well, just for fun.

We’ll start off by getting a list of employees and the person they report to:

SELECT e."EmployeeID", e."ReportsTo"
FROM employees AS e
WHERE e."ReportsTo" IS NOT NULL;
 
 EmployeeID | ReportsTo
------------+-----------
          1 |         2
          3 |         2
          4 |         2
          5 |         2
          6 |         5
          7 |         5
          8 |         2
          9 |         5
(8 ROWS)

In cypher we’d do this:

MATCH (e:Employee)<-[:REPORTS_TO]-(sub)
RETURN sub.EmployeeID, e.EmployeeID 
 
+-------------------------------+
| sub.EmployeeID | e.EmployeeID |
+-------------------------------+
| "4"            | "2"          |
| "5"            | "2"          |
| "1"            | "2"          |
| "3"            | "2"          |
| "8"            | "2"          |
| "9"            | "5"          |
| "6"            | "5"          |
| "7"            | "5"          |
+-------------------------------+
8 rows

Next let’s find the big boss who doesn’t report to anyone. First in SQL:

SELECT e."EmployeeID" AS bigBoss
FROM employees AS e
WHERE e."ReportsTo" IS NULL
 
 bigboss
---------
       2
(1 ROW)

And now cypher:

MATCH (e:Employee)
WHERE NOT (e)-[:REPORTS_TO]->()
RETURN e.EmployeeID AS bigBoss
 
+---------+
| bigBoss |
+---------+
| "2"     |
+---------+
1 row

We still don’t need to join anything so the query isn’t that interesting yet. Let’s bring in some more properties from the manager record so we have to self join on the employees table:

SELECT e."FirstName", e."LastName", e."Title", manager."FirstName", manager."LastName", manager."Title"
FROM employees AS e
JOIN employees AS manager ON e."ReportsTo" = manager."EmployeeID"
WHERE e."ReportsTo" IS NOT NULL
 
 FirstName | LastName  |          Title           | FirstName | LastName |         Title
-----------+-----------+--------------------------+-----------+----------+-----------------------
 Nancy     | Davolio   | Sales Representative     | Andrew    | Fuller   | Vice President, Sales
 Janet     | Leverling | Sales Representative     | Andrew    | Fuller   | Vice President, Sales
 Margaret  | Peacock   | Sales Representative     | Andrew    | Fuller   | Vice President, Sales
 Steven    | Buchanan  | Sales Manager            | Andrew    | Fuller   | Vice President, Sales
 Michael   | Suyama    | Sales Representative     | Steven    | Buchanan | Sales Manager
 Robert    | King      | Sales Representative     | Steven    | Buchanan | Sales Manager
 Laura     | Callahan  | Inside Sales Coordinator | Andrew    | Fuller   | Vice President, Sales
 Anne      | Dodsworth | Sales Representative     | Steven    | Buchanan | Sales Manager
(8 ROWS)
MATCH (e:Employee)<-[:REPORTS_TO]-(sub)
RETURN sub.FirstName, sub.LastName, sub.Title, e.FirstName, e.LastName, e.Title
 
+----------------------------------------------------------------------------------------------------------------+
| sub.FirstName | sub.LastName | sub.Title                  | e.FirstName | e.LastName | e.Title                 |
+----------------------------------------------------------------------------------------------------------------+
| "Margaret"    | "Peacock"    | "Sales Representative"     | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Steven"      | "Buchanan"   | "Sales Manager"            | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Nancy"       | "Davolio"    | "Sales Representative"     | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Janet"       | "Leverling"  | "Sales Representative"     | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Laura"       | "Callahan"   | "Inside Sales Coordinator" | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Anne"        | "Dodsworth"  | "Sales Representative"     | "Steven"    | "Buchanan" | "Sales Manager"         |
| "Michael"     | "Suyama"     | "Sales Representative"     | "Steven"    | "Buchanan" | "Sales Manager"         |
| "Robert"      | "King"       | "Sales Representative"     | "Steven"    | "Buchanan" | "Sales Manager"         |
+----------------------------------------------------------------------------------------------------------------+
8 rows

Now let’s see how many direct reports each manager has:

SELECT manager."EmployeeID" AS manager, COUNT(e."EmployeeID") AS reports
FROM employees AS manager
LEFT JOIN employees AS e ON e."ReportsTo" = manager."EmployeeID"
GROUP BY manager
ORDER BY reports DESC;
 
 manager | reports
---------+---------
       2 |       5
       5 |       3
       1 |       0
       3 |       0
       4 |       0
       9 |       0
       6 |       0
       7 |       0
       8 |       0
(9 ROWS)
MATCH (e:Employee)
OPTIONAL MATCH (e)<-[rel:REPORTS_TO]-(report)
RETURN e.EmployeeID AS employee, COUNT(rel) AS reports
 
+--------------------+
| employee | reports |
+--------------------+
| "2"      | 5       |
| "5"      | 3       |
| "8"      | 0       |
| "7"      | 0       |
| "1"      | 0       |
| "4"      | 0       |
| "6"      | 0       |
| "9"      | 0       |
| "3"      | 0       |
+--------------------+
9 rows

Things start to get more interesting if we find the transitive reporting relationships that exist. I’m not an expert at Postgres but one way to achieve this is by writing a recursive WITH query like so:

WITH RECURSIVE recursive_employees("EmployeeID", "ReportsTo") AS (
        SELECT e."EmployeeID", e."ReportsTo"
        FROM employees e
      UNION ALL
        SELECT e."EmployeeID", e."ReportsTo"
        FROM employees e, recursive_employees re
        WHERE e."EmployeeID" = re."ReportsTo"
)
SELECT re."ReportsTo", COUNT(*) AS COUNT
FROM recursive_employees AS re
WHERE re."ReportsTo" IS NOT NULL
GROUP BY re."ReportsTo";
 
 ReportsTo | COUNT
-----------+-------
         2 |     8
         5 |     3
(2 ROWS)

If there’s a simpler way let me know in the comments.

In cypher we only need to add one character, ‘*’, after the ‘REPORTS_TO’ relationship to get it to recurse as far as it can. We’ll also remove the ‘OPTIONAL MATCH’ so that we only get back people who have people reporting to them:

MATCH (e:Employee)<-[rel:REPORTS_TO*]-(report)
RETURN e.EmployeeID AS employee, COUNT(rel) AS reports
 
+--------------------+
| employee | reports |
+--------------------+
| "2"      | 8       |
| "5"      | 3       |
+--------------------+
2 rows

Now I need to find some relational datasets with more complicated queries to play around with. If you have any ideas do let me know.

Categories: Programming

The Willpower Instinct: Reducing time spent mindlessly scrolling for things to read

Sat, 06/13/2015 - 00:12

I recently finished reading Kelly McGonigal’s excellent book ‘The Willpower Instinct‘ having previously watched her Google talk of the same title

My main takeaway from the book is that there are things that we want to do (or not do) but doing them (or not as the case may be) isn’t necessarily instinctive and so we need to develop some strategies to help ourselves out.

In one of the early chapters she suggests picking a habit that you want to do less off and write down on a piece of paper every time you want to do it and how you’re feeling at that point.

After writing it down you’re free to then follow through and do it but you don’t have to if you change your mind.

I was quite aware of the fact that I spend a lot of time idly scrolling from email to Twitter to Facebook to LinkedIn to news websites and back again so I thought it’d be interesting to track when/why I was doing this. The annoying thing about this habit is that it can easily eat up 20-30 minutes at a time without you even noticing.

I’ve been tracking myself for about three weeks and in the first few days I noticed that the first thing I did as soon as I woke up was grab my phone and get into the cycle.

It was quite frustrating to be lured in so early in the day but one of the suggestions in the book is that feeling guilty about something is actually detrimental to our progress. Instead we should note why it happened and then move on – the day isn’t a write off because of one event!

Kelly suggests that if we can work out the times when we’re most likely to fall into our habits then we can pre-plan a mitigation strategy.

From looking over my notes the following are the reasons why I want to start mindlessly scrolling:

  • I’m stuck on the problem I’m working on
  • I’m bored
  • I’m tired
  • I’m hungry
  • I’m getting distracted by notifications
  • I want to not think for a while

The notifications bullet is easy to address – I turn off notifications on my phone for 4 hours at a time so I don’t even know there’s anything to read.

I was intrigued to note that I got distracted when stuck on a problem – the main take away here is to check whether the urge to scroll mindlessly is being driven by having to think hard. If it is then I can choose to either get back to it or go for a short walk and then come back. But definitely don’t start scrolling!

I often find myself bored on my commute to work so I’ve addressed this by working out a book/paper I’m going to read the night before and then having that ready for the journey. Lunch time is prime time for mindless scrolling as well so I’ve filled that time with various computer science/data science videos.

Since I started tracking my scrolling I’ve found myself sleeping earlier so my assumption is that the extra hours awake were being spent mindlessly scrolling which led to being more tired so a win all around in that respect.

Something I’ve noticed is that I’m sometimes wasting time on other activities which I’m are not ‘forbidden’ but are equally unconstructive e.g. chat applications / watching music videos.

The former are obviously useful for communicating with people so I’ve been trying to use them only when I actually want to chat to someone rather than mindlessly looking for messages to read.

I also find myself not wanting to write down the times I’ve mindlessly scrolled when I’m doing it a lot on a given day. Being aware of this is helpful as I just write it down anyway and get on with the day.

The summary of my experience so far is it seems beneficial – I don’t think I’ve lost anything by not checking those mediums so often and I’ve definitely read a lot more than I usually do and been more focused as well.

Now I need to go and try out some of the other exercises from the book – if you’ve read it / tried out any of the tips I’d love to hear what’s worked well for you.

Categories: Programming

Neo4j: Using LOAD CSV to help explore CSV files

Fri, 06/12/2015 - 00:15

During the Neo4j How I met your mother hackathon that we ran last week one of the attendees noticed that one of the CSV files we were importing wasn’t creating as many records as they expected it to.

This is typically the case when there’s some odd quoting in the CSV file but we decided to look into it.

The file in question was one containing references made in HIMYM. The first 5 lines look like this:

$ head -n 5 data/import/references.csv
ReferencedEpisodeId,ReferencingEpisodeId,ReferenceText
168,184,"Marshall will eventually hear back from the New York State Judicatory Committee in Something New, which will become a main plot point of Season 9."
168,169,Barney proclaiming to be done with Robin will be the focal point of Lobster Crawl.
58,57,"Barney finally confronts his saboteur (Abby, whom he slept with in Ten Sessions) in Everything Must Go."
58,63,"Barney finally confronts his saboteur (Abby, whom he slept with in Ten Sessions) in Everything Must Go."

And this is how many lines the Unix ‘wc’ command sees:

$ wc -l data/import/references.csv
     782 data/import/references.csv

So we might expect that there are going to be 782 records created if we import that file into Neo4j. Let’s run a quick query in Neo4j to see what it thinks:

LOAD CSV WITH HEADERS 
FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" 
AS row
return count(*)
 
==> +----------+
==> | count(*) |
==> +----------+
==> | 636      |
==> +----------+
==> 1 row

So we have 146 less records than we expected which means Neo4j is treating multiple lines as one CSV line in some cases.

Let’s go back to the Unix command line to try and work out which lines those are. There must be some lines which start with part of the ‘ReferenceText’ rather than a ‘ReferenceEpisodeId’ so let’s extract the first column and see what’s going on there:

$ cat data/import/references.csv | cut -d"," -f1 | grep -v  '[0-9]\+$'| head -n 10
ReferencedEpisodeId
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
Also

We’ve extracted the first column and then filter the output to only keep rows which don’t contain all numbers which will be our rogue rows.

Let’s switch back to Neo4j land to see which rows it thinks contains these fragments of text:

LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" AS row
WITH row WHERE row.ReferenceText =~ ".*This is the Mother's first.*"
RETURN row.ReferencedEpisodeId, row.ReferencingEpisodeId, row.ReferenceText
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row.ReferencedEpisodeId | row.ReferencingEpisodeId | row.ReferenceText                                                                                                                                                                                                                                                                                                                                                                     |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "45"                    | "37"                     | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "184"                    | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Interestingly it only returns two rows containing that phrase whereas we see it at least 8 times. Initially I thought this was an issue with the LOAD CSV command but if we filter the rows to only return ones that have a ‘ReferencedEpisodeId’ of ’45’ then we do see them returned:

==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "53", ReferenceText -> "The website counting down to the next slap (slapcountdown.com) that Marshall sends Barney reaches zero in Slapsgiving, when the third slap is delivered."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "55", ReferenceText -> "Ted gets rid of his butterfly tramp stamp through ten weekly sessions of laser tattoo removal between The Platinum Rule and Ten Sessions, over the course of which he meets, asks out, and eventually starts dating his dermatologist, Stella Zinman."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "57", ReferenceText -> "Ted gets rid of his butterfly tramp stamp through ten weekly sessions of laser tattoo removal between The Platinum Rule and Ten Sessions, over the course of which he meets, asks out, and eventually starts dating his dermatologist, Stella Zinman."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "56", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "200", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "100", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "86", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "113", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "161", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                         |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                        |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "63", ReferenceText -> "Marshall makes other home-made websites in Everything Must Go (lilyandmarshallselltheirstuff.com) and The Sexless Innkeeper (itwasthebestnightever.com), where Lily and Future Ted mention it being a problem."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "92", ReferenceText -> "Marshall makes other home-made websites in Everything Must Go (lilyandmarshallselltheirstuff.com) and The Sexless Innkeeper (itwasthebestnightever.com), where Lily and Future Ted mention it being a problem."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

So the actual problem is that the regex matcher doesn’t deal with the new line in the string.

Our next step is therefore to get rid of new lines within strings. I spent ages trying to find the appropriate command before coming across the following use of awk which does the job:

$ cat data/import/references.csv | awk '(NR-1)%2{$1=$1} {print $0}' RS=\" ORS=\" | wc -l
637
 
$ cat data/import/references.csv | awk '(NR-1)%2{$1=$1} {print $0}' RS=\" ORS=\" > data/import/refs.csv

Let’s try the LOAD CSV command again:

LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/refs.csv" AS row
WITH row WHERE row.ReferenceText =~ ".*This is the Mother's first.*"
RETURN row.ReferencedEpisodeId, row.ReferencingEpisodeId, row.ReferenceText
 
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row.ReferencedEpisodeId | row.ReferencingEpisodeId | row.ReferenceText                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "45"                    | "56"                     | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "200"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "100"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "86"                     | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "113"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "161"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "37"                     | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "184"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "37"                     | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."                                                                                                                                                                                                                                                                                                                                                                        |
==> | "45"                    | "184"                    | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."                                                                                                                                                                                                                                                                                                                                                                        |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

And there we go!

Update

Michael pointed out that I could have used the dotall regex flag at the beginning of the regular expression in order to search across new lines without having to remove them! In that case the query would read like this:

LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" AS row
WITH row WHERE row.ReferenceText =~ "(?s).*This is the Mother.*"
RETURN row
 
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "56", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "200", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "100", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "86", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "113", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "161", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                         |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                        |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Categories: Programming

Mac OS X: GNU sed – Hex string replacement / replacing new line characters

Thu, 06/11/2015 - 22:38

Recently I was working with a CSV file which contained both Windows and Unix line endings which was making it difficult to work with.

The actual line endings were HEX ‘0A0D’ i.e. Windows line breaks but there were also HEX ‘OA’ i.e. Unix line breaks within one of the columns.

I wanted to get rid of the Unix line breaks and discovered that you can do HEX sequence replacement using the GNU version of sed – unfortunately the Mac ships with the BSD version which doesn’t have this functionaltiy.

The first step was therefore to install the GNU version of sed.

brew install coreutils
brew install gnu-sed --with-default-names

I wanted to replace my system sed so that’s why I went with the ‘–with-default-names’ flag – without that flag I believe the sed installation would be accessible as ‘gs-sed’.

The following is an example of what the lines in the file look like:

$ echo -e "Hello\x0AMark\x0A\x0D"
Hello
Mark

We want to get rid of the new line in between ‘Hello’ and ‘Mark’ but leave the other one be. I adapted one of the commands from this tutorial to look for lines which end in ‘0A’ where that isn’t followed by a ‘0D':

$ echo -e "Hello\x0AMark\x0A\x0D" | \
  sed 'N;/\x0A[^\x0D]/s/\n/ /'
Hello Mark

Let’s go through the parts of the sed command:

  • N – this creates a multiline pattern space by reading a new line of input and appending it to the contents of the pattern space. The two lines are separated by a new line.
  • /\x0A[^\x0D]/ – this matches any lines which contain ‘OA’ not followed by ‘OD’
  • /s/\n/ / – this substitutes the new line character with a space for those matching lines from the previous command.

Now let’s check it works if we have multiple lines that we want to squash:

$ echo -e "Hello\x0AMark\x0A\x0DHello\x0AMichael\x0A\x0D"
Hello
Mark
Hello
Michael
 
$ echo -e "Hello\x0AMark\x0A\x0DHello\x0AMichael\x0A\x0D" | \
  sed 'N;/\x0A[^\x0D]/s/\n/ /'
Hello Mark
Hello Michael

Looks good! The actual file is a bit more nuanced so I’ve still got a bit more work to do but this is a good start.

Categories: Programming

Unix: Converting a file of values into a comma separated list

Mon, 06/08/2015 - 23:23

I recently had a bunch of values in a file that I wanted to paste into a Java program which required a comma separated list of strings.

This is what the file looked like:

$ cat foo2.txt | head -n 5
1.0
1.0
1.0
1.0
1.0

And the idea is that we would end up with something like this:

"1.0","1.0","1.0","1.0","1.0"


The first thing we need to do is quote each of the values. I found a nice way to do this using sed:

$ sed 's/.*/"&"/g' foo2.txt | head -n 5
"1.0"
"1.0"
"1.0"
"1.0"
"1.0"

Now that we’ve got all the values quoted we need to get rid of the new lines and replace them with commas. The way I’d normally do this is using ‘tr’ and then just not copy the final comma…

$ sed 's/.*/"&"/g' foo2.txt | tr '\n' ','
"1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0",

…but I learnt that we can actually do one better than this using ‘paste’ which allows you to replace new lines excluding the last one.

The only annoying thing about paste is that you can’t pipe to it so we need to use process substitution instead:

$ paste -s -d ',' <(sed 's/.*/"&"/g' foo2.txt)
"1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0"

If we’re only a Mac we could even automate the copy/paste step too by piping to ‘pbcopy':

$ paste -s -d ',' <(sed 's/.*/"&"/g' foo2.txt) | pbcopy
Categories: Programming

Netty: Testing encoders/decoders

Fri, 06/05/2015 - 22:25

I’ve been working with Netty a bit recently and having built a pipeline of encoders/decoders as described in this excellent tutorial wanted to test that the encoders and decoders were working without having to send real messages around.

Luckily there is a EmbeddedChannel which makes our life very easy indeed.

Let’s say we’ve got a message ‘Foo’ that we want to send across the wire. It only contains a single integer value so we’ll just send that and reconstruct ‘Foo’ on the other side.

We might write the following encoder to do this:

// Examples uses Netty 4.0.28.Final
public static class MessageEncoder extends MessageToMessageEncoder<Foo>
{
    @Override
    protected void encode( ChannelHandlerContext ctx, Foo msg, List<Object> out ) throws Exception
    {
        ByteBuf buf = ctx.alloc().buffer();
        buf.writeInt( msg.value() );
        out.add( buf );
    }
}
 
public static class Foo
{
    private Integer value;
 
    public Foo(Integer value)
    {
        this.value = value;
    }
 
    public int value()
    {
        return value;
    }
}

So all we’re doing is taking the ‘value’ field out of ‘Foo’ and putting it into the list which gets passed downstream.

Let’s write a test which simulates sending a ‘Foo’ message and has an empty decoder attempt to process the message:

@Test
public void shouldEncodeAndDecodeVoteRequest()
{
    // given
    EmbeddedChannel channel = new EmbeddedChannel( new MessageEncoder(), new MessageDecoder() );
 
    // when
    Foo foo = new Foo( 42 );
    channel.writeOutbound( foo );
    channel.writeInbound( channel.readOutbound() );
 
    // then
    Foo returnedFoo = (Foo) channel.readInbound();
    assertNotNull(returnedFoo);
    assertEquals( foo.value(), returnedFoo.value() );
}
 
public static class MessageDecoder extends MessageToMessageDecoder<ByteBuf>
{
    @Override
    protected void decode( ChannelHandlerContext ctx, ByteBuf msg, List<Object> out ) throws Exception { }
}

So in the test we write ‘Foo’ to the outbound channel and then read it back into the inbound channel and then check what we’ve got. If we run that test now this is what we’ll see:

junit.framework.AssertionFailedError
	at NettyTest.shouldEncodeAndDecodeVoteRequest(NettyTest.java:28)

The message we get back is null which makes sense given that we didn’t bother writing the decoder. Let’s implement the decoder then:

public static class MessageDecoder extends MessageToMessageDecoder<ByteBuf>
{
    @Override
    protected void decode( ChannelHandlerContext ctx, ByteBuf msg, List<Object> out ) throws Exception
    {
        int value = msg.readInt();
        out.add( new Foo(value) );
    }
}

Now if we run our test again it’s all green and happy. We can now go and encode/decode some more complex structures and update our test accordingly.

Categories: Programming

Neo4j: Cypher – Step by step to creating a linked list of adjacent nodes using UNWIND

Thu, 06/04/2015 - 23:17

In late 2013 I wrote a post showing how to create a linked list connecting different football seasons together using Neo4j’s Cypher query language, a post I’ve frequently copy & pasted from!

Now 18 months later, and using Neo4j 2.2 rather than 2.0, we can actually solve this problem in what I believe is a more intuitive way using the UNWIND function. Credit for the idea goes to Michael, I’m just the messenger.

To recap, we had a collection of football seasons and we wanted to connect adjacent seasons to each other to allow easy querying between seasons. The following is the code we used:

CREATE (:Season {name: "2013/2014", timestamp: 1375315200})
CREATE (:Season {name: "2012/2013", timestamp: 1343779200})
CREATE (:Season {name: "2011/2012", timestamp: 1312156800})
CREATE (:Season {name: "2010/2011", timestamp: 1280620800})
CREATE (:Season {name: "2009/2010", timestamp: 1249084800})
MATCH (s:Season)
WITH s
ORDER BY s.timestamp
WITH COLLECT(s) AS seasons
 
FOREACH(i in RANGE(0, length(seasons)-2) | 
    FOREACH(si in [seasons[i]] | 
        FOREACH(si2 in [seasons[i+1]] | 
            MERGE (si)-[:NEXT]->(si2))))

Our goal is to replace those 3 FOREACH loops with something a bit easier to understand. To start with, let’s run the first part of the query to get some intuition of what we’re trying to do:

MATCH (s:Season)
WITH s
ORDER BY s.timestamp
RETURN COLLECT(s) AS seasons
 
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | seasons                                                                                                                                                                                                                                                     |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | [Node[1973]{timestamp:1249084800,name:"2009/2010"},Node[1972]{timestamp:1280620800,name:"2010/2011"},Node[1971]{timestamp:1312156800,name:"2011/2012"},Node[1970]{timestamp:1343779200,name:"2012/2013"},Node[1969]{timestamp:1375315200,name:"2013/2014"}] |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

So at this point we’ve got all the seasons in an array going from 2009/2010 up to 2013/2014. We want to create a ‘NEXT’ relationship between 2009/2010 -> 2010/2011, 2010/2011 -> 2011/2012 and so on.

To achieve this we need to get the adjacent seasons split into two columns, like so:

2009/2010	2010/2011
2010/2011	2011/2012
2011/2012	2012/2013
2012/2013	2013/2014

If we can get the data into that format then we can apply a MERGE between the two fields to create the ‘NEXT’ relationship. So how do we do that?

If we were in Python we’d be calling for the zip function which we could apply like this:

>>> seasons = ["2009/2010", "2010/2011", "2011/2012", "2012/2013", "2013/2014"]
 
>>> zip(seasons, seasons[1:])
[('2009/2010', '2010/2011'), ('2010/2011', '2011/2012'), ('2011/2012', '2012/2013'), ('2012/2013', '2013/2014')]

Unfortunately we don’t have an equivalent function in Cypher but we can achieve the same outcome by creating 2 columns with adjacent integer values. The RANGE and UNWIND functions are our friends here:

return RANGE(0,4)
 
==> +-------------+
==> | RANGE(0,4)  |
==> +-------------+
==> | [0,1,2,3,4] |
==> +-------------+
UNWIND RANGE(0,4) as idx 
RETURN idx, idx +1;
 
==> +--------------+
==> | idx | idx +1 |
==> +--------------+
==> | 0   | 1      |
==> | 1   | 2      |
==> | 2   | 3      |
==> | 3   | 4      |
==> | 4   | 5      |
==> +--------------+
==> 5 rows

Now all we need to do is plug this code into our original query where ‘idx’ and ‘idx + 1′ represent indexes into the array of seasons. We use a range which stops 1 element early since there isn’t anywhere to connect our last season to:

MATCH (s:Season)
WITH s
ORDER BY s.timestamp
WITH COLLECT(s) AS seasons
UNWIND RANGE(0,LENGTH(seasons) - 2) as idx 
RETURN seasons[idx], seasons[idx+1]
 
==> +-------------------------------------------------------------------------------------------------------+
==> | seasons[idx]                                      | seasons[idx+1]                                    |
==> +-------------------------------------------------------------------------------------------------------+
==> | Node[1973]{timestamp:1249084800,name:"2009/2010"} | Node[1972]{timestamp:1280620800,name:"2010/2011"} |
==> | Node[1972]{timestamp:1280620800,name:"2010/2011"} | Node[1971]{timestamp:1312156800,name:"2011/2012"} |
==> | Node[1971]{timestamp:1312156800,name:"2011/2012"} | Node[1970]{timestamp:1343779200,name:"2012/2013"} |
==> | Node[1970]{timestamp:1343779200,name:"2012/2013"} | Node[1969]{timestamp:1375315200,name:"2013/2014"} |
==> +-------------------------------------------------------------------------------------------------------+
==> 4 rows

Now we’ve got all the adjacent seasons lined up we complete the query with a call to MERGE:

MATCH (s:Season)
WITH s
ORDER BY s.timestamp
WITH COLLECT(s) AS seasons
UNWIND RANGE(0,LENGTH(seasons) - 2) as idx 
WITH seasons[idx] AS s1, seasons[idx+1] AS s2
MERGE (s1)-[:NEXT]->(s2)
 
==> +-------------------+
==> | No data returned. |
==> +-------------------+
==> Relationships created: 4

And we’re done. Hopefully I can remember this approach more than I did the initial one!

Categories: Programming

R: ggplot geom_density – Error in exists(name, envir = env, mode = mode) : argument “env” is missing, with no default

Wed, 06/03/2015 - 06:52

Continuing on from yesterday’s blog post where I worked out how to clean up the Think Bayes Price is Right data set, the next task was to plot a distribution of the prices of show case items.

To recap, this is what the data frame we’re working with looks like:

library(dplyr)
 
df2011 = read.csv("~/projects/rLearning/showcases.2011.csv", na.strings = c("", "NA"))
df2011 = df2011 %>% na.omit()
 
> df2011 %>% head()
              X Sep..19 Sep..20 Sep..21 Sep..22 Sep..23 Sep..26 Sep..27 Sep..28 Sep..29 Sep..30 Oct..3
3    Showcase 1   50969   21901   32815   44432   24273   30554   20963   28941   25851   28800  37703
4    Showcase 2   45429   34061   53186   31428   22320   24337   41373   45437   41125   36319  38752
6         Bid 1   42000   14000   32000   27000   18750   27222   25000   35000   22500   21300  21567
7         Bid 2   34000   59900   45000   38000   23000   18525   32000   45000   32000   27500  23800
9  Difference 1    8969    7901     815   17432    5523    3332   -4037   -6059    3351    7500  16136
10 Difference 2   11429  -25839    8186   -6572    -680    5812    9373     437    9125    8819  14952
...

So our goal is to plot the density of the ‘Showcase 1′ items. Unfortunately those aren’t currently stored in a way that makes this easy for us. We need to flip the data frame so that we have a row for each date/price type/price:

PriceType  Date     Price
Showcase 1 Sep..19  50969
Showcase 2 Sep..19  21901
...
Showcase 1 Sep..20  45429
Showcase 2 Sep..20  34061

The reshape library’s melt function is our friend here:

library(reshape)
meltedDf = melt(df2011, id=c("X"))
 
> meltedDf %>% sample_n(10)
                X variable value
643    Showcase 1  Feb..24 27883
224    Showcase 2  Nov..10 34089
1062 Difference 2   Jun..4  9962
770    Showcase 2  Mar..28 39620
150  Difference 2  Oct..24  9137
431  Difference 1   Jan..4  7516
345         Bid 1  Dec..12 21569
918  Difference 2    May.1 -2093
536    Showcase 2  Jan..31 30918
502         Bid 2  Jan..23 27000

Now we need to plug this into ggplot. We’ll start by just plotting all the prices for showcase 1:

> ggplot(aes(x = value), data = meltedDf %>% filter(X == "Showcase 1")) +
    geom_density()
 
Error in exists(name, envir = env, mode = mode) : 
  argument "env" is missing, with no default


This error usually means that you’ve passed an empty data set to ggplot which isn’t the case here, but if we extract the values column we can see the problem:

> meltedDf$value[1:10]
 [1] "50969" "45429" "42000" "34000" "8969"  "11429" "21901" "34061" "14000" "59900"

They are all strings! Making it very difficult to plot a density curve which relies on the data being continuous. Let’s fix that and try again:

meltedDf$value = as.numeric(meltedDf$value)

ggplot(aes(x = value), data = meltedDf %>% filter(X == "Showcase 1")) +
  geom_density()

2015 06 03 06 46 48

If we want to show the curves for both showcases we can tweak our code slightly:

ggplot(meltedDf %>% filter(grepl("Showcase", X)), aes(x = value, colour = X)) + 
  geom_density() + 
  theme(legend.position="top")
2015 06 03 06 50 35

Et voila!

Categories: Programming

R: dplyr – removing empty rows

Tue, 06/02/2015 - 07:49

I’m still working my way through the exercises in Think Bayes and in Chapter 6 needed to do some cleaning of the data in a CSV file containing information about the Price is Right.

I downloaded the file using wget:

wget http://www.greenteapress.com/thinkbayes/showcases.2011.csv

And then loaded it into R and explored the first few rows using dplyr

library(dplyr)
df2011 = read.csv("~/projects/rLearning/showcases.2011.csv")
 
> df2011 %>% head(10)
 
           X Sep..19 Sep..20 Sep..21 Sep..22 Sep..23 Sep..26 Sep..27 Sep..28 Sep..29 Sep..30 Oct..3
1              5631K   5632K   5633K   5634K   5635K   5641K   5642K   5643K   5644K   5645K  5681K
2                                                                                                  
3 Showcase 1   50969   21901   32815   44432   24273   30554   20963   28941   25851   28800  37703
4 Showcase 2   45429   34061   53186   31428   22320   24337   41373   45437   41125   36319  38752
5                                                                                                  
...

As you can see, we have some empty rows which we want to get rid of to ease future processing. I couldn’t find an easy way to filter those out but what we can do instead is have empty columns converted to ‘NA’ and then filter those.

First we need to tell read.csv to treat empty columns as NA:

df2011 = read.csv("~/projects/rLearning/showcases.2011.csv", na.strings = c("", "NA"))

And now we can filter them out using na.omit:

df2011 = df2011 %>% na.omit()
 
> df2011  %>% head(5)
             X Sep..19 Sep..20 Sep..21 Sep..22 Sep..23 Sep..26 Sep..27 Sep..28 Sep..29 Sep..30 Oct..3
3   Showcase 1   50969   21901   32815   44432   24273   30554   20963   28941   25851   28800  37703
4   Showcase 2   45429   34061   53186   31428   22320   24337   41373   45437   41125   36319  38752
6        Bid 1   42000   14000   32000   27000   18750   27222   25000   35000   22500   21300  21567
7        Bid 2   34000   59900   45000   38000   23000   18525   32000   45000   32000   27500  23800
9 Difference 1    8969    7901     815   17432    5523    3332   -4037   -6059    3351    7500  16136
...

Much better!

Categories: Programming