Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 5 hours 37 min ago

Northwind: Finding direct/transitive Reports in SQL and Neo4j’s Cypher

Mon, 06/15/2015 - 23:53

Every few months we run a relational to graph meetup at the Neo London office where we go through how to take your data from a relational database and into the graph.

We use the Northwind dataset which often comes as a demo dataset on relational databases and come up with some queries which seem graph in nature.

My favourite query is one which finds out how employees are organised and who reports to whom. I thought it’d be quite interesting to see what it would look like in Postgres SQL as well, just for fun.

We’ll start off by getting a list of employees and the person they report to:

SELECT e."EmployeeID", e."ReportsTo"
FROM employees AS e
WHERE e."ReportsTo" IS NOT NULL;
 
 EmployeeID | ReportsTo
------------+-----------
          1 |         2
          3 |         2
          4 |         2
          5 |         2
          6 |         5
          7 |         5
          8 |         2
          9 |         5
(8 ROWS)

In cypher we’d do this:

MATCH (e:Employee)<-[:REPORTS_TO]-(sub)
RETURN sub.EmployeeID, e.EmployeeID 
 
+-------------------------------+
| sub.EmployeeID | e.EmployeeID |
+-------------------------------+
| "4"            | "2"          |
| "5"            | "2"          |
| "1"            | "2"          |
| "3"            | "2"          |
| "8"            | "2"          |
| "9"            | "5"          |
| "6"            | "5"          |
| "7"            | "5"          |
+-------------------------------+
8 rows

Next let’s find the big boss who doesn’t report to anyone. First in SQL:

SELECT e."EmployeeID" AS bigBoss
FROM employees AS e
WHERE e."ReportsTo" IS NULL
 
 bigboss
---------
       2
(1 ROW)

And now cypher:

MATCH (e:Employee)
WHERE NOT (e)-[:REPORTS_TO]->()
RETURN e.EmployeeID AS bigBoss
 
+---------+
| bigBoss |
+---------+
| "2"     |
+---------+
1 row

We still don’t need to join anything so the query isn’t that interesting yet. Let’s bring in some more properties from the manager record so we have to self join on the employees table:

SELECT e."FirstName", e."LastName", e."Title", manager."FirstName", manager."LastName", manager."Title"
FROM employees AS e
JOIN employees AS manager ON e."ReportsTo" = manager."EmployeeID"
WHERE e."ReportsTo" IS NOT NULL
 
 FirstName | LastName  |          Title           | FirstName | LastName |         Title
-----------+-----------+--------------------------+-----------+----------+-----------------------
 Nancy     | Davolio   | Sales Representative     | Andrew    | Fuller   | Vice President, Sales
 Janet     | Leverling | Sales Representative     | Andrew    | Fuller   | Vice President, Sales
 Margaret  | Peacock   | Sales Representative     | Andrew    | Fuller   | Vice President, Sales
 Steven    | Buchanan  | Sales Manager            | Andrew    | Fuller   | Vice President, Sales
 Michael   | Suyama    | Sales Representative     | Steven    | Buchanan | Sales Manager
 Robert    | King      | Sales Representative     | Steven    | Buchanan | Sales Manager
 Laura     | Callahan  | Inside Sales Coordinator | Andrew    | Fuller   | Vice President, Sales
 Anne      | Dodsworth | Sales Representative     | Steven    | Buchanan | Sales Manager
(8 ROWS)
MATCH (e:Employee)<-[:REPORTS_TO]-(sub)
RETURN sub.FirstName, sub.LastName, sub.Title, e.FirstName, e.LastName, e.Title
 
+----------------------------------------------------------------------------------------------------------------+
| sub.FirstName | sub.LastName | sub.Title                  | e.FirstName | e.LastName | e.Title                 |
+----------------------------------------------------------------------------------------------------------------+
| "Margaret"    | "Peacock"    | "Sales Representative"     | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Steven"      | "Buchanan"   | "Sales Manager"            | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Nancy"       | "Davolio"    | "Sales Representative"     | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Janet"       | "Leverling"  | "Sales Representative"     | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Laura"       | "Callahan"   | "Inside Sales Coordinator" | "Andrew"    | "Fuller"   | "Vice President, Sales" |
| "Anne"        | "Dodsworth"  | "Sales Representative"     | "Steven"    | "Buchanan" | "Sales Manager"         |
| "Michael"     | "Suyama"     | "Sales Representative"     | "Steven"    | "Buchanan" | "Sales Manager"         |
| "Robert"      | "King"       | "Sales Representative"     | "Steven"    | "Buchanan" | "Sales Manager"         |
+----------------------------------------------------------------------------------------------------------------+
8 rows

Now let’s see how many direct reports each manager has:

SELECT manager."EmployeeID" AS manager, COUNT(e."EmployeeID") AS reports
FROM employees AS manager
LEFT JOIN employees AS e ON e."ReportsTo" = manager."EmployeeID"
GROUP BY manager
ORDER BY reports DESC;
 
 manager | reports
---------+---------
       2 |       5
       5 |       3
       1 |       0
       3 |       0
       4 |       0
       9 |       0
       6 |       0
       7 |       0
       8 |       0
(9 ROWS)
MATCH (e:Employee)
OPTIONAL MATCH (e)<-[rel:REPORTS_TO]-(report)
RETURN e.EmployeeID AS employee, COUNT(rel) AS reports
 
+--------------------+
| employee | reports |
+--------------------+
| "2"      | 5       |
| "5"      | 3       |
| "8"      | 0       |
| "7"      | 0       |
| "1"      | 0       |
| "4"      | 0       |
| "6"      | 0       |
| "9"      | 0       |
| "3"      | 0       |
+--------------------+
9 rows

Things start to get more interesting if we find the transitive reporting relationships that exist. I’m not an expert at Postgres but one way to achieve this is by writing a recursive WITH query like so:

WITH RECURSIVE recursive_employees("EmployeeID", "ReportsTo") AS (
        SELECT e."EmployeeID", e."ReportsTo"
        FROM employees e
      UNION ALL
        SELECT e."EmployeeID", e."ReportsTo"
        FROM employees e, recursive_employees re
        WHERE e."EmployeeID" = re."ReportsTo"
)
SELECT re."ReportsTo", COUNT(*) AS COUNT
FROM recursive_employees AS re
WHERE re."ReportsTo" IS NOT NULL
GROUP BY re."ReportsTo";
 
 ReportsTo | COUNT
-----------+-------
         2 |     8
         5 |     3
(2 ROWS)

If there’s a simpler way let me know in the comments.

In cypher we only need to add one character, ‘*’, after the ‘REPORTS_TO’ relationship to get it to recurse as far as it can. We’ll also remove the ‘OPTIONAL MATCH’ so that we only get back people who have people reporting to them:

MATCH (e:Employee)<-[rel:REPORTS_TO*]-(report)
RETURN e.EmployeeID AS employee, COUNT(rel) AS reports
 
+--------------------+
| employee | reports |
+--------------------+
| "2"      | 8       |
| "5"      | 3       |
+--------------------+
2 rows

Now I need to find some relational datasets with more complicated queries to play around with. If you have any ideas do let me know.

Categories: Programming

The Willpower Instinct: Reducing time spent mindlessly scrolling for things to read

Sat, 06/13/2015 - 00:12

I recently finished reading Kelly McGonigal’s excellent book ‘The Willpower Instinct‘ having previously watched her Google talk of the same title

My main takeaway from the book is that there are things that we want to do (or not do) but doing them (or not as the case may be) isn’t necessarily instinctive and so we need to develop some strategies to help ourselves out.

In one of the early chapters she suggests picking a habit that you want to do less off and write down on a piece of paper every time you want to do it and how you’re feeling at that point.

After writing it down you’re free to then follow through and do it but you don’t have to if you change your mind.

I was quite aware of the fact that I spend a lot of time idly scrolling from email to Twitter to Facebook to LinkedIn to news websites and back again so I thought it’d be interesting to track when/why I was doing this. The annoying thing about this habit is that it can easily eat up 20-30 minutes at a time without you even noticing.

I’ve been tracking myself for about three weeks and in the first few days I noticed that the first thing I did as soon as I woke up was grab my phone and get into the cycle.

It was quite frustrating to be lured in so early in the day but one of the suggestions in the book is that feeling guilty about something is actually detrimental to our progress. Instead we should note why it happened and then move on – the day isn’t a write off because of one event!

Kelly suggests that if we can work out the times when we’re most likely to fall into our habits then we can pre-plan a mitigation strategy.

From looking over my notes the following are the reasons why I want to start mindlessly scrolling:

  • I’m stuck on the problem I’m working on
  • I’m bored
  • I’m tired
  • I’m hungry
  • I’m getting distracted by notifications
  • I want to not think for a while

The notifications bullet is easy to address – I turn off notifications on my phone for 4 hours at a time so I don’t even know there’s anything to read.

I was intrigued to note that I got distracted when stuck on a problem – the main take away here is to check whether the urge to scroll mindlessly is being driven by having to think hard. If it is then I can choose to either get back to it or go for a short walk and then come back. But definitely don’t start scrolling!

I often find myself bored on my commute to work so I’ve addressed this by working out a book/paper I’m going to read the night before and then having that ready for the journey. Lunch time is prime time for mindless scrolling as well so I’ve filled that time with various computer science/data science videos.

Since I started tracking my scrolling I’ve found myself sleeping earlier so my assumption is that the extra hours awake were being spent mindlessly scrolling which led to being more tired so a win all around in that respect.

Something I’ve noticed is that I’m sometimes wasting time on other activities which I’m are not ‘forbidden’ but are equally unconstructive e.g. chat applications / watching music videos.

The former are obviously useful for communicating with people so I’ve been trying to use them only when I actually want to chat to someone rather than mindlessly looking for messages to read.

I also find myself not wanting to write down the times I’ve mindlessly scrolled when I’m doing it a lot on a given day. Being aware of this is helpful as I just write it down anyway and get on with the day.

The summary of my experience so far is it seems beneficial – I don’t think I’ve lost anything by not checking those mediums so often and I’ve definitely read a lot more than I usually do and been more focused as well.

Now I need to go and try out some of the other exercises from the book – if you’ve read it / tried out any of the tips I’d love to hear what’s worked well for you.

Categories: Programming

Neo4j: Using LOAD CSV to help explore CSV files

Fri, 06/12/2015 - 00:15

During the Neo4j How I met your mother hackathon that we ran last week one of the attendees noticed that one of the CSV files we were importing wasn’t creating as many records as they expected it to.

This is typically the case when there’s some odd quoting in the CSV file but we decided to look into it.

The file in question was one containing references made in HIMYM. The first 5 lines look like this:

$ head -n 5 data/import/references.csv
ReferencedEpisodeId,ReferencingEpisodeId,ReferenceText
168,184,"Marshall will eventually hear back from the New York State Judicatory Committee in Something New, which will become a main plot point of Season 9."
168,169,Barney proclaiming to be done with Robin will be the focal point of Lobster Crawl.
58,57,"Barney finally confronts his saboteur (Abby, whom he slept with in Ten Sessions) in Everything Must Go."
58,63,"Barney finally confronts his saboteur (Abby, whom he slept with in Ten Sessions) in Everything Must Go."

And this is how many lines the Unix ‘wc’ command sees:

$ wc -l data/import/references.csv
     782 data/import/references.csv

So we might expect that there are going to be 782 records created if we import that file into Neo4j. Let’s run a quick query in Neo4j to see what it thinks:

LOAD CSV WITH HEADERS 
FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" 
AS row
return count(*)
 
==> +----------+
==> | count(*) |
==> +----------+
==> | 636      |
==> +----------+
==> 1 row

So we have 146 less records than we expected which means Neo4j is treating multiple lines as one CSV line in some cases.

Let’s go back to the Unix command line to try and work out which lines those are. There must be some lines which start with part of the ‘ReferenceText’ rather than a ‘ReferenceEpisodeId’ so let’s extract the first column and see what’s going on there:

$ cat data/import/references.csv | cut -d"," -f1 | grep -v  '[0-9]\+$'| head -n 10
ReferencedEpisodeId
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny
Also

We’ve extracted the first column and then filter the output to only keep rows which don’t contain all numbers which will be our rogue rows.

Let’s switch back to Neo4j land to see which rows it thinks contains these fragments of text:

LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" AS row
WITH row WHERE row.ReferenceText =~ ".*This is the Mother's first.*"
RETURN row.ReferencedEpisodeId, row.ReferencingEpisodeId, row.ReferenceText
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row.ReferencedEpisodeId | row.ReferencingEpisodeId | row.ReferenceText                                                                                                                                                                                                                                                                                                                                                                     |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "45"                    | "37"                     | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "184"                    | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Interestingly it only returns two rows containing that phrase whereas we see it at least 8 times. Initially I thought this was an issue with the LOAD CSV command but if we filter the rows to only return ones that have a ‘ReferencedEpisodeId’ of ’45’ then we do see them returned:

==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "53", ReferenceText -> "The website counting down to the next slap (slapcountdown.com) that Marshall sends Barney reaches zero in Slapsgiving, when the third slap is delivered."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "55", ReferenceText -> "Ted gets rid of his butterfly tramp stamp through ten weekly sessions of laser tattoo removal between The Platinum Rule and Ten Sessions, over the course of which he meets, asks out, and eventually starts dating his dermatologist, Stella Zinman."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "57", ReferenceText -> "Ted gets rid of his butterfly tramp stamp through ten weekly sessions of laser tattoo removal between The Platinum Rule and Ten Sessions, over the course of which he meets, asks out, and eventually starts dating his dermatologist, Stella Zinman."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "56", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "200", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "100", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "86", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "113", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "161", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                         |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                        |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "63", ReferenceText -> "Marshall makes other home-made websites in Everything Must Go (lilyandmarshallselltheirstuff.com) and The Sexless Innkeeper (itwasthebestnightever.com), where Lily and Future Ted mention it being a problem."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "92", ReferenceText -> "Marshall makes other home-made websites in Everything Must Go (lilyandmarshallselltheirstuff.com) and The Sexless Innkeeper (itwasthebestnightever.com), where Lily and Future Ted mention it being a problem."}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

So the actual problem is that the regex matcher doesn’t deal with the new line in the string.

Our next step is therefore to get rid of new lines within strings. I spent ages trying to find the appropriate command before coming across the following use of awk which does the job:

$ cat data/import/references.csv | awk '(NR-1)%2{$1=$1} {print $0}' RS=\" ORS=\" | wc -l
637
 
$ cat data/import/references.csv | awk '(NR-1)%2{$1=$1} {print $0}' RS=\" ORS=\" > data/import/refs.csv

Let’s try the LOAD CSV command again:

LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/refs.csv" AS row
WITH row WHERE row.ReferenceText =~ ".*This is the Mother's first.*"
RETURN row.ReferencedEpisodeId, row.ReferencingEpisodeId, row.ReferenceText
 
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row.ReferencedEpisodeId | row.ReferencingEpisodeId | row.ReferenceText                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | "45"                    | "56"                     | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "200"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "100"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "86"                     | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "113"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "161"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "37"                     | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "184"                    | "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton. This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9." |
==> | "45"                    | "37"                     | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."                                                                                                                                                                                                                                                                                                                                                                        |
==> | "45"                    | "184"                    | "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."                                                                                                                                                                                                                                                                                                                                                                        |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

And there we go!

Update

Michael pointed out that I could have used the dotall regex flag at the beginning of the regular expression in order to search across new lines without having to remove them! In that case the query would read like this:

LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/references.csv" AS row
WITH row WHERE row.ReferenceText =~ "(?s).*This is the Mother.*"
RETURN row
 
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | row                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "56", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "200", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "100", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "86", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "113", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "161", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}  |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "Ted eventually acquires the yellow umbrella in No Tomorrow (after the Mother leaves it behind at the St. Patrick's Day party, as seen in How Your Mother Met Me), and leaves it in Cindy's and the Mother's apartment in Girls Versus Suits. The umbrella is also seen/referenced in many other episodes, including Right Place, Right Time, Big Days, and Farhampton.
==> This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."} |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "37", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                         |
==> | {ReferencedEpisodeId -> "45", ReferencingEpisodeId -> "184", ReferenceText -> "This is the Mother's first on-screen appearance with the yellow umbrella. Previously she appeared in Lucky Penny, with her head obscured by a bridal veil. She is seen again in No Tomorrow, again hidden by the umbrella, her ankle is seen briefly in Girls Versus Suits, and she gets her first proper appearance in Something New, after which she appears throughout Season 9."}                                                                                                                                                                                                                                                                                                                                                                        |
==> +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Categories: Programming

Mac OS X: GNU sed – Hex string replacement / replacing new line characters

Thu, 06/11/2015 - 22:38

Recently I was working with a CSV file which contained both Windows and Unix line endings which was making it difficult to work with.

The actual line endings were HEX ‘0A0D’ i.e. Windows line breaks but there were also HEX ‘OA’ i.e. Unix line breaks within one of the columns.

I wanted to get rid of the Unix line breaks and discovered that you can do HEX sequence replacement using the GNU version of sed – unfortunately the Mac ships with the BSD version which doesn’t have this functionaltiy.

The first step was therefore to install the GNU version of sed.

brew install coreutils
brew install gnu-sed --with-default-names

I wanted to replace my system sed so that’s why I went with the ‘–with-default-names’ flag – without that flag I believe the sed installation would be accessible as ‘gs-sed’.

The following is an example of what the lines in the file look like:

$ echo -e "Hello\x0AMark\x0A\x0D"
Hello
Mark

We want to get rid of the new line in between ‘Hello’ and ‘Mark’ but leave the other one be. I adapted one of the commands from this tutorial to look for lines which end in ‘0A’ where that isn’t followed by a ‘0D':

$ echo -e "Hello\x0AMark\x0A\x0D" | \
  sed 'N;/\x0A[^\x0D]/s/\n/ /'
Hello Mark

Let’s go through the parts of the sed command:

  • N – this creates a multiline pattern space by reading a new line of input and appending it to the contents of the pattern space. The two lines are separated by a new line.
  • /\x0A[^\x0D]/ – this matches any lines which contain ‘OA’ not followed by ‘OD’
  • /s/\n/ / – this substitutes the new line character with a space for those matching lines from the previous command.

Now let’s check it works if we have multiple lines that we want to squash:

$ echo -e "Hello\x0AMark\x0A\x0DHello\x0AMichael\x0A\x0D"
Hello
Mark
Hello
Michael
 
$ echo -e "Hello\x0AMark\x0A\x0DHello\x0AMichael\x0A\x0D" | \
  sed 'N;/\x0A[^\x0D]/s/\n/ /'
Hello Mark
Hello Michael

Looks good! The actual file is a bit more nuanced so I’ve still got a bit more work to do but this is a good start.

Categories: Programming

Unix: Converting a file of values into a comma separated list

Mon, 06/08/2015 - 23:23

I recently had a bunch of values in a file that I wanted to paste into a Java program which required a comma separated list of strings.

This is what the file looked like:

$ cat foo2.txt | head -n 5
1.0
1.0
1.0
1.0
1.0

And the idea is that we would end up with something like this:

"1.0","1.0","1.0","1.0","1.0"


The first thing we need to do is quote each of the values. I found a nice way to do this using sed:

$ sed 's/.*/"&"/g' foo2.txt | head -n 5
"1.0"
"1.0"
"1.0"
"1.0"
"1.0"

Now that we’ve got all the values quoted we need to get rid of the new lines and replace them with commas. The way I’d normally do this is using ‘tr’ and then just not copy the final comma…

$ sed 's/.*/"&"/g' foo2.txt | tr '\n' ','
"1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0",

…but I learnt that we can actually do one better than this using ‘paste’ which allows you to replace new lines excluding the last one.

The only annoying thing about paste is that you can’t pipe to it so we need to use process substitution instead:

$ paste -s -d ',' <(sed 's/.*/"&"/g' foo2.txt)
"1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0","1.0"

If we’re only a Mac we could even automate the copy/paste step too by piping to ‘pbcopy':

$ paste -s -d ',' <(sed 's/.*/"&"/g' foo2.txt) | pbcopy
Categories: Programming

Netty: Testing encoders/decoders

Fri, 06/05/2015 - 22:25

I’ve been working with Netty a bit recently and having built a pipeline of encoders/decoders as described in this excellent tutorial wanted to test that the encoders and decoders were working without having to send real messages around.

Luckily there is a EmbeddedChannel which makes our life very easy indeed.

Let’s say we’ve got a message ‘Foo’ that we want to send across the wire. It only contains a single integer value so we’ll just send that and reconstruct ‘Foo’ on the other side.

We might write the following encoder to do this:

// Examples uses Netty 4.0.28.Final
public static class MessageEncoder extends MessageToMessageEncoder<Foo>
{
    @Override
    protected void encode( ChannelHandlerContext ctx, Foo msg, List<Object> out ) throws Exception
    {
        ByteBuf buf = ctx.alloc().buffer();
        buf.writeInt( msg.value() );
        out.add( buf );
    }
}
 
public static class Foo
{
    private Integer value;
 
    public Foo(Integer value)
    {
        this.value = value;
    }
 
    public int value()
    {
        return value;
    }
}

So all we’re doing is taking the ‘value’ field out of ‘Foo’ and putting it into the list which gets passed downstream.

Let’s write a test which simulates sending a ‘Foo’ message and has an empty decoder attempt to process the message:

@Test
public void shouldEncodeAndDecodeVoteRequest()
{
    // given
    EmbeddedChannel channel = new EmbeddedChannel( new MessageEncoder(), new MessageDecoder() );
 
    // when
    Foo foo = new Foo( 42 );
    channel.writeOutbound( foo );
    channel.writeInbound( channel.readOutbound() );
 
    // then
    Foo returnedFoo = (Foo) channel.readInbound();
    assertNotNull(returnedFoo);
    assertEquals( foo.value(), returnedFoo.value() );
}
 
public static class MessageDecoder extends MessageToMessageDecoder<ByteBuf>
{
    @Override
    protected void decode( ChannelHandlerContext ctx, ByteBuf msg, List<Object> out ) throws Exception { }
}

So in the test we write ‘Foo’ to the outbound channel and then read it back into the inbound channel and then check what we’ve got. If we run that test now this is what we’ll see:

junit.framework.AssertionFailedError
	at NettyTest.shouldEncodeAndDecodeVoteRequest(NettyTest.java:28)

The message we get back is null which makes sense given that we didn’t bother writing the decoder. Let’s implement the decoder then:

public static class MessageDecoder extends MessageToMessageDecoder<ByteBuf>
{
    @Override
    protected void decode( ChannelHandlerContext ctx, ByteBuf msg, List<Object> out ) throws Exception
    {
        int value = msg.readInt();
        out.add( new Foo(value) );
    }
}

Now if we run our test again it’s all green and happy. We can now go and encode/decode some more complex structures and update our test accordingly.

Categories: Programming

Neo4j: Cypher – Step by step to creating a linked list of adjacent nodes using UNWIND

Thu, 06/04/2015 - 23:17

In late 2013 I wrote a post showing how to create a linked list connecting different football seasons together using Neo4j’s Cypher query language, a post I’ve frequently copy & pasted from!

Now 18 months later, and using Neo4j 2.2 rather than 2.0, we can actually solve this problem in what I believe is a more intuitive way using the UNWIND function. Credit for the idea goes to Michael, I’m just the messenger.

To recap, we had a collection of football seasons and we wanted to connect adjacent seasons to each other to allow easy querying between seasons. The following is the code we used:

CREATE (:Season {name: "2013/2014", timestamp: 1375315200})
CREATE (:Season {name: "2012/2013", timestamp: 1343779200})
CREATE (:Season {name: "2011/2012", timestamp: 1312156800})
CREATE (:Season {name: "2010/2011", timestamp: 1280620800})
CREATE (:Season {name: "2009/2010", timestamp: 1249084800})
MATCH (s:Season)
WITH s
ORDER BY s.timestamp
WITH COLLECT(s) AS seasons
 
FOREACH(i in RANGE(0, length(seasons)-2) | 
    FOREACH(si in [seasons[i]] | 
        FOREACH(si2 in [seasons[i+1]] | 
            MERGE (si)-[:NEXT]->(si2))))

Our goal is to replace those 3 FOREACH loops with something a bit easier to understand. To start with, let’s run the first part of the query to get some intuition of what we’re trying to do:

MATCH (s:Season)
WITH s
ORDER BY s.timestamp
RETURN COLLECT(s) AS seasons
 
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | seasons                                                                                                                                                                                                                                                     |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | [Node[1973]{timestamp:1249084800,name:"2009/2010"},Node[1972]{timestamp:1280620800,name:"2010/2011"},Node[1971]{timestamp:1312156800,name:"2011/2012"},Node[1970]{timestamp:1343779200,name:"2012/2013"},Node[1969]{timestamp:1375315200,name:"2013/2014"}] |
==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

So at this point we’ve got all the seasons in an array going from 2009/2010 up to 2013/2014. We want to create a ‘NEXT’ relationship between 2009/2010 -> 2010/2011, 2010/2011 -> 2011/2012 and so on.

To achieve this we need to get the adjacent seasons split into two columns, like so:

2009/2010	2010/2011
2010/2011	2011/2012
2011/2012	2012/2013
2012/2013	2013/2014

If we can get the data into that format then we can apply a MERGE between the two fields to create the ‘NEXT’ relationship. So how do we do that?

If we were in Python we’d be calling for the zip function which we could apply like this:

>>> seasons = ["2009/2010", "2010/2011", "2011/2012", "2012/2013", "2013/2014"]
 
>>> zip(seasons, seasons[1:])
[('2009/2010', '2010/2011'), ('2010/2011', '2011/2012'), ('2011/2012', '2012/2013'), ('2012/2013', '2013/2014')]

Unfortunately we don’t have an equivalent function in Cypher but we can achieve the same outcome by creating 2 columns with adjacent integer values. The RANGE and UNWIND functions are our friends here:

return RANGE(0,4)
 
==> +-------------+
==> | RANGE(0,4)  |
==> +-------------+
==> | [0,1,2,3,4] |
==> +-------------+
UNWIND RANGE(0,4) as idx 
RETURN idx, idx +1;
 
==> +--------------+
==> | idx | idx +1 |
==> +--------------+
==> | 0   | 1      |
==> | 1   | 2      |
==> | 2   | 3      |
==> | 3   | 4      |
==> | 4   | 5      |
==> +--------------+
==> 5 rows

Now all we need to do is plug this code into our original query where ‘idx’ and ‘idx + 1′ represent indexes into the array of seasons. We use a range which stops 1 element early since there isn’t anywhere to connect our last season to:

MATCH (s:Season)
WITH s
ORDER BY s.timestamp
WITH COLLECT(s) AS seasons
UNWIND RANGE(0,LENGTH(seasons) - 2) as idx 
RETURN seasons[idx], seasons[idx+1]
 
==> +-------------------------------------------------------------------------------------------------------+
==> | seasons[idx]                                      | seasons[idx+1]                                    |
==> +-------------------------------------------------------------------------------------------------------+
==> | Node[1973]{timestamp:1249084800,name:"2009/2010"} | Node[1972]{timestamp:1280620800,name:"2010/2011"} |
==> | Node[1972]{timestamp:1280620800,name:"2010/2011"} | Node[1971]{timestamp:1312156800,name:"2011/2012"} |
==> | Node[1971]{timestamp:1312156800,name:"2011/2012"} | Node[1970]{timestamp:1343779200,name:"2012/2013"} |
==> | Node[1970]{timestamp:1343779200,name:"2012/2013"} | Node[1969]{timestamp:1375315200,name:"2013/2014"} |
==> +-------------------------------------------------------------------------------------------------------+
==> 4 rows

Now we’ve got all the adjacent seasons lined up we complete the query with a call to MERGE:

MATCH (s:Season)
WITH s
ORDER BY s.timestamp
WITH COLLECT(s) AS seasons
UNWIND RANGE(0,LENGTH(seasons) - 2) as idx 
WITH seasons[idx] AS s1, seasons[idx+1] AS s2
MERGE (s1)-[:NEXT]->(s2)
 
==> +-------------------+
==> | No data returned. |
==> +-------------------+
==> Relationships created: 4

And we’re done. Hopefully I can remember this approach more than I did the initial one!

Categories: Programming

R: ggplot geom_density – Error in exists(name, envir = env, mode = mode) : argument “env” is missing, with no default

Wed, 06/03/2015 - 06:52

Continuing on from yesterday’s blog post where I worked out how to clean up the Think Bayes Price is Right data set, the next task was to plot a distribution of the prices of show case items.

To recap, this is what the data frame we’re working with looks like:

library(dplyr)
 
df2011 = read.csv("~/projects/rLearning/showcases.2011.csv", na.strings = c("", "NA"))
df2011 = df2011 %>% na.omit()
 
> df2011 %>% head()
              X Sep..19 Sep..20 Sep..21 Sep..22 Sep..23 Sep..26 Sep..27 Sep..28 Sep..29 Sep..30 Oct..3
3    Showcase 1   50969   21901   32815   44432   24273   30554   20963   28941   25851   28800  37703
4    Showcase 2   45429   34061   53186   31428   22320   24337   41373   45437   41125   36319  38752
6         Bid 1   42000   14000   32000   27000   18750   27222   25000   35000   22500   21300  21567
7         Bid 2   34000   59900   45000   38000   23000   18525   32000   45000   32000   27500  23800
9  Difference 1    8969    7901     815   17432    5523    3332   -4037   -6059    3351    7500  16136
10 Difference 2   11429  -25839    8186   -6572    -680    5812    9373     437    9125    8819  14952
...

So our goal is to plot the density of the ‘Showcase 1′ items. Unfortunately those aren’t currently stored in a way that makes this easy for us. We need to flip the data frame so that we have a row for each date/price type/price:

PriceType  Date     Price
Showcase 1 Sep..19  50969
Showcase 2 Sep..19  21901
...
Showcase 1 Sep..20  45429
Showcase 2 Sep..20  34061

The reshape library’s melt function is our friend here:

library(reshape)
meltedDf = melt(df2011, id=c("X"))
 
> meltedDf %>% sample_n(10)
                X variable value
643    Showcase 1  Feb..24 27883
224    Showcase 2  Nov..10 34089
1062 Difference 2   Jun..4  9962
770    Showcase 2  Mar..28 39620
150  Difference 2  Oct..24  9137
431  Difference 1   Jan..4  7516
345         Bid 1  Dec..12 21569
918  Difference 2    May.1 -2093
536    Showcase 2  Jan..31 30918
502         Bid 2  Jan..23 27000

Now we need to plug this into ggplot. We’ll start by just plotting all the prices for showcase 1:

> ggplot(aes(x = value), data = meltedDf %>% filter(X == "Showcase 1")) +
    geom_density()
 
Error in exists(name, envir = env, mode = mode) : 
  argument "env" is missing, with no default


This error usually means that you’ve passed an empty data set to ggplot which isn’t the case here, but if we extract the values column we can see the problem:

> meltedDf$value[1:10]
 [1] "50969" "45429" "42000" "34000" "8969"  "11429" "21901" "34061" "14000" "59900"

They are all strings! Making it very difficult to plot a density curve which relies on the data being continuous. Let’s fix that and try again:

meltedDf$value = as.numeric(meltedDf$value)

ggplot(aes(x = value), data = meltedDf %>% filter(X == "Showcase 1")) +
  geom_density()

2015 06 03 06 46 48

If we want to show the curves for both showcases we can tweak our code slightly:

ggplot(meltedDf %>% filter(grepl("Showcase", X)), aes(x = value, colour = X)) + 
  geom_density() + 
  theme(legend.position="top")
2015 06 03 06 50 35

Et voila!

Categories: Programming

R: dplyr – removing empty rows

Tue, 06/02/2015 - 07:49

I’m still working my way through the exercises in Think Bayes and in Chapter 6 needed to do some cleaning of the data in a CSV file containing information about the Price is Right.

I downloaded the file using wget:

wget ïżŒhttp://www.greenteapress.com/thinkbayes/showcases.2011.csvïżŒ

And then loaded it into R and explored the first few rows using dplyr

library(dplyr)
df2011 = read.csv("~/projects/rLearning/showcases.2011.csv")
 
> df2011 %>% head(10)
 
           X Sep..19 Sep..20 Sep..21 Sep..22 Sep..23 Sep..26 Sep..27 Sep..28 Sep..29 Sep..30 Oct..3
1              5631K   5632K   5633K   5634K   5635K   5641K   5642K   5643K   5644K   5645K  5681K
2                                                                                                  
3 Showcase 1   50969   21901   32815   44432   24273   30554   20963   28941   25851   28800  37703
4 Showcase 2   45429   34061   53186   31428   22320   24337   41373   45437   41125   36319  38752
5                                                                                                  
...

As you can see, we have some empty rows which we want to get rid of to ease future processing. I couldn’t find an easy way to filter those out but what we can do instead is have empty columns converted to ‘NA’ and then filter those.

First we need to tell read.csv to treat empty columns as NA:

df2011 = read.csv("~/projects/rLearning/showcases.2011.csv", na.strings = c("", "NA"))

And now we can filter them out using na.omit:

df2011 = df2011 %>% na.omit()
 
> df2011  %>% head(5)
             X Sep..19 Sep..20 Sep..21 Sep..22 Sep..23 Sep..26 Sep..27 Sep..28 Sep..29 Sep..30 Oct..3
3   Showcase 1   50969   21901   32815   44432   24273   30554   20963   28941   25851   28800  37703
4   Showcase 2   45429   34061   53186   31428   22320   24337   41373   45437   41125   36319  38752
6        Bid 1   42000   14000   32000   27000   18750   27222   25000   35000   22500   21300  21567
7        Bid 2   34000   59900   45000   38000   23000   18525   32000   45000   32000   27500  23800
9 Difference 1    8969    7901     815   17432    5523    3332   -4037   -6059    3351    7500  16136
...

Much better!

Categories: Programming

R: Think Bayes Euro Problem

Mon, 06/01/2015 - 00:11

I’ve got back to working my way through Think Bayes after a month’s break and started out with the one euro coin problem in Chapter 4:

A statistical statement appeared in “The Guardian” on Friday January 4, 2002:

When spun on edge 250 times, a Belgian one-euro coin came up heads 140 times and tails 110. ‘It looks very suspicious to me,’ said Barry Blight, a statistics lecturer at the London School of Economics. ‘If the coin were unbiased, the chance of getting a result as extreme as that would be less than 7%.’

But do these data give evidence that the coin is biased rather than fair?

We’re going to create a data frame with each row representing the probability that heads shows up that often. We need one row for each value between 0 (no heads) and 100 (all heads) and we’ll start with the assumption that each value can be chosen equally (a uniform prior):

library(dplyr)
 
values = seq(0, 100)
scores = rep(1.0 / length(values), length(values))  
df = data.frame(score = scores, value = values)
 
> df %>% sample_n(10)
         score value
60  0.00990099    59
101 0.00990099   100
10  0.00990099     9
41  0.00990099    40
2   0.00990099     1
83  0.00990099    82
44  0.00990099    43
97  0.00990099    96
100 0.00990099    99
12  0.00990099    11

Now we need to feed in our observations. We need to create a vector containing 140 heads and 110 tails. The ‘rep’ function comes in handy here:

observations = c(rep("T", times = 110), rep("H", times = 140))
> observations
  [1] "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T"
 [29] "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T"
 [57] "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T"
 [85] "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "H" "H"
[113] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H"
[141] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H"
[169] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H"
[197] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H"
[225] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H" "H"

Now we need to iterate over each of the observations and update our data frame appropriately.

for(observation in observations) {
  if(observation == "H") {
    df = df %>% mutate(score = score * (value / 100.0))
  } else {
    df = df %>% mutate(score = score * (1.0 - (value / 100.0)))
  }    
}
 
df = df %>% mutate(weighted = score / sum(score))

Now that we’ve done that we can calculate the maximum likelihood, mean, median and credible interval. We’ll create a ‘percentile’ function to help us out:

percentile = function(df, p) {
    df %>% filter(cumsum(weighted) > p) %>% head(1) %>% select(value) %>% as.numeric
}

And now let’s calculate the values:

# Maximum likelihood
> df %>% filter(weighted == max(weighted)) %>% select(value) %>% as.numeric
[1] 56
 
# Mean
> df %>% mutate(mean = value * weighted) %>% select(mean) %>% sum
[1] 55.95238
 
# Median
> percentile(df, 0.5)
[1] 56
 
# Credible Interval
percentage = 90
prob = (1 - percentage / 100.0) / 2
 
# lower
> percentile(df, prob)
[1] 51
 
# upper
> percentile(df, 1 - prob)
[1] 61

This all wraps up nicely into a function:

euro = function(values, priors, observations) {
  df = data.frame(score = priors, value = values)
 
  for(observation in observations) {
    if(observation == "H") {
      df = df %>% mutate(score = score * (value / 100.0))
    } else {
      df = df %>% mutate(score = score * (1.0 - (value / 100.0)))
    }    
  }
 
  return(df %>% mutate(weighted = score / sum(score)))
}

which we can call like so:

values = seq(0,100)
priors = rep(1.0 / length(values), length(values))
observations = c(rep("T", times = 110), rep("H", times = 140))
df = euro(values, priors, observations)

The next part of the problem requires us to change the prior distribution to be more weighted to values close to 50%. We can tweak the parameters we pass into the function accordingly:

values = seq(0,100)
priors = sapply(values, function(x) ifelse(x < 50, x, 100 - x))
priors = priors / sum(priors)
observations = c(rep("T", times = 110), rep("H", times = 140))
df = euro(values, priors, observations)

In fact even with the adjusted priors we still end up with the same posterior distribution:

> df %>% filter(weighted == max(weighted)) %>% select(value) %>% as.numeric
[1] 56
 
> df %>% mutate(mean = value * weighted) %>% select(mean) %>% sum
[1] 55.7435
 
> percentile(df, 0.5)
[1] 56
 
> percentile(df, 0.05)
[1] 51
 
> percentile(df, 0.95)
[1] 61

The book describes this phenemenom as follows:

This is an example of swamping the priors: with enough data, people who start with different priors will tend to converge on the same posterior.

Categories: Programming

Python: CSV writing – TypeError: ‘builtin_function_or_method’ object has no attribute ‘__getitem__’

Sun, 05/31/2015 - 23:33

When I’m working in Python I often find myself writing to CSV files using the in built library and every now and then make a mistake when calling writerow:

import csv
writer = csv.writer(file, delimiter=",")
writer.writerow["player", "team"]

This results in the following error message:

TypeError: 'builtin_function_or_method' object has no attribute '__getitem__'

The error message is a bit weird at first but it’s basically saying that I’ve tried to do an associative lookup on an object which doesn’t support that operation.

The resolution is simply to include the appropriate parentheses instead of leaving them out!

writer.writerow(["player", "team"])

This one’s for future Mark.

Categories: Programming

Neo4j: The BBC Champions League graph

Sat, 05/30/2015 - 22:45

A couple of weekends ago I started scraping the BBC live text feed of the Bayern Munich/Barcelona match, initially starting out with just the fouls and building the foul graph.

I’ve spent a bit more time on it since then and have managed to model several other events as well including attempts, goals, cards and free kicks.

I started doing this just for the Bayern Munich/Barcelona match but realised it wasn’t particularly difficult to extend this out and graph the events for every match in the Champions League 2014/2015.

To do this we first need to download the pages for each of the matches. I downloaded this page and wrote a simple Python script to get the appropriate URIs:

from bs4 import BeautifulSoup
from soupselect import select
import bs4
 
soup = BeautifulSoup(open("data/results", "r"))
 
matches = select(soup, "a.report")
 
for match in matches:
    print "http://www.bbc.co.uk/%s" %(match.get("href"))

I then piped the output of running this script into wget:

find_all_matches.py | xargs wget -O data/raw

It was relatively simple to update the scraping and import code to handle multiple matches. The whole process from end to end looks like this:

2015 05 29 23 27 56

Most of the code is in the ‘scraping magic’ phase where I go through all the events and pull out the appropriate elements that we can link together in the graph.

e.g. a freekick and foul event are typically adjacent so we’d look to pull out the two players involved, the type of card issues, the time of the incident and the match the event occurred in.

I used Python’s Beautiful Soup library for this task but there’s no reason you couldn’t use another set of tools.

The README page shows how to create your own version of the graph but here’s an overview of what the graph looks like using Rik’s meta graph query:

Graph  21

Here’s a few of my favourite queries so far:

Which player with more than 10 shots has the best conversion rate?
match (a:Attempt)<-[:HAD_ATTEMPT]-(app)<-[:MADE_APPEARANCE]-(player),
      (app)-[:FOR_TEAM]-(team)
WITH player, COUNT(*) as times, COLLECT(a) AS attempts, team
WITH player, times, LENGTH([a in attempts WHERE a:Goal]) AS goals, team
WHERE times > 10
RETURN player.name, team.name, goals, times, (goals * 1.0 / times) AS conversionRate
ORDER BY conversionRate DESC
LIMIT 10
 
==> +------------------------------------------------------------------------------------+
==> | player.name           | team.name            | goals | times | conversionRate      |
==> +------------------------------------------------------------------------------------+
==> | "Luiz Adriano"        | "Shakhtar Donetsk"   | 9     | 14    | 0.6428571428571429  |
==> | "Yacine Brahimi"      | "FC Porto"           | 5     | 13    | 0.38461538461538464 |
==> | "Mario Mandzukic"     | "Atlético de Madrid" | 5     | 14    | 0.35714285714285715 |
==> | "Sergio AgĂŒero"       | "Manchester City"    | 6     | 18    | 0.3333333333333333  |
==> | "Karim Benzema"       | "Real Madrid"        | 6     | 19    | 0.3157894736842105  |
==> | "Klaas-Jan Huntelaar" | "FC Schalke 04"      | 5     | 16    | 0.3125              |
==> | "Neymar"              | "Barcelona"          | 9     | 29    | 0.3103448275862069  |
==> | "Thomas MĂŒller"       | "FC Bayern MĂŒnchen"  | 7     | 24    | 0.2916666666666667  |
==> | "Jackson MartĂ­nez"    | "FC Porto"           | 7     | 24    | 0.2916666666666667  |
==> | "Callum McGregor"     | "Celtic"             | 3     | 11    | 0.2727272727272727  |
==> +------------------------------------------------------------------------------------+
Which players gained immediate revenge for a foul?
match (firstFoul:Foul)-[:COMMITTED_AGAINST]->(app1)<-[:MADE_APPEARANCE]-(revengeFouler),
      (app1)-[:IN_MATCH]->(match), (firstFoulerApp)-[:COMMITTED_FOUL]->(firstFoul),
      (app1)-[:COMMITTED_FOUL]->(revengeFoul)-[:COMMITTED_AGAINST]->(firstFoulerApp),
       (firstFouler)-[:MADE_APPEARANCE]->(firstFoulerApp)
WHERE (firstFoul)-[:NEXT]->(revengeFoul)
RETURN firstFouler.name AS firstFouler, revengeFouler.name AS revengeFouler, firstFoul.time, revengeFoul.time, match.home + " vs " + match.away
 
==> +---------------------------------------------------------------------------------------------------------------------------------+
==> | firstFouler         | revengeFouler               | firstFoul.time | revengeFoul.time | match.home + " vs " + match.away        |
==> +---------------------------------------------------------------------------------------------------------------------------------+
==> | "Derk Boerrigter"   | "Jean Philippe Mendy"       | "88:48"        | "89:42"          | "Celtic vs NK Maribor"                  |
==> | "Mario Suårez"      | "Pajtim Kasami"             | "27:17"        | "32:38"          | "Olympiakos vs Atlético de Madrid"      |
==> | "Aleksandr Volodko" | "Casemiro"                  | "39:27"        | "44:32"          | "FC Porto vs BATE Borisov"              |
==> | "Thomas MĂŒller"     | "Mario Fernandes"           | "87:22"        | "88:31"          | "CSKA Moscow vs FC Bayern MĂŒnchen"      |
==> | "Vinicius"          | "Marco Verratti"            | "56:36"        | "58:00"          | "APOEL Nicosia vs Paris Saint Germain"  |
==> | "Lasse Schöne"      | "Dani Alves"                | "84:08"        | "86:18"          | "Barcelona vs Ajax"                     |
==> | "Nick Viergever"    | "Dani Alves"                | "57:22"        | "60:37"          | "Barcelona vs Ajax"                     |
==> | "Nani"              | "Atsuto Uchida"             | "6:10"         | "8:40"           | "FC Schalke 04 vs Sporting Lisbon"      |
==> | "Andreas Samaris"   | "Yannick Ferreira-Carrasco" | "89:21"        | "90:00 +4:21"    | "Monaco vs Benfica"                     |
==> | "Simon Kroon"       | "Guillherme Siqueira"       | "84:05"        | "90:00 +0:29"    | "Atlético de Madrid vs Malmö FF"        |
==> | "Mario Suårez"      | "Isaac Thelin"              | "32:02"        | "38:47"          | "Atlético de Madrid vs Malmö FF"        |
==> | "Hakan Balta"       | "Henrikh Mkhitaryan"        | "62:09"        | "64:14"          | "Borussia Dortmund vs Galatasaray"      |
==> | "Marco Reus"        | "Selcuk Inan"               | "36:17"        | "44:03"          | "Borussia Dortmund vs Galatasaray"      |
==> | "Hakan Balta"       | "Sven Bender"               | "10:57"        | "12:51"          | "Borussia Dortmund vs Galatasaray"      |
==> | "Vinicius"          | "Edinson Cavani"            | "87:56"        | "90:00 +1:25"    | "Paris Saint Germain vs APOEL Nicosia"  |
==> | "Jackson MartĂ­nez"  | "Carlos Gurpegi"            | "64:55"        | "66:17"          | "Athletic Club vs FC Porto"             |
==> | "Nani"              | "Chinedu Obasi"             | "1:30"         | "4:47"           | "Sporting Lisbon vs FC Schalke 04"      |
==> | "Vitali Rodionov"   | "Bruno Martins Indi"        | "52:16"        | "60:08"          | "BATE Borisov vs FC Porto"              |
==> | "Raheem Sterling"   | "Behrang Safari"            | "29:00"        | "33:27"          | "Liverpool vs FC Basel"                 |
==> | "Derlis GonzĂĄlez"   | "FĂĄbio CoentrĂŁo"            | "52:55"        | "57:59"          | "FC Basel vs Real Madrid"               |
==> | "Josip Drmic"       | "Lisandro LĂłpez"            | "15:04"        | "17:35"          | "Benfica vs Bayer 04 Leverkusen"        |
==> | "Fred"              | "Bastian Schweinsteiger"    | "6:04"         | "9:28"           | "Shakhtar Donetsk vs FC Bayern MĂŒnchen" |
==> | "Alex Sandro"       | "Derlis GonzĂĄlez"           | "4:07"         | "7:28"           | "FC Basel vs FC Porto"                  |
==> | "Luca Zuffi"        | "Ruben Neves"               | "73:49"        | "84:44"          | "FC Porto vs FC Basel"                  |
==> | "Marco Verratti"    | "Oscar"                     | "28:49"        | "34:04"          | "Chelsea vs Paris Saint Germain"        |
==> | "Cristiano Ronaldo" | "JesĂșs GĂĄmez"               | "20:59"        | "25:37"          | "Real Madrid vs AtlĂ©tico de Madrid"     |
==> | "Bernardo Silva"    | "Álvaro Morata"             | "49:20"        | "62:31"          | "Monaco vs Juventus"                    |
==> | "Arturo Vidal"      | "Fabinho"                   | "38:19"        | "45:00"          | "Monaco vs Juventus"                    |
==> +---------------------------------------------------------------------------------------------------------------------------------+
Which players took the longest to gain revenge for a foul?
match (foul1:Foul)-[:COMMITTED_AGAINST]->(app1)-[:COMMITTED_FOUL]->(foul2)-[:COMMITTED_AGAINST]->(app2)-[:COMMITTED_FOUL]->(foul1),
      (player1)-[:MADE_APPEARANCE]->(app1), (player2)-[:MADE_APPEARANCE]->(app2),
      (foul1)-[:COMMITTED_IN_MATCH]->(match:Match)<-[:COMMITTED_IN_MATCH]-(foul2)
WHERE (foul1)-[:NEXT*]->(foul2)
WITH match, foul1, player1, player2, foul2 ORDER BY foul1.sortableTime, foul2.sortableTime
WITH match, foul1, player1, player2, COLLECT(foul2) AS revenge
WITH match, foul1,  player1,player2,  revenge[0] AS revengeFoul
RETURN player1.name, player2.name, foul1.time, revengeFoul.time, revengeFoul.sortableTime - foul1.sortableTime AS secondsWaited, match.home + " vs " + match.away AS match
ORDER BY secondsWaited DESC
LIMIT 5
 
==> +---------------------------------------------------------------------------------------------------------------------------+
==> | player1.name      | player2.name        | foul1.time | revengeFoul.time | secondsWaited | match                           |
==> +---------------------------------------------------------------------------------------------------------------------------+
==> | "Stefan Johansen" | "Ondrej Duda"       | "1:30"     | "82:11"          | 4841          | "Legia Warsaw vs Celtic"        |
==> | "Neymar"          | "Vinicius"          | "2:35"     | "80:08"          | 4653          | "Barcelona vs APOEL Nicosia"    |
==> | "JĂ©rĂ©my Toulalan" | "Stefan Kießling"   | "9:19"     | "86:37"          | 4638          | "Monaco vs Bayer 04 Leverkusen" |
==> | "Nabil Dirar"     | "Domenico Criscito" | "6:32"     | "82:39"          | 4567          | "Zenit St Petersburg vs Monaco" |
==> | "Nabil Dirar"     | "Eliseu"            | "7:20"     | "81:30"          | 4450          | "Monaco vs Benfica"             |
==> +---------------------------------------------------------------------------------------------------------------------------+
Who’s had the most shots?
match (team)<-[:FOR_TEAM]-(app)<-[appRel:MADE_APPEARANCE]-(player:Player)
optional match (a:Attempt)<-[att:HAD_ATTEMPT]-(app)
WITH player, COUNT( DISTINCT appRel) AS apps, COUNT(att) as times, COLLECT(a) AS attempts, team
WITH player,apps, times, LENGTH([a in attempts WHERE a:Goal]) AS goals, team
WHERE times > 10
RETURN player.name, team.name, apps, goals, times, (goals * 1.0 / times) AS conversionRate
ORDER BY times DESC
LIMIT 10
 
==> +-------------------------------------------------------------------------------------------+
==> | player.name          | team.name             | apps | goals | times | conversionRate      |
==> +-------------------------------------------------------------------------------------------+
==> | "Cristiano Ronaldo"  | "Real Madrid"         | 12   | 10    | 69    | 0.14492753623188406 |
==> | "Lionel Messi"       | "Barcelona"           | 12   | 10    | 51    | 0.19607843137254902 |
==> | "Robert Lewandowski" | "FC Bayern MĂŒnchen"   | 12   | 6     | 43    | 0.13953488372093023 |
==> | "Carlos TĂ©vez"       | "Juventus"            | 12   | 7     | 34    | 0.20588235294117646 |
==> | "Gareth Bale"        | "Real Madrid"         | 10   | 2     | 32    | 0.0625              |
==> | "Luis SuĂĄrez"        | "Barcelona"           | 9    | 6     | 30    | 0.2                 |
==> | "Neymar"             | "Barcelona"           | 11   | 9     | 29    | 0.3103448275862069  |
==> | "Hakan Calhanoglu"   | "Bayer 04 Leverkusen" | 8    | 2     | 29    | 0.06896551724137931 |
==> | "Edinson Cavani"     | "Paris Saint Germain" | 8    | 6     | 27    | 0.2222222222222222  |
==> | "Alexis SĂĄnchez"     | "Arsenal"             | 9    | 4     | 25    | 0.16                |
==> +-------------------------------------------------------------------------------------------+

Maybe you can think of some cooler ones? I’d love to see them. Grab the code from github and give it a try.

Categories: Programming

Python: Look ahead multiple elements in an iterator/generator

Thu, 05/28/2015 - 21:56

As part of the BBC live text scraping code I’ve been working on I needed to take an iterator of raw events created by a generator and transform this into an iterator of cards shown in a match.

The structure of the raw events I’m interested in is as follows:

  • Line 1: Player booked
  • Line 2: Player fouled
  • Line 3: Information about the foul

e.g.

events = [
  {'event': u'Booking     Pedro (Barcelona) is shown the yellow card for a bad foul.', 'sortable_time': 5083, 'match_id': '32683310', 'formatted_time': u'84:43'}, 
  {'event': u'Rafinha (FC Bayern M\xfcnchen) wins a free kick on the right wing.', 'sortable_time': 5078, 'match_id': '32683310', 'formatted_time': u'84:38'}, 
  {'event': u'Foul by Pedro (Barcelona).', 'sortable_time': 5078, 'match_id': '32683310', 'formatted_time': u'84:38'}
]

We want to take these 3 raw events and create one ‘booking event’. We therefore need to have access to all 3 lines at the same time.

I started with the following function:

def cards(events):
    events = iter(events)
 
    item = events.next()
    next = events.next()
    event_id = 0
 
    for next_next in events:
        event = item["event"]
        booking = re.findall("Booking.*", event)
        if booking:
            player = re.findall("Booking([^(]*)", event)[0].strip()
            team = re.findall("Booking([^(]*) \((.*)\)", event)[0][1]
 
            associated_foul = [x for x in [(next, event_id+1), (next_next, event_id+2)]
                                 if re.findall("Foul by.*", x[0]["event"])]
 
            if associated_foul:
                associated_foul = associated_foul[0]
                yield event_id, associated_foul[1], player, team, item, "yellow"
            else:
                yield event_id, "", player, team, item, "yellow"
 
        item = next
        next = next_next
        event_id += 1

If we run the sample events through it we’d see this output:

>>> for card_id, associated_foul, player, team, item, card_type in cards(iter(events)):
      print card_id, associated_foul, player, team, item, card_type
 
0 2 Pedro Barcelona {'match_id': '32683310', 'event': u'Booking     Pedro (Barcelona) is shown the yellow card for a bad foul.', 'formatted_time': u'84:43', 'sortable_time': 5083} yellow

In retrospect, it’s a bit of a hacky way of moving a window of 3 items over the iterator of raw events and yielding a ‘booking’ event if the regex matches.

I thought there was probably a better way of doing this using itertools and indeed there is!

First we need to create a window function which will return us an iterator of tuples containing consecutive events:

from itertools import tee, izip
 
def window(iterable, size):
    iters = tee(iterable, size)
    for i in xrange(1, size):
        for each in iters[i:]:
            next(each, None)
    return izip(*iters)

Now let’s have a look at how it works using a simple example:

>>> numbers = iter(range(0,10))
>>> for triple in window(numbers, 3):
      print triple
 
(0, 1, 2)
(1, 2, 3)
(2, 3, 4)
(3, 4, 5)
(4, 5, 6)
(5, 6, 7)
(6, 7, 8)
(7, 8, 9)

Exactly what we need. Now let’s plug the windows function into our cards function:

def cards(events):
    events = iter(events)
    for event_id, triple in enumerate(window(events, 3)):
        item = triple[0]
        event = triple[0]["event"]
 
        booking = re.findall("Booking.*", event)
        if booking:
            player = re.findall("Booking([^(]*)", event)[0].strip()
            team = re.findall("Booking([^(]*) \((.*)\)", event)[0][1]
 
            associated_foul = [x for x in [(triple[1], event_id+1), (triple[2], event_id+2)]
                                 if re.findall("Foul by.*", x[0]["event"])]
 
            if associated_foul:
                associated_foul = associated_foul[0]
                yield event_id, associated_foul[1], player, team, item, "yellow"
            else:
                yield event_id, "", player, team, item, "yellow"

And finally check it still processes events correctly:

>>> for card_id, associated_foul, player, team, item, card_type in cards(iter(events)):
      print card_id, associated_foul, player, team, item, card_type
 
0 2 Pedro Barcelona {'match_id': '32683310', 'event': u'Booking     Pedro (Barcelona) is shown the yellow card for a bad foul.', 'formatted_time': u'84:43', 'sortable_time': 5083} yellow
Categories: Programming

Neo4j: The foul revenge graph

Tue, 05/26/2015 - 08:03

Last week I was showing the foul graph to my colleague Alistair who came up with the idea of running a ‘foul revenge’ query to find out which players gained revenge for a foul with one of their own later in them match.

Queries like this are very path centric and therefore work well in a graph. To recap, this is what the foul graph looks like:

2015 05 26 07 35 33

The first thing that we need to do is connect the fouls in a linked list based on time so that we can query their order more easily.

We can do this with the following query:

MATCH (foul:Foul)-[:COMMITTED_IN_MATCH]->(match)
WITH foul,match
ORDER BY match.id, foul.sortableTime
WITH match, COLLECT(foul) AS fouls
FOREACH(i in range(0, length(fouls) -2) |
  FOREACH(foul1 in [fouls[i]] | FOREACH (foul2 in [fouls[i+1]] |
    MERGE (foul1)-[:NEXT]->(foul2)
)));

This query collects fouls grouped by match and then adds a ‘NEXT’ relationship between adjacent fouls. The graph now looks like this:

2015 05 26 07 43 28

Now let’s find the revenge foulers in the Bayern Munich vs Barcelona match. We’re looking for the following pattern:

2015 05 26 07 55 45

This translates to the following cypher query:

match (foul1:Foul)-[:COMMITTED_AGAINST]->(app1)-[:COMMITTED_FOUL]->(foul2)-[:COMMITTED_AGAINST]->(app2)-[:COMMITTED_FOUL]->(foul1),
      (player1)-[:MADE_APPEARANCE]->(app1), (player2)-[:MADE_APPEARANCE]->(app2),
      (foul1)-[:COMMITTED_IN_MATCH]->(match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul2)
WHERE (foul1)-[:NEXT*]->(foul2)
RETURN player2.name AS firstFouler, player1.name AS revengeFouler, foul1.time, foul1.location, foul2.time, foul2.location

I’ve added in a few extra parts to the pattern to pull out the players involved and to find the revenge foulers in a specific match – the Bayern Munich vs Barcelona Semi Final 2nd leg.


We end up with the following revenge fouls:

2015 05 26 00 05 48

We can see here that Dani Alves actually gains revenge on Bastian Schweinsteiger twice for a foul he made in the 10th minute.

If we tweak the query to the following we can get a visual representation of the revenge fouls as well:

match (foul1:Foul)-[:COMMITTED_AGAINST]->(app1)-[:COMMITTED_FOUL]->(foul2)-[:COMMITTED_AGAINST]->(app2)-[:COMMITTED_FOUL]->(foul1),
      (player1)-[:MADE_APPEARANCE]->(app1), (player2)-[:MADE_APPEARANCE]->(app2),
      (foul1)-[:COMMITTED_IN_MATCH]->(match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul2),
      (foul1)-[:NEXT*]->(foul2)
RETURN *

2015 05 23 15 23 22

At the moment I’ve restricted the revenge concept to single matches but I wonder whether it’d be more interesting to create a linked list of fouls which crosses matches between teams in the same season.

The code for all of this is on github – the README is a bit sketchy at the moment but I’ll be fixing that up soon.

Categories: Programming

Python: Joining multiple generators/iterators

Mon, 05/25/2015 - 00:51

In my previous blog post I described how I’d refactored some scraping code I’ve been working on to use iterators and ended up with a function which returned a generator containing all the events for one BBC live text match:

match_id = "32683310"
events = extract_events("data/raw/%s" % (match_id))
 
>>> print type(events)
<type 'generator'>

The next thing I wanted to do is get the events for multiple matches which meant I needed to glue together multiple generators into one big generator.

itertools’ chain function does exactly what we want:

itertools.chain(*iterables)

Make an iterator that returns elements from the first iterable until it is exhausted, then proceeds to the next iterable, until all of the iterables are exhausted. Used for treating consecutive sequences as a single sequence.

First let’s try it out on a collection of range generators:

import itertools
gens = [(n*2 for n in range(0, 3)), (n*2 for n in range(4,7))]
>>> gens
[<generator object <genexpr> at 0x10ff3b140>, <generator object <genexpr> at 0x10ff7d870>]
 
output = itertools.chain()
for gen in gens:
  output = itertools.chain(output, gen)

Now if we iterate through ‘output’ we’d expect to see the multiples of 2 up to and including 12:

>>> for item in output:
...   print item
...
0
2
4
8
10
12

Exactly as we expected! Our scraping code looks like this once we plug the chaining in:

matches = ["32683310", "32683303", "32384894", "31816155"]
 
raw_events = itertools.chain()
for match_id in matches:
    raw_events = itertools.chain(raw_events, extract_events("data/raw/%s" % (match_id)))

‘raw_events’ now contains a single generator that we can iterate through and process the events for all matches.

Categories: Programming

Python: Refactoring to iterator

Sat, 05/23/2015 - 11:14

Over the last week I’ve been building a set of scripts to scrape the events from the Bayern Munich/Barcelona game and I’ve ended up with a few hundred lines of nested for statements, if statements and mutated lists. I thought it was about time I did a bit of refactoring.

The following is a function which takes in a match file and spits out a collection of maps containing times & events.

import bs4
import re
from bs4 import BeautifulSoup
from soupselect import select
 
def extract_events(file):
    match = open(file, 'r')
    soup = BeautifulSoup(match.read())
 
    all_events = []
    for event in select(soup, 'div#live-text-commentary-wrapper div#live-text'):
        for child in event.children:
            if type(child) is bs4.element.Tag:
                all_events.append(child.getText().strip())
 
    for event in select(soup, 'div#live-text-commentary-wrapper div#more-live-text'):
        for child in event.children:
            if type(child) is bs4.element.Tag:
                all_events.append(child.getText().strip())
 
    timed_events = []
    for i in range(0, len(all_events)):
        event = all_events[i]
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            timed_events.append({'time': formatted_time, 'event': all_events[i+1]})
    return timed_events

We call it like this:

match_id = "32683310"
for event in extract_events("data/%s" % (match_id))[:10]:
    print event

The file we’re loading is the Bayern Munich vs Barcelona match HTML file which I have saved locally. After we’ve got that read into beautiful soup we locate the two divs on the page which contain the match events.

We then iterate over that list and create a new list containing (time, event) pairs which we return.

I think we should be able to get to our resulting collection without persisting an intermediate list, but first things first – let’s remove the duplicated for loops:

def extract_events(file):
    match = open(file, 'r')
    soup = BeautifulSoup(match.read())
 
    all_events = []
    events = select(soup, 'div#live-text-commentary-wrapper div#live-text')
    more_events = select(soup, 'div#live-text-commentary-wrapper div#more-live-text')
 
    for event in events + more_events:
        for child in event.children:
            if type(child) is bs4.element.Tag:
                all_events.append(child.getText().strip())
 
    timed_events = []
    for i in range(0, len(all_events)):
        event = all_events[i]
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            timed_events.append({'time': formatted_time, 'event': all_events[i+1]})
    return timed_events

The next step is to refactor towards using an iterator. After a bit of reading I realised a generator would make life even easier.

I created a function which returned an iterator of the raw events and plugged that into the original function:

def raw_events(file):
    match = open(file, 'r')
    soup = BeautifulSoup(match.read())
    events = select(soup, 'div#live-text-commentary-wrapper div#live-text')
    more_events = select(soup, 'div#live-text-commentary-wrapper div#more-live-text')
    for event in events + more_events:
        for child in event.children:
            if type(child) is bs4.element.Tag:
                yield child.getText().strip()
 
def extract_events(file):
    all_events = list(raw_events(file))
 
    timed_events = []
    for i in range(0, len(all_events)):
        event = all_events[i]
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            timed_events.append({'time': formatted_time, 'event': all_events[i+1]})
    return timed_events

If we run that function we still get the same output as before which is good. Now we need to work out how to clean up the second bit of the code which groups the appropriate rows together.

The goal is that ‘extract_events’ returns an iterator rather than a list – we need to figure out how to iterate over the output of ‘raw_events’ in such a way that when we find a ‘time row’ we can yield that and the row immediately after.

Luckily I found a Stack Overflow post explaining that you can use the ‘next’ function inside an iterator to achieve this:

def extract_events(file):
    events = raw_events(file)
    for event in events:
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            yield {'time': formatted_time, 'event': next(events)}

It’s not that much less code than the original function but I think it’s an improvement. Any thoughts/tips to simplify it further are always welcome.

Categories: Programming

Python: UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xfc’ in position 11: ordinal not in range(128)

Thu, 05/21/2015 - 07:14

I’ve been trying to write some Python code to extract the players and the team they represented in the Bayern Munich/Barcelona match into a CSV file and had much more difficulty than I expected.

I have some scraping code (which is beyond the scope of this article) which gives me a list of (player, team) pairs that I want to write to disk. The contents of the list is as follows:

$ python extract_players.py
(u'Sergio Busquets', u'Barcelona')
(u'Javier Mascherano', u'Barcelona')
(u'Jordi Alba', u'Barcelona')
(u'Bastian Schweinsteiger', u'FC Bayern M\xfcnchen')
(u'Dani Alves', u'Barcelona')

I started with the following script:

with open("data/players.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])
 
    for player, team in players:
        print player, team, type(player), type(team)
        writer.writerow([player, team])

And if I run that I’ll see this error:

$ python extract_players.py
...
Bastian Schweinsteiger FC Bayern MĂŒnchen <type 'unicode'> <type 'unicode'>
Traceback (most recent call last):
  File "extract_players.py", line 67, in <module>
    writer.writerow([player, team])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)

So it looks like the ‘ĂŒ’ in ‘FC Bayern MĂŒnchen’ is causing us issues. Let’s try and encode the teams to avoid this:

with open("data/players.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])
 
    for player, team in players:
        print player, team, type(player), type(team)
        writer.writerow([player, team.encode("utf-8")])
$ python extract_players.py
...
Thomas MĂŒller FC Bayern MĂŒnchen <type 'unicode'> <type 'unicode'>
Traceback (most recent call last):
  File "extract_players.py", line 70, in <module>
    writer.writerow([player, team.encode("utf-8")])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 8: ordinal not in range(128)

Now we’ve got the same issue with the ‘ĂŒ’ in MĂŒller so let’s encode the players too:

with open("data/players.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])
 
    for player, team in players:
        print player, team, type(player), type(team)
        writer.writerow([player.encode("utf-8"), team.encode("utf-8")])
$ python extract_players.py
...
Gerard Piqué Barcelona <type 'str'> <type 'unicode'>
Traceback (most recent call last):
  File "extract_players.py", line 70, in <module>
    writer.writerow([player.encode("utf-8"), team.encode("utf-8")])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)

Now we’ve got a problem with Gerard PiquĂ© because that value has type string rather than unicode. Let’s fix that:

with open("data/players.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])
 
    for player, team in players:
        if isinstance(player, str):
            player = unicode(player, "utf-8")
        print player, team, type(player), type(team)
        writer.writerow([player.encode("utf-8"), team.encode("utf-8")])

Et voila! All the players are now successfully written to the file.

An alternative approach is to change the default encoding of the whole script to be ‘UTF-8′, like so:

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
 
with open("data/players.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["player", "team"])
 
    for player, team in players:
        print player, team, type(player), type(team)
        writer.writerow([player, team])

It took me a while to figure it out but finally the players are ready to go!

Categories: Programming

Neo4j: Finding all shortest paths

Tue, 05/19/2015 - 23:45

One of the Cypher language features we show in Neo4j training courses is the shortest path function which allows you to find the shortest path in terms of number of relationships between two nodes.

Using the movie graph, which you can import via the ‘:play movies’ command in the browser, we’ll first create a ‘KNOWS’ relationship between any people that have appeared in the same movie:

MATCH (p1:Person)-[:ACTED_IN]->()<-[:ACTED_IN]-(p2:Person)
MERGE (p1)-[:KNOWS]-(p2)

Now that we’ve got that relationship we can easily find the shortest path between two people, say Tom Cruise and Tom Hanks:

MATCH (p1:Person {name: "Tom Hanks"}), (p2:Person {name: "Tom Cruise"}),
      path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
Graph  18

That works pretty well but what if we want to find the longest shortest path between any two people in the graph? We can calculate it like this:

MATCH (p1:Person), (p2:Person),
      path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
ORDER BY LENGTH(path) DESC
LIMIT 1
Graph  19

So that’s 6 hops which is actually the Bacon number – I expect we’d probably see a smaller maximum value if we imported all the movies.

And to round off the post what if we want to find the longest shortest path between the 10 people who acted in the most movies? We might start out with the following query which seems like it should do the job:

MATCH (p1:Person)-[:ACTED_IN]->()
 
WITH p1, COUNT(*) AS appearances
ORDER BY appearances DESC
LIMIT 10
 
WITH p1 AS p1, p1 AS p2
MATCH path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
ORDER BY LENGTH(path) DESC
LIMIT 1

Unfortunately if we run that query we get no rows returned because ‘p1′ and ‘p2′ always refer to the same node.

Instead we can calculate the shortest path between our hardest working people by creating a cross product using COLLECT and UNWIND:

MATCH (p1:Person)-[:ACTED_IN]->()
 
WITH p1, COUNT(*) AS appearances
ORDER BY appearances DESC
LIMIT 10
 
WITH COLLECT(p1) AS ps
UNWIND ps AS p1 UNWIND ps AS p2
MATCH path = shortestpath((p1)-[:KNOWS*]-(p2))
RETURN path
ORDER BY LENGTH(path) DESC
LIMIT 1

Graph  20

That’s all for now!

Categories: Programming

Neo4j: Refactoring the BBC football live text fouls graph

Sun, 05/17/2015 - 12:04

Yesterday I wrote about a Neo4j graph I’ve started building which contains all the fouls committed in the Champions League game between Barcelona & Bayern Munich and surrounding meta data.

While adding other events into the graph I realised that I’d added some duplication in the model and the model could do with some refactoring to make it easier to use.

To recap, this is the model that we designed in the previous blog post:

The duplication is on the left hand side of the model – we model a foul as being committed by one player against another and then hook the foul back into the match. By doing that we’re not using the ‘appearance’ concept which links a player and a match together.

We can make the ‘COMMITTED_IN_MATCH’ relationship redundant by connecting the foul to appearance rather than to player. The match the foul was committed in can then be found by navigating through the appearance node.

This is what we want the graph to look like:

2015 05 17 10 40 44

We’ll move towards this new model in 3 steps:

  • Introduce the new structure alongside the existing one
  • Rewrite our queries to use the new structure
  • Remove the old structure
Introducing the new structure

First up let’s write a query to introduce the new structure.

match (foul:Foul)-[:COMMITTED_AGAINST]->(fouledPlayer),
      (foul)<-[:COMMITTED_FOUL]-(foulingPlayer),
      (foul)-[:COMMITTED_IN_MATCH]->(match:Match {id: "32683310"}),
      (foulingPlayer)-[:MADE_APPEARANCE]-(foulingPlayerApp)-[:IN_MATCH]->(match),
      (fouledPlayer)-[:MADE_APPEARANCE]-(fouledPlayerApp)-[:IN_MATCH]->(match)
MERGE (foul)<-[:COMMITTED_FOUL]-(foulingPlayerApp)
MERGE (foul)-[:COMMITTED_AGAINST]->(fouledPlayerApp)

Remember we’re not going to delete the old structure yet so that’s why there aren’t any delete statements in here.

Rewriting our queries

Now we need to update our queries to work against the new graph structure:

Where do the fouls happen?
match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-()<-[]-(foul:Foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC
Who fouls the most?
match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul:Foul)<-[:COMMITTED_FOUL]-(fouler:Player)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(appearance)-[:COMMITTED_FOUL]->(foul:Foul),
      (appearance)<-[:MADE_APPEARANCE]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10
Who was fouled the most?
match (match:Match {id: "32683310"})<-[:IN_MATCH]-(appearance)-[r:COMMITTED_FOUL]->(foul:Foul),
      (appearance)<-[:MADE_APPEARANCE]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(appearance)<-[:COMMITTED_AGAINST]->(foul:Foul),
      (appearance)<-[:MADE_APPEARANCE]-(fouled)
RETURN fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10
Who fouled who the most?
match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul:Foul)-[:COMMITTED_AGAINST]->(fouled:Player),
      (foul)<-[:COMMITTED_FOUL]-(fouler:Player)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10

becomes

match (match:Match {id: "32683310"}),
      (match)<-[:IN_MATCH]-(fouledApp)<-[:COMMITTED_AGAINST]->(foul:Foul)<-[:COMMITTED_FOUL]-(foulerApp)-[:IN_MATCH]->(match),
      (fouledApp)<-[:MADE_APPEARANCE]-(fouled),
      (foulerApp)<-[:MADE_APPEARANCE]-(fouler)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
Which team fouled most?
match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-()<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(app:Appearance)-[:COMMITTED_FOUL]->(),
      (app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC
Worst fouler for each team
match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(app:Appearance)-[:COMMITTED_FOUL]->(),
      (app)-[:FOR_TEAM]->(team),
      (fouler)-[:MADE_APPEARANCE]->(app)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;
Most fouled against for each team
match (match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls

becomes

match (match:Match {id: "32683310"})<-[:IN_MATCH]-(app:Appearance)<-[:COMMITTED_AGAINST]->(),
      (app)-[:FOR_TEAM]->(team),
      (fouled)-[:MADE_APPEARANCE]->(app)
WITH team, fouled, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouled:fouled, fouls:fouls})[0] AS topFouled
RETURN team.name, topFouled.fouled.name, topFouled.fouls

The early queries are made more complicated by the refactoring but the latter ones are slightly simpler. I think we need to hook some more events onto the appearance node to see whether this refactoring is worthwhile or not.

Removing the old structure

Holding judgement for now, let’s look at how we’d remove the old structure – the final step in this refactoring:

match (match:Match {id: "32683310"})<-[oldRel:COMMITTED_IN_MATCH]-(foul:Foul)
DELETE oldRel
match (player:Player)<-[oldRel:COMMITTED_AGAINST]-(foul:Foul)
DELETE oldRel
match (player:Player)-[oldRel:COMMITTED_FOUL]->(foul:Foul)
DELETE oldRel

Hopefully you can see how you’d go about refactoring your own graph if you realise the model isn’t quite what you want.

Any questions/thoughts/suggestions let me know!

Categories: Programming

Neo4j: BBC football live text fouls graph

Sat, 05/16/2015 - 22:13

I recently came across the Partially Derivative podcast and in episode 17 they describe how Kirk Goldsberry scraped a bunch of data about shots in basketball matches then ran some analysis on that data.

It got me thinking that we might be able to do something similar for football matches and although event based data for football matches only comes from Opta, the BBC does expose some of them in live text feeds.

We’ll start with the Champions League match between Barcelona and Bayern Munich from last Tuesday.

2015 05 16 23 10 43

Our first task is to extract the events that happened in the match along with the players involved. After we’ve got that we’ll generate a Neo4j graph and see if we can find some interesting insights.

I find the feedback cycle with this type of work is dramatically improved if we have the source data available locally so the first step was to get the BBC web page downloaded:

$ wget http://www.bbc.co.uk/sport/0/football/32683310

Next we need to write a scraper which will extract all the events. We want to get an array containing one entry for each event, where the following is an example of an event:

2015 05 16 22 19 00

HTML-wise it looks like this:

2015 05 16 22 20 28

4709393221 bddd85c64e z

Image courtesy of William Brawley

I do most of my scraping work in Python so I used the Beautiful Soup library with the soupselect wrapper to get the data into CSV format ready to import into Neo4j.

It was mostly a straight forward job of finding the appropriate CSS tag and pulling out the values although the way fouls are described in the page is a bit strange – sometimes the person fouled comes first row and the fouler comes on the next line and sometimes vice versa.

Luckily the two parts of the foul can be joined together by matching the time which made life easier.

The full code for the scrapper is on github if you want to play with it.

This is what the resulting CSV file looks like:

$ head -n 10 data/events.csv 
matchId,foulId,freeKickId,time,foulLocation,fouledPlayer,fouledPlayerTeam,foulingPlayer,foulingPlayerTeam
32683310,3,2,90:00 +0:40,in the defensive half.,Xabi Alonso,FC Bayern MĂŒnchen,Pedro,Barcelona
32683310,9,8,84:38,on the right wing.,Rafinha,FC Bayern MĂŒnchen,Pedro,Barcelona
32683310,12,13,83:17,in the attacking half.,Lionel Messi,Barcelona,Sebastian Rode,FC Bayern MĂŒnchen
32683310,15,14,82:43,in the defensive half.,Sebastian Rode,FC Bayern MĂŒnchen,Neymar,Barcelona
32683310,17,18,80:41,in the attacking half.,Pedro,Barcelona,Xabi Alonso,FC Bayern MĂŒnchen
32683310,22,23,76:31,in the defensive half.,Neymar,Barcelona,Rafinha,FC Bayern MĂŒnchen
32683310,25,26,75:03,in the attacking half.,Lionel Messi,Barcelona,Xabi Alonso,FC Bayern MĂŒnchen
32683310,31,30,69:37,in the attacking half.,Bastian Schweinsteiger,FC Bayern MĂŒnchen,Dani Alves,Barcelona
32683310,36,35,63:27,in the attacking half.,Robert Lewandowski,FC Bayern MĂŒnchen,Ivan Rakitic,Barcelona

Now it’s time to create a graph. We’ll aim to massage the data into this model:

2015 05 16 22 50 32

Next we need to write some Cypher code to get the CSV data into the graph. The full script is here, a sample of which is below:

// match
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Match {id: row.matchId});
 
// teams
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Team {name: row.foulingPlayerTeam});
 
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Team {name: row.fouledPlayerTeam});
 
// players
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (player:Player {id: row.foulingPlayer + "_" + row.foulingPlayerTeam})
ON CREATE SET player.name = row.foulingPlayer;
 
// appearances
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MATCH (match:Match {id: row.matchId})
MATCH (player:Player {id: row.foulingPlayer + "_" + row.foulingPlayerTeam})
MATCH (team:Team {name: row.foulingPlayerTeam})
 
MERGE (appearance:Appearance {id: player.id + " in " + row.matchId})
MERGE (player)-[:MADE_APPEARANCE]->(appearance)
MERGE (appearance)-[:IN_MATCH]->(match)
MERGE (appearance)-[:FOR_TEAM]->(team);
 
// fouls
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
 
MATCH (foulingPlayer:Player {id:row.foulingPlayer + "_" + row.foulingPlayerTeam })
MATCH (fouledPlayer:Player {id:row.fouledPlayer + "_" + row.fouledPlayerTeam })
MATCH (match:Match {id: row.matchId})
 
MERGE (foul:Foul {eventId: row.foulId})
ON CREATE SET foul.time = row.time, foul.location = row.foulLocation
 
MERGE (foul)<-[:COMMITTED_FOUL]-(foulingPlayer)
MERGE (foul)-[:COMMITTED_AGAINST]->(fouledPlayer)
MERGE (foul)-[:COMMITTED_IN_MATCH]->(match);

We’ll use neo4j-shell to execute the script:

$ ./neo4j-community-2.2.1/bin/neo4j-shell --file import.cql

Now that we’ve got the data into Neo4j we need to come up with some questions to ask of it. I came up with the following but perhaps you can think of some others!

  • Where do the fouls happen on the pitch?
  • Who made the most fouls?
  • Who was fouled the most?
  • Who fouled who the most?
  • Which team fouled the most?
  • Who’s the worst fouler in each team?
  • Who’s the most fouled in each team?
Where do the fouls happen?
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC;
 
+----------------------------------+
| location                 | fouls |
+----------------------------------+
| "in the defensive half." | 12    |
| "in the attacking half." | 12    |
| "on the right wing."     | 3     |
| "on the left wing."      | 3     |
+----------------------------------+
4 rows
Who fouls the most?
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+------------------------------+
| fouler               | fouls |
+------------------------------+
| "Rafinha"            | 4     |
| "Pedro"              | 3     |
| "Medhi Benatia"      | 3     |
| "Dani Alves"         | 3     |
| "Xabi Alonso"        | 3     |
| "Javier Mascherano"  | 2     |
| "Thiago AlcĂĄntara"   | 2     |
| "Robert Lewandowski" | 2     |
| "Sebastian Rode"     | 1     |
| "Sergio Busquets"    | 1     |
+------------------------------+
10 rows
Who was fouled the most?
// who was fouled the most
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]->(fouled)
RETURN fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+----------------------------------+
| fouled                   | fouls |
+----------------------------------+
| "Robert Lewandowski"     | 4     |
| "Lionel Messi"           | 4     |
| "Neymar"                 | 3     |
| "Pedro"                  | 2     |
| "Xabi Alonso"            | 2     |
| "Andrés Iniesta"         | 2     |
| "Rafinha"                | 2     |
| "Bastian Schweinsteiger" | 2     |
| "Sebastian Rode"         | 1     |
| "Sergio Busquets"        | 1     |
+----------------------------------+
10 rows
Who fouled who the most?
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]->(fouled),
      (foul)<-[:COMMITTED_FOUL]-(fouler)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+--------------------------------------------------------+
| fouler              | fouled                   | fouls |
+--------------------------------------------------------+
| "Javier Mascherano" | "Robert Lewandowski"     | 2     |
| "Dani Alves"        | "Bastian Schweinsteiger" | 2     |
| "Xabi Alonso"       | "Lionel Messi"           | 2     |
| "Rafinha"           | "Neymar"                 | 2     |
| "Rafinha"           | "Andrés Iniesta"         | 2     |
| "Dani Alves"        | "Xabi Alonso"            | 1     |
| "Thiago AlcĂĄntara"  | "Javier Mascherano"      | 1     |
| "Pedro"             | "Juan Bernat"            | 1     |
| "Medhi Benatia"     | "Pedro"                  | 1     |
| "Neymar"            | "Sebastian Rode"         | 1     |
+--------------------------------------------------------+
10 rows
Which team fouled the most?
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
 
+-----------------------------+
| team.name           | fouls |
+-----------------------------+
| "FC Bayern MĂŒnchen" | 18    |
| "Barcelona"         | 12    |
+-----------------------------+
2 rows
Worst fouler for each team?
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
      (fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;
 
+---------------------------------------------------------------+
| team.name           | topFouler.fouler.name | topFouler.fouls |
+---------------------------------------------------------------+
| "FC Bayern MĂŒnchen" | "Rafinha"             | 4               |
| "Barcelona"         | "Pedro"               | 3               |
+---------------------------------------------------------------+
2 rows
Most fouled against for each team
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]-(fouled),
      (fouled)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
      (app)-[:FOR_TEAM]->(team)
WITH team, fouled, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouled:fouled, fouls:fouls})[0] AS topFouled
RETURN team.name, topFouled.fouled.name, topFouled.fouls;
 
+---------------------------------------------------------------+
| team.name           | topFouled.fouled.name | topFouled.fouls |
+---------------------------------------------------------------+
| "FC Bayern MĂŒnchen" | "Robert Lewandowski"  | 4               |
| "Barcelona"         | "Lionel Messi"        | 4               |
+---------------------------------------------------------------+
2 rows

So Bayern fouled a bit more than Barca, the main forwards for each team (Messi/Lewandowski) were the most fouled players on the pitch and the fouling was mostly in the middle of the pitch.

I expect this graph will become much more interesting to query with more matches and with the other event types as well but I haven’t got those scraped yet. The code is on github if you want to play around with it and perhaps get the other events into the graph.

Categories: Programming