Warning: Table './devblogsdb/cache_page' is marked as crashed and last (automatic?) repair failed query: SELECT data, created, headers, expire, serialized FROM cache_page WHERE cid = 'http://www.softdevblogs.com/?q=aggregator/sources/23&page=1' in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc on line 135

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 729

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 730

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 731

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 732
Software Development Blogs: Programming, Software Testing, Agile, Project Management
Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/common.inc on line 153.
Syndicate content
Thoughts on Software Development
Updated: 5 hours 3 min ago

Unix: Find files greater than date

Fri, 06/24/2016 - 17:56

For the latter part of the week I’ve been running some tests against Neo4j which generate a bunch of log files and I wanted to filter those files based on the time they were created to do some further analysis.

This is an example of what the directory listing looks like:

$ ls -alh foo/database-agent-*
-rw-r--r--  1 markneedham  wheel   2.5K 23 Jun 14:00 foo/database-agent-mac17f73-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel   8.6K 23 Jun 11:49 foo/database-agent-mac19b6b-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   8.6K 23 Jun 11:49 foo/database-agent-mac1f427-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   2.5K 23 Jun 14:00 foo/database-agent-mac29389-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel    11K 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
-rw-r--r--  1 markneedham  wheel   4.8K 23 Jun 14:00 foo/database-agent-mac35563-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel   3.8K 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
-rw-r--r--  1 markneedham  wheel   4.8K 23 Jun 14:00 foo/database-agent-mac40798-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel    12K 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
-rw-r--r--  1 markneedham  wheel   2.5K 23 Jun 14:00 foo/database-agent-mac5f094-1-logs-archive-201606231300189.tar.gz
-rw-r--r--  1 markneedham  wheel   5.8K 23 Jun 14:00 foo/database-agent-mac636b8-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel   9.5K 23 Jun 11:49 foo/database-agent-mac7e165-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   2.7K 23 Jun 11:49 foo/database-agent-macab7f1-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   2.8K 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
-rw-r--r--  1 markneedham  wheel   3.1K 23 Jun 11:49 foo/database-agent-macbcbe8-1-logs-archive-201606231049520.tar.gz
-rw-r--r--  1 markneedham  wheel    13K 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
-rw-r--r--  1 markneedham  wheel   3.8K 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
-rw-r--r--  1 markneedham  wheel   3.9K 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz
-rw-r--r--  1 markneedham  wheel   3.1K 23 Jun 11:49 foo/database-agent-mace075e-1-logs-archive-201606231049520.tar.gz
-rw-r--r--  1 markneedham  wheel   3.1K 23 Jun 11:49 foo/database-agent-mace8859-1-logs-archive-201606231049507.tar.gz

I wanted to split the files in half so that I could have the ones created before and after 12pm on the 23rd June.

I discovered that this type of filtering is actually quite easy to do with the ‘find’ command. So if I want to get the files after 12pm I could write the following:

$ find foo -name database-agent* -newermt "Jun 23, 2016 12:00" -ls
121939705        8 -rw-r--r--    1 markneedham      wheel                2524 23 Jun 14:00 foo/database-agent-mac17f73-1-logs-archive-201606231300176.tar.gz
121939704        8 -rw-r--r--    1 markneedham      wheel                2511 23 Jun 14:00 foo/database-agent-mac29389-1-logs-archive-201606231300176.tar.gz
121934591       24 -rw-r--r--    1 markneedham      wheel               11294 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
121939707       16 -rw-r--r--    1 markneedham      wheel                4878 23 Jun 14:00 foo/database-agent-mac35563-1-logs-archive-201606231300176.tar.gz
121934612        8 -rw-r--r--    1 markneedham      wheel                3896 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
121939708       16 -rw-r--r--    1 markneedham      wheel                4887 23 Jun 14:00 foo/database-agent-mac40798-1-logs-archive-201606231300176.tar.gz
121934589       24 -rw-r--r--    1 markneedham      wheel               12204 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
121939720        8 -rw-r--r--    1 markneedham      wheel                2510 23 Jun 14:00 foo/database-agent-mac5f094-1-logs-archive-201606231300189.tar.gz
121939706       16 -rw-r--r--    1 markneedham      wheel                5912 23 Jun 14:00 foo/database-agent-mac636b8-1-logs-archive-201606231300176.tar.gz
121934588        8 -rw-r--r--    1 markneedham      wheel                2895 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
121934590       32 -rw-r--r--    1 markneedham      wheel               13427 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
121934587        8 -rw-r--r--    1 markneedham      wheel                3882 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
121934611        8 -rw-r--r--    1 markneedham      wheel                3970 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz

And to get the ones before 12pm:

$ find foo -name database-agent* -not -newermt "Jun 23, 2016 12:00" -ls
121879391       24 -rw-r--r--    1 markneedham      wheel                8856 23 Jun 11:49 foo/database-agent-mac19b6b-1-logs-archive-201606231049507.tar.gz
121879394       24 -rw-r--r--    1 markneedham      wheel                8772 23 Jun 11:49 foo/database-agent-mac1f427-1-logs-archive-201606231049507.tar.gz
121879390       24 -rw-r--r--    1 markneedham      wheel                9702 23 Jun 11:49 foo/database-agent-mac7e165-1-logs-archive-201606231049507.tar.gz
121879393        8 -rw-r--r--    1 markneedham      wheel                2812 23 Jun 11:49 foo/database-agent-macab7f1-1-logs-archive-201606231049507.tar.gz
121879413        8 -rw-r--r--    1 markneedham      wheel                3144 23 Jun 11:49 foo/database-agent-macbcbe8-1-logs-archive-201606231049520.tar.gz
121879414        8 -rw-r--r--    1 markneedham      wheel                3131 23 Jun 11:49 foo/database-agent-mace075e-1-logs-archive-201606231049520.tar.gz
121879392        8 -rw-r--r--    1 markneedham      wheel                3130 23 Jun 11:49 foo/database-agent-mace8859-1-logs-archive-201606231049507.tar.gz

Or we could even find the ones last modified between 12pm and 2pm:

$ find foo -name database-agent* -not -newermt "Jun 23, 2016 14:00" -newermt "Jun 23, 2016 12:00" -ls
121934591       24 -rw-r--r--    1 markneedham      wheel               11294 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
121934612        8 -rw-r--r--    1 markneedham      wheel                3896 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
121934589       24 -rw-r--r--    1 markneedham      wheel               12204 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
121934588        8 -rw-r--r--    1 markneedham      wheel                2895 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
121934590       32 -rw-r--r--    1 markneedham      wheel               13427 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
121934587        8 -rw-r--r--    1 markneedham      wheel                3882 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
121934611        8 -rw-r--r--    1 markneedham      wheel                3970 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz

Or we can filter by relative time e.g. to find the files last modified in the last 1 day, 5 hours:

$ find foo -name database-agent* -mtime -1d5h -ls
121939705        8 -rw-r--r--    1 markneedham      wheel                2524 23 Jun 14:00 foo/database-agent-mac17f73-1-logs-archive-201606231300176.tar.gz
121939704        8 -rw-r--r--    1 markneedham      wheel                2511 23 Jun 14:00 foo/database-agent-mac29389-1-logs-archive-201606231300176.tar.gz
121934591       24 -rw-r--r--    1 markneedham      wheel               11294 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
121939707       16 -rw-r--r--    1 markneedham      wheel                4878 23 Jun 14:00 foo/database-agent-mac35563-1-logs-archive-201606231300176.tar.gz
121934612        8 -rw-r--r--    1 markneedham      wheel                3896 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
121939708       16 -rw-r--r--    1 markneedham      wheel                4887 23 Jun 14:00 foo/database-agent-mac40798-1-logs-archive-201606231300176.tar.gz
121934589       24 -rw-r--r--    1 markneedham      wheel               12204 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
121939720        8 -rw-r--r--    1 markneedham      wheel                2510 23 Jun 14:00 foo/database-agent-mac5f094-1-logs-archive-201606231300189.tar.gz
121939706       16 -rw-r--r--    1 markneedham      wheel                5912 23 Jun 14:00 foo/database-agent-mac636b8-1-logs-archive-201606231300176.tar.gz
121934588        8 -rw-r--r--    1 markneedham      wheel                2895 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
121934590       32 -rw-r--r--    1 markneedham      wheel               13427 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
121934587        8 -rw-r--r--    1 markneedham      wheel                3882 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
121934611        8 -rw-r--r--    1 markneedham      wheel                3970 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz

Or the ones modified more than 1 day, 5 hours ago:

$ find foo -name database-agent* -mtime +1d5h -ls
121879391       24 -rw-r--r--    1 markneedham      wheel                8856 23 Jun 11:49 foo/database-agent-mac19b6b-1-logs-archive-201606231049507.tar.gz
121879394       24 -rw-r--r--    1 markneedham      wheel                8772 23 Jun 11:49 foo/database-agent-mac1f427-1-logs-archive-201606231049507.tar.gz
121879390       24 -rw-r--r--    1 markneedham      wheel                9702 23 Jun 11:49 foo/database-agent-mac7e165-1-logs-archive-201606231049507.tar.gz
121879393        8 -rw-r--r--    1 markneedham      wheel                2812 23 Jun 11:49 foo/database-agent-macab7f1-1-logs-archive-201606231049507.tar.gz
121879413        8 -rw-r--r--    1 markneedham      wheel                3144 23 Jun 11:49 foo/database-agent-macbcbe8-1-logs-archive-201606231049520.tar.gz
121879414        8 -rw-r--r--    1 markneedham      wheel                3131 23 Jun 11:49 foo/database-agent-mace075e-1-logs-archive-201606231049520.tar.gz
121879392        8 -rw-r--r--    1 markneedham      wheel                3130 23 Jun 11:49 foo/database-agent-mace8859-1-logs-archive-201606231049507.tar.gz

There are lots of other flags you can pass to find but these ones did exactly what I wanted!

Categories: Programming

Unix: Find all text below string in a file

Sun, 06/19/2016 - 09:36

I recently wanted to parse some text out of a bunch of files so that I could do some sentiment analysis on it. Luckily the text I want is at the end of the file and doesn’t have anything after it but there is text before it that I want to get rid.

The files look like this:

# text I don't care about
 
= Heading of the bit I care about
 
# text I care about

In other words I want to find the line that contains the Heading and then get all the text after that point.

I figured sed was the tool for the job but my knowledge of the syntax was a bit rusty. Luckily this post served as a refresher.

Effectively what we want to do is delete from the beginning of the file up until the line after the heading. We can do this with the following command:

$ cat /tmp/foo.txt 
# text I don't care about
 
= Heading of the bit I care about
 
# text I care about
$ cat /tmp/foo.txt | sed '1,/Heading of the bit I care about/d'
 
# text I care about

That still leaves an extra empty line after the heading which is a bit annoying but easy enough to get rid of by passing another command to sed that strips empty lines:

$ cat /tmp/foo.txt | sed -e '1,/Heading of the bit I care about/d' -e '/^\s*$/d'
# text I care about

The only difference here is that we’re now passing the ‘-e’ flag to allow us to specify multiple commands. If we just pass them sequentially then the 2nd one will be interpreted as the name of a file.

Categories: Programming

Unix: Split string using separator

Sun, 06/19/2016 - 08:22

I recently found myself needing to iterate over a bunch of ‘/’ separated strings on the command line and extract just the text after the last ‘/’.

e.g. an example of one of the strings

A/B/C

I wanted to write some code that could split on ‘/’ and then pick the 3rd item in the resulting collection.

One way of doing this is to echo the string and then pipe it through cut:

$ string="A/B/C"
$ echo ${string} | cut -d"/" -f3
C

or awk:

$ echo ${string} | awk -F"/" '{ print $3}'
C

I don’t like having to echo the string – it feels a bit odd so I wanted to see if there was a way to do the parsing more ‘inline’.

I came across this post which explains how to change the internal field separator (IFS) on the shell and then parse the string into an array using read. I gave it a try:

$ IFS="/" read -ra ADDR <<< "${string}"; echo ${ADDR[2]}
C

Works! We can even refer to the last item in the array using -1 instead of it’s absolute position:

$ IFS="/" read -ra ADDR <<< "${string}"; echo ${ADDR[-1]}
C

I’d not come across this use of the ‘read’ function before. The key is the ‘-a’ parameter. From the man page:

-a aname
The words are assigned to sequential indices of the array variable aname,
starting at 0. All elements are removed from aname before the assignment.
Other name arguments are ignored.

So we’re resetting the internal field separator and then reading the string into another variable as an array split on the ‘/’ character.

Pretty neat although now it’s longer than the original command and I’m sure I’ll forget the syntax.

Further down the page is another suggestion which seems even harder to remember but is much shorter:

$ echo ${string##*/} 
C

This drops from the beginning of the string up until the last occurrence of ‘/’ which is exactly what we want.

This way is the nicest and doesn’t require any echoing if we just want to assign the result to a variable. The echo is only used here to see the output.

Categories: Programming

Python: Regex – matching foreign characters/unicode letters

Sat, 06/18/2016 - 08:38

I’ve been back in the land of screen scrapping this week extracting data from the Game of Thrones wiki and needed to write a regular expression to pull out characters and actors.

Here are some examples of the format of the data:

Peter Dinklage as Tyrion Lannister
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
Filip Lozić as Young Nobleman
Morgan C. Jones as a Braavosi captain
Adewale Akinnuoye-Agbaje as Malko

So the pattern is:

<actor> as <character>

optionally followed by some other text that we’re not interested in.

The output I want to get is:

Peter Dinklage, Tyrion Lannister
Daniel Naprous, Oznak zo Pahl
Filip Lozić, Young Nobleman
Morgan C. Jones, a Braavosi captain
Adewale Akinnuoye-Agbaje, Malko

I started using the ‘split’ command on the word ‘as’ but that broke down when I realised some of the characters had the letters ‘as’ in the middle of their name. So regex it is!

This was my first attempt:

import re
 
strings = [
    "Peter Dinklage as Tyrion Lannister",
    "Filip Lozić as Young Nobleman",
    "Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)",
    "Morgan C. Jones as a Braavosi captain",
    "Adewale Akinnuoye-Agbaje as Malko"
]
 
regex = "([A-Za-z\-'\. ]*) as ([A-Za-z\-'\. ]*)"
 
for string in strings:
    print string
    match = re.match( regex, string)
    if match is not None:
        print match.groups()
    else:
        print "FAIL"
	print ""
Peter Dinklage as Tyrion Lannister
('Peter Dinklage', 'Tyrion Lannister')
 
Filip Lozić as Young Nobleman
FAIL
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
('Daniel Naprous', 'Oznak zo Pahl')
 
Morgan C. Jones as a Braavosi captain
('Morgan C. Jones', 'a Braavosi captain')
 
Adewale Akinnuoye-Agbaje as Malko
('Adewale Akinnuoye-Agbaje', 'Malko')

It works for 4 of the 5 scenarios but now for Filip Lozić. The ‘ć’ character causes the issue so we need to be able to match foreign characters which the current charset I defined in the regex doesn’t capture.

I came across this Stack Overflow post which said that in some regex libraries you can use ‘\p{L}’ to match all letters. I gave that a try:

regex = "([\p{L}\-'\. ]*) as ([\p{L}\-'\. ]*)"

And then re-ran the script:

Peter Dinklage as Tyrion Lannister
FAIL
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
FAIL
 
Filip Lozić as Young Nobleman
FAIL
 
Morgan C. Jones as a Braavosi captain
FAIL
 
Adewale Akinnuoye-Agbaje as Malko
FAIL

Hmmm, not sure if I did it wrong or if that isn’t available in Python. I’ll assume the latter but feel free to correct me in the comments and I’ll update the post.

I went search again and found this post which suggested another approach:

You can construct a new character class:

[^\W\d_]

instead of \w. Translated into English, it means “Any character that is not a non-alphanumeric character ([^\W] is the same as \w), but that is also not a digit and not an underscore”.

Let’s try plugging that in:

regex = "([A-Za-z\-'\.^\W\d_ ]*) as ([A-Za-z\-'\.^\W\d_ ]*)"
Peter Dinklage as Tyrion Lannister
('Peter Dinklage', 'Tyrion Lannister')
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
('Daniel Naprous as Oznak zo Pahl(credited', 'Stunt Performer)')
 
Filip Lozić as Young Nobleman
('Filip Lozi\xc4\x87', 'Young Nobleman')
 
Morgan C. Jones as a Braavosi captain
('Morgan C. Jones', 'a Braavosi captain')
 
Adewale Akinnuoye-Agbaje as Malko
('Adewale Akinnuoye-Agbaje', 'Malko')

So that’s fixed Filip but now Daniel Naprous is being incorrectly parsed.

For Attempt #4 I decided to try excluding what I don’t want instead:

regex = "([^0-9\(]*) as ([^0-9\(]*)"
Peter Dinklage as Tyrion Lannister
('Peter Dinklage', 'Tyrion Lannister')
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
('Daniel Naprous', 'Oznak zo Pahl')
 
Filip Lozić as Young Nobleman
('Filip Lozi\xc4\x87', 'Young Nobleman')
 
Morgan C. Jones as a Braavosi captain
('Morgan C. Jones', 'a Braavosi captain')
 
Adewale Akinnuoye-Agbaje as Malko
('Adewale Akinnuoye-Agbaje', 'Malko')

That does the job but has exposed my lack of regex skillz. If you know a better way let me know in the comments.

Categories: Programming

Unix parallel: Populating all the USB sticks

Wed, 06/01/2016 - 06:53

The day before Graph Connect Europe 2016 we needed to create a bunch of USB sticks containing Neo4j and the training materials and eventually iterated our way to a half decent approach which made use of the GNU parallel command which I’ve always wanted to use!


But first I needed to get a USB hub so I could do lots of them at the same time. I bought the EasyAcc USB 3.0 but there are lots of other ones that do the same job.

Next I mouunted all the USB sticks and then renamed the volumes to be NEO4J1 -> NEO4J7:

for i in 1 2 3 4 5 6 7; do diskutil renameVolume "USB DISK" NEO4J${i}; done

I then created a bash function called ‘duplicate’ to do the copying work:

function duplicate() {
  i=${1}
  echo ${i}
  time rsync -avP --size-only --delete --exclude '.*' --omit-dir-times /Users/markneedham/Downloads/graph-connect-europe-2016/ /Volumes/NEO4J${i}/
}

We can now call this function in parallel like so:

seq 1 7 | parallel duplicate

And that’s it. We didn’t get a 7x improvement in the throughput of USB creation from doing 7 in parallel but it took ~ 9 minutes to complete 7 compared to 5 minutes each. Presumably there’s still some part of the copying that is sequential further down – Amdahl’s law #ftw.

I want to go and find other things that I can use pipe into parallel now!

Categories: Programming

Neo4j vs Relational: Refactoring – Extracting node/table

Sun, 05/22/2016 - 10:58

In my previous blog post I showed how to add a new property/field to a node with a label/record in a table for a football transfers dataset that I’ve been playing with.

After introducing this ‘nationality’ property I realised that I now had some duplication in the model:

2016 05 22 10 15 15

players.nationality and clubs.country are referring to the same countries but they’ve both got them stored as strings so we can’t ensure the integrity of our countries and ensure that we’re referring to the same country.

We have the same issue in the graph model as well:

2016 05 22 10 40 40

This time Player.nationality and Club.country refer to the same countries.

We can solve our problem by introducing a countries table in the relational model and a set of nodes with a ‘Country’ label in the graph model. Let’s start with relational.

This is the model we’re driving towards:

2016 05 22 10 50 43

The first thing we need to do is create a countries table and populate it:

CREATE TABLE countries (
    "code" CHARACTER VARYING(3) NOT NULL PRIMARY KEY,
    "name" CHARACTER VARYING(50) NOT NULL
);
INSERT INTO countries VALUES('MNE', 'Montenegro');
INSERT INTO countries VALUES('SWZ', 'Swaziland');
...

Next let’s update the clubs table to reference the countries table:

ALTER TABLE clubs
ADD COLUMN country_id CHARACTER VARYING(3)
REFERENCES countries(code);

And let’s run a query to populate that column:

UPDATE clubs AS cl
SET country_id = c.code
FROM clubs
INNER JOIN countries AS c 
ON c.name = clubs.country
WHERE cl.id = clubs.id;

This query iterates over all the clubs, queries the country table to find the country id for that row and then stores it in the ‘country_id’ field. Finally we can remove the ‘country’ field:

ALTER TABLE clubs
DROP COLUMN country;

Now we do the same drill for the players table:

ALTER TABLE players
ADD COLUMN country_id CHARACTER VARYING(3)
REFERENCES countries(code);
UPDATE players AS p
SET country_id = c.code
FROM players
INNER JOIN countries AS c 
ON c.name = players.nationality
WHERE p.id = players.id;
ALTER TABLE players
DROP COLUMN nationality;

Now it’s time for the graph. This is the model we want to get to:

2016 05 22 10 51 49

First we’ll create the countries:

CREATE CONSTRAINT ON (c:Country)
ASSERT c.id IS UNIQUE
LOAD CSV WITH HEADERS FROM "file:///countries.csv"
AS row
MERGE (country:Country {id: row.countryCode})
ON CREATE SET country.name = row.country

And now let’s get clubs and players to point at those countries nodes and get rid of their respective nationality/country properties:

MATCH (club:Club)
MATCH (country:Country {name: club.country})
MERGE (club)-[:PART_OF]->(country)
REMOVE club.country
MATCH (player:Player)
MATCH (country:Country {name: player.nationality})
MERGE (player)-[:PLAYS_FOR]->(country)
REMOVE player.nationality

And that’s it, we can now write queries against our new model.

Categories: Programming

Neo4j vs Relational: Refactoring – Add a new field/property

Sun, 05/22/2016 - 10:09

A couple of months ago I presented a webinar comparing how you’d model and evolve a data model using a Postgres SQL database and Neo4j.

This is what the two data models looked like after the initial data import and before any refactoring/migration had been done:

Relational

2016 05 22 09 49 23

Graph

2016 05 22 09 52 16

I wanted to add a ‘nationality’ property to the players table in the SQL schema and to the nodes with the ‘Player’ label in the graph.

This refactoring is quite easy in both models. In the relational database we first run a query to add the ‘nationality’ field to the table:

ALTER TABLE players 
ADD COLUMN nationality VARYING(30);

And then we need to generate UPDATE statements from our data dump to update all the existing records:

UPDATE players 
SET nationality = 'Brazil'
WHERE players.id = '/aldair/profil/spieler/4151';
 
...

In the graph we can do this in a single step by processing our data dump using the LOAD CSV command and then setting a property on each player:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///transfers.csv" AS row
MATCH (player:Player {id: row.playerUri})
SET player.nationality = row.playerNationality

If we wanted to make the nationality field non nullable we could go back and run the following queries:

ALTER TABLE players 
ALTER COLUMN nationality SET NOT NULL
CREATE CONSTRAINT ON (player:Player) 
ASSERT exists(player.nationality)

And we’re done!

Categories: Programming

R: substr – Getting a vector of positions

Mon, 04/18/2016 - 20:49

I recently found myself writing an R script to extract parts of a string based on a beginning and end index which is reasonably easy using the substr function:

> substr("mark loves graphs", 0, 4)
[1] "mark"

But what if we have a vector of start and end positions?

> substr("mark loves graphs", c(0, 6), c(4, 10))
[1] "mark"

Hmmm that didn’t work as I expected! It turns out we actually need to use the substring function instead which wasn’t initially obvious to me on reading the documentation:

> substring("mark loves graphs", c(0, 6, 12), c(4, 10, 17))
[1] "mark"   "loves"  "graphs"

Easy when you know how!

Categories: Programming

R: tm – Unique words/terms per document

Mon, 04/11/2016 - 06:40

I’ve been doing a bit of text mining over the weekend using the R tm package and I wanted to only count a term once per document which isn’t how it works out the box.

For example let’s say we’re writing a bit of code to calculate the frequency of terms across some documents. We might write the following code:

library(tm)
text = c("I am Mark I am Mark", "Neo4j is cool Neo4j is cool")
corpus = VCorpus(VectorSource(text))
tdm = as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf))))
 
> tdm
       Docs
Terms   1 2
  am    2 0
  cool  0 2
  i     2 0
  is    0 2
  mark  2 0
  neo4j 0 2
 
> rowSums(tdm)
   am  cool     i    is  mark neo4j 
    2     2     2     2     2     2

We’ve created a small corpus over a vector which contains two bits of text. On the last line we output a TermDocumentMatrix which shows how frequently each term shows up across the corpus. I had to tweak the default word length of 3 to make sure we could see ‘am’ and ‘cool’.

But we’ve actually got some duplicate terms in each of our documents so we want to get rid of those and only count unique terms per document.

We can achieve that by mapping over the corpus using the tm_map function and then applying a function which returns unique terms. I wrote the following function:

uniqueWords = function(d) {
  return(paste(unique(strsplit(d, " ")[[1]]), collapse = ' '))
}

We can then apply the function like so:

corpus = tm_map(corpus, content_transformer(uniqueWords))
tdm = as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf))))
 
> tdm
       Docs
Terms   1 2
  am    1 0
  cool  0 1
  i     1 0
  is    0 1
  mark  1 0
  neo4j 0 1
 
> rowSums(tdm)
   am  cool     i    is  mark neo4j 
    1     1     1     1     1     1

And now each term is only counted once. Success!

Categories: Programming