Warning: Table './devblogsdb/cache_page' is marked as crashed and last (automatic?) repair failed query: SELECT data, created, headers, expire, serialized FROM cache_page WHERE cid = 'http://www.softdevblogs.com/?q=aggregator/sources/23' in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc on line 135

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 729

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 730

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 731

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 732
Software Development Blogs: Programming, Software Testing, Agile, Project Management
Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/common.inc on line 153.
Syndicate content
Thoughts on Software Development
Updated: 2 hours 12 min ago

Mahout/Hadoop: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4

Fri, 07/22/2016 - 14:55

I’ve been working my way through Dragan Milcevski’s mini tutorial on using Mahout to do content based filtering on documents and reached the final step where I needed to read in the generated item-similarity files.

I got the example compiling by using the following Maven dependency:

<dependency>
      <groupId>org.apache.mahout</groupId>
      <artifactId>mahout-core</artifactId>
      <version>0.9</version>
</dependency>

Unfortunately when I ran the code I ran into a version incompatibility problem:

Exception in thread "main" org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
	at org.apache.hadoop.ipc.Client.call(Client.java:1113)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
	at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
	at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
	at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
	at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:281)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:245)
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:124)
	at com.markhneedham.mahout.Similarity.getDocIndex(Similarity.java:86)
	at com.markhneedham.mahout.Similarity.main(Similarity.java:25)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

Version 0.9.0 of mahout-core was published in early 2014 so I expect it was built against an earlier version of Hadoop than I’m using (2.7.2).

I tried updating the Hadoop dependencies that were being called in the stack trace to no avail.

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.7.2</version>
</dependency>
 
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.7.2</version>
</dependency>

When stepping through the stack trace I noticed that my program was still using an old version of hadoop-core, so with one last throw of the dice I decided to try explicitly excluding that:

<dependency>
    <groupId>org.apache.mahout</groupId>
    <artifactId>mahout-core</artifactId>
    <version>0.9</version>
 
    <exclusions>
        <exclusion>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-core</artifactId>
        </exclusion>
    </exclusions>
</dependency>

And amazingly it worked. Now, finally, I can see how similar my documents are!

Categories: Programming

Hadoop: DataNode not starting

Fri, 07/22/2016 - 14:31

In my continued playing with Mahout I eventually decided to give up using my local file system and use a local Hadoop instead since that seems to have much less friction when following any examples.

Unfortunately all my attempts to upload any files from my local file system to HDFS were being met with the following exception:

java.io.IOException: File /user/markneedham/book2.txt could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1448)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:690)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:342)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1350)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1346)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1344)
 
at org.apache.hadoop.ipc.Client.call(Client.java:905)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:928)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:811)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)

I eventually realised, from looking at the output of jps, that the DataNode wasn’t actually starting up which explains the error message I was seeing.

A quick look at the log files showed what was going wrong:


/usr/local/Cellar/hadoop/2.7.1/libexec/logs/hadoop-markneedham-datanode-marks-mbp-4.zte.com.cn.log

2016-07-21 18:58:00,496 WARN org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: Incompatible clusterIDs in /usr/local/Cellar/hadoop/hdfs/tmp/dfs/data: namenode clusterID = CID-c2e0b896-34a6-4dde-b6cd-99f36d613e6a; datanode clusterID = CID-403dde8b-bdc8-41d9-8a30-fe2dc951575c
2016-07-21 18:58:00,496 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to /0.0.0.0:8020. Exiting.
java.io.IOException: All specified directories are failed to load.
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:477)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1361)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1326)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:316)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:801)
        at java.lang.Thread.run(Thread.java:745)
2016-07-21 18:58:00,497 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to /0.0.0.0:8020
2016-07-21 18:58:00,602 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool <registering> (Datanode Uuid unassigned)
2016-07-21 18:58:02,607 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2016-07-21 18:58:02,608 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2016-07-21 18:58:02,610 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:

I’m not sure how my clusterIDs got out of sync, although I expect it’s because I reformatted HDFS without realising at some stage. There are other ways of solving this problem but the quickest for me was to just nuke the DataNode’s data directory which the log file told me sits here:

sudo rm -r /usr/local/Cellar/hadoop/hdfs/tmp/dfs/data/current

I then re-ran the hstart script that I stole from this tutorial and everything, including the DataNode this time, started up correctly:

$ jps
26736 NodeManager
26392 DataNode
26297 NameNode
26635 ResourceManager
26510 SecondaryNameNode

And now I can upload local files to HDFS again. #win!

Categories: Programming

Mahout: Exception in thread β€œmain” java.lang.IllegalArgumentException: Wrong FS: file:/… expected: hdfs://

Thu, 07/21/2016 - 18:57

I’ve been playing around with Mahout over the last couple of days to see how well it works for content based filtering.

I started following a mini tutorial from Stack Overflow but ran into trouble on the first step:

bin/mahout seqdirectory \
--input file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo \
--output file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo-out \
-c UTF-8 \
-chunk 64 \
-prefix mah
16/07/21 21:19:20 INFO AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo], --keyPrefix=[mah], --method=[mapreduce], --output=[file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo-out], --startPhase=[0], --tempDir=[temp]}
16/07/21 21:19:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/21 21:19:20 INFO deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
16/07/21 21:19:20 INFO deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
16/07/21 21:19:20 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo, expected: hdfs://localhost:8020
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:646)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
	at org.apache.mahout.text.SequenceFilesFromDirectory.runMapReduce(SequenceFilesFromDirectory.java:156)
	at org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:90)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
	at org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:64)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
	at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

I was trying to run the command against the local file system on my laptop which should have been possible according to the instructions. I couldn’t find any flag I could pass any flag that I could pass to Mahout to tell it not to use HDFS but I eventually stumbled on someone else experiencing a similar problem.

It turns out the last time I was playing around with Hadoop, in late 2015, I’d actually configured that and had completely forgotten. I needed to comment out the following config:

/usr/local/Cellar/hadoop/2.7.1/libexec/etc/hadoop/core-site.xml

<property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:8020</value>
</property>

I commented that property out and all was happy with the (Hadoop) world again.

Categories: Programming

Neo4j: Cypher – Detecting duplicates using relationships

Wed, 07/20/2016 - 18:32

I’ve been building a graph of computer science papers on and off for a couple of months and now that I’ve got a few thousand loaded in I realised that there are quite a few duplicates.

They’re not duplicates in the sense that there are multiple entries with the same identifier but rather have different identifiers but seem to be the same paper!

e.g. there are a couple of papers titled ‘Authentication in the Taos operating system’:

http://dl.acm.org/citation.cfm?id=174614

2016 07 20 11 43 00

http://dl.acm.org/citation.cfm?id=168640

2016 07 20 11 43 38

This is the same paper published in two different journals as far as I can tell.

Now in this case it’s quite easy to just do a string similarity comparison of the titles of these papers and realise that they’re identical. I’ve previously use the excellent dedupe library to do this and there’s also an excellent talk from Berlin Buzzwords 2014 where the author uses locality-sensitive hashing to achieve a similar outcome.

However, I was curious whether I could use any of the relationships these papers have to detect duplicates rather than just relying on string matching.

This is what the graph looks like:

Graph  8

We’ll start by writing a query to see how many common references the different Taos papers have:

MATCH (r:Resource {id: "168640"})-[:REFERENCES]->(other)
WITH r, COLLECT(other) as myReferences
 
UNWIND myReferences AS reference
OPTIONAL MATCH path = (other)-[:REFERENCES]->(reference)
WITH other, COUNT(path) AS otherReferences, SIZE(myReferences) AS myReferences
WITH other, 1.0 * otherReferences / myReferences AS similarity WHERE similarity > 0.5
 
RETURN other.id, other.title, similarity
ORDER BY similarity DESC
LIMIT 10
╒════════╀═══════════════════════════════════════════╀══════════╕
β”‚other.idβ”‚other.title                                β”‚similarityβ”‚
β•žβ•β•β•β•β•β•β•β•β•ͺ═══════════════════════════════════════════β•ͺ══════════║
β”‚168640  β”‚Authentication in the Taos operating systemβ”‚1         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚174614  β”‚Authentication in the Taos operating systemβ”‚1         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This query:

  • picks one of the Taos papers and finds its references
  • finds other papers which reference those same papers
  • calculates a similarity score based on how many common references they have
  • returns papers that have more than 50% of the same references with the most similar ones at the top

I tried it with other papers to see how it fared:

Performance of Firefly RPC

╒════════╀════════════════════════════════════════════════════════════════╀══════════════════╕
β”‚other.idβ”‚other.title                                                     β”‚similarity        β”‚
β•žβ•β•β•β•β•β•β•β•β•ͺ════════════════════════════════════════════════════════════════β•ͺ══════════════════║
β”‚74859   β”‚Performance of Firefly RPC                                      β”‚1                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚77653   β”‚Performance of the Firefly RPC                                  β”‚0.8333333333333334β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚110815  β”‚The X-Kernel: An Architecture for Implementing Network Protocolsβ”‚0.6666666666666666β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚96281   β”‚Experiences with the Amoeba distributed operating system        β”‚0.6666666666666666β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚74861   β”‚Lightweight remote procedure call                               β”‚0.6666666666666666β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚106985  β”‚The interaction of architecture and operating system design     β”‚0.6666666666666666β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚77650   β”‚Lightweight remote procedure call                               β”‚0.6666666666666666β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Authentication in distributed systems: theory and practice

╒════════╀══════════════════════════════════════════════════════════╀══════════════════╕
β”‚other.idβ”‚other.title                                               β”‚similarity        β”‚
β•žβ•β•β•β•β•β•β•β•β•ͺ══════════════════════════════════════════════════════════β•ͺ══════════════════║
β”‚121160  β”‚Authentication in distributed systems: theory and practiceβ”‚1                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚138874  β”‚Authentication in distributed systems: theory and practiceβ”‚0.9090909090909091β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Sadly it’s not as simple as finding 100% matches on references! I expect the later revisions of a paper added more content and therefore additional references.

What if we look for author similarity as well?

MATCH (r:Resource {id: "121160"})-[:REFERENCES]->(other)
WITH r, COLLECT(other) as myReferences
 
UNWIND myReferences AS reference
OPTIONAL MATCH path = (other)-[:REFERENCES]->(reference)
WITH r, other, authorSimilarity,  COUNT(path) AS otherReferences, SIZE(myReferences) AS myReferences
WITH r, other, authorSimilarity,  1.0 * otherReferences / myReferences AS referenceSimilarity
WHERE referenceSimilarity > 0.5
 
MATCH (r)<-[:AUTHORED]-(author)
WITH r, myReferences, COLLECT(author) AS myAuthors
 
UNWIND myAuthors AS author
OPTIONAL MATCH path = (other)<-[:AUTHORED]-(author)
WITH other, myReferences, COUNT(path) AS otherAuthors, SIZE(myAuthors) AS myAuthors
WITH other, myReferences, 1.0 * otherAuthors / myAuthors AS authorSimilarity
WHERE authorSimilarity > 0.5
 
 
 
RETURN other.id, other.title, referenceSimilarity, authorSimilarity
ORDER BY (referenceSimilarity + authorSimilarity) DESC
LIMIT 10
╒════════╀══════════════════════════════════════════════════════════╀═══════════════════╀════════════════╕
β”‚other.idβ”‚other.title                                               β”‚referenceSimilarityβ”‚authorSimilarityβ”‚
β•žβ•β•β•β•β•β•β•β•β•ͺ══════════════════════════════════════════════════════════β•ͺ═══════════════════β•ͺ════════════════║
β”‚121160  β”‚Authentication in distributed systems: theory and practiceβ”‚1                  β”‚1               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚138874  β”‚Authentication in distributed systems: theory and practiceβ”‚0.9090909090909091 β”‚1               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
╒════════╀══════════════════════════════╀═══════════════════╀════════════════╕
β”‚other.idβ”‚other.title                   β”‚referenceSimilarityβ”‚authorSimilarityβ”‚
β•žβ•β•β•β•β•β•β•β•β•ͺ══════════════════════════════β•ͺ═══════════════════β•ͺ════════════════║
β”‚74859   β”‚Performance of Firefly RPC    β”‚1                  β”‚1               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚77653   β”‚Performance of the Firefly RPCβ”‚0.8333333333333334 β”‚1               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

I’m sure I could find some other papers where neither of these similarities worked well but it’s an interesting start.

I think the next step is to build up a training set of pairs of documents that are and aren’t similar to each other. We could then train a classifier to determine whether two documents are identical.

But that’s for another day!

Categories: Programming

Python: Scraping elements relative to each other with BeautifulSoup

Mon, 07/11/2016 - 07:01

Last week we hosted a Game of Thrones based intro to Cypher at the Women Who Code London meetup and in preparation had to scrape the wiki to build a dataset.

I’ve built lots of datasets this way and it’s a painless experience as long as the pages make liberal use of CSS classes and/or IDs.

Unfortunately the Game of Thrones wiki doesn’t really do that so I had to find another way to extract the data I wanted – extracting elements based on their position to more prominent elements on the page.

For example, I wanted to extract Arya Stark‘s allegiances which look like this on the page:

2016 07 11 06 45 37

We don’t have a direct route to her allegiances but we do have an indirect path via the h3 element with the text ‘Allegiance’.

The following code gets us the ‘Allegiance’ element:

from bs4 import BeautifulSoup
 
file_name = "Arya_Stark"
wikia = BeautifulSoup(open("data/wikia/characters/{0}".format(file_name), "r"), "html.parser")
allegiance_element = [tag for tag in wikia.find_all('h3') if tag.text == "Allegiance"]
 
> print allegiance_element
[<h3 class="pi-data-label pi-secondary-font">Allegiance</h3>]

Now we need to work out the relative position of the div containing the houses. It’s inside the same parent div so I thought it’d probably be the next sibling:

next_element = allegiance_element[0].next_sibling
 
> print next_element

Nope. Nothing! Hmmm, wonder why:

> print next_element.name, type(next_element)
None <class 'bs4.element.NavigableString'>

Ah, empty string. Maybe it’s the one after that?

next_element = allegiance_element[0].next_sibling.next_sibling
 
> print next_element.name, type(next_element)
[<a href="/wiki/House_Stark" title="House Stark">House Stark</a>, <br/>, <a href="/wiki/Faceless_Men" title="Faceless Men">Faceless Men</a>, u' (Formerly)']

Hoorah! Afer this it became a case of working out how the text was structure and pulling out what I wanted.

The code I ended up with is on github if you want to recreate it yourself.

Categories: Programming

Neo4j 3.0 Drivers – Failed to save the server ID and the certificate received from the server

Mon, 07/11/2016 - 06:21

I’ve been using the Neo4j Java Driver on various local databases over the past week and ran into the following certificate problem a few times:

org.neo4j.driver.v1.exceptions.ClientException: Unable to process request: General SSLEngine problem
	at org.neo4j.driver.internal.connector.socket.SocketClient.start(SocketClient.java:88)
	at org.neo4j.driver.internal.connector.socket.SocketConnection.<init>(SocketConnection.java:63)
	at org.neo4j.driver.internal.connector.socket.SocketConnector.connect(SocketConnector.java:52)
	at org.neo4j.driver.internal.pool.InternalConnectionPool.acquire(InternalConnectionPool.java:113)
	at org.neo4j.driver.internal.InternalDriver.session(InternalDriver.java:53)
Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem
	at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1431)
	at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)
	at sun.security.ssl.SSLEngineImpl.writeAppRecord(SSLEngineImpl.java:1214)
	at sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:1186)
	at javax.net.ssl.SSLEngine.wrap(SSLEngine.java:469)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.wrap(TLSSocketChannel.java:270)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.runHandshake(TLSSocketChannel.java:131)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.<init>(TLSSocketChannel.java:95)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.<init>(TLSSocketChannel.java:77)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.<init>(TLSSocketChannel.java:70)
	at org.neo4j.driver.internal.connector.socket.SocketClient$ChannelFactory.create(SocketClient.java:251)
	at org.neo4j.driver.internal.connector.socket.SocketClient.start(SocketClient.java:75)
	... 14 more
Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
	at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:304)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1497)
	at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:212)
	at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:919)
	at sun.security.ssl.Handshaker$1.run(Handshaker.java:916)
	at java.security.AccessController.doPrivileged(Native Method)
	at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1369)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.runDelegatedTasks(TLSSocketChannel.java:142)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.unwrap(TLSSocketChannel.java:203)
	at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.runHandshake(TLSSocketChannel.java:127)
	... 19 more
Caused by: java.security.cert.CertificateException: Unable to connect to neo4j at `localhost:10003`, because the certificate the server uses has changed. This is a security feature to protect against man-in-the-middle attacks.
If you trust the certificate the server uses now, simply remove the line that starts with `localhost:10003` in the file `/Users/markneedham/.neo4j/known_hosts`.
The old certificate saved in file is:
-----BEGIN CERTIFICATE-----
7770ee598be69c8537b0e576e62442c84400008ca0d3e3565b379b7cce9a51de
0fd4396251df2e8da50eb1628d44dcbca3fae5c8fb9c0adc29396839c25eb0c8
 
-----END CERTIFICATE-----
The New certificate received is:
-----BEGIN CERTIFICATE-----
01a422739a39625ee95a0547fa99c7e43fbb33c70ff720e5ae4a8408421aa63b
2fe4f5d6094c5fd770ed1ad214dbdc428a6811d0955ed80d48cc67d84067df2c
 
-----END CERTIFICATE-----
 
	at org.neo4j.driver.internal.connector.socket.TrustOnFirstUseTrustManager.checkServerTrusted(TrustOnFirstUseTrustManager.java:153)
	at sun.security.ssl.AbstractTrustManagerWrapper.checkServerTrusted(SSLContextImpl.java:936)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1484)
	... 28 more

I got a bit lazy and just nuked the file it mentions in the error message – /Users/markneedham/.neo4j/known_hosts – which led to this error the next time I call the driver in my application:

Failed to save the server ID and the certificate received from the server to file /Users/markneedham/.neo4j/known_hosts.
Server ID: localhost:10003
Received cert:
-----BEGIN CERTIFICATE-----
933c7ec5d6a1b876bd186dc6d05b04478ae771262f07d26a4d7d2e6b7f71054c
3e6b7c172474493b7fe93170d940b9cc3544661c7966632361649f2fda7c66be
 
-----END CERTIFICATE-----

I recreated the file with no content and tried again and it worked fine. Alternatively we can choose to turn off encryption when working with local databases and avoid the issue:

Config config = Config.build().withEncryptionLevel( Config.EncryptionLevel.NONE ).toConfig();
 
try ( Driver driver = GraphDatabase.driver( "bolt://localhost:7687", config );
      Session session = driver.session() )
{
   // use the driver
}
Categories: Programming

R: Sentiment analysis of morning pages

Sat, 07/09/2016 - 07:36

A couple of months ago I came across a cool blog post by Julia Silge where she runs a sentiment analysis algorithm over her tweet stream to see how her tweet sentiment has varied over time.

I wanted to give it a try but couldn’t figure out how to get a dump of my tweets so I decided to try it out on the text from my morning pages writing which I’ve been experimenting with for a few months.

Here’s an explanation of morning pages if you haven’t come across it before:

Morning Pages are three pages of longhand, stream of consciousness writing, done first thing in the morning.

*There is no wrong way to do Morning Pages* – they are not high art.

They are not even β€œwriting.” They are about anything and everything that crosses your mind– and they are for your eyes only.

Morning Pages provoke, clarify, comfort, cajole, prioritize and synchronize the day at hand. Do not over-think Morning Pages: just put three pages of anything on the page…and then do three more pages tomorrow.

Most of my writing is complete gibberish but I thought it’d be fun to see how my mood changes over time and see if it identifies any peaks or troughs in sentiment that I could then look into further.

I’ve got one file per day so we’ll start by building a data frame containing the text, one row per day:

library(syuzhet)
library(lubridate)
library(ggplot2)
library(scales)
library(reshape2)
library(dplyr)
 
root="/path/to/files"
files = list.files(root)
 
df = data.frame(file = files, stringsAsFactors=FALSE)
df$fullPath = paste(root, df$file, sep = "/")
df$text = sapply(df$fullPath, get_text_as_string)

We end up with a data frame with 3 fields:

> names(df)
 
[1] "file"     "fullPath" "text"

Next we’ll run the sentiment analysis function – syuzhet#get_nrc_sentiment – over the data frame and get a score for each type of sentiment for each entry:

get_nrc_sentiment(df$text) %>% head()
 
  anger anticipation disgust fear joy sadness surprise trust negative positive
1     7           14       5    7   8       6        6    12       14       27
2    11           12       2   13   9      10        4    11       22       24
3     6           12       3    8   7       7        5    13       16       21
4     5           17       4    7  10       6        7    13       16       37
5     4           13       3    7   7       9        5    14       16       25
6     7           11       5    7   8       8        6    15       16       26

Now we’ll merge these columns into our original data frame:

df = cbind(df, get_nrc_sentiment(df$text))
df$date = ymd(sapply(df$file, function(file) unlist(strsplit(file, "[.]"))[1]))
df %>% select(-text, -fullPath, -file) %>% head()
 
  anger anticipation disgust fear joy sadness surprise trust negative positive       date
1     7           14       5    7   8       6        6    12       14       27 2016-01-02
2    11           12       2   13   9      10        4    11       22       24 2016-01-03
3     6           12       3    8   7       7        5    13       16       21 2016-01-04
4     5           17       4    7  10       6        7    13       16       37 2016-01-05
5     4           13       3    7   7       9        5    14       16       25 2016-01-06
6     7           11       5    7   8       8        6    15       16       26 2016-01-07

Finally we can build some ‘sentiment over time’ charts like Julia has in her post:

posnegtime <- df %>% 
  group_by(date = cut(date, breaks="1 week")) %>%
  summarise(negative = mean(negative), positive = mean(positive)) %>% 
  melt
 
names(posnegtime) <- c("date", "sentiment", "meanvalue")
posnegtime$sentiment = factor(posnegtime$sentiment,levels(posnegtime$sentiment)[c(2,1)])
 
ggplot(data = posnegtime, aes(x = as.Date(date), y = meanvalue, group = sentiment)) +
  geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) +
  geom_point(size = 0.5) +
  ylim(0, NA) + 
  scale_colour_manual(values = c("springgreen4", "firebrick3")) +
  theme(legend.title=element_blank(), axis.title.x = element_blank()) +
  scale_x_date(breaks = date_breaks("1 month"), labels = date_format("%b %Y")) +
  ylab("Average sentiment score") + 
  ggtitle("Sentiment Over Time")

2016 07 05 06 47 12

So overall it seems like my writing displays more positive sentiment than negative which is nice to know. The chart shows a rolling one week average and there isn’t a single week where there’s more negative sentiment than positive.

I thought it’d be fun to drill into the highest negative and positive days to see what was going on there:

> df %>% filter(negative == max(negative)) %>% select(date)
 
        date
1 2016-03-19
 
> df %>% filter(positive == max(positive)) %>% select(date)
 
        date
1 2016-01-05
2 2016-06-20

On the 19th March I was really frustrated because my boiler had broken down and I had to buy a new one – I’d completely forgotten how annoyed I was, so thanks sentiment analysis for reminding me!

I couldn’t find anything particularly positive on the 5th January or 20th June. The 5th January was the day after my birthday so perhaps I was happy about that but I couldn’t see any particular evidence that was the case.

Playing around with the get_nrc_sentiment function it does seem to identify positive sentiment when I wouldn’t say there is any. For example here’s some example sentences from my writing today:

> get_nrc_sentiment("There was one section that I didn't quite understand so will have another go at reading that.")
 
  anger anticipation disgust fear joy sadness surprise trust negative positive
1     0            0       0    0   0       0        0     0        0        1
> get_nrc_sentiment("Bit of a delay in starting my writing for the day...for some reason was feeling wheezy again.")
 
  anger anticipation disgust fear joy sadness surprise trust negative positive
1     2            1       2    2   1       2        1     1        2        2

I don’t think there’s any positive sentiment in either of those sentences but the function claims 3 bits of positive sentiment! It would be interesting to see if I fare any better with Stanford’s sentiment extraction tool which you can use with syuzhet but requires a bit of setup first.

I’ll give that a try next but in terms of getting an overview of my mood I thought I might get a better picture if I looked for the difference between positive and negative sentiment rather than absolute values.

The following code does the trick:

difftime <- df %>% 
  group_by(date = cut(date, breaks="1 week")) %>%
  summarise(diff = mean(positive) - mean(negative))
 
ggplot(data = difftime, aes(x = as.Date(date), y = diff)) +
  geom_line(size = 2.5, alpha = 0.7) +
  geom_point(size = 0.5) +
  ylim(0, NA) + 
  scale_colour_manual(values = c("springgreen4", "firebrick3")) +
  theme(legend.title=element_blank(), axis.title.x = element_blank()) +
  scale_x_date(breaks = date_breaks("1 month"), labels = date_format("%b %Y")) +
  ylab("Average sentiment difference score") + 
  ggtitle("Sentiment Over Time")
2016 07 09 07 05 34

This one identifies peak happiness in mid January/February. We can find the peak day for this measure as well:

> df %>% mutate(diff = positive - negative) %>% filter(diff == max(diff)) %>% select(date)
 
        date
1 2016-02-25

Or if we want to see the individual scores:

> df %>% mutate(diff = positive - negative) %>% filter(diff == max(diff)) %>% select(-text, -file, -fullPath)
 
  anger anticipation disgust fear joy sadness surprise trust negative positive       date diff
1     0           11       2    3   7       1        6     6        3       31 2016-02-25   28

After reading through the entry for this day I’m wondering if the individual pieces of sentiment might be more interesting than the positive/negative score.

On the 25th February I was:

  • quite excited about reading a distributed systems book I’d just bought (I know?!)
  • thinking about how to apply the tag clustering technique to meetup topics
  • preparing my submission to PyData London and thinking about what was gonna go in it
  • thinking about the soak testing we were about to start doing on our project

Each of those is a type of anticipation so it makes sense that this day scores highly. I looked through some other days which specifically rank highly for anticipation and couldn’t figure out what I was anticipating so even this is a bit hit and miss!

I have a few avenues to explore further but if you have any other ideas for what I can try next let me know in the comments.

Categories: Programming

Python: BeautifulSoup – Insert tag

Thu, 06/30/2016 - 22:28

I’ve been scraping the Game of Thrones wiki in preparation for a meetup at Women Who Code next week and while attempting to extract character allegiances I wanted to insert missing line breaks to separate different allegiances.

I initially tried creating a line break like this:

>>> from bs4 import BeautifulSoup
>>> tag = BeautifulSoup("<br />", "html.parser")
>>> tag
<br/>

It looks like it should work but later on in my script I check the ‘name’ attribute to work out whether I’ve got a line break and it doesn’t return the value I expected it to:

>>> tag.name
u'[document]'

My script assumes it’s going to return the string ‘br’ so I needed another way of creating the tag. The following does the trick:

>>> from bs4 import Tag
>>> tag = Tag(name = "br")
>>> tag
<br></br>
>>> tag.name
'br'

That’s all for now, back to scraping for me!

Categories: Programming

Unix: Find files greater than date

Fri, 06/24/2016 - 17:56

For the latter part of the week I’ve been running some tests against Neo4j which generate a bunch of log files and I wanted to filter those files based on the time they were created to do some further analysis.

This is an example of what the directory listing looks like:

$ ls -alh foo/database-agent-*
-rw-r--r--  1 markneedham  wheel   2.5K 23 Jun 14:00 foo/database-agent-mac17f73-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel   8.6K 23 Jun 11:49 foo/database-agent-mac19b6b-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   8.6K 23 Jun 11:49 foo/database-agent-mac1f427-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   2.5K 23 Jun 14:00 foo/database-agent-mac29389-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel    11K 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
-rw-r--r--  1 markneedham  wheel   4.8K 23 Jun 14:00 foo/database-agent-mac35563-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel   3.8K 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
-rw-r--r--  1 markneedham  wheel   4.8K 23 Jun 14:00 foo/database-agent-mac40798-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel    12K 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
-rw-r--r--  1 markneedham  wheel   2.5K 23 Jun 14:00 foo/database-agent-mac5f094-1-logs-archive-201606231300189.tar.gz
-rw-r--r--  1 markneedham  wheel   5.8K 23 Jun 14:00 foo/database-agent-mac636b8-1-logs-archive-201606231300176.tar.gz
-rw-r--r--  1 markneedham  wheel   9.5K 23 Jun 11:49 foo/database-agent-mac7e165-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   2.7K 23 Jun 11:49 foo/database-agent-macab7f1-1-logs-archive-201606231049507.tar.gz
-rw-r--r--  1 markneedham  wheel   2.8K 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
-rw-r--r--  1 markneedham  wheel   3.1K 23 Jun 11:49 foo/database-agent-macbcbe8-1-logs-archive-201606231049520.tar.gz
-rw-r--r--  1 markneedham  wheel    13K 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
-rw-r--r--  1 markneedham  wheel   3.8K 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
-rw-r--r--  1 markneedham  wheel   3.9K 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz
-rw-r--r--  1 markneedham  wheel   3.1K 23 Jun 11:49 foo/database-agent-mace075e-1-logs-archive-201606231049520.tar.gz
-rw-r--r--  1 markneedham  wheel   3.1K 23 Jun 11:49 foo/database-agent-mace8859-1-logs-archive-201606231049507.tar.gz

I wanted to split the files in half so that I could have the ones created before and after 12pm on the 23rd June.

I discovered that this type of filtering is actually quite easy to do with the ‘find’ command. So if I want to get the files after 12pm I could write the following:

$ find foo -name database-agent* -newermt "Jun 23, 2016 12:00" -ls
121939705        8 -rw-r--r--    1 markneedham      wheel                2524 23 Jun 14:00 foo/database-agent-mac17f73-1-logs-archive-201606231300176.tar.gz
121939704        8 -rw-r--r--    1 markneedham      wheel                2511 23 Jun 14:00 foo/database-agent-mac29389-1-logs-archive-201606231300176.tar.gz
121934591       24 -rw-r--r--    1 markneedham      wheel               11294 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
121939707       16 -rw-r--r--    1 markneedham      wheel                4878 23 Jun 14:00 foo/database-agent-mac35563-1-logs-archive-201606231300176.tar.gz
121934612        8 -rw-r--r--    1 markneedham      wheel                3896 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
121939708       16 -rw-r--r--    1 markneedham      wheel                4887 23 Jun 14:00 foo/database-agent-mac40798-1-logs-archive-201606231300176.tar.gz
121934589       24 -rw-r--r--    1 markneedham      wheel               12204 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
121939720        8 -rw-r--r--    1 markneedham      wheel                2510 23 Jun 14:00 foo/database-agent-mac5f094-1-logs-archive-201606231300189.tar.gz
121939706       16 -rw-r--r--    1 markneedham      wheel                5912 23 Jun 14:00 foo/database-agent-mac636b8-1-logs-archive-201606231300176.tar.gz
121934588        8 -rw-r--r--    1 markneedham      wheel                2895 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
121934590       32 -rw-r--r--    1 markneedham      wheel               13427 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
121934587        8 -rw-r--r--    1 markneedham      wheel                3882 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
121934611        8 -rw-r--r--    1 markneedham      wheel                3970 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz

And to get the ones before 12pm:

$ find foo -name database-agent* -not -newermt "Jun 23, 2016 12:00" -ls
121879391       24 -rw-r--r--    1 markneedham      wheel                8856 23 Jun 11:49 foo/database-agent-mac19b6b-1-logs-archive-201606231049507.tar.gz
121879394       24 -rw-r--r--    1 markneedham      wheel                8772 23 Jun 11:49 foo/database-agent-mac1f427-1-logs-archive-201606231049507.tar.gz
121879390       24 -rw-r--r--    1 markneedham      wheel                9702 23 Jun 11:49 foo/database-agent-mac7e165-1-logs-archive-201606231049507.tar.gz
121879393        8 -rw-r--r--    1 markneedham      wheel                2812 23 Jun 11:49 foo/database-agent-macab7f1-1-logs-archive-201606231049507.tar.gz
121879413        8 -rw-r--r--    1 markneedham      wheel                3144 23 Jun 11:49 foo/database-agent-macbcbe8-1-logs-archive-201606231049520.tar.gz
121879414        8 -rw-r--r--    1 markneedham      wheel                3131 23 Jun 11:49 foo/database-agent-mace075e-1-logs-archive-201606231049520.tar.gz
121879392        8 -rw-r--r--    1 markneedham      wheel                3130 23 Jun 11:49 foo/database-agent-mace8859-1-logs-archive-201606231049507.tar.gz

Or we could even find the ones last modified between 12pm and 2pm:

$ find foo -name database-agent* -not -newermt "Jun 23, 2016 14:00" -newermt "Jun 23, 2016 12:00" -ls
121934591       24 -rw-r--r--    1 markneedham      wheel               11294 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
121934612        8 -rw-r--r--    1 markneedham      wheel                3896 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
121934589       24 -rw-r--r--    1 markneedham      wheel               12204 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
121934588        8 -rw-r--r--    1 markneedham      wheel                2895 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
121934590       32 -rw-r--r--    1 markneedham      wheel               13427 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
121934587        8 -rw-r--r--    1 markneedham      wheel                3882 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
121934611        8 -rw-r--r--    1 markneedham      wheel                3970 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz

Or we can filter by relative time e.g. to find the files last modified in the last 1 day, 5 hours:

$ find foo -name database-agent* -mtime -1d5h -ls
121939705        8 -rw-r--r--    1 markneedham      wheel                2524 23 Jun 14:00 foo/database-agent-mac17f73-1-logs-archive-201606231300176.tar.gz
121939704        8 -rw-r--r--    1 markneedham      wheel                2511 23 Jun 14:00 foo/database-agent-mac29389-1-logs-archive-201606231300176.tar.gz
121934591       24 -rw-r--r--    1 markneedham      wheel               11294 23 Jun 13:44 foo/database-agent-mac3533f-1-logs-archive-201606231244152.tar.gz
121939707       16 -rw-r--r--    1 markneedham      wheel                4878 23 Jun 14:00 foo/database-agent-mac35563-1-logs-archive-201606231300176.tar.gz
121934612        8 -rw-r--r--    1 markneedham      wheel                3896 23 Jun 13:44 foo/database-agent-mac35f7e-1-logs-archive-201606231244165.tar.gz
121939708       16 -rw-r--r--    1 markneedham      wheel                4887 23 Jun 14:00 foo/database-agent-mac40798-1-logs-archive-201606231300176.tar.gz
121934589       24 -rw-r--r--    1 markneedham      wheel               12204 23 Jun 13:44 foo/database-agent-mac490bf-1-logs-archive-201606231244151.tar.gz
121939720        8 -rw-r--r--    1 markneedham      wheel                2510 23 Jun 14:00 foo/database-agent-mac5f094-1-logs-archive-201606231300189.tar.gz
121939706       16 -rw-r--r--    1 markneedham      wheel                5912 23 Jun 14:00 foo/database-agent-mac636b8-1-logs-archive-201606231300176.tar.gz
121934588        8 -rw-r--r--    1 markneedham      wheel                2895 23 Jun 13:44 foo/database-agent-macbb8e1-1-logs-archive-201606231244151.tar.gz
121934590       32 -rw-r--r--    1 markneedham      wheel               13427 23 Jun 13:44 foo/database-agent-macc8177-1-logs-archive-201606231244152.tar.gz
121934587        8 -rw-r--r--    1 markneedham      wheel                3882 23 Jun 13:44 foo/database-agent-maccd92c-1-logs-archive-201606231244151.tar.gz
121934611        8 -rw-r--r--    1 markneedham      wheel                3970 23 Jun 13:44 foo/database-agent-macdf24f-1-logs-archive-201606231244165.tar.gz

Or the ones modified more than 1 day, 5 hours ago:

$ find foo -name database-agent* -mtime +1d5h -ls
121879391       24 -rw-r--r--    1 markneedham      wheel                8856 23 Jun 11:49 foo/database-agent-mac19b6b-1-logs-archive-201606231049507.tar.gz
121879394       24 -rw-r--r--    1 markneedham      wheel                8772 23 Jun 11:49 foo/database-agent-mac1f427-1-logs-archive-201606231049507.tar.gz
121879390       24 -rw-r--r--    1 markneedham      wheel                9702 23 Jun 11:49 foo/database-agent-mac7e165-1-logs-archive-201606231049507.tar.gz
121879393        8 -rw-r--r--    1 markneedham      wheel                2812 23 Jun 11:49 foo/database-agent-macab7f1-1-logs-archive-201606231049507.tar.gz
121879413        8 -rw-r--r--    1 markneedham      wheel                3144 23 Jun 11:49 foo/database-agent-macbcbe8-1-logs-archive-201606231049520.tar.gz
121879414        8 -rw-r--r--    1 markneedham      wheel                3131 23 Jun 11:49 foo/database-agent-mace075e-1-logs-archive-201606231049520.tar.gz
121879392        8 -rw-r--r--    1 markneedham      wheel                3130 23 Jun 11:49 foo/database-agent-mace8859-1-logs-archive-201606231049507.tar.gz

There are lots of other flags you can pass to find but these ones did exactly what I wanted!

Categories: Programming

Unix: Find all text below string in a file

Sun, 06/19/2016 - 09:36

I recently wanted to parse some text out of a bunch of files so that I could do some sentiment analysis on it. Luckily the text I want is at the end of the file and doesn’t have anything after it but there is text before it that I want to get rid.

The files look like this:

# text I don't care about
 
= Heading of the bit I care about
 
# text I care about

In other words I want to find the line that contains the Heading and then get all the text after that point.

I figured sed was the tool for the job but my knowledge of the syntax was a bit rusty. Luckily this post served as a refresher.

Effectively what we want to do is delete from the beginning of the file up until the line after the heading. We can do this with the following command:

$ cat /tmp/foo.txt 
# text I don't care about
 
= Heading of the bit I care about
 
# text I care about
$ cat /tmp/foo.txt | sed '1,/Heading of the bit I care about/d'
 
# text I care about

That still leaves an extra empty line after the heading which is a bit annoying but easy enough to get rid of by passing another command to sed that strips empty lines:

$ cat /tmp/foo.txt | sed -e '1,/Heading of the bit I care about/d' -e '/^\s*$/d'
# text I care about

The only difference here is that we’re now passing the ‘-e’ flag to allow us to specify multiple commands. If we just pass them sequentially then the 2nd one will be interpreted as the name of a file.

Categories: Programming

Unix: Split string using separator

Sun, 06/19/2016 - 08:22

I recently found myself needing to iterate over a bunch of ‘/’ separated strings on the command line and extract just the text after the last ‘/’.

e.g. an example of one of the strings

A/B/C

I wanted to write some code that could split on ‘/’ and then pick the 3rd item in the resulting collection.

One way of doing this is to echo the string and then pipe it through cut:

$ string="A/B/C"
$ echo ${string} | cut -d"/" -f3
C

or awk:

$ echo ${string} | awk -F"/" '{ print $3}'
C

I don’t like having to echo the string – it feels a bit odd so I wanted to see if there was a way to do the parsing more ‘inline’.

I came across this post which explains how to change the internal field separator (IFS) on the shell and then parse the string into an array using read. I gave it a try:

$ IFS="/" read -ra ADDR <<< "${string}"; echo ${ADDR[2]}
C

Works! We can even refer to the last item in the array using -1 instead of it’s absolute position:

$ IFS="/" read -ra ADDR <<< "${string}"; echo ${ADDR[-1]}
C

I’d not come across this use of the ‘read’ function before. The key is the ‘-a’ parameter. From the man page:

-a aname
The words are assigned to sequential indices of the array variable aname,
starting at 0. All elements are removed from aname before the assignment.
Other name arguments are ignored.

So we’re resetting the internal field separator and then reading the string into another variable as an array split on the ‘/’ character.

Pretty neat although now it’s longer than the original command and I’m sure I’ll forget the syntax.

Further down the page is another suggestion which seems even harder to remember but is much shorter:

$ echo ${string##*/} 
C

This drops from the beginning of the string up until the last occurrence of ‘/’ which is exactly what we want.

This way is the nicest and doesn’t require any echoing if we just want to assign the result to a variable. The echo is only used here to see the output.

Categories: Programming

Python: Regex – matching foreign characters/unicode letters

Sat, 06/18/2016 - 08:38

I’ve been back in the land of screen scrapping this week extracting data from the Game of Thrones wiki and needed to write a regular expression to pull out characters and actors.

Here are some examples of the format of the data:

Peter Dinklage as Tyrion Lannister
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
Filip Lozić as Young Nobleman
Morgan C. Jones as a Braavosi captain
Adewale Akinnuoye-Agbaje as Malko

So the pattern is:

<actor> as <character>

optionally followed by some other text that we’re not interested in.

The output I want to get is:

Peter Dinklage, Tyrion Lannister
Daniel Naprous, Oznak zo Pahl
Filip Lozić, Young Nobleman
Morgan C. Jones, a Braavosi captain
Adewale Akinnuoye-Agbaje, Malko

I started using the ‘split’ command on the word ‘as’ but that broke down when I realised some of the characters had the letters ‘as’ in the middle of their name. So regex it is!

This was my first attempt:

import re
 
strings = [
    "Peter Dinklage as Tyrion Lannister",
    "Filip Lozić as Young Nobleman",
    "Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)",
    "Morgan C. Jones as a Braavosi captain",
    "Adewale Akinnuoye-Agbaje as Malko"
]
 
regex = "([A-Za-z\-'\. ]*) as ([A-Za-z\-'\. ]*)"
 
for string in strings:
    print string
    match = re.match( regex, string)
    if match is not None:
        print match.groups()
    else:
        print "FAIL"
	print ""
Peter Dinklage as Tyrion Lannister
('Peter Dinklage', 'Tyrion Lannister')
 
Filip Lozić as Young Nobleman
FAIL
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
('Daniel Naprous', 'Oznak zo Pahl')
 
Morgan C. Jones as a Braavosi captain
('Morgan C. Jones', 'a Braavosi captain')
 
Adewale Akinnuoye-Agbaje as Malko
('Adewale Akinnuoye-Agbaje', 'Malko')

It works for 4 of the 5 scenarios but now for Filip LoziΔ‡. The ‘Δ‡’ character causes the issue so we need to be able to match foreign characters which the current charset I defined in the regex doesn’t capture.

I came across this Stack Overflow post which said that in some regex libraries you can use ‘\p{L}’ to match all letters. I gave that a try:

regex = "([\p{L}\-'\. ]*) as ([\p{L}\-'\. ]*)"

And then re-ran the script:

Peter Dinklage as Tyrion Lannister
FAIL
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
FAIL
 
Filip Lozić as Young Nobleman
FAIL
 
Morgan C. Jones as a Braavosi captain
FAIL
 
Adewale Akinnuoye-Agbaje as Malko
FAIL

Hmmm, not sure if I did it wrong or if that isn’t available in Python. I’ll assume the latter but feel free to correct me in the comments and I’ll update the post.

I went search again and found this post which suggested another approach:

You can construct a new character class:

[^\W\d_]

instead of \w. Translated into English, it means “Any character that is not a non-alphanumeric character ([^\W] is the same as \w), but that is also not a digit and not an underscore”.

Let’s try plugging that in:

regex = "([A-Za-z\-'\.^\W\d_ ]*) as ([A-Za-z\-'\.^\W\d_ ]*)"
Peter Dinklage as Tyrion Lannister
('Peter Dinklage', 'Tyrion Lannister')
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
('Daniel Naprous as Oznak zo Pahl(credited', 'Stunt Performer)')
 
Filip Lozić as Young Nobleman
('Filip Lozi\xc4\x87', 'Young Nobleman')
 
Morgan C. Jones as a Braavosi captain
('Morgan C. Jones', 'a Braavosi captain')
 
Adewale Akinnuoye-Agbaje as Malko
('Adewale Akinnuoye-Agbaje', 'Malko')

So that’s fixed Filip but now Daniel Naprous is being incorrectly parsed.

For Attempt #4 I decided to try excluding what I don’t want instead:

regex = "([^0-9\(]*) as ([^0-9\(]*)"
Peter Dinklage as Tyrion Lannister
('Peter Dinklage', 'Tyrion Lannister')
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
('Daniel Naprous', 'Oznak zo Pahl')
 
Filip Lozić as Young Nobleman
('Filip Lozi\xc4\x87', 'Young Nobleman')
 
Morgan C. Jones as a Braavosi captain
('Morgan C. Jones', 'a Braavosi captain')
 
Adewale Akinnuoye-Agbaje as Malko
('Adewale Akinnuoye-Agbaje', 'Malko')

That does the job but has exposed my lack of regex skillz. If you know a better way let me know in the comments.

Categories: Programming

Unix parallel: Populating all the USB sticks

Wed, 06/01/2016 - 06:53

The day before Graph Connect Europe 2016 we needed to create a bunch of USB sticks containing Neo4j and the training materials and eventually iterated our way to a half decent approach which made use of the GNU parallel command which I’ve always wanted to use!


But first I needed to get a USB hub so I could do lots of them at the same time. I bought the EasyAcc USB 3.0 but there are lots of other ones that do the same job.

Next I mouunted all the USB sticks and then renamed the volumes to be NEO4J1 -> NEO4J7:

for i in 1 2 3 4 5 6 7; do diskutil renameVolume "USB DISK" NEO4J${i}; done

I then created a bash function called ‘duplicate’ to do the copying work:

function duplicate() {
  i=${1}
  echo ${i}
  time rsync -avP --size-only --delete --exclude '.*' --omit-dir-times /Users/markneedham/Downloads/graph-connect-europe-2016/ /Volumes/NEO4J${i}/
}

We can now call this function in parallel like so:

seq 1 7 | parallel duplicate

And that’s it. We didn’t get a 7x improvement in the throughput of USB creation from doing 7 in parallel but it took ~ 9 minutes to complete 7 compared to 5 minutes each. Presumably there’s still some part of the copying that is sequential further down – Amdahl’s law #ftw.

I want to go and find other things that I can use pipe into parallel now!

Categories: Programming

Neo4j vs Relational: Refactoring – Extracting node/table

Sun, 05/22/2016 - 10:58

In my previous blog post I showed how to add a new property/field to a node with a label/record in a table for a football transfers dataset that I’ve been playing with.

After introducing this ‘nationality’ property I realised that I now had some duplication in the model:

2016 05 22 10 15 15

players.nationality and clubs.country are referring to the same countries but they’ve both got them stored as strings so we can’t ensure the integrity of our countries and ensure that we’re referring to the same country.

We have the same issue in the graph model as well:

2016 05 22 10 40 40

This time Player.nationality and Club.country refer to the same countries.

We can solve our problem by introducing a countries table in the relational model and a set of nodes with a ‘Country’ label in the graph model. Let’s start with relational.

This is the model we’re driving towards:

2016 05 22 10 50 43

The first thing we need to do is create a countries table and populate it:

CREATE TABLE countries (
    "code" CHARACTER VARYING(3) NOT NULL PRIMARY KEY,
    "name" CHARACTER VARYING(50) NOT NULL
);
INSERT INTO countries VALUES('MNE', 'Montenegro');
INSERT INTO countries VALUES('SWZ', 'Swaziland');
...

Next let’s update the clubs table to reference the countries table:

ALTER TABLE clubs
ADD COLUMN country_id CHARACTER VARYING(3)
REFERENCES countries(code);

And let’s run a query to populate that column:

UPDATE clubs AS cl
SET country_id = c.code
FROM clubs
INNER JOIN countries AS c 
ON c.name = clubs.country
WHERE cl.id = clubs.id;

This query iterates over all the clubs, queries the country table to find the country id for that row and then stores it in the ‘country_id’ field. Finally we can remove the ‘country’ field:

ALTER TABLE clubs
DROP COLUMN country;

Now we do the same drill for the players table:

ALTER TABLE players
ADD COLUMN country_id CHARACTER VARYING(3)
REFERENCES countries(code);
UPDATE players AS p
SET country_id = c.code
FROM players
INNER JOIN countries AS c 
ON c.name = players.nationality
WHERE p.id = players.id;
ALTER TABLE players
DROP COLUMN nationality;

Now it’s time for the graph. This is the model we want to get to:

2016 05 22 10 51 49

First we’ll create the countries:

CREATE CONSTRAINT ON (c:Country)
ASSERT c.id IS UNIQUE
LOAD CSV WITH HEADERS FROM "file:///countries.csv"
AS row
MERGE (country:Country {id: row.countryCode})
ON CREATE SET country.name = row.country

And now let’s get clubs and players to point at those countries nodes and get rid of their respective nationality/country properties:

MATCH (club:Club)
MATCH (country:Country {name: club.country})
MERGE (club)-[:PART_OF]->(country)
REMOVE club.country
MATCH (player:Player)
MATCH (country:Country {name: player.nationality})
MERGE (player)-[:PLAYS_FOR]->(country)
REMOVE player.nationality

And that’s it, we can now write queries against our new model.

Categories: Programming

Neo4j vs Relational: Refactoring – Add a new field/property

Sun, 05/22/2016 - 10:09

A couple of months ago I presented a webinar comparing how you’d model and evolve a data model using a Postgres SQL database and Neo4j.

This is what the two data models looked like after the initial data import and before any refactoring/migration had been done:

Relational

2016 05 22 09 49 23

Graph

2016 05 22 09 52 16

I wanted to add a ‘nationality’ property to the players table in the SQL schema and to the nodes with the ‘Player’ label in the graph.

This refactoring is quite easy in both models. In the relational database we first run a query to add the ‘nationality’ field to the table:

ALTER TABLE players 
ADD COLUMN nationality VARYING(30);

And then we need to generate UPDATE statements from our data dump to update all the existing records:

UPDATE players 
SET nationality = 'Brazil'
WHERE players.id = '/aldair/profil/spieler/4151';
 
...

In the graph we can do this in a single step by processing our data dump using the LOAD CSV command and then setting a property on each player:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///transfers.csv" AS row
MATCH (player:Player {id: row.playerUri})
SET player.nationality = row.playerNationality

If we wanted to make the nationality field non nullable we could go back and run the following queries:

ALTER TABLE players 
ALTER COLUMN nationality SET NOT NULL
CREATE CONSTRAINT ON (player:Player) 
ASSERT exists(player.nationality)

And we’re done!

Categories: Programming

R: substr – Getting a vector of positions

Mon, 04/18/2016 - 20:49

I recently found myself writing an R script to extract parts of a string based on a beginning and end index which is reasonably easy using the substr function:

> substr("mark loves graphs", 0, 4)
[1] "mark"

But what if we have a vector of start and end positions?

> substr("mark loves graphs", c(0, 6), c(4, 10))
[1] "mark"

Hmmm that didn’t work as I expected! It turns out we actually need to use the substring function instead which wasn’t initially obvious to me on reading the documentation:

> substring("mark loves graphs", c(0, 6, 12), c(4, 10, 17))
[1] "mark"   "loves"  "graphs"

Easy when you know how!

Categories: Programming

R: tm – Unique words/terms per document

Mon, 04/11/2016 - 06:40

I’ve been doing a bit of text mining over the weekend using the R tm package and I wanted to only count a term once per document which isn’t how it works out the box.

For example let’s say we’re writing a bit of code to calculate the frequency of terms across some documents. We might write the following code:

library(tm)
text = c("I am Mark I am Mark", "Neo4j is cool Neo4j is cool")
corpus = VCorpus(VectorSource(text))
tdm = as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf))))
 
> tdm
       Docs
Terms   1 2
  am    2 0
  cool  0 2
  i     2 0
  is    0 2
  mark  2 0
  neo4j 0 2
 
> rowSums(tdm)
   am  cool     i    is  mark neo4j 
    2     2     2     2     2     2

We’ve created a small corpus over a vector which contains two bits of text. On the last line we output a TermDocumentMatrix which shows how frequently each term shows up across the corpus. I had to tweak the default word length of 3 to make sure we could see ‘am’ and ‘cool’.

But we’ve actually got some duplicate terms in each of our documents so we want to get rid of those and only count unique terms per document.

We can achieve that by mapping over the corpus using the tm_map function and then applying a function which returns unique terms. I wrote the following function:

uniqueWords = function(d) {
  return(paste(unique(strsplit(d, " ")[[1]]), collapse = ' '))
}

We can then apply the function like so:

corpus = tm_map(corpus, content_transformer(uniqueWords))
tdm = as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf))))
 
> tdm
       Docs
Terms   1 2
  am    1 0
  cool  0 1
  i     1 0
  is    0 1
  mark  1 0
  neo4j 0 1
 
> rowSums(tdm)
   am  cool     i    is  mark neo4j 
    1     1     1     1     1     1

And now each term is only counted once. Success!

Categories: Programming