Warning: Table './devblogsdb/cache_page' is marked as crashed and last (automatic?) repair failed query: SELECT data, created, headers, expire, serialized FROM cache_page WHERE cid = 'http://www.softdevblogs.com/?q=aggregator/sources/23&page=2' in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc on line 135

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 729

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 730

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 731

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 732
Software Development Blogs: Programming, Software Testing, Agile, Project Management
Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/common.inc on line 153.
Syndicate content
Thoughts on Software Development
Updated: 5 hours 37 min ago

R: substr – Getting a vector of positions

Mon, 04/18/2016 - 20:49

I recently found myself writing an R script to extract parts of a string based on a beginning and end index which is reasonably easy using the substr function:

> substr("mark loves graphs", 0, 4)
[1] "mark"

But what if we have a vector of start and end positions?

> substr("mark loves graphs", c(0, 6), c(4, 10))
[1] "mark"

Hmmm that didn’t work as I expected! It turns out we actually need to use the substring function instead which wasn’t initially obvious to me on reading the documentation:

> substring("mark loves graphs", c(0, 6, 12), c(4, 10, 17))
[1] "mark"   "loves"  "graphs"

Easy when you know how!

Categories: Programming

R: tm – Unique words/terms per document

Mon, 04/11/2016 - 06:40

I’ve been doing a bit of text mining over the weekend using the R tm package and I wanted to only count a term once per document which isn’t how it works out the box.

For example let’s say we’re writing a bit of code to calculate the frequency of terms across some documents. We might write the following code:

library(tm)
text = c("I am Mark I am Mark", "Neo4j is cool Neo4j is cool")
corpus = VCorpus(VectorSource(text))
tdm = as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf))))
 
> tdm
       Docs
Terms   1 2
  am    2 0
  cool  0 2
  i     2 0
  is    0 2
  mark  2 0
  neo4j 0 2
 
> rowSums(tdm)
   am  cool     i    is  mark neo4j 
    2     2     2     2     2     2

We’ve created a small corpus over a vector which contains two bits of text. On the last line we output a TermDocumentMatrix which shows how frequently each term shows up across the corpus. I had to tweak the default word length of 3 to make sure we could see ‘am’ and ‘cool’.

But we’ve actually got some duplicate terms in each of our documents so we want to get rid of those and only count unique terms per document.

We can achieve that by mapping over the corpus using the tm_map function and then applying a function which returns unique terms. I wrote the following function:

uniqueWords = function(d) {
  return(paste(unique(strsplit(d, " ")[[1]]), collapse = ' '))
}

We can then apply the function like so:

corpus = tm_map(corpus, content_transformer(uniqueWords))
tdm = as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf))))
 
> tdm
       Docs
Terms   1 2
  am    1 0
  cool  0 1
  i     1 0
  is    0 1
  mark  1 0
  neo4j 0 1
 
> rowSums(tdm)
   am  cool     i    is  mark neo4j 
    1     1     1     1     1     1

And now each term is only counted once. Success!

Categories: Programming

Neo4j: A procedure for the SLM clustering algorithm

Sun, 02/28/2016 - 21:40

In the middle of last year I blogged about the Smart Local Moving algorithm which is used for community detection in networks and with the upcoming introduction of procedures in Neo4j I thought it’d be fun to make that code accessible as one.

If you want to grab the code and follow along it’s sitting on the SLM repository on my github.

At the moment the procedure is hard coded to work with a KNOWS relationship between two nodes but that could easily be changed.

To check it’s working correctly I thought it’d make most sense to use the Karate Club data set described on the SLM home page. I think this data set is originally from Networks, Crowds and Markets.

I wrote the following LOAD CSV script to create the graph in Neo4j:

LOAD CSV FROM "file:///Users/markneedham/projects/slm/karate_club_network.txt" as row
FIELDTERMINATOR "\t"
MERGE (person1:Person {id: row[0]})
MERGE (person2:Person {id: row[1]})
MERGE (person1)-[:KNOWS]->(person2)
Graph

Next we need to call the procedure which will add an appropriate label to each node depending which community it belongs to. This is what the procedure code looks like:

public class ClusterAllTheThings
{
    @Context
    public org.neo4j.graphdb.GraphDatabaseService db;
 
    @Procedure
    @PerformsWrites
    public Stream<Cluster> knows() throws IOException
    {
        String query = "MATCH (person1:Person)-[r:KNOWS]->(person2:Person) \n" +
                       "RETURN person1.id AS p1, person2.id AS p2, toFloat(1) AS weight";
 
        Result rows = db.execute( query );
 
        ModularityOptimizer.ModularityFunction modularityFunction = ModularityOptimizer.ModularityFunction.Standard;
        Network network = Network.create( modularityFunction, rows );
 
        double resolution = 1.0;
        int nRandomStarts = 1;
        int nIterations = 10;
        long randomSeed = 0;
 
        double modularity;
 
        Random random = new Random( randomSeed );
 
        double resolution2 = modularityFunction.resolution( resolution, network );
 
        Map<Integer, Node> cluster = new HashMap<>();
        double maxModularity = Double.NEGATIVE_INFINITY;
 
        for ( int randomStart = 0; randomStart < nRandomStarts; randomStart++ )
        {
            network.initSingletonClusters();
 
            int iteration = 0;
            do
            {
                network.runSmartLocalMovingAlgorithm( resolution2, random );
                iteration++;
 
                modularity = network.calcQualityFunction( resolution2 );
            } while ( (iteration < nIterations) );
 
            if ( modularity > maxModularity )
            {
                network.orderClustersByNNodes();
                cluster = network.getNodes();
                maxModularity = modularity;
            }
        }
 
        for ( Map.Entry<Integer, Node> entry : cluster.entrySet() )
        {
            Map<String, Object> params = new HashMap<>();
            params.put("userId", String.valueOf(entry.getKey()));
 
            db.execute("MATCH (person:Person {id: {userId}})\n" +
                       "SET person:`" + (format( "Community-%d`", entry.getValue().getCluster() )),
                    params);
        }
 
        return cluster
                .entrySet()
                .stream()
                .map( ( entry ) -> new Cluster( entry.getKey(), entry.getValue().getCluster() ) );
    }
 
    public static class Cluster
    {
        public long id;
        public long clusterId;
 
        public Cluster( int id, int clusterId )
        {
            this.id = id;
            this.clusterId = clusterId;
        }
    }
}

I’ve hardcoded some parameters to use defaults which could be exposed through the procedure to allow more control if necessary. The Network#create function assumes it is going to receive a stream of rows containing columns ‘p1’, ‘p2’ and ‘weight’ to represent the ‘source’, ‘destination’ and ‘weight’ of the relationship between them.

We call the procedure like this:

CALL org.neo4j.slm.knows()

It will return each of the nodes and the cluster it’s been assigned to and if we then visualise the network in the neo4j browser we’ll see this:

Graph  1

which is similar to the visualisation from the SLM home page:

Network

If you want to play around with the code feel free. You’ll need to run the following commands to create the JAR for the plugin and deploy it.

$ mvn clean package 
$ cp target/slm-1.0.jar /path/to/neo4j/plugins/ 
$ ./path/to/neo4j/bin/neo4j restart

And you’ll need the latest milestone of Neo4j which has procedures enabled.

Categories: Programming

Clojure: First steps with reducers

Sun, 01/24/2016 - 23:01

I’ve been playing around with Clojure a bit today in preparation for a talk I’m giving next week and found myself writing the following code to apply the same function to three different scores:

(defn log2 [n]
  (/ (Math/log n) (Math/log 2)))
 
(defn score-item [n]
  (if (= n 0) 0 (log2 n)))
 
(+ (score-item 12) (score-item 13) (score-item 5)) 9.60733031374961

I’d forgotten about folding over a collection but quickly remembered that I could achieve the same result with the following code:

(reduce #(+ %1 (score-item %2)) 0 [12 13 5]) 9.60733031374961

The added advantage here is that if I want to add a 4th score to the mix all I need to do is append it to the end of the vector:

(reduce #(+ %1 (score-item %2)) 0 [12 13 5 6]) 12.192292814470767

However, while Googling to remind myself of the order of the arguments to reduce I kept coming across articles and documentation about reducers which I’d heard about but never used.

As I understand they’re used to achieve performance gains and easier composition of functions over collections so I’m not sure how useful they’ll be to me but I thought I’d give them a try.

Our first step is to bring the namespace into scope:

(require '[clojure.core.reducers :as r])

Now we can compute the same result using the reduce function:

(r/reduce #(+ %1 (score-item %2)) 0 [12 13 5 6]) 12.192292814470767

So far, so identical. If we wanted to calculate individual scores and then filter out those below a certain threshold the code would behave a little differently:

(->>[12 13 5 6]
    (map score-item)
    (filter #(> % 3))) (3.5849625007211565 3.700439718141092)
 
(->> [12 13 5 6]
     (r/map score-item)
     (r/filter #(> % 3))) #object[clojure.core.reducers$folder$reify__19192 0x5d0edf21 "clojure.core.reducers$folder$reify__19192@5d0edf21"]

Instead of giving us a vector of scores the reducers version returns a reducer which can pass into reduce or fold if we want an accumulated result or into if we want to output a collection. In this case we want the latter:

(->> [12 13 5 6]
     (r/map score-item)
     (r/filter #(> % 3))
     (into [])) (3.5849625007211565 3.700439718141092)

With a measly 4 item collection I don’t think the reducers are going to provide much speed improvement here but we’d need to use the fold function if we want processing of the collection to be done in parallel.

One for next time!

Categories: Programming

Neo4j: Cypher – avoid duplicate calls to NOT patterns

Sun, 01/17/2016 - 13:19

I’ve been reacquainting myself with the meetup.com dataset ahead of Wednesday’s meetup in London and wanted to write a collaborative filtering type query to work out which groups people in my groups were in.

This started simple enough:

MATCH (member:Member {name: "Mark Needham"})-[:MEMBER_OF]->(group:Group)<-[:MEMBER_OF]-(other:Member)-[:MEMBER_OF]->(otherGroup:Group)
RETURN otherGroup, COUNT(*) AS commonMembers
ORDER BY commonMembers DESC
LIMIT 5

And doesn’t take too long to run:

Cypher version: CYPHER 2.3, planner: COST. 1084378 total db hits in 1103 ms.

However, it was showing up several groups that I’m already a member of so I added in a “WHERE NOT” clause to sort that out:

MATCH (member:Member {name: "Mark Needham"})-[:MEMBER_OF]->(group:Group)<-[:MEMBER_OF]-(other:Member)-[:MEMBER_OF]->(otherGroup:Group)
WHERE NOT (member)-[:MEMBER_OF]->(otherGroup)
RETURN otherGroup, COUNT(*) AS commonMembers
ORDER BY commonMembers DESC
LIMIT 5

Unfortunately when I ran this the amount of db hits increased by 14x and it now took 3x as long to run:

Cypher version: CYPHER 2.3, planner: COST. 14061442 total db hits in 3364 ms.


The problem is that we’re making lots of duplicate calls to NOT (member)-[:MEMBER_OF]->(otherGroup) because each group shows up lots of times.

This is the ‘reduce cardinality of work in progress’ tip from Michael Hunger’s blog post:

Bonus Query Tuning Tip: Reduce Cardinality of Work in Progress

When following longer paths, you’ll encounter duplicates. If you’re not interested in all the possible paths – but just distinct information from stages of the path – make sure that you eagerly eliminate duplicates, so that later matches don’t have to be executed many multiple times.

We can reduce the WIP in our query by doing the counting of common members first and then filtering out the groups we’re already a member of:

MATCH (member:Member {name: "Mark Needham"})-[:MEMBER_OF]->(group:Group)<-[:MEMBER_OF]-(other:Member)-[:MEMBER_OF]->(otherGroup:Group)
WITH otherGroup, member, COUNT(*) AS commonMembers
WHERE NOT (member)-[:MEMBER_OF]->(otherGroup)
RETURN otherGroup, commonMembers
ORDER BY commonMembers DESC
LIMIT 5

This gets us back down to something closer to the running time/db hits of our initial query:

Cypher version: CYPHER 2.3, planner: COST. 1097114 total db hits in 1004 ms.
Categories: Programming