Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Mark Needham
Syndicate content
Thoughts on Software Development
Updated: 5 hours 11 min ago

Shell: Create a comma separated string

Fri, 06/23/2017 - 13:26

I recently needed to generate a string with comma separated values, based on iterating a range of numbers.

e.g. we should get the following output where n = 3

foo-0,foo-1,foo-2

I only had the shell available to me so I couldn’t shell out into Python or Ruby for example. That means it’s bash scripting time!

If we want to iterate a range of numbers and print them out on the screen we can write the following code:

n=3
for i in $(seq 0 $(($n > 0? $n-1: 0))); do 
  echo "foo-$i"
done

foo-0
foo-1
foo-2

Combining them into a string is a bit more tricky, but luckily I found a great blog post by Andreas Haupt which shows what to do. Andreas is solving a more complicated problem than me but these are the bits of code that we need from the post.

n=3
combined=""

for i in $(seq 0 $(($n > 0? $n-1: 0))); do 
  token="foo-$i"
  combined="${combined}${combined:+,}$token"
done
echo $combined

foo-0,foo-1,foo-2

This won’t work if you set n<0 but that’s ok for me! I’ll let Andreas explain how it works:

  • ${combined:+,} will return either a comma (if combined exists and is set) or nothing at all.
  • In the first invocation of the loop combined is not yet set and nothing is put out.
  • In the next rounds combined is set and a comma will be put out.

We can see how it in action by printing out the value of $combined after each iteration of the loop:

n=3
combined=""

for i in $(seq 0 $(($n > 0 ? $n-1: 0))); do 
  token="foo-$i"
  combined="${combined}${combined:+,}$token"
  echo $combined
done

foo-0
foo-0,foo-1
foo-0,foo-1,foo-2

Looks good to me!

The post Shell: Create a comma separated string appeared first on Mark Needham.

Categories: Programming

scikit-learn: Random forests – Feature Importance

Fri, 06/16/2017 - 06:55

As I mentioned in a blog post a couple of weeks ago, I’ve been playing around with the Kaggle House Prices competition and the most recent thing I tried was training a random forest regressor.

Unfortunately, although it gave me better results locally it got a worse score on the unseen data, which I figured meant I’d overfitted the model.

I wasn’t really sure how to work out if that theory was true or not, but by chance I was reading Chris Albon’s blog and found a post where he explains how to inspect the importance of every feature in a random forest. Just what I needed!

Stealing from Chris’ post I wrote the following code to work out the feature importance for my dataset:

Prerequisites
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# We'll use this library to make the display pretty
from tabulate import tabulate
Load Data
train = pd.read_csv('train.csv')

# the model can only handle numeric values so filter out the rest
data = train.select_dtypes(include=[np.number]).interpolate().dropna()
Split train/test sets
y = train.SalePrice
X = data.drop(["SalePrice", "Id"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)
Train model
clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model = clf.fit(X_train, y_train)
Feature Importance
headers = ["name", "score"]
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt="plain"))
name                 score
OverallQual    0.553829
GrLivArea      0.131
BsmtFinSF1     0.0374779
TotalBsmtSF    0.0372076
1stFlrSF       0.0321814
GarageCars     0.0226189
GarageArea     0.0215719
LotArea        0.0214979
YearBuilt      0.0184556
2ndFlrSF       0.0127248
YearRemodAdd   0.0126581
WoodDeckSF     0.0108077
OpenPorchSF    0.00945239
LotFrontage    0.00873811
TotRmsAbvGrd   0.00803121
GarageYrBlt    0.00760442
BsmtUnfSF      0.00715158
MasVnrArea     0.00680341
ScreenPorch    0.00618797
Fireplaces     0.00521741
OverallCond    0.00487722
MoSold         0.00461165
MSSubClass     0.00458496
BedroomAbvGr   0.00253031
FullBath       0.0024245
YrSold         0.00211638
HalfBath       0.0014954
KitchenAbvGr   0.00140786
BsmtFullBath   0.00137335
BsmtFinSF2     0.00107147
EnclosedPorch  0.000951266
3SsnPorch      0.000501238
PoolArea       0.000261668
LowQualFinSF   0.000241304
BsmtHalfBath   0.000179506
MiscVal        0.000154799

So OverallQual is quite a good predictor but then there’s a steep fall to GrLivArea before things really tail off after WoodDeckSF.

I think this is telling us that a lot of these features aren’t useful at all and can be removed from the model. There are also a bunch of categorical/factor variables that have been stripped out of the model but might be predictive of the house price.

These are the next things I’m going to explore:

  • Make the categorical variables numeric (perhaps by using one hot encoding for some of them)
  • Remove the most predictive features and build a model that only uses the other features

The post scikit-learn: Random forests – Feature Importance appeared first on Mark Needham.

Categories: Programming

Kubernetes: Which node is a pod on?

Wed, 06/14/2017 - 09:49

When running Kubernetes on a cloud provider, rather than locally using minikube, it’s useful to know which node a pod is running on.

The normal command to list pods doesn’t contain this information:

$ kubectl get pod
NAME           READY     STATUS    RESTARTS   AGE       
neo4j-core-0   1/1       Running   0          6m        
neo4j-core-1   1/1       Running   0          6m        
neo4j-core-2   1/1       Running   0          2m        

I spent a while searching for a command that I could use before I came across Ta-Ching Chen’s blog post while looking for something else.

Ta-Ching points out that we just need to add the flag -o wide to our original command to get the information we require:

$ kubectl get pod -o wide
NAME           READY     STATUS    RESTARTS   AGE       IP           NODE
neo4j-core-0   1/1       Running   0          6m        10.32.3.6    gke-neo4j-cluster-default-pool-ded394fa-0kpw
neo4j-core-1   1/1       Running   0          6m        10.32.3.7    gke-neo4j-cluster-default-pool-ded394fa-0kpw
neo4j-core-2   1/1       Running   0          2m        10.32.0.10   gke-neo4j-cluster-default-pool-ded394fa-kp68

Easy!

The post Kubernetes: Which node is a pod on? appeared first on Mark Needham.

Categories: Programming

Kaggle: House Prices: Advanced Regression Techniques – Trying to fill in missing values

Sun, 06/04/2017 - 10:22

I’ve been playing around with the data in Kaggle’s House Prices: Advanced Regression Techniques and while replicating Poonam Ligade’s exploratory analysis I wanted to see if I could create a model to fill in some of the missing values.

Poonam wrote the following code to identify which columns in the dataset had the most missing values:

import pandas as pd
train = pd.read_csv('train.csv')
null_columns=train.columns[train.isnull().any()]

>>> print(train[null_columns].isnull().sum())
LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

The one that I’m most interested in is LotFrontage, which describes ‘Linear feet of street connected to property’. There are a few other columns related to lots so I thought I might be able to use them to fill in the missing LotFrontage values.

We can write the following code to find a selection of the rows missing a LotFrontage value:

cols = [col for col in train.columns if col.startswith("Lot")]
missing_frontage = train[cols][train["LotFrontage"].isnull()]

>>> print(missing_frontage.head())
    LotFrontage  LotArea LotShape LotConfig
7           NaN    10382      IR1    Corner
12          NaN    12968      IR2    Inside
14          NaN    10920      IR1    Corner
16          NaN    11241      IR1   CulDSac
24          NaN     8246      IR1    Inside

I want to use scikit-learn‘s linear regression model which only works with numeric values so we need to convert our categorical variables into numeric equivalents. We can use pandas get_dummies function for this.

Let’s try it out on the LotShape column:

sub_train = train[train.LotFrontage.notnull()]
dummies = pd.get_dummies(sub_train[cols].LotShape)

>>> print(dummies.head())
   IR1  IR2  IR3  Reg
0    0    0    0    1
1    0    0    0    1
2    1    0    0    0
3    1    0    0    0
4    1    0    0    0

Cool, that looks good. We can do the same with LotConfig and then we need to add these new columns onto the original DataFrame. We can use pandas concat function to do this.

import numpy as np

data = pd.concat([
        sub_train[cols],
        pd.get_dummies(sub_train[cols].LotShape),
        pd.get_dummies(sub_train[cols].LotConfig)
    ], axis=1).select_dtypes(include=[np.number])

>>> print(data.head())
   LotFrontage  LotArea  IR1  IR2  IR3  Reg  Corner  CulDSac  FR2  FR3  Inside
0         65.0     8450    0    0    0    1       0        0    0    0       1
1         80.0     9600    0    0    0    1       0        0    1    0       0
2         68.0    11250    1    0    0    0       0        0    0    0       1
3         60.0     9550    1    0    0    0       1        0    0    0       0
4         84.0    14260    1    0    0    0       0        0    1    0       0

We can now split data into train and test sets and create a model.

from sklearn import linear_model
from sklearn.model_selection import train_test_split

X = data.drop(["LotFrontage"], axis=1)
y = data.LotFrontage

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)

lr = linear_model.LinearRegression()

model = lr.fit(X_train, y_train)

Now it’s time to give it a try on the test set:

>>> print("R^2 is: \n", model.score(X_test, y_test))
R^2 is: 
 -0.84137438493

Hmm that didn’t work too well – an R^2 score of less than 0 suggests that we’d be better off just predicting the average LotFrontage regardless of any of the other features. We can confirm that with the following code:

from sklearn.metrics import r2_score

>>> print(r2_score(y_test, np.repeat(y_test.mean(), len(y_test))))
0.0

whereas if we had all of the values correct we’d get a score of 1:

>>> print(r2_score(y_test, y_test))
1.0

In summary, not a very successful experiment. Poonam derives a value for LotFrontage based on the square root of LotArea so perhaps that’s the best we can do here.

The post Kaggle: House Prices: Advanced Regression Techniques – Trying to fill in missing values appeared first on Mark Needham.

Categories: Programming

GraphQL-Europe: A trip to Berlin

Sat, 05/27/2017 - 12:31

Last weekend my colleagues Will, Michael, Oskar, and I went to Berlin to spend Sunday at the GraphQL Europe conference in Berlin.

IMG 20170521 084449

Neo4j sponsored the conference as we’ve been experimenting with building a GraphQL to Neo4j integration and wanted to get some feedback from the community as well as learn what’s going on in GraphQL land.

Will and Michael have written about their experience where they talk more about the hackathon we hosted so I’ll cover it more from a personal perspective.

The first thing that stood out for me was how busy it was – I knew GraphQL was pretty hipster but I wasn’t expecting there to be ~ 300 attendees.

The venue was amazing – the nHow Hotel is located right next to the Spree River so there were great views to be had during the breaks. It also helped that it was really sunny for the whole day!

IMG 20170521 103636

I spent most of the day hanging out at the Neo4j booth which was good fun – several people pointed out that an integration between Neo4j and GraphQL made a lot of sense given that GraphQL talks about the application graph and Neo4j graphs in general.

I managed to attend a few of the talks, including one by Brooks Swinnerton from GitHub who announced that they’d be moving to GraphQL for v4 of their API.

The most interesting part of the talk for me was when Brooks said they’d directed requests for their REST API to the GraphQL one behind the scenes for a while now to check that it could handle the load.

GitHub is moving to GraphQL for v4 of our API because it offers significantly more flexibility for our integrators. The ability to define precisely the data you wantβ€”and only the data you wantβ€”is a powerful advantage over the REST API v3 endpoints.

I think twitter may be doing something similar based on this tweet by Tom Ashworth:

Heh. Twitter GraphQL is quietly serving more than 40 million queries per day. Tiny at Twitter scale but not a bad start.

— tom (@tgvashworth) May 9, 2017

From what I could tell the early pick up of GraphQL seems to be from the front end of applications – several of the attendees had attended ReactEurope a couple of days earlier – but micro services were mentioned in a few of the talks and it was suggested that GraphQL works well in this world as well.

It was a fun day out so thanks to the folks at Graphcool for organising!

The post GraphQL-Europe: A trip to Berlin appeared first on Mark Needham.

Categories: Programming

PostgreSQL: ERROR: argument of WHERE must not return a set

Mon, 05/01/2017 - 21:42

In my last post I showed how to load and query data from the Strava API in PostgreSQL and after executing some simple queries my next task was to query more complex part of the JSON structure.

2017 05 01 21 22 55

Strava allows users to create segments, which are edited portions of road or trail where athletes can compete for time.

I wanted to write a query to find all the times that I’d run a particular segment. e.g. the Akerman Road segment covers a road running North to South in Kennington/Stockwell in South London.

This segment has the id ‘6818475’ so we’ll need to look inside segment_efforts and then compare the value segment.id against this id.

I initially wrote this query to try and find the times I’d run this segment:

SELECT id, data->'start_date' AS startDate, data->'average_speed' AS averageSpeed
FROM runs
WHERE jsonb_array_elements(data->'segment_efforts')->'segment'->>'id' = '6818475'

ERROR:  argument of WHERE must not return a set
LINE 3: WHERE jsonb_array_elements(data->'segment_efforts')->'segmen...

This doesn’t work since jsonb_array_elements returns a set of boolean values, as Craig Ringer points out on Stack Overflow.

Instead we can use a LATERAL subquery to achieve our goal:

SELECT id, data->'start_date' AS startDate, data->'average_speed' AS averageSpeed
FROM runs r,
LATERAL jsonb_array_elements(r.data->'segment_efforts') segment
WHERE segment ->'segment'->>'id' = '6818475'

    id     |       startdate        | averagespeed 
-----------+------------------------+--------------
 455461182 | "2015-12-24T11:20:26Z" | 2.841
 440088621 | "2015-11-27T06:10:42Z" | 2.975
 407930503 | "2015-10-07T05:18:34Z" | 2.985
 317170464 | "2015-06-03T04:44:59Z" | 2.842
 312629236 | "2015-05-27T04:46:33Z" | 2.857
 277786711 | "2015-04-02T05:25:59Z" | 2.408
 226351235 | "2014-12-05T07:59:15Z" | 2.803
 225073326 | "2014-12-01T06:15:21Z" | 2.929
 224287690 | "2014-11-29T09:02:46Z" | 3.087
 223964715 | "2014-11-28T06:18:29Z" | 2.844
(10 rows)

Perfect!

The post PostgreSQL: ERROR: argument of WHERE must not return a set appeared first on Mark Needham.

Categories: Programming

Loading and analysing Strava runs using PostgreSQL JSON data type

Mon, 05/01/2017 - 20:11

In my last post I showed how to map Strava runs using data that I’d extracted from their /activities API, but the API returns a lot of other data that I discarded because I wasn’t sure what I should keep.

The API returns a nested JSON structure so the easiest solution would be to save each run as an individual file but I’ve always wanted to try out PostgreSQL’s JSON data type and this seemed like a good opportunity.

Creating a JSON ready PostgreSQL table

First up we need to create a database in which we’ll store our Strava data. Let’s name it appropriately:

create database strava;
\connect strava;

Now we can now create a table with one field with the JSON data type:

CREATE TABLE runs (
  id INTEGER NOT NULL,
  data jsonb
);

ALTER TABLE runs ADD PRIMARY KEY(id);

Easy enough. Now we’re ready to populate the table.

Importing Strava API

We can partially reuse the script from the last post except rather than saving to CSV file we’ll save to PostgreSQL using the psycopg2 library.

2017 05 01 13 45 58

The script relies on a TOKEN environment variable. If you want to try this on your own Strava account you’ll need to create an application, which will give you a key.

extract-runs.py

import requests
import os
import json
import psycopg2

token = os.environ["TOKEN"]
headers = {'Authorization': "Bearer {0}".format(token)}

with psycopg2.connect("dbname=strava user=markneedham") as conn:
    with conn.cursor() as cur:
        page = 1
        while True:
            r = requests.get("https://www.strava.com/api/v3/athlete/activities?page={0}".format(page), headers = headers)
            response = r.json()

            if len(response) == 0:
                break
            else:
                for activity in response:
                    r = requests.get("https://www.strava.com/api/v3/activities/{0}?include_all_efforts=true".format(activity["id"]), headers = headers)
                    json_response = r.json()
                    cur.execute("INSERT INTO runs (id, data) VALUES(%s, %s)", (activity["id"], json.dumps(json_response)))
                    conn.commit()
                page += 1
Querying Strava

We can now write some queries against our newly imported data.

My quickest runs

SELECT id, data->>'start_date' as start_date, 
       (data->>'average_speed')::float as speed 
FROM runs 
ORDER BY speed DESC 
LIMIT 5

    id     |      start_date      | speed 
-----------+----------------------+-------
 649253963 | 2016-07-22T05:18:37Z | 3.736
 914796614 | 2017-03-26T08:37:56Z | 3.614
 653703601 | 2016-07-26T05:25:07Z | 3.606
 548540883 | 2016-04-17T18:18:05Z | 3.604
 665006485 | 2016-08-05T04:11:21Z | 3.604
(5 rows)
My longest runs

SELECT id, data->>'start_date' as start_date, 
       (data->>'distance')::float as distance
FROM runs
ORDER BY distance DESC
LIMIT 5

    id     |      start_date      | distance 
-----------+----------------------+----------
 840246999 | 2017-01-22T10:20:33Z |  10764.1
 461124609 | 2016-01-02T08:42:47Z |  10457.9
 467634177 | 2016-01-10T18:48:47Z |  10434.5
 471467618 | 2016-01-16T12:33:28Z |  10359.3
 540811705 | 2016-04-10T07:26:55Z |   9651.6
(5 rows)
Runs this year

SELECT COUNT(*)
FROM runs
WHERE data->>'start_date' >= '2017-01-01 00:00:00'

 count 
-------
    62
(1 row)
Runs per year
SELECT EXTRACT(year from to_date(data->>'start_date', 'YYYY-mm-dd')) AS year, 
       count(*) 
FROM runs 
GROUP BY year 
ORDER BY year

 year | count 
------+-------
 2014 |    18
 2015 |   139
 2016 |   166
 2017 |    62
(4 rows)

That’s all for now. Next I’m going to learn how to query segments, which are stored inside a nested array inside the JSON document. Stay tuned for that in a future post.

The post Loading and analysing Strava runs using PostgreSQL JSON data type appeared first on Mark Needham.

Categories: Programming

Leaflet: Mapping Strava runs/polylines on Open Street Map

Sat, 04/29/2017 - 16:36

I’m a big Strava user and spent a bit of time last weekend playing around with their API to work out how to map all my runs.

2017 04 29 15 56 06

Strava API and polylines

This is a two step process:

  1. Call the /athlete/activities/ endpoint to get a list of all my activities
  2. For each of those activities call /activities/[activityId] endpoint to get more detailed information for each activity

That second API returns a ‘polyline’ property which the documentation describes as follows:

Activity and segment API requests may include summary polylines of their respective routes. The values are string encodings of the latitude and longitude points using the Google encoded polyline algorithm format.

If we navigate to that page we get the following explanation:

Polyline encoding is a lossy compression algorithm that allows you to store a series of coordinates as a single string.

I tried out a couple of my polylines using the interactive polyline encoder utility which worked well once I realised that I needed to escape backslashes (“\”) in the polyline before pasting it into the tool.

Now that I’d figured out how to map one run it was time to automate the process.

Leaflet and OpenStreetMap

I’ve previously had a good experience using Leaflet so I was keen to use that and luckily came across a Stack Overflow answer showing how to do what I wanted.

I created a HTML file and manually pasted in a couple of my runs (not forgetting to escape those backslashes!) to check that they worked:

blog.html


  
    Mapping my runs
  

  
    
    
    
    

    
    var map = L.map('map').setView([55.609818, 13.003286], 13);
    L.tileLayer(
        'http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png', {
            maxZoom: 18,
        }).addTo(map);

    var encodedRoutes = [
      "{zkrIm`inANPD?BDXGPKLATHNRBRFtAR~AFjAHl@D|ALtATj@HHJBL?`@EZ?NQ\\Y^MZURGJKR]RMXYh@QdAWf@[~@aAFGb@?j@YJKBU@m@FKZ[NSPKTCRJD?`@Wf@Wb@g@HCp@Qh@]z@SRMRE^EHJZnDHbBGPHb@NfBTxBN|DVbCBdA^lBFl@Lz@HbBDl@Lr@Bb@ApCAp@Ez@g@bEMl@g@`B_AvAq@l@    QF]Rs@Nq@CmAVKCK?_@Nw@h@UJIHOZa@xA]~@UfASn@U`@_@~@[d@Sn@s@rAs@dAGN?NVhAB\\Ox@@b@S|A?Tl@jBZpAt@vBJhATfGJn@b@fARp@H^Hx@ARGNSTIFWHe@AGBOTAP@^\\zBMpACjEWlEIrCKl@i@nAk@}@}@yBOWSg@kAgBUk@Mu@[mC?QLIEUAuAS_E?uCKyCA{BH{DDgF`AaEr@uAb@oA~@{AE}AKw@    g@qAU[_@w@[gAYm@]qAEa@FOXg@JGJ@j@o@bAy@NW?Qe@oCCc@SaBEOIIEQGaAe@kC_@{De@cE?KD[H[P]NcAJ_@DGd@Gh@UHI@Ua@}Bg@yBa@uDSo@i@UIICQUkCi@sCKe@]aAa@oBG{@G[CMOIKMQe@IIM@KB]Tg@Nw@^QL]NMPMn@@\\Lb@P~@XT",
      "u}krIq_inA_@y@My@Yu@OqAUsA]mAQc@CS@o@FSHSp@e@n@Wl@]ZCFEBK?OC_@Qw@?m@CSK[]]EMBeAA_@m@qEAg@UoCAaAMs@IkBMoACq@SwAGOYa@IYIyA_@kEMkC]{DEaAScC@yEHkGA_ALsCBiA@mCD{CCuAZcANOH@HDZl@Z`@RFh@\\TDT@ZVJBPMVGLM\\Mz@c@NCPMXERO|@a@^Ut@s@p@KJAJ    Bd@EHEXi@f@a@\\g@b@[HUD_B@uADg@DQLCLD~@l@`@J^TF?JANQ\\UbAyABEZIFG`@o@RAJEl@_@ZENDDIA[Ki@BURQZaARODKVs@LSdAiAz@G`BU^A^GT@PRp@zARXRn@`BlDHt@ZlAFh@^`BX|@HHHEf@i@FAHHp@bBd@v@DRAVMl@i@v@SROXm@tBILOTOLs@NON_@t@KX]h@Un@k@\\c@h@Ud@]ZGNKp@Sj@KJo@    b@W`@UPOX]XWd@UF]b@WPOAIBSf@QVi@j@_@V[b@Uj@YtAEFCCELARBn@`@lBjAzD^vB^hB?LENURkAv@[Ze@Xg@Py@p@QHONMA[HGAWE_@Em@Hg@AMCG@QHq@Cm@M[Jy@?UJIA{@Ae@KI@GFKNIX[QGAcAT[JK?OVMFK@IAIUKAYJI?QKUCGFIZCXDtAHl@@p@LjBCZS^ERAn@Fj@Br@Hn@HzAHh@RfD?j@TnCTlA    NjANb@\\z@TtARr@P`AFnAGfBG`@CFE?"
  ]

    for (let encoded of encodedRoutes) {
      var coordinates = L.Polyline.fromEncoded(encoded).getLatLngs();

      L.polyline(
          coordinates,
          {
              color: 'blue',
              weight: 2,
              opacity: .7,
              lineJoin: 'round'
          }
      ).addTo(map);
    }
    
  

We can spin up a Python web server over that HTML file to see how it renders:

$ python -m http.server
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

And below we can see both runs plotted on the map.

2017 04 29 15 53 28 Automating Strava API to Open Street Map

The final step is to automate the whole thing so that I can see all of my runs.

I wrote the following script to call the Strava API and save the polyline for every run to a CSV file:

import requests
import os
import sys
import csv

token = os.environ["TOKEN"]
headers = {'Authorization': "Bearer {0}".format(token)}

with open("runs.csv", "w") as runs_file:
    writer = csv.writer(runs_file, delimiter=",")
    writer.writerow(["id", "polyline"])

    page = 1
    while True:
        r = requests.get("https://www.strava.com/api/v3/athlete/activities?page={0}".format(page), headers = headers)
        response = r.json()

        if len(response) == 0:
            break
        else:
            for activity in response:
                r = requests.get("https://www.strava.com/api/v3/activities/{0}?include_all_efforts=true".format(activity["id"]), headers = headers)
                polyline = r.json()["map"]["polyline"]
                writer.writerow([activity["id"], polyline])
            page += 1

I then wrote a simple script using Flask to parse the CSV files and send a JSON representation of my runs to a slightly modified version of the HTML page that I described above:

from flask import Flask
from flask import render_template
import csv
import json

app = Flask(__name__)

@app.route('/')
def my_runs():
    runs = []
    with open("runs.csv", "r") as runs_file:
        reader = csv.DictReader(runs_file)

        for row in reader:
            runs.append(row["polyline"])

    return render_template("leaflet.html", runs = json.dumps(runs))

if __name__ == "__main__":
    app.run(port = 5001)

I changed the following line in the HTML file:

var encodedRoutes = {{ runs|safe }};

Now we can launch our Flask web server:

$ python app.py 
 * Running on http://127.0.0.1:5001/ (Press CTRL+C to quit)

And if we navigate to http://127.0.0.1:5001/ we can see all my runs that went near Westminster:

2017 04 29 16 32 00

The full code for all the files I’ve described in this post are available on github. If you give it a try you’ll need to provide your Strava Token in the ‘TOKEN’ environment variable before running extract_runs.py.

Hope this was helpful and if you have any questions ask me in the comments.

The post Leaflet: Mapping Strava runs/polylines on Open Street Map appeared first on Mark Needham.

Categories: Programming

Python: Flask – Generating a static HTML page

Thu, 04/27/2017 - 21:59

Whenever I need to quickly spin up a web application Python’s Flask library is my go to tool but I recently found myself wanting to generate a static HTML to upload to S3 and wondered if I could use it for that as well.

It’s actually not too tricky. If we’re in the scope of the app context then we have access to the template rendering that we’d normally use when serving the response to a web request.

The following code will generate a HTML file based on a template file templates/blog.html:

from flask import render_template
import flask

app = flask.Flask('my app')

if __name__ == "__main__":
    with app.app_context():
        rendered = render_template('blog.html', \
            title = "My Generated Page", \
            people = [{"name": "Mark"}, {"name": "Michael"}])
        print(rendered)

templates/index.html



  
	{{ title }}
  
  
	{{ title }}
  
    {% for person in people %}
  • {{ person.name }}
  • {% endfor %}

If we execute the Python script it will generate the following HTML:

$ python blog.py 


  
	My Generated Page
  
  
	My Generated Page
  
  • Mark
  • Michael


And we can finish off by redirecting that output into a file:

$ python blog.py  > blog.html

We could also write to the file from Python but this seems just as easy!

The post Python: Flask – Generating a static HTML page appeared first on Mark Needham.

Categories: Programming

AWS Lambda: Programmatically scheduling a CloudWatchEvent

Thu, 04/06/2017 - 00:49

I recently wrote a blog post showing how to create a Python ‘Hello World’ AWS lambda function and manually invoke it, but what I really wanted to do was have it run automatically every hour.

To achieve that in AWS Lambda land we need to create a CloudWatch Event. The documentation describes them as follows:

Using simple rules that you can quickly set up, you can match events and route them to one or more target functions or streams.

2017 04 05 23 06 36

This is actually really easy from the Amazon web console as you just need to click the ‘Triggers’ tab and then ‘Add trigger’. It’s not obvious that there are actually three steps are involved as they’re abstracted from you.

So what are the steps?

  1. Create rule
  2. Give permission for that rule to execute
  3. Map the rule to the function

I forgot to do step 2) initially and then you just end up with a rule that never triggers, which isn’t particularly useful.

The following code creates a ‘Hello World’ lambda function and runs it once an hour:

import boto3

lambda_client = boto3.client('lambda')
events_client = boto3.client('events')

fn_name = "HelloWorld"
fn_role = 'arn:aws:iam::[your-aws-id]:role/lambda_basic_execution'

fn_response = lambda_client.create_function(
    FunctionName=fn_name,
    Runtime='python2.7',
    Role=fn_role,
    Handler="{0}.lambda_handler".format(fn_name),
    Code={'ZipFile': open("{0}.zip".format(fn_name), 'rb').read(), },
)

fn_arn = fn_response['FunctionArn']
frequency = "rate(1 hour)"
name = "{0}-Trigger".format(fn_name)

rule_response = events_client.put_rule(
    Name=name,
    ScheduleExpression=frequency,
    State='ENABLED',
)

lambda_client.add_permission(
    FunctionName=fn_name,
    StatementId="{0}-Event".format(name),
    Action='lambda:InvokeFunction',
    Principal='events.amazonaws.com',
    SourceArn=rule_response['RuleArn'],
)

events_client.put_targets(
    Rule=name,
    Targets=[
        {
            'Id': "1",
            'Arn': fn_arn,
        },
    ]
)

We can now check if our trigger has been configured correctly:

$ aws events list-rules --query "Rules[?Name=='HelloWorld-Trigger']"
[
    {
        "State": "ENABLED", 
        "ScheduleExpression": "rate(1 hour)", 
        "Name": "HelloWorld-Trigger", 
        "Arn": "arn:aws:events:us-east-1:[your-aws-id]:rule/HelloWorld-Trigger"
    }
]

$ aws events list-targets-by-rule --rule HelloWorld-Trigger
{
    "Targets": [
        {
            "Id": "1", 
            "Arn": "arn:aws:lambda:us-east-1:[your-aws-id]:function:HelloWorld"
        }
    ]
}

$ aws lambda get-policy --function-name HelloWorld
{
    "Policy": "{\"Version\":\"2012-10-17\",\"Id\":\"default\",\"Statement\":[{\"Sid\":\"HelloWorld-Trigger-Event\",\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"events.amazonaws.com\"},\"Action\":\"lambda:InvokeFunction\",\"Resource\":\"arn:aws:lambda:us-east-1:[your-aws-id]:function:HelloWorld\",\"Condition\":{\"ArnLike\":{\"AWS:SourceArn\":\"arn:aws:events:us-east-1:[your-aws-id]:rule/HelloWorld-Trigger\"}}}]}"
}

All looks good so we’re done!

The post AWS Lambda: Programmatically scheduling a CloudWatchEvent appeared first on Mark Needham.

Categories: Programming

AWS Lambda: Encrypted environment variables

Mon, 04/03/2017 - 06:49

Continuing on from my post showing how to create a ‘Hello World’ AWS lambda function I wanted to pass encrypted environment variables to my function.

The following function takes in both an encrypted and unencrypted variable and prints them out.

Don’t print out encrypted variables in a real function, this is just so we can see the example working!

import boto3
import os

from base64 import b64decode

def lambda_handler(event, context):
    encrypted = os.environ['ENCRYPTED_VALUE']
    decrypted = boto3.client('kms').decrypt(CiphertextBlob=b64decode(encrypted))['Plaintext']

    # Don't print out your decrypted value in a real function! This is just to show how it works.
    print("Decrypted value:", decrypted)

    plain_text = os.environ["PLAIN_TEXT_VALUE"]
    print("Plain text:", plain_text)

Now we’ll zip up our function into HelloWorldEncrypted.zip, ready to send to AWS.

zip HelloWorldEncrypted.zip HelloWorldEncrypted.py

Now it’s time to upload our function to AWS and create the associated environment variables.

If you’re using a Python editor then you’ll need to install boto3 locally to keep the editor happy but you don’t need to include boto3 in the code you send to AWS Lambda – it comes pre-installed.

Now we write the following code to automate the creation of our Lambda function:

import boto3
from base64 import b64encode

fn_name = "HelloWorldEncrypted"
kms_key = "arn:aws:kms:[aws-zone]:[your-aws-id]:key/[your-kms-key-id]"
fn_role = 'arn:aws:iam::[your-aws-id]:role/lambda_basic_execution'

lambda_client = boto3.client('lambda')
kms_client = boto3.client('kms')

encrypt_me = "abcdefg"
encrypted = b64encode(kms_client.encrypt(Plaintext=encrypt_me, KeyId=kms_key)["CiphertextBlob"])

plain_text = 'hijklmno'

lambda_client.create_function(
        FunctionName=fn_name,
        Runtime='python2.7',
        Role=fn_role,
        Handler="{0}.lambda_handler".format(fn_name),
        Code={ 'ZipFile': open("{0}.zip".format(fn_name), 'rb').read(),},
        Environment={
            'Variables': {
                'ENCRYPTED_VALUE': encrypted,
                'PLAIN_TEXT_VALUE': plain_text,
            }
        },
        KMSKeyArn=kms_key
)

The tricky bit for me here was figuring out that I needed to pass the value that I wanted to base 64 encode the output of the value encrypted by the KMS client. The KMS client relies on a KMS key that we need to setup. We can see a list of all our KMS keys by running the following command:

$ aws kms list-keys

The format of these keys is arn:aws:kms:[zone]:[account-id]:key/[key-id].

Now let’s try executing our Lambda function from the AWS console:

$ python CreateHelloWorldEncrypted.py

Let’s check it got created:

$ aws lambda list-functions --query "Functions[*].FunctionName"
[
    "HelloWorldEncrypted", 
]

And now let’s execute the function:

$ aws lambda invoke --function-name HelloWorldEncrypted --invocation-type RequestResponse --log-type Tail /tmp/out | jq ".LogResult"
"U1RBUlQgUmVxdWVzdElkOiA5YmNlM2E1MC0xODMwLTExZTctYjFlNi1hZjQxZDYzMzYxZDkgVmVyc2lvbjogJExBVEVTVAooJ0RlY3J5cHRlZCB2YWx1ZTonLCAnYWJjZGVmZycpCignUGxhaW4gdGV4dDonLCAnaGlqa2xtbm8nKQpFTkQgUmVxdWVzdElkOiA5YmNlM2E1MC0xODMwLTExZTctYjFlNi1hZjQxZDYzMzYxZDkKUkVQT1JUIFJlcXVlc3RJZDogOWJjZTNhNTAtMTgzMC0xMWU3LWIxZTYtYWY0MWQ2MzM2MWQ5CUR1cmF0aW9uOiAzNjAuMDQgbXMJQmlsbGVkIER1cmF0aW9uOiA0MDAgbXMgCU1lbW9yeSBTaXplOiAxMjggTUIJTWF4IE1lbW9yeSBVc2VkOiAyNCBNQgkK"

That’s a bit hard to read, some decoding is needed:

$ echo "U1RBUlQgUmVxdWVzdElkOiA5YmNlM2E1MC0xODMwLTExZTctYjFlNi1hZjQxZDYzMzYxZDkgVmVyc2lvbjogJExBVEVTVAooJ0RlY3J5cHRlZCB2YWx1ZTonLCAnYWJjZGVmZycpCignUGxhaW4gdGV4dDonLCAnaGlqa2xtbm8nKQpFTkQgUmVxdWVzdElkOiA5YmNlM2E1MC0xODMwLTExZTctYjFlNi1hZjQxZDYzMzYxZDkKUkVQT1JUIFJlcXVlc3RJZDogOWJjZTNhNTAtMTgzMC0xMWU3LWIxZTYtYWY0MWQ2MzM2MWQ5CUR1cmF0aW9uOiAzNjAuMDQgbXMJQmlsbGVkIER1cmF0aW9uOiA0MDAgbXMgCU1lbW9yeSBTaXplOiAxMjggTUIJTWF4IE1lbW9yeSBVc2VkOiAyNCBNQgkK" | base64 --decode
START RequestId: 9bce3a50-1830-11e7-b1e6-af41d63361d9 Version: $LATEST
('Decrypted value:', 'abcdefg')
('Plain text:', 'hijklmno')
END RequestId: 9bce3a50-1830-11e7-b1e6-af41d63361d9
REPORT RequestId: 9bce3a50-1830-11e7-b1e6-af41d63361d9	Duration: 360.04 ms	Billed Duration: 400 ms 	Memory Size: 128 MB	Max Memory Used: 24 MB	

And it worked, hoorah!

The post AWS Lambda: Encrypted environment variables appeared first on Mark Needham.

Categories: Programming

AWS Lambda: Programatically create a Python β€˜Hello World’ function

Sun, 04/02/2017 - 23:11

I’ve been playing around with AWS Lambda over the last couple of weeks and I wanted to automate the creation of these functions and all their surrounding config.

Let’s say we have the following Hello World function:

def lambda_handler(event, context):
    print("Hello world")

To upload it to AWS we need to put it inside a zip file so let’s do that:

$ zip HelloWorld.zip HelloWorld.py
$ unzip -l HelloWorld.zip 
Archive:  HelloWorld.zip
  Length     Date   Time    Name
 --------    ----   ----    ----
       61  04-02-17 22:04   HelloWorld.py
 --------                   -------
       61                   1 file

Now we’re ready to write a script to create our AWS lambda function.

import boto3

lambda_client = boto3.client('lambda')

fn_name = "HelloWorld"
fn_role = 'arn:aws:iam::[your-aws-id]:role/lambda_basic_execution'

lambda_client.create_function(
    FunctionName=fn_name,
    Runtime='python2.7',
    Role=fn_role,
    Handler="{0}.lambda_handler".format(fn_name),
    Code={'ZipFile': open("{0}.zip".format(fn_name), 'rb').read(), },
)

[your-aws-id] needs to be replaced with the identifier of our AWS account. We can find that out be running the following command against the AWS CLI:

$ aws ec2 describe-security-groups --query 'SecurityGroups[0].OwnerId' --output text
123456789012

Now we can create our function:

$ python CreateHelloWorld.py

2017 04 02 23 07 38

And if we test the function we’ll get the expected output:

2017 04 02 23 02 59

The post AWS Lambda: Programatically create a Python ‘Hello World’ function appeared first on Mark Needham.

Categories: Programming

My top 10 technology podcasts

Thu, 03/30/2017 - 23:38

For the last six months I’ve been listening to 2 or 3 technology podcasts every day while out running and on my commute and I thought it’d be cool to share some of my favourites.

I listen to all of these on the Podbean android app which seems pretty good. It can’t read the RSS feeds of some podcasts but other than that it’s worked well.

Anyway, on with the podcasts:

Software Engineering Daily

This is the most reliable of all the podcasts I’ve listened to and a new episode is posted every weekday.

It sweeps across lots of different areas of technology – there’s a bit of software development, a bit of data engineering, and a bit of infrastructure.

Every now and then there’s a focus on a particular topic area or company which I find really interesting e.g. in 2015 there was a week of Bitcoin focused episodes and more recently there’s been a bunch of episodes about Stripe.

Partially Derivative

This one is more of a data science postcast and cover lots of different areas in that space but thankfully keep the conversation at a level that a non data scientist like me can understand.

I especially liked the post US election episode where they talked about the problems with polling and how most election predictions had ended up being wrong.

There’s roughly one new episode a week.

O’Reilly Bots podcast

I didn’t know anything about bots before i listened to this podcast and it was quite addictive – i powered through all the episodes in a few weeks.

They cover all sorts of topics that I’d have never thought of – why have developers got interested in bots? How do UIs differ to ones in apps? How do users find out about bots?

I really enjoy listening to this one but it’s been a bit quiet recently.

Datanauts

I found this one really useful for getting the hang of infrastructure topics. I wanted to learn a bit more about Kubernetes a few months ago and they had an episode which gives an overview as well as more detailed episodes.

One neat feature of this podcast is that after each part of an interview the hosts summarise what they picked up from that segment. I like that it gives you a few seconds to think about what you picked up and whether it matches the summary.

Some of the episodes go really deep into specific infrastructure topics and I struggle to follow along but there are enough other ones to keep me happy.

Becoming a Data Scientist

This one mirrors the journey of Renee Teate getting into data science and bringing everyone along on the journey.

Each episode is paired with a learning exercises for the listener to try and although any of the learning exercises yet I like how some interviews are structured around them. e.g. Sebastien Rashka was interviewed about model accuracy on the week that was being explored in the learning club.

If you’re interested in data science topics but aren’t a data scientist yourself this is a good one to listen to.

This Week In Machine Learning and AI Podcast

This one mostly goes well over my head but it’s still interesting to listen to other people talk about stuff they’re working on.

There’s a lot of focus on Deep Learning so i think i need to learn a bit more about that and then the episodes will make more sense.

The last episode with Evan Wright was much more accessible. I need more like that one!

The Women in Tech Show

I came across Edaena Salinas on Software Engineering Daily and didn’t initially realise that Edaena had a podcast until a couple of weeks ago.

There’s lots of interesting content on this one. The episodes on data driven marketing and unconscious bias are my favourites of the ones I’ve listened to so far.

The Bitcoin Podcast

I listened to a few shows about bitcoin on Software Engineering Daily and found this podcast while trying to learn more.

Some of the episodes are general enough that i can follow along but others use a lot of block chain specific terminology that leave me feeling a bit lost.

I especially liked the episode that featured Greg Walker of learnmeabitcoin fame. Greg uses Neo4j as part of the website and presented at the London Neo4j meetup earlier this week.

Go Time

This one has a chat based format that I really. They have a cool section called ‘free software Friday’ at the end of each show where everybody calls out a piece of software or maintainer that they’re grateful for.

I was playing around with Go in November/December last year so it was really helpful in pointing me in the right direction. I haven’t done any stuff recently so it’s more a general interest show for now.

Change Log

This one covers lots of different topics, mostly around different open source projects.

The really cool thing about this one is they get every guest to explain their ‘origin story’ i.e. how did they get into software and what was their path to the current job. The interview with Nathan Sobo about Atom was particularly good in this respect.

It’s always interesting to hear how other people got started and contrast it with my own experiences.

Another cool feature of this podcast is that they sometimes have episodes where they interview people at open source conferences.

That’s it folks

That’s all for now. Hopefully there’s one or more in there that you haven’t listened to before.

If you’ve got any suggestions for other ones I should listen to let me know in the comments or send me a message on twitter @markhneedham

The post My top 10 technology podcasts appeared first on Mark Needham.

Categories: Programming

Luigi: Defining dynamic requirements (on output files)

Tue, 03/28/2017 - 06:39

In my last blog post I showed how to convert a JSON document containing meetup groups into a CSV file using Luigi, the Python library for building data pipelines. As well as creating that CSV file I wanted to go back to the meetup.com API and download all the members of those groups.

This was a rough flow of what i wanted to do:

  • Take JSON document containing all groups
  • Parse that document and for each group:
    • Call the /members endpoint
    • Save each one of those files as a JSON file
  • Iterate over all those JSON files and create a members CSV file

In the previous post we created the GroupsToJSON task which calls the /groups endpoint on the meetup API and creates the file /tmp/groups.json.

Our new task has that as its initial requirement:

class MembersToCSV(luigi.Task):
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()

    def requires(self):
        yield GroupsToJSON(self.key, self.lat, self.lon)

But we also want to create a requirement on a task that will make those calls to the /members endpoint and store the result in a JSON file.

One of the patterns that Luigi imposes on us is that each task should only create one file so actually we have a requirement on a collection of tasks rather than just one. It took me a little while to get my head around that!

We don’t know the parameters of those tasks at compile time – we can only calculate them by parsing the JSON file produced by GroupsToJSON.

In Luigi terminology what we want to create is a dynamic requirement. A dynamic requirement is defined inside the run method of a task and can rely on the output of any tasks specified in the requires method, which is exactly what we need.

This code does the delegating part of the job:

class MembersToCSV(luigi.Task):
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()


    def run(self):
        outputs = []
        for input in self.input():
            with input.open('r') as group_file:
                groups_json = json.load(group_file)
                groups = [str(group['id']) for group in groups_json]


                for group_id in groups:
                    members = MembersToJSON(group_id, self.key)
                    outputs.append(members.output().path)
                    yield members


    def requires(self):
        yield GroupsToJSON(self.key, self.lat, self.lon)

Inside our run method we iterate over the output of GroupsToJSON (which is our input) and we yield to another task as well as collecting its outputs in the array outputs that we’ll use later.
MembersToJSON looks like this:

class MembersToJSON(luigi.Task):
    group_id = luigi.IntParameter()
    key = luigi.Parameter()


    def run(self):
        results = []
        uri = "https://api.meetup.com/2/members?&group_id={0}&key={1}".format(self.group_id, self.key)
        while True:
            if uri is None:
                break
            r = requests.get(uri)
            response = r.json()
            for result in response["results"]:
                results.append(result)
            uri = response["meta"]["next"] if response["meta"]["next"] else None


        with self.output().open("w") as output:
            json.dump(results, output)

    def output(self):
        return luigi.LocalTarget("/tmp/members/{0}.json".format(self.group_id))

This task generates one file per group containing a list of all the members of that group.

We can now go back to MembersToCSV and convert those JSON files into a single CSV file:

class MembersToCSV(luigi.Task):
    out_path = "/tmp/members.csv"
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()


    def run(self):
        outputs = []
        for input in self.input():
            with input.open('r') as group_file:
                groups_json = json.load(group_file)
                groups = [str(group['id']) for group in groups_json]


                for group_id in groups:
                    members = MembersToJSON(group_id, self.key)
                    outputs.append(members.output().path)
                    yield members

        with self.output().open("w") as output:
            writer = csv.writer(output, delimiter=",")
            writer.writerow(["id", "name", "joined", "topics", "groupId"])

            for path in outputs:
                group_id = path.split("/")[-1].replace(".json", "")
                with open(path) as json_data:
                    d = json.load(json_data)
                    for member in d:
                        topic_ids = ";".join([str(topic["id"]) for topic in member["topics"]])
                        if "name" in member:
                            writer.writerow([member["id"], member["name"], member["joined"], topic_ids, group_id])

    def output(self):
        return luigi.LocalTarget(self.out_path)

    def requires(self):
        yield GroupsToJSON(self.key, self.lat, self.lon)

We then just need to add our new task as a requirement of the wrapper task:

And we’re ready to roll:

$ PYTHONPATH="." luigi --module blog --local-scheduler Meetup --workers 3

We’ve defined the number of workers here as we can execute those calls to the /members endpoint in parallel and there are ~ 600 calls to make.

All the code from both blog posts is available as a gist if you want to play around with it.

Any questions/advice let me know in the comments or I’m @markhneedham on twitter.

The post Luigi: Defining dynamic requirements (on output files) appeared first on Mark Needham.

Categories: Programming

Luigi: An ExternalProgramTask example – Converting JSON to CSV

Sat, 03/25/2017 - 15:09

I’ve been playing around with the Python library Luigi which is used to build pipelines of batch jobs and I struggled to find an example of an ExternalProgramTask so this is my attempt at filling that void.

Luigi - the Python data library for building data science pipelines

I’m building a little data pipeline to get data from the meetup.com API and put it into CSV files that can be loaded into Neo4j using the LOAD CSV command.

The first task I created calls the /groups endpoint and saves the result into a JSON file:

import luigi
import requests
import json
from collections import Counter

class GroupsToJSON(luigi.Task):
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()

    def run(self):
        seed_topic = "nosql"
        uri = "https://api.meetup.com/2/groups?&topic={0}&lat={1}&lon={2}&key={3}".format(seed_topic, self.lat, self.lon, self.key)

        r = requests.get(uri)
        all_topics = [topic["urlkey"]  for result in r.json()["results"] for topic in result["topics"]]
        c = Counter(all_topics)

        topics = [entry[0] for entry in c.most_common(10)]

        groups = {}
        for topic in topics:
            uri = "https://api.meetup.com/2/groups?&topic={0}&lat={1}&lon={2}&key={3}".format(topic, self.lat, self.lon, self.key)
            r = requests.get(uri)
            for group in r.json()["results"]:
                groups[group["id"]] = group

        with self.output().open('w') as groups_file:
            json.dump(list(groups.values()), groups_file, indent=4, sort_keys=True)

    def output(self):
        return luigi.LocalTarget("/tmp/groups.json")

We define a few parameters at the top of the class which will be passed in when this task is executed. The most interesting lines of the run function are the last couple where we write the JSON to a file. self.output() refers to the target defined in the output function which in this case is /tmp/groups.json.

Now we need to create a task to convert that JSON file into CSV format. The jq command line tool does this job well so we’ll use that. The following task does the job:

from luigi.contrib.external_program import ExternalProgramTask

class GroupsToCSV(luigi.contrib.external_program.ExternalProgramTask):
    file_path = "/tmp/groups.csv"
    key = luigi.Parameter()
    lat = luigi.Parameter()
    lon = luigi.Parameter()

    def program_args(self):
        return ["./groups.sh", self.input()[0].path, self.output().path]

    def output(self):
        return luigi.LocalTarget(self.file_path)

    def requires(self):
        yield GroupsToJSON(self.key, self.lat, self.lon)

groups.sh

#!/bin/bash

in=${1}
out=${2}

echo "id,name,urlname,link,rating,created,description,organiserName,organiserMemberId" > ${out}
jq -r '.[] | [.id, .name, .urlname, .link, .rating, .created, .description, .organizer.name, .organizer.member_id] | @csv' ${in} >> ${out}

I wanted to call jq directly from the Python code but I couldn’t figure out how to do it so putting that code in a shell script is my workaround.

The last piece of the puzzle is a wrapper task that launches the others:

import os

class Meetup(luigi.WrapperTask):
    def run(self):
        print("Running Meetup")

    def requires(self):
        key = os.environ['MEETUP_API_KEY']
        lat = os.getenv('LAT', "51.5072")
        lon = os.getenv('LON', "0.1275")

        yield GroupsToCSV(key, lat, lon)

Now we’re ready to run the tasks:

$ PYTHONPATH="." luigi --module blog --local-scheduler Meetup
DEBUG: Checking if Meetup() is complete
DEBUG: Checking if GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275) is complete
INFO: Informed scheduler that task   Meetup__99914b932b   has status   PENDING
DEBUG: Checking if GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275) is complete
INFO: Informed scheduler that task   GroupsToCSV_xxx_51_5072_0_1275_e07372cebf   has status   PENDING
INFO: Informed scheduler that task   GroupsToJSON_xxx_51_5072_0_1275_e07372cebf   has status   PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 3
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) running   GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275)
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) done      GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275)
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   GroupsToJSON_xxx_51_5072_0_1275_e07372cebf   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) running   GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275)
INFO: Running command: ./groups.sh /tmp/groups.json /tmp/groups.csv
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) done      GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275)
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   GroupsToCSV_xxx_51_5072_0_1275_e07372cebf   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) running   Meetup()
Running Meetup
INFO: [pid 4452] Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) done      Meetup()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   Meetup__99914b932b   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
INFO: Worker Worker(salt=970508581, workers=1, host=Marks-MBP-4, username=markneedham, pid=4452) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====

Scheduled 3 tasks of which:
* 3 ran successfully:
    - 1 GroupsToCSV(key=xxx, lat=51.5072, lon=0.1275)
    - 1 GroupsToJSON(key=xxx, lat=51.5072, lon=0.1275)
    - 1 Meetup()

This progress looks &#x1f642; because there were no failed tasks or missing external dependencies

===== Luigi Execution Summary =====

Looks good! Let’s quickly look at our CSV file:

$ head -n10 /tmp/groups.csv 
id,name,urlname,link,rating,created,description,organiserName,organiserMemberId
1114381,"London NoSQL, MySQL, Open Source Community","london-nosql-mysql","https://www.meetup.com/london-nosql-mysql/",4.28,1208505614000,"

Meet others in London interested in NoSQL, MySQL, and Open Source Databases.

","Sinead Lawless",185675230 1561841,"Enterprise Search London Meetup","es-london","https://www.meetup.com/es-london/",4.66,1259157419000,"

Enterprise Search London is a meetup for anyone interested in building search and discovery experiences β€” from intranet search and site search, to advanced discovery applications and beyond.

Disclaimer: This meetup is NOT about SEO or search engine marketing.

What people are saying:

  • ""Join this meetup if you have a passion for enterprise search and user experience that you would like to share with other able-minded practitioners."" β€” Vegard Sandvold
  • ""Full marks for vision and execution. Looking forward to the next Meetup."" β€” Martin White
  • β€œConsistently excellent” β€” Helen Lippell

Sweet! And what if we run it again?

$ PYTHONPATH="." luigi --module blog --local-scheduler Meetup
DEBUG: Checking if Meetup() is complete
INFO: Informed scheduler that task   Meetup__99914b932b   has status   DONE
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
INFO: Worker Worker(salt=172768377, workers=1, host=Marks-MBP-4, username=markneedham, pid=4531) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====

Scheduled 1 tasks of which:
* 1 present dependencies were encountered:
    - 1 Meetup()

Did not run any tasks
This progress looks &#x1f642; because there were no failed tasks or missing external dependencies

===== Luigi Execution Summary =====

As expected nothing happens since our dependencies are already satisfied and we have our first Luigi pipeline up and running.

The post Luigi: An ExternalProgramTask example – Converting JSON to CSV appeared first on Mark Needham.

Categories: Programming

Python 3: TypeError: Object of type β€˜dict_values’ is not JSON serializable

Sun, 03/19/2017 - 17:40

I’ve recently upgraded to Python 3 (I know, took me a while!) and realised that one of my scripts that writes JSON to a file no longer works!

This is a simplified version of what I’m doing:

>>> import json
>>> x = {"mark": {"name": "Mark"}, "michael": {"name": "Michael"}  } 
>>> json.dumps(x.values())
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/encoder.py", line 180, in default
    o.__class__.__name__)
TypeError: Object of type 'dict_values' is not JSON serializable

Python 2.7 would be perfectly happy:

>>> json.dumps(x.values())
'[{"name": "Michael"}, {"name": "Mark"}]'

The difference is in the results returned by the values method:

# Python 2.7.10
>>> x.values()
[{'name': 'Michael'}, {'name': 'Mark'}]

# Python 3.6.0
>>> x.values()
dict_values([{'name': 'Mark'}, {'name': 'Michael'}])
>>> 

Python 3 no longer returns an array, instead we have a dict_values wrapper around the data.

Luckily this is easy to resolve – we just need to wrap the call to values with a call to list:

>>> json.dumps(list(x.values()))
'[{"name": "Mark"}, {"name": "Michael"}]'

This versions works with Python 2.7 as well so if I accidentally run the script with an old version the world isn’t going to explode.

The post Python 3: TypeError: Object of type ‘dict_values’ is not JSON serializable appeared first on Mark Needham.

Categories: Programming

Neo4j: apoc.date.parse – java.lang.IllegalArgumentException: Illegal pattern character β€˜T’ / java.text.ParseException: Unparseable date: β€œ2012-11-12T08:46:15Z”

Mon, 03/06/2017 - 21:52

I often find myself wanting to convert date strings into Unix timestamps using Neo4j’s APOC library and unfortunately some sources don’t use the format that apoc.date.parse expects.

e.g.

return apoc.date.parse("2012-11-12T08:46:15Z",'s') 
AS ts

Failed to invoke function `apoc.date.parse`: 
Caused by: java.lang.IllegalArgumentException: java.text.ParseException: Unparseable date: "2012-11-12T08:46:15Z"

We need to define the format explicitly so the SimpleDataFormat documentation comes in handy. I tried the following:

return apoc.date.parse("2012-11-12T08:46:15Z",'s',"yyyy-MM-ddTHH:mm:ssZ") 
AS ts

Failed to invoke function `apoc.date.parse`: 
Caused by: java.lang.IllegalArgumentException: Illegal pattern character 'T'

Hmmm, we need to quote the ‘T’ character – we can’t just include it in the pattern. Let’s try again:

return  apoc.date.parse("2012-11-12T08:46:15Z",'s',"yyyy-MM-dd'T'HH:mm:ssZ") 
AS ts

Failed to invoke function `apoc.date.parse`: 
Caused by: java.lang.IllegalArgumentException: java.text.ParseException: Unparseable date: "2012-11-12T08:46:15Z"

The problem now is that we haven’t quoted the ‘Z’ but the error doesn’t indicate that – not sure why!

We can either quote the ‘Z’:

return  apoc.date.parse("2012-11-12T08:46:15Z",'s',"yyyy-MM-dd'T'HH:mm:ss'Z'") 
AS ts

╒══════════╕
β”‚"ts"      β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•‘
β”‚1352709975β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Or we can match the timezone using ‘XXX’:

return  apoc.date.parse("2012-11-12T08:46:15Z",'s',"yyyy-MM-dd'T'HH:mm:ssXXX") 
AS ts

╒══════════╕
β”‚"ts"      β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•‘
β”‚1352709975β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The post Neo4j: apoc.date.parse – java.lang.IllegalArgumentException: Illegal pattern character ‘T’ / java.text.ParseException: Unparseable date: “2012-11-12T08:46:15Z” appeared first on Mark Needham.

Categories: Programming