Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Agile Testing - Grig Gheorghiu
Syndicate content
Did anybody say webscale?
Updated: 4 hours 46 min ago

Dynamic DNS updates with nsupdate (new and improved!)

Wed, 12/17/2014 - 23:14
I blogged about this topic before. This post shows a slightly different way of using nsupdate remotely against a DNS server running BIND 9 in order to programatically update DNS records. The scenario I am describing here involves an Ubuntu 12.04 DNS server running BIND 9 and an Ubuntu 12.04 client running nsupdate against the DNS server.

1) Run ddns-confgen and specify /dev/urandom as the source of randomness and the name of the zone file you want to dynamically update via nsupdate:

$ ddns-confgen -r /dev/urandom -z

# To activate this key, place the following in named.conf, and
# in a separate keyfile on the system or systems from which nsupdate
# will be run:
key "" {
algorithm hmac-sha256;
secret "1D1niZqRvT8pNDgyrJcuCiykOQCHUL33k8ZYzmQYe/0=";

# Then, in the "zone" definition statement for "",
# place an "update-policy" statement like this one, adjusted as
# needed for your preferred permissions:
update-policy {
 grant zonesub ANY;

# After the keyfile has been placed, the following command will
# execute nsupdate using this key:
nsupdate -k <keyfile>

2) Follow the instructions in the output of ddns-keygen (above). I actually named the key just ddns-key, since I was going to use it for all the zones on my DNS server. So I added this stanza to /etc/bind/named.conf on the DNS server:

key "ddns-key" {
algorithm hmac-sha256;
secret "1D1niZqRvT8pNDgyrJcuCiykOQCHUL33k8ZYzmQYe/0=";

3) Allow updates when the key ddns-key is used. In my case, I added the allow-update line below to all zones that I wanted to dynamically update, not only to

zone "" {
        type master;
        file "/etc/bind/zones/";
allow-update { key "ddns-key"; };

At this point I also restarted the bind9 service on my DNS server.

4) On the client box, create a text file containing nsupdate commands to be sent to the DNS server. In the example below, I want to dynamically add both an A record and a reverse DNS PTR record:

$ cat update_dns1.txt
debug yes
update add 3600 A
update add 3600 PTR

Still on the client box, create a file containing the stanza with the DDNS key generated in step 1:

$ cat ddns-key.txt
key "ddns-key" {
algorithm hmac-sha256;
secret "Wxp1uJv3SHT+R9rx96o6342KKNnjW8hjJTyxK2HYufg=";

5) Run nsupdate and feed it both the update_dns1.txt file containing the commands, and the ddns-key.txt file:

$ nsupdate -k ddns-key.txt -v update_dns1.txt

You should see some fairly verbose output, since the command file specifies 'debug yes'. At the same time, tail /var/log/syslog on the DNS server and make sure there are no errors.

In my case, there were some hurdles I had to overcome on the DNS server. The first one was that apparmor was installed and it wasn't allowing the creation of the journal files used to keep track of DDNS records. I saw lines like these in /var/log/syslog:

Dec 16 11:22:59 dns1 kernel: [49671335.189689] type=1400 audit(1418757779.712:12): apparmor="DENIED" operation="mknod" parent=1 profile="/usr/sbin/named" name="/etc/bind/zones/" pid=31154 comm="named" requested_mask="c" denied_mask="c" fsuid=107 ouid=107
Dec 16 11:22:59 dns1 kernel: [49671335.306304] type=1400 audit(1418757779.828:13): apparmor="DENIED" operation="mknod" parent=1 profile="/usr/sbin/named" name="/etc/bind/zones/" pid=31153 comm="named" requested_mask="c" denied_mask="c" fsuid=107 ouid=107

To get past this issue, I disabled apparmor for named:

# ln -s /etc/apparmor.d/usr.sbin.named /etc/apparmor.d/disable/
# service apparmor restart

The next issue was an OS permission denied (nothing to do with apparmor) when trying to create the journal files in /etc/bind/zones:

Dec 16 11:30:54 dns1 named[32640]: /etc/bind/zones/ create: permission denied
Dec 16 11:30:54 dns named[32640]: /etc/bind/zones/ create: permission denied

I got past this issue by running

# chown -R bind:bind /etc/bind/zones

At this point everything worked as expected.

Service discovery with consul and consul-template

Tue, 11/18/2014 - 00:20
I talked in the past about an "Ops Design Pattern: local haproxy talking to service layer". I described how we used a local haproxy on pretty much all nodes at a given layer of our infrastructure (webapp, API, e-commerce) to talk to services offered by the layer below it. So each webapp server has a local haproxy that talks to all API nodes it sends requests to. Similarly, each API node has a local haproxy that talks to all e-commerce nodes it needs info from.

This seemed like a good idea at a time, but it turns out it has a couple of annoying drawbacks:
  • each local haproxy runs health checks against N nodes, so if you have M nodes running haproxy, each of the N nodes will receive M health checks; if M and N are large, then you have a health check storm on your hands
  • to take a node out of a cluster at any given layer, we tag it as 'inactive' in Chef, then run chef-client on all nodes that run haproxy and talk to the inactive node at layers above it; this gets old pretty fast, especially when you're doing anything that might conflict with Chef and that the chef-client run might overwrite (I know, I know, you're not supposed to do anything of that nature, but we are all human :-)
For the second point, we are experimenting with haproxyctl so that we don't have to run chef-client on every node running haproxy. But it still feels like a heavy-handed approach.
If I were to do this again (which I might), I would still have an haproxy instance in front of our webapp servers, but for communicating from one layer of services to another I would use a proper service discovery tool such as grampa Apache ZooKeeper or the newer kids on the block, etcd from CoreOS and consul from HashiCorp.

I settled on consul for now, so in this post I am going to show how you can use consul in conjunction with the recently released consul-template to discover services and to automate configuration changes. At the same time, I wanted to experiment a bit with Ansible as a configuration management tool. So the steps I'll describe were actually automated with Ansible, but I'll leave that for another blog post.

The scenario I am going to describe involves 2 haproxy instances, each pointing to 2 Wordpress servers running Apache, PHP and MySQL, with Varnish fronting the Wordpress application. One of the 2 Wordpress servers is considered primary as far as haproxy is concerned, and the other one is a backup server, which will only get requests if the primary server is down. All servers are running Ubuntu 12.04.

Install and run the consul agent on all nodes

The agent will start in server mode on the 2 haproxy nodes, and in agent mode on the 2 Wordpress nodes.

I first deployed consul to the 2 haproxy nodes. I used a modified version of the ansible-consul role from jivesoftware. The configuration file /etc/consul.cfg for the first server (lb1) is:

  "domain": "consul.",
  "data_dir": "/opt/consul/data",
  "log_level": "INFO",
  "node_name": "lb1",
  "server": true,
  "bind_addr": "",
  "datacenter": "us-west-1b",
  "bootstrap": true,
  "rejoin_after_leave": true

(and similar for lb2, with only node_name and bind_addr changed to lb2 and respectively)

The ansible-consul role also creates a consul user and group, and an upstart configuration file like this:

# cat /etc/init/consul.conf

# Consul Agent (Upstart unit)
description "Consul Agent"
start on (local-filesystems and net-device-up IFACE!=lo)
stop on runlevel [06]

exec sudo -u consul -g consul /opt/consul/bin/consul agent -config-dir /etc/consul.d -config-file=/etc/consul.conf >> /var/log/consul 2>&1
respawn limit 10 10
kill timeout 10

To start/stop consul, I use:

# start consul
# stop consul

Note that "server" is set to true and "bootstrap" is also set to true, which means that each consul server will be the leader of a cluster with 1 member, itself. To join the 2 servers into a consul cluster, I did the following:
  • join lb1 to lb2: on lb1 run consul join
  • tail /var/log/consul on lb1, note messages complaining about both consul servers (lb1 and lb2) running in bootstrap mode
  • stop consul on lb1: stop consul
  • edit /etc/consul.conf on lb1 and set  "bootstrap": false
  • start consul on lb1: start consul
  • tail /var/log/consul on both lb1 and lb2; it should show no more errors
  • run consul info on both lb1 and lb2; the output should show server=true on both nodes, but leader=true only on lb2
Next I ran the consul agent in regular non-server mode on the 2 Wordpress nodes. The configuration file /etc/consul.cfg on node wordpress1 was:
{  "domain": "consul.",  "data_dir": "/opt/consul/data",  "log_level": "INFO",  "node_name": "wordpress1",  "server": false,  "bind_addr": "",  "datacenter": "us-west-1b",  "rejoin_after_leave": true}
(and similar for wordpress2, with the node_name set to wordpress2 and bind_addr set to
After starting up the agents via upstart, I joined them to lb2 (although the could be joined to any of the existing members of the cluster). I ran this on both wordpress1 and wordpress2:
# consul join
At this point, running consul members on any of the 4 nodes should show all 4 members of the cluster:
Node          Address         Status  Type    Build  Protocollb1    alive   server  0.4.0  2wordpress2   alive   client  0.4.0  2lb2    alive   server  0.4.0  2wordpress1   alive   client  0.4.0  2
Install and run dnsmasq on all nodes
The ansible-consul role does this for you. Consul piggybacks on DNS resolution for service naming, and by default the domain names internal to Consul start with consul. In my case they are configured in consul.cfg via "domain": "consul."
The dnsmasq configuration file for consul is:
# cat /etc/dnsmasq.d/10-consul
This causes dnsmasq to provide DNS resolution for domain names starting with consul. by querying a DNS server on running on port 8600 (which is the port the local consul agent listens on to provide DNS resolution).
To start/stop dnsmasq, use: service dnsmasq start | stop.
Now that dnsmasq is running, you can look up names that end in .node.consul from any member node of the consul cluster (there are 4 member nodes in my cluster, 2 servers and 2 agents). For example, I ran this on lb2:
$ dig wordpress1.node.consul
; <<>> DiG 9.8.1-P1 <<>> wordpress1.node.consul;; global options: +cmd;; Got answer:;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2511;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:;wordpress1.node.consul. IN A
;; ANSWER SECTION:wordpress1.node.consul. 0 IN A
;; Query time: 1 msec;; SERVER:;; WHEN: Fri Nov 14 00:09:16 2014;; MSG SIZE  rcvd: 76
Configure services and checks on consul agent nodes

Internal DNS resolution within the .consul domain becomes even more useful when nodes define services and checks. For example, the 2 Wordpress nodes run varnish and apache (on port 80 and port 443) so we can define 3 services as JSON files in /etc/consul.d. On wordpress1, which is our active/primary node in haproxy, I defined these services:

$ cat http_service.json
    "service": {
        "name": "http",
        "tags": ["primary"],
        "check": {
                "id": "http_check",
                "name": "HTTP Health Check",
  "script": "curl -H '' http://localhost",
        "interval": "5s"

$ cat ssl_service.json
    "service": {
        "name": "ssl",
        "tags": ["primary"],
        "check": {
                "id": "ssl_check",
                "name": "SSL Health Check",
  "script": "curl -k -H '' https://localhost:443",
        "interval": "5s"

$ cat varnish_service.json
    "service": {
        "name": "varnish",
        "tags": ["primary"],
        "port":6081 ,
        "check": {
                "id": "varnish_check",
                "name": "Varnish Health Check",
  "script": "curl http://localhost:6081",
        "interval": "5s"

Each service we defined has a name, a port and a check with its own ID, name, script that runs whenever the check is executed, and an interval that specifies how often the check is run. In the examples above I specified simple curl commands against the ports that these services are running on. Note also that each service has a list of tags associated with it. In my case, the services on wordpress1 have the tag "primary". The services defined on wordpress2 are identical to the ones on wordpress1 with the only difference being the tag, which on wordpress2 is "backup".

After restarting consul on wordpress1 and wordpress2, the following service-related DNS names are available for resolution on all nodes in the consul cluster (I am going to include only relevant portions of the dig output):

$ dig varnish.service.consul

varnish.service.consul. 0 IN A
varnish.service.consul. 0 IN A

This name resolves in DNS round-robin fashion to the IP addresses of all nodes that are running the varnish service, regardless of their tags and regardless of the data centers that their nodes run in. In our case, it resolves to the IP addresses of wordpress1 and wordpress2.

Note that the IP address of a given node only appears in the DNS result set if the service running on that node has a healty check. If the check fails, then consul's DNS service will not include the IP of the node in the result set. This is very important for the dynamic discovery of healthy services.

$ dig


If we include the data center (in our case us-west-1b) in the DNS name we query, then only the services running on nodes in that data center will be returned in the result set. In our case though, all nodes run in the us-west-1b data center, so this query returns, like the previous one, the IP addresses of wordpress1 and wordpress2. Note that the IPs can be returned in any order, because of DNS round-robin. In this case the IP of wordpress2 was first.

$ dig SRV varnish.service.consul

varnish.service.consul. 0 IN SRV 1 1 6081
varnish.service.consul. 0 IN SRV 1 1 6081


A useful feature of the consul DNS service is that it returns the port number that a given service runs on when queried for an SRV record. So this query returns the names and IPs of the nodes that the varnish service runs on, as well as the port number, which in this case is 6081. The application querying for the SRV record needs to interpret this extra piece of information, but this is very useful for the discovery of internal services that might run on non-standard port numbers.

$ dig primary.varnish.service.consul

primary.varnish.service.consul. 0 IN A

$ dig backup.varnish.service.consul

backup.varnish.service.consul. 0 IN A

The 2 DNS queries above show that it's possible to query a service by its tag, in our case 'primary' vs. 'backup'. The result set will contain the IP addresses of the nodes tagged with the specific tag and running the specific service we asked for. This feature will prove useful when dealing with consul-template in haproxy, as I'll show later in this post.

Load balance across services

It's easy now to see how an application can take advantage of the internal DNS service provided by consul and load balance across services. For example, an application that needs to load balance across the 2 varnish services on wordpress1 and wordpress2 would use varnish.service.consul as the DNS name it talks to when it needs to hit varnish. Every time this DNS name is resolved, a random node from wordpress1 and wordpress2 is returned via the DNS round-robin mechanism. If varnish were to run on a non-standard port number, the application would need to issue a DNS request for the SRV record in order to obtain the port number as well as the IP address to hit.

Note that this method of load balancing has health checks built in. If the varnish health check fails on one of the nodes providing the varnish service, that node's IP address will not be included in the DNS result set returned by the DNS query for that service.

Also note that the DNS query can be customized for the needs of the application, which can query for a specific data center, or a specific tag, as I showed in the examples above.

Force a node out of service

I am still looking for the best way to take nodes in and out of service for maintenance or other purposes. One way I found so far is to deregister a given service via the Consul HTTP API. Here is an example of a curl command that accomplishes that, executed on node wordpress1:

$ curl -v http://localhost:8500/v1/agent/service/deregister/varnish
* About to connect() to localhost port 8500 (#0)
*   Trying connected
> GET /v1/agent/service/deregister/varnish HTTP/1.1
> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/ libidn/1.23 librtmp/2.3
> Host: localhost:8500
> Accept: */*
< HTTP/1.1 200 OK
< Date: Mon, 17 Nov 2014 19:01:06 GMT
< Content-Length: 0
< Content-Type: text/plain; charset=utf-8
* Connection #0 to host localhost left intact
* Closing connection #0

The effect of this command is that the varnish service on node wordpress1 is 'deregistered', which for my purposes means 'marked as down'. DNS queries for varnish.service.consul will only return the IP address of wordpress2:

$ dig varnish.service.consul

varnish.service.consul. 0 IN A

We can also use the Consul HTTP API to verify that the varnish service does not appear in the list of active services on node wordpress1. We'll use the /agent/services API call and we'll save the output to a file called services.out, then we'll use the jq tool to pretty-print the output:

$ curl -v http://localhost:8500/v1/agent/services -o services.out
$ jq . <<< `cat services.out`{  "http": {    "ID": "http",    "Service": "http",    "Tags": [      "primary"    ],    "Port": 80  },  "ssl": {    "ID": "ssl",    "Service": "ssl",    "Tags": [      "primary"    ],    "Port": 443  }}
Note that only the http and ssl services are shown.
Force a node back in service

Again, I am still looking for the best way to mark as service as 'up' once it was marked as 'down'. One way would be to register the service via the Consul HTTP API, and that requires issuing a POST request with the payload being the JSON configuration file for that service. Another way is to just restart the consul agent on the node in question. This will register the service that had been deregistered previously.

Install and configure consul-template

For the next few steps, I am going to show how to use consul-template in conjuction with consul for discovering services and configuring haproxy based on the discovered services.

I automated the installation and configuration of consul-template via an Ansible role that I put on Github, but I am going to discuss the main steps here. See also the instructions on the consul-template Github page.

In my Ansible role, I copy the consul-template binary to the target node (in my case the 2 haproxy nodes lb1 and lb2), then create a directory structure /opt/consul-template/{bin,config,templates}. The consul-template configuration file is /opt/consul-template/config/consul-template.cfg and it looks like this in my case:

$ cat config/consul-template.cfg
consul = ""

template {
  source = "/opt/consul-template/templates/haproxy.ctmpl"
  destination = "/etc/haproxy/haproxy.cfg"
  command = "service haproxy restart"

Note that consul-template needs to be able to talk a consul agent, which in my case is the local agent listening on port 8500. The template that consul-template maintains is defined in another file,  /opt/consul-template/templates/haproxy.ctmpl. What consul-template does is monitor changes to that file via changes to the services referenced in the file. Upon any such change, consul-template will generate a new target file based on the template and copy it to the destination file, which in my case is the haproxy config file /etc/haproxy/haproxy.cfg. Finally, consul-template will executed a command, which in my case is the restarting of the haproxy service.

Here is the actual template file for my haproxy config, which is written in the Go template format:

$ cat /opt/consul-template/templates/haproxy.ctmpl

  log   local0
  maxconn 4096
  user haproxy
  group haproxy

  log     global
  mode    http
  option  dontlognull
  retries 3
  option redispatch
  timeout connect 5s
  timeout client 50s
  timeout server 50s
  balance  roundrobin

# Set up application listeners here.

frontend http
  maxconn {{key "service/haproxy/maxconn"}}
  default_backend servers-http-varnish

backend servers-http-varnish
  balance            roundrobin
  option httpchk GET /
  option  httplog
{{range service "primary.varnish"}}
    server {{.Node}} {{.Address}}:{{.Port}} weight 1 check port {{.Port}}
{{range service "backup.varnish"}}
    server {{.Node}} {{.Address}}:{{.Port}} backup weight 1 check port {{.Port}}

frontend https
  maxconn            {{key "service/haproxy/maxconn"}}
  mode               tcp
  default_backend    servers-https

backend servers-https
  mode               tcp
  option             tcplog
  balance            roundrobin
{{range service "primary.ssl"}}
    server {{.Node}} {{.Address}}:{{.Port}} weight 1 check port {{.Port}}
{{range service "backup.ssl"}}
    server {{.Node}} {{.Address}}:{{.Port}} backup weight 1 check port {{.Port}}

To the trained eye, this looks like a regular haproxy configuration file, with the exception of the portions bolded above. These are Go template snippets which rely on a couple of template functions exposed by consul-template above and beyond what the Go templating language offers. Specifically, the key function queries a key stored in the Consul key/value store and outputs the value associated with that key (or an empty string if the value doesn't exist). The service function queries a consul service by its DNS name and returns a result set used inside the range statement. The variables inside the result set can be inspected for properties such as Node, Address and Port, which correspond to the Consul service node name, IP address and port number for that particular service.

In my example above, I use the value of the key service/haproxy/maxconn as the value of maxconn. In the http-varnish backend, I used 2 sets of services names, primary.varnish and backup.varnish, because I wanted to differentiate in haproxy.cfg between the primary server (wordpress1 in my case) and the backup server (wordpress2). In the ssl backend, I did the same but with the ssl service.
Everything so far would work fine with the exception of the key/value pair represented by the key service/haproxy/maxconn. To define that pair, I used the Consul key/value store API (this can be run on any member of the Consul cluster):
$ cat!/bin/bash
curl -X PUT -d "$MAXCONN" http://localhost:8500/v1/kv/service/haproxy/maxconn
To verify that the value was set, I used:
$ cat!/bin/bash
curl -v http://localhost:8500/v1/kv/?recurse
$ ./* About to connect() to localhost port 8500 (#0)*   Trying connected> GET /v1/kv/?recurse HTTP/1.1> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/ libidn/1.23 librtmp/2.3> Host: localhost:8500> Accept: */*>< HTTP/1.1 200 OK< Content-Type: application/json< X-Consul-Index: 30563< X-Consul-Knownleader: true< X-Consul-Lastcontact: 0< Date: Mon, 17 Nov 2014 23:01:07 GMT< Content-Length: 118<* Connection #0 to host localhost left intact* Closing connection #0[{"CreateIndex":10995,"ModifyIndex":30563,"LockIndex":0,"Key":"service/haproxy/maxconn","Flags":0,"Value":"NDAwMA=="}]
At this point, everything is ready for starting up the consul-template service (in Ubuntu), I did it via this Upstart configuration file:
# cat /etc/init/consul-template.conf# Consul Template (Upstart unit)description "Consul Template"start on (local-filesystems and net-device-up IFACE!=lo)stop on runlevel [06]
exec /opt/consul-template/bin/consul-template  -config=/opt/consul-template/config/consul-template.cfg >> /var/log/consul-template 2>&1
respawnrespawn limit 10 10kill timeout 10
# start consul-template
Once consul-template starts, it will peform the actions corresponding to the functions defined in the template file /opt/consul-template/templates/haproxy.ctmpl. In my case, it will query Consul for the value of the key service/haproxy/maxconn and for information about the 2 Consul services varnish.service and ssl.service. It will then save the generated file to /etc/haproxy/haproxy.cfg and it will restart the haproxy service. The relevant snippets from haproxy.cfg are:
frontend http  maxconn 4000  bind  default_backend servers-http
backend servers-http  balance            roundrobin  option httpchk GET /  option  httplog
    server wordpress1 weight 1 check port 6081

    server wordpress2 backup weight 1 check port 6081
frontend https  maxconn            4000  mode               tcp  bind       default_backend    servers-https
backend servers-https  mode               tcp  option             tcplog  balance            roundrobin
    server wordpress1 weight 1 check port 443

    server wordpress2 backup weight 1 check port 443
I've been running this as a test on lb2. I don't consider my setup quite production-ready because I don't have monitoring in place, and I also want to experiment with consul security tokens for better security. But this is a pattern that I think will work.

Testing CDN and geolocation with

Wed, 10/15/2014 - 19:31
Assume you want to migrate to a new CDN provider. Eventually you'll have to point as a CNAME to a domain name handled by the CDN provider, let's call it To test this setup before you put it in production, the usual way is to get an IP address corresponding to, then associate with that IP address in your local /etc/hosts file.

This works well for testing most of the functionality of your web site, but it doesn't work when you want to test geolocation-specific features such as displaying the currency based on the users's country of origin. For this, you can use a nifty feature from the amazing free service WebPageTest.

On the main page of WebPageTest, you can specify the test location from the dropdown. It contains a generous list of locations across the globe. To fake your DNS setting and point, you can specify something like this in the Script tab:

setDNSName example.cdnprovider.comnavigate
This will effectively associate the page you want to test with the CDN provider-specified URL, so you will hit the CDN first from the location you chose.

Watch the open files limit when running Riak

Mon, 10/13/2014 - 17:53
I was close to expressing my unbridled joy at how little hand-holding our Riak cluster needs, when we started to see strange increased latencies when hitting the cluster, on calls that should have been very fast. Also, the health of the Riak nodes seems fine in terms of CPU, memory and disk. As usual, our good old friend the error log file pointed us towards the solution. We saw entries like this in /var/log/riak/error.log:

2014-10-11 03:22:40.565 UTC [error] <0.12830.4607> CRASH REPORT Process <0.12830.4607> with 0 neighbours exited with reason: {error,accept_failed} in mochiweb_acceptor:init/3 line 342014-10-11 03:22:40.619 UTC [error] <0.168.0> {mochiweb_socket_server,310,{acceptor_error,{error,accept_failed}}}2014-10-11 03:22:40.619 UTC [error] <0.12831.4607> application: mochiweb, "Accept failed error", "{error,emfile}"
A google search revealed that a possible cause of these errors is the dreaded open file descriptor limit, which is 1024 by default in Ubuntu.

To be perfectly honest, we hadn't done almost any tuning on our Riak cluster, because it had been running so smoothly. But recently we started to throw more traffic at it, hence issues with open file descriptors made sense. To fix it, we followed the advice in this Riak doc and created /etc/default/riak with the contents:
ulimit -n 65536
We also took the opportunity to apply the networking-related kernel tuning recommendations from this other Riak tuning doc and added these lines to /etc/sysctl.conf:
net.ipv4.tcp_max_syn_backlog = 40000net.core.somaxconn=4000net.ipv4.tcp_timestamps = 0net.ipv4.tcp_sack = 1net.ipv4.tcp_window_scaling = 1net.ipv4.tcp_fin_timeout = 15net.ipv4.tcp_keepalive_intvl = 30net.ipv4.tcp_tw_reuse = 1
Then we ran sysctl -p to update the above values in the kernel. Finally we restarted our Riak nodes one at a time.
I am happy to report that ever since, we've had absolutely no issues with our Riak cluster.  I should also say we are running Riak 1.3, and I understand that Riak 2.0 has better tests in place for avoiding this issue.

I do want to give kudos to Basho for an amazingly robust piece of technology, whose only fault is that it gets you into the habit of ignoring it because it just works!

A quick note on haproxy acl rules

Thu, 10/02/2014 - 20:29
I blogged in the past about haproxy acl rules we used for geolocation detection purposes. In that post, I referenced acl conditions that were met when traffic was coming from a non-US IP address. In that case, we were using a different haproxy backend. We had an issue recently when trying to introduce yet another backend for a given country. We added these acl conditions:

       acl acl_geoloc_akamai_true_client_ip_some_country req.hdr(X-Country-Akamai) -m str -i SOME_COUNTRY_CODE
       acl acl_geoloc_src_some_country req.hdr(X-Country-Src) -m str -i SOME_COUNTRY_CODE

We also added this use_backend rule:

      use_backend www_some_country-backend if acl_akamai_true_client_ip_header_exists acl_geoloc_akamai_true_client_ip_some_country or acl_geoloc_src_some_country

However, the backend www_some_country-backend was never chosen by haproxy, even though we could see traffic coming from IP address from SOME_COUNTRY_CODE.

The cause of this issue was that another use_backend rule (for non-US traffic) was firing before the new rule we added. I believe this is because this rule is more generic:

       use_backend www_row-backend if acl_akamai_true_client_ip_header_exists !acl_geoloc_akamai_true_client_ip_us or !acl_geoloc_src_us
The solution was to modify the use_backend rule for non-US traffic to fire only when the SOME_COUNTRY acl condition isn't met:
       use_backend www_row-backend if acl_akamai_true_client_ip_header_exists !acl_geoloc_akamai_true_client_ip_us !acl_geoloc_akamai_true_client_ip_some_country or !acl_geoloc_src_us !acl_geoloc_src_some_country
Maybe another solution would be to change the order of acls and use_backend rules. I couldn't find any good documentation on how this order affects what gets triggered when.

Booting a Raspberry Pi B+ with the Raspbian Debian Wheezy image

Thu, 09/11/2014 - 01:32
It took me a while to boot my brand new Raspberry Pi B+ with a usable Linux image. I chose the Raspbian Debian Wheezy image available on the downloads page of the official site. Here are the steps I needed:

1) Bought micro SD card. Note DO NOT get a regular SD card for the B+ because it will not fit in the SD card slot. You need a micro SD card.

2) Inserted the SD card via an SD USB adaptor in my MacBook Pro.

3) Went to the command line and ran df to see which volume the SD card was mounted as. In my case, it was /dev/disk1s1.

4) Unmounted the SD card. I initially tried 'sudo umount /dev/disk1s1' but the system told me to use 'diskutil unmount', so the command that worked for me was:

diskutil unmount /dev/disk1s1

5) Used dd to copy the Raspbian Debian Wheezy image (which I previously downloaded) per these instructions. Important note: the target of the dd command is /dev/disk1 and NOT /dev/disk1s1. I tried initially with the latter, and the Raspberry Pi wouldn't boot (one of the symptoms that something was wrong other than the fact that nothing appeared on the monitor, was that the green light was solid and not flashing; a google search revealed that one possible cause for that was a problem with the SD card). The dd command I used was:

dd if=2014-06-20-wheezy-raspbian.img of=/dev/disk1 bs=1m

6) At this point, I inserted the micro SD card into the SD slot on the Raspberry Pi, then connected the Pi to a USB power cable, a monitor via an HDMI cable, a USB keyboard and a USB mouse. I was able to boot and change the password for the pi user. The sky is the limit next ;-)