Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!


Testing CDN and geolocation with

Agile Testing - Grig Gheorghiu - Wed, 10/15/2014 - 19:31
Assume you want to migrate to a new CDN provider. Eventually you'll have to point as a CNAME to a domain name handled by the CDN provider, let's call it To test this setup before you put it in production, the usual way is to get an IP address corresponding to, then associate with that IP address in your local /etc/hosts file.

This works well for testing most of the functionality of your web site, but it doesn't work when you want to test geolocation-specific features such as displaying the currency based on the users's country of origin. For this, you can use a nifty feature from the amazing free service WebPageTest.

On the main page of WebPageTest, you can specify the test location from the dropdown. It contains a generous list of locations across the globe. To fake your DNS setting and point, you can specify something like this in the Script tab:

setDNSName example.cdnprovider.comnavigate
This will effectively associate the page you want to test with the CDN provider-specified URL, so you will hit the CDN first from the location you chose.

Watch the open files limit when running Riak

Agile Testing - Grig Gheorghiu - Mon, 10/13/2014 - 17:53
I was close to expressing my unbridled joy at how little hand-holding our Riak cluster needs, when we started to see strange increased latencies when hitting the cluster, on calls that should have been very fast. Also, the health of the Riak nodes seems fine in terms of CPU, memory and disk. As usual, our good old friend the error log file pointed us towards the solution. We saw entries like this in /var/log/riak/error.log:

2014-10-11 03:22:40.565 UTC [error] <0.12830.4607> CRASH REPORT Process <0.12830.4607> with 0 neighbours exited with reason: {error,accept_failed} in mochiweb_acceptor:init/3 line 342014-10-11 03:22:40.619 UTC [error] <0.168.0> {mochiweb_socket_server,310,{acceptor_error,{error,accept_failed}}}2014-10-11 03:22:40.619 UTC [error] <0.12831.4607> application: mochiweb, "Accept failed error", "{error,emfile}"
A google search revealed that a possible cause of these errors is the dreaded open file descriptor limit, which is 1024 by default in Ubuntu.

To be perfectly honest, we hadn't done almost any tuning on our Riak cluster, because it had been running so smoothly. But recently we started to throw more traffic at it, hence issues with open file descriptors made sense. To fix it, we followed the advice in this Riak doc and created /etc/default/riak with the contents:
ulimit -n 65536
We also took the opportunity to apply the networking-related kernel tuning recommendations from this other Riak tuning doc and added these lines to /etc/sysctl.conf:
net.ipv4.tcp_max_syn_backlog = 40000net.core.somaxconn=4000net.ipv4.tcp_timestamps = 0net.ipv4.tcp_sack = 1net.ipv4.tcp_window_scaling = 1net.ipv4.tcp_fin_timeout = 15net.ipv4.tcp_keepalive_intvl = 30net.ipv4.tcp_tw_reuse = 1
Then we ran sysctl -p to update the above values in the kernel. Finally we restarted our Riak nodes one at a time.
I am happy to report that ever since, we've had absolutely no issues with our Riak cluster.  I should also say we are running Riak 1.3, and I understand that Riak 2.0 has better tests in place for avoiding this issue.

I do want to give kudos to Basho for an amazingly robust piece of technology, whose only fault is that it gets you into the habit of ignoring it because it just works!

A quick note on haproxy acl rules

Agile Testing - Grig Gheorghiu - Thu, 10/02/2014 - 20:29
I blogged in the past about haproxy acl rules we used for geolocation detection purposes. In that post, I referenced acl conditions that were met when traffic was coming from a non-US IP address. In that case, we were using a different haproxy backend. We had an issue recently when trying to introduce yet another backend for a given country. We added these acl conditions:

       acl acl_geoloc_akamai_true_client_ip_some_country req.hdr(X-Country-Akamai) -m str -i SOME_COUNTRY_CODE
       acl acl_geoloc_src_some_country req.hdr(X-Country-Src) -m str -i SOME_COUNTRY_CODE

We also added this use_backend rule:

      use_backend www_some_country-backend if acl_akamai_true_client_ip_header_exists acl_geoloc_akamai_true_client_ip_some_country or acl_geoloc_src_some_country

However, the backend www_some_country-backend was never chosen by haproxy, even though we could see traffic coming from IP address from SOME_COUNTRY_CODE.

The cause of this issue was that another use_backend rule (for non-US traffic) was firing before the new rule we added. I believe this is because this rule is more generic:

       use_backend www_row-backend if acl_akamai_true_client_ip_header_exists !acl_geoloc_akamai_true_client_ip_us or !acl_geoloc_src_us
The solution was to modify the use_backend rule for non-US traffic to fire only when the SOME_COUNTRY acl condition isn't met:
       use_backend www_row-backend if acl_akamai_true_client_ip_header_exists !acl_geoloc_akamai_true_client_ip_us !acl_geoloc_akamai_true_client_ip_some_country or !acl_geoloc_src_us !acl_geoloc_src_some_country
Maybe another solution would be to change the order of acls and use_backend rules. I couldn't find any good documentation on how this order affects what gets triggered when.

Booting a Raspberry Pi B+ with the Raspbian Debian Wheezy image

Agile Testing - Grig Gheorghiu - Thu, 09/11/2014 - 01:32
It took me a while to boot my brand new Raspberry Pi B+ with a usable Linux image. I chose the Raspbian Debian Wheezy image available on the downloads page of the official site. Here are the steps I needed:

1) Bought micro SD card. Note DO NOT get a regular SD card for the B+ because it will not fit in the SD card slot. You need a micro SD card.

2) Inserted the SD card via an SD USB adaptor in my MacBook Pro.

3) Went to the command line and ran df to see which volume the SD card was mounted as. In my case, it was /dev/disk1s1.

4) Unmounted the SD card. I initially tried 'sudo umount /dev/disk1s1' but the system told me to use 'diskutil unmount', so the command that worked for me was:

diskutil unmount /dev/disk1s1

5) Used dd to copy the Raspbian Debian Wheezy image (which I previously downloaded) per these instructions. Important note: the target of the dd command is /dev/disk1 and NOT /dev/disk1s1. I tried initially with the latter, and the Raspberry Pi wouldn't boot (one of the symptoms that something was wrong other than the fact that nothing appeared on the monitor, was that the green light was solid and not flashing; a google search revealed that one possible cause for that was a problem with the SD card). The dd command I used was:

dd if=2014-06-20-wheezy-raspbian.img of=/dev/disk1 bs=1m

6) At this point, I inserted the micro SD card into the SD slot on the Raspberry Pi, then connected the Pi to a USB power cable, a monitor via an HDMI cable, a USB keyboard and a USB mouse. I was able to boot and change the password for the pi user. The sky is the limit next ;-)

Two lessons on haproxy checks and swap space

Agile Testing - Grig Gheorghiu - Thu, 08/21/2014 - 02:03
Let's assume you want to host a Wordpress site which is not going to get a lot of traffic. You want to use EC2 for this. You still want as much fault tolerance as you can get at a decent price, so you create an Elastic Load Balancer endpoint which points to 2 (smallish) EC2 instances running haproxy, with each haproxy instance pointing in turn to 2 (not-so-smallish) EC2 instances running Wordpress (Apache + MySQL). 
You choose to run haproxy behind the ELB because it gives you more flexibitity in terms of load balancing algorithms, health checks, redirections etc. Within haproxy, one of the Wordpress servers is marked as a backup for the other, so it only gets hit by haproxy when the primary one goes down. On this secondary Wordpress instance you set up MySQL to be a slave of the primary instance's MySQL. 
Here are two things (at least) that you need to make sure you have in this scenario:
1) Make sure you specify the httpchk option in haproxy.cfg, otherwise the primary server will not be marked as down even if Apache goes down. So you should have something like:
backend servers-http  server s1 weight 1 maxconn 5000 check port 80  server s2 backup weight 1 maxconn 5000 check port 80  option httpchk GET /
2) Make sure you have swap space in case the memory on the Wordpress instances gets exhausted, in which case random processes will be killed by the oom process (and one of those processes can be mysqld). By default, there is no swap space when you spin up an Ubuntu EC2 instance. Here's how to set up a 2 GB swapfile:
dd if=/dev/zero of=/swapfile1 bs=1024 count=2097152mkswap /swapfile1chmod 0600 /swapfile1swapon /swapfile1echo "/swapfile1 swap swap defaults 0 0" >> /etc/fstab
I hope these two things will help you if you're not already doing them ;-)

Managing OpenStack security groups from the command line

Agile Testing - Grig Gheorghiu - Fri, 08/15/2014 - 20:47
I had an issue today where I couldn't connect to a particular OpenStack instance on port 443. I decided to inspect the security group it belongs (let's call it myapp) to from the command line:

# nova secgroup-list-rules myapp
| IP Protocol | From Port | To Port | IP Range   | Source Group |
| tcp         | 80        | 80      |  |              |
| tcp         | 443       | 443     | |              |

Note that the IP range for port 443 is wrong. It should be all IPs and not a /24 network.

I proceeded to delete the wrong rule:

# nova secgroup-delete-rule myapp tcp 443 443                                                               
| IP Protocol | From Port | To Port | IP Range   | Source Group |
| tcp         | 443       | 443     | |              |

Then I added back the correct rule:
 # nova secgroup-add-rule myapp tcp 443 443                                                                   +-------------+-----------+---------+-----------+--------------+| IP Protocol | From Port | To Port | IP Range  | Source Group |+-------------+-----------+---------+-----------+--------------+| tcp         | 443       | 443     | |              |+-------------+-----------+---------+------------+--------------+
Finally, I verified that the rules are now correct:
# nova secgroup-list-rules myapp                                                                                       +-------------+-----------+---------+-----------+--------------+| IP Protocol | From Port | To Port | IP Range  | Source Group |+-------------+-----------+---------+-----------+--------------+| tcp         | 443       | 443     | |              || tcp         | 80        | 80      | |              |+-------------+-----------+---------+-----------+--------------+
Of course, the real test was to see if I could now hit port 443 on my instance, and indeed I was able to.

Troubleshooting haproxy 502 errors related to malformed/large HTTP headers

Agile Testing - Grig Gheorghiu - Wed, 07/23/2014 - 00:02
We had a situation recently where our web application started to behave strangely. First nginx (which sits in front of the application) started to error out with messages of this type:

upstream sent too big header while reading response header from upstream

A quick Google search revealed that a fix for this is to bump up proxy_buffer_size in nginx.conf, for both http and https traffic, along these lines:

proxy_buffer_size   256k;
proxy_buffers   4 256k;
proxy_busy_buffers_size   256k;

Now nginx was happy when hit directly. However, haproxy was still erroring out with a 502 'bad gateway' return code, followed by PH. Here is a snippet from the haproxy log file:

Jul 22 21:27:13 haproxy[14317]: [22/Jul/2014:21:27:12.776] www-frontend www-backend/www2:80 1/0/1/-1/898 502 8396 - - PH-- 0/0/0/0/0 0/0 "GET /someurl HTTP/1.1"

Another Google search revealed that PH means that haproxy rejected the header from the backend because it was malformed.

At this point, an investigation into the web app did discover a loop in the code that kept adding elements to a cookie included in the response header.

Anyway, I leave this here in the hope that somebody will stumble on it and benefit from it.

First experiences with OpenStack

Agile Testing - Grig Gheorghiu - Thu, 07/17/2014 - 21:37
We hit a big milestone this week, as we started to use OpenStack as a private cloud, intially just for QA/integration environments. Up to now we've been creating KVM machines semi-manually, which used to take minutes. Now we cut down that process to seconds, calling the Nova API from the command line, e.g.:

$ nova boot --image precise-image --flavor www --key_name mykey --nic net-id=3eafbd4f-0389-4c5b-93ba-7764742ee8cd www1.qa1

Once an instance is provisioned, we bootstrap it with Chef:

$ knife bootstrap -x ubuntu --sudo -E qa1 -N www1.qa1 -r "role[base], role[www]"

Our internal network architecture is fairly complex, so my colleague Jeff Roberts spent quite some time bending OpenStack Neutron to his will (in conjunction with Open vSwitch) in order to support our internal VLANs. The OpenStack infrastructure has been stable so far, and it's just such a pleasure to do everything via an API and not to spin VMs up manually. Being back to working with a (private) cloud feels good.

This is just version 1.0 of our OpenStack rollout. Soon we'll start spinning up one environment at a time using chef-metal and fog  and we'll also integrate instance + environment spin-up with Jenkins. Exciting times ahead!