Warning: Table './devblogsdb/cache_page' is marked as crashed and last (automatic?) repair failed query: SELECT data, created, headers, expire, serialized FROM cache_page WHERE cid = 'http://www.softdevblogs.com/?q=aggregator/sources/17' in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc on line 135

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 729

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 730

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 731

Warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/bootstrap.inc on line 732
Software Development Blogs: Programming, Software Testing, Agile, Project Management
Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Agile Testing - Grig Gheorghiu
warning: Cannot modify header information - headers already sent by (output started at /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/database.mysql.inc:135) in /home/content/O/c/n/Ocnarfparking9/html/softdevblogs/includes/common.inc on line 153.
Syndicate content
Did anybody say webscale?
Updated: 2 hours 7 min ago

Setting up Jenkins to run headless Selenium tests in Docker containers

Sat, 02/06/2016 - 01:40
This is the third post in a series on running headless Selenium WebDriver tests. Here are the first two posts:
  1. Running Selenium WebDriver tests using Firefox headless mode on Ubuntu
  2. Running headless Selenium WebDriver tests in Docker containers
In this post I will show how to add the final piece to this workflow, namely how to fully automate the execution of Selenium-based WebDriver tests running Firefox in headless mode in Docker containers. I will use Jenkins for this example, but the same applies to other continuous integration systems.
1) Install docker-engine on the server running Jenkins (I covered this in my post #2 above)
2) Add the jenkins user to the docker group so that Jenkins can run the docker command-line tool in order to communicate with the docker daemon. Remember to restart Jenkins after doing this.
3) Go through the rest of the workflow in my post above ("Running headless Selenium WebDriver tests in Docker containers") and make sure you can run all the commands in that post from the command line of the server running Jenkins.
4) Create a directory structure for your Selenium WebDriver tests (mine are written in Python). 
I have a directory called selenium-docker which contains a directory called tests, under which I put all my Python WebDriver tests named sel_wd_*.py. I also  have a simple shell script I named run_selenium_tests.sh which does the following:
#!/bin/bash
TARGET=$1 # e.g. someotherdomain.example.com (if not specified, the default is somedomain.example.com)
for f in `ls tests/sel_wd_*.py`; do    echo Running $f against $TARGET    python $f $TARGETdone
My selenium-docker directory also contains the xvfb.init file I need for starting up Xvfb in the container, and finally it contains this Dockerfile:
FROM ubuntu:trusty
RUN echo "deb http://ppa.launchpad.net/mozillateam/firefox-next/ubuntu trusty main" > /etc/apt/sources.list.d//mozillateam-firefox-next-trusty.listRUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys CE49EC21RUN apt-get updateRUN apt-get install -y firefox xvfb python-pipRUN pip install seleniumRUN mkdir -p /root/selenium/tests
ADD tests /root/selenium/testsADD run_all_selenium_tests.sh /root/selenium
ADD xvfb.init /etc/init.d/xvfbRUN chmod +x /etc/init.d/xvfbRUN update-rc.d xvfb defaults
ENV TARGET=somedomain.example.com
CMD (service xvfb start; export DISPLAY=:10; cd /root/selenium; ./run_all_selenium_tests.sh $TARGET)
I explained what this Dockerfile achieves in the 2nd post referenced above. The ADD instructions will copy all the files in the tests directory to the directory called /root/selenium/tests, and will copy run_all_selenium_tests.sh to /root/selenium. The ENV variable TARGET represents the URL against which we want to run our Selenium tests. It is set by default to somedomain.example.com, and is used as the first argument when running run_all_selenium_tests.sh in the CMD instruction.
At this point, I checked in the selenium-docker directory and all files and directories under it into a Github repository I will call 'devops'.
5) Create a new Jenkins project (I usually create a New Item and copy it from an existing project).
I specified that the build is parameterized and I indicated a choice parameter called TARGET_HOST with a few host/domain names that I want to test. I also specified Git as the Source Code Management type, and I indicated the URL of the devops repository on Github. Most of the action of course happens in the Jenkins build step, which in my case is of type "Execute shell". Here it is:
#!/bin/bash
set +e
IMAGE_NAME=selenium-wd:v1
cd $WORKSPACE/selenium-docker
# build the image out of the Dockerfile in the current directory/usr/bin/docker build -t $IMAGE_NAME .
# run a container based on the imageCONTAINER_ID=`/usr/bin/docker run -d -e "TARGET=$TARGET_HOST" $IMAGE_NAME`
echo CONTAINER_ID=$CONTAINER_ID  # while the container is still running, sleep and check logs; repeat every 40 secwhile [ $? -eq 0 ];do  sleep 40  /usr/bin/docker logs $CONTAINER_ID  /usr/bin/docker ps | grep $IMAGE_NAMEdone
# docker logs sends errors to stderr so we need to save its output to a file first/usr/bin/docker logs $CONTAINER_ID > d.out 2>&1
# remove the container so they don't keep accumulatingdocker rm $CONTAINER_ID
# mark jenkins build as failed if log output contains FAILEDgrep "FAILED" d.out
if [[ $? -eq 0 ]]; then    rm d.out   exit 1else  rm d.out  exit 0fi
Some notes:
  • it is recommended that you specify #!/bin/bash as the 1st line of your script, to make sure that bash is the shell that is being used
  • use set +e if you want the Jenkins shell script to continue after hitting a non-zero return code (the default behavior is for the script to stop on the first line it encounters an error and for the build to be marked as failed; subsequent lines won't get executed, resulting in much pulling of hair)
  • the Jenkins script will build a new image every time it runs, so that we make sure we have updated Selenium scripts in place
  • when running the container via docker run, we specify -e "TARGET=$TARGET_HOST" as an extra command line argument. This will override the ENV variable named TARGET in the Dockerfile with the value received from the Jenkins multiple choice dropdown for TARGET_HOST
  • the main part of the shell script stays in a while loop that checks for the return code of "/usr/bin/docker ps | grep $IMAGE_NAME". This is so we wait for all the Selenium tests to finish, at which point docker ps will not show the container running anymore (you can still see the container by running docker ps -a)
  • once the tests finish, we save the stdout and stderr of the docker logs command for our container to a file (this is so we capture both stdout and stderr; at first I tried something like docker logs $CONTAINER_ID | grep FAILED but this was never successful, because it was grep-ing against stdout, and errors are sent to stderr)
  • we grep the file (d.out) for the string FAILED and if we find it, we exit with code 1, i.e. unsuccessful as far as Jenkins is concerned. If we don't find it, we exit successfully with code 0.


Running headless Selenium WebDriver tests in Docker containers

Sat, 01/09/2016 - 00:24
In my previous post, I showed how to install firefox in headless mode on an Ubuntu box and how to use Xvfb to allow Selenium WebDriver scripts to run against firefox in headless mode.

Here I want to show how run each Selenium test suite in a Docker container, so that the suite gets access to its own firefox browser. This makes it easy to parallelize the test runs, and thus allows you to load test your Web infrastructure with real-life test cases.

Install docker-engine on Ubuntu 14.04

We import the dockerproject.org signing key and apt repo into our apt repositories, then we install the linux-image-extra and docker-engine packages.

# apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
# echo “deb https://apt.dockerproject.org/repo ubuntu-trusty main” > /etc/apt/sources.list.d/docker.list
# apt-get update
# apt-get install linux-image-extra-$(uname -r)
# apt-get install docker-engine

Start the docker service and verify that it is operational

Installing docker-engine actually starts up docker as well, but to start the service you do:

# service docker start

To verify that the docker service is operational, run a container based on the public “hello-world” Docker image:

# docker run hello-world
Unable to find image ‘hello-world:latest’ locally
latest: Pulling from library/hello-world
b901d36b6f2f: Pull complete
0a6ba66e537a: Pull complete
Digest: sha256:8be990ef2aeb16dbcb9271ddfe2610fa6658d13f6dfb8bc72074cc1ca36966a7
Status: Downloaded newer image for hello-world:latest
Hello from Docker.This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the “hello-world” image from the Docker Hub.
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.
To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker Hub account:
https://hub.docker.com
For more examples and ideas, visit:
https://docs.docker.com/userguide/

Pull the ubuntu:trusty public Docker image
# docker pull ubuntu:trusty
trusty: Pulling from library/ubuntu
fcee8bcfe180: Pull complete
4cdc0cbc1936: Pull complete
d9e545b90db8: Pull complete
c4bea91afef3: Pull complete
Digest: sha256:3a7f4c0573b303f7a36b20ead6190bd81cabf323fc62c77d52fb8fa3e9f7edfe
Status: Downloaded newer image for ubuntu:trusty

# docker images
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
ubuntu trusty c4bea91afef3 3 days ago 187.9 MB
hello-world latest 0a6ba66e537a 12 weeks ago 960 B

Build custom Docker image for headless Selenium WebDriver testing
I created a directory called selwd on my host Ubuntu 14.04 box, and in that directory I created this Dockerfile:

FROM ubuntu:trusty
RUN echo “deb http://ppa.launchpad.net/mozillateam/firefox-next/ubuntu trusty main” > /etc/apt/sources.list.d//mozillateam-firefox-next-trusty.list RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys CE49EC21 RUN apt-get update RUN apt-get install -y firefox xvfb python-pip RUN pip install selenium RUN mkdir -p /root/selenium_wd_tests ADD sel_wd_new_user.py /root/selenium_wd_tests ADD xvfb.init /etc/init.d/xvfb RUN chmod +x /etc/init.d/xvfb RUN update-rc.d xvfb defaults
CMD (service xvfb start; export DISPLAY=:10; python /root/selenium_wd_tests/sel_wd_new_user.py)

This Dockerfile tells docker, via the FROM instruction, to create an image based on the ubuntu:trusty image that we pulled before (if we hadn’t pulled it, it would be pulled the first time our image was built).The various RUN instructions specify commands to be run at build time. The above instructions add the Firefox Beta repository and key to the apt repositories inside the image, then install firefox, xvfb and python-pip. Then they install the selenium Python package via pip and create a directory structure for the Selenium tests.

The ADD instructions copy local files to the image. In my case, I copy one Selenium WebDriver Python script, and an init.d-type file for starting Xvfb as a service (by default it starts in the foreground, which is not something I want inside a Docker container).

The last two RUN instructions make the /etc/init.d/xvfb script executable and run update-rc.d to install it as a service. The xvfb script is the usual init.d wrapper around a command, in my case this command:

PROG=”/usr/bin/Xvfb” PROG_OPTIONS=”:10 -ac”

Here is a gist for the xvfb.init script for reference.

Finally, the CMD instruction specifies what gets executed when a container based on this image starts up (assuming no other commands are given in the ‘docker run’ command-line for this container). The CMD instruction in the Dockerfile above starts up the xvfb service (which connects to DISPLAY 10 as specified in the xvfb init script), sets the DISPLAY environment variable to 10, then runs the Selenium WebDriver script sel_wd_new_user.py, which will launch firefox in headless mode and execute its commands against it.

Here’s the official documentation for Dockerfile instructions.To build a Docker image based on this Dockerfile, run:

# docker build -t selwd:v1 .

selwd is the name of the image and v1 is a tag associated with this name. The dot . tells docker to look for a Dockerfile in the current directory.

The build process will take a while intially because it will install all the dependencies necessary for the packages we are installing with apt. Every time you make a modification to the Dockerfile, you need to run ‘docker build’ again, but subsequent runs will be much faster.

Run Docker containers based on the custom image
At this point, we are ready to run Docker containers based on the selwd image we created above.

Here’s how to run a single container:

# docker run --rm selwd:v1
In this format, the command specified in the CMD instruction inside the Dockerfile will get executed, then the container will stop. This is exactly what we need: we run our Selenium WebDriver tests against headless firefox, inside their own container isolated from any other container.

The output of the ‘docker run’ command above is:


Starting : X Virtual Frame Buffer . — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —Ran 1 test in 40.441sOK
(or a traceback if the Selenium test encountered an error)

Note that we also specified the rm flag to ‘docker run’ so that the container gets removed once it stops — otherwise these short-lived containers will be kept around and will pile up, as you can see for yourself if you run:

# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 6c9673e59585 selwd:v1 “/bin/bash” 5 minutes ago Exited (130) 5 seconds ago modest_mccarthy 980651e1b167 selwd:v1 “/bin/sh -c ‘(service” 9 minutes ago Exited (0) 8 minutes ago stupefied_turing 4a9b2f4c8c28 selwd:v1 “/bin/sh -c ‘(service” 13 minutes ago Exited (0) 12 minutes ago nostalgic_ride 9f1fa953c83b selwd:v1 “/bin/sh -c ‘(service” 13 minutes ago Exited (0) 12 minutes ago admiring_ride c15b180832f6 selwd:v1 “/bin/sh -c ‘(service” 13 minutes ago Exited (0) 12 minutes ago jovial_booth .....etc

If you do have large numbers of containers that you want to remove in one go, use this command:

# docker rm `docker ps -aq`For troubleshooting purposes, we can run a container in interactive mode (with the -i and -t flags) and specify a shell command to be executed on startup, which will override the CMD instruction in the Dockerfile:

# docker run -it selwd:v1 /bin/bash root@6c9673e59585:/#
At the bash prompt, you can run the shell commands specified by the Dockerfile CMD instruction in order to see interactively what is going on. The official ‘docker run’ documentation has lots of details.

One other thing I found useful for troubleshooting Selenium WebDriver scripts running against headless firefox was to have the scripts take screenshots during their execution with the save_screenshot command:

driver.save_screenshot(“before_place_order.png”)
# Click Place Order driver.find_element_by_xpath("//*[@id='order_submit_button']").click()
driver.save_screenshot(“after_place_order.png”)

I then inspected the PNG files to see what was going on.

Running multiple Docker containers for load testing

Because our Selenium WebDriver tests run isolated in their own Docker container, it enables us to run N containers in parallel to do a poor man’s load testing of our site.We’ll use the -d option to ‘docker run’ to run each container in ‘detached’ mode. Here is a bash script that launches COUNT Docker containers, where COUNT is the 1st command line argument, or 2 by default:


#!/bin/bash
COUNT=$1 if [ -z “$COUNT” ]; then  COUNT=2 fi
for i in `seq 1 $COUNT`; do  docker run -d selwd:v1 done

The output of the script consists in a list of container IDs, one for each container that was launched.

Note that if you launch a container in detached mode with -d, you can’t specify the rm flag to have the container removed automatically when it stops. You will need to periodically clean up your containers with the command I referenced above (docker rm `docker ps -aq`).

To inspect the output of the Selenium scripts in the containers that were launched, first get the container IDs:
# docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 6fb931689c03 selwd:v1 “/bin/sh -c ‘(service” About an hour ago Exited (0) About an hour ago grave_northcutt 1b82ef59ad46 selwd:v1 “/bin/sh -c ‘(service” About an hour ago Exited (0) About an hour ago admiring_fermat

Then run ‘docker logs <container_id>’ to see the output for a specific container:
# docker logs 6fb931689c03 . — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — Ran 1 test in 68.436s
OK Starting : X Virtual Frame Buffer
Have fun load testing your site!

Running Selenium WebDriver tests using Firefox headless mode on Ubuntu

Thu, 01/07/2016 - 23:58
Selenium IDE is a very good tool for recording and troubleshooting Selenium tests, but you are limited to clicking around in a GUI. For a better testing workflow, including load testing, you need to use Selenium WebDriver, which can programatically drive a browser and run Selenium test cases.

In its default mode, WebDriver will launch a browser and run the test scripts in the browser, then exit. If you like to work exclusively from the command line, then you need to look into running the browser in headless mode. Fortunately, this is easy to do with Firefox on Ubuntu. Here’s what you need to do:

Install the official Firefox Beta PPA:
$ sudo apt-add-repository ppa:mozillateam/firefox-next

(this will add the file /etc/apt/sources.list.d/mozillateam-firefox-next-trusty.list and also fetch the PPA’s key, which enables your Ubuntu system to verify that the packages in the PPA have not been interfered with since they were built)

Run apt-get update:
$ sudo apt-get update

Install firefox and xvfb (the X windows virtual framebuffer) packages:
$ sudo apt-get install firefox xvfb

Run Xvfb in the background and specify a display number (10 in my example):
$ Xvfb :10 -ac &

Set the DISPLAY variable to the number you chose:
$ export DISPLAY=:10

Test that you can run firefox in the foreground with no errors:
$ firefox
(kill it with Ctrl-C)

Now run your regular Selenium WebDriver scripts (no modifications required if they already use Firefox as their browser).

Here is an example of a script I have written in Python, which clicks on a category link in an e-commerce store, adds an item to the cart, that starts filling out the user’s information in the cart:
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import unittest, time, re, random
class SelWebdriverNewUser(unittest.TestCase):
  def setUp(self):
    self.driver = webdriver.Firefox()
    self.driver.implicitly_wait(20)
    self.base_url = “http://myhost.mycompany.com/"
    self.verificationErrors = []
    self.accept_next_alert = True
    def test_sel_webdriver_new_user(self):
    driver = self.driver
    HOST = “myhost.mycompany.com”
    RANDINT = random.random()*10000
    driver.get(“https://” + HOST)

    # Click on category link
    driver.find_element_by_xpath(“//*[@id=’nav’]/ol/li[3]/a”).click()
    # Click on sub-category link
    driver.find_element_by_xpath(“//*[@id=’top’]/body/div/div[2]/div[2]/div/div[2]/ul/li[4]/a/span”).click()
    # Click on product image
    driver.find_element_by_xpath(“//*[@id=’product-collection-image-374']”).click()
    # Click Checkout button
    driver.find_element_by_xpath(“//*[@id=’checkout-button’]/span/span”).click()
    driver.find_element_by_id(“billing:firstname”).clear()
driver.find_element_by_id(“billing:firstname”).send_keys(“selenium”, RANDINT, “_fname”)
    driver.find_element_by_id(“billing:lastname”).clear()
driver.find_element_by_id(“billing:lastname”).send_keys(“selenium”, RANDINT, “_lname”)
    # Click Place Order
    driver.find_element_by_xpath(“//*[@id=’order_submit_button’]”).click()
  def is_element_present(self, how, what):
    try: self.driver.find_element(by=how, value=what)
    except NoSuchElementException as e: return False
    return True
  def is_alert_present(self):
    try: self.driver.switch_to_alert()
    except NoAlertPresentException as e: return False
    return True
  def close_alert_and_get_its_text(self):
    try:
      alert = self.driver.switch_to_alert()
      alert_text = alert.text
      if self.accept_next_alert:
        alert.accept()
      else:
        alert.dismiss()
      return alert_text
    finally: self.accept_next_alert = True
def tearDown(self):
    self.driver.quit()
    self.assertEqual([], self.verificationErrors)
if __name__ == “__main__”:
unittest.main()

To run this script, you first need to install the selenium Python package:$ sudo pip install selenium

Then run the script (called selenium_webdriver_new_user.py in my case):$ python selenium_webdriver_new_user.py

After hopefully not so long of a wait, you should see a successful test run:$ python selenium_webdriver_new_user.py
.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Ran 1 test in 29.317s

A few notes regarding Selenium WebDriver scripts.

I was stumped for a while when I was trying to use the “find_element_by_id” form of finding an HTML element on a Web page. It was working fine in Selenium IDE, but Selenium WebDriver couldn’t find that element. I had to resort to finding all elements via their XPath id using “find_element_by_xpath”. Fortunately, Chrome for example makes it easy to right click an element on a page, choose Inspect, then righ click the HTML code for the element and choose Copy->Copy XPath to get their id which can then be pasted in the Selenium WebDriver script.

I also had to use time.sleep(N) (where N is in seconds at least for Python) to wait for certain elements of the page to load asynchronously. I know it’s not best practices, but it works.

Distributing a beta version of an iOS app

Fri, 01/01/2016 - 20:41

I am not an iOS expert by any means, but recently I’ve had to maintain an iOS app and distribute it to beta testers. I had to jump through a few hoops, so I am documenting here the steps I had to take.

First of all, I am using Xcode 6.4 with the Fabric 2.1.1 plugin. I assume you are already signed up for the Fabric/Crashlytics service and that you also have an Apple developer account.
  1. Ask each beta tester to send you the UUID of the devices they want to run your app on.
  2. Go to developer.apple.com -> “Certificates, Identifiers and Profiles” -> “Devices” and add each device with its associated UUID. Let’s say you add a device called “Tom’s iPhone 6s” with its UUID.
  3. Go to Xcode -> Preferences -> Accounts. If you already have an account set up, remove it by selecting it and clicking the minus icon on the lower left side. Add an account: click the plus icon, choose “Add Apple ID” and enter your Apple ID and password. This will import your Apple developer provisioning profile into Xcode, with the newly added device UUIDs (note: there may be a better way of adding/modifying the provisioning profile within Xcode but this worked for me)
  4. Make sure the Fabric plugin is running on your Mac.
  5. Go to Xcode and choose the iOS application you want to distribute. Choose iOS Device as the target for the build.
  6. Go to Xcode -> Product -> Archive. This will build the app, then the Fabric plugin will pop up a message box asking you if you want to distribute the archive build. Click Distribute.
  7. The Fabric plugin will pop up a dialog box asking you for the email of the tester(s) you want to invite. Enter one or more email addresses. Enter release notes. At this point the Fabric plugin will upload your app build to the Fabric site and notify the tester(s) that they are invited to test the app.

Installing and configuring Raspbian Jessie on a Raspberry Pi B+

Wed, 12/23/2015 - 20:38
I blogged before about configuring a Raspberry Pi B+ with Raspbian Wheezy. Here are some notes I took today while going through the whole process again, but this time with the latest Raspbian version, Jessie, from 2015-11-21. Many steps are the same, but I will add instructions for configuring a wireless connection.

1) Bought micro SD card. Note: DO NOT get a regular SD card for the B+ because it will not fit in the SD card slot. You need a micro SD card.

2) Inserted the SD card via an SD USB adaptor in my MacBook Pro.

3) Went to the command line and ran df to see which volume the SD card was mounted as. In my case, it was /dev/disk2s1.

4) Unmounted the SD card from my Mac. I initially tried 'sudo umount /dev/disk2s1' but the system told me to use 'diskutil unmount', so the command that worked for me was:

$ diskutil unmount /dev/disk2s1

5) Downloaded 2015-11-21-raspbian-jessie.zip from  https://downloads.raspberrypi.org/raspbian/images. Unzipped it to obtain the image file 2015-11-21-raspbian-jessie.img
6) Used dd to copy the image from my Mac to the SD card. Thanks to an anonymous commenter on my previous blog post, I specified the target of the dd command as the raw device /dev/rdisk2. Note: DO NOT specify the target as /dev/disk2s1 or /dev/rdisk2s1. Either /dev/disk2 or /dev/rdisk2 will work, but copying to the raw device is faster. Here is the dd command I used:
$ sudo dd if=2015-11-21-raspbian-jessie.img of=/dev/rdisk2 bs=1m3752+0 records in3752+0 records out3934257152 bytes transferred in 233.218961 secs (16869371 bytes/sec)
7) I unmounted the SD card from my Mac one more time:
$ diskutil unmount /dev/disk2s1
8) I inserted the SD card into my Raspberry Pi. I also inserted a USB WiFi adapter (I used the Wi-Pi 802.11n adapter). My Pi was also connected to a USB keyboard, to a USB mouse and to a monitor via HDMI. 
9) I powered up the Pi. It went through the Raspbian Jessie boot process uneventfully, and it brought up the X Windows GUI interface (which is the default in Jessie, as opposed to the console in Wheezy). At this point, I configured the Pi to boot back into console mode by going to Menu -> Preferences -> Raspberry Pi Configuration and changing the Boot option from "To Desktop" to "To CLI". While in the configuration dialog, I also changed the default password for user pi, and unchecked the autologin option.
10) I rebooted the Pi and this time it booted up in console mode and stopped at the login prompt. I logged in as user pi.
11) I spent the next 30 minutes googling around to find out how to make the wireless interface work. It's always been a chore for me to get wlan to work on a Pi, hence the following instructions (based on this really good blog post).
12) Edit /etc/network/interfaces:
(i)  change "auto l0" to "auto wlan0"(ii) change "iface wlan0 inet manual" to "iface wlan0 inet dhcp"
13) Edit /etc/wpa_supplicant/wpa_supplicant.conf and add this at the end:
network={  ssid="your_ssid"  psk="your_ssid_password"}
14) Rebooted the Pi and ran ifconfig. At this point I could see wlan0 configured properly with an IP address.
Hope these instructions work for you. Merry Christmas!

Protecting your site for free with Let's Encrypt SSL certificates and acmetool

Tue, 12/08/2015 - 02:47
The buzz level around Let's Encrypt has been more elevated lately, due to their opening up their service as a public beta. If you don't know what Let's Encrypt is, it's a Certificate Authority which provides SSL certificates free of charge. The twist is that they implement a protocol called ACME ("Automated Certificate Management Environment") for automating the management of domain-validation certificates, based on a simple JSON-over-HTTPS interface. Read more technical details about Let's Encrypt here.

The certificates from Let's Encrypt have a short life of 90 days, and this is done on purpose so that they encourage web site administrators to renew them programatically and automatically. In what follows, I'll walk you through how to obtain and install Let's Encrypt certificates for nginx on Ubuntu. I will use a tool called acmetool, and not the official Let's Encrypt client tools, because acmetool generates standalone SSL keys and certs and doesn't try to reconfigure a given web server automatically in order to use them (like the letsencrypt client tools do). I like this separation of concerns. Plus acmetool is written in Go, so you just deploy it as a binary and you're off to the races.
1) Configure nginx to serve your domain name

I will assume you want to protect www.mydomain.com with SSL certificates from Let's Encrypt. The very first step, which I assume you have already taken, is to configure nginx to serve www.mydomain.com on port 80. I also assume the document root is /var/www/mydomain.
2) Install acmetool

$ sudo apt-get install libcap-dev
$ git clone https://github.com/hlandau/acme $ cd acme$ make && sudo make install
3) Run "acmetool quickstart" to configure ACME
The ACME protocol requires a verification of your ownership of mydomain.com. There are multiple ways to prove that ownership and the one I chose below was to let the ACME agent (in this case acmetool) to drop a file under the nginx document root. As part of the verification, the ACME agent will also generate a keypair under the covers, and sign a nonce sent from the ACME server with the private key, in order to prove possession of the keypair.# acmetool quickstart
------------------------- Select ACME Server -----------------------Please choose an ACME server from which to request certificates. Your principal choices are the Let's Encrypt Live Server, and the Let's Encrypt Staging Server.
You can use the Let's Encrypt Live Server to get real certificates.
The Let's Encrypt Staging Server does not issue publically trusted certificates. It is useful for development purposes, as it has far higher rate limits than the live server.
  1) Let's Encrypt Live Server - I want live certificates  2) Let's Encrypt Staging Server - I want test certificates  3) Enter an ACME server URL
I chose option 1 (Let's Encrypt Live Server).
----------------- Select Challenge Conveyance Method ---------------acmetool needs to be able to convey challenge responses to the ACME server in order to prove its control of the domains for which you issue certificates. These authorizations expire rapidly, as do ACME-issued certificates (Let's Encrypt certificates have a 90 day lifetime), thus it is essential that the completion of these challenges is a) automated and b) functioning properly. There are several options by which challenges can be facilitated:
WEBROOT: The webroot option installs challenge files to a given directory. You must configure your web server so that the files will be available at <http://[HOST]/.well-known/acme-challenge/>. For example, if your webroot is "/var/www", specifying a webroot of "/var/www/.well-known/acme-challenge" is likely to work well. The directory will be created automatically if it does not already exist.
PROXY: The proxy option requires you to configure your web server to proxy requests for paths under /.well-known/acme-challenge/ to a special web server running on port 402, which will serve challenges appropriately.
REDIRECTOR: The redirector option runs a special web server daemon on port 80. This means that you cannot run your own web server on port 80. The redirector redirects all HTTP requests to the equivalent HTTPS URL, so this is useful if you want to enforce use of HTTPS. You will need to configure your web server to not listen on port 80, and you will need to configure your system to run "acmetool redirector" as a daemon. If your system uses systemd, an appropriate unit file can automatically be installed.
LISTEN: Directly listen on port 80 or 443, whichever is available, in order to complete challenges. This is useful only for development purposes.
  1) WEBROOT - Place challenges in a directory  2) PROXY - I'll proxy challenge requests to an HTTP server  3) REDIRECTOR - I want to use acmetool's redirect-to-HTTPS functionality  4) LISTEN - Listen on port 80 or 443 (only useful for development purposes)
I chose option 1 (WEBROOT).
------------------------- Enter Webroot Path -----------------------Please enter the path at which challenges should be stored.
If your webroot path is /var/www, you would enter /var/www/.well-known/acme-challenge here.The directory will be created if it does not exist.
Webroot paths vary by OS; please consult your web server configuration.
I indicated /var/www/mydomain/.well-known/acme-challenge as the directory where the challenge will be stored.
------------------------- Quickstart Complete ----------------------The quickstart process is complete.
Ensure your chosen challenge conveyance method is configured properly before attempting to request certificates. You can find more information about how to configure your system for each method in the acmetool documentation: https://github.com/hlandau/acme.t/blob/master/doc/WSCONFIG.md
To request a certificate, run:
$ sudo acmetool want example.com www.example.com
If the certificate is successfully obtained, it will be placed in /var/lib/acme/live/example.com/{cert,chain,fullchain,privkey}.
Press Return to continue.
4) Obtain the Let's Encrypt SSL key and certificates for www.mydomain.com
As the quickstart output indicates above, we need to run:
# acmetool want www.mydomain.com
This should run with no errors and drop the following files in /var/lib/acme/live/www.mydomain.com: cert, chain, fullchain, privkey and url.
5) Configure nginx to use the Let's Encrypt SSL key and certificate chain
I found a good resource for specifying secure (as of Dec. 2015) SSL configurations for a variety of software, including nginx: cipherli.st.
Here is the nginx configuration pertaining to SSL that I used, pointing to the SSL key and certificate chain retrieved by acmetool from Let's Encrypt:
        listen 443 ssl default_server;        listen [::]:443 ssl default_server;
        ssl_certificate     /var/lib/acme/live/www.mydomain.com/fullchain;        ssl_certificate_key /var/lib/acme/live/www.mydomain.com/privkey;
        ssl_ciphers "EECDH+AESGCM:EDH+AESGCM:AES256+EECDH:AES256+EDH";        ssl_protocols TLSv1 TLSv1.1 TLSv1.2;        ssl_prefer_server_ciphers on;        ssl_session_cache shared:SSL:10m;        add_header Strict-Transport-Security "max-age=63072000; includeSubdomains; preload";        add_header X-Frame-Options DENY;        add_header X-Content-Type-Options nosniff;        ssl_session_tickets off; # Requires nginx >= 1.5.9        ssl_stapling on; # Requires nginx >= 1.3.7        ssl_stapling_verify on; # Requires nginx => 1.3.7
At this point, if you hit www.mydomain.com over SSL, you should be able to inspect the SSL certificate and see that it's considered valid by your browser (I tested it in Chrome, Firefox and Safari). The Issuer Name has Organization Name "Let's Encrypt" and Common Name "Let's Encrypt Authority X1".
6) Configure cron job for SSL certificate renewal
Let's Encrypt certificates expire in 90 days after the issue date, so you need to renew them more often than you are used to with regular SSL certificates. I added this line to my crontab on the server that handles www.mydomain.com:
# m h  dom mon dow   command0 0 1 * * /usr/local/bin/acmetool reconcile --batch; service nginx restart
This runs the acmetool "reconcile" command in batch mode (with no input required from the user) at midnight on the 1st day of every month, then restarts nginx just in case the certificate has changed. If the Let's Encrypt SSL certificate is 30 days away from expiring, acmetool reconcile will renew it.
I think Let's Encrypt is a great service, and you should start using it if you're not already!


Protecting your site for free with Let's Encrypt SSL certificates and acmetool

Tue, 12/08/2015 - 02:47
The buzz level around Let's Encrypt has been more elevated lately, due to their opening up their service as a public beta. If you don't know what Let's Encrypt is, it's a Certificate Authority which provides SSL certificates free of charge. The twist is that they implement a protocol called ACME ("Automated Certificate Management Environment") for automating the management of domain-validation certificates, based on a simple JSON-over-HTTPS interface. Read more technical details about Let's Encrypt here.

The certificates from Let's Encrypt have a short life of 90 days, and this is done on purpose so that they encourage web site administrators to renew them programatically and automatically. In what follows, I'll walk you through how to obtain and install Let's Encrypt certificates for nginx on Ubuntu. I will use a tool called acmetool, and not the official Let's Encrypt client tools, because acmetool generates standalone SSL keys and certs and doesn't try to reconfigure a given web server automatically in order to use them (like the letsencrypt client tools do). I like this separation of concerns. Plus acmetool is written in Go, so you just deploy it as a binary and you're off to the races.
1) Configure nginx to serve your domain name

I will assume you want to protect www.mydomain.com with SSL certificates from Let's Encrypt. The very first step, which I assume you have already taken, is to configure nginx to serve www.mydomain.com on port 80. I also assume the document root is /var/www/mydomain.
2) Install acmetool

$ sudo apt-get install libcap-dev
$ git clone https://github.com/hlandau/acme $ cd acme$ make && sudo make install
3) Run "acmetool quickstart" to configure ACME
The ACME protocol requires a verification of your ownership of mydomain.com. There are multiple ways to prove that ownership and the one I chose below was to let the ACME agent (in this case acmetool) to drop a file under the nginx document root. As part of the verification, the ACME agent will also generate a keypair under the covers, and sign a nonce sent from the ACME server with the private key, in order to prove possession of the keypair.# acmetool quickstart
------------------------- Select ACME Server -----------------------Please choose an ACME server from which to request certificates. Your principal choices are the Let's Encrypt Live Server, and the Let's Encrypt Staging Server.
You can use the Let's Encrypt Live Server to get real certificates.
The Let's Encrypt Staging Server does not issue publically trusted certificates. It is useful for development purposes, as it has far higher rate limits than the live server.
  1) Let's Encrypt Live Server - I want live certificates  2) Let's Encrypt Staging Server - I want test certificates  3) Enter an ACME server URL
I chose option 1 (Let's Encrypt Live Server).
----------------- Select Challenge Conveyance Method ---------------acmetool needs to be able to convey challenge responses to the ACME server in order to prove its control of the domains for which you issue certificates. These authorizations expire rapidly, as do ACME-issued certificates (Let's Encrypt certificates have a 90 day lifetime), thus it is essential that the completion of these challenges is a) automated and b) functioning properly. There are several options by which challenges can be facilitated:
WEBROOT: The webroot option installs challenge files to a given directory. You must configure your web server so that the files will be available at <http://[HOST]/.well-known/acme-challenge/>. For example, if your webroot is "/var/www", specifying a webroot of "/var/www/.well-known/acme-challenge" is likely to work well. The directory will be created automatically if it does not already exist.
PROXY: The proxy option requires you to configure your web server to proxy requests for paths under /.well-known/acme-challenge/ to a special web server running on port 402, which will serve challenges appropriately.
REDIRECTOR: The redirector option runs a special web server daemon on port 80. This means that you cannot run your own web server on port 80. The redirector redirects all HTTP requests to the equivalent HTTPS URL, so this is useful if you want to enforce use of HTTPS. You will need to configure your web server to not listen on port 80, and you will need to configure your system to run "acmetool redirector" as a daemon. If your system uses systemd, an appropriate unit file can automatically be installed.
LISTEN: Directly listen on port 80 or 443, whichever is available, in order to complete challenges. This is useful only for development purposes.
  1) WEBROOT - Place challenges in a directory  2) PROXY - I'll proxy challenge requests to an HTTP server  3) REDIRECTOR - I want to use acmetool's redirect-to-HTTPS functionality  4) LISTEN - Listen on port 80 or 443 (only useful for development purposes)
I chose option 1 (WEBROOT).
------------------------- Enter Webroot Path -----------------------Please enter the path at which challenges should be stored.
If your webroot path is /var/www, you would enter /var/www/.well-known/acme-challenge here.The directory will be created if it does not exist.
Webroot paths vary by OS; please consult your web server configuration.
I indicated /var/www/mydomain/.well-known/acme-challenge as the directory where the challenge will be stored.
------------------------- Quickstart Complete ----------------------The quickstart process is complete.
Ensure your chosen challenge conveyance method is configured properly before attempting to request certificates. You can find more information about how to configure your system for each method in the acmetool documentation: https://github.com/hlandau/acme.t/blob/master/doc/WSCONFIG.md
To request a certificate, run:
$ sudo acmetool want example.com www.example.com
If the certificate is successfully obtained, it will be placed in /var/lib/acme/live/example.com/{cert,chain,fullchain,privkey}.
Press Return to continue.
4) Obtain the Let's Encrypt SSL key and certificates for www.mydomain.com
As the quickstart output indicates above, we need to run:
# acmetool want www.mydomain.com
This should run with no errors and drop the following files in /var/lib/acme/live/www.mydomain.com: cert, chain, fullchain, privkey and url.
5) Configure nginx to use the Let's Encrypt SSL key and certificate chain
I found a good resource for specifying secure (as of Dec. 2015) SSL configurations for a variety of software, including nginx: cipherli.st.
Here is the nginx configuration pertaining to SSL that I used, pointing to the SSL key and certificate chain retrieved by acmetool from Let's Encrypt:
        listen 443 ssl default_server;        listen [::]:443 ssl default_server;
        ssl_certificate     /var/lib/acme/live/www.mydomain.com/fullchain;        ssl_certificate_key /var/lib/acme/live/www.mydomain.com/privkey;
        ssl_ciphers "EECDH+AESGCM:EDH+AESGCM:AES256+EECDH:AES256+EDH";        ssl_protocols TLSv1 TLSv1.1 TLSv1.2;        ssl_prefer_server_ciphers on;        ssl_session_cache shared:SSL:10m;        add_header Strict-Transport-Security "max-age=63072000; includeSubdomains; preload";        add_header X-Frame-Options DENY;        add_header X-Content-Type-Options nosniff;        ssl_session_tickets off; # Requires nginx >= 1.5.9        ssl_stapling on; # Requires nginx >= 1.3.7        ssl_stapling_verify on; # Requires nginx => 1.3.7
At this point, if you hit www.mydomain.com over SSL, you should be able to inspect the SSL certificate and see that it's considered valid by your browser (I tested it in Chrome, Firefox and Safari). The Issuer Name has Organization Name "Let's Encrypt" and Common Name "Let's Encrypt Authority X1".
6) Configure cron job for SSL certificate renewal
Let's Encrypt certificates expire in 90 days after the issue date, so you need to renew them more often than you are used to with regular SSL certificates. I added this line to my crontab on the server that handles www.mydomain.com:
# m h  dom mon dow   command0 0 1 * * /usr/local/bin/acmetool reconcile --batch; service nginx restart
This runs the acmetool "reconcile" command in batch mode (with no input required from the user) at midnight on the 1st day of every month, then restarts nginx just in case the certificate has changed. If the Let's Encrypt SSL certificate is 30 days away from expiring, acmetool reconcile will renew it.
I think Let's Encrypt is a great service, and you should start using it if you're not already!


Initial experiences with the Prometheus monitoring system

Fri, 11/20/2015 - 22:23
I've been looking for a while for a monitoring system written in Go, self-contained and easy to deploy. I think I finally found what I was looking for in Prometheus, a monitoring system open-sourced by SoundCloud and started there by ex-Googlers who took their inspiration from Google's Borgmon system.

Prometheus is a pull system, where the monitoring server pulls data from its clients by hitting a special HTTP handler exposed by each client ("/metrics" by default) and retrieving a list of metrics from that handler. The output of /metrics is plain text, which makes it fairly easily parseable by humans as well, and also helps in troubleshooting.

Here's a subset of the OS-level metrics that are exposed by a client running the node_exporter Prometheus binary (and available when you hit http://client_ip_or_name:9100/metrics):

# HELP node_cpu Seconds the cpus spent in each mode.
# TYPE node_cpu counter
node_cpu{cpu="cpu0",mode="guest"} 0
node_cpu{cpu="cpu0",mode="idle"} 2803.93
node_cpu{cpu="cpu0",mode="iowait"} 31.38
node_cpu{cpu="cpu0",mode="irq"} 0
node_cpu{cpu="cpu0",mode="nice"} 2.26
node_cpu{cpu="cpu0",mode="softirq"} 0.23
node_cpu{cpu="cpu0",mode="steal"} 21.16
node_cpu{cpu="cpu0",mode="system"} 25.84
node_cpu{cpu="cpu0",mode="user"} 79.94
# HELP node_disk_io_now The number of I/Os currently in progress.
# TYPE node_disk_io_now gauge
node_disk_io_now{device="xvda"} 0
# HELP node_disk_io_time_ms Milliseconds spent doing I/Os.
# TYPE node_disk_io_time_ms counter
node_disk_io_time_ms{device="xvda"} 44608
# HELP node_disk_io_time_weighted The weighted # of milliseconds spent doing I/Os. See https://www.kernel.org/doc/Documentation/iostats.txt.
# TYPE node_disk_io_time_weighted counter
node_disk_io_time_weighted{device="xvda"} 959264

There are many such "exporters" available for Prometheus, exposing metrics in the format expected by the Prometheus server from systems such as Apache, MySQL, PostgreSQL, HAProxy and many others (see a list here).

What drew me to Prometheus though was the fact that it allows for easy instrumentation of code by providing client libraries for many languages: Go, Java/Scala, Python, Ruby and others. 
One of the main advantages of Prometheus over alternative systems such as Graphite is the rich query language that it provides. You can associate labels (which are arbitrary key/value pairs) with any metrics, and you are then able to query the system by label. I'll show examples in this post. Here's a more in-depth comparison between Prometheus and Graphite.
Installation (on Ubuntu 14.04)
I put together an ansible role that is loosely based on Brian Brazil's demo_prometheus_ansible repo.
Check out my ansible-prometheus repo for this ansible role, which installs Prometheus, node_exporter and PromDash (a ruby-based dashboard builder). For people not familiar with ansible, most of the installation commands are in the install.yml task file. Here is the sequence of installation actions, in broad strokes.
For the Prometheus server:
  • download prometheus-0.16.1.linux-amd64.tar.gz from https://github.com/prometheus/prometheus/releases/download
  • extract tar.gz into /opt/prometheus/dist and link /opt/prometheus/prometheus-server to /opt/prometheus/dist/prometheus-0.16.1.linux-amd64
  • create Prometheus configuration file from ansible template and drop it in /etc/prometheus/prometheus.yml (more on the config file later)
  • create Prometheus default command-line options file from ansible template and drop it in /etc/default/prometheus
  • create Upstart script for Prometheus in /etc/init/prometheus.conf:
# Run prometheus

start on startup

chdir /opt/prometheus/prometheus-server

script
./prometheus -config.file /etc/prometheus/prometheus.yml
end script
For node_exporter:
  • download node_exporter-0.12.0rc1.linux-amd64.tar.gz from https://github.com/prometheus/node_exporter/releases/download
  • extract tar.gz into /opt/prometheus/dist and move node_exporter binary to /opt/prometheus/bin/node_exporter
  • create Upstart script for Prometheus in /etc/init/prometheus_node_exporter.conf:
# Run prometheus node_exporter
start on startup
script   /opt/prometheus/bin/node_exporterend script
For PromDash:
  • git clone from https://github.com/prometheus/promdash
  • follow instructions in the Prometheus tutorial from Digital Ocean (can't stop myself from repeating that D.O. publishes the best technical tutorials out there!)
Here is a minimal Prometheus configuration file (/etc/prometheus/prometheus.yml):
global:  scrape_interval: 30s  evaluation_interval: 5s
scrape_configs:  - job_name: 'prometheus'    target_groups:      - targets:        - prometheus.example.com:9090  - job_name: 'node'    target_groups:      - targets:        - prometheus.example.com:9100        - api01.example.com:9100        - api02.example.com:9100        - test-api01.example.com:9100        - test-api02.example.com:9100
The configuration file format for Prometheus is well documented in the official docs. My example shows that the Prometheus server itself is monitored (or "scraped" in Prometheus parlance) on port 9090, and that OS metrics are also scraped from 5 clients which are running the node_exporter binary on port 9100, including the Prometheus server.
At this point, you can start Prometheus and node_exporter on your Prometheus server via Upstart:
# start prometheus# start prometheus_node_exporter
Then you should be able to hit http://prometheus.example.com:9100 to see the metrics exposed by node_exporter, and more importantly http://prometheus.example.com:9090 to see the default Web console included in the Prometheus server. A demo page available from Robust Perception can be examined here.
Note that Prometheus also provides default Web consoles for node_exporter OS-level metrics. They are available at http://prometheus.example.com:9090/consoles/node.html (the ansible-prometheus role installs nginx and redirects http://prometheus.example.com:80 to the previous URL). The node consoles show CPU, Disk I/O and Memory graphs and also network traffic metrics for each client running node_exporter. 




Working with the MySQL exporter
I installed the mysqld_exporter binary on my Prometheus server box.
# cd /opt/prometheus/dist# git clone https://github.com/prometheus/mysqld_exporter.git# cd mysqld_exporter# make
Then I created a wrapper script I called run_mysqld_exporter.sh:
# cat run_mysqld_exporter.sh#!/bin/bash
export DATA_SOURCE_NAME=“dbuser:dbpassword@tcp(dbserver:3306)/dbname”; ./mysqld_exporter
Two important notes here:
1) Note the somewhat awkward format for the DATA_SOURCE_NAME environment variable. I tried many other formats but only this one worked for me. The wrapper's script main purpose is to define this variable properly. With some of my other tries, I got this error message:
INFO[0089] Error scraping global state: Default addr for network 'dbserver:3306' unknown  file=mysqld_exporter.go line=697
You could also define this variable in ~/.bashrc but in that case it may clash with other  Prometheus exporters (the one for PostgreSQL for example) which also need to define this variable.
2) Note that the dbuser specified in the DATA_SOURCE_NAME variable needs to have either SUPER or REPLICATION CLIENT permissions to the MySQL server you need to monitor. I ran a SQL statement of this form:
GRANT REPLICATION CLIENT ON *.* TO dbuser@'%' IDENTIFIED BY 'dbpassword';

I created an Upstart init script I called /etc/init/prometheus_mysqld_exporter.conf:
# cat /etc/init/prometheus_mysqld_exporter.conf# Run prometheus mysqld exporter
start on startup
chdir /opt/prometheus/dist/mysqld_exporter
script   ./run_mysqld_exporter.shend script
I modified the Prometheus server configuration file (/etc/prometheus/prometheus.yml) and added a scrape job for the MySQL metrics:

  - job_name: 'mysql'
    honor_labels: true
    target_groups:
      - targets:
        - prometheus.example.com:9104

I restarted the Prometheus server:

# stop prometheus
# start prometheus

Then I started up mysqld_exporter via Upstart:
# start prometheus_mysqld_exporter
If everything goes well, the metrics scraped from MySQL will be available at http://prometheus.example.com:9104/metrics
Here are some of the available metrics:
# HELP mysql_global_status_innodb_data_reads Generic metric from SHOW GLOBAL STATUS.
# TYPE mysql_global_status_innodb_data_reads untyped
mysql_global_status_innodb_data_reads 12660
# HELP mysql_global_status_innodb_data_writes Generic metric from SHOW GLOBAL STATUS.
# TYPE mysql_global_status_innodb_data_writes untyped
mysql_global_status_innodb_data_writes 528790
# HELP mysql_global_status_innodb_data_written Generic metric from SHOW GLOBAL STATUS.
# TYPE mysql_global_status_innodb_data_written untyped
mysql_global_status_innodb_data_written 9.879318016e+09
# HELP mysql_global_status_innodb_dblwr_pages_written Generic metric from SHOW GLOBAL STATUS.
# TYPE mysql_global_status_innodb_dblwr_pages_written untyped
mysql_global_status_innodb_dblwr_pages_written 285184
# HELP mysql_global_status_innodb_row_ops_total Total number of MySQL InnoDB row operations.
# TYPE mysql_global_status_innodb_row_ops_total counter
mysql_global_status_innodb_row_ops_total{operation="deleted"} 14580
mysql_global_status_innodb_row_ops_total{operation="inserted"} 847656
mysql_global_status_innodb_row_ops_total{operation="read"} 8.1021419e+07
mysql_global_status_innodb_row_ops_total{operation="updated"} 35305

Most of the metrics exposed by mysqld_exporter are of type Counter, which means they always increase. A meaningful number to graph then is not their absolute value, but their rate of change. For example, for the mysql_global_status_innodb_row_ops_total metric, the rate of change of reads for the last 5 minutes (reads/sec) can be expressed as:
rate(mysql_global_status_innodb_row_ops_total{operation="read"}[5m])
This is also an example of a Prometheus query which filters by a specific label (in this case {operation="read"})
A good way to get a feel for the metrics available to the Prometheus server is to go to the Web console and graphing tool available at http://prometheus.example.com:9090/graph. You can copy and paste the ine above in the Expression edit box and click execute. You should see something like this graph in the Graph tab:


It's important to familiarize yourself with the 4 types of metrics handled by Prometheus: Counter, Gauge, Histogram and Summary. 
Working with the Postgres exporter
Although not an official Prometheus package, the Postgres exporter has worked just fine for me. 
I installed the postgres_exporter binary on my Prometheus server box.
# cd /opt/prometheus/dist# git clone https://github.com/wrouesnel/postgres_exporter.git# cd postgres_exporter# make
Then I created a wrapper script I called run_postgres_exporter.sh:

# cat run_postgres_exporter.sh
#!/bin/bash

export DATA_SOURCE_NAME="postgres://dbuser:dbpassword@dbserver/dbname"; ./postgres_exporter
Note that the format for DATA_SOURCE_NAME is a bit different from the MySQL format.
I created an Upstart init script I called /etc/init/prometheus_postgres_exporter.conf:
# cat /etc/init/prometheus_postgres_exporter.conf# Run prometheus postgres exporter
start on startup
chdir /opt/prometheus/dist/postgres_exporter
script   ./run_postgres_exporter.shend script
I modified the Prometheus server configuration file (/etc/prometheus/prometheus.yml) and added a scrape job for the Postgres metrics:

  - job_name: 'postgres'
    honor_labels: true
    target_groups:
      - targets:
        - prometheus.example.com:9113

I restarted the Prometheus server:

# stop prometheus
# start prometheus
Then I started up postgres_exporter via Upstart:
# start prometheus_postgres_exporter
If everything goes well, the metrics scraped from Postgres will be available at http://prometheus.example.com:9113/metrics
Here are some of the available metrics:
# HELP pg_stat_database_tup_fetched Number of rows fetched by queries in this database
# TYPE pg_stat_database_tup_fetched counter
pg_stat_database_tup_fetched{datid="1",datname="template1"} 7.730469e+06
pg_stat_database_tup_fetched{datid="12998",datname="template0"} 0
pg_stat_database_tup_fetched{datid="13003",datname="postgres"} 7.74208e+06
pg_stat_database_tup_fetched{datid="16740",datname="mydb"} 2.18194538e+08
# HELP pg_stat_database_tup_inserted Number of rows inserted by queries in this database
# TYPE pg_stat_database_tup_inserted counter
pg_stat_database_tup_inserted{datid="1",datname="template1"} 0
pg_stat_database_tup_inserted{datid="12998",datname="template0"} 0
pg_stat_database_tup_inserted{datid="13003",datname="postgres"} 0
pg_stat_database_tup_inserted{datid="16740",datname="mydb"} 3.5467483e+07
# HELP pg_stat_database_tup_returned Number of rows returned by queries in this database
# TYPE pg_stat_database_tup_returned counter
pg_stat_database_tup_returned{datid="1",datname="template1"} 6.41976558e+08
pg_stat_database_tup_returned{datid="12998",datname="template0"} 0
pg_stat_database_tup_returned{datid="13003",datname="postgres"} 6.42022129e+08
pg_stat_database_tup_returned{datid="16740",datname="mydb"} 7.114057378094e+12
# HELP pg_stat_database_tup_updated Number of rows updated by queries in this database
# TYPE pg_stat_database_tup_updated counter
pg_stat_database_tup_updated{datid="1",datname="template1"} 1
pg_stat_database_tup_updated{datid="12998",datname="template0"} 0
pg_stat_database_tup_updated{datid="13003",datname="postgres"} 1
pg_stat_database_tup_updated{datid="16740",datname="mydb"} 4351

These metrics are also of type Counter, so to generate meaningful graphs for them, you need to plot their rates. For example, to see the rate of rows returned per second from the database called mydb, you would plot this expression:
rate(pg_stat_database_tup_returned{datid="16740",datname="mydb"}[5m])
The Prometheus expression evaluator available at http://prometheus.example.com:9090/graph is again your friend. BTW, if you start typing pg_ in the expression field, you'll see a drop-down filled automatically with all the available metrics starting with pg_. Handy!
Working with the AWS CloudWatch exporterThis is one of the officially supported Prometheus exporters, used for graphing and alerting on AWS CloudWatch metrics. I installed it on the Prometheus server box. It's a java app, so it needs a JDK installed, and also maven for building the app.
# cd /opt/prometheus/dist# git clone https://github.com/prometheus/cloudwatch_exporter.git# apt-get install maven2 openjdk-7-jdk# cd cloudwatch_exporter# mvn package
The cloudwatch_exporter app needs AWS credentials in order to connect to CloudWatch and read the metrics. Here's what I did:
  1. created an AWS IAM user called cloudwatch_ro and downloaded its access key and secret key
  2. created an AWS IAM custom policy called CloudWatchReadOnlyAccess-201511181031, which includes the default CloudWatchReadOnlyAccess policy (the custom policy is not stricly necessary, and you can use the default one, but I preferred to use a custom one because I may need to further edits to the policy file)
  3. attached the CloudWatchReadOnlyAccess-201511181031 policy to the cloudwatch_ro user
  4. created a file called ~/.aws/credentials with the contents:
[default]aws_access_key_id=ACCESS_KEY_FOR_USER_CLOUDWATCH_ROaws_secret_access_key=SECRET_KEY_FOR_USER_CLOUDWATCH_RO
The cloudwatch_exporter app also needs a json file containing the CloudWatch metrics we want it to retrieve from AWS. Here is an example of ELB-related metrics I specified in a file called cloudwatch.json:
{
  "region": "us-west-2",
  "metrics": [
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "RequestCount",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "BackendConnectionErrors",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_Backend_2XX",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_Backend_4XX",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_Backend_5XX",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_ELB_4XX",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_ELB_5XX",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "SurgeQueueLength",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Maximum", "Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "SpilloverCount",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Sum"]},
    {"aws_namespace": "AWS/ELB", "aws_metric_name": "Latency",
     "aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
     "aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
     "aws_statistics": ["Average"]},
  ]
}
Note that you need to look up the exact syntax for each metric name, dimensions and preferred statistics in the AWS CloudWatch documentation. For ELB metrics, the documentation is here. The CloudWatch name corresponds to the cloudwatch_exporter JSON parameter aws_metric_name, dimensions corresponds to aws_dimensions, and preferred statistics corresponds to aws_statistics.
I modified the Prometheus server configuration file (/etc/prometheus/prometheus.yml) and added a scrape job for the CloudWatch metrics:

  - job_name: 'cloudwatch'
    honor_labels: true
    target_groups:
      - targets:
        - prometheus.example.com:9106

I restarted the Prometheus server:

# stop prometheus
# start prometheus

I created an Upstart init script I called /etc/init/prometheus_cloudwatch_exporter.conf:
# cat /etc/init/prometheus_cloudwatch_exporter.conf# Run prometheus cloudwatch exporter
start on startup
chdir /opt/prometheus/dist/cloudwatch_exporter
script   /usr/bin/java -jar target/cloudwatch_exporter-0.2-SNAPSHOT-jar-with-dependencies.jar 9106 cloudwatch.jsonend script
Then I started up cloudwatch_exporter via Upstart:
# start prometheus_cloudwatch_exporter
If everything goes well, the metrics scraped from CloudWatch will be available at http://prometheus.example.com:9106/metrics
Here are some of the available metrics:
# HELP aws_elb_request_count_sum CloudWatch metric AWS/ELB RequestCount Dimensions: [AvailabilityZone, LoadBalancerName] Statistic: Sum Unit: Count
# TYPE aws_elb_request_count_sum gauge
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2a",} 1.0
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2c",} 1.0
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2c",} 2.0
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2a",} 12.0
# HELP aws_elb_httpcode_backend_2_xx_sum CloudWatch metric AWS/ELB HTTPCode_Backend_2XX Dimensions: [AvailabilityZone, LoadBalancerName] Statistic: Sum Unit: Count
# TYPE aws_elb_httpcode_backend_2_xx_sum gauge
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2a",} 1.0
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2c",} 1.0
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2c",} 2.0
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2a",} 12.0
# HELP aws_elb_latency_average CloudWatch metric AWS/ELB Latency Dimensions: [AvailabilityZone, LoadBalancerName] Statistic: Average Unit: Seconds
# TYPE aws_elb_latency_average gauge
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2a",} 0.5571935176849365
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2c",} 0.5089397430419922
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2c",} 0.035556912422180176
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2a",} 0.0031794110933939614

Note that there are 3 labels available to query the metrics above: job, load_balancer_name and availability_zone. 
If we specify something like aws_elb_request_count_sum{job="aws_elb"} in the expression evaluator at http://prometheus.example.com:9090/graph, we'll see 4 graphs, one for each load_balancer_name/availability_zone combination. 
To see only graphs related to a specific load balancer, say LB1, we can specify an expression of the form:aws_elb_request_count_sum{job="aws_elb",load_balancer_name="LB1"}In this case, we'll see 2 graphs for LB1, one for each availability zone.
In order to see the request count across all availability zones for a specific load balancer, we need to apply the sum function: sum(aws_elb_request_count_sum{job="aws_elb",load_balancer_name="LB1"}) by (load_balancer_name) In this case, we'll see one graph with the request count across the 2 availability zones pertaining to LB1.
If we want to graph all load balancers but only show one graph per balancer, summing all availability zones for each balancer, we would use an expression like this: sum(aws_elb_request_count_sum{job="aws_elb"}) by (load_balancer_name)So in this case we'll see 2 graphs, one for LB1 and one for LB2, with each graph summing the request count across the availability zones for LB1 and LB2 respectively.
Note that in all the expressions above, since the job label has the value "aws_elb" common to all metrics, it can be dropped from the queries because it doesn't produce any useful filtering.
For other AWS CloudWatch metrics, consult the Amazon CloudWatch Namespaces, Dimensions and Metrics Reference.

Instrumenting Go code with Prometheus
For me, the most interesting feature of Prometheus is that allows for easy instrumentation of the code. Instead of pushing metrics a la statsd and Graphite, a web app needs to implement a /metrics handler and use the Prometheus client library code to publish app-level metrics to that handler. The Prometheus server will then hit /metrics on the client and pull/scrape the metrics.

More specifics for Go code instrumentation

1) Declare and register Prometheus metrics in your code

I have the following 2 variables defined in an init.go file in a common package that gets imported in all of the webapp code:

var PrometheusHTTPRequestCount = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Namespace: "myapp",
        Name:      "http_request_count",
        Help:      "The number of HTTP requests.",
    },
    []string{"method", "type", "endpoint"},
)

var PrometheusHTTPRequestLatency = prometheus.NewSummaryVec(
    prometheus.SummaryOpts{
        Namespace: "myapp",
        Name:      "http_request_latency",
        Help:      "The latency of HTTP requests.",
    },
    []string{"method", "type", "endpoint"},
)

Note that the first metric is a CounterVec, which in the Prometheus client_golang library specifies a Counter metric that can also get labels associated with it. The labels in my case are "method", "type" and "endpoint". The purpose of this metric is to measure the HTTP request count. Since it's a Counter, it will increase monotonically, so for graphing purposes we'll need to plot its rate and not its absolute value.

The second metric is a SummaryVec, which in the client_golang library specifies a Summary metric with labels. I have the same labels are for the CounterVec metric. The purpose of this metric is to measure the HTTP request latency. Because it's a Summary, it will provide the absolute measurement, the count, as well as quantiles for the measurements.

These 2 variables then get registered in the init function:

func init() {
    // Register Prometheus metric trackers
    prometheus.MustRegister(PrometheusHTTPRequestCount)
    prometheus.MustRegister(PrometheusHTTPRequestLatency)
}

2) Let Prometheus handle the /metrics endpoint

The GitHub README for client_golang shows the simplest way of doing this:

http.Handle("/metrics", prometheus.Handler())
http.ListenAndServe(":8080", nil)

However, most of the Go webapp code will rely on some sort of web framework, so YMMV. In our case, I had to insert the prometheus.Handler function as a variable pretty deep in our framework code in order to associate it with the /metrics endpoint.

3) Modify Prometheus metrics in your code

The final step in getting Prometheus to instrument your code is to modify the Prometheus metrics you registered by incrementing Counter variables and taking measurements for Summary variables in the appropriate places in your app. In my case, I increment PrometheusHTTPRequestCount in every HTTP handler in my webapp by calling its Inc() method. I also measure the HTTP latency, i.e. the time it took for the handler code to execute, and call the Observe() method on the PrometheusHTTPRequestLatency variable.

The values I associate with the "method", "type" and "endpoint" labels come from the endpoint URL associated with each instrumented handler. As an example, for an HTTP GET request to a URL such as http://api.example.com/customers/find, "method" is the HTTP method used in the request ("GET"), "type" is "customers", and "endpoint" is "/customers/find".

Here is the code I use for modifying the Prometheus metrics (R is an object/struct which represents the HTTP request):

    // Modify Prometheus metrics
    pkg, endpoint := common.SplitUrlForMonitoring(R.URL.Path)
    method := R.Method
    PrometheusHTTPRequestCount.WithLabelValues(method, pkg, endpoint).Inc()
    PrometheusHTTPRequestLatency.WithLabelValues(method, pkg, endpoint).Observe(float64(elapsed) / float64(time.Millisecond))


4) Retrieving your metrics

Assuming your web app runs on port 8080, you'll need to modify the Prometheus server configuration file and add a scrape job for app-level metrics. I have something similar to this in /etc/prometheus/prometheus.xml:

- job_name: 'myapp-api'
    target_groups:
      - targets:
        - api01.example.com:8080
        - api02.example.com:8080
        labels:
          group: 'production'
      - targets:
        - test-api01.example.com:8080
        - test-api02.example.com:8080
        labels:
          group: 'test'

Note an extra label called "group" defined in the configuration file. It has the values "production" and "test" respectively, and allows for the filtering of Prometheus measurements by the environment of the monitored nodes.

Whenever the Prometheus configuration file gets modified, you need to restart the Prometheus server:

# stop prometheus
# start prometheus

At this point, the metrics scraped from the webapp servers will be available at http://api01.example.com:8080/metrics.

Here are some of the available metrics:
# HELP myapp_http_request_count The number of HTTP requests.
# TYPE myapp_http_request_count counter
myapp_http_request_count{endpoint="/merchant/register",method="GET",type="admin"} 2928
# HELP myapp_http_request_latency The latency of HTTP requests.
# TYPE myapp_http_request_latency summary
myapp_http_request_latency{endpoint="/merchant/register",method="GET",type="admin",quantile="0.5"} 31.284808
myapp_http_request_latency{endpoint="/merchant/register",method="GET",type="admin",quantile="0.9"} 33.353354
myapp_http_request_latency{endpoint="/merchant/register",method="GET",type="admin",quantile="0.99"} 33.353354
myapp_http_request_latency_sum{endpoint="/merchant/register",method="GET",type="admin"} 93606.57930099976

myapp_http_request_latency_count{endpoint="/merchant/register",method="GET",type="admin"} 2928

Note that myapp_http_request_count and myapp_http_request_latency_count show the same value for the method/type/endpoint combination in this example. You could argue that myapp_http_request_count is redundant in this case. There could be instances where you want to increment a counter without taking a measurement for the summary, so it's still useful to have both. 
Also note that myapp_http_request_latency, being a summary, computes 3 different quantiles: 0.5, 0.9 and 0.99 (so 50%, 90% and 99% of the measurements respectively fall under the given numbers for the latencies).

5) Graphing your metrics with PromDash
The PromDash tool provides an easy way to create dashboards with a look and feel similar to Graphite. PromDash is available at http://prometheus.example.com:3000. 
First you need to define a server by clicking on the Servers link up top, then entering a name ("prometheus") and the URL of the Prometheus server ("http://prometheus.example.com:9090/").
Then click on Dashboards up top, and create a new directory, which offers a way to group dashboards. You can call it something like "myapp". Now you can create a dashboard (you also need to select the directory it belongs to). Once you are in the Dashboard create/edit screen, you'll see one empty graph with the default title "Title". 
When you hover over the header of the graph, you'll see other buttons available. You want to click on the 2nd button from the left, called Datasources, then click Add Expression. Note that the server field is already pre-filled. If you start typing myapp in the expression field, you should see the metrics exposed by your application (for example myapp_http_request_count and myapp_http_request_latency).
To properly graph a Counter-type metric, you need to plot its rate. Use this expression to show the HTTP request/second rate measured in the last minute for all the production endpoints in my webapp:
rate(myapp_http_request_count{group="production",job="myapp-api"}[1m])
(the job and group values correspond to what we specified in /etc/prometheus/prometheus.xml)
If you want to show the HTTP request/second rate for test endpoints of "admin" type, use this expression:
rate(myapp_http_request_count{group="test",job="myapp-api",type="admin"}[1m])
If you want to show the HTTP request/second rate for a specific production endpoint, use an expression similar to this:
rate(myapp_http_request_count{group="production",job="myapp-api",endpoint="/merchant/register",type="admin"}[1m])
Once you enter the expression you want, close the Datasources form (it will save everything). Also change the title by clicking on the button called "Graph and Axis Settings". In that form, you can also specify that you want the plot lines stacked as opposed to regular lines.
 For latency metrics, you don't need to look at the rate. Instead, you can look at a specific quantile. Let's say you want to plot the 99% quantile for latencies observed in all production endpoint, for write operations (corresponding to HTTP methods which are not GET). Then you would use an expression like this:
myapp_http_request_latency{method!="GET",quantile="0.99",group="production",job="myapp-api"}
As for the HTTP request/second graphs, you can refine the latency queries by specifying a type, an endpoint or both:
myapp_http_request_latency{method!="GET",quantile="0.99",group="production",type="admin",endpoint="/merchant/register",job="myapp-api"}
I hope you have enough information at this point to go wild with dashboards! Remember, who has the most dashboards wins!
Wrapping up
I wanted to write this blog post so I don't forget all the stuff that was involved in setting up and using Prometheus. It's a lot, but it's also not that bad once you get a hang for it. In particular, the Prometheus server itself is remarkably easy to set up and maintain, a refreshing change from other monitoring systems I've used before.
One thing I haven't touched on is the alerting mechanism used in Prometheus. I haven't looked at that yet, since I'm still using a combination of Pingdom, monit and Jenkins for my alerting. I'll tackle Prometheus alerting in another blog post.
I really like Prometheus so far and I hope you'll give it a try!








Why I like golang: a programming autobiography

Mon, 11/16/2015 - 19:07
Tried my hand at writing a story on Medium.

Notes on testing in golang

Fri, 11/06/2015 - 00:00
I've been doing a lot of testing of APIs written in the Go programming language in the last few months. It's been FUN! Writing code in Golang is also very fun. Can't have good code quality without tests though, so I am going to focus on the testing part in this post.


Unit testing
Go is a "batteries included" type of language, just like Python, so naturally it comes with its own testing package, which provides support for automated execution of unit tests. Here's an excerpt from its documentation:
Package testing provides support for automated testing of Go packages. It is intended to be used in concert with the “go test” command, which automates execution of any function of the form
func TestXxx(*testing.T)
where Xxx can be any alphanumeric string (but the first letter must not be in [a-z]) and serves to identify the test routine.The functionality offered by the testing package is fairly bare-bones though, so I've actually been using another package called testify which provides test suites and more friendly assertions.

Whether you're using testing or a 3rd party package such as testify, the Go way of writing unit tests is to include them in a file ending with _test.go in the same directory as your code under test. For example, if you have a file called customers.go which deals with customer management business logic, you would write unit tests for that code and put them in file called customers_test.go in the same directory as customers.go. Then, when you run the "go test" command in that same directory, your unit tests will be automatically run. In fact, "go test" will discover all tests in files named *_test.go and run them. You can find more details on Go unit testing in the Testing section of the "How to Write Go Code" article.

Integration testing
I'll give some examples of how I organize my integration tests. Let's take again the example of testing an API what deals with the management of customers. An integration test, by definition, will hit the API endpoint from the outside, via HTTP. This is in contrast with a unit test which will test the business logic of the API handler internally, and will live as I said above in the same package as the API code.

For my integration tests, I usually create a directory per set of endpoints that I want to test, something like core-api for example. In there I drop a file called main.go where I set some constants used throughout my tests:

package main

import (
        "fmt"
)

const API_VERSION = "v2"
const API_HOST = "myapi.example.com"
const API_PORT = 8000
const API_PROTO = "http"
const API_INIT_KEY = "some_init_key"
const API_SECRET_KEY = "some_secret_key"
const TEST_PHONE_NUMBER = "+15555550000"
const DEBUG = true

func init() {
        fmt.Printf("API_PROTO:%s; API_HOST:%s; API_PORT:%d\n", API_PROTO, API_HOST, API_PORT)
}

func main() {
}

For integration tests related to the customer API, I create a file called customer_test.go with the following boilerplate:

package main

import (
        "fmt"
        "testing"

        "github.com/stretchr/testify/assert"
        "github.com/stretchr/testify/suite"
)

// Define the suite, and absorb the built-in basic suite
// functionality from testify - including a T() method which
// returns the current testing context
type CustomerTestSuite struct {
        suite.Suite
        apiURL          string
        testPhoneNumber string
}

// Set up variables used in all tests
// this method is called before each test
func (suite *CustomerTestSuite) SetupTest() {
        suite.apiURL = fmt.Sprintf("%s://%s:%d/%s/customers", API_PROTO, API_HOST, API_PORT, API_VERSION)
        suite.testPhoneNumber = TEST_PHONE_NUMBER
}

// Tear down variables used in all tests
// this method is called after each test
func (suite *CustomerTestSuite) TearDownTest() {
}

// In order for 'go test' to run this suite, we need to create
// a normal test function and pass our suite to suite.Run
func TestCustomerTestSuite(t *testing.T) {
        suite.Run(t, new(CustomerTestSuite))
}

By using the testify package, I am able to define a test suite, a struct I call CustomerTestSuite which contains a testify suite.Suite as an anonymous field. Golang uses composition over inheritance, and the effect of embedding a suite.Suite in my test suite is that I can define methods such as SetupTest and TearDownTest on my CustomerTestSuite. I do the common set up for all test functions in SetupTest (which is called before each test function is executed), and the common tear down for all test functions in TearDownTest (which is called after each test function is executed).

In the example above, I set some variables in SetupTest which I will use in every test function I'll define. Here is an example of a test function:

func (suite *CustomerTestSuite) TestCreateCustomerNewEmailNewPhone() {
        url := suite.apiURL
        random_email_addr := fmt.Sprintf("test-user%d@example.com", common.RandomInt(1, 1000000))
        phone_num := suite.testPhoneNumber
        status_code, json_data := create_customer(url, phone_num, random_email_addr)

        customer_id := get_nested_item_property(json_data, "customer", "id")

        assert_success_response(suite.T(), status_code, json_data)
        assert.NotEmpty(suite.T(), customer_id, "customer id should not be empty")
}

The actual HTTP call to the backend API that I want to test happens inside the create_customer function, which I defined in a separate utils.go file:

func create_customer(url, phone_num, email_addr string) (int, map[string]interface{}) {
        fmt.Printf("Sending request to %s\n", url)

        payload := map[string]string{
                "phone_num":  phone_num,
                "email_addr": email_addr,
        }
        ro := &grequests.RequestOptions{}
        ro.JSON = payload

        var resp *grequests.Response
        resp, _ = grequests.Post(url, ro)

        var json_data map[string]interface{}
        status_code := resp.StatusCode
        err := resp.JSON(&json_data)
        if err != nil {
                fmt.Println("Unable to coerce to JSON", err)
                return 0, nil
        }

        return status_code, json_data
}

Notice that I use the grequests package, which is a Golang port of the Python Requests package. Using grequests allows me to encapsulate the HTTP request and response in a sane way, and to easily deal with JSON.

To go back to the TestCreateCustomerNewEmailNewPhone test function, once I get back the response from the API call to create a customer, I call another helper function called assert_success_response, which uses the assert package from testify in order to verify that the HTTP response code was 200 and that certain JSON parameters that we send back with every response (such as error_msg, error_code, req_id) are what we expect them to be:

func assert_success_response(testobj *testing.T, status_code int, json_data map[string]interface{}) {
        assert.Equal(testobj, 200, status_code, "HTTP status code should be 200")
        assert.Equal(testobj, 0.0, json_data["error_code"], "error_code should be 0")
        assert.Empty(testobj, json_data["error_msg"], "error_msg should be empty")
        assert.NotEmpty(testobj, json_data["req_id"], "req_id should not be empty")
        assert.Equal(testobj, true, json_data["success"], "success should be true")
}

To actually run the integration test, I run the usual 'go test' command inside the directory containing my test files.

This pattern has served me well in creating an ever-growing collection of integration tests against our API endpoints.

Test coverage
Part of Golang's "batteries included" series of tools is a test coverage tool. To use it, you first need to run 'go test' with various coverage options. Here is a shell script we use to produce our test coverage numbers:

#!/bin/bash

#
# Run all of our go unit-like tests from each package
#
CTMP=$GOPATH/src/core_api/coveragetmp.out
CREAL=$GOPATH/src/core_api/coverage.out
CMERGE=$GOPATH/src/core_api/merged_coverage.out

set -e
set -x

cp /dev/null $CTMP
cp /dev/null $CREAL
cp /dev/null $CMERGE

go test -v -coverprofile=$CTMP -covermode=count -parallel=9 ./auth
cat $CTMP > $CREAL

go test -v -coverprofile=$CTMP -covermode=count -parallel=9 ./customers
cat $CTMP |tail -n+2 >> $CREAL

#
# Finally run all the go integration tests
#

go test -v -coverprofile=$C -covermode=count -coverpkg=./auth,./customers ./all_test.go
cat $CTMP |tail -n+2 >> $CREAL

rm $CTMP

#
# Merge the coverage report from unit tests and integration tests
#

cd $GOPATH/src/core_api/
cat $CREAL | go run ../samples/mergecover/main.go >> $CMERGE

#
set +x

echo "You can run the following to view the full coverage report!::::"
echo "go tool cover -func=$CMERGE"
echo "You can run the following to generate the html coverage report!::::"
echo "go tool cover -html=$CMERGE -o coverage.html"


The first section of the bash script above runs 'go test' in covermode=count against every sub-package we have (auth, customers etc). It combines the coverprofile output files (CTMP) into a single file (CREAL).

The second section runs the integration tests by calling 'go test' in covermode=count, with coverpkg=[comma-separated list of our packages], against a file called all_test.go. This file starts an HTTP server exposing our APIs, then hits our APIs by calling 'go test' from within the integration test directory.

The coverage numbers from the unit tests and integration tests are then merged into the CMERGE file by running the mergecover tool.

At this point, you can generate an html file via go tool cover -html=$CMERGE -o coverage.html, then inspect coverage.html in a browser. Aim for more than 80% coverage for each package under test.