Using Python to Tackle Big Problems

1. Coding for great good
- 1.1. A workflow for sustainable, correct code
2. A Quick Python (re)fresher
3. Making a module
4. Compartmentalisation
5. Testing your code
- 5.1. A test suite
- 5.2. Different types of test
6. Documenting code
7. Hooking with mercurial and git
8. Making a setup script for your package
- 8.1. Distributing your code

The Python programming language was designed to be simple to read and understand, and most of its design philosophies are based around these goals. In fact, the Zen of Python is an important guiding principle for the language:

Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren’t special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one– and preferably only one –obvious way to do it. Although that way may not be obvious at first unless you’re Dutch. Now is better than never. Although never is often better than right now. If the implementation is hard to explain, it’s a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea – let’s do more of those!

There are lots of other good choices of programming language, and in this part of the workshop I’m using Python largely because of its clean and easy-to-read syntax, and partly because Python is a language which has the batteries included - there are lots of packages (extensions to the language) which are easy to use and easy to obtain. This definitely isn’t intended to be a full course on Python, and if you feel a bit left behind by some of the syntax, don’t panic. It’s probably not important, but instead the underlying principles being covered are what’s important.

1 Coding for great good

Most scientists write code to get answers to some deep question about some aspect of the universe, and often don’t think about code as an important product of their research; this is a very tempting trap to fall into (few of us got into scientific research to craft careful computer programs, after all), but this is a dangerous approach to take, for a number of reasons, just a few are:

You spend a lot of time on it: be proud of it!
Science should be reproducible (and you should make doing so easy if you can).
Research is expensive: we should share code so that we don’t keep re-inventing the wheel.

Of course, writing good code is sometimes harder than writing bad code, but from my own experience, any increase in speed from dashing off code is always made up for later on, as it’s often harder to debug, or difficult to maintain or come back to later.

1.1 A workflow for sustainable, correct code

We’ll cover a few principles which can help to make your life (and those of your collaborators and readers) easier:

Keeping your code organised
Testing your code
Packaging your code for the world
Writing good documentation

To do this we’ll touch on some things which are specific to the Python programming language, and to the “ecosystem” which is built around it. All of the principles, however, can carry over to other languages (though most of the time Python does it best!).

2 A Quick Python (re)fresher

Python programming encourages breaking your code into small blocks called functions, and that’s the simplest form of Python syntax we’ll cover in this session. Here’s a simple example:

 1: def mean(numbers):
 2:       ```
 3:       Take a list of numbers, and find their mean.
 4: 
 5:       Parameters
 6:       ----------
 7:       numbers : list
 8: 	 A list of numbers
 9:       ```
10: 
11:       return sum(numbers) / len(numbers)

This function takes a list, which is (one of) Python’s array datatypes. They look something like this:

a = [1,3,4,6,9,13]

So we could use this function like this:

average = mean(a)

In order to calculate the mean of the list of numbers in a.

2.1 Embracing objectification

In Python everything is an “object”, if you’ve programmed in a language like C++, Objective C, or Java in the past this might be a familiar idea to you (if you’ve programmed in Javascript this will also be quite familiar, but oddly different¹).

A code object can be thought of approximately analogously with physical objects, as they have properties, and can perform various different actions. If we wanted to represent a car in code we might want to be able to represent some things like its model, color, engine_size, as its properties, and accelerate, brake, change_gear.

There are obviously ways that you can do this without turning to a new way of coding, perhaps something like this:

 1:  enginesize = 1600
 2:  model = "Renault Captur"
 3:  color = "Midnight Sierra"
 4: 
 5:  def accelerate(enginesize, model):
 6:     """
 7:     Simulate the acceleration of the car.
 8: 
 9:     Parameters
10:     ---------- 
11:     enginesize : int
12:        The engine size of the car in cubic centimetres.
13:     model : str
14:        The model of the car.
15: 
16:     Notes
17:     -----
18:     We need to know the weight of the car, and the power of the
19:     engine to calculate the acceleration, so we need to collect
20:     these details in the arguments of the function.
21:     """
22: 
23:     <<< DO SOME PHYSICS AND MATHS >>>
24: 
25: def change_gear(current_gear, new_gear):
26:    """
27:    Calculate the effect of changing gear on the speed of the car.
28: 
29:    Parameters
30:    ---------- 
31:    current_gear : int 
32:       The number of the current gear
33:    new_gear : int
34:       The number of the gear we're switching to.
35: 
36:    Return
37:    ------
38:    gear_ratio : float
39:       The ratio between two gears
40:    """
41: 
42:    <<< DO SOME MORE MATHS >>>

This is all very well, but suppose we want to simulate two cars: we’re going to end up with lots of variables with names like enginesize_clio, and color_zafira. Objects give us a tidier way of doing this, and make it easier to recycle code.

2.2 A car object

We can design an object in Python using a class, which you can think of as the blueprint to help Python build it.

 1: class Car:
 2:    """
 3:    This class represents a car. A motorised vehicle which can move over land.
 4:    """
 5: 
 6:    def __init__(self, model, enginesize, color):
 7:       """
 8:       Set-up a car object.
 9: 
10:       Parameters
11:       ---------- 
12:       enginesize : int
13: 	 The engine size of the car in cubic centimetres.
14:       model : str
15: 	 The model of the car.
16:       color : str
17: 	 The color of the car.
18:       """
19: 
20:       self.enginesize = enginesize
21:       self.model = model
22:       self.color = color
23: 
24:    def accelerate(self, time):
25:       """
26:       Simulate the acceleration of the car.
27: 
28:       Parameters
29:       ---------- 
30:       time : float
31: 	 The time at which to calculate the acceleration
32: 
33:       Notes
34:       -----
35:       We need to know the weight of the car, and the power of the
36:       engine to calculate the acceleration, so we need to collect
37:       these details in the arguments of the function.
38:       """
39: 
40:       # Before we needed to collect information in the function
41:       # arguments, here it's replaced by ~self~, because we can
42:       # access all of the properties of the object from it.
43: 
44:       acceleration = (self.enginepower / (2*self.mass*time))**0.5
45: 
46:       return acceleration
47:

We’ve introduced some new things here, to make the class work: the weirdest looking of these is the function inside the class called __init__, which is the ’constructor’ for the class. It sets all of the variables up in the object. When we make an object, by running something like

fiesta = Car("Ford Fiesta", 1350, "red")

the init function is what’s called.

The object now contains various bits of information about the car, so we can use:

print(fiesta.color)

to print the color of the car.

Say we want to introduce another car into our program, we can just do that by defining it in another variable:

mondeo = Car("Ford Mondeo", 1900, "blue")

So the objects keep everything together neatly. We also keep all of the logic which applies to the car with the data (but don’t duplicate code), because we can run

a = fiesta.accelerate(100)

to find a property of the object which changes.

2.3 Exercise: Making a dataset object

We’ve already seen a function which can calculate the mean of a list of data: try making an object which can store a dataset, and perform some simple statistical operations on the data (try standard deviation first).

2.4 Hint: Making a dataset object

 1: class Dataset:
 2:    """
 3:    Represents a dataset.
 4:    """
 5: 
 6:    def __init__(self, data):
 7:       """
 8:       Construct the data set object
 9:       """
10:       self.data = data
11: 
12:    def mean(self):
13:      """
14:      Take a list of numbers, and find their mean.
15:      """
16: 
17:      return sum(self.data) / len(self.data)

2.5 More things with objects

This has been a very break-neck introduction to Python objects, and we’ve not really had time to look at other neat things we can do with them:

Inheritance: You can use one class to build another (for example we could build the Car class atop a Vehicle class).
Operator overloading: You can define arithmetic operations on your classes (for example, if we add a number to one of our Dataset objects, what should happen?) You can do this be defining the method __add_() in a class. For more on this see the Python documentation.

3 Making a module

To keep your code tidy it’s often a good idea to keep different objects from your program in different files, which makes it easier to find the code for some specific job, and also makes it easier to include in another program (thus improving your code re-usability).

You may have seen a module at work in python before, from a line like

import numpy as np

Which loads the numpy module.

To make our own module we need to put the Dataset code in its own file, and put that in its own directory. This should look something like this:

|
|-dataset
| |-dataset.py

But we also need to put in an additional, blank file, called __init__.py, so we have a folder structure like this:

|
|-dataset
| |-__init__.py
| |-dataset.py

Now we can make a script in the root directory of the project which can import the dataset module.

|
|-script.py
|-dataset
| |-__init__.py
| |-dataset.py

Then we might have something along the lines of

1: from dataset import Dataset
2: a = Dataset([0,4,5,3,6])
3: print(a.mean())

4 Compartmentalisation

Now that we’re progressing with our software project, we need to start thinking about dependencies - code that we bring into the project from elsewhere. If we want someone else to be able to use our code we need to make sure it can run on their machine.

We can do this in Python with a mechanism called a virtualenv, or virtual environment, which isolates your code from (most) of the other software on your computer.

You can install virtualenv on your system by running

sudo apt-get install python-virtualenv

on an Ubuntu (or WSL) system. You should also run

sudo apt-get install python-virtualenvwrapper

Which makes things work a bit better.

Once it’s installed you can make a virtualenv by running

mkvirtualenv supa

which makes a virtualenv called supa.

We can leave the virtualenv by running deactivate in the terminal, and start it again with

workon supa

Now that we’re in the virtualenv we can install dependencies, for example

pip install numpy

will install numpy which is a module for doing matrix arithmetic. pip is the Python package manager, and handles all of the downloading and installation of packages.

We can check all of the packages installed in the virtualenv by running

pip freeze

which can be helpful for making a list of dependencies for your code.

5 Testing your code

Suppose you collaborate on writing code: how do you make sure none of your collaborators break your code? The answer: introduce quality controls. We can do this by testing the code frequently. To make sure you do this (and do it consistently), we should automate this process, so that we can make sure it happens every time that the code is committed to your repository.

Let’s start with a simple example which tests our Dataset object.

1: from dataset import Dataset
2: def test_mean():
3:    data = [1,2,3]
4:    result = 2
5: 
6:    testobj = Dataset(data)
7: 
8:    assert testobj.mean() == data

If our mean function doesn’t return 2 when given the numbers 1,2,3 Python will throw an AssertionError, and the code will fail its test. Otherwise the function will operate as normal, and we can conclude that the test has passed.

This is helpful, but chances are that the code won’t produce errors on easy jobs like this. Instead we need to consider so-called “edge cases”, which are places where the behaviour of the function doesn’t follow the usual pattern. (A good example of this is the Fibonacci function, which has special cases for 0 and 1).

For our mean function we might want to check the behaviour of the function

When the input is empty (i.e. when data = [])
When all of the values are negative (or do we - are there any times this might be sensible to check?)

5.1 A test suite

Clearly as we keep adding new tests we’re going to want a sensible way of managing them. This is the point at which we turn to a testing framework.

Let’s start by adding a new directory to our project to keep the test files in: we’ll put all our tests here in a file called test_dataset.py.

|
|-script.py
|-dataset
| |-__init__.py
| |-dataset.py
|-tests
| |-test_dataset.py

We now need to install a new python module called nose, which we can do by running

pip install nose

Nose makes things easy by looking for files, classes, and functions which match the “regular expression”

(?:^|[\\b_\\.-])[Tt]est

(basically, anything that starts with test.

We should update the last test to use nose:

1: from dataset import Dataset
2: from nose.tools import assert_equal
3: def test_mean():
4:    data = [1,2,3]
5:    result = 2
6: 
7:    testobj = Dataset(data)
8: 
9:    assert_equal(testobj.mean(), data)

This allows us to run nosetests in the tests directory, to run all of the tests in the files in that directory.

5.2 Different types of test

All of the tests we’ve looked at so far are “unit tests”, which individually test the smallest units of your code (what that smallest unit is might well be up for debate, but often it means functions or object methods). There are other types of test which it can be sensible to implement:

interation tests: These perform checks that all of the parts of your code interoperate the way you expect. Integration tests may test functions which depend on the functions which are unit-tested, or they may implement simple versions of the behaviour you expect your code to be used in in functions.
regression tests: These act a bit like short-term memory for your project, and compare the outputs of a new version of your code with one from before which was in some way “accepted”. These can be useful as a way of identifying the time that a change in the code’s behaviour was introduced, but can be bad at actually finding the underlying change. They might be useful if you work in an environment where releases of your software undergo peer-review.

6 Documenting code

“You could be run over by a bus tomorrow, then how could we run your code?!”

A caring supervisor

“Sure you can download our code. But good luck downloading the only grad-student who can run it.”

A scientist

Documenting your code is generally a good idea. It allows you to remind yourself of how your code works a few months after you’ve written it, and it lets other people work out what to do with it without bugging you later.

Some guidelines:

Good code needs good documentation. Code with no documentation is bad code.
Include comments in tricky lines
Document every function and object
Document your whole package / software product
Include usage examples

The numpy project does documentation very well, and they have a very nice standard, which I’m going to suggest you follow.

#+BEGIN_SRC python -n def spam2eggs(spam): “”“ This is a function which turns spam into eggs

Parameters

spam : str A spam string, which contains the spam. Returns

eggs : str An egg string, which is not spam “”“ return ”eggs“

6.1 READ ME

It’s also a good idea to include a file called “README” or some obvious variant of it in the root directory of your project which contains some basic information about the project. That might be a description, and a link to the full documentation, or it might be the full documentation, depending how complex the project is.

Other common files are

CONTRIBUTING: Instructions on how to contribute to the project
LICENSE: Details of the license the code is released under. If this isn’t present people can’t reuse your code in most jurisdictions)

6.2 Additional documentation

You’ll probably want to include some additional documentation on your project (maybe theory, or large usage examples). The standard tool for doing this in Python is called sphinx.

We can install it by running

pip install sphinx

And then set it up in the repository with

mkdir docs
cd docs
sphinx-quickstart

Because we’ve used numpy-style docstrings we should install the numpydoc extension too:

pip install numpydoc

Sphinx will set us up with some files to get us going, and the docs directory will now look something like this:

docs
| Makefile
|-build
|-source
| |-conf.py
| |-index.rst

You’ll have some extra directories which I left out, but right now we don’t need to worry about them.

The first thing we need to do is to activate the numpydoc extension, and we can do that by editing the configuration file, source/conf.py.

You’ll need to find the place where the variable extensions is defined. It should looks something like

extensions = ['sphinx.ext.autodoc',
 'sphinx.ext.doctest',
 'sphinx.ext.coverage',
 'sphinx.ext.viewcode']

and then add in 'numpydoc', so we get

extensions = ['sphinx.ext.autodoc',
 'sphinx.ext.doctest',
 'sphinx.ext.coverage',
 'sphinx.ext.viewcode',
 'numpydoc']

We can then edit the main index.rst page of the documentation to look something like:

.. dataset documentation master file, created by

Welcome to dataset's documentation!
=====================================
Contents:

.. toctree::
   :maxdepth: 2

Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
API
====
.. toctree::
   :maxdepth: 1
   dataset

Then we can make a new file, dataset.rst for the documentation for that class:

.. _dataset_dataset:

Dataset – :mod:`dataset.dataset`
================================

.. currentmodule:: dataset.dataset
.. automodule:: dataset.dataset

Write some documentation, whatever you like, really, about the
``dataset`` package here. The documentation of the API can be found
below.

Dataset class
*************

.. autoclass:: Dataset

6.3 Automating documentation production

We want to make your documentation easy (and pleasant) to read, and sphinx handles all of this too.

In the docs folder you should have a file called Makefile. If that’s there you can simply run

make html

To make html-format documentation, which you can upload to a web-server. (You can find it in the build directory under docs).

Sphinx can produce numerous other output formats as well, including epub, should you wish to peruse your documentation on your Kindle.

If you keep your code on an online service like github you can use a service called Read the Docs, which clones your repository, builds the documentation using sphinx, and then uploads it to a webserver, free. Alternatively you can set up your own workflow, which can involve ’hooks’ in your version control system (more on that later).

7 Hooking with mercurial and git

An important part of your programming workflow probably centres around your version control system (which is probably mercurial or git). Both systems allow us to define special events which should happen when code is committed to the repository (or when various other actions happen). This is a ’hook’ into the VCS. If we wanted to run our tests when a commit was made in git we could edit (or possibly create) the file

.git/hooks/pre-commit

and add something like

#! /bin/sh
# The line above this tells the shell that this is a shell script
cd tests
workon supa # make sure we're using the virtualenv
nosetest

and then make the file executable:

chmod +x pre-commit

The process in git is similar, but we need to make the script for the hook somewhere in the main repository. We could just make a script called run_tests.py in the repository, and put the same thing in it as the git commit hook’s script. We then need to edit the hgrc file for the repository, which we can do by running

hg config –-local

nano .hg/hgrc

and adding to the end

[hooks]
pretxncommit.runtests = tests/run_tests.py

8 Making a setup script for your package

Now we have a thoroughly documented and tested project, it’s time to make sure other people can install it on their machine, and to make sure you can package it up to send to them.

Python has a built-in build system which can handle all of this, and to use it we only need to add one file to the repository, which is conventionally called setup.py.

This file contains the information that is needed to install the project, and should look something like this:

#!/usr/bin/env python

from setuptools import setup

requirements = [
# paste output of `pip freeze` here
numpy, scipy, matplotlib
]
setup(
  name='dataset',
  version='0.1.0',
  description="Dataset is a neat way of handling data in Python.",
  author="Daniel Williams",
  author_email='d.williams.2@research.gla.ac.uk',
  url='https://fakey.mcfakeface.com/daniel/dataset',
  packages=[''],
  package_dir={'dataset': 'dataset'},
  install_requires=requirements,
  license="ISCL",
  classifiers=[
      'Development Status :: 2 – Pre-Alpha',
  ],
  test_suite='tests'
  )

Now, by running

python setup.py install

you can install your project in your virtualenv. This means that it’s accessible from working directories other than the one for this project, so you can inlcude the code in other projects.

We can also run the tests with

python setup.py test

8.1 Distributing your code

setuptools is also able to roll your package up in such a way that it can be distributed (and indeed uploaded to pypi, the Python package index, which pip pulls code from). We can do this by running

python setup.py sdist

Which generates a source distribution (which will be saved as a gzipped tarball in the dist directory), which can be unpacked and installed easily. By adding upload:

python setup.py sdist upload

and following the instructions it gives you, you can upload it to pypi (but please don’t spam it with test projects!)

It’s also possible to produce binary distributions with the bdist command, but this is a case of “There be dragons”, especially with Linux, so we’ll not cover this today.

Footnotes:

Javascript handles object-orientation through “prototypes”, Python, C++, and others use “classes”.

Table of Contents