Using Python to Tackle Big Problems
Table of Contents
The Python programming language was designed to be simple to read and understand, and most of its design philosophies are based around these goals. In fact, the Zen of Python is an important guiding principle for the language:
Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren’t special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one– and preferably only one –obvious way to do it. Although that way may not be obvious at first unless you’re Dutch. Now is better than never. Although never is often better than right now. If the implementation is hard to explain, it’s a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea – let’s do more of those!
There are lots of other good choices of programming language, and in this part of the workshop I’m using Python largely because of its clean and easy-to-read syntax, and partly because Python is a language which has the batteries included - there are lots of packages (extensions to the language) which are easy to use and easy to obtain. This definitely isn’t intended to be a full course on Python, and if you feel a bit left behind by some of the syntax, don’t panic. It’s probably not important, but instead the underlying principles being covered are what’s important.
1 Coding for great good
Most scientists write code to get answers to some deep question about some aspect of the universe, and often don’t think about code as an important product of their research; this is a very tempting trap to fall into (few of us got into scientific research to craft careful computer programs, after all), but this is a dangerous approach to take, for a number of reasons, just a few are:
- You spend a lot of time on it: be proud of it!
- Science should be reproducible (and you should make doing so easy if you can).
- Research is expensive: we should share code so that we don’t keep re-inventing the wheel.
Of course, writing good code is sometimes harder than writing bad code, but from my own experience, any increase in speed from dashing off code is always made up for later on, as it’s often harder to debug, or difficult to maintain or come back to later.
1.1 A workflow for sustainable, correct code
We’ll cover a few principles which can help to make your life (and those of your collaborators and readers) easier:
- Keeping your code organised
- Testing your code
- Packaging your code for the world
- Writing good documentation
To do this we’ll touch on some things which are specific to the Python programming language, and to the “ecosystem” which is built around it. All of the principles, however, can carry over to other languages (though most of the time Python does it best!).
2 A Quick Python (re)fresher
Python programming encourages breaking your code into small blocks called functions, and that’s the simplest form of Python syntax we’ll cover in this session. Here’s a simple example:
1: def mean(numbers): 2: ``` 3: Take a list of numbers, and find their mean. 4: 5: Parameters 6: ---------- 7: numbers : list 8: A list of numbers 9: ``` 10: 11: return sum(numbers) / len(numbers)
This function takes a list
, which is (one of) Python’s array
datatypes. They look something like this:
a = [1,3,4,6,9,13]
So we could use this function like this:
average = mean(a)
In order to calculate the mean of the list of numbers in a
.
2.1 Embracing objectification
In Python everything is an “object”, if you’ve programmed in a language like C++, Objective C, or Java in the past this might be a familiar idea to you (if you’ve programmed in Javascript this will also be quite familiar, but oddly different1).
A code object can be thought of approximately analogously with
physical objects, as they have properties, and can perform various
different actions. If we wanted to represent a car in code we might
want to be able to represent some things like its model
, color
,
engine_size
, as its properties, and accelerate
, brake
,
change_gear
.
There are obviously ways that you can do this without turning to a new way of coding, perhaps something like this:
1: enginesize = 1600 2: model = "Renault Captur" 3: color = "Midnight Sierra" 4: 5: def accelerate(enginesize, model): 6: """ 7: Simulate the acceleration of the car. 8: 9: Parameters 10: ---------- 11: enginesize : int 12: The engine size of the car in cubic centimetres. 13: model : str 14: The model of the car. 15: 16: Notes 17: ----- 18: We need to know the weight of the car, and the power of the 19: engine to calculate the acceleration, so we need to collect 20: these details in the arguments of the function. 21: """ 22: 23: <<< DO SOME PHYSICS AND MATHS >>> 24: 25: def change_gear(current_gear, new_gear): 26: """ 27: Calculate the effect of changing gear on the speed of the car. 28: 29: Parameters 30: ---------- 31: current_gear : int 32: The number of the current gear 33: new_gear : int 34: The number of the gear we're switching to. 35: 36: Return 37: ------ 38: gear_ratio : float 39: The ratio between two gears 40: """ 41: 42: <<< DO SOME MORE MATHS >>>
This is all very well, but suppose we want to simulate two cars:
we’re going to end up with lots of variables with names like
enginesize_clio
, and color_zafira
. Objects give us a tidier
way of doing this, and make it easier to recycle code.
2.2 A car object
We can design an object in Python using a class
, which you can
think of as the blueprint to help Python build it.
1: class Car: 2: """ 3: This class represents a car. A motorised vehicle which can move over land. 4: """ 5: 6: def __init__(self, model, enginesize, color): 7: """ 8: Set-up a car object. 9: 10: Parameters 11: ---------- 12: enginesize : int 13: The engine size of the car in cubic centimetres. 14: model : str 15: The model of the car. 16: color : str 17: The color of the car. 18: """ 19: 20: self.enginesize = enginesize 21: self.model = model 22: self.color = color 23: 24: def accelerate(self, time): 25: """ 26: Simulate the acceleration of the car. 27: 28: Parameters 29: ---------- 30: time : float 31: The time at which to calculate the acceleration 32: 33: Notes 34: ----- 35: We need to know the weight of the car, and the power of the 36: engine to calculate the acceleration, so we need to collect 37: these details in the arguments of the function. 38: """ 39: 40: # Before we needed to collect information in the function 41: # arguments, here it's replaced by ~self~, because we can 42: # access all of the properties of the object from it. 43: 44: acceleration = (self.enginepower / (2*self.mass*time))**0.5 45: 46: return acceleration 47:
We’ve introduced some new things here, to make the class
work: the weirdest looking of these is the function inside
the class called __init__
, which is the ’constructor’ for
the class. It sets all of the variables up in the
object. When we make an object, by running something like
fiesta = Car("Ford Fiesta", 1350, "red")
the init
function is what’s called.
The object now contains various bits of information about the car, so we can use:
print(fiesta.color)
to print the color of the car.
Say we want to introduce another car into our program, we can just do that by defining it in another variable:
mondeo = Car("Ford Mondeo", 1900, "blue")
So the objects keep everything together neatly. We also keep all of the logic which applies to the car with the data (but don’t duplicate code), because we can run
a = fiesta.accelerate(100)
to find a property of the object which changes.
2.3 Exercise: Making a dataset object
We’ve already seen a function which can calculate the mean of a list of data: try making an object which can store a dataset, and perform some simple statistical operations on the data (try standard deviation first).
2.4 Hint: Making a dataset object
1: class Dataset: 2: """ 3: Represents a dataset. 4: """ 5: 6: def __init__(self, data): 7: """ 8: Construct the data set object 9: """ 10: self.data = data 11: 12: def mean(self): 13: """ 14: Take a list of numbers, and find their mean. 15: """ 16: 17: return sum(self.data) / len(self.data)
2.5 More things with objects
This has been a very break-neck introduction to Python objects, and we’ve not really had time to look at other neat things we can do with them:
- Inheritance
- You can use one class to build another (for
example we could build the
Car
class atop aVehicle
class). - Operator overloading
- You can define arithmetic operations on
your classes (for example, if we add a number to one of our
Dataset
objects, what should happen?) You can do this be defining the method__add_()
in a class. For more on this see the Python documentation.
3 Making a module
To keep your code tidy it’s often a good idea to keep different objects from your program in different files, which makes it easier to find the code for some specific job, and also makes it easier to include in another program (thus improving your code re-usability).
You may have seen a module at work in python before, from a line like
import numpy as np
Which loads the numpy
module.
To make our own module we need to put the Dataset
code in its own
file, and put that in its own directory. This should look something
like this:
| |-dataset | |-dataset.py
But we also need to put in an additional, blank file, called
__init__.py
, so we have a folder structure like this:
| |-dataset | |-__init__.py | |-dataset.py
Now we can make a script in the root directory of the project which
can import the dataset
module.
| |-script.py |-dataset | |-__init__.py | |-dataset.py
Then we might have something along the lines of
1: from dataset import Dataset 2: a = Dataset([0,4,5,3,6]) 3: print(a.mean())
4 Compartmentalisation
Now that we’re progressing with our software project, we need to start thinking about dependencies - code that we bring into the project from elsewhere. If we want someone else to be able to use our code we need to make sure it can run on their machine.
We can do this in Python with a mechanism called a virtualenv
, or
virtual environment, which isolates your code from (most) of the
other software on your computer.
You can install virtualenv
on your system by running
sudo apt-get install python-virtualenv
on an Ubuntu (or WSL) system. You should also run
sudo apt-get install python-virtualenvwrapper
Which makes things work a bit better.
Once it’s installed you can make a virtualenv by running
mkvirtualenv supa
which makes a virtualenv called supa
.
We can leave the virtualenv
by running deactivate
in the
terminal, and start it again with
workon supa
Now that we’re in the virtualenv we can install dependencies, for example
pip install numpy
will install numpy
which is a module for doing matrix arithmetic.
pip
is the Python package manager, and handles all of the
downloading and installation of packages.
We can check all of the packages installed in the virtualenv by running
pip freeze
which can be helpful for making a list of dependencies for your code.
5 Testing your code
Suppose you collaborate on writing code: how do you make sure none of your collaborators break your code? The answer: introduce quality controls. We can do this by testing the code frequently. To make sure you do this (and do it consistently), we should automate this process, so that we can make sure it happens every time that the code is committed to your repository.
Let’s start with a simple example which tests our Dataset
object.
1: from dataset import Dataset 2: def test_mean(): 3: data = [1,2,3] 4: result = 2 5: 6: testobj = Dataset(data) 7: 8: assert testobj.mean() == data
If our mean function doesn’t return 2
when given the numbers
1,2,3
Python will throw an AssertionError
, and the code will
fail its test. Otherwise the function will operate as normal, and we
can conclude that the test has passed.
This is helpful, but chances are that the code won’t produce errors on easy jobs like this. Instead we need to consider so-called “edge cases”, which are places where the behaviour of the function doesn’t follow the usual pattern. (A good example of this is the Fibonacci function, which has special cases for 0 and 1).
For our mean function we might want to check the behaviour of the function
- When the input is empty (i.e. when
data = []
) - When all of the values are negative (or do we - are there any times this might be sensible to check?)
5.1 A test suite
Clearly as we keep adding new tests we’re going to want a sensible way of managing them. This is the point at which we turn to a testing framework.
Let’s start by adding a new directory to our project to keep the
test files in: we’ll put all our tests here in a file called
test_dataset.py
.
| |-script.py |-dataset | |-__init__.py | |-dataset.py |-tests | |-test_dataset.py
We now need to install a new python module called nose
, which we
can do by running
pip install nose
Nose makes things easy by looking for files, classes, and functions which match the “regular expression”
(?:^|[\\b_\\.-])[Tt]est
(basically, anything that starts with test
.
We should update the last test to use nose
:
1: from dataset import Dataset 2: from nose.tools import assert_equal 3: def test_mean(): 4: data = [1,2,3] 5: result = 2 6: 7: testobj = Dataset(data) 8: 9: assert_equal(testobj.mean(), data)
This allows us to run nosetests
in the tests directory, to run all
of the tests in the files in that directory.
5.2 Different types of test
All of the tests we’ve looked at so far are “unit tests”, which individually test the smallest units of your code (what that smallest unit is might well be up for debate, but often it means functions or object methods). There are other types of test which it can be sensible to implement:
- interation tests
- These perform checks that all of the parts of your code interoperate the way you expect. Integration tests may test functions which depend on the functions which are unit-tested, or they may implement simple versions of the behaviour you expect your code to be used in in functions.
- regression tests
- These act a bit like short-term memory for your project, and compare the outputs of a new version of your code with one from before which was in some way “accepted”. These can be useful as a way of identifying the time that a change in the code’s behaviour was introduced, but can be bad at actually finding the underlying change. They might be useful if you work in an environment where releases of your software undergo peer-review.
6 Documenting code
“You could be run over by a bus tomorrow, then how could we run your code?!”
- A caring supervisor
“Sure you can download our code. But good luck downloading the only grad-student who can run it.”
- A scientist
Documenting your code is generally a good idea. It allows you to remind yourself of how your code works a few months after you’ve written it, and it lets other people work out what to do with it without bugging you later.
Some guidelines:
- Good code needs good documentation. Code with no documentation is bad code.
- Include comments in tricky lines
- Document every function and object
- Document your whole package / software product
- Include usage examples
The numpy
project does documentation very well, and they have a
very nice standard, which I’m going to suggest you follow.
#+BEGINSRC python -n def spam2eggs(spam): “”“ This is a function which turns spam into eggs
Parameters
spam : str A spam string, which contains the spam. Returns
eggs : str An egg string, which is not spam “”“ return ”eggs“
6.1 READ ME
It’s also a good idea to include a file called “README” or some obvious variant of it in the root directory of your project which contains some basic information about the project. That might be a description, and a link to the full documentation, or it might be the full documentation, depending how complex the project is.
Other common files are
- CONTRIBUTING
- Instructions on how to contribute to the project
- LICENSE
- Details of the license the code is released under. If this isn’t present people can’t reuse your code in most jurisdictions)
6.2 Additional documentation
You’ll probably want to include some additional documentation on
your project (maybe theory, or large usage examples). The standard
tool for doing this in Python is called sphinx
.
We can install it by running
pip install sphinx
And then set it up in the repository with
mkdir docs cd docs sphinx-quickstart
Because we’ve used numpy-style docstrings we should install the numpydoc
extension too:
pip install numpydoc
Sphinx
will set us up with some files to get us going, and the
docs
directory will now look something like this:
docs | Makefile |-build |-source | |-conf.py | |-index.rst
You’ll have some extra directories which I left out, but right now we don’t need to worry about them.
The first thing we need to do is to activate the numpydoc
extension, and we can do that by editing the configuration file,
source/conf.py
.
You’ll need to find the place where the variable extensions
is defined. It should looks something like
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.doctest', 'sphinx.ext.coverage', 'sphinx.ext.viewcode']
and then add in 'numpydoc'
, so we get
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.doctest', 'sphinx.ext.coverage', 'sphinx.ext.viewcode', 'numpydoc']
We can then edit the main index.rst
page of the documentation to look something like:
.. dataset documentation master file, created by Welcome to dataset's documentation! ===================================== Contents: .. toctree:: :maxdepth: 2 Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search` API ==== .. toctree:: :maxdepth: 1 dataset
Then we can make a new file, dataset.rst
for the documentation for that class:
.. _dataset_dataset: Dataset – :mod:`dataset.dataset` ================================ .. currentmodule:: dataset.dataset .. automodule:: dataset.dataset Write some documentation, whatever you like, really, about the ``dataset`` package here. The documentation of the API can be found below. Dataset class ************* .. autoclass:: Dataset
6.3 Automating documentation production
We want to make your documentation easy (and pleasant) to read, and
sphinx
handles all of this too.
In the docs folder you should have a file called Makefile
. If
that’s there you can simply run
make html
To make html-format documentation, which you can upload to a
web-server. (You can find it in the build
directory under docs).
Sphinx
can produce numerous other output formats as well,
including epub
, should you wish to peruse your documentation on
your Kindle.
If you keep your code on an online service like github you can use
a service called Read the Docs, which clones your repository,
builds the documentation using sphinx
, and then uploads it to a
webserver, free. Alternatively you can set up your own workflow,
which can involve ’hooks’ in your version control system (more on
that later).
7 Hooking with mercurial and git
An important part of your programming workflow probably centres
around your version control system (which is probably mercurial
or
git
). Both systems allow us to define special events which should
happen when code is committed to the repository (or when various
other actions happen). This is a ’hook’ into the VCS. If we wanted
to run our tests when a commit was made in git we could edit (or
possibly create) the file
.git/hooks/pre-commit
and add something like
#! /bin/sh # The line above this tells the shell that this is a shell script cd tests workon supa # make sure we're using the virtualenv nosetest
and then make the file executable:
chmod +x pre-commit
The process in git is similar, but we need to make the script for
the hook somewhere in the main repository. We could just make a
script called run_tests.py
in the repository, and put the same
thing in it as the git commit hook’s script. We then need to edit
the hgrc
file for the repository, which we can do by running
hg config –-local
or
nano .hg/hgrc
and adding to the end
[hooks] pretxncommit.runtests = tests/run_tests.py
8 Making a setup script for your package
Now we have a thoroughly documented and tested project, it’s time to make sure other people can install it on their machine, and to make sure you can package it up to send to them.
Python has a built-in build system which can handle all of this, and
to use it we only need to add one file to the repository, which is
conventionally called setup.py
.
This file contains the information that is needed to install the project, and should look something like this:
#!/usr/bin/env python from setuptools import setup requirements = [ # paste output of `pip freeze` here numpy, scipy, matplotlib ] setup( name='dataset', version='0.1.0', description="Dataset is a neat way of handling data in Python.", author="Daniel Williams", author_email='d.williams.2@research.gla.ac.uk', url='https://fakey.mcfakeface.com/daniel/dataset', packages=[''], package_dir={'dataset': 'dataset'}, install_requires=requirements, license="ISCL", classifiers=[ 'Development Status :: 2 – Pre-Alpha', ], test_suite='tests' )
Now, by running
python setup.py install
you can install your project in your virtualenv. This means that it’s accessible from working directories other than the one for this project, so you can inlcude the code in other projects.
We can also run the tests with
python setup.py test
8.1 Distributing your code
setuptools
is also able to roll your package up in such a way
that it can be distributed (and indeed uploaded to pypi
, the
Python package index, which pip
pulls code from). We can do this
by running
python setup.py sdist
Which generates a source distribution (which will be saved as a
gzipped tarball in the dist
directory), which can be unpacked and
installed easily. By adding upload
:
python setup.py sdist upload
and following the instructions it gives you, you can upload it to pypi (but please don’t spam it with test projects!)
It’s also possible to produce binary distributions with the bdist
command, but this is a case of “There be dragons”, especially with
Linux, so we’ll not cover this today.
Footnotes:
Javascript handles object-orientation through “prototypes”, Python, C++, and others use “classes”.