Creating and executing an analysis¶
The purpose of “analyses” in a paper version is to automate the generation of
variable files, table data files and figures for inclusion into the manuscript.
You would write Python functions that generate summary data, table data and perform
plotting. In an analysis specification file (spec.yaml) you then declare how
the functions should be used to generate the variable/table/figure files.
The Python functions should be located in one or more .py modules in the analysis folder. Alternatively, if you would like to work with the analysis functions interactively, an IPython notebook (.ipynb) can also be used (see Using IPython notebooks).
Creating variable files¶
A data summary function for creating variable files should return
a (nested) dictionary of strings, numbers or lists. Numpy arrays will
be automatically converted to lists before the dictionary is saved to
a YAML file in the variables folder.
Within and across analyses, all YAML variable files are merged prior to the preprocessing step. This means that it is important to make sure that variables names do not clash and are unique.
In the spec.yaml, data summary functions need to be declared in a summary
block. Below is an example in which stats and tests are the identifiers of the
summaries to be generated, and for each a function in <module>.<function> notation
is indicated. Optionally, extra keyword arguments are given as well. Note that a Python
module <module>.py needs to exist in the analysis folder. The name of the variable
files is formed from the name of the analysis and the summary identifier, i.e.
<analysis>.<summary>.yaml.
summary:
stats: module.stats_function
tests:
function: module.tests_function
args:
test: wilcoxon
See Insert variable values for more information about how to refer to variables in the document file.
Creating tables¶
Functions for generating rables should return two outputs: a list of column names and a sequence of rows (where each row is a sequence of table cell data). The output of the function will be saved as a comma-separated csv file.
In the spec.yaml file, table generating functions need to be declared in a tables
block. Below is an example in which table1 and table2 are the identifiers of the
tables to be generated, and for each a function in <module>.<function> notation
is indicated. Optionally, extra keyword arguments are given as well. Note that a Python
module <module>.py needs to exist in the analysis folder. The name of the csv
files is formed from the name of the analysis and the table identifier, i.e.
<analysis>.<table>.yaml.
tables:
table1: module.table_one
table2:
function: module.table_two
args:
nrows: 3
See Insert tables for more information about how to incoporate tables from csv files in the document file.
Creating figures¶
To create a figure, you first have to define a figure layout that sets the location
of the axes for plotting. Individual axes or groups of axes are labeled and organized
in a (possibly nested) dictionary. There are three ways in which a layout can be
specified in the figures block in the spec.yaml file:
grid layout: a regular grid of axes created by a call to matplotlib’s
subplotsfunction. You specify the number of rows and columns, as well as any additional arguments that should be passed to thesubplotsfunction. By default, axes are organized in a flat map and labeled ax1, ax2, etc. You may also provide a custom label prefix or a custom list of labels by specifying thelabeloption.Optionally, axes may be grouped column-wise or row-wise using the
groupoption. By default, the groups and the axes inside a group are labeled col1, col2, … or row1, row2, … (depending on the grouping dimension). A custom label prefix or list of labels can be specified for groups and axes inside groups by thegroup_labelandlabeloptions respectively.If the
arrayoption is set to True, then (grouped) axes are organized in an array and not labeled individually. If no grouping is performed, than thelabeloption must be specified as a string to set the label of the axes array. If grouping is performed, then group labels are determined as described before and thelabeloption is ignored.Here is an example figure layout definition that creates a figure with 2x2 grid of axes and specifies custom labels.
figures: main: layout: kind: grid nrows: 2 ncols: 2 group: columns group_label: column label: [top, bottom] args: figsize: [8,4]
The resulting hierarchy of labeled axes would be:
-column1 ├─ top └─ bottom -column2 ├─ top └─ bottom
svg layout: a layout is created from a svg drawing in which rectangles are tagged as axes. Groups of axes can also be specially tagged to create a hierarchy. PaperBuilder uses the FigureFirst python module. See the FigureFirst docmentation for details on how to tag rectangles and groups in Inkscape. It is strongly recommended that to use a separate layer in Inkscape for the layout elements and put other drawings that need to appear above or below the plots in dedicated overlay/underlay layers. To define a svg layout in the
spec.yaml, provide the name of the svg file. Below are three examples (for figures main, suppl1 and suppl2) that show how this can be done:figures: main: main_layout.svg suppl1: layout: suppl1_layout.svg style: - seaborn-white - lines.linewidth: 5 font.size: 9 suppl2: layout: kind: svg file: suppl2_layout.svg output_layer: output hide_layer: layout
(Note that the
styleblock for the suppl1 figure will be explained later). The suppl2 example shows two extra options that can be set.output_layersets the name of the layer in the svg file in which the axes are drawn (default: output).hide_layersets the name(s) of layer(s) that need to be hidden, which is usually the layer that contains the layout (default: layout).custom layout: a figure and axes layout is created by a custom Python function. The function should return two values: the matplotlib figure object and a (nested) dictionary of axes. In the
spec.yamlfile, the function and (optionally) extra arguments for the function can be specified in the following ways:figures: main: module.main_layout suppl1: layout: module.suppl1_layout style: - seaborn-white - lines.linewidth: 5 font.size: 9 suppl2: layout: kind: function function: module.suppl2_layout args: n: 1
Note that the custom layout function could also perform all the necessary plotting to fully create the figure, without making use of the plotting functions (see below). The downside of this approach is the strong coupling between layout and content generation, which does not allow the flexible reuse of (parameterized) plots across layouts.
Creating plot content¶
To plot the data in a figure, plotting functions need to be mapped to the labeled axes or groups of axes
that were defined in the figure layout. The first argument to the plotting function will be the destination
axes, an array of axes or a dictionary representing a group of axes. To define which plotting function
should be used for which (group of) axes, one could put the following in the spec.yaml:
plots:
col1.top: module.plot1
col2:
function: module.plot2
args:
npoints: 100
style:
- default
- lines.linewidth: 1
The plots section in the spec.yaml file is a map between a (nested) label in the
figure layout and a Python function in a local <module>.py with optional extra arguments.
Deeper levels of the axes dictionary in the figure layout can be indicated using dot-notation.
Given the grid layout example presented previously, the example above
will map the plot1 function to the top axes in the first column and the plot2 function
to the axes in column 2 (i.e. a dictionary with row1 and row2 axes). Note that a plotting
style is also defined for the col2 entry, this will be explained further below.
Configuring plotting options¶
Matplotlib’s plotting functions accept arguments to set (line) style, line color, etc. each time
you call them. However, if you need to consistently apply the same plotting style across figures,
it is more convenient to use matplotlib’s style system rather than hard-coding the style in calls
to plotting functions. PaperBuilder provides a mechanism to set the plotting style through
plot_options.yaml files. Similar to the configuration system (Configuration options),
plotting options can be set at the user, project and paper version level by the corresponding
plot_options.yaml file. In addition, plot options can also be specified at the figure level
(see Creating figures) and plot level (see Creating plot content).
PaperBuilder will call the plotting function as defined in the spec.yaml file in the context
of the desired default plotting style. In the spec.yaml file, this plotting style is defined
in the style block, for example:
style:
- seaborn-white
- lines.linewidth: 5
font.size: 14
The content of the style block should a valid argument for the
matplotlib.style.use
function, i.e. a string specifying the name of a style, a dictionary of rc parameters
or a list of these. For the library of names styles that ship with matplotlib, see
the style sheets reference.
Note that default plotting styles are combined across all levels, with the order that they
are applied from general (user level) to specific (figure and plot level).
In some cases, it would be useful to temporarily set a custom plot style other than the
default and other than the ones that ship with matplotlib. For example, you may have a
default plot style set for drawing the data, but you would like to use a different plot
style for annotations (e.g. thinner lines). To do this, you can create a custom style sheet
under a given name in the style-library section of a plot_options.yaml file.
For example:
style_library:
annotation:
lines.linewidth: 1
Within a plotting function, you can now do:
with matplotlib.style.context('annotation'):
ax.plot([0,1], [0,1])
Often, you find yourself using the same colors across plots because they represent the same
experimental group in the data. Piggy-backing on matplotlib’s internal map of
named colors, you can create
custom named colors in a plot_options.yaml file.
For example:
colors:
reward-low: royalblue
reward-high: crimson
annotation: black
order-first: mediumseagreen
order-last: seagreen
delay: cadetblue
ontime: orange
You could either map the custom color names to existing named colors (as is done above), or use of the accpted color formats (see https://matplotlib.org/api/colors_api.html).
With the above custom colors defined, you can now do the following in a plotting function:
# plot data for low reward group
ax.plot([0,1], [0,1], color='reward-low')
Executing an analysis¶
(to be completed)
Using IPython notebooks¶
PaperBuilder supports IPython notebooks in addition to standard python modules, if you prefer
to do build and test the analysis functions interactively. Thus, when specifying any function in
the spec.yaml file, the module part could refer to a ipynb file in the analysis folder.
However, there are a few caveats:
- When a notebook is imported, only import statements, constants (i.e. capitalized module-level variables) and function definitions are imported. No other code in cells is executed. This means that the summary / table / plot functions should work independent of the remaining code.
- Currently, plotting styles, custom named styles and named colors are not automatically set when you work interactively in the notebook. Future work may add this functionality.
Sharing data across functions¶
You will quickly discover the need for sharing data across summary / table / plot functions and across analyses. The recommended way of doing this is to write separate data loading or data generation functions that cache their outputs given the same inputs. One way of doing this is to use Python’s memoizing decorators:
import functools
@functools.lru_cache(maxsize=1)
def load_data():
# here you would get data from disk
# optionally do some clean up of the data
# and return the data
pass
If your cached function does not take any arguments, then a maximum cache size of 1 would be sufficient. However, if your function takes arguments, then you may want to set the cache size to a value that is close to the number of unique argument combinations used in your code.
Beware that in interactive work in the notebook, you may have to occasionally clear the cache manually, if source data has changed (e.g. in case the data file was updated).
Code organization¶
(to be completed)
put code in right place: code specific to paper version, specific to project or general code