Advanced Tutorial

This second tutorial targets more experienced users. We will focus on:

Launching cdt Docker containers
Tweaking the cdt.SETTINGS to adapt the package to the hardware configuration
Generate a artificial dataset from scratch
Perform causal discovery on GPU
Evaluate the results

1. Launch the Docker containers

Docker images are really useful to have a portable environment with minimal impact on performance. In our case, it becomes really handy as all the R libraries are quite time-consuming to install and have lots of incompatibilities depending on the user environment. Check https://docs.docker.com/install/ to install Docker and have a quick tutorial on its usage.

cdt Docker containers are available at https://hub.docker.com/u/divkal . Check here to select the image adapted to your configuration. In this tutorial we will consider having GPUs available, but the methods are really similar if you don’t have GPUs (selecting the CPU docker image instead of the GPU one).

$ docker pull divkal/nv-cdt-py3.6:XX  # XX corresponds to the latest version
$ nvidia-docker run -it --init --ipc=host --rm -u=$(id -u):$(id -g) divkal/nv-cdt-py3.6:XX /bin/bash
=============
== PyTorch ==
=============

NVIDIA Release 18.09 (build 687447)

Container image Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.

Copyright (c) 2016-     Facebook, Inc            (Adam Paszke)
Copyright (c) 2014-     Facebook, Inc            (Soumith Chintala)
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
Failed to detect NVIDIA driver version.

I have no name!@5308f95cd331:/workspace$
I have no name!@5308f95cd331:/workspace$ ipython
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

The docker image is built upon the Nvidia NGC docker image for PyTorch. Details of the options of the docker command:

nvidia-docker is a variant of docker developed by NVIDIA for GPU passthrough. It is available at : https://github.com/NVIDIA/nvidia-docker
-it is an option to launch the container in interactive mode
--init is to passthrough the signals such as SIGINT or SIGKILL in the container.
--rm is an option to save space by deleting the container at the end of the execution.
-u is an option to launch the container as a specific user. Otherwise it will be executed as root. This is quite useful for accessing files created in the container from the outside environment.

2. Adapt the cdt package configuration

In this section, we will tweak the cdt.SETTINGS to fit our usage. We will first check the current configuration, then increase the number of jobs as the graph generated in the next section will be quite small. More details on the package settings are provided here.

In [1]: import cdt
Detecting 1 CUDA device(s).

In [2]: cdt.SETTINGS.GPU  # Is set to the number of devices
Out[2]: 1

In [3]: cdt.SETTINGS.NJOBS  # Set to the num of devices
Out[3]: 1

In [4]: cdt.SETTINGS.NJOBS = 3  # 3 jobs per GPU

In [5]: cdt.SETTINGS.verbose = False

3. Artifical graph generation

Generating artificial graph with the cdt package is quite straightforward when using the cdt.data.AcyclicGraphGenerator class. Check here to have more details on how to customize the graph generator.

In [6]: generator = cdt.data.AcyclicGraphGenerator('gp_add', noise_coeff=.2,
                                                   nodes=20, parents_max=3)

In [7]: data, graph = generator.generate()

In [7]: data.head()
Out[7]:
         V0        V1        V2        V3    ...          V16       V17       V18       V19
0 -0.948506  0.366023 -0.659409 -1.012921    ...    -0.086537  0.504257  1.163381 -0.815508
1 -1.175473  1.612285  1.087017 -1.505346    ...    -0.119292 -1.251204  0.303203 -0.730214
2 -0.899956  0.757223 -0.394799 -1.345747    ...    -0.620322 -0.919279 -1.948743  0.027883
3 -1.143217  1.419192  0.608848 -1.144207    ...     1.992465 -1.277411 -0.109563 -0.907268
4 -0.653106 -0.582684 -0.947306 -0.701014    ...    -0.217655  1.429272 -1.156742  1.305437

[5 rows x 20 columns]

And the data and graph are generated.

4. Run SAM on GPUs

Running multiple bootstrapped runs of SAM proved itself to yield much better results than a single run. The parameter nruns allows to control the total number of runs. As soon as the setting cdt.SETTINGS.GPU > 0, the execution of GPU compatible algorithms will be automatically performed on those devices, making the prediction step similar to a traditional algorithm:

In [8]: sam = cdt.causality.graph.SAM(nruns=12)

In [9]: prediction = sam.predict(data)

5. Scoring the results

In a similar fashion to the other tutorial, we can quickly score the results using the methods in cdt.metrics:

In [10]: from cdt.metrics import (precision_recall, SHD)

In [11]: [metric(graph, prediction) for metric in
         (precision_recall, SHD)]
Out[11]: [(0.53, [(0.06, 1.0), (1.0, 0.0)]), 24.0]

This concludes our second tutorial on the cdt package.