Basic Tutorial

In this first tutorial, we will got through all the main features of the cdt package:

Loading a dataset
Recovering a graph skeleton with independence tests
Apply a causal Discovery algorithm
Evaluate our approach using 3 different scoring metrics

1. Load data

Loading a standard dataset using the cdt package is straightforward using the cdt.data module. In this tutorial, we are going to load the Sachs dataset.

See also

Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., & Nolan, G. P. (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721), 523-529.

This dataset is quite useful as it is quite a small dataset with a relatively known ground truth and real data.

>>> import cdt
>>> data, graph = cdt.data.load_dataset('sachs')
>>> print(data.head())
praf  pmek   plcg   PIP2   PIP3  p44/42  pakts473    PKA    PKC   P38  pjnk
0  26.4  13.2   8.82  18.30  58.80    6.61      17.0  414.0  17.00  44.9  40.0
1  35.9  16.5  12.30  16.80   8.13   18.60      32.5  352.0   3.37  16.5  61.5
2  59.4  44.1  14.60  10.20  13.00   14.90      32.5  403.0  11.40  31.9  19.5
3  73.0  82.8  23.10  13.50   1.29    5.83      11.8  528.0  13.70  28.6  23.1
4  33.7  19.8   5.19   9.73  24.80   21.10      46.1  305.0   4.66  25.7  81.3

And graph is loaded: the data object is a pandas.DataFrame containing all the data, and graph contains the ground truth of the dataset:

2. Graph skeleton

Having a graph skeleton on given data might be quite useful for having information on the structure of the data. In order to do so, let’s apply the Graph Lasso.

See also

Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432-441:

>>> glasso = cdt.independence.graph.Glasso()
>>> skeleton = glasso.predict(data)
>>> print(skeleton)
<networkx.classes.digraph.DiGraph at 0x7fe3ccfb1438>

The skeleton object is a networkx.Graph instance, which might be quite obscure at first but is handy in the long run. (Check here for a quick introduction on networkx graphs). We can check the structure of the skeleton by looking at its adjacency matrix:

>>> print(nx.adjacency_matrix(skeleton).todense())
matrix([[ 9.26744031e-04, -6.13751618e-04,  1.66612981e-05,
         -1.10912131e-06, -3.04172363e-05, -9.71526466e-05,
          7.00340545e-05, -1.93863471e-06, -7.31774543e-06,
          2.29788237e-06,  8.31264711e-06],
        [-6.13751618e-04,  4.14978956e-04, -1.37962487e-05,
          1.42164753e-06,  2.04443539e-05,  8.24208108e-05,
         -5.63238668e-05,  1.62688021e-06,  8.63444133e-06,
         -2.88779755e-06, -4.69605195e-06],
        [ 1.66612981e-05, -1.37962487e-05,  2.60824802e-04,
         -1.35895911e-04,  8.78979413e-05,  2.17234579e-05,
         -2.07856535e-05,  1.23313600e-06,  2.12954874e-05,
         -3.22869246e-06, -7.47522248e-06],
        [-1.10912131e-06,  1.42164753e-06, -1.35895911e-04,
          8.68622146e-05, -7.05405720e-05,  3.08709259e-06,
         -2.60810094e-06,  9.09261370e-09, -6.25320515e-06,
          2.56399675e-07,  2.85201875e-07],
        [-3.04172363e-05,  2.04443539e-05,  8.78979413e-05,
         -7.05405720e-05,  6.09681818e-04, -9.91703900e-06,
          1.78188074e-05, -5.97491176e-07,  6.11896719e-06,
         -4.30918870e-07,  5.79322379e-06],
        [-9.71526466e-05,  8.24208108e-05,  2.17234579e-05,
          3.08709259e-06, -9.91703900e-06,  1.10860610e-03,
         -3.08483289e-04, -1.30867663e-05, -3.31258890e-05,
          7.76132824e-06,  2.10416319e-05],
        [ 7.00340545e-05, -5.63238668e-05, -2.07856535e-05,
         -2.60810094e-06,  1.78188074e-05, -3.08483289e-04,
          1.66144775e-04,  1.26667898e-06,  3.11407736e-05,
         -7.29116898e-06, -1.86454298e-05],
        [-1.93863471e-06,  1.62688021e-06,  1.23313600e-06,
          9.09261370e-09, -5.97491176e-07, -1.30867663e-05,
          1.26667898e-06,  2.80073467e-06, -3.78879972e-06,
          8.67580852e-07,  6.92379671e-07],
        [-7.31774543e-06,  8.63444133e-06,  2.12954874e-05,
         -6.25320515e-06,  6.11896719e-06, -3.31258890e-05,
          3.11407736e-05, -3.78879972e-06,  1.59642510e-03,
         -2.58155157e-04, -1.01767664e-04],
        [ 2.29788237e-06, -2.88779755e-06, -3.22869246e-06,
          2.56399675e-07, -4.30918870e-07,  7.76132824e-06,
         -7.29116898e-06,  8.67580852e-07, -2.58155157e-04,
          5.32997159e-05, -3.35285721e-06],
        [ 8.31264711e-06, -4.69605195e-06, -7.47522248e-06,
          2.85201875e-07,  5.79322379e-06,  2.10416319e-05,
         -1.86454298e-05,  6.92379671e-07, -1.01767664e-04,
         -3.35285721e-06,  7.05796078e-05]])

As you have noticed already, the graph is quite dense. Let’s remove indirect links in the graph using the Aracne algorithm

See also

An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context Adam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Riccardo Dalla Favera and Andrea Califano DOI: https://doi.org/10.1186/1471-2105-7-S1-S7

>>> new_skeleton = cdt.utils.graph.remove_indirect_links(skeleton, alg='aracne')
>>> print(nx.adjacency_matrix(new_skeleton).todense())
matrix([[9.26576364e-04, 0.00000000e+00, 1.66279016e-05, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 6.99676073e-05, 0.00000000e+00,
0.00000000e+00, 2.26182196e-06, 8.29822467e-06],
[0.00000000e+00, 4.14897907e-04, 0.00000000e+00, 0.00000000e+00,
2.04095344e-05, 8.22844158e-05, 0.00000000e+00, 1.62835373e-06,
8.48527014e-06, 0.00000000e+00, 0.00000000e+00],
[1.66279016e-05, 0.00000000e+00, 2.60808178e-04, 0.00000000e+00,
8.78753504e-05, 2.17299267e-05, 0.00000000e+00, 1.23340219e-06,
2.12217433e-05, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 8.68568259e-05,
0.00000000e+00, 3.07901285e-06, 0.00000000e+00, 8.94955462e-09,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 2.04095344e-05, 8.78753504e-05, 0.00000000e+00,
6.09654932e-04, 0.00000000e+00, 1.77984674e-05, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 5.80118715e-06],
[0.00000000e+00, 8.22844158e-05, 2.17299267e-05, 3.07901285e-06,
0.00000000e+00, 1.10847276e-03, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 7.72649753e-06, 2.10224309e-05],
[6.99676073e-05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.77984674e-05, 0.00000000e+00, 1.66117739e-04, 1.26646124e-06,
3.10736844e-05, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 1.62835373e-06, 1.23340219e-06, 8.94955462e-09,
0.00000000e+00, 0.00000000e+00, 1.26646124e-06, 2.80075082e-06,
0.00000000e+00, 8.67949681e-07, 6.92548597e-07],
[0.00000000e+00, 8.48527014e-06, 2.12217433e-05, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 3.10736844e-05, 0.00000000e+00,
1.59628546e-03, 0.00000000e+00, 0.00000000e+00],
[2.26182196e-06, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 7.72649753e-06, 0.00000000e+00, 8.67949681e-07,
0.00000000e+00, 5.32959890e-05, 0.00000000e+00],
[8.29822467e-06, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
5.80118715e-06, 2.10224309e-05, 0.00000000e+00, 6.92548597e-07,
0.00000000e+00, 0.00000000e+00, 7.05766621e-05]])

and the resulting skeleton is much more sparse. Let’s build on this new skeleton to perform our causal discovery.

3. Causal discovery

Having a graph skeleton, we are going to perform causal discovery with constraints, by using the GES algorithm.

See also

D.M. Chickering (2002). Optimal structure identification with greedy search. Journal of Machine Learning Research 3 , 507–554

>>> model = cdt.causality.graph.GES()
>>> output_graph = model.predict(data, new_skeleton)
>>> print(nx.adjacency_matrix(output_graph).todense())
matrix([[0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
        [0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1],
        [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1],
        [1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

And we obtain GES’s prediction on this graph using the skeleton as constraint. Next we are going to evaluate our solution compared to using CAM without skeleton.

See also

J. Peters, J. Mooij, D. Janzing, B. Schölkopf: Causal Discovery with Continuous Additive Noise Models, JMLR 15:2009-2053, 2014.

4. Evaluation and scoring metrics

In order to evaluate various predictions with the ground truth, the cdt package includes 3 different metrics in the cdt.metrics module:

Area under the precision recall curve
Structural Hamming Distance (SHD)
Structural Intervention Distance (SID)

>>> from cdt.metrics import (precision_recall, SID, SHD)
>>> scores = [metric(graph, output_graph) for metric in (precision_recall, SID, SHD)]
>>> print(scores)
[(0.3212943387361992, [(0.1487603305785124, 1.0), (0.16279069767441862, 0.3888888888888889), (1.0, 0.0)]),
array(76.),
47]

>>> # now we compute the CAM graph without constraints and the associated scores
>>> model2 = cdt.causality.graph.CAM()
>>> output_graph_nc = model2.predict(data)
>>> scores_nc = [metric(graph, output_graph_nc) for metric in (precision_recall, SID, SHD)]
>>> print(scores_nc)
[(0.4423624964377315, [(0.1487603305785124, 1.0), (0.3103448275862069, 0.5), (1.0, 0.0)]),
array(68.),
29]

We can observe that CAM has better performance than our previous pipeline, as:

The average precision score is higher
The SID score is lower
The SHD score is lower as well

This concludes our first tutorial on the cdt package.