Basic Tutorial
In this first tutorial, we will got through all the main features of the cdt package:
Loading a dataset
Recovering a graph skeleton with independence tests
Apply a causal Discovery algorithm
Evaluate our approach using 3 different scoring metrics
1. Load data
Loading a standard dataset using the cdt package is straightforward using the
cdt.data
module. In this
tutorial, we are going to load the Sachs dataset.
See also
Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., & Nolan, G. P. (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721), 523-529.
This dataset is quite useful as it is quite a small dataset with a relatively known ground truth and real data.
>>> import cdt
>>> data, graph = cdt.data.load_dataset('sachs')
>>> print(data.head())
praf pmek plcg PIP2 PIP3 p44/42 pakts473 PKA PKC P38 pjnk
0 26.4 13.2 8.82 18.30 58.80 6.61 17.0 414.0 17.00 44.9 40.0
1 35.9 16.5 12.30 16.80 8.13 18.60 32.5 352.0 3.37 16.5 61.5
2 59.4 44.1 14.60 10.20 13.00 14.90 32.5 403.0 11.40 31.9 19.5
3 73.0 82.8 23.10 13.50 1.29 5.83 11.8 528.0 13.70 28.6 23.1
4 33.7 19.8 5.19 9.73 24.80 21.10 46.1 305.0 4.66 25.7 81.3
And graph is loaded: the data
object is a pandas.DataFrame
containing all
the data, and graph
contains the ground truth of the dataset:
2. Graph skeleton
Having a graph skeleton on given data might be quite useful for having information on the structure of the data. In order to do so, let’s apply the Graph Lasso.
See also
Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432-441:
>>> glasso = cdt.independence.graph.Glasso()
>>> skeleton = glasso.predict(data)
>>> print(skeleton)
<networkx.classes.digraph.DiGraph at 0x7fe3ccfb1438>
The skeleton
object is a networkx.Graph
instance, which might be quite
obscure at first but is handy in the long run. (Check
here for a quick introduction on networkx
graphs).
We can check the structure of the skeleton by looking at its adjacency matrix:
>>> print(nx.adjacency_matrix(skeleton).todense())
matrix([[ 9.26744031e-04, -6.13751618e-04, 1.66612981e-05,
-1.10912131e-06, -3.04172363e-05, -9.71526466e-05,
7.00340545e-05, -1.93863471e-06, -7.31774543e-06,
2.29788237e-06, 8.31264711e-06],
[-6.13751618e-04, 4.14978956e-04, -1.37962487e-05,
1.42164753e-06, 2.04443539e-05, 8.24208108e-05,
-5.63238668e-05, 1.62688021e-06, 8.63444133e-06,
-2.88779755e-06, -4.69605195e-06],
[ 1.66612981e-05, -1.37962487e-05, 2.60824802e-04,
-1.35895911e-04, 8.78979413e-05, 2.17234579e-05,
-2.07856535e-05, 1.23313600e-06, 2.12954874e-05,
-3.22869246e-06, -7.47522248e-06],
[-1.10912131e-06, 1.42164753e-06, -1.35895911e-04,
8.68622146e-05, -7.05405720e-05, 3.08709259e-06,
-2.60810094e-06, 9.09261370e-09, -6.25320515e-06,
2.56399675e-07, 2.85201875e-07],
[-3.04172363e-05, 2.04443539e-05, 8.78979413e-05,
-7.05405720e-05, 6.09681818e-04, -9.91703900e-06,
1.78188074e-05, -5.97491176e-07, 6.11896719e-06,
-4.30918870e-07, 5.79322379e-06],
[-9.71526466e-05, 8.24208108e-05, 2.17234579e-05,
3.08709259e-06, -9.91703900e-06, 1.10860610e-03,
-3.08483289e-04, -1.30867663e-05, -3.31258890e-05,
7.76132824e-06, 2.10416319e-05],
[ 7.00340545e-05, -5.63238668e-05, -2.07856535e-05,
-2.60810094e-06, 1.78188074e-05, -3.08483289e-04,
1.66144775e-04, 1.26667898e-06, 3.11407736e-05,
-7.29116898e-06, -1.86454298e-05],
[-1.93863471e-06, 1.62688021e-06, 1.23313600e-06,
9.09261370e-09, -5.97491176e-07, -1.30867663e-05,
1.26667898e-06, 2.80073467e-06, -3.78879972e-06,
8.67580852e-07, 6.92379671e-07],
[-7.31774543e-06, 8.63444133e-06, 2.12954874e-05,
-6.25320515e-06, 6.11896719e-06, -3.31258890e-05,
3.11407736e-05, -3.78879972e-06, 1.59642510e-03,
-2.58155157e-04, -1.01767664e-04],
[ 2.29788237e-06, -2.88779755e-06, -3.22869246e-06,
2.56399675e-07, -4.30918870e-07, 7.76132824e-06,
-7.29116898e-06, 8.67580852e-07, -2.58155157e-04,
5.32997159e-05, -3.35285721e-06],
[ 8.31264711e-06, -4.69605195e-06, -7.47522248e-06,
2.85201875e-07, 5.79322379e-06, 2.10416319e-05,
-1.86454298e-05, 6.92379671e-07, -1.01767664e-04,
-3.35285721e-06, 7.05796078e-05]])
As you have noticed already, the graph is quite dense. Let’s remove indirect links in the graph using the Aracne algorithm
See also
An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context Adam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Riccardo Dalla Favera and Andrea Califano DOI: https://doi.org/10.1186/1471-2105-7-S1-S7
>>> new_skeleton = cdt.utils.graph.remove_indirect_links(skeleton, alg='aracne')
>>> print(nx.adjacency_matrix(new_skeleton).todense())
matrix([[9.26576364e-04, 0.00000000e+00, 1.66279016e-05, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 6.99676073e-05, 0.00000000e+00,
0.00000000e+00, 2.26182196e-06, 8.29822467e-06],
[0.00000000e+00, 4.14897907e-04, 0.00000000e+00, 0.00000000e+00,
2.04095344e-05, 8.22844158e-05, 0.00000000e+00, 1.62835373e-06,
8.48527014e-06, 0.00000000e+00, 0.00000000e+00],
[1.66279016e-05, 0.00000000e+00, 2.60808178e-04, 0.00000000e+00,
8.78753504e-05, 2.17299267e-05, 0.00000000e+00, 1.23340219e-06,
2.12217433e-05, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 8.68568259e-05,
0.00000000e+00, 3.07901285e-06, 0.00000000e+00, 8.94955462e-09,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 2.04095344e-05, 8.78753504e-05, 0.00000000e+00,
6.09654932e-04, 0.00000000e+00, 1.77984674e-05, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 5.80118715e-06],
[0.00000000e+00, 8.22844158e-05, 2.17299267e-05, 3.07901285e-06,
0.00000000e+00, 1.10847276e-03, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 7.72649753e-06, 2.10224309e-05],
[6.99676073e-05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.77984674e-05, 0.00000000e+00, 1.66117739e-04, 1.26646124e-06,
3.10736844e-05, 0.00000000e+00, 0.00000000e+00],
[0.00000000e+00, 1.62835373e-06, 1.23340219e-06, 8.94955462e-09,
0.00000000e+00, 0.00000000e+00, 1.26646124e-06, 2.80075082e-06,
0.00000000e+00, 8.67949681e-07, 6.92548597e-07],
[0.00000000e+00, 8.48527014e-06, 2.12217433e-05, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 3.10736844e-05, 0.00000000e+00,
1.59628546e-03, 0.00000000e+00, 0.00000000e+00],
[2.26182196e-06, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 7.72649753e-06, 0.00000000e+00, 8.67949681e-07,
0.00000000e+00, 5.32959890e-05, 0.00000000e+00],
[8.29822467e-06, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
5.80118715e-06, 2.10224309e-05, 0.00000000e+00, 6.92548597e-07,
0.00000000e+00, 0.00000000e+00, 7.05766621e-05]])
and the resulting skeleton is much more sparse. Let’s build on this new skeleton to perform our causal discovery.
3. Causal discovery
Having a graph skeleton, we are going to perform causal discovery with constraints, by using the GES algorithm.
See also
D.M. Chickering (2002). Optimal structure identification with greedy search. Journal of Machine Learning Research 3 , 507–554
>>> model = cdt.causality.graph.GES()
>>> output_graph = model.predict(data, new_skeleton)
>>> print(nx.adjacency_matrix(output_graph).todense())
matrix([[0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1],
[1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)
And we obtain GES’s prediction on this graph using the skeleton as constraint. Next we are going to evaluate our solution compared to using CAM without skeleton.
See also
J. Peters, J. Mooij, D. Janzing, B. Schölkopf: Causal Discovery with Continuous Additive Noise Models, JMLR 15:2009-2053, 2014.
4. Evaluation and scoring metrics
In order to evaluate various predictions with the ground truth, the cdt
package includes 3 different metrics in the cdt.metrics
module:
Area under the precision recall curve
Structural Hamming Distance (SHD)
Structural Intervention Distance (SID)
>>> from cdt.metrics import (precision_recall, SID, SHD)
>>> scores = [metric(graph, output_graph) for metric in (precision_recall, SID, SHD)]
>>> print(scores)
[(0.3212943387361992, [(0.1487603305785124, 1.0), (0.16279069767441862, 0.3888888888888889), (1.0, 0.0)]),
array(76.),
47]
>>> # now we compute the CAM graph without constraints and the associated scores
>>> model2 = cdt.causality.graph.CAM()
>>> output_graph_nc = model2.predict(data)
>>> scores_nc = [metric(graph, output_graph_nc) for metric in (precision_recall, SID, SHD)]
>>> print(scores_nc)
[(0.4423624964377315, [(0.1487603305785124, 1.0), (0.3103448275862069, 0.5), (1.0, 0.0)]),
array(68.),
29]
We can observe that CAM has better performance than our previous pipeline, as:
The average precision score is higher
The SID score is lower
The SHD score is lower as well
This concludes our first tutorial on the cdt package.