Basic Tutorial ============== In this first tutorial, we will got through all the main features of the `cdt` package: 1. Loading a dataset 2. Recovering a graph skeleton with independence tests 3. Apply a causal Discovery algorithm 4. Evaluate our approach using 3 different scoring metrics 1. Load data ------------ Loading a standard dataset using the `cdt` package is straightforward using the ``cdt.data`` module. In this tutorial, we are going to load the `Sachs` dataset. .. seealso:: `Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., & Nolan, G. P. (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721), 523-529`. This dataset is quite useful as it is quite a small dataset with a relatively known ground truth and real data. .. code-block:: python >>> import cdt >>> data, graph = cdt.data.load_dataset('sachs') >>> print(data.head()) praf pmek plcg PIP2 PIP3 p44/42 pakts473 PKA PKC P38 pjnk 0 26.4 13.2 8.82 18.30 58.80 6.61 17.0 414.0 17.00 44.9 40.0 1 35.9 16.5 12.30 16.80 8.13 18.60 32.5 352.0 3.37 16.5 61.5 2 59.4 44.1 14.60 10.20 13.00 14.90 32.5 403.0 11.40 31.9 19.5 3 73.0 82.8 23.10 13.50 1.29 5.83 11.8 528.0 13.70 28.6 23.1 4 33.7 19.8 5.19 9.73 24.80 21.10 46.1 305.0 4.66 25.7 81.3 And graph is loaded: the ``data`` object is a ``pandas.DataFrame`` containing all the data, and ``graph`` contains the ground truth of the dataset: 2. Graph skeleton ----------------- Having a graph skeleton on given data might be quite useful for having information on the structure of the data. In order to do so, let's apply the Graph Lasso. .. seealso:: `Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432-441`: .. code-block:: python >>> glasso = cdt.independence.graph.Glasso() >>> skeleton = glasso.predict(data) >>> print(skeleton) The ``skeleton`` object is a ``networkx.Graph`` instance, which might be quite obscure at first but is handy in the long run. (Check :ref:`here ` for a quick introduction on ``networkx`` graphs). We can check the structure of the skeleton by looking at its adjacency matrix: .. code-block:: python >>> print(nx.adjacency_matrix(skeleton).todense()) matrix([[ 9.26744031e-04, -6.13751618e-04, 1.66612981e-05, -1.10912131e-06, -3.04172363e-05, -9.71526466e-05, 7.00340545e-05, -1.93863471e-06, -7.31774543e-06, 2.29788237e-06, 8.31264711e-06], [-6.13751618e-04, 4.14978956e-04, -1.37962487e-05, 1.42164753e-06, 2.04443539e-05, 8.24208108e-05, -5.63238668e-05, 1.62688021e-06, 8.63444133e-06, -2.88779755e-06, -4.69605195e-06], [ 1.66612981e-05, -1.37962487e-05, 2.60824802e-04, -1.35895911e-04, 8.78979413e-05, 2.17234579e-05, -2.07856535e-05, 1.23313600e-06, 2.12954874e-05, -3.22869246e-06, -7.47522248e-06], [-1.10912131e-06, 1.42164753e-06, -1.35895911e-04, 8.68622146e-05, -7.05405720e-05, 3.08709259e-06, -2.60810094e-06, 9.09261370e-09, -6.25320515e-06, 2.56399675e-07, 2.85201875e-07], [-3.04172363e-05, 2.04443539e-05, 8.78979413e-05, -7.05405720e-05, 6.09681818e-04, -9.91703900e-06, 1.78188074e-05, -5.97491176e-07, 6.11896719e-06, -4.30918870e-07, 5.79322379e-06], [-9.71526466e-05, 8.24208108e-05, 2.17234579e-05, 3.08709259e-06, -9.91703900e-06, 1.10860610e-03, -3.08483289e-04, -1.30867663e-05, -3.31258890e-05, 7.76132824e-06, 2.10416319e-05], [ 7.00340545e-05, -5.63238668e-05, -2.07856535e-05, -2.60810094e-06, 1.78188074e-05, -3.08483289e-04, 1.66144775e-04, 1.26667898e-06, 3.11407736e-05, -7.29116898e-06, -1.86454298e-05], [-1.93863471e-06, 1.62688021e-06, 1.23313600e-06, 9.09261370e-09, -5.97491176e-07, -1.30867663e-05, 1.26667898e-06, 2.80073467e-06, -3.78879972e-06, 8.67580852e-07, 6.92379671e-07], [-7.31774543e-06, 8.63444133e-06, 2.12954874e-05, -6.25320515e-06, 6.11896719e-06, -3.31258890e-05, 3.11407736e-05, -3.78879972e-06, 1.59642510e-03, -2.58155157e-04, -1.01767664e-04], [ 2.29788237e-06, -2.88779755e-06, -3.22869246e-06, 2.56399675e-07, -4.30918870e-07, 7.76132824e-06, -7.29116898e-06, 8.67580852e-07, -2.58155157e-04, 5.32997159e-05, -3.35285721e-06], [ 8.31264711e-06, -4.69605195e-06, -7.47522248e-06, 2.85201875e-07, 5.79322379e-06, 2.10416319e-05, -1.86454298e-05, 6.92379671e-07, -1.01767664e-04, -3.35285721e-06, 7.05796078e-05]]) As you have noticed already, the graph is quite dense. Let's remove indirect links in the graph using the Aracne algorithm .. seealso:: `An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context Adam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Riccardo Dalla Favera and Andrea Califano DOI: https://doi.org/10.1186/1471-2105-7-S1-S7` .. code-block:: python >>> new_skeleton = cdt.utils.graph.remove_indirect_links(skeleton, alg='aracne') >>> print(nx.adjacency_matrix(new_skeleton).todense()) matrix([[9.26576364e-04, 0.00000000e+00, 1.66279016e-05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 6.99676073e-05, 0.00000000e+00, 0.00000000e+00, 2.26182196e-06, 8.29822467e-06], [0.00000000e+00, 4.14897907e-04, 0.00000000e+00, 0.00000000e+00, 2.04095344e-05, 8.22844158e-05, 0.00000000e+00, 1.62835373e-06, 8.48527014e-06, 0.00000000e+00, 0.00000000e+00], [1.66279016e-05, 0.00000000e+00, 2.60808178e-04, 0.00000000e+00, 8.78753504e-05, 2.17299267e-05, 0.00000000e+00, 1.23340219e-06, 2.12217433e-05, 0.00000000e+00, 0.00000000e+00], [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 8.68568259e-05, 0.00000000e+00, 3.07901285e-06, 0.00000000e+00, 8.94955462e-09, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00], [0.00000000e+00, 2.04095344e-05, 8.78753504e-05, 0.00000000e+00, 6.09654932e-04, 0.00000000e+00, 1.77984674e-05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 5.80118715e-06], [0.00000000e+00, 8.22844158e-05, 2.17299267e-05, 3.07901285e-06, 0.00000000e+00, 1.10847276e-03, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 7.72649753e-06, 2.10224309e-05], [6.99676073e-05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.77984674e-05, 0.00000000e+00, 1.66117739e-04, 1.26646124e-06, 3.10736844e-05, 0.00000000e+00, 0.00000000e+00], [0.00000000e+00, 1.62835373e-06, 1.23340219e-06, 8.94955462e-09, 0.00000000e+00, 0.00000000e+00, 1.26646124e-06, 2.80075082e-06, 0.00000000e+00, 8.67949681e-07, 6.92548597e-07], [0.00000000e+00, 8.48527014e-06, 2.12217433e-05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.10736844e-05, 0.00000000e+00, 1.59628546e-03, 0.00000000e+00, 0.00000000e+00], [2.26182196e-06, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 7.72649753e-06, 0.00000000e+00, 8.67949681e-07, 0.00000000e+00, 5.32959890e-05, 0.00000000e+00], [8.29822467e-06, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 5.80118715e-06, 2.10224309e-05, 0.00000000e+00, 6.92548597e-07, 0.00000000e+00, 0.00000000e+00, 7.05766621e-05]]) and the resulting skeleton is much more sparse. Let's build on this new skeleton to perform our causal discovery. 3. Causal discovery ------------------- Having a graph skeleton, we are going to perform causal discovery with constraints, by using the `GES` algorithm. .. seealso:: `D.M. Chickering (2002). Optimal structure identification with greedy search. Journal of Machine Learning Research 3 , 507–554` .. code-block:: python >>> model = cdt.causality.graph.GES() >>> output_graph = model.predict(data, new_skeleton) >>> print(nx.adjacency_matrix(output_graph).todense()) matrix([[0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1], [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1], [1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64) And we obtain GES's prediction on this graph using the skeleton as constraint. Next we are going to evaluate our solution compared to using CAM without skeleton. .. seealso:: `J. Peters, J. Mooij, D. Janzing, B. Schölkopf: Causal Discovery with Continuous Additive Noise Models, JMLR 15:2009-2053, 2014.` 4. Evaluation and scoring metrics --------------------------------- In order to evaluate various predictions with the ground truth, the `cdt` package includes 3 different metrics in the ``cdt.metrics`` module: - Area under the precision recall curve - Structural Hamming Distance (SHD) - Structural Intervention Distance (SID) .. code-block:: python >>> from cdt.metrics import (precision_recall, SID, SHD) >>> scores = [metric(graph, output_graph) for metric in (precision_recall, SID, SHD)] >>> print(scores) [(0.3212943387361992, [(0.1487603305785124, 1.0), (0.16279069767441862, 0.3888888888888889), (1.0, 0.0)]), array(76.), 47] >>> # now we compute the CAM graph without constraints and the associated scores >>> model2 = cdt.causality.graph.CAM() >>> output_graph_nc = model2.predict(data) >>> scores_nc = [metric(graph, output_graph_nc) for metric in (precision_recall, SID, SHD)] >>> print(scores_nc) [(0.4423624964377315, [(0.1487603305785124, 1.0), (0.3103448275862069, 0.5), (1.0, 0.0)]), array(68.), 29] We can observe that CAM has better performance than our previous pipeline, as: - The average precision score is higher - The SID score is lower - The SHD score is lower as well This concludes our first tutorial on the `cdt` package.