cdt.data

This module focusses on data: data generation but also provides the user with standard and well known datasets, useful for validation and benchmarking.

The generators provide the user the ability to choose which causal mechanism to be used in the data generation process, as well as the type of noise contribution (additive and/or multiplicative). Currently, the implemented mechanisms are (\(+\times\) denotes either addition or multiplication, and \(\mathbf{X}\) denotes the vector of causes, and \(E\) represents the noise variable accounting for all unobserved variables):

Linear: \(y = \mathbf{X}W +\times E\)
Polynomial: \(y = \left( W_0 + \mathbf{X}W_1 + ...+ \mathbf{X}^d W_d \right) +\times E\)
Gaussian Process: \(y = GP(\mathbf{X}) +\times E\)
Sigmoid: \(y = \sum_i^d W_i * sigmoid(\mathbf{X_i}) +\times E\)
Randomly init. Neural network: \(y = \sigma((\mathbf{X},E) W_{in})W_{out}\)

Causal pairs can be generated using the cdt.data.CausalPairGenerator class, and acyclic graphs can be generated using the cdt.data.AcyclicGraphGenerator class.

CausalPairGenerator

class cdt.data.CausalPairGenerator(causal_mechanism, noise=<function normal_noise>, noise_coeff=0.4, initial_variable_generator=<function gmm_cause>)[source]

Generates Bivariate Causal Distributions.

Parameters

causal_mechanism (str) – currently implemented mechanisms: [‘linear’, ‘polynomial’, ‘sigmoid_add’, ‘sigmoid_mix’, ‘gp_add’, ‘gp_mix’, ‘nn’].
noise (str or function) – type of noise to use in the generative process (‘normal’, ‘uniform’ or a custom noise function).
noise_coeff (float) – Proportion of noise in the mechanisms.
initial_variable_generator (function) – Function used to init variables of the graph, defaults to a Gaussian Mixture model.

Example

>>> from cdt.data import CausalPairGenerator
>>> generator = CausalPairGenerator('linear')
>>> data, labels = generator.generate(100, npoints=500)
>>> generator.to_csv('generated_pairs')

generate(npairs, npoints=500, rescale=True, njobs=None)[source]

Generate Causal pairs, such that one variable causes the other.

Parameters

npairs (int) – Number of pairs of variables to generate.
npoints (int) – Number of data points to generate.
rescale (bool) – Rescale the output with zero mean and unit variance.
njobs (int) – Number of parallel jobs to execute. Defaults to cdt.SETTINGS.NJOBS

Returns

(pandas.DataFrame, pandas.DataFrame) data and corresponding labels. The data is at the SampleID, a (numpy.ndarray) , b (numpy.ndarray)) format.

Return type

tuple

to_csv(fname_radical, **kwargs)[source]

Save data to the csv format by default, in two separate files.

Optional keyword arguments can be passed to pandas.

AcyclicGraphGenerator

class cdt.data.AcyclicGraphGenerator(causal_mechanism, noise='gaussian', noise_coeff=0.4, initial_variable_generator=<function gmm_cause>, npoints=500, nodes=20, parents_max=5, expected_degree=3, dag_type='default')[source]

Generate an acyclic graph and data given a causal mechanism.

Parameters

causal_mechanism (str) – currently implemented mechanisms: [‘linear’, ‘polynomial’, ‘sigmoid_add’, ‘sigmoid_mix’, ‘gp_add’, ‘gp_mix’, ‘nn’].
noise (str or function) – type of noise to use in the generative process (‘gaussian’, ‘uniform’ or a custom noise function).
noise_coeff (float) – Proportion of noise in the mechanisms.
initial_variable_generator (function) – Function used to init variables of the graph, defaults to a Gaussian Mixture model.
npoints (int) – Number of data points to generate.
nodes (int) – Number of nodes in the graph to generate.
parents_max (int) – Maximum number of parents of a node.
expected_degree (int) – Degree (number of edge per node) expected, only used for erdos graph
dag_type (str) – type of graph to generate (‘default’, ‘erdos’)

Example

>>> from cdt.data import AcyclicGraphGenerator
>>> generator = AcyclicGraphGenerator('linear', npoints=1000)
>>> data, graph = generator.generate()
>>> generator.to_csv('generated_graph')

generate(rescale=True)[source]

Generate data from an FCM defined in self.init_variables().

Parameters: rescale (bool) – rescale the generated data (recommended)
Returns: (pandas.DataFrame, networkx.DiGraph), respectively the generated data and graph.
Return type: tuple

init_dag(verbose)[source]

Redefine the structure of the graph depending on dag_type (‘default’, ‘erdos’)

Parameters: verbose (bool) – Verbosity

init_variables(verbose=False)[source]

Redefine the causes, mechanisms and the structure of the graph, called by self.generate() if never called.

Parameters: verbose (bool) – Verbosity

to_csv(fname_radical, **kwargs)[source]

Save the generated data to the csv format by default, in two separate files: data, and the adjacency matrix of the corresponding graph.

Parameters

fname_radical (str) – radical of the file names. Completed by _data.csv for the data file and _target.csv for the adjacency matrix of the generated graph.
**kwargs – Optional keyword arguments can be passed to pandas.

load_dataset

cdt.data.load_dataset(name, **kwargs)[source]

Main function of this module, allows to easily import well-known causal datasets into python.

Details on the supported datasets:

tuebingen, dataset of 100 real cause-effect pairs
J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, B. Schoelkopf: “Distinguishing cause from effect using observational data: methods and benchmarks”, Journal of Machine Learning Research 17(32):1-102, 2016.
sachs, Dataset of flow cytometry, real data,
11 variables x 7466 samples; Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., & Nolan, G. P. (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721), 523-529.
dream4, multifactorial artificial data of the challenge.
Data generated with GeneNetWeaver 2.0, 5 graphs of 100 variables x 100 samples. Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010.

Parameters

name (str) – Name of the dataset. currenly supported datasets: [tuebingen, sachs, dream4-1, dream4-2, dream4-3, dream4-4, dream4-5]
**kwargs – Optional additional arguments for dataset loaders. tuebingen dataset accepts the shuffle (bool) option to shuffle the causal pairs and their according labels.

Returns

(pandas.DataFrame, pandas.DataFrame or networkx.DiGraph) Standard dataframe containing the data, and the target.

Return type

tuple

Examples

>>> from cdt.data import load_dataset
>>> s_data, s_graph = load_dataset('sachs')
>>> t_data, t_labels = load_dataset('tuebingen')

Warning

The ‘Tuebingen’ dataset is loaded with the same label for all samples (1: A causes B)