cdt.data
This module focusses on data: data generation but also provides the user with standard and well known datasets, useful for validation and benchmarking.
The generators provide the user the ability to choose which causal mechanism to be used in the data generation process, as well as the type of noise contribution (additive and/or multiplicative). Currently, the implemented mechanisms are (\(+\times\) denotes either addition or multiplication, and \(\mathbf{X}\) denotes the vector of causes, and \(E\) represents the noise variable accounting for all unobserved variables):
Linear: \(y = \mathbf{X}W +\times E\)
Polynomial: \(y = \left( W_0 + \mathbf{X}W_1 + ...+ \mathbf{X}^d W_d \right) +\times E\)
Gaussian Process: \(y = GP(\mathbf{X}) +\times E\)
Sigmoid: \(y = \sum_i^d W_i * sigmoid(\mathbf{X_i}) +\times E\)
Randomly init. Neural network: \(y = \sigma((\mathbf{X},E) W_{in})W_{out}\)
Causal pairs can be generated using the cdt.data.CausalPairGenerator
class,
and acyclic graphs can be generated using the cdt.data.AcyclicGraphGenerator
class.
CausalPairGenerator
- class cdt.data.CausalPairGenerator(causal_mechanism, noise=<function normal_noise>, noise_coeff=0.4, initial_variable_generator=<function gmm_cause>)[source]
Generates Bivariate Causal Distributions.
- Parameters
causal_mechanism (str) – currently implemented mechanisms: [‘linear’, ‘polynomial’, ‘sigmoid_add’, ‘sigmoid_mix’, ‘gp_add’, ‘gp_mix’, ‘nn’].
noise (str or function) – type of noise to use in the generative process (‘normal’, ‘uniform’ or a custom noise function).
noise_coeff (float) – Proportion of noise in the mechanisms.
initial_variable_generator (function) – Function used to init variables of the graph, defaults to a Gaussian Mixture model.
Example
>>> from cdt.data import CausalPairGenerator >>> generator = CausalPairGenerator('linear') >>> data, labels = generator.generate(100, npoints=500) >>> generator.to_csv('generated_pairs')
- generate(npairs, npoints=500, rescale=True, njobs=None)[source]
Generate Causal pairs, such that one variable causes the other.
- Parameters
npairs (int) – Number of pairs of variables to generate.
npoints (int) – Number of data points to generate.
rescale (bool) – Rescale the output with zero mean and unit variance.
njobs (int) – Number of parallel jobs to execute. Defaults to cdt.SETTINGS.NJOBS
- Returns
(pandas.DataFrame, pandas.DataFrame) data and corresponding labels. The data is at the
SampleID, a (numpy.ndarray) , b (numpy.ndarray))
format.- Return type
tuple
AcyclicGraphGenerator
- class cdt.data.AcyclicGraphGenerator(causal_mechanism, noise='gaussian', noise_coeff=0.4, initial_variable_generator=<function gmm_cause>, npoints=500, nodes=20, parents_max=5, expected_degree=3, dag_type='default')[source]
Generate an acyclic graph and data given a causal mechanism.
- Parameters
causal_mechanism (str) – currently implemented mechanisms: [‘linear’, ‘polynomial’, ‘sigmoid_add’, ‘sigmoid_mix’, ‘gp_add’, ‘gp_mix’, ‘nn’].
noise (str or function) – type of noise to use in the generative process (‘gaussian’, ‘uniform’ or a custom noise function).
noise_coeff (float) – Proportion of noise in the mechanisms.
initial_variable_generator (function) – Function used to init variables of the graph, defaults to a Gaussian Mixture model.
npoints (int) – Number of data points to generate.
nodes (int) – Number of nodes in the graph to generate.
parents_max (int) – Maximum number of parents of a node.
expected_degree (int) – Degree (number of edge per node) expected, only used for erdos graph
dag_type (str) – type of graph to generate (‘default’, ‘erdos’)
Example
>>> from cdt.data import AcyclicGraphGenerator >>> generator = AcyclicGraphGenerator('linear', npoints=1000) >>> data, graph = generator.generate() >>> generator.to_csv('generated_graph')
- generate(rescale=True)[source]
Generate data from an FCM defined in
self.init_variables()
.- Parameters
rescale (bool) – rescale the generated data (recommended)
- Returns
(pandas.DataFrame, networkx.DiGraph), respectively the generated data and graph.
- Return type
tuple
- init_dag(verbose)[source]
Redefine the structure of the graph depending on dag_type (‘default’, ‘erdos’)
- Parameters
verbose (bool) – Verbosity
- init_variables(verbose=False)[source]
Redefine the causes, mechanisms and the structure of the graph, called by
self.generate()
if never called.- Parameters
verbose (bool) – Verbosity
- to_csv(fname_radical, **kwargs)[source]
Save the generated data to the csv format by default, in two separate files: data, and the adjacency matrix of the corresponding graph.
- Parameters
fname_radical (str) – radical of the file names. Completed by
_data.csv
for the data file and_target.csv
for the adjacency matrix of the generated graph.**kwargs – Optional keyword arguments can be passed to pandas.
load_dataset
- cdt.data.load_dataset(name, **kwargs)[source]
Main function of this module, allows to easily import well-known causal datasets into python.
- Details on the supported datasets:
- tuebingen, dataset of 100 real cause-effect pairs
J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, B. Schoelkopf: “Distinguishing cause from effect using observational data: methods and benchmarks”, Journal of Machine Learning Research 17(32):1-102, 2016.
- sachs, Dataset of flow cytometry, real data,
11 variables x 7466 samples; Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., & Nolan, G. P. (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721), 523-529.
- dream4, multifactorial artificial data of the challenge.
Data generated with GeneNetWeaver 2.0, 5 graphs of 100 variables x 100 samples. Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. PNAS, 107(14):6286-6291, 2010.
- Parameters
name (str) – Name of the dataset. currenly supported datasets: [tuebingen, sachs, dream4-1, dream4-2, dream4-3, dream4-4, dream4-5]
**kwargs – Optional additional arguments for dataset loaders.
tuebingen
dataset accepts theshuffle (bool)
option to shuffle the causal pairs and their according labels.
- Returns
(pandas.DataFrame, pandas.DataFrame or networkx.DiGraph) Standard dataframe containing the data, and the target.
- Return type
tuple
Examples
>>> from cdt.data import load_dataset >>> s_data, s_graph = load_dataset('sachs') >>> t_data, t_labels = load_dataset('tuebingen')
Warning
The ‘Tuebingen’ dataset is loaded with the same label for all samples (1: A causes B)