cdt.independence

cdt.independence.stats

class cdt.independence.stats.model.IndependenceModel(predictor=None)[source]

Base class for independence and utilities to recover the undirected graph out of data.

Parameters: predictor (function) – function to estimate dependence (0 : independence), taking as input 2 array-like variables.

predict(a, b)[source]

Compute a dependence test statistic between variables.

Parameters

a (numpy.ndarray) – First variable
b (numpy.ndarray) – Second variable

Returns

dependence test statistic (close to 0 -> independent)

Return type

float

predict_undirected_graph(data)[source]

Build a skeleton using a pairwise independence criterion.

Parameters: data (pandas.DataFrame) – Raw data table
Returns: Undirected graph representing the skeleton.
Return type: networkx.Graph

AdjMI

class cdt.independence.stats.AdjMI[source]

Dependency criterion made of binning and mutual information.

The dependency metric relies on using the clustering metric adjusted mutual information applied to binned variables using the Freedman Diaconis Estimator.

Note

Ref: Vinh, Nguyen Xuan and Epps, Julien and Bailey, James, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance”, Journal of Machine Learning Research, Volume 11, Oct 2010. Ref: Freedman, David and Diaconis, Persi, “On the histogram as a density estimator:L2 theory”, “Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete”, 1981, issn=1432-2064, doi=10.1007/BF01025868.

Example

>>> from cdt.independence.stats import AdjMI
>>> obj = AdjMI()
>>> a = np.array([1, 2, 1, 5])
>>> b = np.array([1, 3, 0, 6])
>>> obj.predict(a, b)

predict(a, b, **kwargs)[source]

Perform the independence test.

Parameters

a (array-like, numerical data) – input data
b (array-like, numerical data) – input data

Returns

dependency statistic (1=Highly dependent, 0=Not dependent)

Return type

float

KendallTau

class cdt.independence.stats.KendallTau[source]

Compute Kendall’s Tau.

Example

>>> from cdt.independence.stats import KendallTau
>>> obj = KendallTau()
>>> a = np.array([1, 2, 1, 5])
>>> b = np.array([1, 3, 0, 6])
>>> obj.predict(a, b)

predict(a, b)[source]

Compute the test statistic

Parameters

a (array-like) – Variable 1
b (array-like) – Variable 2

Returns

test statistic

Return type

float

MIRegression

class cdt.independence.stats.MIRegression[source]

Test statistic based on a mutual information regression.

Example

>>> from cdt.independence.stats import MIRegression
>>> obj = MIRegression()
>>> a = np.array([1, 2, 1, 5])
>>> b = np.array([1, 3, 0, 6])
>>> obj.predict(a, b)

predict(a, b)[source]

Compute the test statistic

Parameters

a (array-like) – Variable 1
b (array-like) – Variable 2

Returns

test statistic

Return type

float

NormalizedHSIC

class cdt.independence.stats.NormalizedHSIC[source]

Kernel-based independence test statistic. Uses RBF kernel.

Example

>>> from cdt.independence.stats import NormalizedHSIC
>>> obj = NormalizedHSIC()
>>> a = np.array([1, 2, 1, 5])
>>> b = np.array([1, 3, 0, 6])
>>> obj.predict(a, b)

predict(a, b, sig=[-1, -1], maxpnt=500)[source]

Compute the test statistic

Parameters

a (array-like) – Variable 1
b (array-like) – Variable 2
sig (list) – [0] (resp [1]) is kernel size for a(resp b) (set to median distance if -1)
maxpnt (int) – maximum number of points used, for computational time

Returns

test statistic

Return type

float

NormMI

class cdt.independence.stats.NormMI[source]

Dependency criterion made of binning and mutual information.

The dependency metric relies on using the clustering metric adjusted mutual information applied to binned variables using the Freedman Diaconis Estimator. :param a: input data :param b: input data :type a: array-like, numerical data :type b: array-like, numerical data :return: dependency statistic (1=Highly dependent, 0=Not dependent) :rtype: float

Note

Ref: Vinh, Nguyen Xuan and Epps, Julien and Bailey, James, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance”, Journal of Machine Learning Research, Volume 11, Oct 2010. Ref: Freedman, David and Diaconis, Persi, “On the histogram as a density estimator:L2 theory”, “Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete”, 1981, issn=1432-2064, doi=10.1007/BF01025868.

Example

>>> from cdt.independence.stats import NormMI
>>> obj = NormMI()
>>> a = np.array([1, 2, 1, 5])
>>> b = np.array([1, 3, 0, 6])
>>> obj.predict(a, b)

predict(a, b, **kwargs)[source]

Perform the independence test.

Parameters

a (array-like, numerical data) – input data
b (array-like, numerical data) – input data

Returns

dependency statistic (1=Highly dependent, 0=Not dependent)

Return type

float

PearsonCorrelation

class cdt.independence.stats.PearsonCorrelation[source]

Pearson’s correlation coefficient.

\[r(a, b) = \frac{\sum_{i=1}^n (a_i - \bar{a})(b_i - \bar{b})} {\sqrt{\sum_{i=1}^n(a_i - \bar{a})^2 \sqrt{\sum_{i=1}^n(b_i - \bar{b})^2}}}\]

Example

>>> from cdt.independence.stats import PearsonCorrelation
>>> obj = PearsonCorrelation()
>>> a = np.array([1, 2, 1, 5])
>>> b = np.array([1, 3, 0, 6])
>>> obj.predict(a, b)

predict(a, b)[source]

Compute the test statistic

Parameters

a (array-like) – Variable 1
b (array-like) – Variable 2

Returns

test statistic

Return type

float

SpearmanCorrelation

class cdt.independence.stats.SpearmanCorrelation[source]

Spearman correlation.

Applies Pearson’s correlation on the rank of the values.

Example

>>> from cdt.independence.stats import SpearmanCorrelation
>>> obj = SpearmanCorrelation()
>>> a = np.array([1, 2, 1, 5])
>>> b = np.array([1, 3, 0, 6])
>>> obj.predict(a, b)

predict(a, b)[source]

Compute the test statistic

Parameters

a (array-like) – Variable 1
b (array-like) – Variable 2

Returns

test statistic

Return type

float

cdt.independence.graph

class cdt.independence.graph.model.GraphSkeletonModel[source]

Base class for undirected graph recovery directly out of data.

predict(data)[source]

Infer a undirected graph out of data.

Parameters: data (pandas.DataFrame) – observational data
Returns: Graph skeleton
Return type: networkx.Graph

Warning

Not implemented. Implemented by the algorithms.

class cdt.independence.graph.model.FeatureSelectionModel[source]

Base class for methods using feature selection on each variable independently.

predict(df_data, threshold=0.05, **kwargs)[source]

Predict the skeleton of the graph from raw data.

Returns iteratively the feature selection algorithm on each node.

Parameters

df_data (pandas.DataFrame) – data to construct a graph from
threshold (float) – cutoff value for feature selection scores
kwargs (dict) – additional arguments for algorithms

Returns

predicted skeleton of the graph.

Return type

networkx.Graph

predict_features(df_features, df_target, idx=0, **kwargs)[source]

For one variable, predict its neighbouring nodes.

Parameters

df_features (pandas.DataFrame) –
df_target (pandas.Series) –
idx (int) – (optional) for printing purposes
kwargs (dict) – additional options for algorithms

Returns

scores of each feature relatively to the target

Return type

list

Warning

Not implemented. Implemented by the algorithms.

run_feature_selection(df_data, target, idx=0, **kwargs)[source]

Run feature selection for one node: wrapper around self.predict_features.

Parameters

df_data (pandas.DataFrame) – All the observational data
target (str) – Name of the target variable
idx (int) – (optional) For printing purposes

Returns

scores of each feature relatively to the target

Return type

list

ARD

class cdt.independence.graph.ARD[source]

Feature selection with Bayesian ARD regression.

Example

>>> from cdt.independence.graph import ARD
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> df_features = pd.DataFrame(boston['data'])
>>> df_target = pd.DataFrame(boston['target'])
>>> obj = ARD()
>>> output = obj.predict_features(df_features, df_target)
>>> ugraph = obj.predict(df_features)  # Predict skeleton

predict_features(df_features, df_target, idx=0, **kwargs)[source]

For one variable, predict its neighbouring nodes.

Parameters

df_features (pandas.DataFrame) –
df_target (pandas.Series) –
idx (int) – (optional) for printing purposes
kwargs (dict) – additional options for algorithms

Returns

scores of each feature relatively to the target

Return type

list

DecisionTreeRegression

class cdt.independence.graph.DecisionTreeRegression[source]

Feature selection with decision tree regression.

Example

>>> from cdt.independence.graph import DecisionTreeRegression
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> df_features = pd.DataFrame(boston['data'])
>>> df_target = pd.DataFrame(boston['target'])
>>> obj = DecisionTreeRegression()
>>> output = obj.predict_features(df_features, df_target)
>>> ugraph = obj.predict(df_features)  # Predict skeleton

predict_features(df_features, df_target, idx=0, **kwargs)[source]

For one variable, predict its neighbouring nodes.

Parameters

df_features (pandas.DataFrame) –
df_target (pandas.Series) –
idx (int) – (optional) for printing purposes
kwargs (dict) – additional options for algorithms

Returns

scores of each feature relatively to the target

Return type

list

FSGNN

class cdt.independence.graph.FSGNN(nh=20, dropout=0.0, activation_function=<class 'torch.nn.modules.activation.ReLU'>, lr=0.01, l1=0.1, batch_size=-1, train_epochs=1000, test_epochs=1000, verbose=None, nruns=3, dataloader_workers=0, njobs=None)[source]

Feature Selection using MMD and Generative Neural Networks.

Parameters

nh (int) – number of hidden units
dropout (float) – probability of dropout (between 0 and 1)
activation_function (torch.nn.Module) – activation function of the NN
lr (float) – learning rate of Adam
l1 (float) – L1 penalization coefficient
batch_size (int) – batch size, defaults to full-batch
train_epochs (int) – number of train epochs
test_epochs (int) – number of test epochs
verbose (bool) – verbosity (defaults to cdt.SETTINGS.verbose)
nruns (int) – number of bootstrap runs
dataloader_workers (int) – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)

Example

>>> from cdt.independence.graph import FSGNN
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> df_features = pd.DataFrame(boston['data'])
>>> df_target = pd.DataFrame(boston['target'])
>>> obj = FSGNN()
>>> output = obj.predict_features(df_features, df_target)
>>> ugraph = obj.predict(df_features)  # Predict skeleton

predict(df_data, threshold=0.05, gpus=None, **kwargs)[source]

Predict the skeleton of the graph from raw data.

Returns iteratively the feature selection algorithm on each node.

Parameters

df_data (pandas.DataFrame) – data to construct a graph from
threshold (float) – cutoff value for feature selection scores
kwargs (dict) – additional arguments for algorithms

Returns

predicted skeleton of the graph.

Return type

networkx.Graph

predict_features(df_features, df_target, datasetclass=<class 'torch.utils.data.dataset.TensorDataset'>, device=None, idx=0)[source]

For one variable, predict its neighbours.

Parameters

df_features (pandas.DataFrame) – Features to select
df_target (pandas.Series) – Target variable to predict
datasetclass (torch.utils.data.Dataset) – Class to override for custom loading of data.
idx (int) – (optional) for printing purposes
device (str) – cuda or cpu device (defaults to cdt.SETTINGS.default_device)

Returns

scores of each feature relatively to the target

Return type

list

Glasso

class cdt.independence.graph.Glasso[source]

Graphical Lasso to find an adjacency matrix

Note

Ref : Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432-441.

Example

>>> from cdt.independence.graph import Glasso
>>> df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
>>> obj = Glasso()
>>> output = obj.predict(df)

predict(data, alpha=0.01, max_iter=2000, **kwargs)[source]

Predict the graph skeleton.

Parameters

data (pandas.DataFrame) – observational data
alpha (float) – regularization parameter
max_iter (int) – maximum number of iterations

Returns

Graph skeleton

Return type

networkx.Graph

HSICLasso

class cdt.independence.graph.HSICLasso[source]

Graphical Lasso with a kernel-based independence test.

Example

>>> from cdt.independence.graph import HSICLasso
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> df_features = pd.DataFrame(boston['data'])
>>> df_target = pd.DataFrame(boston['target'])
>>> obj = HSICLasso()
>>> output = obj.predict_features(df_features, df_target)
>>> ugraph = obj.predict(df_features)  # Predict skeleton

predict_features(df_features, df_target, idx=0, **kwargs)[source]

For one variable, predict its neighbouring nodes.

Parameters

df_features (pandas.DataFrame) –
df_target (pandas.Series) –
idx (int) – (optional) for printing purposes
kwargs (dict) – additional options for algorithms

Returns

scores of each feature relatively to the target

Return type

list

LinearSVRL2

class cdt.independence.graph.LinearSVRL2[source]

Feature selection with Linear Support Vector Regression.

Example

>>> from cdt.independence.graph import LinearSVRL2
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> df_features = pd.DataFrame(boston['data'])
>>> df_target = pd.DataFrame(boston['target'])
>>> obj = LinearSVRL2()
>>> output = obj.predict_features(df_features, df_target)
>>> ugraph = obj.predict(df_features)  # Predict skeleton

predict_features(df_features, df_target, idx=0, C=0.1, **kwargs)[source]

For one variable, predict its neighbouring nodes.

Parameters

df_features (pandas.DataFrame) –
df_target (pandas.Series) –
idx (int) – (optional) for printing purposes
kwargs (dict) – additional options for algorithms
C (float) – Penalty parameter of the error term

Returns

scores of each feature relatively to the target

Return type

list