cdt.independence

cdt.independence.stats

class cdt.independence.stats.model.IndependenceModel(predictor=None)[source]

Base class for independence and utilities to recover the undirected graph out of data.

Parameters

predictor (function) – function to estimate dependence (0 : independence), taking as input 2 array-like variables.

predict(a, b)[source]

Compute a dependence test statistic between variables.

Parameters
  • a (numpy.ndarray) – First variable

  • b (numpy.ndarray) – Second variable

Returns

dependence test statistic (close to 0 -> independent)

Return type

float

predict_undirected_graph(data)[source]

Build a skeleton using a pairwise independence criterion.

Parameters

data (pandas.DataFrame) – Raw data table

Returns

Undirected graph representing the skeleton.

Return type

networkx.Graph

AdjMI

class cdt.independence.stats.AdjMI[source]

Dependency criterion made of binning and mutual information.

The dependency metric relies on using the clustering metric adjusted mutual information applied to binned variables using the Freedman Diaconis Estimator.

Note

Ref: Vinh, Nguyen Xuan and Epps, Julien and Bailey, James, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance”, Journal of Machine Learning Research, Volume 11, Oct 2010. Ref: Freedman, David and Diaconis, Persi, “On the histogram as a density estimator:L2 theory”, “Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete”, 1981, issn=1432-2064, doi=10.1007/BF01025868.

Example

>>> from cdt.independence.stats import AdjMI
>>> obj = AdjMI()
>>> a = np.array([1, 2, 1, 5])
>>> b = np.array([1, 3, 0, 6])
>>> obj.predict(a, b)
predict(a, b, **kwargs)[source]

Perform the independence test.

Parameters
  • a (array-like, numerical data) – input data

  • b (array-like, numerical data) – input data

Returns

dependency statistic (1=Highly dependent, 0=Not dependent)

Return type

float

KendallTau

class cdt.independence.stats.KendallTau[source]

Compute Kendall’s Tau.

Example

>>> from cdt.independence.stats import KendallTau
>>> obj = KendallTau()
>>> a = np.array([1, 2, 1, 5])
>>> b = np.array([1, 3, 0, 6])
>>> obj.predict(a, b)
predict(a, b)[source]

Compute the test statistic

Parameters
  • a (array-like) – Variable 1

  • b (array-like) – Variable 2

Returns

test statistic

Return type

float

MIRegression

class cdt.independence.stats.MIRegression[source]

Test statistic based on a mutual information regression.

Example

>>> from cdt.independence.stats import MIRegression
>>> obj = MIRegression()
>>> a = np.array([1, 2, 1, 5])
>>> b = np.array([1, 3, 0, 6])
>>> obj.predict(a, b)
predict(a, b)[source]

Compute the test statistic

Parameters
  • a (array-like) – Variable 1

  • b (array-like) – Variable 2

Returns

test statistic

Return type

float

NormalizedHSIC

class cdt.independence.stats.NormalizedHSIC[source]

Kernel-based independence test statistic. Uses RBF kernel.

Example

>>> from cdt.independence.stats import NormalizedHSIC
>>> obj = NormalizedHSIC()
>>> a = np.array([1, 2, 1, 5])
>>> b = np.array([1, 3, 0, 6])
>>> obj.predict(a, b)
predict(a, b, sig=[-1, -1], maxpnt=500)[source]

Compute the test statistic

Parameters
  • a (array-like) – Variable 1

  • b (array-like) – Variable 2

  • sig (list) – [0] (resp [1]) is kernel size for a(resp b) (set to median distance if -1)

  • maxpnt (int) – maximum number of points used, for computational time

Returns

test statistic

Return type

float

NormMI

class cdt.independence.stats.NormMI[source]

Dependency criterion made of binning and mutual information.

The dependency metric relies on using the clustering metric adjusted mutual information applied to binned variables using the Freedman Diaconis Estimator. :param a: input data :param b: input data :type a: array-like, numerical data :type b: array-like, numerical data :return: dependency statistic (1=Highly dependent, 0=Not dependent) :rtype: float

Note

Ref: Vinh, Nguyen Xuan and Epps, Julien and Bailey, James, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance”, Journal of Machine Learning Research, Volume 11, Oct 2010. Ref: Freedman, David and Diaconis, Persi, “On the histogram as a density estimator:L2 theory”, “Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete”, 1981, issn=1432-2064, doi=10.1007/BF01025868.

Example

>>> from cdt.independence.stats import NormMI
>>> obj = NormMI()
>>> a = np.array([1, 2, 1, 5])
>>> b = np.array([1, 3, 0, 6])
>>> obj.predict(a, b)
predict(a, b, **kwargs)[source]

Perform the independence test.

Parameters
  • a (array-like, numerical data) – input data

  • b (array-like, numerical data) – input data

Returns

dependency statistic (1=Highly dependent, 0=Not dependent)

Return type

float

PearsonCorrelation

class cdt.independence.stats.PearsonCorrelation[source]

Pearson’s correlation coefficient.

\[r(a, b) = \frac{\sum_{i=1}^n (a_i - \bar{a})(b_i - \bar{b})} {\sqrt{\sum_{i=1}^n(a_i - \bar{a})^2 \sqrt{\sum_{i=1}^n(b_i - \bar{b})^2}}}\]

Example

>>> from cdt.independence.stats import PearsonCorrelation
>>> obj = PearsonCorrelation()
>>> a = np.array([1, 2, 1, 5])
>>> b = np.array([1, 3, 0, 6])
>>> obj.predict(a, b)
predict(a, b)[source]

Compute the test statistic

Parameters
  • a (array-like) – Variable 1

  • b (array-like) – Variable 2

Returns

test statistic

Return type

float

SpearmanCorrelation

class cdt.independence.stats.SpearmanCorrelation[source]

Spearman correlation.

Applies Pearson’s correlation on the rank of the values.

Example

>>> from cdt.independence.stats import SpearmanCorrelation
>>> obj = SpearmanCorrelation()
>>> a = np.array([1, 2, 1, 5])
>>> b = np.array([1, 3, 0, 6])
>>> obj.predict(a, b)
predict(a, b)[source]

Compute the test statistic

Parameters
  • a (array-like) – Variable 1

  • b (array-like) – Variable 2

Returns

test statistic

Return type

float

cdt.independence.graph

class cdt.independence.graph.model.GraphSkeletonModel[source]

Base class for undirected graph recovery directly out of data.

predict(data)[source]

Infer a undirected graph out of data.

Parameters

data (pandas.DataFrame) – observational data

Returns

Graph skeleton

Return type

networkx.Graph

Warning

Not implemented. Implemented by the algorithms.

class cdt.independence.graph.model.FeatureSelectionModel[source]

Base class for methods using feature selection on each variable independently.

predict(df_data, threshold=0.05, **kwargs)[source]

Predict the skeleton of the graph from raw data.

Returns iteratively the feature selection algorithm on each node.

Parameters
  • df_data (pandas.DataFrame) – data to construct a graph from

  • threshold (float) – cutoff value for feature selection scores

  • kwargs (dict) – additional arguments for algorithms

Returns

predicted skeleton of the graph.

Return type

networkx.Graph

predict_features(df_features, df_target, idx=0, **kwargs)[source]

For one variable, predict its neighbouring nodes.

Parameters
  • df_features (pandas.DataFrame) –

  • df_target (pandas.Series) –

  • idx (int) – (optional) for printing purposes

  • kwargs (dict) – additional options for algorithms

Returns

scores of each feature relatively to the target

Return type

list

Warning

Not implemented. Implemented by the algorithms.

run_feature_selection(df_data, target, idx=0, **kwargs)[source]

Run feature selection for one node: wrapper around self.predict_features.

Parameters
  • df_data (pandas.DataFrame) – All the observational data

  • target (str) – Name of the target variable

  • idx (int) – (optional) For printing purposes

Returns

scores of each feature relatively to the target

Return type

list

ARD

class cdt.independence.graph.ARD[source]

Feature selection with Bayesian ARD regression.

Example

>>> from cdt.independence.graph import ARD
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> df_features = pd.DataFrame(boston['data'])
>>> df_target = pd.DataFrame(boston['target'])
>>> obj = ARD()
>>> output = obj.predict_features(df_features, df_target)
>>> ugraph = obj.predict(df_features)  # Predict skeleton
predict_features(df_features, df_target, idx=0, **kwargs)[source]

For one variable, predict its neighbouring nodes.

Parameters
  • df_features (pandas.DataFrame) –

  • df_target (pandas.Series) –

  • idx (int) – (optional) for printing purposes

  • kwargs (dict) – additional options for algorithms

Returns

scores of each feature relatively to the target

Return type

list

DecisionTreeRegression

class cdt.independence.graph.DecisionTreeRegression[source]

Feature selection with decision tree regression.

Example

>>> from cdt.independence.graph import DecisionTreeRegression
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> df_features = pd.DataFrame(boston['data'])
>>> df_target = pd.DataFrame(boston['target'])
>>> obj = DecisionTreeRegression()
>>> output = obj.predict_features(df_features, df_target)
>>> ugraph = obj.predict(df_features)  # Predict skeleton
predict_features(df_features, df_target, idx=0, **kwargs)[source]

For one variable, predict its neighbouring nodes.

Parameters
  • df_features (pandas.DataFrame) –

  • df_target (pandas.Series) –

  • idx (int) – (optional) for printing purposes

  • kwargs (dict) – additional options for algorithms

Returns

scores of each feature relatively to the target

Return type

list

FSGNN

class cdt.independence.graph.FSGNN(nh=20, dropout=0.0, activation_function=<class 'torch.nn.modules.activation.ReLU'>, lr=0.01, l1=0.1, batch_size=-1, train_epochs=1000, test_epochs=1000, verbose=None, nruns=3, dataloader_workers=0, njobs=None)[source]

Feature Selection using MMD and Generative Neural Networks.

Parameters
  • nh (int) – number of hidden units

  • dropout (float) – probability of dropout (between 0 and 1)

  • activation_function (torch.nn.Module) – activation function of the NN

  • lr (float) – learning rate of Adam

  • l1 (float) – L1 penalization coefficient

  • batch_size (int) – batch size, defaults to full-batch

  • train_epochs (int) – number of train epochs

  • test_epochs (int) – number of test epochs

  • verbose (bool) – verbosity (defaults to cdt.SETTINGS.verbose)

  • nruns (int) – number of bootstrap runs

  • dataloader_workers (int) – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)

Example

>>> from cdt.independence.graph import FSGNN
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> df_features = pd.DataFrame(boston['data'])
>>> df_target = pd.DataFrame(boston['target'])
>>> obj = FSGNN()
>>> output = obj.predict_features(df_features, df_target)
>>> ugraph = obj.predict(df_features)  # Predict skeleton
predict(df_data, threshold=0.05, gpus=None, **kwargs)[source]

Predict the skeleton of the graph from raw data.

Returns iteratively the feature selection algorithm on each node.

Parameters
  • df_data (pandas.DataFrame) – data to construct a graph from

  • threshold (float) – cutoff value for feature selection scores

  • kwargs (dict) – additional arguments for algorithms

Returns

predicted skeleton of the graph.

Return type

networkx.Graph

predict_features(df_features, df_target, datasetclass=<class 'torch.utils.data.dataset.TensorDataset'>, device=None, idx=0)[source]

For one variable, predict its neighbours.

Parameters
  • df_features (pandas.DataFrame) – Features to select

  • df_target (pandas.Series) – Target variable to predict

  • datasetclass (torch.utils.data.Dataset) – Class to override for custom loading of data.

  • idx (int) – (optional) for printing purposes

  • device (str) – cuda or cpu device (defaults to cdt.SETTINGS.default_device)

Returns

scores of each feature relatively to the target

Return type

list

Glasso

class cdt.independence.graph.Glasso[source]

Graphical Lasso to find an adjacency matrix

Note

Ref : Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432-441.

Example

>>> from cdt.independence.graph import Glasso
>>> df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
>>> obj = Glasso()
>>> output = obj.predict(df)
predict(data, alpha=0.01, max_iter=2000, **kwargs)[source]

Predict the graph skeleton.

Parameters
  • data (pandas.DataFrame) – observational data

  • alpha (float) – regularization parameter

  • max_iter (int) – maximum number of iterations

Returns

Graph skeleton

Return type

networkx.Graph

HSICLasso

class cdt.independence.graph.HSICLasso[source]

Graphical Lasso with a kernel-based independence test.

Example

>>> from cdt.independence.graph import HSICLasso
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> df_features = pd.DataFrame(boston['data'])
>>> df_target = pd.DataFrame(boston['target'])
>>> obj = HSICLasso()
>>> output = obj.predict_features(df_features, df_target)
>>> ugraph = obj.predict(df_features)  # Predict skeleton
predict_features(df_features, df_target, idx=0, **kwargs)[source]

For one variable, predict its neighbouring nodes.

Parameters
  • df_features (pandas.DataFrame) –

  • df_target (pandas.Series) –

  • idx (int) – (optional) for printing purposes

  • kwargs (dict) – additional options for algorithms

Returns

scores of each feature relatively to the target

Return type

list

LinearSVRL2

class cdt.independence.graph.LinearSVRL2[source]

Feature selection with Linear Support Vector Regression.

Example

>>> from cdt.independence.graph import LinearSVRL2
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> df_features = pd.DataFrame(boston['data'])
>>> df_target = pd.DataFrame(boston['target'])
>>> obj = LinearSVRL2()
>>> output = obj.predict_features(df_features, df_target)
>>> ugraph = obj.predict(df_features)  # Predict skeleton
predict_features(df_features, df_target, idx=0, C=0.1, **kwargs)[source]

For one variable, predict its neighbouring nodes.

Parameters
  • df_features (pandas.DataFrame) –

  • df_target (pandas.Series) –

  • idx (int) – (optional) for printing purposes

  • kwargs (dict) – additional options for algorithms

  • C (float) – Penalty parameter of the error term

Returns

scores of each feature relatively to the target

Return type

list