cdt.independence
cdt.independence.stats
- class cdt.independence.stats.model.IndependenceModel(predictor=None)[source]
Base class for independence and utilities to recover the undirected graph out of data.
- Parameters
predictor (function) – function to estimate dependence (0 : independence), taking as input 2 array-like variables.
AdjMI
- class cdt.independence.stats.AdjMI[source]
Dependency criterion made of binning and mutual information.
The dependency metric relies on using the clustering metric adjusted mutual information applied to binned variables using the Freedman Diaconis Estimator.
Note
Ref: Vinh, Nguyen Xuan and Epps, Julien and Bailey, James, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance”, Journal of Machine Learning Research, Volume 11, Oct 2010. Ref: Freedman, David and Diaconis, Persi, “On the histogram as a density estimator:L2 theory”, “Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete”, 1981, issn=1432-2064, doi=10.1007/BF01025868.
Example
>>> from cdt.independence.stats import AdjMI >>> obj = AdjMI() >>> a = np.array([1, 2, 1, 5]) >>> b = np.array([1, 3, 0, 6]) >>> obj.predict(a, b)
KendallTau
MIRegression
NormalizedHSIC
- class cdt.independence.stats.NormalizedHSIC[source]
Kernel-based independence test statistic. Uses RBF kernel.
Example
>>> from cdt.independence.stats import NormalizedHSIC >>> obj = NormalizedHSIC() >>> a = np.array([1, 2, 1, 5]) >>> b = np.array([1, 3, 0, 6]) >>> obj.predict(a, b)
- predict(a, b, sig=[-1, -1], maxpnt=500)[source]
Compute the test statistic
- Parameters
a (array-like) – Variable 1
b (array-like) – Variable 2
sig (list) – [0] (resp [1]) is kernel size for a(resp b) (set to median distance if -1)
maxpnt (int) – maximum number of points used, for computational time
- Returns
test statistic
- Return type
float
NormMI
- class cdt.independence.stats.NormMI[source]
Dependency criterion made of binning and mutual information.
The dependency metric relies on using the clustering metric adjusted mutual information applied to binned variables using the Freedman Diaconis Estimator. :param a: input data :param b: input data :type a: array-like, numerical data :type b: array-like, numerical data :return: dependency statistic (1=Highly dependent, 0=Not dependent) :rtype: float
Note
Ref: Vinh, Nguyen Xuan and Epps, Julien and Bailey, James, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance”, Journal of Machine Learning Research, Volume 11, Oct 2010. Ref: Freedman, David and Diaconis, Persi, “On the histogram as a density estimator:L2 theory”, “Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete”, 1981, issn=1432-2064, doi=10.1007/BF01025868.
Example
>>> from cdt.independence.stats import NormMI >>> obj = NormMI() >>> a = np.array([1, 2, 1, 5]) >>> b = np.array([1, 3, 0, 6]) >>> obj.predict(a, b)
PearsonCorrelation
- class cdt.independence.stats.PearsonCorrelation[source]
Pearson’s correlation coefficient.
\[r(a, b) = \frac{\sum_{i=1}^n (a_i - \bar{a})(b_i - \bar{b})} {\sqrt{\sum_{i=1}^n(a_i - \bar{a})^2 \sqrt{\sum_{i=1}^n(b_i - \bar{b})^2}}}\]Example
>>> from cdt.independence.stats import PearsonCorrelation >>> obj = PearsonCorrelation() >>> a = np.array([1, 2, 1, 5]) >>> b = np.array([1, 3, 0, 6]) >>> obj.predict(a, b)
SpearmanCorrelation
- class cdt.independence.stats.SpearmanCorrelation[source]
Spearman correlation.
Applies Pearson’s correlation on the rank of the values.
Example
>>> from cdt.independence.stats import SpearmanCorrelation >>> obj = SpearmanCorrelation() >>> a = np.array([1, 2, 1, 5]) >>> b = np.array([1, 3, 0, 6]) >>> obj.predict(a, b)
cdt.independence.graph
- class cdt.independence.graph.model.GraphSkeletonModel[source]
Base class for undirected graph recovery directly out of data.
- class cdt.independence.graph.model.FeatureSelectionModel[source]
Base class for methods using feature selection on each variable independently.
- predict(df_data, threshold=0.05, **kwargs)[source]
Predict the skeleton of the graph from raw data.
Returns iteratively the feature selection algorithm on each node.
- Parameters
df_data (pandas.DataFrame) – data to construct a graph from
threshold (float) – cutoff value for feature selection scores
kwargs (dict) – additional arguments for algorithms
- Returns
predicted skeleton of the graph.
- Return type
networkx.Graph
- predict_features(df_features, df_target, idx=0, **kwargs)[source]
For one variable, predict its neighbouring nodes.
- Parameters
df_features (pandas.DataFrame) –
df_target (pandas.Series) –
idx (int) – (optional) for printing purposes
kwargs (dict) – additional options for algorithms
- Returns
scores of each feature relatively to the target
- Return type
list
Warning
Not implemented. Implemented by the algorithms.
- run_feature_selection(df_data, target, idx=0, **kwargs)[source]
Run feature selection for one node: wrapper around
self.predict_features
.- Parameters
df_data (pandas.DataFrame) – All the observational data
target (str) – Name of the target variable
idx (int) – (optional) For printing purposes
- Returns
scores of each feature relatively to the target
- Return type
list
ARD
- class cdt.independence.graph.ARD[source]
Feature selection with Bayesian ARD regression.
Example
>>> from cdt.independence.graph import ARD >>> from sklearn.datasets import load_boston >>> boston = load_boston() >>> df_features = pd.DataFrame(boston['data']) >>> df_target = pd.DataFrame(boston['target']) >>> obj = ARD() >>> output = obj.predict_features(df_features, df_target) >>> ugraph = obj.predict(df_features) # Predict skeleton
- predict_features(df_features, df_target, idx=0, **kwargs)[source]
For one variable, predict its neighbouring nodes.
- Parameters
df_features (pandas.DataFrame) –
df_target (pandas.Series) –
idx (int) – (optional) for printing purposes
kwargs (dict) – additional options for algorithms
- Returns
scores of each feature relatively to the target
- Return type
list
DecisionTreeRegression
- class cdt.independence.graph.DecisionTreeRegression[source]
Feature selection with decision tree regression.
Example
>>> from cdt.independence.graph import DecisionTreeRegression >>> from sklearn.datasets import load_boston >>> boston = load_boston() >>> df_features = pd.DataFrame(boston['data']) >>> df_target = pd.DataFrame(boston['target']) >>> obj = DecisionTreeRegression() >>> output = obj.predict_features(df_features, df_target) >>> ugraph = obj.predict(df_features) # Predict skeleton
- predict_features(df_features, df_target, idx=0, **kwargs)[source]
For one variable, predict its neighbouring nodes.
- Parameters
df_features (pandas.DataFrame) –
df_target (pandas.Series) –
idx (int) – (optional) for printing purposes
kwargs (dict) – additional options for algorithms
- Returns
scores of each feature relatively to the target
- Return type
list
FSGNN
- class cdt.independence.graph.FSGNN(nh=20, dropout=0.0, activation_function=<class 'torch.nn.modules.activation.ReLU'>, lr=0.01, l1=0.1, batch_size=-1, train_epochs=1000, test_epochs=1000, verbose=None, nruns=3, dataloader_workers=0, njobs=None)[source]
Feature Selection using MMD and Generative Neural Networks.
- Parameters
nh (int) – number of hidden units
dropout (float) – probability of dropout (between 0 and 1)
activation_function (torch.nn.Module) – activation function of the NN
lr (float) – learning rate of Adam
l1 (float) – L1 penalization coefficient
batch_size (int) – batch size, defaults to full-batch
train_epochs (int) – number of train epochs
test_epochs (int) – number of test epochs
verbose (bool) – verbosity (defaults to
cdt.SETTINGS.verbose
)nruns (int) – number of bootstrap runs
dataloader_workers (int) – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)
Example
>>> from cdt.independence.graph import FSGNN >>> from sklearn.datasets import load_boston >>> boston = load_boston() >>> df_features = pd.DataFrame(boston['data']) >>> df_target = pd.DataFrame(boston['target']) >>> obj = FSGNN() >>> output = obj.predict_features(df_features, df_target) >>> ugraph = obj.predict(df_features) # Predict skeleton
- predict(df_data, threshold=0.05, gpus=None, **kwargs)[source]
Predict the skeleton of the graph from raw data.
Returns iteratively the feature selection algorithm on each node.
- Parameters
df_data (pandas.DataFrame) – data to construct a graph from
threshold (float) – cutoff value for feature selection scores
kwargs (dict) – additional arguments for algorithms
- Returns
predicted skeleton of the graph.
- Return type
networkx.Graph
- predict_features(df_features, df_target, datasetclass=<class 'torch.utils.data.dataset.TensorDataset'>, device=None, idx=0)[source]
For one variable, predict its neighbours.
- Parameters
df_features (pandas.DataFrame) – Features to select
df_target (pandas.Series) – Target variable to predict
datasetclass (torch.utils.data.Dataset) – Class to override for custom loading of data.
idx (int) – (optional) for printing purposes
device (str) – cuda or cpu device (defaults to
cdt.SETTINGS.default_device
)
- Returns
scores of each feature relatively to the target
- Return type
list
Glasso
- class cdt.independence.graph.Glasso[source]
Graphical Lasso to find an adjacency matrix
Note
Ref : Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432-441.
Example
>>> from cdt.independence.graph import Glasso >>> df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')) >>> obj = Glasso() >>> output = obj.predict(df)
HSICLasso
- class cdt.independence.graph.HSICLasso[source]
Graphical Lasso with a kernel-based independence test.
Example
>>> from cdt.independence.graph import HSICLasso >>> from sklearn.datasets import load_boston >>> boston = load_boston() >>> df_features = pd.DataFrame(boston['data']) >>> df_target = pd.DataFrame(boston['target']) >>> obj = HSICLasso() >>> output = obj.predict_features(df_features, df_target) >>> ugraph = obj.predict(df_features) # Predict skeleton
- predict_features(df_features, df_target, idx=0, **kwargs)[source]
For one variable, predict its neighbouring nodes.
- Parameters
df_features (pandas.DataFrame) –
df_target (pandas.Series) –
idx (int) – (optional) for printing purposes
kwargs (dict) – additional options for algorithms
- Returns
scores of each feature relatively to the target
- Return type
list
LinearSVRL2
- class cdt.independence.graph.LinearSVRL2[source]
Feature selection with Linear Support Vector Regression.
Example
>>> from cdt.independence.graph import LinearSVRL2 >>> from sklearn.datasets import load_boston >>> boston = load_boston() >>> df_features = pd.DataFrame(boston['data']) >>> df_target = pd.DataFrame(boston['target']) >>> obj = LinearSVRL2() >>> output = obj.predict_features(df_features, df_target) >>> ugraph = obj.predict(df_features) # Predict skeleton
- predict_features(df_features, df_target, idx=0, C=0.1, **kwargs)[source]
For one variable, predict its neighbouring nodes.
- Parameters
df_features (pandas.DataFrame) –
df_target (pandas.Series) –
idx (int) – (optional) for printing purposes
kwargs (dict) – additional options for algorithms
C (float) – Penalty parameter of the error term
- Returns
scores of each feature relatively to the target
- Return type
list