easymlpy package

easymlpy.core module

The core functionality of easyml.

class easymlpy.core.easy_analysis(data, dependent_variable, algorithm=None, family='gaussian', resample=None, preprocess=None, measure=None, exclude_variables=None, categorical_variables=None, train_size=0.667, survival_rate_cutoff=0.05, n_samples=1000, n_divisions=1000, n_iterations=10, random_state=None, progress_bar=True, n_core=1, generate_coefficients=None, generate_variable_importances=None, generate_predictions=None, generate_model_performance=None, model_args=None)

Bases: object

The core recipe of easyml.

This recipe is the workhorse behind all of the easy_* functions.

create_estimator()

See the subclass documentation for more information on this method.

extract_coefficients(estimator)

See the subclass documentation for more information on this method.

extract_variable_importances(estimator)

See the subclass documentation for more information on this method.

generate_coefficients()

Generate coefficients for a model (if applicable).

Returns:An ndarray.
generate_coefficients_()

Generate coefficients for a model (if applicable).

Returns:An ndarray.
generate_model_performance()

Generate measures of model performance for a model.

Returns:An ndarray.
generate_model_performance_()

Generate measures of model performance for a model.

Returns:An ndarray.
generate_predictions()

Generate predictions for a model.

Returns:An ndarray.
generate_predictions_()

Generate predictions for a model.

Returns:An ndarray.
generate_variable_importances()

Generate variable importances for a model (if applicable).

Returns:An ndarray.
generate_variable_importances_()

Generate variable importances for a model (if applicable).

Returns:An ndarray.
plot_coefficients_processed()

See the subclass documentation for more information on this method.

plot_model_performance_test()

Plot model performance.

Returns:Figure and axe objects.
plot_model_performance_train()

Plot model performance.

Returns:Figure and axe objects.
plot_predictions_single_train_test_split_test()

Plot predictions.

Returns:Figure and axe objects.
plot_predictions_single_train_test_split_train()

Plot predictions.

Returns:Figure and axe objects.
plot_roc_single_train_test_split_test()

Plot ROC Curve.

Returns:Figure and axe objects.
plot_roc_single_train_test_split_train()

Plot ROC Curve.

Returns:Figure and axe objects.
plot_variable_importances_processed()

See the subclass documentation for more information on this method.

predict_model()

See the subclass documentation for more information on this method.

process_coefficients()

See the subclass documentation for more information on this method.

process_variable_importances()

See the subclass documentation for more information on this method.

easymlpy.datasets module

Helper functions for datasets.

easymlpy.datasets.load_cocaine_dependence()

Loads the cocaine dependence dataset.

Returns:An object of class pandas.DataFrame.
easymlpy.datasets.load_prostate()

Loads the prostate cancer dataset.

Returns:An object of class pandas.DataFrame.

easymlpy.glmnet module

Functions for glmnet analysis.

class easymlpy.glmnet.easy_glmnet(data, dependent_variable, algorithm='glmnet', family='gaussian', resample=None, preprocess=<function preprocess_scale>, measure=None, exclude_variables=None, categorical_variables=None, train_size=0.667, survival_rate_cutoff=0.05, n_samples=1000, n_divisions=1000, n_iterations=10, random_state=None, progress_bar=True, n_core=1, generate_coefficients=True, generate_variable_importances=False, generate_predictions=True, generate_model_performance=True, model_args=None)

Bases: easymlpy.core.easy_analysis

Easily build and evaluate a penalized regression model.

This function wraps the easyml core framework, allowing a user to easily run the easyml methodology for a glmnet model.

Please see the core class easy_analysis for more details on arguments.

create_estimator()

Create an estimator.

Creates an estimator depending on the family of regression.

Returns:A scikit-learn estimator.
extract_coefficients(estimator)

Extract coefficients from a penalized regression model.

Parameters:estimator – An estimator that has been fit to data.
Returns:An ndarray.
plot_coefficients()

Plots the coefficients.

Returns:Figure and axe.
predict_model(model, X)

Predict values from model.

Generates predictions from a model depending on the family of regression.

Parameters:
  • model – The model to use for generating predictions.
  • X – The data to use for generating predictions.
Returns:

An ndarray.

process_coefficients(coefficients, column_names, survival_rate_cutoff=0.05)

Process coefficients for plotting.

Parameters:
  • coefficients – An ndarray.
  • column_names – A list of strings, the columns of the data.
  • survival_rate_cutoff – The cutoff for survival.
Returns:

An object of class pandas.DataFrame.

easymlpy.measure module

Functions for measuring model performance.

easymlpy.measure.measure_mean_squared_error(y_true, y_pred)

Measure mean squared error.

Given the ground truth (correct) target values and the estimated target values, calculates the correlation metric.

Parameters:
  • y_true – An ndarray; the ground truth (correct) target values.
  • y_pred – An ndarray; the estimated target values.
Returns:

A float.

easymlpy.measure.measure_cor_score(y_true, y_pred)

Measure Pearsons Correlation Coefficient.

Given the ground truth (correct) target values and the estimated target values, calculates the mean squared error metric.

Parameters:
  • y_true – An ndarray; the ground truth (correct) target values.
  • y_pred – An ndarray; the estimated target values.
Returns:

A float.

easymlpy.measure.measure_r2_score(y_true, y_pred)

Measure Coefficient of Determination (R^2 Score).

Given the ground truth (correct) target values and the estimated target values, calculates the the R^2 metric.

Parameters:
  • y_true – An ndarray; the ground truth (correct) target values.
  • y_pred – An ndarray; the estimated target values.
Returns:

A float.

easymlpy.measure.measure_area_under_curve(y_true, y_pred)

Measure area under the curve.

Given the ground truth (correct) target values and the estimated target values, calculates the the AUC metric.

Parameters:
  • y_true – An ndarray; the ground truth (correct) target values.
  • y_pred – An ndarray; the estimated target values.
Returns:

A float.

easymlpy.plot module

Functions for plotting.

easymlpy.plot.plot_predictions_binomial(y_true, y_pred, subtitle='Train')

Plot binomial predictions.

Plots a logistic plot of the ground truth (correct) target values and the estimated target values.

Parameters:
  • y_true – Ground truth (correct) target values.
  • y_pred – Estimated target values.
  • subtitle – A string, whether one is plotting the ‘Train’ or ‘Test’ Dataset.
Returns:

Figure and axe objects.

easymlpy.plot.plot_predictions_gaussian(y_true, y_pred, subtitle='Train')

Plot gaussian predictions.

Plots a scatter plot of the ground truth (correct) target values and the estimated target values.

Parameters:
  • y_true – Ground truth (correct) target values.
  • y_pred – Estimated target values.
  • subtitle – A string, whether one is plotting the ‘Train’ or ‘Test’ Dataset.
Returns:

Figure and axe objects.

easymlpy.plot.plot_model_performance_binomial_area_under_curve(x, subtitle='Train')

Plot histogram of the area under the curve (AUC) metrics.

This function plots a histogram of the area under the curve (AUC) metrics.

Parameters:
  • x – An ndarray, the area under the curve (AUC) metrics to be plotted as a histogram.
  • subtitle – A string, whether one is plotting the ‘Train’ or ‘Test’ Dataset.
Returns:

Figure and axe objects.

easymlpy.plot.plot_model_performance_gaussian_cor_score(x, subtitle='Train')

Plot histogram of the correlation coefficient metrics.

This function plots a histogram of the correlation coefficient metrics.

Parameters:
  • x – An ndarray, the correlation coefficient metrics to be plotted as a histogram.
  • subtitle – A string, whether one is plotting the ‘Train’ or ‘Test’ Dataset.
Returns:

Figure and axe objects.

easymlpy.plot.plot_model_performance_gaussian_mean_squared_error(x, subtitle='Train')

Plot histogram of the mean squared error metrics.

This function plots a histogram of the mean squared error metrics.

Parameters:
  • x – An ndarray, the mean squared error metrics to be plotted as a histogram.
  • subtitle – A string, whether one is plotting the ‘Train’ or ‘Test’ Dataset.
Returns:

Figure and axe objects.

easymlpy.plot.plot_model_performance_gaussian_r2_score(x, subtitle='Train')

Plot histogram of the coefficient of determination (R^2) metrics.

This function plots a histogram of the coefficient of determination (R^2) metrics.

Parameters:
  • x – An ndarray, the coefficient of determination (R^2) metrics to be plotted as a histogram.
  • subtitle – A string, whether one is plotting the ‘Train’ or ‘Test’ Dataset.
Returns:

Figure and axe objects.

easymlpy.plot.plot_roc_single_train_test_split(y_true, y_pred, subtitle='Train')

Plot ROC Curve.

Given the ground truth (correct) target values and the estimated target values will plot an ROC curve.

Parameters:
  • y_true – Ground truth (correct) target values.
  • y_pred – Estimated target values.
  • subtitle – A string, whether one is plotting the ‘Train’ or ‘Test’ Dataset.
Returns:

Figure and axe objects.

easymlpy.preprocess module

Functions for preprocessing.

easymlpy.preprocess.preprocess_identity(*data, categorical_variables=None)

An identify function for preprocessing.

Returns inputs without modifying them.

Parameters:
  • data – array(s).
  • categorical_variables – A list of strings representing the variables that are categorical.
Returns:

array(s).

easymlpy.preprocess.preprocess_scale(*data, categorical_variables=None)

A function for scaling data.

Takes one or two arrays and scales them using a standard scaler.

Parameters:
  • data – array(s).
  • categorical_variables – A list of strings representing the variables that are categorical.
Returns:

array(s).

easymlpy.random_forest module

Functions for random forest analysis.

class easymlpy.random_forest.easy_random_forest(data, dependent_variable, algorithm='random_forest', family='gaussian', resample=None, preprocess=None, measure=None, exclude_variables=None, categorical_variables=None, train_size=0.667, survival_rate_cutoff=0.05, n_samples=1000, n_divisions=1000, n_iterations=10, random_state=None, progress_bar=True, n_core=1, generate_coefficients=False, generate_variable_importances=True, generate_predictions=True, generate_model_performance=True, model_args=None)

Bases: easymlpy.core.easy_analysis

Easily build and evaluate a random forest model.

This function wraps the easyml core framework, allowing a user to easily run the easyml methodology for a random forest model.

Please see the core class easy_analysis for more details on arguments.

create_estimator()

Create an estimator.

Creates an estimator depending on the family of regression.

Returns:A scikit-learn estimator.
extract_variable_importances(estimator)

Extract variable importances from a random forest model.

Parameters:estimator – An estimator that has been fit to data.
Returns:An ndarray.
plot_variable_importances()

Plots the variable importances.

Returns:Figure and axe.
predict_model(model, X)

Predict values from model.

Generates predictions from a model depending on the family of regression.

Parameters:
  • model – The model to use for generating predictions.
  • X – The data to use for generating predictions.
Returns:

An ndarray.

process_variable_importances(variable_importances)

Process variable importances for plotting.

Returns:An ndarray.

easymlpy.resample module

Functions for resampling data.

easymlpy.resample.resample_fold_train_test_split(X, y, foldid=None, train_size=0.667, random_state=None)

Sample with respect to an identification vector

This will sample the training and test sets so that case identifiers (e.g. subject ID’s) are not shared across training and test sets.

Parameters:
  • X – An ndarray, the data to be resampled.
  • y – An ndarray with two classes, 0 and 1.
  • train_size – A float; specifies what proportion of the data should be used for the training data set. Defaults to 0.667.
  • foldid – A vector with length equal to len(y) which identifies cases belonging to the same fold.
  • random_state – An integer; specifies the seed to be used for the analysis. Defaults to None.
Returns:

A tuple of arrays; the arrays X, y split into X_train, X_test, y_train, y_test.

easymlpy.resample.resample_simple_train_test_split(X, y, train_size=0.667, foldid=None, random_state=None)

Train test split.

This will split the data into train and test.

Parameters:
  • X – An ndarray, the data to be resampled.
  • y – An ndarray with two classes, 0 and 1.
  • train_size – A float; specifies what proportion of the data should be used for the training data set. Defaults to 0.667.
  • foldid – Not currently supported in this function.
  • random_state – An integer; specifies the seed to be used for the analysis. Defaults to None.
Returns:

A tuple of arrays; the arrays X, y split into X_train, X_test, y_train, y_test.

easymlpy.resample.resample_stratified_class_train_test_split(X, y, train_size=0.667, foldid=None, random_state=None)

Sample in equal proportion.

This will sample in equal proportion.

Parameters:
  • X – An ndarray, the data to be resampled.
  • y – An ndarray with two classes, 0 and 1.
  • train_size – A float; specifies what proportion of the data should be used for the training data set. Defaults to 0.667.
  • foldid – Not currently supported in this function.
  • random_state – An integer; specifies the seed to be used for the analysis. Defaults to None.
Returns:

A tuple of arrays; the arrays X, y split into X_train, X_test, y_train, y_test.

easymlpy.resample.resample_stratified_simple_train_test_split(X, y, train_size=0.667, foldid=None, random_state=None)

Sample in equal proportion.

This will sample in equal proportion.

Parameters:
  • X – An ndarray, the data to be resampled.
  • y – An ndarray with two classes, 0 and 1.
  • train_size – A float; specifies what proportion of the data should be used for the training data set. Defaults to 0.667.
  • foldid – A vector with length equal to len(y) which identifies cases belonging to the same fold.
  • random_state – An integer; specifies the seed to be used for the analysis. Defaults to None.
Returns:

A tuple of arrays; the arrays X, y split into X_train, X_test, y_train, y_test.

easymlpy.setters module

Functions for setting certain functions and parameters.

easymlpy.setters.set_random_state(random_state=None)

Set random state.

Sets the random state to a specific seed. Please note this function affects global state.

Parameters:random_state – An integer; specifies the seed to be used for the analysis. Defaults to None.
Returns:None.
easymlpy.setters.set_parallel(n_core)

Set parallel.

This helper function decides whether the analysis should be run in parallel based on the number of cores specified.

Parameters:n_core – An integer; specifies the number of cores to use for this analysis.
Returns:A boolean; whether analysis should be run in parallel or not.
easymlpy.setters.set_resample(resample=None, family=None)

Set resample function.

Sets the function responsible for resampling the data.

Parameters:
  • resample – A function; the function for resampling the data. Defaults to None.
  • family – A string; the type of regression to run on the data. Choices are either ‘gaussian’ or ‘binomial’.
Returns:

A function; the function for resampling the data.

easymlpy.setters.set_categorical_variables(column_names, categorical_variables=None)

Set categorical variables.

This helper functions determines a logical boolean vector based on the column names and the designation for which ones are categorical variables.

Parameters:
  • column_names – A list of strings; the column names of the data for this analysis.
  • categorical_variables – A list of strings; the variables that are categorical. Defaults to None.
Returns:

None, or if categorical_variables is not None, then a list of booleans of length len(column_names) where True represents that column is a categorical variable.

easymlpy.setters.set_column_names(column_names, dependent_variable, exclude_variables=None, preprocess=None, categorical_variables=None)

Set column names.

This functions helps decide what the updated column names of a data.frame should be within the easyml framework based on the dependent variable, preprocessing function, exclusionary variables, and categorical variables.

Parameters:
  • column_names – A list of strings; the column names of the data for this analysis.
  • dependent_variable – A string; the dependent variable for this analysis.
  • preprocess – A function; the function for preprocessing the data. Defaults to None.
  • exclude_variables – A list of strings; the variables from the data set to exclude. Defaults to None.
  • categorical_variables – A list of strings; the variables that are categorical. Defaults to None.
Returns:

The updated columns, in the correct order for preprocessing.

easymlpy.setters.set_dependent_variable(data, dependent_variable)

Set dependent variable.

This helper functions isolates the dependent variable in a data.frame.

Parameters:
  • data – An object of class pandas.DataFrame; the data to be analyzed.
  • dependent_variable – A string; the dependent variable for this analysis.
Returns:

An ndarray, the dependent variable of the analysis.

easymlpy.setters.set_independent_variables(data, dependent_variable)

Set independent variables.

This helper functions isolates the independent variables in a data.frame.

Parameters:
  • data – An object of class pandas.DataFrame; the data to be analyzed.
  • dependent_variable – A string; the dependent variable for this analysis.
Returns:

An object of class pandas.DataFrame; the independent variables of the analysis.

easymlpy.setters.set_measure(measure=None, family=None)

Set measure function.

Sets the function responsible for measuring the results.

Parameters:
  • measure – A function; the function for measuring the results. Defaults to None.
  • family – A string; the type of regression to run on the data. Choices are either ‘gaussian’ or ‘binomial’.
Returns:

A function; the function for measuring the results.

easymlpy.setters.set_plot_model_performance(measure)

Set plot model performance function.

Sets the function responsible for plotting the measures of model performance generated from the predictions generated from a fitted model.

Parameters:measure – A function; the function for measuring the results. Defaults to None.
Returns:A function; the function for plotting the measures of model performance generated from the predictions generated from a fitted model.
easymlpy.setters.set_plot_predictions(family=None)

Set plot predictions function.

Sets the function responsible for plotting the predictions generated from a fitted model.

Parameters:family – A string; the type of regression to run on the data. Choices are either ‘gaussian’ or ‘binomial’.
Returns:A function; the function for plotting the predictions generated from a fitted model.
easymlpy.setters.set_preprocess(preprocess=None)

Set preprocess function.

Sets the function responsible for preprocessing the data.

Parameters:preprocess – A function; the function for preprocessing the data. Defaults to None.
Returns:A function; the function for preprocessing the data.

easymlpy.support_vector_machine module

Functions for support vector machine analysis.

class easymlpy.support_vector_machine.easy_support_vector_machine(data, dependent_variable, algorithm='support_vector_machine', family='gaussian', resample=None, preprocess=<function preprocess_scale>, measure=None, exclude_variables=None, categorical_variables=None, train_size=0.667, survival_rate_cutoff=0.05, n_samples=1000, n_divisions=1000, n_iterations=10, random_state=None, progress_bar=True, n_core=1, generate_coefficients=False, generate_variable_importances=False, generate_predictions=True, generate_model_performance=True, model_args=None)

Bases: easymlpy.core.easy_analysis

Easily build and evaluate a support vector machine model.

This function wraps the easyml core framework, allowing a user to easily run the easyml methodology for a support vector machine model.

Please see the core class easy_analysis for more details on arguments.

create_estimator()

Create an estimator.

Creates an estimator depending on the family of regression.

Returns:A scikit-learn estimator.
predict_model(model, X)

Predict values from model.

Generates predictions from a model depending on the family of regression.

Parameters:
  • model – The model to use for generating predictions.
  • X – The data to use for generating predictions.
Returns:

An ndarray.

easymlpy.utils module

Utility functions.

easymlpy.utils.reduce_cores(n_core, cpu_count=None)

Reduces cores.

If the number of cores exceeds the number of cores on the OS then n_core is reduced to the number of cores on the OS.

Parameters:
  • n_core – integer The number of cores to use for the analysis.
  • cpu_count – integer, None The number of CPUs available on the machine. Defaults to

os.cpu_count() if None. :return: number of cores.

easymlpy.utils.remove_variables(data, exclude_variables=None)

Removes variables from the data set.

If passed a list of variable names to exclude, remove_variables will drop those variables from the dataset.

Parameters:
  • data – A pandas.DataFrame.
  • exclude_variables – A list of strings.
Returns:

A pandas.DataFrame.