machine_learning_tools package

Submodules

machine_learning_tools.clustering_ml module

machine_learning_tools.clustering_ml.adjusted_rand_score(labels_true, labels_pred, verbose=False)[source]
machine_learning_tools.clustering_ml.calculate_k_means_loss(data, labels, cluster_centers)[source]

Purpose: Will calculate the k means loss that depends on: 1) cluster centers 2) current labels of data

Pseudocode: For each datapoint:

  1. Calculate the squared euclidean distance between datapoint and center of cluster it is assigned to

Sum up all of the squared distances

machine_learning_tools.clustering_ml.category_classifications(model, labeled_data, return_dataframe=True, verbose=False, classification_types=['hard', 'soft'])[source]
machine_learning_tools.clustering_ml.closest_k_nodes_on_dendrogram(node, k, G=None, model=None, verbose=False)[source]

Purpose: Want to find the first k nodes that are close to a node through a dendrogram

machine_learning_tools.clustering_ml.closet_k_neighbors_from_hierarchical_clustering(X, node_name, row_names, k, n_components=3, verbose=False)[source]
machine_learning_tools.clustering_ml.cluster_stats_dataframe(labeled_data_classification)[source]

Purpose: Just want to visualize the soft and the hard assignment (and show they are not that different)

Pseudocode: 1)

machine_learning_tools.clustering_ml.clustering_stats(data, clust_perc=0.8)[source]

Will computer different statistics about the clusters formed that will be later shown or plotting

Metrics: For each category and classification type 1) highest_cluster identify 2) highest_cluster_percentage 3) n clusters needed to encompass clust_perc % of the category 4) Purity statistic

machine_learning_tools.clustering_ml.compute_average_log_likelihood_per_K(peaks, N=5000, n_iterations=10, K_list=array([8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]), return_train=True, covariance_type='full')[source]
machine_learning_tools.clustering_ml.dendrogram_HC(model, p=10000, no_plot=True, **kwargs)[source]

Purpose: to create a dendrogram and plot it for a hierarchical clustering model

Ex: p = 1000 # plot the top three levels of the dendrogram curr_dendrogram = clu.dendrogram_HC(model,no_plot=False)

machine_learning_tools.clustering_ml.dendrogram_graph_from_model(model)[source]

Purpose: will return the dendrogram as a grpah object so you can navigate it

machine_learning_tools.clustering_ml.dendrogram_leaves_ordered(model, **kwargs)[source]

Gets the order of the leaves in the dendrogram

Applictation: For bi-clustering

machine_learning_tools.clustering_ml.gmm_analysis(X_train, possible_K=[2, 3, 4, 5, 6, 7], reg_covar=1e-05, init_params='kmeans', covariance_type='full', pca_obj=None, scaler_obj=None, column_titles=None, model_type='mixture', verbose=True)[source]

Purpose: Will perform gmm analysis for a specified different number of clusters and save the models and relevant data for further analysis

machine_learning_tools.clustering_ml.gmm_classification(gmm_model, curr_data, classification='hard', verbose=True, return_counts=True)[source]

Purpose: Will use the gaussian model passed to classify the data points as to which cluster they belong

machine_learning_tools.clustering_ml.gmm_hard_classify(model, df, classes_as_str=False, verbose=False)[source]
machine_learning_tools.clustering_ml.gmm_pipeline(df, title_suffix=None, labeled_data_indices=None, category_column=None, columns_picked=None, possible_K=[2, 3, 4, 5, 6, 7], print_tables=None, apply_normalization=True, apply_pca=True, pca_whiten=True, plot_sqrt_eigvals=True, n_components_pca=None, classification_types=['hard'], model_type='mixture', verbose=True)[source]

Will carry out all of the clustering analysis and advanced stats analysis on a given dataset

Arguments: A data table with all of the labeled data

machine_learning_tools.clustering_ml.k_mean_clustering(data, n_clusters=3, max_iterations=1000, return_snapshots=True, verbose=True)[source]

Purpose: Will take in the input data and the number of expected clusters and run the K-Means algorithm to cluster the data into k-clusters

Arguments: - data (np.array): the data points in R40 to be clusters - n_clusters: Number of expected clusters - max_iterations: upper bound on number of iterations - return_snapshots: return the label assignments, cluster centers and loss value

for every iteration

Returns: - final_data_labels: what cluster each datapoint was assigned to at end - final_cluster_centers: cluster centers on last iteration - final_loss: steady-state value of loss function at end

  • snapshots of labels,centers and loss if requested

Pseudocode: 1) Randomly assign labels to data 2) Calculate the cluster centers from random labels 3) Calculate loss value 4) Begin iteration loop:

  1. Reassign data labels to the closest cluster center

  2. Recalculate the cluster center

  3. Calculate the k-means loss

  4. If the k-means loss did not change from previous value

    OR max_iterations is reached

    break out of loop

  1. return final values (and snapshots if requested)

machine_learning_tools.clustering_ml.normalized_mutual_info_score(labels_true, labels_pred, verbose=False)[source]
machine_learning_tools.clustering_ml.plot_4D_GMM_clusters(X_train, X_test=None, K=10, covariance_type='full')[source]

To graph the 4D clustering of the peaks of the AP

machine_learning_tools.clustering_ml.plot_BIC_and_Likelihood(gmm_data, fig_width=12, fig_height=5, title_suffix=None)[source]

Pupose

machine_learning_tools.clustering_ml.plot_advanced_stats_per_k(advanced_stats_per_k, stats_to_plot=['highest_cluster_perc', 'purity'], title_suffix='', fig_width=12, fig_height=5)[source]

Purpose: plotting the highest cluster and purity as a function of k

Pseudocode: 0) Get all the possible categories, n_clusters 0) Sort by n_clusters 1) Iterate through all the stats we want to plot

  1. Iterate through all of the categories

    – for all n_clusters a. Restrict by category and n_clusters and pull down the statistic b. Add to list – c. Save full list in dictionary

  2. Plot the stat using the category dictionary (using the ax index id)

machine_learning_tools.clustering_ml.plot_loss_function_history(loss_history, title='K-Means Loss vs. Iterations', n_clusters=None)[source]
machine_learning_tools.clustering_ml.plot_voltage_vs_time(n_clusters, final_cluster_centers, final_labels, data)[source]

Purpose: For each cluster:

  1. Plot the cluster center as a waveform

  2. All the waveform snippets assigned to the cluster

machine_learning_tools.clustering_ml.purity_score(labels_true, labels_pred, verbose=False)[source]
machine_learning_tools.clustering_ml.reassign_data_to_clusters(data, cluster_centers)[source]
machine_learning_tools.clustering_ml.updated_cluster_centers(data, labels, n_clusters)[source]

Calculate the new cluster centers for k-means by averaging the data points of those assigned to each cluster

machine_learning_tools.data_annotation_utils module

Pidgeon is a fast way to annotate data: https://github.com/agermanidis/pigeon

machine_learning_tools.data_input_utils module

machine_learning_tools.data_input_utils.df_from_rda(filepath, verbose=True, reset_index=False, use_label_for_name=True)[source]

conert an .rda r file into a pandas dataframe

machine_learning_tools.dimensionality_reduction_ml module

machine_learning_tools.dimensionality_reduction_ml.add_dimensionality_reduction_embeddings_to_df(df, method='UMAP', feature_columns=None, return_transform_columns=False, n_components=3, **kwargs)[source]
machine_learning_tools.dimensionality_reduction_ml.compute_total_variance(data)[source]
machine_learning_tools.dimensionality_reduction_ml.data_covariance(data)[source]
machine_learning_tools.dimensionality_reduction_ml.data_mean(data)[source]
machine_learning_tools.dimensionality_reduction_ml.dimensionality_reduction_by_method(X, method='umap', n_components=3, plot=False, plot_kwargs=None, y=None, verbose=False, **kwargs)[source]

Purpose: To apply a dimensionality reduction technique to a dataset (and optionally plot)

Ex: method=”tsne”, X=X_pca[y!= “Unsure”], n_components =2, y = y[y != “Unsure”],

plot=True, plot_kwargs=dict( target_to_color = ctu.cell_type_fine_color_map,

ndim = 3,

)

machine_learning_tools.dimensionality_reduction_ml.dimensionality_reduction_by_umap(x, random_state=42, n_neighbors=15, min_dist=0.1)[source]
machine_learning_tools.dimensionality_reduction_ml.eigen_decomp(data=None, cov_matrix=None)[source]
machine_learning_tools.dimensionality_reduction_ml.explained_variance(data=None, return_cumulative=True, eigenValues_sorted=None)[source]
machine_learning_tools.dimensionality_reduction_ml.fraction_of_variance_after_proj_back_proj()[source]
machine_learning_tools.dimensionality_reduction_ml.kth_eigenvector_proj(data, k, whiten=False, plot_perc_variance_explained=False, **kwargs)[source]

Purpose: To get the data projected onto the highest eigenvalue eigenvector

machine_learning_tools.dimensionality_reduction_ml.largest_eigenvector_proj(data, whiten=False, plot_perc_variance_explained=False, **kwargs)[source]
machine_learning_tools.dimensionality_reduction_ml.pca_analysis(data, n_components=None, whiten=False, method='sklearn', plot_sqrt_eigvals=False, plot_perc_variance_explained=False, verbose=False, **kwargs)[source]

Arguments: - data, #where each data point is stored as a column vector - method: whether performed with sklearn or by manual implmenetation - n_components: the number of principal components to anlayze - plot_sqrt_eigvals: whether to plot the square root of eigenvalues at end

purpose: Will compute the following parts of PCA analysis

mean covaraince_matrix eigenvalues (the variance explained) eigenvectors (the principal of components), as row vectors percent_variance_explained percent_variance_explained_up_to_n_comp data_proj = data projected onto n_components PC data_backproj = the data points projected into PC space reprojected

back into the original R^N space (but may be with reduced components use for reconstruction)

Ex:

#pracitice on iris data from sklearn import datasets iris = datasets.load_iris() test_data = iris.data

pca_dict_sklearn = pca_analysis(data=test_data,
n_components = 3,

method=”sklearn”)

pca_dict_manual = pca_analysis(data=test_data,
n_components = 3,

method=”manual”)

machine_learning_tools.dimensionality_reduction_ml.plot_projected_data(data_proj, labels=None, axis_prefix='', text_to_plot_dict=None, text_to_plot_individual=None, use_labels_as_text_to_plot=False, cmap='viridis', figsize=(10, 10), scatter_alpha=0.5, title='')[source]

To plot the PC projection in 3D

machine_learning_tools.dimensionality_reduction_ml.plot_projected_data_2D(data_proj, labels=None, axis_prefix='Proj', text_to_plot_dict=None, cmap=<matplotlib.colors.LinearSegmentedColormap object>, figsize=(10, 10), scatter_alpha=0.5, title='')[source]

To plot the PC projection in 3D

machine_learning_tools.dimensionality_reduction_ml.plot_sq_root_eigvals(eigVals, title=None, title_prefix=None)[source]

Create a square root eigenvalue plot from pca analysis

machine_learning_tools.dimensionality_reduction_ml.plot_top_2_PC_and_mean_waveform(data, spikewaves_pca=None, title='Waveforms for Mean waveform and top 2 PC', title_prefix=None, scale_by_sq_root_eigVal=True, return_spikewaves_pca=False, mean_scale=1)[source]
machine_learning_tools.dimensionality_reduction_ml.plot_um(UM, height=8, width=4, title='Imshow of the UM matrix')[source]
machine_learning_tools.dimensionality_reduction_ml.plot_variance_explained(data_var=None, pca_model=None, title=None, title_prefix=None)[source]

Create a square root eigenvalue plot from pca analysis

machine_learning_tools.dimensionality_reduction_ml.projected_and_backprojected_data(data, eigenVectors_sorted_filt)[source]
machine_learning_tools.dimensionality_reduction_ml.second_largest_eigenvector_proj(data, whiten=False, plot_perc_variance_explained=False, **kwargs)[source]

machine_learning_tools.dimensionality_reduction_utils module

machine_learning_tools.dimensionality_reduction_utils.compute_total_variance(data)[source]
machine_learning_tools.dimensionality_reduction_utils.data_covariance(data)[source]
machine_learning_tools.dimensionality_reduction_utils.data_mean(data)[source]
machine_learning_tools.dimensionality_reduction_utils.eigen_decomp(data=None, cov_matrix=None)[source]
machine_learning_tools.dimensionality_reduction_utils.explained_variance(data=None, return_cumulative=True, eigenValues_sorted=None)[source]
machine_learning_tools.dimensionality_reduction_utils.fraction_of_variance_after_proj_back_proj()[source]
machine_learning_tools.dimensionality_reduction_utils.kth_eigenvector_proj(data, k, whiten=False, plot_perc_variance_explained=False, **kwargs)[source]

Purpose: To get the data projected onto the highest eigenvalue eigenvector

machine_learning_tools.dimensionality_reduction_utils.largest_eigenvector_proj(data, whiten=False, plot_perc_variance_explained=False, **kwargs)[source]
machine_learning_tools.dimensionality_reduction_utils.pca_analysis(data, n_components=None, whiten=False, method='sklearn', plot_sqrt_eigvals=False, plot_perc_variance_explained=False, verbose=False, **kwargs)[source]

Arguments: - data, #where each data point is stored as a column vector - method: whether performed with sklearn or by manual implmenetation - n_components: the number of principal components to anlayze - plot_sqrt_eigvals: whether to plot the square root of eigenvalues at end

purpose: Will compute the following parts of PCA analysis

mean covaraince_matrix eigenvalues (the variance explained) eigenvectors (the principal of components), as row vectors percent_variance_explained percent_variance_explained_up_to_n_comp data_proj = data projected onto n_components PC data_backproj = the data points projected into PC space reprojected

back into the original R^N space (but may be with reduced components use for reconstruction)

Ex:

#pracitice on iris data from sklearn import datasets iris = datasets.load_iris() test_data = iris.data

pca_dict_sklearn = pca_analysis(data=test_data,
n_components = 3,

method=”sklearn”)

pca_dict_manual = pca_analysis(data=test_data,
n_components = 3,

method=”manual”)

machine_learning_tools.dimensionality_reduction_utils.plot_projected_data(data_proj, labels, axis_prefix='Proj', text_to_plot_dict=None)[source]

To plot the PC projection in 3D

machine_learning_tools.dimensionality_reduction_utils.plot_sq_root_eigvals(eigVals, title=None, title_prefix=None)[source]

Create a square root eigenvalue plot from pca analysis

machine_learning_tools.dimensionality_reduction_utils.plot_top_2_PC_and_mean_waveform(data, spikewaves_pca=None, title='Waveforms for Mean waveform and top 2 PC', title_prefix=None, scale_by_sq_root_eigVal=True, return_spikewaves_pca=False, mean_scale=1)[source]
machine_learning_tools.dimensionality_reduction_utils.plot_um(UM, height=8, width=4, title='Imshow of the UM matrix')[source]
machine_learning_tools.dimensionality_reduction_utils.plot_variance_explained(data_var, title=None, title_prefix=None)[source]

Create a square root eigenvalue plot from pca analysis

machine_learning_tools.dimensionality_reduction_utils.projected_and_backprojected_data(data, eigenVectors_sorted_filt)[source]
machine_learning_tools.dimensionality_reduction_utils.second_largest_eigenvector_proj(data, whiten=False, plot_perc_variance_explained=False, **kwargs)[source]

machine_learning_tools.evaluation_metrics_utils module

machine_learning_tools.evaluation_metrics_utils.accuracy(M)[source]
machine_learning_tools.evaluation_metrics_utils.average_and_class_accuracy(M, labels)[source]
machine_learning_tools.evaluation_metrics_utils.class_accuracy(M)[source]
machine_learning_tools.evaluation_metrics_utils.class_accuracy_str(M, labels)[source]
machine_learning_tools.evaluation_metrics_utils.class_mean_accuracy(M)[source]
machine_learning_tools.evaluation_metrics_utils.confusion_matrix(y_true, y_pred, labels=None, normalize=None, return_df=False, df=None)[source]
machine_learning_tools.evaluation_metrics_utils.normalize_confusion_matrix(cf_matrix, axis=1)[source]
machine_learning_tools.evaluation_metrics_utils.plot_confusion_matrix(cf_matrix, annot=True, annot_fontsize=30, cell_fmt='.2f', cmap='Blues', vmin=0, vmax=1, axes_font_size=20, xlabel_rotation=15, ylabel_rotation=0, xlabels=None, ylabels=None, plot_colorbar=True, colobar_tick_fontsize=25, ax=None)[source]

machine_learning_tools.feature_selection_utils module

Functions to help with featuer evaluation

machine_learning_tools.feature_selection_utils.best_subset_k(df, k, model, target_name=None, y=None, verbose=False, evaluation_method='R_squared', return_model=False)[source]

Purpose: To pick the best subset of features for a certain number of features allowed

  • evalutation of best is chosen by the

evaluation method: R^2, MSE

Pseudocode: 0) divide df into X,y 1) Get all choose k subsets of the features 2) For each combination of features: - find the evaluation score

sklm.best_subset_k( df, k = 2, target_name = target_name, model = sklm.LinearRegression(), evaluation_method = “MSE”, verbose = True )

machine_learning_tools.feature_selection_utils.best_subset_k_individual_sklearn(df, target_name, k, model_type='regression', verbose=False, return_data=False)[source]

Purpose: To run the sklearn best subsets k using built in sklearn method (NOTE THIS SELECTS THE BEST FEATURES INDIVIDUALLY)

Useful Link: https://www.datatechnotes.com/2021/02/seleckbest-feature-selection-example-in-python.html

Example

‘’’ Note: This method uses a the following evaluation criteria for best feature https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html

  1. The correlation between each regressor and the target is computed, that is, ((X[:, i] - mean(X[:, i])) * (y - mean_y)) / (std(X[:, i]) * std(y)).

  2. It is converted to an F score then to a p-value.

‘’’ best_features_over_sk = []

for k in tqdm(range(1,n_featuers+1)):

eval_method = “sklearn”

curr_best = fsu.best_subset_k_individual_sklearn( df, k = k, target_name = target_name, verbose = False, ) best_features_over_sk.append(dict(k = k,

evaluation_method = eval_method, best_subset = curr_best))

import pandas as pd print(f”Using sklearn method”) pd.DataFrame.from_records(best_features_over_sk)

machine_learning_tools.feature_selection_utils.reverse_feature_elimination(df, k, model, target_name=None, y=None, verbose=False, return_model=False)[source]

Use sklearn function for recursively elimininating the least important features

How does it pick the best features? the absolute value of the model.coef_ (not considering p_value)

machine_learning_tools.hyperparameters_ml module

machine_learning_tools.hyperparameters_ml.best_hyperparams_RandomizedSearchCV(clf, parameter_dict, X, y, n_iter_search=2, return_clf=False, return_cv_results=True, verbose=True, n_cv_folds=5, n_jobs=1)[source]

Purpose: To find the best parameters in from a random search of a parameter map defined by a dict

Source: https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html#sphx-glr-auto-examples-model-selection-plot-randomized-search-py

machine_learning_tools.machine_learning_utils module

machine_learning_tools.machine_learning_utils.decision_tree_analysis(df, target_column, max_depth=None, max_features=None)[source]

Purpose: To perform decision tree analysis and plot it

machine_learning_tools.machine_learning_utils.decision_tree_sklearn(df, target_column, feature_columns=None, perform_testing=False, test_size=0, criterion='entropy', splitter='best', max_depth=None, max_features=None, min_samples_split=0.1, min_samples_leaf=0.02)[source]

Purpose: To train a decision tree based on a dataframe with the features and the classifications

Parameters: max_depth = If None then the depth is chosen so all leaves contin less than min_samples_split

The higher the depth th emore overfitting

machine_learning_tools.machine_learning_utils.encode_labels_as_ints(labels)[source]

Purpose: Will convert a list of labels into an array encoded 0,1,2…. and return the unique labels used

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

Ex: import machine_learning_utils as mlu mlu.encode_labels_as_ints([“hi”,”hello”,”yo”,’yo’])

machine_learning_tools.machine_learning_utils.export_model(model, path)[source]
machine_learning_tools.machine_learning_utils.kNN_classifier(X, y, n_neighbors, labels_list=None, weights='distance', verbose=False, plot_map=False, add_labels=True, **kwargs)[source]

Purpose: Will create a kNN classifier and preferrencially plot the decision map

Ex: Running it on the iris data

from sklearn import neighbors, datasets import machine_learning_utils as mlu

n_neighbors = 15

# import some data to play with iris = datasets.load_iris()

# we only take the first two features. We could avoid this ugly # slicing by using a two-dim dataset X = iris.data[:, :2] y = iris.target

mlu.kNN_classifier(X,y,

n_neighbors=n_neighbors,

labels_list = iris.target_names,

plot_map = True, feature_1_name=iris.feature_names[0], feature_2_name=iris.feature_names[1],

)

machine_learning_tools.machine_learning_utils.load_model(path)[source]
machine_learning_tools.machine_learning_utils.plot_classifier_map(clf, data_to_plot=None, data_to_plot_color='red', data_to_plot_size=50, figsize=(8, 6), map_fill_colors=None, scatter_colors=None, h=0.02, feature_1_idx=0, feature_2_idx=1, feature_1_name='feature_1_name', feature_2_name='feature_2_name', x_min=None, x_max=None, y_min=None, y_max=None, verbose=False, plot_training_points=False, **kwargs)[source]

Purpose: To plot

machine_learning_tools.machine_learning_utils.plot_decision_tree(clf, feature_names, class_names=None)[source]

Purpose: Will show the

machine_learning_tools.machine_learning_utils.predict_class_single_datapoint(clf, data, verbose=False, return_probability=False)[source]

Purpose: To predict the class of a single datapoint

Ex: data = [1,1] mlu.predict_class_single_datapoint(clf,data,verbose = True)

machine_learning_tools.machine_learning_utils.print_tree_structure_description(clf)[source]

machine_learning_tools.matplotlib_ml module

Notes on other functions: eventplot #will plot 1D data as lines, can stack multiple 1D events – if did a lot of these gives the characteristic neuron spikes

all stacked on top of each other

matplot colors can be described with “C102” where C{number} –> there are only 10 possible colors but the number can go as high as you want it just repeats after 10 Ex: C100 = C110

#How to set the figure size: fig.set_size_inches(18.5, 10.5)

# not have the subplots run into each other fig.tight_layout()

machine_learning_tools.matplotlib_ml.add_random_color_for_missing_labels_in_dict(labels, label_color_dict, verbose=False)[source]

Purpose: Will generate random colors for labels that are missing in the labels dict

machine_learning_tools.matplotlib_ml.apply_alpha_to_color_list(color_list, alpha=0.2, print_flag=False)[source]
machine_learning_tools.matplotlib_ml.bins_from_width_range(bin_width, bin_max, bin_min=None)[source]

To compute the width boundaries to help with plotting and give a constant bin widht

machine_learning_tools.matplotlib_ml.closest_colour(requested_colour)[source]
machine_learning_tools.matplotlib_ml.color_to_hex(color)[source]
machine_learning_tools.matplotlib_ml.color_to_rgb(color_str)[source]

To turn a string of a color into an RGB value

Ex: color_to_rgb(“red”)

machine_learning_tools.matplotlib_ml.color_to_rgba(current_color, alpha=0.2)[source]
machine_learning_tools.matplotlib_ml.convert_dict_rgb_values_to_names(color_dict)[source]

Purpose: To convert dictonary with colors as values to the color names instead of the rgb equivalents

Application: can be used on the color dictionary returned by the neuron plotting function

Example: from datasci_tools import matplotlib_utils as mu mu = reload(mu) nviz=reload(nviz)

returned_color_dict = nviz.visualize_neuron(uncompressed_neuron,

visualize_type=[“network”], network_resolution=”branch”, network_directional=True, network_soma=[“S1”,”S0”], network_soma_color = [“black”,”red”], limb_branch_dict=dict(L1=”all”, L2=”all”), node_size = 1, arrow_size = 1, return_color_dict=True)

color_info = mu.convert_dict_rgb_values_to_names(returned_color_dict)

machine_learning_tools.matplotlib_ml.convert_rgb_to_name(rgb_value)[source]

Example: convert_rgb_to_name(np.array([[1,0,0,0.5]]))

machine_learning_tools.matplotlib_ml.display_figure(fig)[source]
machine_learning_tools.matplotlib_ml.generate_color_list(user_colors=[], n_colors=-1, colors_to_omit=[], alpha_level=0.2, return_named_colors=False)[source]

Can specify the number of colors that you want Can specify colors that you don’t want accept what alpha you want

Example of how to use colors_array = generate_color_list(colors_to_omit=[“green”])

machine_learning_tools.matplotlib_ml.generate_color_list_no_alpha_change(user_colors=[], n_colors=-1, colors_to_omit=[], alpha_level=0.2)[source]

Can specify the number of colors that you want Can specify colors that you don’t want accept what alpha you want

Example of how to use colors_array = generate_color_list(colors_to_omit=[“green”])

machine_learning_tools.matplotlib_ml.generate_non_randon_named_color_list(n_colors, user_colors=[], colors_to_omit=[])[source]

To generate a list of colors of a certain length that is non-random

machine_learning_tools.matplotlib_ml.generate_random_color(print_flag=False, colors_to_omit=[])[source]
machine_learning_tools.matplotlib_ml.generate_random_rgba(print_flag=False)[source]
machine_learning_tools.matplotlib_ml.generate_unique_random_color_list(n_colors, print_flag=False, colors_to_omit=[])[source]
machine_learning_tools.matplotlib_ml.get_axes_layout_from_figure(fig)[source]
machine_learning_tools.matplotlib_ml.get_axes_locations_from_figure(fig)[source]
machine_learning_tools.matplotlib_ml.get_colour_name(requested_colour)[source]
machine_learning_tools.matplotlib_ml.get_graph_color_list()[source]
machine_learning_tools.matplotlib_ml.histogram(data, n_bins=50, bin_width=None, bin_max=None, bin_min=None, density=False, logscale=False, return_fig_ax=True, fontsize_axes=20, **kwargs)[source]

Ex: histogram(in_degree,bin_max = 700,

bin_width = 20,return_fig_ax=True)

machine_learning_tools.matplotlib_ml.plot_color_dict(colors, sorted_names=None, hue_sort=False, ncols=4, figure_width=20, figure_height=8, print_flag=True)[source]

Ex:

#how to plot the base colors Examples: mu.plot_color_dict(mu.base_colors_dict,figure_height=20) mu.plot_color_dict(mu.base_colors_dict,hue_sort=True,figure_height=20)

How to plot colors returned from the plotting function: from datasci_tools import matplotlib_utils as mu mu = reload(mu) nviz=reload(nviz)

returned_color_dict = nviz.visualize_neuron(uncompressed_neuron,

visualize_type=[“network”], network_resolution=”branch”, network_directional=True, network_soma=[“S1”,”S0”], network_soma_color = [“black”,”red”], limb_branch_dict=dict(L1=”all”, L2=”all”), node_size = 1, arrow_size = 1, return_color_dict=True)

mu.plot_color_dict(returned_color_dict,hue_sort=False,figure_height=20)

machine_learning_tools.matplotlib_ml.plot_graph(title, y_values, x_values, x_axis_label, y_axis_label, return_fig=False, figure=None, ax_index=None, label=None, x_axis_int=True)[source]

Purpose: For easy plotting and concatenating plots

machine_learning_tools.matplotlib_ml.process_non_dict_color_input(color_input)[source]

Will return a color list that is as long as n_items based on a diverse set of options for how to specify colors

  • string

  • list of strings

  • 1D np.array

  • list of strings and 1D np.array

  • list of 1D np.array or 2D np.array

Warning: This will not be alpha corrected

machine_learning_tools.matplotlib_ml.reset_default_settings()[source]
machine_learning_tools.matplotlib_ml.scatter_2D_with_labels(X, Y, labels, label_color_dict=None, x_title='', y_title='', axis_append='', Z=None, z_title='', alpha=0.5, verbose=False, move_legend_outside_plot=True)[source]

Purpose: Will plot scatter points where each point has a unique label (and allows to specify the colors of each label)

Pseudocode: 1) Find the unique labels 2) For all unique labels, if a color mapping is not specified then add a random unique color (use function)

3) Iterate through the labels to plot: a. Find all indices of that label b. Plot them with the correct color and label

  1. Move the legend to outside of the plot

mu.scatter_2D_with_labels( X = np.concatenate([f1_inh,f1_exc]), Y = np.concatenate([f2_inh,f2_exc]), #Z = np.ones(194), x_title = feature_1, y_title = feature_2, axis_append = “(per um of skeleton)”, labels = np.concatenate([class_inh,class_exc]), alpha = 0.5, label_color_dict= dict(BC = “blue”,

BPC = “black”, MC = “yellow”, excitatory = “red”

),

verbose = True)

machine_learning_tools.matplotlib_ml.set_font_size(font_size)[source]
machine_learning_tools.matplotlib_ml.set_legend_outside_plot(ax, scale_down=0.8)[source]

Will adjust your axis so that the legend appears outside of the box

machine_learning_tools.numpy_ml module

machine_learning_tools.numpy_ml.all_choose_1_combinations_form_dict_values(parameter_dict, verbose=False)[source]

Purpose: To generate a list of dictinoaries that encompass all the possible parameter settings defined by the possible parameter settings in the dictionary

Pseudocode:

machine_learning_tools.numpy_ml.all_directed_choose_2_combinations(array)[source]

Ex: seg_split_ids = [“864691136388279671_0”,

“864691135403726574_0”, “864691136194013910_0”]

output: [[‘864691136388279671_0’, ‘864691135403726574_0’],

[‘864691136388279671_0’, ‘864691136194013910_0’], [‘864691135403726574_0’, ‘864691136388279671_0’], [‘864691135403726574_0’, ‘864691136194013910_0’], [‘864691136194013910_0’, ‘864691136388279671_0’], [‘864691136194013910_0’, ‘864691135403726574_0’]]

machine_learning_tools.numpy_ml.all_partitions(array, min_partition_size=2, verbose=False)[source]

Will form all of the possible 2 partions of an array where you can specify the minimum number of elements needed for each possible partition

Ex: x = nu.all_partitions(array = np.array([4,5,6,9]))

machine_learning_tools.numpy_ml.all_subarrays(l)[source]

Ex: from datasci_tools import numpy_utils as nu nu.all_subarrays([[1,”a”],[2,”b”],[3,”c”]])

Output: [[],

[[1, ‘a’]], [[2, ‘b’]], [[1, ‘a’], [2, ‘b’]], [[3, ‘c’]], [[1, ‘a’], [3, ‘c’]], [[2, ‘b’], [3, ‘c’]], [[1, ‘a’], [2, ‘b’], [3, ‘c’]]]

machine_learning_tools.numpy_ml.all_unique_choose_2_combinations(array)[source]

Given a list of numbers or labels, will determine all the possible unique pariings

machine_learning_tools.numpy_ml.all_unique_choose_k_combinations(array, k)[source]
machine_learning_tools.numpy_ml.angle_between_vectors(v1, v2, acute=True, degrees=True, verbose=False)[source]

vec1 = np.array([0,0,1]) vec2 = np.array([1,1,-0.1]) angle(vec1,vec2,verbose=True)

machine_learning_tools.numpy_ml.argnan(array)[source]
machine_learning_tools.numpy_ml.argsort_multidim_array_by_rows(array, descending=False)[source]

Ex: x = np.array([

[2,2,3,4,5], [-2,2,3,4,5], [3,1,1,1,1], [1,10,10,10,10], [3,0,1,1,1], [-2,-3,3,4,5]

])

#showing this argsort will correctly sort x[nu.argsort_multidim_arrays_by_rows(x)]

>> Output:

array([[-2, -3, 3, 4, 5],

[-2, 2, 3, 4, 5], [ 1, 10, 10, 10, 10], [ 2, 2, 3, 4, 5], [ 3, 0, 1, 1, 1], [ 3, 1, 1, 1, 1]])

machine_learning_tools.numpy_ml.argsort_rows_of_2D_array_independently(array, descending=False)[source]

Purpose: will return array for row idx and one for col idex that will sort the values of each row independently of the column

Ex: x = np.array([

[2,2,3,4,5], [-2,2,3,4,5], [3,1,1,1,1], [1,10,10,10,10], [3,0,1,1,1], [-2,-3,3,4,5]

])

row_idx,col_idx = nu.argsort_rows_of_2D_array_independently(x) x[row_idx,col_idx]

Output: >> array([[ 2, 2, 3, 4, 5],

[-2, 2, 3, 4, 5], [ 1, 1, 1, 1, 3], [ 1, 10, 10, 10, 10], [ 0, 1, 1, 1, 3], [-3, -2, 3, 4, 5]])

machine_learning_tools.numpy_ml.array_after_exclusion(original_array=[], exclusion_list=[], n_elements=0)[source]

To efficiently get the difference between 2 lists:

original_list = [1,5,6,10,11] exclusion = [10,6] n_elements = 20

array_after_exclusion(n_elements=n_elements,exclusion_list=exclusion)

** pretty much the same thing as : np.setdiff1d(array1, array2)

machine_learning_tools.numpy_ml.array_split(array, n_splits)[source]

Split an array into multiple sub-arrays

Ex: from datasci_tools import numpy_utils as nu nu.array_split(np.arange(0,10),3)

machine_learning_tools.numpy_ml.average_by_weights(values, weights)[source]
machine_learning_tools.numpy_ml.bounding_box_side_lengths(array)[source]
machine_learning_tools.numpy_ml.bounding_box_volume(array)[source]
machine_learning_tools.numpy_ml.bouning_box_corners(array)[source]
machine_learning_tools.numpy_ml.bouning_box_midpoint(array)[source]
machine_learning_tools.numpy_ml.choose_k_combinations(array, k)[source]
machine_learning_tools.numpy_ml.comma_str(num)[source]
machine_learning_tools.numpy_ml.compare_threshold(item1, item2, threshold=0.0001, print_flag=False)[source]

Purpose: Function that will take a scalar or 2D array and subtract them if the distance between them is less than the specified threshold then consider equal

Example: nu = reload(nu)

item1 = [[1,4,5,7],

[1,4,5,7], [1,4,5,7]]

item2 = [[1,4,5,8.00001],

[1,4,5,7.00001], [1,4,5,7.00001]]

# item1 = [1,4,5,7] # item2 = [1,4,5,9.0000001]

print(nu.compare_threshold(item1,item2,print_flag=True))

machine_learning_tools.numpy_ml.concatenate_arrays_along_last_axis_after_upgraded_to_at_least_2D(arrays)[source]

Example: from datasci_tools import numpy_utils as nu arrays = [np.array([1,2,3]), np.array([4,5,6])] nu.concatenate_arrays_along_last_axis_after_upgraded_to_at_least_2D(arrays)

>> output: array([[1, 4],

[2, 5], [3, 6]])

machine_learning_tools.numpy_ml.concatenate_lists(list_of_lists)[source]
machine_learning_tools.numpy_ml.convert_to_array_like(array, include_tuple=False)[source]

Will convert something to an array

machine_learning_tools.numpy_ml.divide_data_into_classes(classes_array, data_array, unique_classes=None)[source]

Purpose: Will divide two parallel arrays of class and the data into a dictionary that keys to the unique class and hen all of the data that belongs to that class

machine_learning_tools.numpy_ml.divide_into_label_indexes(mapping)[source]

Purpose: To take an array that attributes labels to indices and divide it into a list of the arrays that correspond to the indices of all of the labels

machine_learning_tools.numpy_ml.find_matching_endpoints_row(branch_idx_to_endpoints, end_coordinates)[source]
machine_learning_tools.numpy_ml.float_to_datetime(fl)[source]
machine_learning_tools.numpy_ml.function_over_multi_lists(arrays, set_function)[source]
machine_learning_tools.numpy_ml.get_coordinate_distance_matrix(coordinates)[source]
machine_learning_tools.numpy_ml.get_matching_vertices(possible_vertices, ignore_diagonal=True, equiv_distance=0, print_flag=False)[source]

ignore_diagonal is not implemented yet

machine_learning_tools.numpy_ml.indices_of_comparison_func(func, array1, array2)[source]

Returns the indices of the elements that result from applying func to array1 and array2

machine_learning_tools.numpy_ml.interpercentile_range(array, range_percentage, axis=None, verbose=False)[source]

range_percentage should be 50 or 90 (not 0.5 or 0.9)

Purpose: To compute the range that extends from (1-range_percentage)/2 to 0.5 + range_percentage/2

Ex: interpercentile_range(np.vstack([np.arange(1,11),

np.arange(1,11), np.arange(1,11)]),90,verbose = True,axis = 1)

machine_learning_tools.numpy_ml.intersect1d(arr1, arr2, assume_unique=False, return_indices=False)[source]

Will return the common elements from 2 possibly different sized arrays

If select the return indices = True, will also return the indexes of the common elements

machine_learning_tools.numpy_ml.intersect1d_multi_list(arrays)[source]
machine_learning_tools.numpy_ml.intersect2d(arr1, arr2)[source]
machine_learning_tools.numpy_ml.intersect2d_multi_list(arrays)[source]
machine_learning_tools.numpy_ml.intersect_indices(array1, array2)[source]

Returns the indices of the intersection of array1 and 2

machine_learning_tools.numpy_ml.intersecting_array_components(arrays, sort_components=True, verbose=False, perfect_match=False)[source]

Purpose: Will find the groups of arrays that are connected components based on overlap of elements

Pseudocode: 1) Create an empty edges list 2) Iterate through all combinations of arrays (skipping the redundants) a. heck if there is an intersection b. If yes then add to edges list 3) Trun the edges into a graph 4) Return the connected components

machine_learning_tools.numpy_ml.is_array_like(current_data, include_tuple=False)[source]
machine_learning_tools.numpy_ml.load_compressed(filepath)[source]
machine_learning_tools.numpy_ml.load_dict(file_path)[source]
machine_learning_tools.numpy_ml.matching_row_index(vals, row)[source]
machine_learning_tools.numpy_ml.matching_rows(vals, row, print_flag=False, equiv_distance=0.0001)[source]
machine_learning_tools.numpy_ml.matching_rows_old(vals, row, print_flag=False)[source]
machine_learning_tools.numpy_ml.matrix_of_col_idx(n_rows, n_cols)[source]
machine_learning_tools.numpy_ml.matrix_of_row_idx(n_rows, n_cols=None)[source]
machine_learning_tools.numpy_ml.min_max(array, axis=0)[source]
machine_learning_tools.numpy_ml.min_max_3D_coordinates(array)[source]
machine_learning_tools.numpy_ml.mode_1d(array)[source]
machine_learning_tools.numpy_ml.non_empty_or_none(current_data)[source]
machine_learning_tools.numpy_ml.number_matching_vertices_between_lists(arr1, arr2, verbose=False)[source]
machine_learning_tools.numpy_ml.obj_array_to_dtype_array(array, dtype=None)[source]
machine_learning_tools.numpy_ml.order_array_using_original_and_matching(original_array, matching_array, array, verbose=False)[source]

Purpose: To rearrange arrays so that a specific array matches an original array

Pseudocode: 1) Find the matching array elements 2) For each array in arrays index using the matching indices

Ex: x = [1,2,3,4,5,6] y = [4,6,2] arrays = [ np.array([ “hi”,”yes”,”but”])] arrays = [ np.array([ “hi”,”yes”,”but”]), [“no”,”yes”,”hi”]] arrays = [ np.array([ 1,2,3]), [7,8,9]]

order_arrays_using_original_and_matching(original_array = x, matching_array = y, arrays=arrays, verbose = True)

Return: >>[array([‘but’, ‘hi’, ‘yes’], dtype=’<U3’)]

machine_learning_tools.numpy_ml.order_arrays_using_original_and_matching(original_array, matching_array, arrays, verbose=False)[source]

Purpose: To rearrange arrays so that a specific array matches an original array

Pseudocode: 1) Find the matching array elements 2) For each array in arrays index using the matching indices

Ex: x = [1,2,3,4,5,6] y = [4,6,2] arrays = [ np.array([ “hi”,”yes”,”but”])] arrays = [ np.array([ “hi”,”yes”,”but”]), [“no”,”yes”,”hi”]] arrays = [ np.array([ 1,2,3]), [7,8,9]]

order_arrays_using_original_and_matching(original_array = x, matching_array = y, arrays=arrays, verbose = True)

Return: >>[array([‘but’, ‘hi’, ‘yes’], dtype=’<U3’)]

machine_learning_tools.numpy_ml.original_array_indices_of_elements(original_array, matching_array)[source]

Purpose: Will find the indices of the matching array from the original array

Ex: x = [1,2,3,4,5,6] y = [4,6,2] nu.original_array_indices_of_elements(x,y)

machine_learning_tools.numpy_ml.polyfit(x, y, degree)[source]
machine_learning_tools.numpy_ml.polyval(poly, data)[source]
machine_learning_tools.numpy_ml.random_2D_subarray(array, n_samples, replace=False, verbose=False)[source]

Purpose: To chose a random number of rows from a 2D array

Ex: from datasci_tools import numpy_utils as nu import numpy as np

y = np.array([[1,3],[3,2],[5,6]]) nu.random_2D_subarray(array=y,

n_samples=2, replace=False)

machine_learning_tools.numpy_ml.random_shuffled_indexes_for_array(array)[source]
machine_learning_tools.numpy_ml.randomly_shuffle_array(array)[source]
machine_learning_tools.numpy_ml.remove_indexes(arr1, arr2)[source]
machine_learning_tools.numpy_ml.remove_nans(array)[source]
machine_learning_tools.numpy_ml.repeat_vector_down_rows(array, n_repeat)[source]
Ex: Turn [705895.1025, 711348.065 , 761467.87 ]

into:

TrackedArray([[705895.1025, 711348.065 , 761467.87 ],

[705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ]])

machine_learning_tools.numpy_ml.save_compressed(array, filepath)[source]
machine_learning_tools.numpy_ml.setdiff1d(arr1, arr2, assume_unique=False, return_indices=True)[source]

Purpose: To get the elements in arr1 that aren’t in arr2 and then to possibly return the indices of those that were unique in the first array

machine_learning_tools.numpy_ml.setdiff1d_multi_list(arrays)[source]
machine_learning_tools.numpy_ml.setdiff2d(arr1, arr2)[source]
machine_learning_tools.numpy_ml.sort_elements_in_every_row(current_array)[source]
machine_learning_tools.numpy_ml.sort_multidim_array_by_rows(edge_array, order_row_items=False)[source]

Purpose: To sort an array along the 0 axis where you maintain the row integrity (with possibly sorting the individual elements along a row)

Example: On how to get sorted edges from datasci_tools import numpy_utils as nu nu = reload(nu) nu.sort_multidim_array_by_rows(limb_concept_network.edges(),order_row_items=True)

machine_learning_tools.numpy_ml.sort_rows_by_column(array, column_idx, largest_to_smallest=True)[source]

Will sort the rows based on the values of 1 column

machine_learning_tools.numpy_ml.test_matching_vertices_in_lists(arr1, arr2, verbose=False)[source]
machine_learning_tools.numpy_ml.turn_off_scientific_notation()[source]
machine_learning_tools.numpy_ml.union1d_multi_list(arrays)[source]
machine_learning_tools.numpy_ml.unique_non_self_pairings(array)[source]

Purpose: Will take a list of pairings and then filter the list to only unique pairings where ther is no self pairing

machine_learning_tools.numpy_ml.unique_pairings_between_2_arrays(array1, array2)[source]

Turns 2 seperate array into all possible comibnations of elements

[1,2], [3,4]

into

array([[1, 3],

[1, 4], [2, 3], [2, 4]])

machine_learning_tools.numpy_ml.unique_rows(array)[source]
machine_learning_tools.numpy_ml.vector_from_endpoints(start_endpoint, end_endpoint, normalize_vector=True)[source]
machine_learning_tools.numpy_ml.weighted_average(array, weights)[source]

Ex: from datasci_tools import numpy_utils as nu nu.weighted_average(d_widths,d_sk_lengths)

machine_learning_tools.pandas_ml module

Purpose: pandas functions that are useful for machine learning

.iloc: indexes with integers ex: df_test.iloc[:5] –> gets first 5 rows .loc: indexes with strings Ex: df_test.loc[df.columns,df.columns[:5]]

machine_learning_tools.pandas_ml.X_y(df, target_name)[source]
machine_learning_tools.pandas_ml.center_df(df)[source]
machine_learning_tools.pandas_ml.correlations_by_col(df, correlation_method='pearson')[source]

will return a table that has the correlations between all the columns in the dataframe

other correlations methods: “pearson”,”spearman”,’kendall’

machine_learning_tools.pandas_ml.correlations_to_target(df, target_name='target', correlation_method='pearson', verbose=False, sort_by_value=True)[source]

Purpose: Will find the correlation between all columns and the

machine_learning_tools.pandas_ml.csv_to_df(csv_filepath)[source]
machine_learning_tools.pandas_ml.df_column_summaries(df)[source]
machine_learning_tools.pandas_ml.df_from_X_y(X, y, data_column_names='feature', target_name='target')[source]

Ex: pdml.df_from_X_y(X_trans,y,target_name = “cell_type”)

machine_learning_tools.pandas_ml.df_mean(df)[source]
machine_learning_tools.pandas_ml.df_no_target(df, target_name)[source]
machine_learning_tools.pandas_ml.df_std_dev(df)[source]
machine_learning_tools.pandas_ml.df_to_csv(df, output_filename='df.csv', output_folder='./', file_suffix='.csv', output_filepath=None, verbose=False, return_filepath=True, compression='infer', index=True)[source]

Purpose: To export a dataframe as a csv file

machine_learning_tools.pandas_ml.df_to_gzip(df, output_filename='df.gzip', output_folder='./', output_filepath=None, verbose=False, return_filepath=True, index=False)[source]

Purpose: Save off a compressed version of dataframe (usually 1/3 of the size)

machine_learning_tools.pandas_ml.dropna(axis=0)[source]

More straight forward way for dropping nans

machine_learning_tools.pandas_ml.feature_names(df, target_name=None)[source]
machine_learning_tools.pandas_ml.filter_away_nan_rows(df)[source]
machine_learning_tools.pandas_ml.gzip_to_df(filepath)[source]
machine_learning_tools.pandas_ml.hstack(dfs)[source]
machine_learning_tools.pandas_ml.n_features(df, target_name=None)[source]
machine_learning_tools.pandas_ml.plot_df_x_y_with_std_err(df, x_column, y_column=None, std_err_column=None, log_scale_x=True, log_scale_y=True, verbose=False)[source]

Purpose: to plot the x and y columns where the y column has an associated standard error with it

Example: from machine_learning_tools import pandas_ml as pdml pdml.plot_df_x_y_with_std_err( df,

x_column= “C”,

)

machine_learning_tools.pandas_ml.split_df_by_target(df, target_name)[source]

machine_learning_tools.preprocessing_ml module

Functions used on data before models analyze the data

Application: 1) Lasso Linear Regression should have all columns on the same scale so they get regularized the same amount

Useful link explaining different sklearn scalars for preprocessing:

http://benalexkeen.com/feature-scaling-with-scikit-learn/

machine_learning_tools.preprocessing_ml.get_scaler(scaler='normal_dist')[source]

Purpose: To return the appropriate scalar option

machine_learning_tools.preprocessing_ml.non_negative_df(df)[source]
machine_learning_tools.preprocessing_ml.scale_df(df, scaler='StandardScaler', scaler_trained=None, target_name=None, verbose=False)[source]

Purpose: To apply a preprocessing scalar to all of the feature columns of a df

  1. Get the appropriate scaler

Ex: from machine_learning_tools import preprocessing_ml as preml preml.scale_df(df, target_name=target_name, scaler = “RobustScaler”, verbose = False)

machine_learning_tools.seaborn_ml module

machine_learning_tools.seaborn_ml.corrplot(df, figsize=(10, 10), fmt='.2f', annot=True, **kwargs)[source]

Purpose: Computes and plots the correlation

machine_learning_tools.seaborn_ml.heatmap(df, cmap='Blues', annot=True, logscale=True, title=None, figsize=None, fontsize=16, axes_fontsize=30, ax=None, fmt=None, **kwargs)[source]

Purpose: Will make a heatmap

machine_learning_tools.seaborn_ml.hist(array, bins=50, figsize=(10, 10), **kwargs)[source]
machine_learning_tools.seaborn_ml.hist2D(x_df, y_df, n_bins=100, cbar=True, **kwargs)[source]
machine_learning_tools.seaborn_ml.lineplot(df, x, y, hue=None, **kwargs)[source]
machine_learning_tools.seaborn_ml.pairplot(df, **kwargs)[source]
machine_learning_tools.seaborn_ml.pairwise_hist2D(df, reject_outliers=True, percentile_upper=99.5, percentile_lower=None, features=None, verbose=True, return_data=False, bins='auto')[source]
machine_learning_tools.seaborn_ml.plot_with_param(plot_func, width_inches=10, height_inches=10, **kwargs)[source]
machine_learning_tools.seaborn_ml.save_plot_as_png(sns_plot, filename='seaborn_plot.png')[source]
machine_learning_tools.seaborn_ml.scatter_2D(x, y, x_label='feature_1', y_label='feature_2', title=None, alpha=0.5, **kwargs)[source]

machine_learning_tools.sklearn_models module

Purpose: Storing models that were implemented in sklearn and tested/created easier api

Notes: model.predict –> predicts results model.coef_ –> coefficients model.interecpt_ model.score(X,y) –> gives the r2 of the prediction model.alpha_ –> the LaGrange multiplier after the fit

machine_learning_tools.sklearn_models.AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, random_state=None, **kwargs)[source]

Purpose: To perform boosting sequential ensembles where the subsequent models are trained on weighted data where data missed in previous mehtod are more highly weighted

—- AdaBoost specific parameters —– base_estimator: Object, default=None

the estimator used for the ensemble, if None then it is Tree with max_depth = 1

n_estimators: int, default = 50

maximum number of estimators used before termination, but learning could be terminated earlier (just max possible)

learning_rate: float, default = 1.0

weight applied to each classifier at each iteration, so the lower the learning rate the more models would have

random_state: int

controls seed given to each base_estimator at each boosting iteration (AKA the base estimator has to have a random_state arpument)

attributes: base_estimator_:Object

base from which estimators were built

estimators_: list of objects

list of the fitted sub estimators

estimator_weights_: array of floats

the weights given for each estimator

estimator_errors_: array of floats

classification error for each estimator in the

feature_importances_: array of floats:

impurity-based feautre importances

n_features_in_:

number of features seend during fit

feature_names_in:

names of the features seen during the fit

machine_learning_tools.sklearn_models.AdaptiveLasso(X, y, CV=True, cv_n_splits=10, fit_intercept=False, alpha=None, coef=None, verbose=False, n_lasso_iterations=5)[source]

Example of adaptive Lasso to produce event sparser solutions

Adaptive lasso consists in computing many Lasso with feature reweighting. It’s also known as iterated L1.

Help with the implementation:

https://gist.github.com/agramfort/1610922

— Example 1: Using generated data —–

from sklearn.datasets import make_regression X, y, coef = make_regression(n_samples=306, n_features=8000, n_informative=50,

noise=0.1, shuffle=True, coef=True, random_state=42)

X /= np.sum(X ** 2, axis=0) # scale features alpha = 0.1

model_al = sklm.AdaptiveLasso(

X, y, alpha = alpha, coef = coef, verbose = True

)

—- Example 2: Using simpler data —- X,y = pdml.X_y(df_scaled,target_name) model_al = sklm.AdaptiveLasso(

X, y, verbose = True

)

machine_learning_tools.sklearn_models.BaggingClassifier(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, random_state=True, verbose=True, max_depth=None, **kwargs)[source]

Purpose: 1) Fits classifiers based on random subsets of the original data (done using boostrapping method that uses sampling with replacement) 2) Aggregates the individual predictions of classifiers 9voting or average)

Application: to reduce varaince of an estimator

Baggin Classifier parameters

— Bagging specific parameters ——- base_estimator: Object, default = DecisionTreeClassifier

the model used that will be trained many times

n_estimators: int ,

Number of estimators ensemble will use

max_samples: int/float, default = 1.0 - int: only will draw that many samples - float: that proportion of saples will draw max_features: int/float, default = 1.0

number of features drawn when creating the training set for that iteration

boostrap: bool, default=True

whether samples are drawn with replacement or not

boostrap_features: bool, default = False

to draw features with replacement

oob_score: bool, default=False

to use a out o bag samples to calculate a generalization error as train

random_state: int, default = None verbose: int, default = None

machine_learning_tools.sklearn_models.DecisionTreeClassifier(splitter='best', criterion='gini', max_depth=None, max_features=None, random_state=None, class_weight=None, **kwargs)[source]

DecisiionTreeClassifier parameters:

——- DecisionTree specific parameters ———- splitter: default=”best” how the split will be chosen - “best”: best split - “random”: best random split

——- generic tree parameters ——— criterion : how the best split is determined - ‘gini’ - ‘entropy’ max_depth: int , max depth of trees max_features: number of features to look at when doing best split - ‘auto’:sqrt of number of n_features - ‘sqrt’ - log2 - int: specifying the exact numbe of features - float: secifying the percentage of total n_features - None: always has maximum number of features class_weight: dict,list of dict, “balanced” will add weighting to class decision - “balanced” will automatically balance based on composition of y

machine_learning_tools.sklearn_models.ElasticNet(alpha=1, l1_ratio=0.5, fit_intercept=False, **kwargs)[source]

Purpose: Model that has a mix of L1 and L2 regularization and chooses the lamda (called alpha) based on cross validation later when it is fitted

machine_learning_tools.sklearn_models.ElasticNetCV(l1_ratio=0.5, fit_intercept=False, cv_n_splits=10, **kwargs)[source]

Purpose: Model that has a mix of L1 and L2 regularization and chooses the lamda (called alpha) based on cross validation later when it is fitted

machine_learning_tools.sklearn_models.GradientBoostingClassifier(loss='deviance', learning_rate=0.1, n_estimators=100, subsample=1.0, max_depth=3, random_state=None, max_features=None, verbose=0)[source]

GradientBoosting

Purpose: to optimize a certain loss function. Idea is to fit classifiers that go downhill in the gradient but not fit all the way. This makes the earning slower and harder to overfit

Procedure: For many stages 1) regression tres are fit on the negative gradient of the loss 2) only wegihts the classifiers by a learning rate so not learn to quickly 3) continue to the next stage and learn on the subsequent gradients

Application: Usually pretty good for overfitting

— GradientBoosting specific parameters — loss: str, default = “deviance”

the loss function that should be optimized - “deviance”: logistic regression loss (with probabilistic outputs) - “exponential”: this is ust the adaboost algorithm

learning_rate: float default = 0.1,

how much each of the estimators contributes

n_estimators: int, default = 100 subsample:float, default = 1.0

if less than 1 then will do stochastic gradient boosting where not look at all of the samples

——- generic tree parameters ——— max_depth: int , max depth of trees max_features: number of features to look at when doing best split - ‘auto’:sqrt of number of n_features - ‘sqrt’ - log2 - int: specifying the exact numbe of features - float: secifying the percentage of total n_features - None: always has maximum number of features

What it returns: n_stimators: int

number estimators made

feature_importances: array

impurity based feature importances

oob_importovement:

the improvement in loss on the out of bag sample relative to previous iteration (ONLY AVAILABLE IF SUBSAMPLE < 1.0)

train_score: array

ith train score is the deviance of model iteration i on the in-bag sample (if subsample == 1, then this is the deviance on the training data)

estimators: array of DecisionTreeregression

machine_learning_tools.sklearn_models.Lasso(alpha=1, fit_intercept=False, **kwargs)[source]
machine_learning_tools.sklearn_models.LassoCV(fit_intercept=False, cv_n_splits=10, **kwargs)[source]
machine_learning_tools.sklearn_models.LinearRegression(**kwargs)[source]
machine_learning_tools.sklearn_models.LogisticRegression(**kwargs)[source]

This one you can set the coefficients of the linear classifier

machine_learning_tools.sklearn_models.RandomForestClassifier(n_estimators=30, criterion='gini', max_depth=None, max_features='auto', bootstrap=True, oob_score=True, random_state=None, verbose=False, max_samples=None, class_weight=None, **kwargs)[source]

Purpose: Where the number of features trained on is a subset of overall and the samples trained on are a boostrapped samples

———RandomForrest specific parameters———: n_estimators: number o trees to use in the forest bootstrap: bool(True): whether bootstrap sapling are used to build trees (if not then the whole dataset is used)

oob_score: bool (False): whether to use out-of =-bag samples to estimate the generalization score

max_samples: Number of samples to draw if doing bootstrapping verbose

——- generic tree parameters ——— criterion : how the best split is determined - ‘gini’ - ‘entropy’ max_depth: int , max depth of trees max_features: number of features to look at when doing best split - ‘auto’:sqrt of number of n_features - ‘sqrt’ - log2 - int: specifying the exact numbe of features - float: secifying the percentage of total n_features - None: always has maximum number of features class_weight: dict,list of dict, “balanced” will add weighting to class decision - “balanced” will automatically balance based on composition of y

Example: clf = sklm.RandomForestClassifier(max_depth=5) clf.fit(X_train,y_train) print(sklu.accuracy(clf,X_test,y_test),clf.oob_score_) _ = sklm.feature_importance(clf_for,return_std=True,plot=True)

machine_learning_tools.sklearn_models.Ridge(alpha=1, fit_intercept=False, **kwargs)[source]
machine_learning_tools.sklearn_models.RidgeCV(fit_intercept=False, cv_n_splits=10, **kwargs)[source]
machine_learning_tools.sklearn_models.SVC(kernel='rbf', C=1, gamma='scale', **kwargs)[source]

SVM with the possibilities of adding kernels

machine_learning_tools.sklearn_models.classes(clf)[source]
machine_learning_tools.sklearn_models.clf_name(clf)[source]
machine_learning_tools.sklearn_models.coef_summary(feature_names, model=None, coef_=None, intercept=True)[source]
machine_learning_tools.sklearn_models.compute_class_weight(y)[source]
machine_learning_tools.sklearn_models.feature_importances(clf, verbose=True, plot=False, feature_names=None, return_std=False, method='impurity_decrease', X_permuation=None, y_permutation=None, n_repeats=10, random_state=None, **kwargs)[source]

Purpose: Will return the feature importance of a tree based classifier

sklm.feature_importances(clf,

#method=, verbose = True, plot=True, X_permuation=X_test, y_permutation=y_test, n_repeats=1)

machine_learning_tools.sklearn_models.is_ensemble(clf)[source]
machine_learning_tools.sklearn_models.n_features_in_(clf)[source]
machine_learning_tools.sklearn_models.oob_score(clf)[source]

Purpose: Returns the out of bag error for ensembles that use bootstrapping method

machine_learning_tools.sklearn_models.plot_regularization_paths(model_func, df=None, target_name=None, X=None, y=None, n_alphas=200, alph_log_min=-1, alpha_log_max=5, reverse_axis=True, model_func_kwargs=None)[source]

Purpose: Will plot the regularization paths for a certain model

# Author: Fabian Pedregosa – <fabian.pedregosa@inria.fr> # License: BSD 3 clause

Example from oneline:

# X is the 10x10 Hilbert matrix X = 1. / (np.arange(1, 11) + np.arange(0, 10)[:, np.newaxis]) y = np.ones(10)

plot_regularization_paths( sklm.Ridge, X = X, y =y, alph_log_min = -10, alpha_log_max = -2, )

machine_learning_tools.sklearn_models.ranked_features(model, feature_names=None, verbose=False)[source]

Purpose: to return the features (or feature) indexes that are most important by the absolute values of their coefficients

machine_learning_tools.sklearn_models.residuals(model, x, y, return_score=True)[source]
machine_learning_tools.sklearn_models.set_legend_outside_plot(ax, scale_down=0.8)[source]

Will adjust your axis so that the legend appears outside of the box

machine_learning_tools.sklearn_utils module

Important notes:

sklearn.utils.Bunch: just an extended dictionary that allows attributes to referenced by key, bunch[“value_key”], or by an attribute, bunch.value_key

Notes: R^2 number: lm2.score(X, y)

machine_learning_tools.sklearn_utils.CV_optimal_param_1D(parameter_options, clf_function, loss_function, n_splits=5, data_splits=None, X=None, y=None, test_size=0.2, val_size=0.2, clf_parameters={}, standard_error_buffer=True, plot_loss=False, return_data_splits=False, verbose=False)[source]

Purpose: To Run Cross Validation by Hand with Specific - Dataset - Model Type - 1D Parameter Grid to Search over - Loss function to measure - Method of evaluating the best loss function

Pseudocode: 0) Define the parameter space to iterate over 1) Split the Data into,test,training and validation 2) Combine the validation and training datasets in order to do cross validation 3) Compute the datasets for each cross validation

For every parameter option:
For every K fold dataset:

Train the model on the dataset Measure the MSE or another loss for that model Store the certain loss

Find the average loss and the standard error on the loss

Pick the optimal parameter by one of the options: a) Picking the parameter with the lowest average loss b) Picking the parameter value that is the least complex model

that is within one standard deviation of the parameter with the minimum average loss

Example: clf,data_splits = sklu.CV_optimal_param_1D(

parameter_options = dict(C = np.array([10.**(k) for k in np.linspace(-4,3,25)])),

X = X, y = y,

#parameters for the type of classifier clf_function = linear_model.LogisticRegression, clf_parameters = dict(

penalty = “l1”,

solver=”saga”, max_iter=10000, ),

#arguments for loss function loss_function = sklu.logistic_log_loss,

#arguments for the determination of the optimal parameter standard_error_buffer = True, plot_loss = True,

#arguments for return return_data_splits = True,

verbose = True, )

machine_learning_tools.sklearn_utils.MSE(y_true, y_pred=None, model=None, X=None, clf=None)[source]

Purpose: Will calculate the MSE of a model

machine_learning_tools.sklearn_utils.accuracy(clf, X, y)[source]

Returns the accuracy of a classifier

machine_learning_tools.sklearn_utils.accuracy_score(y_true, y_pred, **kwargs)[source]
machine_learning_tools.sklearn_utils.dataset_df(dataset_name, verbose=False, target_name='target', dropna=True)[source]
machine_learning_tools.sklearn_utils.k_fold_df_split(X, y, target_name=None, n_splits=5, random_state=None)[source]

Purpose: To divide a test and training dataframe into multiple test/train dataframes to use for k fold cross validation

Ex: n_splits = 5 fold_dfs = sklu.k_fold_df_split(

X_train_val, y_train_val, n_splits = n_splits)

fold_dfs[1][“X_train”]

machine_learning_tools.sklearn_utils.load_boston()[source]

MEDV: the median value of home prices

machine_learning_tools.sklearn_utils.logistic_log_loss(clf, X, y_true)[source]

Computes the Log loss, aka logistic loss or cross-entropy loss. on a model

machine_learning_tools.sklearn_utils.optimal_parameter_from_kfold_df(df, parameter_name='k', columns_prefix='mse_fold', higher_param_higher_complexity=True, d=True, verbose=False, return_df=False, standard_error_buffer=False, plot_loss=True, **kwargs)[source]

Purpose: Will find the optimal parameter based on a dataframe of the mse scores for different parameters

Ex: opt_k,ret_df = sklu.optimal_parameter_from_mse_df( best_subset_df, parameter_name = “k”, columns_prefix = “mse_fold”, higher_param_higher_complexity = True, standard_error_buffer = True, verbose = True, return_df = True

)

ret_df

machine_learning_tools.sklearn_utils.random_regression_with_informative_features(n_samples=306, n_features=8000, n_informative=50, random_state=42, noise=0.1, return_true_coef=True)[source]

Purpose: will create a random regression with a certain number of informative features

machine_learning_tools.sklearn_utils.train_val_test_split(X, y, test_size=0.2, val_size=None, verbose=False, random_state=None, return_dict=False)[source]

Purpose: To split the data into 1) train 2) validation (if requested) 3) test

Note: All percentages are specified as number 0 - 1 Process: 1) Split the data into test and train percentages 2) If validation is requested, split the train into train,val by the following formula

val_perc/ ( 1 - test_perc) = val_perc_adjusted

  1. Return the different splits

Example: (X_train,

X_val, X_test, y_train, y_val, y_test) = sklu.train_val_test_split(

X, y, test_size = 0.2, val_size = 0.2, verbose = True)

machine_learning_tools.statsmodels_utils module

Purpose: to export some functionality of the statsmodels library for things like

Application: Behaves a lot like the R functions

-linear regression

Notes: lm1.rsquared gives the r squared value

Notes:

machine_learning_tools.statsmodels_utils.coef(model)[source]
machine_learning_tools.statsmodels_utils.coef_pvalues_df(model)[source]
machine_learning_tools.statsmodels_utils.linear_regression(df, target_name, add_intercept=True, print_summary=True)[source]

Purpose: To run a linear regression and then print out the summary of the coefficients

eX: from machine_learning_tools import statsmodels_utils as smu smu.linear_regression(df_raw[[target_name,”LSTAT”]],

target_name = target_name, add_intercept = True, )

machine_learning_tools.statsmodels_utils.pvalues(model)[source]
machine_learning_tools.statsmodels_utils.ranked_features(model, pval_max=0.001, verbose=False, return_filtered_features=False)[source]

Purpose: Will get the ranked features by their coefficients and filter away tose with a a high p value

machine_learning_tools.visualizations_ml module

machine_learning_tools.visualizations_ml.color_list_for_y(y, color_list=None)[source]
machine_learning_tools.visualizations_ml.contour_map_for_2D_classifier(clf, axes_min_default=-10, axes_max_default=10, axes_step_default=0.01, axes_min_max_step_dict=None, axes_limit_buffers=0, figure_width=10, figure_height=10, color_type='classes', color_prob_axis=0, contour_color_map='RdBu', map_fill_colors=None, training_df=None, training_df_class_name='class', training_df_feature_names=None, feature_names=('feature_1', 'feature_2'), X=None, y=None, scatter_colors=['darkorange', 'c'])[source]

Purpose: To plot the decision boundary of a classifier that is only dependent on 2 feaures

Tried extending this to classifer of more than 2 features but ran into confusion on how to collapse across the other dimensions of the features space

Ex: %matplotlib inline vml.contour_map_for_2D_classifier(ctu.e_i_model)

#plotting the probability %matplotlib inline vml.contour_map_for_2D_classifier(

ctu.e_i_model, color_type=”probability”)

machine_learning_tools.visualizations_ml.meshgrid_for_plot(axes_min_default=-10, axes_max_default=10, axes_step_default=1, axes_min_max_step_dict=None, n_axis=None, return_combined_coordinates=True, clf=None)[source]

Purpose: To generate a meshgrid for plotting that is configured as a mixutre of custom and default values

axes_min_max_step_dict must be a dictionary mapping the class label or classs index to a

Ex: vml.meshgrid_for_plot( axes_min_default = -20, axes_max_default = 10, axes_step_default = 1, #axes_min_max_step_dict = {1:[-2,2,0.5]}, axes_min_max_step_dict = {1:dict(axis_min = -2,axis_max = 3,axis_step = 1)}, n_axis = 2, clf = None, )

machine_learning_tools.visualizations_ml.plot_binary_classifier_map(clf, X=None, xmin=None, xmax=None, ymin=None, ymax=None, buffer=0.01, class_idx_to_plot=0, figure_width=10, figure_height=10, axes_fontsize=25, class_0_color=None, class_1_color=None, mid_color='white', alpha=0.5, plot_colorbar=True, colorbar_label=None, colorbar_labelpad=30, colorbar_label_fontsize=20, colorbar_tick_fontsize=25, ax=None, **kwargs)[source]

Purpose: To plot the prediction map of a binary classifier

Arguments: 1) Model 2) Define the input space want to test over (xmin,xmax)

Pseudocode: 1) Create a meshgrid of the input space 2) Send the prediction

======Example:======= from machine_learning_tools import visualizations_ml as vml

figure_width = 10 figure_height = 10 fig,ax = plt.subplots(1,1,figsize=(figure_width,figure_height))

X = df_plot[trans_cols].to_numpy().astype(“float”)

vml.plot_binary_classifier_map(

clf = ctu.e_i_model, X = X, xmin = 0, xmax = 4.5, ymin=-0.1, ymax = 1.2, alpha = 0.5, colorbar_label = “Excitatory Probability”, ax = ax,

)

machine_learning_tools.visualizations_ml.plot_decision_tree(clf, feature_names, class_names=None, max_depth=None)[source]

Purpose: Will show the

machine_learning_tools.visualizations_ml.plot_df_scatter_2d_classification(df=None, target_name=None, feature_names=None, figure_width=10, figure_height=10, alpha=0.5, axis_append='', verbose=False, X=None, y=None, title=None)[source]

Purpose: To plot features in 3D

Ex: %matplotlib notebook sys.path.append(“/machine_learning_tools/machine_learning_tools/”) from machine_learning_tools import visualizations_ml as vml vml.plot_df_scatter_3d_classification(df,target_name=”group_label”,

feature_names= [

#”ipr_eig_xz_to_width_50”, “center_to_width_50”, “n_limbs”, “ipr_eig_xz_max_95”

])

Ex: ax = vml.plot_df_scatter_2d_classification(

X=X_trans[y!= “Unknown”], y = y[y != “Unknown”],

)

from datasci_tools import matplotlib_utils as mu mu.set_legend_outside_plot(ax)

machine_learning_tools.visualizations_ml.plot_df_scatter_3d_classification(df=None, target_name=None, feature_names=None, figure_width=10, figure_height=10, alpha=0.5, axis_append='', verbose=False, X=None, y=None, title=None)[source]

Purpose: To plot features in 3D

Ex: %matplotlib notebook sys.path.append(“/machine_learning_tools/machine_learning_tools/”) from machine_learning_tools import visualizations_ml as vml vml.plot_df_scatter_3d_classification(df,target_name=”group_label”,

feature_names= [

#”ipr_eig_xz_to_width_50”, “center_to_width_50”, “n_limbs”, “ipr_eig_xz_max_95”

])

machine_learning_tools.visualizations_ml.plot_df_scatter_classification(df=None, target_name=None, feature_names=None, ndim=3, figure_width=14, figure_height=14, alpha=0.5, axis_append='', verbose=False, X=None, y=None, title=None, target_to_color=None, default_color='yellow', plot_legend=True, scale_down_legend=0.75, bbox_to_anchor=(1.02, 0.5), ax=None, text_to_plot_dict=None, use_labels_as_text_to_plot=False, text_to_plot_individual=None, replace_None_with_str_None=False, **kwargs)[source]

Purpose: To plot features in 3D

Ex: %matplotlib notebook sys.path.append(“/machine_learning_tools/machine_learning_tools/”) from machine_learning_tools import visualizations_ml as vml vml.plot_df_scatter_3d_classification(df,target_name=”group_label”,

feature_names= [

#”ipr_eig_xz_to_width_50”, “center_to_width_50”, “n_limbs”, “ipr_eig_xz_max_95”

])

machine_learning_tools.visualizations_ml.plot_dim_red_analysis(X, method, y=None, n_components=[2, 3], alpha=0.5, color_mapppings=None, plot_kwargs=None, verbose=False, **kwargs)[source]
machine_learning_tools.visualizations_ml.plot_feature_importance(clf, feature_names=None, sort_features=True, n_features_to_plot=20, title='Feature Importance', importances=None, std=None, **kwargs)[source]

Purpose: Will plot the feature importance of a classifier

machine_learning_tools.visualizations_ml.plot_svm_kernels(clf, X, y, X_test=None, title=None)[source]

Module contents