machine_learning_tools package
Submodules
machine_learning_tools.clustering_ml module
- machine_learning_tools.clustering_ml.adjusted_rand_score(labels_true, labels_pred, verbose=False)[source]
- machine_learning_tools.clustering_ml.calculate_k_means_loss(data, labels, cluster_centers)[source]
Purpose: Will calculate the k means loss that depends on: 1) cluster centers 2) current labels of data
Pseudocode: For each datapoint:
Calculate the squared euclidean distance between datapoint and center of cluster it is assigned to
Sum up all of the squared distances
- machine_learning_tools.clustering_ml.category_classifications(model, labeled_data, return_dataframe=True, verbose=False, classification_types=['hard', 'soft'])[source]
- machine_learning_tools.clustering_ml.closest_k_nodes_on_dendrogram(node, k, G=None, model=None, verbose=False)[source]
Purpose: Want to find the first k nodes that are close to a node through a dendrogram
- machine_learning_tools.clustering_ml.closet_k_neighbors_from_hierarchical_clustering(X, node_name, row_names, k, n_components=3, verbose=False)[source]
- machine_learning_tools.clustering_ml.cluster_stats_dataframe(labeled_data_classification)[source]
Purpose: Just want to visualize the soft and the hard assignment (and show they are not that different)
Pseudocode: 1)
- machine_learning_tools.clustering_ml.clustering_stats(data, clust_perc=0.8)[source]
Will computer different statistics about the clusters formed that will be later shown or plotting
Metrics: For each category and classification type 1) highest_cluster identify 2) highest_cluster_percentage 3) n clusters needed to encompass clust_perc % of the category 4) Purity statistic
- machine_learning_tools.clustering_ml.compute_average_log_likelihood_per_K(peaks, N=5000, n_iterations=10, K_list=array([8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]), return_train=True, covariance_type='full')[source]
- machine_learning_tools.clustering_ml.dendrogram_HC(model, p=10000, no_plot=True, **kwargs)[source]
Purpose: to create a dendrogram and plot it for a hierarchical clustering model
Ex: p = 1000 # plot the top three levels of the dendrogram curr_dendrogram = clu.dendrogram_HC(model,no_plot=False)
- machine_learning_tools.clustering_ml.dendrogram_graph_from_model(model)[source]
Purpose: will return the dendrogram as a grpah object so you can navigate it
- machine_learning_tools.clustering_ml.dendrogram_leaves_ordered(model, **kwargs)[source]
Gets the order of the leaves in the dendrogram
Applictation: For bi-clustering
- machine_learning_tools.clustering_ml.gmm_analysis(X_train, possible_K=[2, 3, 4, 5, 6, 7], reg_covar=1e-05, init_params='kmeans', covariance_type='full', pca_obj=None, scaler_obj=None, column_titles=None, model_type='mixture', verbose=True)[source]
Purpose: Will perform gmm analysis for a specified different number of clusters and save the models and relevant data for further analysis
- machine_learning_tools.clustering_ml.gmm_classification(gmm_model, curr_data, classification='hard', verbose=True, return_counts=True)[source]
Purpose: Will use the gaussian model passed to classify the data points as to which cluster they belong
- machine_learning_tools.clustering_ml.gmm_hard_classify(model, df, classes_as_str=False, verbose=False)[source]
- machine_learning_tools.clustering_ml.gmm_pipeline(df, title_suffix=None, labeled_data_indices=None, category_column=None, columns_picked=None, possible_K=[2, 3, 4, 5, 6, 7], print_tables=None, apply_normalization=True, apply_pca=True, pca_whiten=True, plot_sqrt_eigvals=True, n_components_pca=None, classification_types=['hard'], model_type='mixture', verbose=True)[source]
Will carry out all of the clustering analysis and advanced stats analysis on a given dataset
Arguments: A data table with all of the labeled data
- machine_learning_tools.clustering_ml.k_mean_clustering(data, n_clusters=3, max_iterations=1000, return_snapshots=True, verbose=True)[source]
Purpose: Will take in the input data and the number of expected clusters and run the K-Means algorithm to cluster the data into k-clusters
Arguments: - data (np.array): the data points in R40 to be clusters - n_clusters: Number of expected clusters - max_iterations: upper bound on number of iterations - return_snapshots: return the label assignments, cluster centers and loss value
for every iteration
Returns: - final_data_labels: what cluster each datapoint was assigned to at end - final_cluster_centers: cluster centers on last iteration - final_loss: steady-state value of loss function at end
snapshots of labels,centers and loss if requested
Pseudocode: 1) Randomly assign labels to data 2) Calculate the cluster centers from random labels 3) Calculate loss value 4) Begin iteration loop:
Reassign data labels to the closest cluster center
Recalculate the cluster center
Calculate the k-means loss
- If the k-means loss did not change from previous value
OR max_iterations is reached
break out of loop
return final values (and snapshots if requested)
- machine_learning_tools.clustering_ml.normalized_mutual_info_score(labels_true, labels_pred, verbose=False)[source]
- machine_learning_tools.clustering_ml.plot_4D_GMM_clusters(X_train, X_test=None, K=10, covariance_type='full')[source]
To graph the 4D clustering of the peaks of the AP
- machine_learning_tools.clustering_ml.plot_BIC_and_Likelihood(gmm_data, fig_width=12, fig_height=5, title_suffix=None)[source]
Pupose
- machine_learning_tools.clustering_ml.plot_advanced_stats_per_k(advanced_stats_per_k, stats_to_plot=['highest_cluster_perc', 'purity'], title_suffix='', fig_width=12, fig_height=5)[source]
Purpose: plotting the highest cluster and purity as a function of k
Pseudocode: 0) Get all the possible categories, n_clusters 0) Sort by n_clusters 1) Iterate through all the stats we want to plot
- Iterate through all of the categories
– for all n_clusters a. Restrict by category and n_clusters and pull down the statistic b. Add to list – c. Save full list in dictionary
Plot the stat using the category dictionary (using the ax index id)
- machine_learning_tools.clustering_ml.plot_loss_function_history(loss_history, title='K-Means Loss vs. Iterations', n_clusters=None)[source]
machine_learning_tools.data_annotation_utils module
Pidgeon is a fast way to annotate data: https://github.com/agermanidis/pigeon
machine_learning_tools.data_input_utils module
machine_learning_tools.dimensionality_reduction_ml module
- machine_learning_tools.dimensionality_reduction_ml.add_dimensionality_reduction_embeddings_to_df(df, method='UMAP', feature_columns=None, return_transform_columns=False, n_components=3, **kwargs)[source]
- machine_learning_tools.dimensionality_reduction_ml.dimensionality_reduction_by_method(X, method='umap', n_components=3, plot=False, plot_kwargs=None, y=None, verbose=False, **kwargs)[source]
Purpose: To apply a dimensionality reduction technique to a dataset (and optionally plot)
Ex: method=”tsne”, X=X_pca[y!= “Unsure”], n_components =2, y = y[y != “Unsure”],
plot=True, plot_kwargs=dict( target_to_color = ctu.cell_type_fine_color_map,
ndim = 3,
)
- machine_learning_tools.dimensionality_reduction_ml.dimensionality_reduction_by_umap(x, random_state=42, n_neighbors=15, min_dist=0.1)[source]
- machine_learning_tools.dimensionality_reduction_ml.eigen_decomp(data=None, cov_matrix=None)[source]
- machine_learning_tools.dimensionality_reduction_ml.explained_variance(data=None, return_cumulative=True, eigenValues_sorted=None)[source]
- machine_learning_tools.dimensionality_reduction_ml.fraction_of_variance_after_proj_back_proj()[source]
- machine_learning_tools.dimensionality_reduction_ml.kth_eigenvector_proj(data, k, whiten=False, plot_perc_variance_explained=False, **kwargs)[source]
Purpose: To get the data projected onto the highest eigenvalue eigenvector
- machine_learning_tools.dimensionality_reduction_ml.largest_eigenvector_proj(data, whiten=False, plot_perc_variance_explained=False, **kwargs)[source]
- machine_learning_tools.dimensionality_reduction_ml.pca_analysis(data, n_components=None, whiten=False, method='sklearn', plot_sqrt_eigvals=False, plot_perc_variance_explained=False, verbose=False, **kwargs)[source]
Arguments: - data, #where each data point is stored as a column vector - method: whether performed with sklearn or by manual implmenetation - n_components: the number of principal components to anlayze - plot_sqrt_eigvals: whether to plot the square root of eigenvalues at end
purpose: Will compute the following parts of PCA analysis
mean covaraince_matrix eigenvalues (the variance explained) eigenvectors (the principal of components), as row vectors percent_variance_explained percent_variance_explained_up_to_n_comp data_proj = data projected onto n_components PC data_backproj = the data points projected into PC space reprojected
back into the original R^N space (but may be with reduced components use for reconstruction)
Ex:
#pracitice on iris data from sklearn import datasets iris = datasets.load_iris() test_data = iris.data
- pca_dict_sklearn = pca_analysis(data=test_data,
- n_components = 3,
method=”sklearn”)
- pca_dict_manual = pca_analysis(data=test_data,
- n_components = 3,
method=”manual”)
- machine_learning_tools.dimensionality_reduction_ml.plot_projected_data(data_proj, labels=None, axis_prefix='', text_to_plot_dict=None, text_to_plot_individual=None, use_labels_as_text_to_plot=False, cmap='viridis', figsize=(10, 10), scatter_alpha=0.5, title='')[source]
To plot the PC projection in 3D
- machine_learning_tools.dimensionality_reduction_ml.plot_projected_data_2D(data_proj, labels=None, axis_prefix='Proj', text_to_plot_dict=None, cmap=<matplotlib.colors.LinearSegmentedColormap object>, figsize=(10, 10), scatter_alpha=0.5, title='')[source]
To plot the PC projection in 3D
- machine_learning_tools.dimensionality_reduction_ml.plot_sq_root_eigvals(eigVals, title=None, title_prefix=None)[source]
Create a square root eigenvalue plot from pca analysis
- machine_learning_tools.dimensionality_reduction_ml.plot_top_2_PC_and_mean_waveform(data, spikewaves_pca=None, title='Waveforms for Mean waveform and top 2 PC', title_prefix=None, scale_by_sq_root_eigVal=True, return_spikewaves_pca=False, mean_scale=1)[source]
- machine_learning_tools.dimensionality_reduction_ml.plot_um(UM, height=8, width=4, title='Imshow of the UM matrix')[source]
- machine_learning_tools.dimensionality_reduction_ml.plot_variance_explained(data_var=None, pca_model=None, title=None, title_prefix=None)[source]
Create a square root eigenvalue plot from pca analysis
machine_learning_tools.dimensionality_reduction_utils module
- machine_learning_tools.dimensionality_reduction_utils.eigen_decomp(data=None, cov_matrix=None)[source]
- machine_learning_tools.dimensionality_reduction_utils.explained_variance(data=None, return_cumulative=True, eigenValues_sorted=None)[source]
- machine_learning_tools.dimensionality_reduction_utils.fraction_of_variance_after_proj_back_proj()[source]
- machine_learning_tools.dimensionality_reduction_utils.kth_eigenvector_proj(data, k, whiten=False, plot_perc_variance_explained=False, **kwargs)[source]
Purpose: To get the data projected onto the highest eigenvalue eigenvector
- machine_learning_tools.dimensionality_reduction_utils.largest_eigenvector_proj(data, whiten=False, plot_perc_variance_explained=False, **kwargs)[source]
- machine_learning_tools.dimensionality_reduction_utils.pca_analysis(data, n_components=None, whiten=False, method='sklearn', plot_sqrt_eigvals=False, plot_perc_variance_explained=False, verbose=False, **kwargs)[source]
Arguments: - data, #where each data point is stored as a column vector - method: whether performed with sklearn or by manual implmenetation - n_components: the number of principal components to anlayze - plot_sqrt_eigvals: whether to plot the square root of eigenvalues at end
purpose: Will compute the following parts of PCA analysis
mean covaraince_matrix eigenvalues (the variance explained) eigenvectors (the principal of components), as row vectors percent_variance_explained percent_variance_explained_up_to_n_comp data_proj = data projected onto n_components PC data_backproj = the data points projected into PC space reprojected
back into the original R^N space (but may be with reduced components use for reconstruction)
Ex:
#pracitice on iris data from sklearn import datasets iris = datasets.load_iris() test_data = iris.data
- pca_dict_sklearn = pca_analysis(data=test_data,
- n_components = 3,
method=”sklearn”)
- pca_dict_manual = pca_analysis(data=test_data,
- n_components = 3,
method=”manual”)
- machine_learning_tools.dimensionality_reduction_utils.plot_projected_data(data_proj, labels, axis_prefix='Proj', text_to_plot_dict=None)[source]
To plot the PC projection in 3D
- machine_learning_tools.dimensionality_reduction_utils.plot_sq_root_eigvals(eigVals, title=None, title_prefix=None)[source]
Create a square root eigenvalue plot from pca analysis
- machine_learning_tools.dimensionality_reduction_utils.plot_top_2_PC_and_mean_waveform(data, spikewaves_pca=None, title='Waveforms for Mean waveform and top 2 PC', title_prefix=None, scale_by_sq_root_eigVal=True, return_spikewaves_pca=False, mean_scale=1)[source]
- machine_learning_tools.dimensionality_reduction_utils.plot_um(UM, height=8, width=4, title='Imshow of the UM matrix')[source]
- machine_learning_tools.dimensionality_reduction_utils.plot_variance_explained(data_var, title=None, title_prefix=None)[source]
Create a square root eigenvalue plot from pca analysis
machine_learning_tools.evaluation_metrics_utils module
- machine_learning_tools.evaluation_metrics_utils.confusion_matrix(y_true, y_pred, labels=None, normalize=None, return_df=False, df=None)[source]
- machine_learning_tools.evaluation_metrics_utils.normalize_confusion_matrix(cf_matrix, axis=1)[source]
- machine_learning_tools.evaluation_metrics_utils.plot_confusion_matrix(cf_matrix, annot=True, annot_fontsize=30, cell_fmt='.2f', cmap='Blues', vmin=0, vmax=1, axes_font_size=20, xlabel_rotation=15, ylabel_rotation=0, xlabels=None, ylabels=None, plot_colorbar=True, colobar_tick_fontsize=25, ax=None)[source]
machine_learning_tools.feature_selection_utils module
Functions to help with featuer evaluation
- machine_learning_tools.feature_selection_utils.best_subset_k(df, k, model, target_name=None, y=None, verbose=False, evaluation_method='R_squared', return_model=False)[source]
Purpose: To pick the best subset of features for a certain number of features allowed
evalutation of best is chosen by the
evaluation method: R^2, MSE
Pseudocode: 0) divide df into X,y 1) Get all choose k subsets of the features 2) For each combination of features: - find the evaluation score
sklm.best_subset_k( df, k = 2, target_name = target_name, model = sklm.LinearRegression(), evaluation_method = “MSE”, verbose = True )
- machine_learning_tools.feature_selection_utils.best_subset_k_individual_sklearn(df, target_name, k, model_type='regression', verbose=False, return_data=False)[source]
Purpose: To run the sklearn best subsets k using built in sklearn method (NOTE THIS SELECTS THE BEST FEATURES INDIVIDUALLY)
Useful Link: https://www.datatechnotes.com/2021/02/seleckbest-feature-selection-example-in-python.html
Example
‘’’ Note: This method uses a the following evaluation criteria for best feature https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html
The correlation between each regressor and the target is computed, that is, ((X[:, i] - mean(X[:, i])) * (y - mean_y)) / (std(X[:, i]) * std(y)).
It is converted to an F score then to a p-value.
‘’’ best_features_over_sk = []
- for k in tqdm(range(1,n_featuers+1)):
eval_method = “sklearn”
curr_best = fsu.best_subset_k_individual_sklearn( df, k = k, target_name = target_name, verbose = False, ) best_features_over_sk.append(dict(k = k,
evaluation_method = eval_method, best_subset = curr_best))
import pandas as pd print(f”Using sklearn method”) pd.DataFrame.from_records(best_features_over_sk)
- machine_learning_tools.feature_selection_utils.reverse_feature_elimination(df, k, model, target_name=None, y=None, verbose=False, return_model=False)[source]
Use sklearn function for recursively elimininating the least important features
How does it pick the best features? the absolute value of the model.coef_ (not considering p_value)
machine_learning_tools.hyperparameters_ml module
- machine_learning_tools.hyperparameters_ml.best_hyperparams_RandomizedSearchCV(clf, parameter_dict, X, y, n_iter_search=2, return_clf=False, return_cv_results=True, verbose=True, n_cv_folds=5, n_jobs=1)[source]
Purpose: To find the best parameters in from a random search of a parameter map defined by a dict
machine_learning_tools.machine_learning_utils module
- machine_learning_tools.machine_learning_utils.decision_tree_analysis(df, target_column, max_depth=None, max_features=None)[source]
Purpose: To perform decision tree analysis and plot it
- machine_learning_tools.machine_learning_utils.decision_tree_sklearn(df, target_column, feature_columns=None, perform_testing=False, test_size=0, criterion='entropy', splitter='best', max_depth=None, max_features=None, min_samples_split=0.1, min_samples_leaf=0.02)[source]
Purpose: To train a decision tree based on a dataframe with the features and the classifications
Parameters: max_depth = If None then the depth is chosen so all leaves contin less than min_samples_split
The higher the depth th emore overfitting
- machine_learning_tools.machine_learning_utils.encode_labels_as_ints(labels)[source]
Purpose: Will convert a list of labels into an array encoded 0,1,2…. and return the unique labels used
Reference: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
Ex: import machine_learning_utils as mlu mlu.encode_labels_as_ints([“hi”,”hello”,”yo”,’yo’])
- machine_learning_tools.machine_learning_utils.kNN_classifier(X, y, n_neighbors, labels_list=None, weights='distance', verbose=False, plot_map=False, add_labels=True, **kwargs)[source]
Purpose: Will create a kNN classifier and preferrencially plot the decision map
Ex: Running it on the iris data
from sklearn import neighbors, datasets import machine_learning_utils as mlu
n_neighbors = 15
# import some data to play with iris = datasets.load_iris()
# we only take the first two features. We could avoid this ugly # slicing by using a two-dim dataset X = iris.data[:, :2] y = iris.target
- mlu.kNN_classifier(X,y,
n_neighbors=n_neighbors,
- labels_list = iris.target_names,
plot_map = True, feature_1_name=iris.feature_names[0], feature_2_name=iris.feature_names[1],
)
- machine_learning_tools.machine_learning_utils.plot_classifier_map(clf, data_to_plot=None, data_to_plot_color='red', data_to_plot_size=50, figsize=(8, 6), map_fill_colors=None, scatter_colors=None, h=0.02, feature_1_idx=0, feature_2_idx=1, feature_1_name='feature_1_name', feature_2_name='feature_2_name', x_min=None, x_max=None, y_min=None, y_max=None, verbose=False, plot_training_points=False, **kwargs)[source]
Purpose: To plot
- machine_learning_tools.machine_learning_utils.plot_decision_tree(clf, feature_names, class_names=None)[source]
Purpose: Will show the
machine_learning_tools.matplotlib_ml module
Notes on other functions: eventplot #will plot 1D data as lines, can stack multiple 1D events – if did a lot of these gives the characteristic neuron spikes
all stacked on top of each other
matplot colors can be described with “C102” where C{number} –> there are only 10 possible colors but the number can go as high as you want it just repeats after 10 Ex: C100 = C110
#How to set the figure size: fig.set_size_inches(18.5, 10.5)
# not have the subplots run into each other fig.tight_layout()
- machine_learning_tools.matplotlib_ml.add_random_color_for_missing_labels_in_dict(labels, label_color_dict, verbose=False)[source]
Purpose: Will generate random colors for labels that are missing in the labels dict
- machine_learning_tools.matplotlib_ml.apply_alpha_to_color_list(color_list, alpha=0.2, print_flag=False)[source]
- machine_learning_tools.matplotlib_ml.bins_from_width_range(bin_width, bin_max, bin_min=None)[source]
To compute the width boundaries to help with plotting and give a constant bin widht
- machine_learning_tools.matplotlib_ml.color_to_rgb(color_str)[source]
To turn a string of a color into an RGB value
Ex: color_to_rgb(“red”)
- machine_learning_tools.matplotlib_ml.convert_dict_rgb_values_to_names(color_dict)[source]
Purpose: To convert dictonary with colors as values to the color names instead of the rgb equivalents
Application: can be used on the color dictionary returned by the neuron plotting function
Example: from datasci_tools import matplotlib_utils as mu mu = reload(mu) nviz=reload(nviz)
- returned_color_dict = nviz.visualize_neuron(uncompressed_neuron,
visualize_type=[“network”], network_resolution=”branch”, network_directional=True, network_soma=[“S1”,”S0”], network_soma_color = [“black”,”red”], limb_branch_dict=dict(L1=”all”, L2=”all”), node_size = 1, arrow_size = 1, return_color_dict=True)
color_info = mu.convert_dict_rgb_values_to_names(returned_color_dict)
- machine_learning_tools.matplotlib_ml.convert_rgb_to_name(rgb_value)[source]
Example: convert_rgb_to_name(np.array([[1,0,0,0.5]]))
- machine_learning_tools.matplotlib_ml.generate_color_list(user_colors=[], n_colors=-1, colors_to_omit=[], alpha_level=0.2, return_named_colors=False)[source]
Can specify the number of colors that you want Can specify colors that you don’t want accept what alpha you want
Example of how to use colors_array = generate_color_list(colors_to_omit=[“green”])
- machine_learning_tools.matplotlib_ml.generate_color_list_no_alpha_change(user_colors=[], n_colors=-1, colors_to_omit=[], alpha_level=0.2)[source]
Can specify the number of colors that you want Can specify colors that you don’t want accept what alpha you want
Example of how to use colors_array = generate_color_list(colors_to_omit=[“green”])
- machine_learning_tools.matplotlib_ml.generate_non_randon_named_color_list(n_colors, user_colors=[], colors_to_omit=[])[source]
To generate a list of colors of a certain length that is non-random
- machine_learning_tools.matplotlib_ml.generate_random_color(print_flag=False, colors_to_omit=[])[source]
- machine_learning_tools.matplotlib_ml.generate_unique_random_color_list(n_colors, print_flag=False, colors_to_omit=[])[source]
- machine_learning_tools.matplotlib_ml.histogram(data, n_bins=50, bin_width=None, bin_max=None, bin_min=None, density=False, logscale=False, return_fig_ax=True, fontsize_axes=20, **kwargs)[source]
Ex: histogram(in_degree,bin_max = 700,
bin_width = 20,return_fig_ax=True)
- machine_learning_tools.matplotlib_ml.plot_color_dict(colors, sorted_names=None, hue_sort=False, ncols=4, figure_width=20, figure_height=8, print_flag=True)[source]
Ex:
#how to plot the base colors Examples: mu.plot_color_dict(mu.base_colors_dict,figure_height=20) mu.plot_color_dict(mu.base_colors_dict,hue_sort=True,figure_height=20)
How to plot colors returned from the plotting function: from datasci_tools import matplotlib_utils as mu mu = reload(mu) nviz=reload(nviz)
- returned_color_dict = nviz.visualize_neuron(uncompressed_neuron,
visualize_type=[“network”], network_resolution=”branch”, network_directional=True, network_soma=[“S1”,”S0”], network_soma_color = [“black”,”red”], limb_branch_dict=dict(L1=”all”, L2=”all”), node_size = 1, arrow_size = 1, return_color_dict=True)
mu.plot_color_dict(returned_color_dict,hue_sort=False,figure_height=20)
- machine_learning_tools.matplotlib_ml.plot_graph(title, y_values, x_values, x_axis_label, y_axis_label, return_fig=False, figure=None, ax_index=None, label=None, x_axis_int=True)[source]
Purpose: For easy plotting and concatenating plots
- machine_learning_tools.matplotlib_ml.process_non_dict_color_input(color_input)[source]
Will return a color list that is as long as n_items based on a diverse set of options for how to specify colors
string
list of strings
1D np.array
list of strings and 1D np.array
list of 1D np.array or 2D np.array
Warning: This will not be alpha corrected
- machine_learning_tools.matplotlib_ml.scatter_2D_with_labels(X, Y, labels, label_color_dict=None, x_title='', y_title='', axis_append='', Z=None, z_title='', alpha=0.5, verbose=False, move_legend_outside_plot=True)[source]
Purpose: Will plot scatter points where each point has a unique label (and allows to specify the colors of each label)
Pseudocode: 1) Find the unique labels 2) For all unique labels, if a color mapping is not specified then add a random unique color (use function)
3) Iterate through the labels to plot: a. Find all indices of that label b. Plot them with the correct color and label
Move the legend to outside of the plot
mu.scatter_2D_with_labels( X = np.concatenate([f1_inh,f1_exc]), Y = np.concatenate([f2_inh,f2_exc]), #Z = np.ones(194), x_title = feature_1, y_title = feature_2, axis_append = “(per um of skeleton)”, labels = np.concatenate([class_inh,class_exc]), alpha = 0.5, label_color_dict= dict(BC = “blue”,
BPC = “black”, MC = “yellow”, excitatory = “red”
),
verbose = True)
machine_learning_tools.numpy_ml module
- machine_learning_tools.numpy_ml.all_choose_1_combinations_form_dict_values(parameter_dict, verbose=False)[source]
Purpose: To generate a list of dictinoaries that encompass all the possible parameter settings defined by the possible parameter settings in the dictionary
Pseudocode:
- machine_learning_tools.numpy_ml.all_directed_choose_2_combinations(array)[source]
Ex: seg_split_ids = [“864691136388279671_0”,
“864691135403726574_0”, “864691136194013910_0”]
output: [[‘864691136388279671_0’, ‘864691135403726574_0’],
[‘864691136388279671_0’, ‘864691136194013910_0’], [‘864691135403726574_0’, ‘864691136388279671_0’], [‘864691135403726574_0’, ‘864691136194013910_0’], [‘864691136194013910_0’, ‘864691136388279671_0’], [‘864691136194013910_0’, ‘864691135403726574_0’]]
- machine_learning_tools.numpy_ml.all_partitions(array, min_partition_size=2, verbose=False)[source]
Will form all of the possible 2 partions of an array where you can specify the minimum number of elements needed for each possible partition
Ex: x = nu.all_partitions(array = np.array([4,5,6,9]))
- machine_learning_tools.numpy_ml.all_subarrays(l)[source]
Ex: from datasci_tools import numpy_utils as nu nu.all_subarrays([[1,”a”],[2,”b”],[3,”c”]])
Output: [[],
[[1, ‘a’]], [[2, ‘b’]], [[1, ‘a’], [2, ‘b’]], [[3, ‘c’]], [[1, ‘a’], [3, ‘c’]], [[2, ‘b’], [3, ‘c’]], [[1, ‘a’], [2, ‘b’], [3, ‘c’]]]
- machine_learning_tools.numpy_ml.all_unique_choose_2_combinations(array)[source]
Given a list of numbers or labels, will determine all the possible unique pariings
- machine_learning_tools.numpy_ml.angle_between_vectors(v1, v2, acute=True, degrees=True, verbose=False)[source]
vec1 = np.array([0,0,1]) vec2 = np.array([1,1,-0.1]) angle(vec1,vec2,verbose=True)
- machine_learning_tools.numpy_ml.argsort_multidim_array_by_rows(array, descending=False)[source]
Ex: x = np.array([
[2,2,3,4,5], [-2,2,3,4,5], [3,1,1,1,1], [1,10,10,10,10], [3,0,1,1,1], [-2,-3,3,4,5]
])
#showing this argsort will correctly sort x[nu.argsort_multidim_arrays_by_rows(x)]
>> Output:
- array([[-2, -3, 3, 4, 5],
[-2, 2, 3, 4, 5], [ 1, 10, 10, 10, 10], [ 2, 2, 3, 4, 5], [ 3, 0, 1, 1, 1], [ 3, 1, 1, 1, 1]])
- machine_learning_tools.numpy_ml.argsort_rows_of_2D_array_independently(array, descending=False)[source]
Purpose: will return array for row idx and one for col idex that will sort the values of each row independently of the column
Ex: x = np.array([
[2,2,3,4,5], [-2,2,3,4,5], [3,1,1,1,1], [1,10,10,10,10], [3,0,1,1,1], [-2,-3,3,4,5]
])
row_idx,col_idx = nu.argsort_rows_of_2D_array_independently(x) x[row_idx,col_idx]
Output: >> array([[ 2, 2, 3, 4, 5],
[-2, 2, 3, 4, 5], [ 1, 1, 1, 1, 3], [ 1, 10, 10, 10, 10], [ 0, 1, 1, 1, 3], [-3, -2, 3, 4, 5]])
- machine_learning_tools.numpy_ml.array_after_exclusion(original_array=[], exclusion_list=[], n_elements=0)[source]
To efficiently get the difference between 2 lists:
original_list = [1,5,6,10,11] exclusion = [10,6] n_elements = 20
array_after_exclusion(n_elements=n_elements,exclusion_list=exclusion)
** pretty much the same thing as : np.setdiff1d(array1, array2)
- machine_learning_tools.numpy_ml.array_split(array, n_splits)[source]
Split an array into multiple sub-arrays
Ex: from datasci_tools import numpy_utils as nu nu.array_split(np.arange(0,10),3)
- machine_learning_tools.numpy_ml.compare_threshold(item1, item2, threshold=0.0001, print_flag=False)[source]
Purpose: Function that will take a scalar or 2D array and subtract them if the distance between them is less than the specified threshold then consider equal
Example: nu = reload(nu)
- item1 = [[1,4,5,7],
[1,4,5,7], [1,4,5,7]]
- item2 = [[1,4,5,8.00001],
[1,4,5,7.00001], [1,4,5,7.00001]]
# item1 = [1,4,5,7] # item2 = [1,4,5,9.0000001]
print(nu.compare_threshold(item1,item2,print_flag=True))
- machine_learning_tools.numpy_ml.concatenate_arrays_along_last_axis_after_upgraded_to_at_least_2D(arrays)[source]
Example: from datasci_tools import numpy_utils as nu arrays = [np.array([1,2,3]), np.array([4,5,6])] nu.concatenate_arrays_along_last_axis_after_upgraded_to_at_least_2D(arrays)
>> output: array([[1, 4],
[2, 5], [3, 6]])
- machine_learning_tools.numpy_ml.convert_to_array_like(array, include_tuple=False)[source]
Will convert something to an array
- machine_learning_tools.numpy_ml.divide_data_into_classes(classes_array, data_array, unique_classes=None)[source]
Purpose: Will divide two parallel arrays of class and the data into a dictionary that keys to the unique class and hen all of the data that belongs to that class
- machine_learning_tools.numpy_ml.divide_into_label_indexes(mapping)[source]
Purpose: To take an array that attributes labels to indices and divide it into a list of the arrays that correspond to the indices of all of the labels
- machine_learning_tools.numpy_ml.find_matching_endpoints_row(branch_idx_to_endpoints, end_coordinates)[source]
- machine_learning_tools.numpy_ml.get_matching_vertices(possible_vertices, ignore_diagonal=True, equiv_distance=0, print_flag=False)[source]
ignore_diagonal is not implemented yet
- machine_learning_tools.numpy_ml.indices_of_comparison_func(func, array1, array2)[source]
Returns the indices of the elements that result from applying func to array1 and array2
- machine_learning_tools.numpy_ml.interpercentile_range(array, range_percentage, axis=None, verbose=False)[source]
range_percentage should be 50 or 90 (not 0.5 or 0.9)
Purpose: To compute the range that extends from (1-range_percentage)/2 to 0.5 + range_percentage/2
Ex: interpercentile_range(np.vstack([np.arange(1,11),
np.arange(1,11), np.arange(1,11)]),90,verbose = True,axis = 1)
- machine_learning_tools.numpy_ml.intersect1d(arr1, arr2, assume_unique=False, return_indices=False)[source]
Will return the common elements from 2 possibly different sized arrays
If select the return indices = True, will also return the indexes of the common elements
- machine_learning_tools.numpy_ml.intersect_indices(array1, array2)[source]
Returns the indices of the intersection of array1 and 2
- machine_learning_tools.numpy_ml.intersecting_array_components(arrays, sort_components=True, verbose=False, perfect_match=False)[source]
Purpose: Will find the groups of arrays that are connected components based on overlap of elements
Pseudocode: 1) Create an empty edges list 2) Iterate through all combinations of arrays (skipping the redundants) a. heck if there is an intersection b. If yes then add to edges list 3) Trun the edges into a graph 4) Return the connected components
- machine_learning_tools.numpy_ml.matching_rows(vals, row, print_flag=False, equiv_distance=0.0001)[source]
- machine_learning_tools.numpy_ml.number_matching_vertices_between_lists(arr1, arr2, verbose=False)[source]
- machine_learning_tools.numpy_ml.order_array_using_original_and_matching(original_array, matching_array, array, verbose=False)[source]
Purpose: To rearrange arrays so that a specific array matches an original array
Pseudocode: 1) Find the matching array elements 2) For each array in arrays index using the matching indices
Ex: x = [1,2,3,4,5,6] y = [4,6,2] arrays = [ np.array([ “hi”,”yes”,”but”])] arrays = [ np.array([ “hi”,”yes”,”but”]), [“no”,”yes”,”hi”]] arrays = [ np.array([ 1,2,3]), [7,8,9]]
order_arrays_using_original_and_matching(original_array = x, matching_array = y, arrays=arrays, verbose = True)
Return: >>[array([‘but’, ‘hi’, ‘yes’], dtype=’<U3’)]
- machine_learning_tools.numpy_ml.order_arrays_using_original_and_matching(original_array, matching_array, arrays, verbose=False)[source]
Purpose: To rearrange arrays so that a specific array matches an original array
Pseudocode: 1) Find the matching array elements 2) For each array in arrays index using the matching indices
Ex: x = [1,2,3,4,5,6] y = [4,6,2] arrays = [ np.array([ “hi”,”yes”,”but”])] arrays = [ np.array([ “hi”,”yes”,”but”]), [“no”,”yes”,”hi”]] arrays = [ np.array([ 1,2,3]), [7,8,9]]
order_arrays_using_original_and_matching(original_array = x, matching_array = y, arrays=arrays, verbose = True)
Return: >>[array([‘but’, ‘hi’, ‘yes’], dtype=’<U3’)]
- machine_learning_tools.numpy_ml.original_array_indices_of_elements(original_array, matching_array)[source]
Purpose: Will find the indices of the matching array from the original array
Ex: x = [1,2,3,4,5,6] y = [4,6,2] nu.original_array_indices_of_elements(x,y)
- machine_learning_tools.numpy_ml.random_2D_subarray(array, n_samples, replace=False, verbose=False)[source]
Purpose: To chose a random number of rows from a 2D array
Ex: from datasci_tools import numpy_utils as nu import numpy as np
y = np.array([[1,3],[3,2],[5,6]]) nu.random_2D_subarray(array=y,
n_samples=2, replace=False)
- machine_learning_tools.numpy_ml.repeat_vector_down_rows(array, n_repeat)[source]
- Ex: Turn [705895.1025, 711348.065 , 761467.87 ]
into:
- TrackedArray([[705895.1025, 711348.065 , 761467.87 ],
[705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ], [705895.1025, 711348.065 , 761467.87 ]])
- machine_learning_tools.numpy_ml.setdiff1d(arr1, arr2, assume_unique=False, return_indices=True)[source]
Purpose: To get the elements in arr1 that aren’t in arr2 and then to possibly return the indices of those that were unique in the first array
- machine_learning_tools.numpy_ml.sort_multidim_array_by_rows(edge_array, order_row_items=False)[source]
Purpose: To sort an array along the 0 axis where you maintain the row integrity (with possibly sorting the individual elements along a row)
Example: On how to get sorted edges from datasci_tools import numpy_utils as nu nu = reload(nu) nu.sort_multidim_array_by_rows(limb_concept_network.edges(),order_row_items=True)
- machine_learning_tools.numpy_ml.sort_rows_by_column(array, column_idx, largest_to_smallest=True)[source]
Will sort the rows based on the values of 1 column
- machine_learning_tools.numpy_ml.unique_non_self_pairings(array)[source]
Purpose: Will take a list of pairings and then filter the list to only unique pairings where ther is no self pairing
- machine_learning_tools.numpy_ml.unique_pairings_between_2_arrays(array1, array2)[source]
Turns 2 seperate array into all possible comibnations of elements
[1,2], [3,4]
into
- array([[1, 3],
[1, 4], [2, 3], [2, 4]])
machine_learning_tools.pandas_ml module
Purpose: pandas functions that are useful for machine learning
.iloc: indexes with integers ex: df_test.iloc[:5] –> gets first 5 rows .loc: indexes with strings Ex: df_test.loc[df.columns,df.columns[:5]]
- machine_learning_tools.pandas_ml.correlations_by_col(df, correlation_method='pearson')[source]
will return a table that has the correlations between all the columns in the dataframe
other correlations methods: “pearson”,”spearman”,’kendall’
- machine_learning_tools.pandas_ml.correlations_to_target(df, target_name='target', correlation_method='pearson', verbose=False, sort_by_value=True)[source]
Purpose: Will find the correlation between all columns and the
- machine_learning_tools.pandas_ml.df_from_X_y(X, y, data_column_names='feature', target_name='target')[source]
Ex: pdml.df_from_X_y(X_trans,y,target_name = “cell_type”)
- machine_learning_tools.pandas_ml.df_to_csv(df, output_filename='df.csv', output_folder='./', file_suffix='.csv', output_filepath=None, verbose=False, return_filepath=True, compression='infer', index=True)[source]
Purpose: To export a dataframe as a csv file
- machine_learning_tools.pandas_ml.df_to_gzip(df, output_filename='df.gzip', output_folder='./', output_filepath=None, verbose=False, return_filepath=True, index=False)[source]
Purpose: Save off a compressed version of dataframe (usually 1/3 of the size)
- machine_learning_tools.pandas_ml.dropna(axis=0)[source]
More straight forward way for dropping nans
- machine_learning_tools.pandas_ml.plot_df_x_y_with_std_err(df, x_column, y_column=None, std_err_column=None, log_scale_x=True, log_scale_y=True, verbose=False)[source]
Purpose: to plot the x and y columns where the y column has an associated standard error with it
Example: from machine_learning_tools import pandas_ml as pdml pdml.plot_df_x_y_with_std_err( df,
x_column= “C”,
)
machine_learning_tools.preprocessing_ml module
Functions used on data before models analyze the data
Application: 1) Lasso Linear Regression should have all columns on the same scale so they get regularized the same amount
Useful link explaining different sklearn scalars for preprocessing:
http://benalexkeen.com/feature-scaling-with-scikit-learn/
- machine_learning_tools.preprocessing_ml.get_scaler(scaler='normal_dist')[source]
Purpose: To return the appropriate scalar option
- machine_learning_tools.preprocessing_ml.scale_df(df, scaler='StandardScaler', scaler_trained=None, target_name=None, verbose=False)[source]
Purpose: To apply a preprocessing scalar to all of the feature columns of a df
Get the appropriate scaler
Ex: from machine_learning_tools import preprocessing_ml as preml preml.scale_df(df, target_name=target_name, scaler = “RobustScaler”, verbose = False)
machine_learning_tools.seaborn_ml module
- machine_learning_tools.seaborn_ml.corrplot(df, figsize=(10, 10), fmt='.2f', annot=True, **kwargs)[source]
Purpose: Computes and plots the correlation
- machine_learning_tools.seaborn_ml.heatmap(df, cmap='Blues', annot=True, logscale=True, title=None, figsize=None, fontsize=16, axes_fontsize=30, ax=None, fmt=None, **kwargs)[source]
Purpose: Will make a heatmap
- machine_learning_tools.seaborn_ml.pairwise_hist2D(df, reject_outliers=True, percentile_upper=99.5, percentile_lower=None, features=None, verbose=True, return_data=False, bins='auto')[source]
machine_learning_tools.sklearn_models module
Purpose: Storing models that were implemented in sklearn and tested/created easier api
Notes: model.predict –> predicts results model.coef_ –> coefficients model.interecpt_ model.score(X,y) –> gives the r2 of the prediction model.alpha_ –> the LaGrange multiplier after the fit
- machine_learning_tools.sklearn_models.AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, random_state=None, **kwargs)[source]
Purpose: To perform boosting sequential ensembles where the subsequent models are trained on weighted data where data missed in previous mehtod are more highly weighted
—- AdaBoost specific parameters —– base_estimator: Object, default=None
the estimator used for the ensemble, if None then it is Tree with max_depth = 1
- n_estimators: int, default = 50
maximum number of estimators used before termination, but learning could be terminated earlier (just max possible)
- learning_rate: float, default = 1.0
weight applied to each classifier at each iteration, so the lower the learning rate the more models would have
- random_state: int
controls seed given to each base_estimator at each boosting iteration (AKA the base estimator has to have a random_state arpument)
attributes: base_estimator_:Object
base from which estimators were built
- estimators_: list of objects
list of the fitted sub estimators
- estimator_weights_: array of floats
the weights given for each estimator
- estimator_errors_: array of floats
classification error for each estimator in the
- feature_importances_: array of floats:
impurity-based feautre importances
- n_features_in_:
number of features seend during fit
- feature_names_in:
names of the features seen during the fit
- machine_learning_tools.sklearn_models.AdaptiveLasso(X, y, CV=True, cv_n_splits=10, fit_intercept=False, alpha=None, coef=None, verbose=False, n_lasso_iterations=5)[source]
Example of adaptive Lasso to produce event sparser solutions
Adaptive lasso consists in computing many Lasso with feature reweighting. It’s also known as iterated L1.
Help with the implementation:
https://gist.github.com/agramfort/1610922
— Example 1: Using generated data —–
from sklearn.datasets import make_regression X, y, coef = make_regression(n_samples=306, n_features=8000, n_informative=50,
noise=0.1, shuffle=True, coef=True, random_state=42)
X /= np.sum(X ** 2, axis=0) # scale features alpha = 0.1
- model_al = sklm.AdaptiveLasso(
X, y, alpha = alpha, coef = coef, verbose = True
)
—- Example 2: Using simpler data —- X,y = pdml.X_y(df_scaled,target_name) model_al = sklm.AdaptiveLasso(
X, y, verbose = True
)
- machine_learning_tools.sklearn_models.BaggingClassifier(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, random_state=True, verbose=True, max_depth=None, **kwargs)[source]
Purpose: 1) Fits classifiers based on random subsets of the original data (done using boostrapping method that uses sampling with replacement) 2) Aggregates the individual predictions of classifiers 9voting or average)
Application: to reduce varaince of an estimator
Baggin Classifier parameters
— Bagging specific parameters ——- base_estimator: Object, default = DecisionTreeClassifier
the model used that will be trained many times
- n_estimators: int ,
Number of estimators ensemble will use
max_samples: int/float, default = 1.0 - int: only will draw that many samples - float: that proportion of saples will draw max_features: int/float, default = 1.0
number of features drawn when creating the training set for that iteration
- boostrap: bool, default=True
whether samples are drawn with replacement or not
- boostrap_features: bool, default = False
to draw features with replacement
- oob_score: bool, default=False
to use a out o bag samples to calculate a generalization error as train
random_state: int, default = None verbose: int, default = None
- machine_learning_tools.sklearn_models.DecisionTreeClassifier(splitter='best', criterion='gini', max_depth=None, max_features=None, random_state=None, class_weight=None, **kwargs)[source]
DecisiionTreeClassifier parameters:
——- DecisionTree specific parameters ———- splitter: default=”best” how the split will be chosen - “best”: best split - “random”: best random split
——- generic tree parameters ——— criterion : how the best split is determined - ‘gini’ - ‘entropy’ max_depth: int , max depth of trees max_features: number of features to look at when doing best split - ‘auto’:sqrt of number of n_features - ‘sqrt’ - log2 - int: specifying the exact numbe of features - float: secifying the percentage of total n_features - None: always has maximum number of features class_weight: dict,list of dict, “balanced” will add weighting to class decision - “balanced” will automatically balance based on composition of y
- machine_learning_tools.sklearn_models.ElasticNet(alpha=1, l1_ratio=0.5, fit_intercept=False, **kwargs)[source]
Purpose: Model that has a mix of L1 and L2 regularization and chooses the lamda (called alpha) based on cross validation later when it is fitted
- machine_learning_tools.sklearn_models.ElasticNetCV(l1_ratio=0.5, fit_intercept=False, cv_n_splits=10, **kwargs)[source]
Purpose: Model that has a mix of L1 and L2 regularization and chooses the lamda (called alpha) based on cross validation later when it is fitted
- machine_learning_tools.sklearn_models.GradientBoostingClassifier(loss='deviance', learning_rate=0.1, n_estimators=100, subsample=1.0, max_depth=3, random_state=None, max_features=None, verbose=0)[source]
GradientBoosting
Purpose: to optimize a certain loss function. Idea is to fit classifiers that go downhill in the gradient but not fit all the way. This makes the earning slower and harder to overfit
Procedure: For many stages 1) regression tres are fit on the negative gradient of the loss 2) only wegihts the classifiers by a learning rate so not learn to quickly 3) continue to the next stage and learn on the subsequent gradients
Application: Usually pretty good for overfitting
— GradientBoosting specific parameters — loss: str, default = “deviance”
the loss function that should be optimized - “deviance”: logistic regression loss (with probabilistic outputs) - “exponential”: this is ust the adaboost algorithm
- learning_rate: float default = 0.1,
how much each of the estimators contributes
n_estimators: int, default = 100 subsample:float, default = 1.0
if less than 1 then will do stochastic gradient boosting where not look at all of the samples
——- generic tree parameters ——— max_depth: int , max depth of trees max_features: number of features to look at when doing best split - ‘auto’:sqrt of number of n_features - ‘sqrt’ - log2 - int: specifying the exact numbe of features - float: secifying the percentage of total n_features - None: always has maximum number of features
What it returns: n_stimators: int
number estimators made
- feature_importances: array
impurity based feature importances
- oob_importovement:
the improvement in loss on the out of bag sample relative to previous iteration (ONLY AVAILABLE IF SUBSAMPLE < 1.0)
- train_score: array
ith train score is the deviance of model iteration i on the in-bag sample (if subsample == 1, then this is the deviance on the training data)
estimators: array of DecisionTreeregression
- machine_learning_tools.sklearn_models.LassoCV(fit_intercept=False, cv_n_splits=10, **kwargs)[source]
- machine_learning_tools.sklearn_models.LogisticRegression(**kwargs)[source]
This one you can set the coefficients of the linear classifier
- machine_learning_tools.sklearn_models.RandomForestClassifier(n_estimators=30, criterion='gini', max_depth=None, max_features='auto', bootstrap=True, oob_score=True, random_state=None, verbose=False, max_samples=None, class_weight=None, **kwargs)[source]
Purpose: Where the number of features trained on is a subset of overall and the samples trained on are a boostrapped samples
———RandomForrest specific parameters———: n_estimators: number o trees to use in the forest bootstrap: bool(True): whether bootstrap sapling are used to build trees (if not then the whole dataset is used)
oob_score: bool (False): whether to use out-of =-bag samples to estimate the generalization score
max_samples: Number of samples to draw if doing bootstrapping verbose
——- generic tree parameters ——— criterion : how the best split is determined - ‘gini’ - ‘entropy’ max_depth: int , max depth of trees max_features: number of features to look at when doing best split - ‘auto’:sqrt of number of n_features - ‘sqrt’ - log2 - int: specifying the exact numbe of features - float: secifying the percentage of total n_features - None: always has maximum number of features class_weight: dict,list of dict, “balanced” will add weighting to class decision - “balanced” will automatically balance based on composition of y
Example: clf = sklm.RandomForestClassifier(max_depth=5) clf.fit(X_train,y_train) print(sklu.accuracy(clf,X_test,y_test),clf.oob_score_) _ = sklm.feature_importance(clf_for,return_std=True,plot=True)
- machine_learning_tools.sklearn_models.RidgeCV(fit_intercept=False, cv_n_splits=10, **kwargs)[source]
- machine_learning_tools.sklearn_models.SVC(kernel='rbf', C=1, gamma='scale', **kwargs)[source]
SVM with the possibilities of adding kernels
- machine_learning_tools.sklearn_models.coef_summary(feature_names, model=None, coef_=None, intercept=True)[source]
- machine_learning_tools.sklearn_models.feature_importances(clf, verbose=True, plot=False, feature_names=None, return_std=False, method='impurity_decrease', X_permuation=None, y_permutation=None, n_repeats=10, random_state=None, **kwargs)[source]
Purpose: Will return the feature importance of a tree based classifier
- sklm.feature_importances(clf,
#method=, verbose = True, plot=True, X_permuation=X_test, y_permutation=y_test, n_repeats=1)
- machine_learning_tools.sklearn_models.oob_score(clf)[source]
Purpose: Returns the out of bag error for ensembles that use bootstrapping method
- machine_learning_tools.sklearn_models.plot_regularization_paths(model_func, df=None, target_name=None, X=None, y=None, n_alphas=200, alph_log_min=-1, alpha_log_max=5, reverse_axis=True, model_func_kwargs=None)[source]
Purpose: Will plot the regularization paths for a certain model
# Author: Fabian Pedregosa – <fabian.pedregosa@inria.fr> # License: BSD 3 clause
Example from oneline:
# X is the 10x10 Hilbert matrix X = 1. / (np.arange(1, 11) + np.arange(0, 10)[:, np.newaxis]) y = np.ones(10)
plot_regularization_paths( sklm.Ridge, X = X, y =y, alph_log_min = -10, alpha_log_max = -2, )
machine_learning_tools.sklearn_utils module
Important notes:
sklearn.utils.Bunch: just an extended dictionary that allows attributes to referenced by key, bunch[“value_key”], or by an attribute, bunch.value_key
Notes: R^2 number: lm2.score(X, y)
- machine_learning_tools.sklearn_utils.CV_optimal_param_1D(parameter_options, clf_function, loss_function, n_splits=5, data_splits=None, X=None, y=None, test_size=0.2, val_size=0.2, clf_parameters={}, standard_error_buffer=True, plot_loss=False, return_data_splits=False, verbose=False)[source]
Purpose: To Run Cross Validation by Hand with Specific - Dataset - Model Type - 1D Parameter Grid to Search over - Loss function to measure - Method of evaluating the best loss function
Pseudocode: 0) Define the parameter space to iterate over 1) Split the Data into,test,training and validation 2) Combine the validation and training datasets in order to do cross validation 3) Compute the datasets for each cross validation
- For every parameter option:
- For every K fold dataset:
Train the model on the dataset Measure the MSE or another loss for that model Store the certain loss
Find the average loss and the standard error on the loss
Pick the optimal parameter by one of the options: a) Picking the parameter with the lowest average loss b) Picking the parameter value that is the least complex model
that is within one standard deviation of the parameter with the minimum average loss
Example: clf,data_splits = sklu.CV_optimal_param_1D(
parameter_options = dict(C = np.array([10.**(k) for k in np.linspace(-4,3,25)])),
X = X, y = y,
#parameters for the type of classifier clf_function = linear_model.LogisticRegression, clf_parameters = dict(
- penalty = “l1”,
solver=”saga”, max_iter=10000, ),
#arguments for loss function loss_function = sklu.logistic_log_loss,
#arguments for the determination of the optimal parameter standard_error_buffer = True, plot_loss = True,
#arguments for return return_data_splits = True,
verbose = True, )
- machine_learning_tools.sklearn_utils.MSE(y_true, y_pred=None, model=None, X=None, clf=None)[source]
Purpose: Will calculate the MSE of a model
- machine_learning_tools.sklearn_utils.accuracy(clf, X, y)[source]
Returns the accuracy of a classifier
- machine_learning_tools.sklearn_utils.dataset_df(dataset_name, verbose=False, target_name='target', dropna=True)[source]
- machine_learning_tools.sklearn_utils.k_fold_df_split(X, y, target_name=None, n_splits=5, random_state=None)[source]
Purpose: To divide a test and training dataframe into multiple test/train dataframes to use for k fold cross validation
Ex: n_splits = 5 fold_dfs = sklu.k_fold_df_split(
X_train_val, y_train_val, n_splits = n_splits)
fold_dfs[1][“X_train”]
- machine_learning_tools.sklearn_utils.logistic_log_loss(clf, X, y_true)[source]
Computes the Log loss, aka logistic loss or cross-entropy loss. on a model
- machine_learning_tools.sklearn_utils.optimal_parameter_from_kfold_df(df, parameter_name='k', columns_prefix='mse_fold', higher_param_higher_complexity=True, d=True, verbose=False, return_df=False, standard_error_buffer=False, plot_loss=True, **kwargs)[source]
Purpose: Will find the optimal parameter based on a dataframe of the mse scores for different parameters
Ex: opt_k,ret_df = sklu.optimal_parameter_from_mse_df( best_subset_df, parameter_name = “k”, columns_prefix = “mse_fold”, higher_param_higher_complexity = True, standard_error_buffer = True, verbose = True, return_df = True
)
ret_df
- machine_learning_tools.sklearn_utils.random_regression_with_informative_features(n_samples=306, n_features=8000, n_informative=50, random_state=42, noise=0.1, return_true_coef=True)[source]
Purpose: will create a random regression with a certain number of informative features
- machine_learning_tools.sklearn_utils.train_val_test_split(X, y, test_size=0.2, val_size=None, verbose=False, random_state=None, return_dict=False)[source]
Purpose: To split the data into 1) train 2) validation (if requested) 3) test
Note: All percentages are specified as number 0 - 1 Process: 1) Split the data into test and train percentages 2) If validation is requested, split the train into train,val by the following formula
val_perc/ ( 1 - test_perc) = val_perc_adjusted
Return the different splits
Example: (X_train,
X_val, X_test, y_train, y_val, y_test) = sklu.train_val_test_split(
X, y, test_size = 0.2, val_size = 0.2, verbose = True)
machine_learning_tools.statsmodels_utils module
Purpose: to export some functionality of the statsmodels library for things like
Application: Behaves a lot like the R functions
-linear regression
Notes: lm1.rsquared gives the r squared value
Notes:
- machine_learning_tools.statsmodels_utils.linear_regression(df, target_name, add_intercept=True, print_summary=True)[source]
Purpose: To run a linear regression and then print out the summary of the coefficients
eX: from machine_learning_tools import statsmodels_utils as smu smu.linear_regression(df_raw[[target_name,”LSTAT”]],
target_name = target_name, add_intercept = True, )
machine_learning_tools.visualizations_ml module
- machine_learning_tools.visualizations_ml.contour_map_for_2D_classifier(clf, axes_min_default=-10, axes_max_default=10, axes_step_default=0.01, axes_min_max_step_dict=None, axes_limit_buffers=0, figure_width=10, figure_height=10, color_type='classes', color_prob_axis=0, contour_color_map='RdBu', map_fill_colors=None, training_df=None, training_df_class_name='class', training_df_feature_names=None, feature_names=('feature_1', 'feature_2'), X=None, y=None, scatter_colors=['darkorange', 'c'])[source]
Purpose: To plot the decision boundary of a classifier that is only dependent on 2 feaures
Tried extending this to classifer of more than 2 features but ran into confusion on how to collapse across the other dimensions of the features space
Ex: %matplotlib inline vml.contour_map_for_2D_classifier(ctu.e_i_model)
#plotting the probability %matplotlib inline vml.contour_map_for_2D_classifier(
ctu.e_i_model, color_type=”probability”)
- machine_learning_tools.visualizations_ml.meshgrid_for_plot(axes_min_default=-10, axes_max_default=10, axes_step_default=1, axes_min_max_step_dict=None, n_axis=None, return_combined_coordinates=True, clf=None)[source]
Purpose: To generate a meshgrid for plotting that is configured as a mixutre of custom and default values
axes_min_max_step_dict must be a dictionary mapping the class label or classs index to a
Ex: vml.meshgrid_for_plot( axes_min_default = -20, axes_max_default = 10, axes_step_default = 1, #axes_min_max_step_dict = {1:[-2,2,0.5]}, axes_min_max_step_dict = {1:dict(axis_min = -2,axis_max = 3,axis_step = 1)}, n_axis = 2, clf = None, )
- machine_learning_tools.visualizations_ml.plot_binary_classifier_map(clf, X=None, xmin=None, xmax=None, ymin=None, ymax=None, buffer=0.01, class_idx_to_plot=0, figure_width=10, figure_height=10, axes_fontsize=25, class_0_color=None, class_1_color=None, mid_color='white', alpha=0.5, plot_colorbar=True, colorbar_label=None, colorbar_labelpad=30, colorbar_label_fontsize=20, colorbar_tick_fontsize=25, ax=None, **kwargs)[source]
Purpose: To plot the prediction map of a binary classifier
Arguments: 1) Model 2) Define the input space want to test over (xmin,xmax)
Pseudocode: 1) Create a meshgrid of the input space 2) Send the prediction
======Example:======= from machine_learning_tools import visualizations_ml as vml
figure_width = 10 figure_height = 10 fig,ax = plt.subplots(1,1,figsize=(figure_width,figure_height))
X = df_plot[trans_cols].to_numpy().astype(“float”)
- vml.plot_binary_classifier_map(
clf = ctu.e_i_model, X = X, xmin = 0, xmax = 4.5, ymin=-0.1, ymax = 1.2, alpha = 0.5, colorbar_label = “Excitatory Probability”, ax = ax,
)
- machine_learning_tools.visualizations_ml.plot_decision_tree(clf, feature_names, class_names=None, max_depth=None)[source]
Purpose: Will show the
- machine_learning_tools.visualizations_ml.plot_df_scatter_2d_classification(df=None, target_name=None, feature_names=None, figure_width=10, figure_height=10, alpha=0.5, axis_append='', verbose=False, X=None, y=None, title=None)[source]
Purpose: To plot features in 3D
Ex: %matplotlib notebook sys.path.append(“/machine_learning_tools/machine_learning_tools/”) from machine_learning_tools import visualizations_ml as vml vml.plot_df_scatter_3d_classification(df,target_name=”group_label”,
feature_names= [
#”ipr_eig_xz_to_width_50”, “center_to_width_50”, “n_limbs”, “ipr_eig_xz_max_95”
])
Ex: ax = vml.plot_df_scatter_2d_classification(
X=X_trans[y!= “Unknown”], y = y[y != “Unknown”],
)
from datasci_tools import matplotlib_utils as mu mu.set_legend_outside_plot(ax)
- machine_learning_tools.visualizations_ml.plot_df_scatter_3d_classification(df=None, target_name=None, feature_names=None, figure_width=10, figure_height=10, alpha=0.5, axis_append='', verbose=False, X=None, y=None, title=None)[source]
Purpose: To plot features in 3D
Ex: %matplotlib notebook sys.path.append(“/machine_learning_tools/machine_learning_tools/”) from machine_learning_tools import visualizations_ml as vml vml.plot_df_scatter_3d_classification(df,target_name=”group_label”,
feature_names= [
#”ipr_eig_xz_to_width_50”, “center_to_width_50”, “n_limbs”, “ipr_eig_xz_max_95”
])
- machine_learning_tools.visualizations_ml.plot_df_scatter_classification(df=None, target_name=None, feature_names=None, ndim=3, figure_width=14, figure_height=14, alpha=0.5, axis_append='', verbose=False, X=None, y=None, title=None, target_to_color=None, default_color='yellow', plot_legend=True, scale_down_legend=0.75, bbox_to_anchor=(1.02, 0.5), ax=None, text_to_plot_dict=None, use_labels_as_text_to_plot=False, text_to_plot_individual=None, replace_None_with_str_None=False, **kwargs)[source]
Purpose: To plot features in 3D
Ex: %matplotlib notebook sys.path.append(“/machine_learning_tools/machine_learning_tools/”) from machine_learning_tools import visualizations_ml as vml vml.plot_df_scatter_3d_classification(df,target_name=”group_label”,
feature_names= [
#”ipr_eig_xz_to_width_50”, “center_to_width_50”, “n_limbs”, “ipr_eig_xz_max_95”
])
- machine_learning_tools.visualizations_ml.plot_dim_red_analysis(X, method, y=None, n_components=[2, 3], alpha=0.5, color_mapppings=None, plot_kwargs=None, verbose=False, **kwargs)[source]