Summary table of models + methods

Introduction

Throughout the course, we will go over several supervised and unsupervised machine learning models. This page summarizes the models.

Method	Strengths	Limitations	Example Use Cases	Implementation
Simple Fill	Simple and fastWorks well with small datasets Simple and fast Works well with small datasets	May not handle complex data relationshipsSensitive to outliers May not handle complex data relationships Sensitive to outliers	Basic data analysisQuick data cleaning Basic data analysis Quick data cleaning	Python
KNN Imputation	Can capture the relationships between featuresWorks well with moderately missing data Can capture the relationships between features Works well with moderately missing data	Computationally intensive for large datasetsSensitive to the choice of k Computationally intensive for large datasets Sensitive to the choice of k	Medical data analysisMarket research Medical data analysis Market research	Python
Soft Impute	Effective for matrix completion in large datasetsWorks well with low-rank data Effective for matrix completion in large datasets Works well with low-rank data	Assumes low-rank data structureCan be sensitive to hyperparameters Assumes low-rank data structure Can be sensitive to hyperparameters	Recommender systemsLarge-scale data projects Recommender systems Large-scale data projects	Python
Iterative Imputer	Can model complex relationshipsSuitable for multiple imputation Can model complex relationships Suitable for multiple imputation	Computationally expensiveDepends on the choice of model Computationally expensive Depends on the choice of model	Complex datasets with multiple types of missing data Complex datasets with multiple types of missing data	Python
Iterative SVD	Good for matrix completion with low-rank assumptionHandles larger datasets Good for matrix completion with low-rank assumption Handles larger datasets	Sensitive to rank selectionComputationally demanding Sensitive to rank selection Computationally demanding	Image and video data processingLarge datasets with structure Image and video data processing Large datasets with structure	Python
Matrix Factorization	Useful for recommendation systemsCan handle large-scale problems Useful for recommendation systems Can handle large-scale problems	Requires careful tuningNot suitable for all types of data Requires careful tuning Not suitable for all types of data	Recommendation enginesUser preference analysis Recommendation engines User preference analysis	Python
Nuclear Norm Minimization	Theoretically strong for matrix completionFinds the lowest rank solution Theoretically strong for matrix completion Finds the lowest rank solution	Very computationally intensiveImpractical for very large datasets Very computationally intensive Impractical for very large datasets	Research in theoretical data completionSmall to medium datasets Research in theoretical data completion Small to medium datasets	Python
BiScaler	Normalizes data effectivelyOften used as a preprocessing step Normalizes data effectively Often used as a preprocessing step	Not an imputation method itselfDoesn’t always converge Not an imputation method itself Doesn’t always converge	Preprocessing for other imputation methodsData normalization Preprocessing for other imputation methods Data normalization	Python

Model Type	Strengths	Limitations	Example Use Cases	Implementation
Logistic Regression	Simple and interpretableFast to train Simple and interpretable Fast to train	Assumes linear boundariesNot suitable for complex relationships Assumes linear boundaries Not suitable for complex relationships	Credit approvalMedical diagnosis Credit approval Medical diagnosis	Python
Decision Trees	IntuitiveCan model non-linear relationships Intuitive Can model non-linear relationships	Prone to overfittingSensitive to small changes in data Prone to overfitting Sensitive to small changes in data	Customer segmentationLoan default prediction Customer segmentation Loan default prediction	Python
Random Forest	Handles overfittingCan model complex relationships Handles overfitting Can model complex relationships	Slower to train and predictBlack box model Slower to train and predict Black box model	Fraud detectionStock price movement prediction Fraud detection Stock price movement prediction	Python
Support Vector Machines (SVM)	Effective in high dimensional spacesWorks well with clear margin of separation Effective in high dimensional spaces Works well with clear margin of separation	Sensitive to kernel choiceSlow on large datasets Sensitive to kernel choice Slow on large datasets	Image classificationHandwriting recognition Image classification Handwriting recognition	Python
K-Nearest Neighbors (KNN)	Simple and intuitiveNo training phase Simple and intuitive No training phase	Slow during query phaseSensitive to irrelevant features and scale Slow during query phase Sensitive to irrelevant features and scale	Product recommendationDocument classification Product recommendation Document classification	Python
Neural Networks	Capable of approximating complex functionsFlexible architecture Trainable with backpropagation Capable of approximating complex functions Flexible architecture Trainable with backpropagation	Can require a large number of parametersProne to overfitting on small data Training can be slow Can require a large number of parameters Prone to overfitting on small data Training can be slow	Pattern recognitionBasic image classificationFunction approximation Pattern recognition Basic image classification Function approximation	Python
Deep Learning	Can model highly complex relationshipsExcels with vast amounts of data State-of-the-art results in many domains Can model highly complex relationships Excels with vast amounts of data State-of-the-art results in many domains	Requires a lot of data Computationally intensiveInterpretability challenges Requires a lot of data Computationally intensive Interpretability challenges	Advanced image and speech recognitionMachine translationGame playing (like AlphaGo) Advanced image and speech recognition Machine translation Game playing (like AlphaGo)	Python
Naive Bayes	FastWorks well with large feature sets Fast Works well with large feature sets	Assumes feature independenceNot suitable for numerical input features Assumes feature independence Not suitable for numerical input features	Spam detectionSentiment analysis Spam detection Sentiment analysis	Python
Gradient Boosting Machines (GBM)	High performanceHandles non-linear relationships High performance Handles non-linear relationships	Prone to overfitting if not tunedSlow to train Prone to overfitting if not tuned Slow to train	Web search rankingEcology predictions Web search ranking Ecology predictions	Python
Rule-Based Classification	Transparent and explainableEasily updated and modified Transparent and explainable Easily updated and modified	Manual rule creation can be tediousMay not capture complex relationships Manual rule creation can be tedious May not capture complex relationships	Expert systemsBusiness rule enforcement Expert systems Business rule enforcement	Python
Bagging	Reduces varianceParallelizable Reduces variance Parallelizable	May not handle bias well May not handle bias well	Random Forest is a popular example Random Forest is a popular example	Python
Boosting	Reduces biasCombines weak learners Reduces bias Combines weak learners	Sensitive to noisy data and outliers Sensitive to noisy data and outliers	AdaBoostGradient Boosting AdaBoost Gradient Boosting	Python
XGBoost	Scalable and efficientRegularization Scalable and efficient Regularization	Requires careful tuningCan overfit if not used correctly Requires careful tuning Can overfit if not used correctly	Competitions on KaggleRetail prediction Competitions on Kaggle Retail prediction	Python
Linear Discriminant Analysis (LDA)	Dimensionality reductionSimple and interpretable Dimensionality reduction Simple and interpretable	Assumes Gaussian distributed data and equal class covariances Assumes Gaussian distributed data and equal class covariances	Face recognitionMarketing segmentation Face recognition Marketing segmentation	Python
Regularized Models (Shrinking)	Prevents overfittingHandles collinearity Prevents overfitting Handles collinearity	Requires parameter tuningMay result in loss of interpretability Requires parameter tuning May result in loss of interpretability	Ridge and Lasso regression Ridge and Lasso regression	Python
Stacking	Combines multiple modelsCan improve accuracy Combines multiple models Can improve accuracy	Increases model complexityRisk of overfitting if base models are correlated Increases model complexity Risk of overfitting if base models are correlated	Meta-modelingKaggle competitions Meta-modeling Kaggle competitions	Python

Table has no assigned ID, using random ID 'tljfqgdzlp' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`

Model Type	Strengths	Limitations	Example Use Cases	Implementation
Linear Regression	Simple and interpretable Simple and interpretable	Assumes linear relationshipSensitive to outliers Assumes linear relationship Sensitive to outliers	Sales forecastingRisk assessment Sales forecasting Risk assessment	Python
Polynomial Regression	Can model non-linear relationships Can model non-linear relationships	Can overfit with high degrees Can overfit with high degrees	Growth predictionNon-linear trend modeling Growth prediction Non-linear trend modeling	Python
Ridge Regression	Prevents overfittingRegularizes the model Prevents overfitting Regularizes the model	Does not perform feature selection Does not perform feature selection	High-dimensional dataPreventing overfitting High-dimensional data Preventing overfitting	Python
Lasso Regression	Feature selectionRegularizes the model Feature selection Regularizes the model	May exclude useful variables May exclude useful variables	Feature selectionHigh-dimensional datasets Feature selection High-dimensional datasets	Python
Elastic Net Regression	Balance between Ridge and Lasso Balance between Ridge and Lasso	Requires tuning for mixing parameter Requires tuning for mixing parameter	High-dimensional datasets with correlated features High-dimensional datasets with correlated features	Python
Quantile Regression	Models the median or other quantiles Models the median or other quantiles	Less interpretable than ordinary regression Less interpretable than ordinary regression	Median house price predictionFinancial quantiles modeling Median house price prediction Financial quantiles modeling	Python
Support Vector Regression (SVR)	FlexibleCan handle non-linear relationships Flexible Can handle non-linear relationships	Sensitive to kernel and hyperparameters Sensitive to kernel and hyperparameters	Stock price predictionNon-linear trend modeling Stock price prediction Non-linear trend modeling	Python
Decision Tree Regression	Handles non-linear dataInterpretable Handles non-linear data Interpretable	Can overfit on noisy data Can overfit on noisy data	Price predictionQuality assessment Price prediction Quality assessment	Python
Random Forest Regression	Handles large datasetsReduces overfitting Handles large datasets Reduces overfitting	Requires more computational resources Requires more computational resources	Large datasetsEnvironmental modeling Large datasets Environmental modeling	Python
Gradient Boosting Regression	High performanceCan handle non-linear relationships High performance Can handle non-linear relationships	Prone to overfitting if not tuned Prone to overfitting if not tuned	Web search rankingPrice prediction Web search ranking Price prediction	Python

Table has no assigned ID, using random ID 'jmrszkcjzs' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`

Model Type	Strengths	Limitations	Example Use Cases	Implementation
K-Means Clustering	Simple and widely usedFast for large datasets Simple and widely used Fast for large datasets	Sensitive to initial conditionsRequires specifying the number of clusters Sensitive to initial conditions Requires specifying the number of clusters	Market segmentationImage compression Market segmentation Image compression	Python
Hierarchical Clustering	Doesn’t require specifying the number of clustersProduces a dendrogram Doesn’t require specifying the number of clusters Produces a dendrogram	May be computationally expensive for large datasets May be computationally expensive for large datasets	TaxonomiesDetermining evolutionary relationships Taxonomies Determining evolutionary relationships	Python
DBSCAN (Density-Based Clustering)	Can find arbitrarily shaped clustersDoesn’t require specifying the number of clusters Can find arbitrarily shaped clusters Doesn’t require specifying the number of clusters	Sensitive to scaleRequires density parameters to be set Sensitive to scale Requires density parameters to be set	Noise detection and anomaly detection Noise detection and anomaly detection	Python
Agglomerative Clustering	Variety of linkage criteriaProduces a hierarchy of clusters Variety of linkage criteria Produces a hierarchy of clusters	Not scalable for very large datasets Not scalable for very large datasets	Sociological hierarchiesTaxonomies Sociological hierarchies Taxonomies	Python
Mean Shift Clustering	No need to specify number of clustersCan find arbitrarily shaped clusters No need to specify number of clusters Can find arbitrarily shaped clusters	Computationally expensiveBandwidth parameter selection is crucial Computationally expensive Bandwidth parameter selection is crucial	Image analysisComputer vision tasks Image analysis Computer vision tasks	Python
Affinity Propagation	Automatically determines the number of clustersGood for data with lots of exemplars Automatically determines the number of clusters Good for data with lots of exemplars	High computational complexityPreference parameter can be difficult to choose High computational complexity Preference parameter can be difficult to choose	Image recognitionData with many similar exemplars Image recognition Data with many similar exemplars	Python
Spectral Clustering	Can capture complex cluster structuresCan be used with various affinity matrices Can capture complex cluster structures Can be used with various affinity matrices	Choice of affinity matrix is crucialCan be computationally expensive Choice of affinity matrix is crucial Can be computationally expensive	Image and speech processingGraph-based clustering Image and speech processing Graph-based clustering	Python

Table has no assigned ID, using random ID 'hfcqxvkcgy' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`

Method	Strengths	Limitations	Example Use Cases	Implementation
PCA	Dimensionality reductionPreserves variance Dimensionality reduction Preserves variance	Linear methodNot for categorical data Linear method Not for categorical data	Feature extractionData compression Feature extraction Data compression	Python
t-SNE	Captures non-linear structuresGood for visualization Captures non-linear structures Good for visualization	Computationally expensiveNot for high-dimensional data Computationally expensive Not for high-dimensional data	Data visualizationExploratory analysis Data visualization Exploratory analysis	Python
Autoencoders	Dimensionality reductionNon-linear relationships Dimensionality reduction Non-linear relationships	Neural network knowledgeComputationally intensive Neural network knowledge Computationally intensive	Feature learningNoise reduction Feature learning Noise reduction	Python
Isolation Forest	Effective for high-dimensional dataFast and scalable Effective for high-dimensional data Fast and scalable	RandomizedMay miss some anomalies Randomized May miss some anomalies	Fraud detectionNetwork security Fraud detection Network security	Python
SVD	Matrix factorizationEfficient for large datasets Matrix factorization Efficient for large datasets	Assumes linear relationshipsSensitive to scaling Assumes linear relationships Sensitive to scaling	Recommender systemsLatent semantic analysis Recommender systems Latent semantic analysis	Python
ICA	Identifies independent componentsSignal separation Identifies independent components Signal separation	Non-Gaussian componentsSensitive to noise Non-Gaussian components Sensitive to noise	Blind signal separationFeature extraction Blind signal separation Feature extraction	Python

Table has no assigned ID, using random ID 'qujppxnzib' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`

Method	Strengths	Limitations	Example Use Cases	Implementation
Apriori Algorithm	Well-known and widely usedEasy to understand and implement Well-known and widely used Easy to understand and implement	Can be slow on large datasetsGenerates a large number of candidate sets Can be slow on large datasets Generates a large number of candidate sets	Market basket analysisCross-marketing strategies Market basket analysis Cross-marketing strategies	Python
FP-Growth Algorithm	Faster than AprioriEfficient for large datasets Faster than Apriori Efficient for large datasets	Memory intensiveCan be complex to implement Memory intensive Can be complex to implement	Frequent itemset mining in large databasesCustomer purchase patterns Frequent itemset mining in large databases Customer purchase patterns	Python
Eclat Algorithm	Faster than AprioriScalable and easy to parallelize Faster than Apriori Scalable and easy to parallelize	Limited to binary attributesGenerates many candidate itemsets Limited to binary attributes Generates many candidate itemsets	Market basket analysisBinary classification tasks Market basket analysis Binary classification tasks	Python
GSP (Generalized Sequential Pattern)	Identifies sequential patternsFlexible for various datasets Identifies sequential patterns Flexible for various datasets	Can be computationally expensiveNot as efficient for very large databases Can be computationally expensive Not as efficient for very large databases	Customer purchase sequence analysisEvent sequence analysis Customer purchase sequence analysis Event sequence analysis	Python
RuleGrowth Algorithm	Efficient for mining sequential rulesWorks well with sparse datasets Efficient for mining sequential rules Works well with sparse datasets	Requires careful parameter settingLess known and used than Apriori or FP-Growth Requires careful parameter setting Less known and used than Apriori or FP-Growth	Analyzing customer shopping sequencesDetecting patterns in web browsing data Analyzing customer shopping sequences Detecting patterns in web browsing data	Python

Table has no assigned ID, using random ID 'fdvgbjrmdb' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`

Technique	Strengths	Limitations	Example Use Cases	Implementation
Accuracy	Simple and intuitiveEffective for balanced datasets Simple and intuitive Effective for balanced datasets	Misleading for imbalanced datasetsDoesn’t reflect true positives/negatives Misleading for imbalanced datasets Doesn’t reflect true positives/negatives	General classification problemsComparing baseline models General classification problems Comparing baseline models	Python
AUC-ROC	Effective for binary classificationGood for imbalanced datasets Effective for binary classification Good for imbalanced datasets	Can be overly optimistic in imbalanced dataNot threshold-specific Can be overly optimistic in imbalanced data Not threshold-specific	Medical diagnosis classificationFraud detection models Medical diagnosis classification Fraud detection models	Python
Precision	Focuses on positive classReduces false positives Focuses on positive class Reduces false positives	Ignores false negativesNot useful alone in imbalanced datasets Ignores false negatives Not useful alone in imbalanced datasets	Spam detectionContent moderation systems Spam detection Content moderation systems	Python
Recall	Identifies actual positives wellMinimizes false negatives Identifies actual positives well Minimizes false negatives	Ignores false positivesCan be misleading if positives are rare Ignores false positives Can be misleading if positives are rare	Disease outbreak detectionRecall-focused tasks Disease outbreak detection Recall-focused tasks	Python
F1-Score	Balances precision and recallUseful for imbalanced datasets Balances precision and recall Useful for imbalanced datasets	May not reflect true model performanceDepends on balance of precision and recall May not reflect true model performance Depends on balance of precision and recall	Customer churn predictionSentiment analysis Customer churn prediction Sentiment analysis	Python
Cross-Validation	Reduces overfittingProvides robust model evaluation Reduces overfitting Provides robust model evaluation	Computationally expensiveMay not be ideal for very large datasets Computationally expensive May not be ideal for very large datasets	General model evaluationComparing multiple models General model evaluation Comparing multiple models	Python
The Validation Set Approach	Simple and easy to implementGood for initial model assessment Simple and easy to implement Good for initial model assessment	Can lead to overfittingDependent on the split Can lead to overfitting Dependent on the split	Quick model prototypingSmall datasets Quick model prototyping Small datasets	Python
Leave-One-Out Cross-Validation	Very detailedEach observation used for validation exactly once Very detailed Each observation used for validation exactly once	Computationally intensiveNot suitable for large datasets Computationally intensive Not suitable for large datasets	Small but rich datasetsHighly sensitive models Small but rich datasets Highly sensitive models	Python
k-Fold Cross-Validation	Balances computational cost and validation accuracySuitable for various data sizes Balances computational cost and validation accuracy Suitable for various data sizes	Variability in results depending on how data is dividedChoice of ‘k’ can impact results Variability in results depending on how data is divided Choice of ‘k’ can impact results	Medium-sized datasetsModel selection Medium-sized datasets Model selection	Python
The Bootstrap Method	Good for estimating model accuracyEffective for small datasets Good for estimating model accuracy Effective for small datasets	Results can be sensitive to outliersMay overestimate accuracy for small datasets Results can be sensitive to outliers May overestimate accuracy for small datasets	Small or medium-sized datasetsUncertainty estimation Small or medium-sized datasets Uncertainty estimation	Python