Summary table of models + methods

Introduction

Throughout the course, we will go over several supervised and unsupervised machine learning models. This page summarizes the models.

Method Strengths Limitations Example Use Cases Implementation
Simple Fill Simple and fastWorks well with small datasets
  • Simple and fast
  • Works well with small datasets
May not handle complex data relationshipsSensitive to outliers
  • May not handle complex data relationships
  • Sensitive to outliers
Basic data analysisQuick data cleaning
  • Basic data analysis
  • Quick data cleaning
Python
KNN Imputation Can capture the relationships between featuresWorks well with moderately missing data
  • Can capture the relationships between features
  • Works well with moderately missing data
Computationally intensive for large datasetsSensitive to the choice of k
  • Computationally intensive for large datasets
  • Sensitive to the choice of k
Medical data analysisMarket research
  • Medical data analysis
  • Market research
Python
Soft Impute Effective for matrix completion in large datasetsWorks well with low-rank data
  • Effective for matrix completion in large datasets
  • Works well with low-rank data
Assumes low-rank data structureCan be sensitive to hyperparameters
  • Assumes low-rank data structure
  • Can be sensitive to hyperparameters
Recommender systemsLarge-scale data projects
  • Recommender systems
  • Large-scale data projects
Python
Iterative Imputer Can model complex relationshipsSuitable for multiple imputation
  • Can model complex relationships
  • Suitable for multiple imputation
Computationally expensiveDepends on the choice of model
  • Computationally expensive
  • Depends on the choice of model
Complex datasets with multiple types of missing data
  • Complex datasets with multiple types of missing data
Python
Iterative SVD Good for matrix completion with low-rank assumptionHandles larger datasets
  • Good for matrix completion with low-rank assumption
  • Handles larger datasets
Sensitive to rank selectionComputationally demanding
  • Sensitive to rank selection
  • Computationally demanding
Image and video data processingLarge datasets with structure
  • Image and video data processing
  • Large datasets with structure
Python
Matrix Factorization Useful for recommendation systemsCan handle large-scale problems
  • Useful for recommendation systems
  • Can handle large-scale problems
Requires careful tuningNot suitable for all types of data
  • Requires careful tuning
  • Not suitable for all types of data
Recommendation enginesUser preference analysis
  • Recommendation engines
  • User preference analysis
Python
Nuclear Norm Minimization Theoretically strong for matrix completionFinds the lowest rank solution
  • Theoretically strong for matrix completion
  • Finds the lowest rank solution
Very computationally intensiveImpractical for very large datasets
  • Very computationally intensive
  • Impractical for very large datasets
Research in theoretical data completionSmall to medium datasets
  • Research in theoretical data completion
  • Small to medium datasets
Python
BiScaler Normalizes data effectivelyOften used as a preprocessing step
  • Normalizes data effectively
  • Often used as a preprocessing step
Not an imputation method itselfDoesn’t always converge
  • Not an imputation method itself
  • Doesn’t always converge
Preprocessing for other imputation methodsData normalization
  • Preprocessing for other imputation methods
  • Data normalization
Python
Model Type Strengths Limitations Example Use Cases Implementation
Logistic Regression Simple and interpretableFast to train
  • Simple and interpretable
  • Fast to train
Assumes linear boundariesNot suitable for complex relationships
  • Assumes linear boundaries
  • Not suitable for complex relationships
Credit approvalMedical diagnosis
  • Credit approval
  • Medical diagnosis
Python
Decision Trees IntuitiveCan model non-linear relationships
  • Intuitive
  • Can model non-linear relationships
Prone to overfittingSensitive to small changes in data
  • Prone to overfitting
  • Sensitive to small changes in data
Customer segmentationLoan default prediction
  • Customer segmentation
  • Loan default prediction
Python
Random Forest Handles overfittingCan model complex relationships
  • Handles overfitting
  • Can model complex relationships
Slower to train and predictBlack box model
  • Slower to train and predict
  • Black box model
Fraud detectionStock price movement prediction
  • Fraud detection
  • Stock price movement prediction
Python
Support Vector Machines (SVM) Effective in high dimensional spacesWorks well with clear margin of separation
  • Effective in high dimensional spaces
  • Works well with clear margin of separation
Sensitive to kernel choiceSlow on large datasets
  • Sensitive to kernel choice
  • Slow on large datasets
Image classificationHandwriting recognition
  • Image classification
  • Handwriting recognition
Python
K-Nearest Neighbors (KNN) Simple and intuitiveNo training phase
  • Simple and intuitive
  • No training phase
Slow during query phaseSensitive to irrelevant features and scale
  • Slow during query phase
  • Sensitive to irrelevant features and scale
Product recommendationDocument classification
  • Product recommendation
  • Document classification
Python
Neural Networks Capable of approximating complex functionsFlexible architecture Trainable with backpropagation
  • Capable of approximating complex functions
  • Flexible architecture Trainable with backpropagation
Can require a large number of parametersProne to overfitting on small data Training can be slow
  • Can require a large number of parameters
  • Prone to overfitting on small data Training can be slow
Pattern recognitionBasic image classificationFunction approximation
  • Pattern recognition
  • Basic image classification
  • Function approximation
Python
Deep Learning Can model highly complex relationshipsExcels with vast amounts of data State-of-the-art results in many domains
  • Can model highly complex relationships
  • Excels with vast amounts of data State-of-the-art results in many domains
Requires a lot of data Computationally intensiveInterpretability challenges
  • Requires a lot of data Computationally intensive
  • Interpretability challenges
Advanced image and speech recognitionMachine translationGame playing (like AlphaGo)
  • Advanced image and speech recognition
  • Machine translation
  • Game playing (like AlphaGo)
Python
Naive Bayes FastWorks well with large feature sets
  • Fast
  • Works well with large feature sets
Assumes feature independenceNot suitable for numerical input features
  • Assumes feature independence
  • Not suitable for numerical input features
Spam detectionSentiment analysis
  • Spam detection
  • Sentiment analysis
Python
Gradient Boosting Machines (GBM) High performanceHandles non-linear relationships
  • High performance
  • Handles non-linear relationships
Prone to overfitting if not tunedSlow to train
  • Prone to overfitting if not tuned
  • Slow to train
Web search rankingEcology predictions
  • Web search ranking
  • Ecology predictions
Python
Rule-Based Classification Transparent and explainableEasily updated and modified
  • Transparent and explainable
  • Easily updated and modified
Manual rule creation can be tediousMay not capture complex relationships
  • Manual rule creation can be tedious
  • May not capture complex relationships
Expert systemsBusiness rule enforcement
  • Expert systems
  • Business rule enforcement
Python
Bagging Reduces varianceParallelizable
  • Reduces variance
  • Parallelizable
May not handle bias well
  • May not handle bias well
Random Forest is a popular example
  • Random Forest is a popular example
Python
Boosting Reduces biasCombines weak learners
  • Reduces bias
  • Combines weak learners
Sensitive to noisy data and outliers
  • Sensitive to noisy data and outliers
AdaBoostGradient Boosting
  • AdaBoost
  • Gradient Boosting
Python
XGBoost Scalable and efficientRegularization
  • Scalable and efficient
  • Regularization
Requires careful tuningCan overfit if not used correctly
  • Requires careful tuning
  • Can overfit if not used correctly
Competitions on KaggleRetail prediction
  • Competitions on Kaggle
  • Retail prediction
Python
Linear Discriminant Analysis (LDA) Dimensionality reductionSimple and interpretable
  • Dimensionality reduction
  • Simple and interpretable
Assumes Gaussian distributed data and equal class covariances
  • Assumes Gaussian distributed data and equal class covariances
Face recognitionMarketing segmentation
  • Face recognition
  • Marketing segmentation
Python
Regularized Models (Shrinking) Prevents overfittingHandles collinearity
  • Prevents overfitting
  • Handles collinearity
Requires parameter tuningMay result in loss of interpretability
  • Requires parameter tuning
  • May result in loss of interpretability
Ridge and Lasso regression
  • Ridge and Lasso regression
Python
Stacking Combines multiple modelsCan improve accuracy
  • Combines multiple models
  • Can improve accuracy
Increases model complexityRisk of overfitting if base models are correlated
  • Increases model complexity
  • Risk of overfitting if base models are correlated
Meta-modelingKaggle competitions
  • Meta-modeling
  • Kaggle competitions
Python
Table has no assigned ID, using random ID 'tnfnxrwdjb' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
Model Type Strengths Limitations Example Use Cases Implementation
Linear Regression Simple and interpretable
  • Simple and interpretable
Assumes linear relationshipSensitive to outliers
  • Assumes linear relationship
  • Sensitive to outliers
Sales forecastingRisk assessment
  • Sales forecasting
  • Risk assessment
Python
Polynomial Regression Can model non-linear relationships
  • Can model non-linear relationships
Can overfit with high degrees
  • Can overfit with high degrees
Growth predictionNon-linear trend modeling
  • Growth prediction
  • Non-linear trend modeling
Python
Ridge Regression Prevents overfittingRegularizes the model
  • Prevents overfitting
  • Regularizes the model
Does not perform feature selection
  • Does not perform feature selection
High-dimensional dataPreventing overfitting
  • High-dimensional data
  • Preventing overfitting
Python
Lasso Regression Feature selectionRegularizes the model
  • Feature selection
  • Regularizes the model
May exclude useful variables
  • May exclude useful variables
Feature selectionHigh-dimensional datasets
  • Feature selection
  • High-dimensional datasets
Python
Elastic Net Regression Balance between Ridge and Lasso
  • Balance between Ridge and Lasso
Requires tuning for mixing parameter
  • Requires tuning for mixing parameter
High-dimensional datasets with correlated features
  • High-dimensional datasets with correlated features
Python
Quantile Regression Models the median or other quantiles
  • Models the median or other quantiles
Less interpretable than ordinary regression
  • Less interpretable than ordinary regression
Median house price predictionFinancial quantiles modeling
  • Median house price prediction
  • Financial quantiles modeling
Python
Support Vector Regression (SVR) FlexibleCan handle non-linear relationships
  • Flexible
  • Can handle non-linear relationships
Sensitive to kernel and hyperparameters
  • Sensitive to kernel and hyperparameters
Stock price predictionNon-linear trend modeling
  • Stock price prediction
  • Non-linear trend modeling
Python
Decision Tree Regression Handles non-linear dataInterpretable
  • Handles non-linear data
  • Interpretable
Can overfit on noisy data
  • Can overfit on noisy data
Price predictionQuality assessment
  • Price prediction
  • Quality assessment
Python
Random Forest Regression Handles large datasetsReduces overfitting
  • Handles large datasets
  • Reduces overfitting
Requires more computational resources
  • Requires more computational resources
Large datasetsEnvironmental modeling
  • Large datasets
  • Environmental modeling
Python
Gradient Boosting Regression High performanceCan handle non-linear relationships
  • High performance
  • Can handle non-linear relationships
Prone to overfitting if not tuned
  • Prone to overfitting if not tuned
Web search rankingPrice prediction
  • Web search ranking
  • Price prediction
Python
Table has no assigned ID, using random ID 'ulmzxpohcs' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
Model Type Strengths Limitations Example Use Cases Implementation
K-Means Clustering Simple and widely usedFast for large datasets
  • Simple and widely used
  • Fast for large datasets
Sensitive to initial conditionsRequires specifying the number of clusters
  • Sensitive to initial conditions
  • Requires specifying the number of clusters
Market segmentationImage compression
  • Market segmentation
  • Image compression
Python
Hierarchical Clustering Doesn’t require specifying the number of clustersProduces a dendrogram
  • Doesn’t require specifying the number of clusters
  • Produces a dendrogram
May be computationally expensive for large datasets
  • May be computationally expensive for large datasets
TaxonomiesDetermining evolutionary relationships
  • Taxonomies
  • Determining evolutionary relationships
Python
DBSCAN (Density-Based Clustering) Can find arbitrarily shaped clustersDoesn’t require specifying the number of clusters
  • Can find arbitrarily shaped clusters
  • Doesn’t require specifying the number of clusters
Sensitive to scaleRequires density parameters to be set
  • Sensitive to scale
  • Requires density parameters to be set
Noise detection and anomaly detection
  • Noise detection and anomaly detection
Python
Agglomerative Clustering Variety of linkage criteriaProduces a hierarchy of clusters
  • Variety of linkage criteria
  • Produces a hierarchy of clusters
Not scalable for very large datasets
  • Not scalable for very large datasets
Sociological hierarchiesTaxonomies
  • Sociological hierarchies
  • Taxonomies
Python
Mean Shift Clustering No need to specify number of clustersCan find arbitrarily shaped clusters
  • No need to specify number of clusters
  • Can find arbitrarily shaped clusters
Computationally expensiveBandwidth parameter selection is crucial
  • Computationally expensive
  • Bandwidth parameter selection is crucial
Image analysisComputer vision tasks
  • Image analysis
  • Computer vision tasks
Python
Affinity Propagation Automatically determines the number of clustersGood for data with lots of exemplars
  • Automatically determines the number of clusters
  • Good for data with lots of exemplars
High computational complexityPreference parameter can be difficult to choose
  • High computational complexity
  • Preference parameter can be difficult to choose
Image recognitionData with many similar exemplars
  • Image recognition
  • Data with many similar exemplars
Python
Spectral Clustering Can capture complex cluster structuresCan be used with various affinity matrices
  • Can capture complex cluster structures
  • Can be used with various affinity matrices
Choice of affinity matrix is crucialCan be computationally expensive
  • Choice of affinity matrix is crucial
  • Can be computationally expensive
Image and speech processingGraph-based clustering
  • Image and speech processing
  • Graph-based clustering
Python
Table has no assigned ID, using random ID 'xhfzvzmsuv' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
Method Strengths Limitations Example Use Cases Implementation
PCA Dimensionality reductionPreserves variance
  • Dimensionality reduction
  • Preserves variance
Linear methodNot for categorical data
  • Linear method
  • Not for categorical data
Feature extractionData compression
  • Feature extraction
  • Data compression
Python
t-SNE Captures non-linear structuresGood for visualization
  • Captures non-linear structures
  • Good for visualization
Computationally expensiveNot for high-dimensional data
  • Computationally expensive
  • Not for high-dimensional data
Data visualizationExploratory analysis
  • Data visualization
  • Exploratory analysis
Python
Autoencoders Dimensionality reductionNon-linear relationships
  • Dimensionality reduction
  • Non-linear relationships
Neural network knowledgeComputationally intensive
  • Neural network knowledge
  • Computationally intensive
Feature learningNoise reduction
  • Feature learning
  • Noise reduction
Python
Isolation Forest Effective for high-dimensional dataFast and scalable
  • Effective for high-dimensional data
  • Fast and scalable
RandomizedMay miss some anomalies
  • Randomized
  • May miss some anomalies
Fraud detectionNetwork security
  • Fraud detection
  • Network security
Python
SVD Matrix factorizationEfficient for large datasets
  • Matrix factorization
  • Efficient for large datasets
Assumes linear relationshipsSensitive to scaling
  • Assumes linear relationships
  • Sensitive to scaling
Recommender systemsLatent semantic analysis
  • Recommender systems
  • Latent semantic analysis
Python
ICA Identifies independent componentsSignal separation
  • Identifies independent components
  • Signal separation
Non-Gaussian componentsSensitive to noise
  • Non-Gaussian components
  • Sensitive to noise
Blind signal separationFeature extraction
  • Blind signal separation
  • Feature extraction
Python
Table has no assigned ID, using random ID 'qndwdbsqwm' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
Method Strengths Limitations Example Use Cases Implementation
Apriori Algorithm Well-known and widely usedEasy to understand and implement
  • Well-known and widely used
  • Easy to understand and implement
Can be slow on large datasetsGenerates a large number of candidate sets
  • Can be slow on large datasets
  • Generates a large number of candidate sets
Market basket analysisCross-marketing strategies
  • Market basket analysis
  • Cross-marketing strategies
Python
FP-Growth Algorithm Faster than AprioriEfficient for large datasets
  • Faster than Apriori
  • Efficient for large datasets
Memory intensiveCan be complex to implement
  • Memory intensive
  • Can be complex to implement
Frequent itemset mining in large databasesCustomer purchase patterns
  • Frequent itemset mining in large databases
  • Customer purchase patterns
Python
Eclat Algorithm Faster than AprioriScalable and easy to parallelize
  • Faster than Apriori
  • Scalable and easy to parallelize
Limited to binary attributesGenerates many candidate itemsets
  • Limited to binary attributes
  • Generates many candidate itemsets
Market basket analysisBinary classification tasks
  • Market basket analysis
  • Binary classification tasks
Python
GSP (Generalized Sequential Pattern) Identifies sequential patternsFlexible for various datasets
  • Identifies sequential patterns
  • Flexible for various datasets
Can be computationally expensiveNot as efficient for very large databases
  • Can be computationally expensive
  • Not as efficient for very large databases
Customer purchase sequence analysisEvent sequence analysis
  • Customer purchase sequence analysis
  • Event sequence analysis
Python
RuleGrowth Algorithm Efficient for mining sequential rulesWorks well with sparse datasets
  • Efficient for mining sequential rules
  • Works well with sparse datasets
Requires careful parameter settingLess known and used than Apriori or FP-Growth
  • Requires careful parameter setting
  • Less known and used than Apriori or FP-Growth
Analyzing customer shopping sequencesDetecting patterns in web browsing data
  • Analyzing customer shopping sequences
  • Detecting patterns in web browsing data
Python
Table has no assigned ID, using random ID 'vjdaurrqpz' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
Technique Strengths Limitations Example Use Cases Implementation
Accuracy Simple and intuitiveEffective for balanced datasets
  • Simple and intuitive
  • Effective for balanced datasets
Misleading for imbalanced datasetsDoesn’t reflect true positives/negatives
  • Misleading for imbalanced datasets
  • Doesn’t reflect true positives/negatives
General classification problemsComparing baseline models
  • General classification problems
  • Comparing baseline models
Python
AUC-ROC Effective for binary classificationGood for imbalanced datasets
  • Effective for binary classification
  • Good for imbalanced datasets
Can be overly optimistic in imbalanced dataNot threshold-specific
  • Can be overly optimistic in imbalanced data
  • Not threshold-specific
Medical diagnosis classificationFraud detection models
  • Medical diagnosis classification
  • Fraud detection models
Python
Precision Focuses on positive classReduces false positives
  • Focuses on positive class
  • Reduces false positives
Ignores false negativesNot useful alone in imbalanced datasets
  • Ignores false negatives
  • Not useful alone in imbalanced datasets
Spam detectionContent moderation systems
  • Spam detection
  • Content moderation systems
Python
Recall Identifies actual positives wellMinimizes false negatives
  • Identifies actual positives well
  • Minimizes false negatives
Ignores false positivesCan be misleading if positives are rare
  • Ignores false positives
  • Can be misleading if positives are rare
Disease outbreak detectionRecall-focused tasks
  • Disease outbreak detection
  • Recall-focused tasks
Python
F1-Score Balances precision and recallUseful for imbalanced datasets
  • Balances precision and recall
  • Useful for imbalanced datasets
May not reflect true model performanceDepends on balance of precision and recall
  • May not reflect true model performance
  • Depends on balance of precision and recall
Customer churn predictionSentiment analysis
  • Customer churn prediction
  • Sentiment analysis
Python
Cross-Validation Reduces overfittingProvides robust model evaluation
  • Reduces overfitting
  • Provides robust model evaluation
Computationally expensiveMay not be ideal for very large datasets
  • Computationally expensive
  • May not be ideal for very large datasets
General model evaluationComparing multiple models
  • General model evaluation
  • Comparing multiple models
Python
The Validation Set Approach Simple and easy to implementGood for initial model assessment
  • Simple and easy to implement
  • Good for initial model assessment
Can lead to overfittingDependent on the split
  • Can lead to overfitting
  • Dependent on the split
Quick model prototypingSmall datasets
  • Quick model prototyping
  • Small datasets
Python
Leave-One-Out Cross-Validation Very detailedEach observation used for validation exactly once
  • Very detailed
  • Each observation used for validation exactly once
Computationally intensiveNot suitable for large datasets
  • Computationally intensive
  • Not suitable for large datasets
Small but rich datasetsHighly sensitive models
  • Small but rich datasets
  • Highly sensitive models
Python
k-Fold Cross-Validation Balances computational cost and validation accuracySuitable for various data sizes
  • Balances computational cost and validation accuracy
  • Suitable for various data sizes
Variability in results depending on how data is dividedChoice of ‘k’ can impact results
  • Variability in results depending on how data is divided
  • Choice of ‘k’ can impact results
Medium-sized datasetsModel selection
  • Medium-sized datasets
  • Model selection
Python
The Bootstrap Method Good for estimating model accuracyEffective for small datasets
  • Good for estimating model accuracy
  • Effective for small datasets
Results can be sensitive to outliersMay overestimate accuracy for small datasets
  • Results can be sensitive to outliers
  • May overestimate accuracy for small datasets
Small or medium-sized datasetsUncertainty estimation
  • Small or medium-sized datasets
  • Uncertainty estimation
Python