Method | Strengths | Limitations | Example Use Cases | Implementation |
---|---|---|---|---|
Simple Fill | Simple and fastWorks well with small datasets
|
May not handle complex data relationshipsSensitive to outliers
|
Basic data analysisQuick data cleaning
|
Python |
KNN Imputation | Can capture the relationships between featuresWorks well with moderately missing data
|
Computationally intensive for large datasetsSensitive to the choice of k
|
Medical data analysisMarket research
|
Python |
Soft Impute | Effective for matrix completion in large datasetsWorks well with low-rank data
|
Assumes low-rank data structureCan be sensitive to hyperparameters
|
Recommender systemsLarge-scale data projects
|
Python |
Iterative Imputer | Can model complex relationshipsSuitable for multiple imputation
|
Computationally expensiveDepends on the choice of model
|
Complex datasets with multiple types of missing data
|
Python |
Iterative SVD | Good for matrix completion with low-rank assumptionHandles larger datasets
|
Sensitive to rank selectionComputationally demanding
|
Image and video data processingLarge datasets with structure
|
Python |
Matrix Factorization | Useful for recommendation systemsCan handle large-scale problems
|
Requires careful tuningNot suitable for all types of data
|
Recommendation enginesUser preference analysis
|
Python |
Nuclear Norm Minimization | Theoretically strong for matrix completionFinds the lowest rank solution
|
Very computationally intensiveImpractical for very large datasets
|
Research in theoretical data completionSmall to medium datasets
|
Python |
BiScaler | Normalizes data effectivelyOften used as a preprocessing step
|
Not an imputation method itselfDoesn’t always converge
|
Preprocessing for other imputation methodsData normalization
|
Python |
Summary table of models + methods
Introduction
Throughout the course, we will go over several supervised and unsupervised machine learning models. This page summarizes the models.
Model Type | Strengths | Limitations | Example Use Cases | Implementation |
---|---|---|---|---|
Logistic Regression | Simple and interpretableFast to train
|
Assumes linear boundariesNot suitable for complex relationships
|
Credit approvalMedical diagnosis
|
Python |
Decision Trees | IntuitiveCan model non-linear relationships
|
Prone to overfittingSensitive to small changes in data
|
Customer segmentationLoan default prediction
|
Python |
Random Forest | Handles overfittingCan model complex relationships
|
Slower to train and predictBlack box model
|
Fraud detectionStock price movement prediction
|
Python |
Support Vector Machines (SVM) | Effective in high dimensional spacesWorks well with clear margin of separation
|
Sensitive to kernel choiceSlow on large datasets
|
Image classificationHandwriting recognition
|
Python |
K-Nearest Neighbors (KNN) | Simple and intuitiveNo training phase
|
Slow during query phaseSensitive to irrelevant features and scale
|
Product recommendationDocument classification
|
Python |
Neural Networks | Capable of approximating complex functionsFlexible architecture Trainable with backpropagation
|
Can require a large number of parametersProne to overfitting on small data Training can be slow
|
Pattern recognitionBasic image classificationFunction approximation
|
Python |
Deep Learning | Can model highly complex relationshipsExcels with vast amounts of data State-of-the-art results in many domains
|
Requires a lot of data Computationally intensiveInterpretability challenges
|
Advanced image and speech recognitionMachine translationGame playing (like AlphaGo)
|
Python |
Naive Bayes | FastWorks well with large feature sets
|
Assumes feature independenceNot suitable for numerical input features
|
Spam detectionSentiment analysis
|
Python |
Gradient Boosting Machines (GBM) | High performanceHandles non-linear relationships
|
Prone to overfitting if not tunedSlow to train
|
Web search rankingEcology predictions
|
Python |
Rule-Based Classification | Transparent and explainableEasily updated and modified
|
Manual rule creation can be tediousMay not capture complex relationships
|
Expert systemsBusiness rule enforcement
|
Python |
Bagging | Reduces varianceParallelizable
|
May not handle bias well
|
Random Forest is a popular example
|
Python |
Boosting | Reduces biasCombines weak learners
|
Sensitive to noisy data and outliers
|
AdaBoostGradient Boosting
|
Python |
XGBoost | Scalable and efficientRegularization
|
Requires careful tuningCan overfit if not used correctly
|
Competitions on KaggleRetail prediction
|
Python |
Linear Discriminant Analysis (LDA) | Dimensionality reductionSimple and interpretable
|
Assumes Gaussian distributed data and equal class covariances
|
Face recognitionMarketing segmentation
|
Python |
Regularized Models (Shrinking) | Prevents overfittingHandles collinearity
|
Requires parameter tuningMay result in loss of interpretability
|
Ridge and Lasso regression
|
Python |
Stacking | Combines multiple modelsCan improve accuracy
|
Increases model complexityRisk of overfitting if base models are correlated
|
Meta-modelingKaggle competitions
|
Python |
Table has no assigned ID, using random ID 'tnfnxrwdjb' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
Model Type | Strengths | Limitations | Example Use Cases | Implementation |
---|---|---|---|---|
Linear Regression | Simple and interpretable
|
Assumes linear relationshipSensitive to outliers
|
Sales forecastingRisk assessment
|
Python |
Polynomial Regression | Can model non-linear relationships
|
Can overfit with high degrees
|
Growth predictionNon-linear trend modeling
|
Python |
Ridge Regression | Prevents overfittingRegularizes the model
|
Does not perform feature selection
|
High-dimensional dataPreventing overfitting
|
Python |
Lasso Regression | Feature selectionRegularizes the model
|
May exclude useful variables
|
Feature selectionHigh-dimensional datasets
|
Python |
Elastic Net Regression | Balance between Ridge and Lasso
|
Requires tuning for mixing parameter
|
High-dimensional datasets with correlated features
|
Python |
Quantile Regression | Models the median or other quantiles
|
Less interpretable than ordinary regression
|
Median house price predictionFinancial quantiles modeling
|
Python |
Support Vector Regression (SVR) | FlexibleCan handle non-linear relationships
|
Sensitive to kernel and hyperparameters
|
Stock price predictionNon-linear trend modeling
|
Python |
Decision Tree Regression | Handles non-linear dataInterpretable
|
Can overfit on noisy data
|
Price predictionQuality assessment
|
Python |
Random Forest Regression | Handles large datasetsReduces overfitting
|
Requires more computational resources
|
Large datasetsEnvironmental modeling
|
Python |
Gradient Boosting Regression | High performanceCan handle non-linear relationships
|
Prone to overfitting if not tuned
|
Web search rankingPrice prediction
|
Python |
Table has no assigned ID, using random ID 'ulmzxpohcs' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
Model Type | Strengths | Limitations | Example Use Cases | Implementation |
---|---|---|---|---|
K-Means Clustering | Simple and widely usedFast for large datasets
|
Sensitive to initial conditionsRequires specifying the number of clusters
|
Market segmentationImage compression
|
Python |
Hierarchical Clustering | Doesn’t require specifying the number of clustersProduces a dendrogram
|
May be computationally expensive for large datasets
|
TaxonomiesDetermining evolutionary relationships
|
Python |
DBSCAN (Density-Based Clustering) | Can find arbitrarily shaped clustersDoesn’t require specifying the number of clusters
|
Sensitive to scaleRequires density parameters to be set
|
Noise detection and anomaly detection
|
Python |
Agglomerative Clustering | Variety of linkage criteriaProduces a hierarchy of clusters
|
Not scalable for very large datasets
|
Sociological hierarchiesTaxonomies
|
Python |
Mean Shift Clustering | No need to specify number of clustersCan find arbitrarily shaped clusters
|
Computationally expensiveBandwidth parameter selection is crucial
|
Image analysisComputer vision tasks
|
Python |
Affinity Propagation | Automatically determines the number of clustersGood for data with lots of exemplars
|
High computational complexityPreference parameter can be difficult to choose
|
Image recognitionData with many similar exemplars
|
Python |
Spectral Clustering | Can capture complex cluster structuresCan be used with various affinity matrices
|
Choice of affinity matrix is crucialCan be computationally expensive
|
Image and speech processingGraph-based clustering
|
Python |
Table has no assigned ID, using random ID 'xhfzvzmsuv' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
Method | Strengths | Limitations | Example Use Cases | Implementation |
---|---|---|---|---|
PCA | Dimensionality reductionPreserves variance
|
Linear methodNot for categorical data
|
Feature extractionData compression
|
Python |
t-SNE | Captures non-linear structuresGood for visualization
|
Computationally expensiveNot for high-dimensional data
|
Data visualizationExploratory analysis
|
Python |
Autoencoders | Dimensionality reductionNon-linear relationships
|
Neural network knowledgeComputationally intensive
|
Feature learningNoise reduction
|
Python |
Isolation Forest | Effective for high-dimensional dataFast and scalable
|
RandomizedMay miss some anomalies
|
Fraud detectionNetwork security
|
Python |
SVD | Matrix factorizationEfficient for large datasets
|
Assumes linear relationshipsSensitive to scaling
|
Recommender systemsLatent semantic analysis
|
Python |
ICA | Identifies independent componentsSignal separation
|
Non-Gaussian componentsSensitive to noise
|
Blind signal separationFeature extraction
|
Python |
Table has no assigned ID, using random ID 'qndwdbsqwm' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
Method | Strengths | Limitations | Example Use Cases | Implementation |
---|---|---|---|---|
Apriori Algorithm | Well-known and widely usedEasy to understand and implement
|
Can be slow on large datasetsGenerates a large number of candidate sets
|
Market basket analysisCross-marketing strategies
|
Python |
FP-Growth Algorithm | Faster than AprioriEfficient for large datasets
|
Memory intensiveCan be complex to implement
|
Frequent itemset mining in large databasesCustomer purchase patterns
|
Python |
Eclat Algorithm | Faster than AprioriScalable and easy to parallelize
|
Limited to binary attributesGenerates many candidate itemsets
|
Market basket analysisBinary classification tasks
|
Python |
GSP (Generalized Sequential Pattern) | Identifies sequential patternsFlexible for various datasets
|
Can be computationally expensiveNot as efficient for very large databases
|
Customer purchase sequence analysisEvent sequence analysis
|
Python |
RuleGrowth Algorithm | Efficient for mining sequential rulesWorks well with sparse datasets
|
Requires careful parameter settingLess known and used than Apriori or FP-Growth
|
Analyzing customer shopping sequencesDetecting patterns in web browsing data
|
Python |
Table has no assigned ID, using random ID 'vjdaurrqpz' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
Technique | Strengths | Limitations | Example Use Cases | Implementation |
---|---|---|---|---|
Accuracy | Simple and intuitiveEffective for balanced datasets
|
Misleading for imbalanced datasetsDoesn’t reflect true positives/negatives
|
General classification problemsComparing baseline models
|
Python |
AUC-ROC | Effective for binary classificationGood for imbalanced datasets
|
Can be overly optimistic in imbalanced dataNot threshold-specific
|
Medical diagnosis classificationFraud detection models
|
Python |
Precision | Focuses on positive classReduces false positives
|
Ignores false negativesNot useful alone in imbalanced datasets
|
Spam detectionContent moderation systems
|
Python |
Recall | Identifies actual positives wellMinimizes false negatives
|
Ignores false positivesCan be misleading if positives are rare
|
Disease outbreak detectionRecall-focused tasks
|
Python |
F1-Score | Balances precision and recallUseful for imbalanced datasets
|
May not reflect true model performanceDepends on balance of precision and recall
|
Customer churn predictionSentiment analysis
|
Python |
Cross-Validation | Reduces overfittingProvides robust model evaluation
|
Computationally expensiveMay not be ideal for very large datasets
|
General model evaluationComparing multiple models
|
Python |
The Validation Set Approach | Simple and easy to implementGood for initial model assessment
|
Can lead to overfittingDependent on the split
|
Quick model prototypingSmall datasets
|
Python |
Leave-One-Out Cross-Validation | Very detailedEach observation used for validation exactly once
|
Computationally intensiveNot suitable for large datasets
|
Small but rich datasetsHighly sensitive models
|
Python |
k-Fold Cross-Validation | Balances computational cost and validation accuracySuitable for various data sizes
|
Variability in results depending on how data is dividedChoice of ‘k’ can impact results
|
Medium-sized datasetsModel selection
|
Python |
The Bootstrap Method | Good for estimating model accuracyEffective for small datasets
|
Results can be sensitive to outliersMay overestimate accuracy for small datasets
|
Small or medium-sized datasetsUncertainty estimation
|
Python |