| Method | Strengths | Limitations | Example Use Cases | Implementation |
|---|---|---|---|---|
| Simple Fill | Simple and fastWorks well with small datasets
|
May not handle complex data relationshipsSensitive to outliers
|
Basic data analysisQuick data cleaning
|
Python |
| KNN Imputation | Can capture the relationships between featuresWorks well with moderately missing data
|
Computationally intensive for large datasetsSensitive to the choice of k
|
Medical data analysisMarket research
|
Python |
| Soft Impute | Effective for matrix completion in large datasetsWorks well with low-rank data
|
Assumes low-rank data structureCan be sensitive to hyperparameters
|
Recommender systemsLarge-scale data projects
|
Python |
| Iterative Imputer | Can model complex relationshipsSuitable for multiple imputation
|
Computationally expensiveDepends on the choice of model
|
Complex datasets with multiple types of missing data
|
Python |
| Iterative SVD | Good for matrix completion with low-rank assumptionHandles larger datasets
|
Sensitive to rank selectionComputationally demanding
|
Image and video data processingLarge datasets with structure
|
Python |
| Matrix Factorization | Useful for recommendation systemsCan handle large-scale problems
|
Requires careful tuningNot suitable for all types of data
|
Recommendation enginesUser preference analysis
|
Python |
| Nuclear Norm Minimization | Theoretically strong for matrix completionFinds the lowest rank solution
|
Very computationally intensiveImpractical for very large datasets
|
Research in theoretical data completionSmall to medium datasets
|
Python |
| BiScaler | Normalizes data effectivelyOften used as a preprocessing step
|
Not an imputation method itselfDoesn’t always converge
|
Preprocessing for other imputation methodsData normalization
|
Python |
Summary table of models + methods
Introduction
Throughout the course, we will go over several supervised and unsupervised machine learning models. This page summarizes the models.
| Model Type | Strengths | Limitations | Example Use Cases | Implementation |
|---|---|---|---|---|
| Logistic Regression | Simple and interpretableFast to train
|
Assumes linear boundariesNot suitable for complex relationships
|
Credit approvalMedical diagnosis
|
Python |
| Decision Trees | IntuitiveCan model non-linear relationships
|
Prone to overfittingSensitive to small changes in data
|
Customer segmentationLoan default prediction
|
Python |
| Random Forest | Handles overfittingCan model complex relationships
|
Slower to train and predictBlack box model
|
Fraud detectionStock price movement prediction
|
Python |
| Support Vector Machines (SVM) | Effective in high dimensional spacesWorks well with clear margin of separation
|
Sensitive to kernel choiceSlow on large datasets
|
Image classificationHandwriting recognition
|
Python |
| K-Nearest Neighbors (KNN) | Simple and intuitiveNo training phase
|
Slow during query phaseSensitive to irrelevant features and scale
|
Product recommendationDocument classification
|
Python |
| Neural Networks | Capable of approximating complex functionsFlexible architecture Trainable with backpropagation
|
Can require a large number of parametersProne to overfitting on small data Training can be slow
|
Pattern recognitionBasic image classificationFunction approximation
|
Python |
| Deep Learning | Can model highly complex relationshipsExcels with vast amounts of data State-of-the-art results in many domains
|
Requires a lot of data Computationally intensiveInterpretability challenges
|
Advanced image and speech recognitionMachine translationGame playing (like AlphaGo)
|
Python |
| Naive Bayes | FastWorks well with large feature sets
|
Assumes feature independenceNot suitable for numerical input features
|
Spam detectionSentiment analysis
|
Python |
| Gradient Boosting Machines (GBM) | High performanceHandles non-linear relationships
|
Prone to overfitting if not tunedSlow to train
|
Web search rankingEcology predictions
|
Python |
| Rule-Based Classification | Transparent and explainableEasily updated and modified
|
Manual rule creation can be tediousMay not capture complex relationships
|
Expert systemsBusiness rule enforcement
|
Python |
| Bagging | Reduces varianceParallelizable
|
May not handle bias well
|
Random Forest is a popular example
|
Python |
| Boosting | Reduces biasCombines weak learners
|
Sensitive to noisy data and outliers
|
AdaBoostGradient Boosting
|
Python |
| XGBoost | Scalable and efficientRegularization
|
Requires careful tuningCan overfit if not used correctly
|
Competitions on KaggleRetail prediction
|
Python |
| Linear Discriminant Analysis (LDA) | Dimensionality reductionSimple and interpretable
|
Assumes Gaussian distributed data and equal class covariances
|
Face recognitionMarketing segmentation
|
Python |
| Regularized Models (Shrinking) | Prevents overfittingHandles collinearity
|
Requires parameter tuningMay result in loss of interpretability
|
Ridge and Lasso regression
|
Python |
| Stacking | Combines multiple modelsCan improve accuracy
|
Increases model complexityRisk of overfitting if base models are correlated
|
Meta-modelingKaggle competitions
|
Python |
Table has no assigned ID, using random ID 'tljfqgdzlp' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
| Model Type | Strengths | Limitations | Example Use Cases | Implementation |
|---|---|---|---|---|
| Linear Regression | Simple and interpretable
|
Assumes linear relationshipSensitive to outliers
|
Sales forecastingRisk assessment
|
Python |
| Polynomial Regression | Can model non-linear relationships
|
Can overfit with high degrees
|
Growth predictionNon-linear trend modeling
|
Python |
| Ridge Regression | Prevents overfittingRegularizes the model
|
Does not perform feature selection
|
High-dimensional dataPreventing overfitting
|
Python |
| Lasso Regression | Feature selectionRegularizes the model
|
May exclude useful variables
|
Feature selectionHigh-dimensional datasets
|
Python |
| Elastic Net Regression | Balance between Ridge and Lasso
|
Requires tuning for mixing parameter
|
High-dimensional datasets with correlated features
|
Python |
| Quantile Regression | Models the median or other quantiles
|
Less interpretable than ordinary regression
|
Median house price predictionFinancial quantiles modeling
|
Python |
| Support Vector Regression (SVR) | FlexibleCan handle non-linear relationships
|
Sensitive to kernel and hyperparameters
|
Stock price predictionNon-linear trend modeling
|
Python |
| Decision Tree Regression | Handles non-linear dataInterpretable
|
Can overfit on noisy data
|
Price predictionQuality assessment
|
Python |
| Random Forest Regression | Handles large datasetsReduces overfitting
|
Requires more computational resources
|
Large datasetsEnvironmental modeling
|
Python |
| Gradient Boosting Regression | High performanceCan handle non-linear relationships
|
Prone to overfitting if not tuned
|
Web search rankingPrice prediction
|
Python |
Table has no assigned ID, using random ID 'jmrszkcjzs' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
| Model Type | Strengths | Limitations | Example Use Cases | Implementation |
|---|---|---|---|---|
| K-Means Clustering | Simple and widely usedFast for large datasets
|
Sensitive to initial conditionsRequires specifying the number of clusters
|
Market segmentationImage compression
|
Python |
| Hierarchical Clustering | Doesn’t require specifying the number of clustersProduces a dendrogram
|
May be computationally expensive for large datasets
|
TaxonomiesDetermining evolutionary relationships
|
Python |
| DBSCAN (Density-Based Clustering) | Can find arbitrarily shaped clustersDoesn’t require specifying the number of clusters
|
Sensitive to scaleRequires density parameters to be set
|
Noise detection and anomaly detection
|
Python |
| Agglomerative Clustering | Variety of linkage criteriaProduces a hierarchy of clusters
|
Not scalable for very large datasets
|
Sociological hierarchiesTaxonomies
|
Python |
| Mean Shift Clustering | No need to specify number of clustersCan find arbitrarily shaped clusters
|
Computationally expensiveBandwidth parameter selection is crucial
|
Image analysisComputer vision tasks
|
Python |
| Affinity Propagation | Automatically determines the number of clustersGood for data with lots of exemplars
|
High computational complexityPreference parameter can be difficult to choose
|
Image recognitionData with many similar exemplars
|
Python |
| Spectral Clustering | Can capture complex cluster structuresCan be used with various affinity matrices
|
Choice of affinity matrix is crucialCan be computationally expensive
|
Image and speech processingGraph-based clustering
|
Python |
Table has no assigned ID, using random ID 'hfcqxvkcgy' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
| Method | Strengths | Limitations | Example Use Cases | Implementation |
|---|---|---|---|---|
| PCA | Dimensionality reductionPreserves variance
|
Linear methodNot for categorical data
|
Feature extractionData compression
|
Python |
| t-SNE | Captures non-linear structuresGood for visualization
|
Computationally expensiveNot for high-dimensional data
|
Data visualizationExploratory analysis
|
Python |
| Autoencoders | Dimensionality reductionNon-linear relationships
|
Neural network knowledgeComputationally intensive
|
Feature learningNoise reduction
|
Python |
| Isolation Forest | Effective for high-dimensional dataFast and scalable
|
RandomizedMay miss some anomalies
|
Fraud detectionNetwork security
|
Python |
| SVD | Matrix factorizationEfficient for large datasets
|
Assumes linear relationshipsSensitive to scaling
|
Recommender systemsLatent semantic analysis
|
Python |
| ICA | Identifies independent componentsSignal separation
|
Non-Gaussian componentsSensitive to noise
|
Blind signal separationFeature extraction
|
Python |
Table has no assigned ID, using random ID 'qujppxnzib' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
| Method | Strengths | Limitations | Example Use Cases | Implementation |
|---|---|---|---|---|
| Apriori Algorithm | Well-known and widely usedEasy to understand and implement
|
Can be slow on large datasetsGenerates a large number of candidate sets
|
Market basket analysisCross-marketing strategies
|
Python |
| FP-Growth Algorithm | Faster than AprioriEfficient for large datasets
|
Memory intensiveCan be complex to implement
|
Frequent itemset mining in large databasesCustomer purchase patterns
|
Python |
| Eclat Algorithm | Faster than AprioriScalable and easy to parallelize
|
Limited to binary attributesGenerates many candidate itemsets
|
Market basket analysisBinary classification tasks
|
Python |
| GSP (Generalized Sequential Pattern) | Identifies sequential patternsFlexible for various datasets
|
Can be computationally expensiveNot as efficient for very large databases
|
Customer purchase sequence analysisEvent sequence analysis
|
Python |
| RuleGrowth Algorithm | Efficient for mining sequential rulesWorks well with sparse datasets
|
Requires careful parameter settingLess known and used than Apriori or FP-Growth
|
Analyzing customer shopping sequencesDetecting patterns in web browsing data
|
Python |
Table has no assigned ID, using random ID 'fdvgbjrmdb' to apply `gt::opt_css()`
Avoid this message by assigning an ID: `gt(id = '')` or `gt_theme_538(quiet = TRUE)`
| Technique | Strengths | Limitations | Example Use Cases | Implementation |
|---|---|---|---|---|
| Accuracy | Simple and intuitiveEffective for balanced datasets
|
Misleading for imbalanced datasetsDoesn’t reflect true positives/negatives
|
General classification problemsComparing baseline models
|
Python |
| AUC-ROC | Effective for binary classificationGood for imbalanced datasets
|
Can be overly optimistic in imbalanced dataNot threshold-specific
|
Medical diagnosis classificationFraud detection models
|
Python |
| Precision | Focuses on positive classReduces false positives
|
Ignores false negativesNot useful alone in imbalanced datasets
|
Spam detectionContent moderation systems
|
Python |
| Recall | Identifies actual positives wellMinimizes false negatives
|
Ignores false positivesCan be misleading if positives are rare
|
Disease outbreak detectionRecall-focused tasks
|
Python |
| F1-Score | Balances precision and recallUseful for imbalanced datasets
|
May not reflect true model performanceDepends on balance of precision and recall
|
Customer churn predictionSentiment analysis
|
Python |
| Cross-Validation | Reduces overfittingProvides robust model evaluation
|
Computationally expensiveMay not be ideal for very large datasets
|
General model evaluationComparing multiple models
|
Python |
| The Validation Set Approach | Simple and easy to implementGood for initial model assessment
|
Can lead to overfittingDependent on the split
|
Quick model prototypingSmall datasets
|
Python |
| Leave-One-Out Cross-Validation | Very detailedEach observation used for validation exactly once
|
Computationally intensiveNot suitable for large datasets
|
Small but rich datasetsHighly sensitive models
|
Python |
| k-Fold Cross-Validation | Balances computational cost and validation accuracySuitable for various data sizes
|
Variability in results depending on how data is dividedChoice of ‘k’ can impact results
|
Medium-sized datasetsModel selection
|
Python |
| The Bootstrap Method | Good for estimating model accuracyEffective for small datasets
|
Results can be sensitive to outliersMay overestimate accuracy for small datasets
|
Small or medium-sized datasetsUncertainty estimation
|
Python |