A Practical Guide to Data Science Modeling: Lessons from the Book ‘Models Demystified’
Written by
Brock Ferguson, Managing Director
Published
February 10, 2025
In the book Models Demystified, OneSix Sr. ML Scientist Michael Clark delves into the fundamentals of modeling in data science. Designed for practical application, the book provides a clear understanding of modeling basics, an actionable toolkit of models and techniques, and a balanced perspective on statistical and machine learning approaches.
In this blog post, we highlight the key insights from his work, diving into various modeling techniques and emphasizing the importance of feature engineering and uncertainty estimation in building reliable, interpretable models.
By mastering these fundamentals, you’ll not only unlock the full potential of predictive analytics but also equip yourself to make smarter, data-driven decisions. So, let’s demystify the science behind the models and turn complexity into clarity!
What is Data Science Modeling?
At its core, a model is a mathematical, statistical, or computational construct designed to understand and predict patterns in data. It simplifies real-world systems or processes into manageable abstractions that data scientists can utilize to derive meaningful insights and actionable recommendations. Models facilitate:
- Pattern Recognition: Identifying trends or relationships within data.
- Hypothesis Testing: Validating assumptions through structured analysis.
- Forecasting: Predicting future events or outcomes based on historical data.
- Recommendation Systems: Suggesting optimal actions to achieve desired outcomes.
What are the Main Types of Data Science Models?
Data science encompasses various modeling techniques, each serving distinct purposes. Here’s an overview of the primary categories.
Linear Models and More
Linear Regression and Extensions
This category encompasses linear regression as a starting point, and extends to generalized linear models (GLMs), including logistic regression for binary targets, and Poisson regression for counts. Further extensions include generalized additive models (GAMs) for non-linear relationships, and generalized linear mixed models (GLMMs) for hierarchical data.
Special Considerations
A variety of modeling approaches are necessary when working with specific data types, such as time-series, spatial, censored, or ordinal data. These data types often exhibit unique characteristics that may necessitate specialized modeling techniques or adaptations to existing models.
Ease and Interpretability
Linear models are prized for their ease of implementation, and their ability to provide relatively clear and interpretable results. They also serve as useful baselines for more complex models, and often are difficult to outperform on simple tasks.
Machine Learning Models
Modeling Framework
Machine Learning (ML) provides a framework for systematically evaluating and improving models. It involves training models on historical data with a primary goal of making predictions or decisions on new, unseen data. The choice of model depends on the problem type, data characteristics, and desired performance metrics.
Penalized Regression
Least Absolute Shrinkage and Selection Operator (LASSO) and Ridge Regression are penalized versions of linear models commonly used for both regression and classification in a machine learning context.
- Lasso applies a penalty proportional to the absolute values of the coefficients, shrinking some to zero. This promotes sparsity, effectively performing feature selection.
- Ridge applies a penalty based on the squared values of the coefficients, which tends to shrink the coefficients towards zero but does not eliminate any features.
Tree-Based
A tree-like model of decisions and their possible consequences, including chance event outcomes and resource costs, to model complex decision-making processes. Following are some of the tree-based machine learning algorithms:
- Random Forests: An ensemble method that constructs multiple decision trees during training. It outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees, improving accuracy and controlling overfitting.
- Boosting Algorithms: Combines weak learners sequentially to form a strong learner. Each new model attempts to correct the errors of its predecessors and enhances predictive performance.
- Bagging (Bootstrap Aggregating): Improves the stability and accuracy of machine learning algorithms by training multiple models on different subsets of the data and aggregating their predictions.
Neural Networks/Deep Learning
Since these models are inspired by the human brain, they are widely used in deep learning applications. Neural networks mimic human brain architecture to identify complex, non-linear relationships in data, enabling deep learning applications.
Causal Models
Identifying Effects
Causal models shift most of the modeling focus to identifying the effects of a treatment or intervention on an outcome, as opposed to predictive performance on new data. In determining causal effects, even small effects can be significant if they are consistent and replicable. For example, a small effect size in a clinical trial could still be meaningful if it leads to a reduction in patient mortality.
Random Assignment and A/B Testing
Random assignment of treatments to subjects is the gold standard for estimating causal effects. A/B testing is a common technique used in online experiments to determine the effectiveness of a treatment.
Directed Acyclic Graphs (DAGs)
Graphical representations that depict assumptions about the causal structure among variables, aiding in understanding and identifying causal relationships. These pave the way for different modeling approaches to help discern causal effects.
Meta Learners
Provides a framework to estimate treatment effects and determine causal relationships between a treatment and an outcome. You can use the following types of meta-learners to assess causal effects:
- S-Learner: Uses a single model for both treated and control groups, estimating the difference in outcomes if all observations were treated versus not treated.
- T-Learner: Leverages two separate models—one for the treatment group and one for the control group—then calculates the difference in predictions.
- X-Learner: A more advanced version of the T-Learner with multiple steps to improve causal effect estimation.
- R-Learner (Double Debiased ML): Uses a residual-based approach to adjust for treatment effects.
Why is Data Preparation and Feature Engineering an Important Part of the Modeling Process?
Effective modeling hinges on thorough data preparation and feature engineering. The following steps ensure data quality and compatibility with algorithms, directly influencing model performance.
- Data Cleaning: Handles missing values, corrects errors, and addresses inconsistencies to build reliable models and ensure data quality.
- Feature Selection: Identifies and selects the most relevant features to help improve model performance and interpretability.
- Feature Transformation: Applies transformations, such as scaling or encoding categorical variables to ensure that data is in a suitable format for modeling algorithms.
- Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) help retain essential information while reducing the number of features.
What is the Role of Statistical Rigor in Uncertainty Estimation?
Addressing uncertainty is integral to robust data science modeling. Statistical rigor ensures reliable predictions and enhances trust in model outputs. It involves:
1. Quantification Through Confidence Intervals
Confidence intervals offer a clear, quantifiable range within which model parameters are likely to fall. This approach ensures that we account for variability in estimates, highlighting the degree of precision in model predictions.
2. Prediction Intervals for Future Observations
Unlike confidence intervals, prediction intervals extend uncertainty quantification to individual predictions. These intervals provide a realistic range for where future data points are expected, accounting for the inherent variability in outcomes.
3. Bootstrapping for Distribution Estimation
Bootstrapping is a statistically rigorous, non-parametric technique that involves resampling the data to estimate the uncertainty of parameters and predictions. It is particularly useful when traditional analytical solutions are infeasible, providing robust insights into variability.
4. Bayesian Methods for Comprehensive Uncertainty Estimation
Bayesian approaches allow for a more comprehensive treatment of uncertainty by incorporating prior information and deriving posterior distributions. This method propagates uncertainty through the entire modeling process, offering a more nuanced understanding of variability in predictions.
5. Model Validation and Testing
Employing techniques such as cross-validation ensures that model predictions generalize well to unseen data. Rigorous testing methods reveal the extent of overfitting and provide an honest assessment of model reliability.
6. Assumption Checking and Diagnostics
Statistical rigor requires a careful evaluation of the assumptions underlying a model. When these assumptions are violated, it can lead to substantial uncertainty in the results, making thorough diagnostics and model refinement critical to minimizing risks and ensuring reliable outcomes.
Key Takeaway: Building a Strong Foundation in Data Science Modeling
Data science modeling is effectively solving challenges in today’s times and enabling demand forecasting, inventory management, and logistical optimization, leading to cost savings and improved efficiency in supply chains.
Incorporating processes like feature engineering, uncertainty estimation, and robust validation ensures that your models are not only reliable but also interpretable and adaptable to real-world complexities.
As artificial intelligence and machine learning continue to advance, we can expect models to become increasingly automated, adaptive, and precise. The future will likely emphasize real-time predictive analytics, empowering industries to anticipate trends, streamline operations, and make informed decisions with enhanced accuracy.
The journey doesn’t end with building models—it’s about using them to transform challenges into opportunities. Ready to dive deeper? You can access expert insights and further your understanding of data science modeling, in the book, Models Demystified, written by OneSix ML Scientist Michael Clark. The print version of the book will be out this year on CRC Press as part of the Data Science Series.
Navigate Future Developments With Data Science Modeling
We can help you get started with expert insights and practical guidance to build and optimize data-driven models for your needs.
Contact Us