Data Analysis Textbook

I. Data Exploration

Before starting working with data, data analysts need to gain a good understanding of how their data was born, what's in their data, and assess its quality. Actual work typically starts with re-organizing, cleaning and storing the data in an appropriate way. Such work, also called data wrangling, takes a lot of time, typically more than any actual analysis. Moreover, the decisions during data wrangling may have far-reaching consequences for the results of the analysis. The chapters in this part discuss how to assess data quality by understanding how the data was born, what frequent issues actual data have, and introduce tools and good practice of data wrangling to deal with those issues. Next, we discuss the methods of exploratory data analysis and give advice on data visualization. This part is closed by chapters on concepts and methods that help generalize findings in actual data to other situations: inference and testing hypotheses.

Chapters of this section:

Origins of data (data table, data quality, survey, scraping, sampling, ethics)
Preparing data for analysis (tidy data, source of variation, variable types, missing data, data cleaning)
Exploratory data analysis (probability, distributions, extreme values, summary stats, layers of data visualization)
Comparison and correlation (conditional probability, conditional distribution, conditional expectation, visual comparisons, correlation, good graphs and tables)
Generalizing from a dataset (repeated samples, confidence interval, standard error estimation via bootstrap and formula, external validity)
Testing hypotheses (null and alternative hypotheses, t-test, false positives / false negatives, p-value, testing multiple hypotheses)

II. Patterns: regression analysis

Uncovering patterns of association in the data can be an important goal in itself, and it is the prerequisite to prediction and establishing the effect of an intervention. The chapters in this part give a thorough introduction to regression analysis, whose goal is to uncover average differences in y that correspond to differences in x. We start with simple regression analysis, introducing nonparametric regressions and focusing on linear regression. Subsequent chapters discuss how linear regression is used to accommodate complicated functional forms, how to generalize regression results from the data to other settings, and how to estimate and interpret the results of linear regression with multiple x variables. The last two chapters of this part cover probability models (linear regressions and logit and probit models with binary y variables) and regressions using time series data. Through all chapters, the emphasis is on understanding the intuition behind the methods, their applicability in various situations, and the correct interpretation of their results.

Chapters of this section:

Simple regression analysis (non-parametric regression, linear regression, OLS, predicted values and residuals, regression and causality)
Complicated patterns and messy data (taking log and other transformations of variables, piecewise linear splines and polynomials, measurement error in variables, influential observations, using weights)
Generalizing results of regression analysis (standard error, confidence interval, prediction interval, testing, external validity)
Multiple linear regression (linear regression mechanics, binary and other qualitative right-hand-side variables, interactions, ceteris paribus vs. conditioning in multiple regression)
Probability models (linear probability, logit and probit, marginal differences, goodness of fit)
Time series regressions (series frequency, trends, seasonality, stationarity, lags, serial correlation)

III. Prediction

Data analysis in business and policy applications is often aimed at predicting the value of variable y with the help of x variables. Examples include predicting the price of differentiated products by their characteristics, predicting the probability by which loan applicants would repay their loans based on their features, or forecasting daily sales by seasonal patterns. This part starts with introducing the general framework for prediction, including the concepts of the loss function, overfitting, training and test samples, and cross-validation. Separate chapters introduce prediction methods with linear regression (variable selection with lasso), tree-based methods, and random forest. We discuss prediction of probabilities and classification, and forecasting y in time series data. Through all chapters, we introduce the language used in predictive analytics (machine learning), emphasize the intuition behind the various predictive models and their evaluation, conceptual issues related to in-sample fit, cross-validation and out-of-data prediction, and how to choose a good method for the problem at hand that likely gives good predictions.

Chapters of this section:

Framework for prediction (prediction error, loss function, RMSE, prediction with regression, overfitting, external validity, np-hard problems, machine learning)
Model selection (adjusted measures of in-sample fit, cross-validation, shrinkage/LASSO)
Regression trees (Idea of decision trees, CART, stopping rules, pruning, search algorithms)
Random forest (boosting, decorrelating trees, regression vs random forest)
Classification (calibration and measures of fit, ROC/AUC, classification with logit vs. Random forest)
Forecasting from time series data (cross-validation in time series, ARIMA, vector autoregression, daily series and seasonality, new tools: auto-arima and Prophet)

IV. Causality: learning the effects of interventions

Decisions in business and policy are often centered on specific interventions, such as changing monetary policy, modifying health care financing, changing the price or other attributes of products, or changing the media mix in marketing. Learning the effects of such interventions is the subject of causal analysis, an important purpose of data analysis. This part starts with the conceptual framework of causal analysis, introducing potential outcomes, counterfactuals, average effects, treated and control groups, and causal maps (graphs that show our assumptions about causal relationships about variables). These concepts help guide subsequent analysis and interpret its results. The next chapters in this part introduce potential issues with controlled experiments and tools to deal with them, confounders that bias causal estimates in observational data and how regression analysis and matching can overcome, or mitigate, their effects, and when and how longitudinal data can help further mitigate those issues. We introduce and discuss difference-in-differences analysis, panel regression analysis with first differences and fixed effects, and methods that can help defining control groups in longitudinal data, such as synthetic controls or event studies. All chapters in this part emphasize thinking in terms of potential outcomes and counterfactuals, finding and using methods that are most likely to give good estimates of average effects, and assessing the internal and external validity of such estimates.

Chapters of this section:

A framework for causal analysis (potential outcomes, average treatment effect selection and other confounders, use of causal graphs)
Experiments (fields experiments, A/B testing, randomization and balance, attrition, power, internal and external validity)
Methods for observational cross-section data (exact, propensity score matching vs. regression, the role of common support)
Difference in differences (parallel trends, panel vs. repeated cross-section)
Methods for observational panel data (fixed effects, first differences, clustering standard errors)
Appropriate control group in observational panel data (synthetic control and event study, panel balance and selection)