Data Analysis: Patterns, Prediction and Causality
A graduate textbook for Business, Economics and Policy


Short description

Data Analysis: Exploration, Patterns, Prediction and Causality is a textbook aimed primarily at business, applied economics and public policy students. It may be taught at MBA, MA Economics (non-PhD track), MSc in Business Economics/Management, MA in Public Policy, PhD in Management and comparable programs. It also a natural fit in Business Analytics graduate programs. This textbook provides integrated knowledge of methods traditionally scattered around various fields such as econometrics, machine learning and practical business statistics. It covers data organization, data description, regression analysis, predictive analytics using regression and machine learning tools, causal analysis of the effects of interventions by doing experiments or using observational data, and practical skills for working with real-life data and collecting data. The textbook covers relatively few methods but helps students gain a a lot of practice and a deep intuitive understanding of those methods. We put a lot of emphasis on the interpretation and visualization of results.

Authors

Gábor Békés (CEU)
Gábor Kézdi (U. Michigan)

A bit more info

Data Analysis: Patterns, Prediction and Causality by Gábor Békés (Central European University and CEPR) and Gábor Kézdi (University of Michigan and ISR) is now forthcoming with Cambridge University Press (in 2020). The textbook material may be fully covered in a year-long course (for example, in the first year of a two-year Master programs or PhD programs) It covers material for a series of courses or modules, and chapters may be used to assemble programs of various lengths.

Our textbook covers integrated knowledge of methods and tools traditionally scattered around various fields such as econometrics, machine learning and practical business statistics. Our sections in the book are:

  1. Data Exploration
  2. Patterns: regression analysis
  3. Prediction
  4. Causality: the effects of interventions

State-of-the art knowledge in data analysis includes traditional regression analysis, causal analysis of the effects of interventions, predictive analytics using regression and machine learning tools, and practical skills for working with real-life data and collecting data. We cover relatively few methods but help students gain a deep intuitive understanding. The upside is that visualization and interpretation of results may become the focus of analysis.

Applied knowledge can be acquired only by working through many applications. Students will use real-life data; learn how to manage analytical projects from scratch as we provide data and code as part of an online ancillary platform. The textbook supports both R and Stata. The textbook is complemented with extensive online material including data, code, additional case studies, practice questions, sample exams and data exercises.

Why is this textbook different?

The most important features of this textbook that we think make it attractive - and different from other textbooks - are as follows.

What's more

Case studies

We will provide additional case studies that allow for studying the entire process of data analysis from the substantive business or policy question through collecting or accessing data, managing and cleaning data, carrying out the analysis, presenting and interpreting its results, and addressing the original substantive questions. Case studies aim at answering a question rather than simply illustrating a method. We selected case studies with a potential appeal to a wide range of students. The topics cover management, consumer choice, labor markets, health, energy, macroeconomic and social policy.

Focus on data visualization

Understanding patterns in data is greatly helped by data visualization. We present a comprehensive take on how to build graphs using a few layers. For many types of graphs, we offer shorter sections on how best show a relationship illustrated by graphs used in case studies.

Big data

Big Data presents opportunities to better answer old questions and ask new questions. It offers great advantages when applying many traditional statistical methods and allows for developing new methods. At the same time analyzing Big Data presents new challenges, too. We include explicit discussion of these opportunities and challenges in relation to uncovering and generalizing patterns, learning the effects of interventions and carrying out predictions, within each of the sections of the book.

Key topics

I. Data Exploration

Real data needs cleaning and restructuring before it can be analyzed. The decisions during that process may have far-reaching consequences for the results of the analysis. Yet they are rarely discussed in standard statistics, econometrics and machine learning texts. Even after extensive cleaning the data used in the analysis is typically different from the ideal dataset that would serve the analysis best. Analysts need to have a thorough understanding of those differences to interpret their results in appropriate ways. The examples used in our course help students acquire the tools of data management and data cleaning and track the consequences of data cleaning on the results of their analysis. Furthermore, similar issues are addressed when analysts collect their own data or influence data collection in some ways.
Chapters of this section:

  1. Origins of data (data table, data quality, survey, scraping, sampling, ethics)
  2. Preparing data for analysis (tidy data, source of variation, variable types, missing data, data cleaning)
  3. Exploratory data analysis (probability, distributions, extreme values, summary stats, layers of data visualization)
  4. Comparison and correlation (conditional probability, conditional distribution, conditional expectation, visual comparisons, correlation, good graphs and tables)
  5. Generalizing from a dataset (repeated samples, confidence interval, standard error estimation via bootstrap and formula, external validity)
  6. Testing hypotheses (null and alternative hypotheses, t-test, false positives / false negatives, p-value, testing multiple hypotheses)

Click on the logo to download code:

Code in R Code in Stata

Clicking on the above link(s) will either start the download process or open the file depending on your browser. If you wish to only download the file, right-click and select your browser's "Save as..." option.

II. Patterns: regression analysis

Uncovering patterns in the data can be an important goal in itself, and it is the prerequisite to establishing cause and effect and carrying out predictions. The course starts with simple regression analysis, the method that compares expected y for different values of x to learn the patterns of association between the two variables. It discusses nonparametric regressions and focuses on the linear regression. It builds on simple linear regression and goes on to enriching it with nonlinear functional forms, generalizing from a particular dataset to other data it represents, adding more explanatory variables, etc. The course also covers regression analysis for time series data, panel data, binary dependent variables, as well as nonlinear models such as logit and probit. Understanding the intuition behind the methods, their applicability in various situations, and the correct interpretation of their results are the constant focus of the course.
Chapters of this section:

  1. Simple regression analysis (non-parametric regression, linear regression, OLS, predicted values and residuals, regression and causality)
  2. Complicated patterns and messy data (taking log and other transformations of variables, piecewise linear splines and polynomials, measurement error in variables, influential observations, using weights)
  3. Generalizing results of regression analysis (standard error, confidence interval, prediction interval, testing, external validity)
  4. Multiple linear regression (linear regression mechanics, binary and other qualitative right-hand-side variables, interactions, ceteris paribus vs. conditioning in multiple regression)
  5. Probability models (linear probability, logit and probit, marginal differences, goodness of fit)
  6. Time series regressions (series frequency, trends, seasonality, stationarity, lags, serial correlation)

Click on the logo to download this chapter:

III. Prediction

Data analysis in business and policy applications is often aimed at prediction. The course introduces tools to evaluate predictions, such as loss functions or the Brier score. It emphasizes the importance of out-of-sample prediction, the role of stationarity, the dangers of overfitting and the use of training and testing samples and cross-validation. It presents and compares the most important predictive models that may be useful in various situations such as time series regressions, classification tools and tree-based machine learning methods.
Chapters of this section:

  1. Framework for prediction (prediction error, loss function, RMSE, prediction with regression, overfitting, external validity, np-hard problems, machine learning)
  2. Model selection (adjusted measures of in-sample fit, cross-validation, shrinkage/LASSO)
  3. Regression trees (Idea of decision trees, CART, stopping rules, pruning, search algorithms)
  4. Random forest (boosting, decorrelating trees, regression vs random forest)
  5. Classification (calibration and measures of fit, ROC/AUC, classification with logit vs. Random forest)
  6. Forecasting from time series data (cross-validation in time series, ARIMA, vector autoregression, daily series and seasonality, new tools: auto-arima and Prophet)

Click on the logo to download this chapter:

IV. Causality: learning the effects of interventions

Decisions in business and policy are often centered on specific interventions, such as changing monetary policy, modifying health care financing, changing the price or other attributes of products, or changing the media mix in marketing. Learning the effects of such interventions is an important purpose of data analysis. The course incorporates the basic concepts and methods used by program evaluation (the framework of potential outcomes, the benefits of randomized assignment, etc.). It also covers related methods used in business, such as A/B testing.
Chapters of this section:

  1. A framework for causal analysis (potential outcomes, average treatment effect selection and other confounders, use of causal graphs)
  2. Experiments (fields experiments, A/B testing, randomization and balance, attrition, power, internal and external validity)
  3. Methods for observational cross-section data (exact, propensity score matching vs. regression, the role of common support)
  4. Difference in differences (parallel trends, panel vs. repeated cross-section)
  5. Methods for observational panel data (fixed effects, first differences, clustering standard errors)
  6. Appropriate control group in observational panel data (synthetic control and event study, panel balance and selection)

Click on the logo to download this chapter:

Errata

If you experience any discrepancies, please feel free to contact us via data.analyis.textbook@gmail.com. Comments are warmly welcome.

Just for HTML, function, css code practice:



jhgjhgjhgghj