R in Action, Third Edition

Book description

R is the most powerful tool you can use for statistical analysis. This definitive guide smooths R’s steep learning curve with practical solutions and real-world applications for commercial environments.

In R in Action, Third Edition you will learn how to:

  • Set up and install R and RStudio
  • Clean, manage, and analyze data with R
  • Use the ggplot2 package for graphs and visualizations
  • Solve data management problems using R functions
  • Fit and interpret regression models
  • Test hypotheses and estimate confidence
  • Simplify complex multivariate data with principal components and exploratory factor analysis
  • Make predictions using time series forecasting
  • Create dynamic reports and stunning visualizations
  • Techniques for debugging programs and creating packages

R in Action, Third Edition makes learning R quick and easy. That’s why thousands of data scientists have chosen this guide to help them master the powerful language. Far from being a dry academic tome, every example you’ll encounter in this book is relevant to scientific and business developers, and helps you solve common data challenges. R expert Rob Kabacoff takes you on a crash course in statistics, from dealing with messy and incomplete data to creating stunning visualizations. This revised and expanded third edition contains fresh coverage of the new tidyverse approach to data analysis and R’s state-of-the-art graphing capabilities with the ggplot2 package.

About the Technology
Used daily by data scientists, researchers, and quants of all types, R is the gold standard for statistical data analysis. This free and open source language includes packages for everything from advanced data visualization to deep learning. Instantly comfortable for mathematically minded users, R easily handles practical problems without forcing you to think like a software engineer.

About the Book
R in Action, Third Edition teaches you how to do statistical analysis and data visualization using R and its popular tidyverse packages. In it, you’ll investigate real-world data challenges, including forecasting, data mining, and dynamic report writing. This revised third edition adds new coverage for graphing with ggplot2, along with examples for machine learning topics like clustering, classification, and time series analysis.

What's Inside
  • Clean, manage, and analyze data
  • Use the ggplot2 package for graphs and visualizations
  • Techniques for debugging programs and creating packages
  • A complete learning resource for R and tidyverse


About the Reader
Requires basic math and statistics. No prior experience with R needed.

About the Author
Dr. Robert I Kabacoff is a professor of quantitative analytics at Wesleyan University and a seasoned data scientist with more than 20 years of experience.

Quotes
Kabacoff has outdone himself by significantly improving on the already excellent previous edition.
- Alain Lompo, ISO-Gruppe

R in Action has been my go-to reference on R for years. The third edition contains timely updates on the tidyverse and other new tools. I would recommend this book without hesitation.
- Daniel Kenney-Jung MD, Department of Pediatrics, Duke University

Outstandingly well-written. The best book on R programming that I have ever read.
- Kelvin Meeks, International Technology Ventures

Takes the reader through a series of essential methods from basic to complex. The only R book you will ever need.
- Martin Perry, Microsoft

Table of contents

  1. R in Action
  2. Copyright
  3. Praise for the previous edition of R in Action
  4. brief contents
  5. contents
  6. Front matter
    1. preface
    2. acknowledgments
    3. about this book
      1. What's new in the third edition
      2. Who should read this book
      3. How this book is organized: A road map
      4. Advice for data miners
      5. About the code
      6. liveBook discussion forum
    4. about the author
    5. about the cover illustration
  7. Part 1. Getting started
  8. 1 Introduction to R
    1. 1.1 Why use R?
    2. 1.2 Obtaining and installing R
    3. 1.3 Working with R
      1. 1.3.1 Getting started
      2. 1.3.2 Using RStudio
      3. 1.3.3 Getting help
      4. 1.3.4 The workspace
      5. 1.3.5 Projects
    4. 1.4 Packages
      1. 1.4.1 What are packages?
      2. 1.4.2 Installing a package
      3. 1.4.3 Loading a package
      4. 1.4.4 Learning about a package
    5. 1.5 Using output as input: Reusing results
    6. 1.6 Working with large datasets
    7. 1.7 Working through an example
    8. Summary
  9. 2 Creating a dataset
    1. 2.1 Understanding datasets
    2. 2.2 Data structures
      1. 2.2.1 Vectors
      2. 2.2.2 Matrices
      3. 2.2.3 Arrays
      4. 2.2.4 Data frames
      5. 2.2.5 Factors
      6. 2.2.6 Lists
      7. 2.2.7 Tibbles
    3. 2.3 Data input
      1. 2.3.1 Entering data from the keyboard
      2. 2.3.2 Importing data from a delimited text file
      3. 2.3.3 Importing data from Excel
      4. 2.3.4 Importing data from JSON
      5. 2.3.5 Importing data from the web
      6. 2.3.6 Importing data from SPSS
      7. 2.3.7 Importing data from SAS
      8. 2.3.8 Importing data from Stata
      9. 2.3.9 Accessing database management systems
      10. 2.3.10 Importing data via Stat/Transfer
    4. 2.4 Annotating datasets
      1. 2.4.1 Variable labels
      2. 2.4.2 Value labels
    5. 2.5 Useful functions for working with data objects
    6. Summary
  10. 3 Basic data management
    1. 3.1 A working example
    2. 3.2 Creating new variables
    3. 3.3 Recoding variables
    4. 3.4 Renaming variables
    5. 3.5 Missing values
      1. 3.5.1 Recoding values to missing
      2. 3.5.2 Excluding missing values from analyses
    6. 3.6 Date values
      1. 3.6.1 Converting dates to character variables
      2. 3.6.2 Going further
    7. 3.7 Type conversions
    8. 3.8 Sorting data
    9. 3.9 Merging datasets
      1. 3.9.1 Adding columns to a data frame
      2. 3.9.2 Adding rows to a data frame
    10. 3.10 Subsetting datasets
      1. 3.10.1 Selecting variables
      2. 3.10.2 Dropping variables
      3. 3.10.3 Selecting observations
      4. 3.10.4 The subset() function
      5. 3.10.5 Random samples
    11. 3.11 Using dplyr to manipulate data frames
      1. 3.11.1 Basic dplyr functions
      2. 3.11.2 Using pipe operators to chain statements
    12. 3.12 Using SQL statements to manipulate data frames
    13. Summary
  11. 4 Getting started with graphs
    1. 4.1 Creating a graph with ggplot2
      1. 4.1.1 ggplot
      2. 4.1.2 Geoms
      3. 4.1.3 Grouping
      4. 4.1.4 Scales
      5. 4.1.5 Facets
      6. 4.1.6 Labels
      7. 4.1.7 Themes
    2. 4.2 ggplot2 details
      1. 4.2.1 Placing the data and mapping options
      2. 4.2.2 Graphs as objects
      3. 4.2.3 Saving graphs
      4. 4.2.4 Common mistakes
    3. Summary
  12. 5 Advanced data management
    1. 5.1 A data management challenge
    2. 5.2 Numerical and character functions
      1. 5.2.1 Mathematical functions
      2. 5.2.2 Statistical functions
      3. 5.2.3 Probability functions
      4. 5.2.4 Character functions
      5. 5.2.5 Other useful functions
      6. 5.2.6 Applying functions to matrices and data frames
      7. 5.2.7 A solution for the data management challenge
    3. 5.3 Control flow
      1. 5.3.1 Repetition and looping
      2. 5.3.2 Conditional execution
    4. 5.4 User-written functions
    5. 5.5 Reshaping data
      1. 5.5.1 Transposing
      2. 5.5.2 Converting from wide to long dataset formats
    6. 5.6 Aggregating data
    7. Summary
  13. Part 2. Basic methods
  14. 6 Basic graphs
    1. 6.1 Bar charts
      1. 6.1.1 Simple bar charts
      2. 6.1.2 Stacked, grouped, and filled bar charts
      3. 6.1.3 Mean bar charts
      4. 6.1.4 Tweaking bar charts
    2. 6.2 Pie charts
    3. 6.3 Tree maps
    4. 6.4 Histograms
    5. 6.5 Kernel density plots
    6. 6.6 Box plots
      1. 6.6.1 Using parallel box plots to compare groups
      2. 6.6.2 Violin plots
    7. 6.7 Dot plots
    8. Summary
  15. 7 Basic statistics
    1. 7.1 Descriptive statistics
      1. 7.1.1 A menagerie of methods
      2. 7.1.2 Even more methods
      3. 7.1.3 Descriptive statistics by group
      4. 7.1.4 Summarizing data interactively with dplyr
      5. 7.1.5 Visualizing results
    2. 7.2 Frequency and contingency tables
      1. 7.2.1 Generating frequency tables
      2. 7.2.2 Tests of independence
      3. 7.2.3 Measures of association
      4. 7.2.4 Visualizing results
    3. 7.3 Correlations
      1. 7.3.1 Types of correlations
      2. 7.3.2 Testing correlations for significance
      3. 7.3.3 Visualizing correlations
    4. 7.4 T-tests
      1. 7.4.1 Independent t-test
      2. 7.4.2 Dependent t-test
      3. 7.4.3 When there are more than two groups
    5. 7.5 Nonparametric tests of group differences
      1. 7.5.1 Comparing two groups
      2. 7.5.2 Comparing more than two groups
    6. 7.6 Visualizing group differences
    7. Summary
  16. Part 3. Intermediate methods
  17. 8 Regression
    1. 8.1 The many faces of regression
      1. 8.1.1 Scenarios for using OLS regression
      2. 8.1.2 What you need to know
    2. 8.2 OLS regression
      1. 8.2.1 Fitting regression models with lm()
      2. 8.2.2 Simple linear regression
      3. 8.2.3 Polynomial regression
      4. 8.2.4 Multiple linear regression
      5. 8.2.5 Multiple linear regression with interactions
    3. 8.3 Regression diagnostics
      1. 8.3.1 A typical approach
      2. 8.3.2 An enhanced approach
      3. 8.3.3 Multicollinearity
    4. 8.4 Unusual observations
      1. 8.4.1 Outliers
      2. 8.4.2 High-leverage points
      3. 8.4.3 Influential observations
    5. 8.5 Corrective measures
      1. 8.5.1 Deleting observations
      2. 8.5.2 Transforming variables
      3. 8.5.3 Adding or deleting variables
      4. 8.5.4 Trying a different approach
    6. 8.6 Selecting the “best” regression model
      1. 8.6.1 Comparing models
      2. 8.6.2 Variable selection
    7. 8.7 Taking the analysis further
      1. 8.7.1 Cross-validation
      2. 8.7.2 Relative importance
    8. Summary
  18. 9 Analysis of variance
    1. 9.1 A crash course on terminology
    2. 9.2 Fitting ANOVA models
      1. 9.2.1 The aov() function
      2. 9.2.2 The order of formula terms
    3. 9.3 One-way ANOVA
      1. 9.3.1 Multiple comparisons
      2. 9.3.2 Assessing test assumptions
    4. 9.4 One-way ANCOVA
      1. 9.4.1 Assessing test assumptions
      2. 9.4.2 Visualizing the results
    5. 9.5 Two-way factorial ANOVA
    6. 9.6 Repeated measures ANOVA
    7. 9.7 Multivariate analysis of variance (MANOVA)
      1. 9.7.1 Assessing test assumptions
      2. 9.7.2 Robust MANOVA
    8. 9.8 ANOVA as regression
    9. Summary
  19. 10 Power analysis
    1. 10.1 A quick review of hypothesis testing
    2. 10.2 Implementing power analysis with the pwr package
      1. 10.2.1 T-tests
      2. 10.2.2 ANOVA
      3. 10.2.3 Correlations
      4. 10.2.4 Linear models
      5. 10.2.5 Tests of proportions
      6. 10.2.6 Chi-square tests
      7. 10.2.7 Choosing an appropriate effect size in novel situations
    3. 10.3 Creating power analysis plots
    4. 10.4 Other packages
    5. Summary
  20. 11 Intermediate graphs
    1. 11.1 Scatter plots
      1. 11.1.1 Scatter plot matrices
      2. 11.1.2 High-density scatter plots
      3. 11.1.3 3D scatter plots
      4. 11.1.4 Spinning 3D scatter plots
      5. 11.1.5 Bubble plots
    2. 11.2 Line charts
    3. 11.3 Corrgrams
    4. 11.4 Mosaic plots
    5. Summary
  21. 12 Resampling statistics and bootstrapping
    1. 12.1 Permutation tests
    2. 12.2 Permutation tests with the coin package
      1. 12.2.1 Independent two-sample and k-sample tests
      2. 12.2.2 Independence in contingency tables
      3. 12.2.3 Independence between numeric variables
      4. 12.2.4 Dependent two-sample and k-sample tests
      5. 12.2.5 Going further
    3. 12.3 Permutation tests with the lmPerm package
      1. 12.3.1 Simple and polynomial regression
      2. 12.3.2 Multiple regression
      3. 12.3.3 One-way ANOVA and ANCOVA
      4. 12.3.4 Two-way ANOVA
    4. 12.4 Additional comments on permutation tests
    5. 12.5 Bootstrapping
    6. 12.6 Bootstrapping with the boot package
      1. 12.6.1 Bootstrapping a single statistic
      2. 12.6.2 Bootstrapping several statistics
    7. Summary
  22. Part 4. Advanced methods
  23. 13 Generalized linear models
    1. 13.1 Generalized linear models and the glm() function
      1. 13.1.1 The glm() function
      2. 13.1.2 Supporting functions
      3. 13.1.3 Model fit and regression diagnostics
    2. 13.2 Logistic regression
      1. 13.2.1 Interpreting the model parameters
      2. 13.2.2 Assessing the impact of predictors on the probability of an outcome
      3. 13.2.3 Overdispersion
      4. 13.2.4 Extensions
    3. 13.3 Poisson regression
      1. 13.3.1 Interpreting the model parameters
      2. 13.3.2 Overdispersion
      3. 13.3.3 Extensions
    4. Summary
  24. 14 Principal components and factor analysis
    1. 14.1 Principal components and factor analysis in R
    2. 14.2 Principal components
      1. 14.2.1 Selecting the number of components to extract
      2. 14.2.2 Extracting principal components
      3. 14.2.3 Rotating principal components
      4. 14.2.4 Obtaining principal component scores
    3. 14.3 Exploratory factor analysis
      1. 14.3.1 Deciding how many common factors to extract
      2. 14.3.2 Extracting common factors
      3. 14.3.3 Rotating factors
      4. 14.3.4 Factor scores
      5. 14.3.5 Other EFA-related packages
    4. 14.4 Other latent variable models
    5. Summary
  25. 15 Time series
    1. 15.1 Creating a time-series object in R
    2. 15.2 Smoothing and seasonal decomposition
      1. 15.2.1 Smoothing with simple moving averages
      2. 15.2.2 Seasonal decomposition
    3. 15.3 Exponential forecasting models
      1. 15.3.1 Simple exponential smoothing
      2. 15.3.2 Holt and Holt–Winters exponential smoothing
      3. 15.3.3 The ets() function and automated forecasting
    4. 15.4 ARIMA forecasting models
      1. 15.4.1 Prerequisite concepts
      2. 15.4.2 ARMA and ARIMA models
      3. 15.4.3 Automated ARIMA forecasting
    5. 15.5 Going further
    6. Summary
  26. 16 Cluster analysis
    1. 16.1 Common steps in cluster analysis
    2. 16.2 Calculating distances
    3. 16.3 Hierarchical cluster analysis
    4. 16.4 Partitioning-cluster analysis
      1. 16.4.1 K-means clustering
      2. 16.4.2 Partitioning around medoids
    5. 16.5 Avoiding nonexistent clusters
    6. 16.6 Going further
    7. Summary
  27. 17 Classification
    1. 17.1 Preparing the data
    2. 17.2 Logistic regression
    3. 17.3 Decision trees
      1. 17.3.1 Classical decision trees
      2. 17.3.2 Conditional inference trees
    4. 17.4 Random forests
    5. 17.5 Support vector machines
      1. 17.5.1 Tuning an SVM
    6. 17.6 Choosing a best predictive solution
    7. 17.7 Understanding black box predictions
      1. 17.7.1 Break-down plots
      2. 17.7.2 Plotting Shapley values
    8. 17.8 Going further
    9. Summary
  28. 18 Advanced methods for missing data
    1. 18.1 Steps in dealing with missing data
    2. 18.2 Identifying missing values
    3. 18.3 Exploring missing-values patterns
      1. 18.3.1 Visualizing missing values
      2. 18.3.2 Using correlations to explore missing values
    4. 18.4 Understanding the sources and impact of missing data
    5. 18.5 Rational approaches for dealing with incomplete data
    6. 18.6 Deleting missing data
      1. 18.6.1 Complete-case analysis (listwise deletion)
      2. 18.6.2 Available case analysis (pairwise deletion)
    7. 18.7 Single imputation
      1. 18.7.1 Simple imputation
      2. 18.7.2 K-nearest neighbor imputation
      3. 18.7.3 missForest
    8. 18.8 Multiple imputation
    9. 18.9 Other approaches to missing data
    10. Summary
  29. Part 5. Expanding your skills
  30. 19 Advanced graphs
    1. 19.1 Modifying scales
      1. 19.1.1 Customizing axes
      2. 19.1.2 Customizing colors
    2. 19.2 Modifying themes
      1. 19.2.1 Prepackaged themes
      2. 19.2.2 Customizing fonts
      3. 19.2.3 Customizing legends
      4. 19.2.4 Customizing the plot area
    3. 19.3 Adding annotations
    4. 19.4 Combining graphs
    5. 19.5 Making graphs interactive
    6. Summary
  31. 20 Advanced programming
    1. 20.1 A review of the language
      1. 20.1.1 Data types
      2. 20.1.2 Control structures
      3. 20.1.3 Creating functions
    2. 20.2 Working with environments
    3. 20.3 Non-standard evaluation
    4. 20.4 Object-oriented programming
      1. 20.4.1 Generic functions
      2. 20.4.2 Limitations of the S3 model
    5. 20.5 Writing efficient code
      1. 20.5.1 Efficient data input
      2. 20.5.2 Vectorization
      3. 20.5.3 Correctly sizing objects
      4. 20.5.4 Parallelization
    6. 20.6 Debugging
      1. 20.6.1 Common sources of errors
      2. 20.6.2 Debugging tools
      3. 20.6.3 Session options that support debugging
      4. 20.6.4 Using RStudio’s visual debugger
    7. 20.7 Going further
    8. Summary
  32. 21 Creating dynamic reports
    1. 21.1 A template approach to reports
    2. 21.2 Creating a report with R and R Markdown
    3. 21.3 Creating a report with R and LaTeX
      1. 21.3.1 Creating a parameterized report
    4. 21.4 Avoiding common R Markdown problems
    5. 21.5 Going further
    6. Summary
  33. 22 Creating a package
    1. 22.1 The edatools package
    2. 22.2 Creating a package
      1. 22.2.1 Installing development tools
      2. 22.2.2 Creating a package project
      3. 22.2.3 Writing the package functions
      4. 22.2.4 Adding function documentation
      5. 22.2.5 Adding a general help file (optional)
      6. 22.2.6 Adding sample data to the package (optional)
      7. 22.2.7 Adding a vignette (optional)
      8. 22.2.8 Editing the DESCRIPTION file
      9. 22.2.9 Building and installing the package
    3. 22.3 Sharing your package
      1. 22.3.1 Distributing a source package file
      2. 22.3.2 Submitting to CRAN
      3. 22.3.3 Hosting on GitHub
      4. 22.3.4 Creating a package website
    4. 22.4 Going further
    5. Summary
  34. Afterword. Into the rabbit hole
  35. Appendix A. Graphical user interfaces
  36. Appendix B. Customizing the startup environment
  37. Appendix C. Exporting data from R
    1. C.1 Delimited text file
    2. C.2 Excel spreadsheet
    3. C.3 Statistical applications
  38. Appendix D. Matrix algebra in R
  39. Appendix E. Packages used in this book
  40. Appendix F. Working with large datasets
    1. F.1 Efficient programming
    2. F.2 Storing data outside of RAM
    3. F.3 Analytic packages for out-of-memory data
    4. F.4 Comprehensive solutions for working with enormous datasets
  41. Appendix G. Updating an R installation
    1. G.1 Automated installation (Windows only)
    2. G.2 Manual installation (Windows and macOS)
    3. G.3 Updating an R installation (Linux)
  42. References
  43. index

Product information

  • Title: R in Action, Third Edition
  • Author(s): Robert I. Kabacoff
  • Release date: May 2022
  • Publisher(s): Manning Publications
  • ISBN: 9781617296055