Build You a Library

Posted at — 2 Jan 2020

When I started to build out the data science / actuarial team at my current job, I was keen to make sure we had a small library of technical reference books at our immediate disposal. It’s not unusual for companies to have such a resource ¹, but in my opinion a good reference library is underrated. Our line of work is technical, creative, and experimental, so it’s wise to have inspiration and a good guide. ²

The list is far from canonical or exhaustive, and is formed from my technical education & professional experience, with a few crowd-sourced titles for good measure. ³ Maybe it’ll help similar practitioners looking to build out a technical in-house library. The common thread is modern, leading, technical references under the general themes of Machine Learning, Bayesian Statistics, Insurance and General Data Science.

Each entry has format:

[Title](link-to-publisher) by [Author](social-handle) [Amazon](link-to-buy) <— Commentary with useful links to online materials etc

General Machine Learning

A solid grounding in Un/Supervised Learning from Data (using & developing algorithms that adapt to observations to provide useful representations, infer behaviours or make predictions):

Pattern Recognition and Machine Learning by Christopher M. Bishop (Amazon) <— PRML is one of the more revered reference books for modern ML. Dense, robustly mathematical, takes no prisoners: if you want to know the source of a technique, it’s likely here. Note that Microsoft made a pdf edition freely available for download
Machine Learning: A Probabilistic Perspective by Kevin P. Murphy (Amazon) <— The spiritual successor to PRML, published only 4 years later but during a period of rapid change in ML and covers more of previously unfashionable neural nets. Supporting MATLAB code is available online
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2E by Trevor Hastie et al. (Amazon) <— Another of the classic references: deep, coherent and consistent explanations of modern statistics. Valuable. Note a pdf version is available online
Introduction to Machine Learning with Python: A Guide for Data Scientists by Andreas C. Mueller & Sarah Guido (Amazon) <— A modern, practical guide to many ML techniques implemented in scikit-learn from one of the core developers. This is a great, hands-on way to learn the basics of ML and practice reproducible research with version control, notebooks, documentation etc.

You might also consider:

Information Theory, Inference and Learning Algorithms by David MacKay (Amazon) <— A real classic, covers all the basics in a highly readable manner, grounded in signal processing and physical phenomena. The material is also available online but is getting a little long in the tooth, and sadly David’s no longer with us to provide an update. Consider alongside PRML for a comprehensive guide to ML in the 2000’s
An Introduction to Statistical Learning: with Applications in R by Gareth James et al. (Amazon) <— A broad and perhaps gentler introduction to modern statistical techniques with practical examples in R, from the authors of ESL (Hastie et al)
Introduction to Natural Language Processing by Jacob Eisenstein (Amazon) <— Very recently released overview of modern NLP. Accompany with anything written by Matthew Honnibal & the spaCy team
Deep Learning by Ian Goodfellow et al. (Amazon) <— A broad introduction to Deep Learning research and modern applications from the inventors of GANs and leading DNN approaches. Just beware typical business problems probably don’t require this level of complexity, and please don’t call it AI
Reinforcement Learning: An Introduction, 2E by Richard S. Sutton & Andrew G. Barto (Amazon) <— The original bible for RL from two fathers of the field, recently updated to include modern breakthroughs inc RL-DNNs like AlphaZero etc. Also not AI.

Bayesian Statistics & Probabilistic Programming

Specialised statistical modelling, esp. where we care about parsimony, parametric and/or functional form, handling uncertainty in a principled way, and learning from what the data generating and observational processes tell us:

Bayesian Data Analysis, 3E by Andrew Gelman et al. (Amazon) <— BDA is a tour-de-force of modern statistical theory and great reading. Gelman is the nominal figurehead of Stan a state-of-the-art probabilistic programming platform/language (the core contributors are frontline stats researchers, and so new research generally appears here before e.g. PyMC3)
Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman & Jennifer Hill (Amazon) <— Gelman & Hill provide a valuable and readable reference on hierarchical models, illustrating their power to handle in a principled way imbalanced datasets, latent variables, and model generalisation to new problems
Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan, 2E by John Kruschke (Amazon) <— A.k.a. “The Puppy Book” this is a very readable resource with useful examples and Kruschke’s own model diagrams which can vastly aid human understanding of model architectures, moreso even than plate diagrams. Also see the comprehensive port to Python in the PyMC3 docs
Statistical Rethinking: A Bayesian Course with Examples in R and Stan (& PyMC3 & brms & Julia too) by Richard McElreath (Amazon) <— Also see the 10 weeks of accompanying videoed lectures. Also see the comprehensive port to Python / PyMC3 in the PyMC3 docs freely available for all to access. Note the Second Edition is slated for Spring 2020
Bayesian Analysis with Python: Introduction to statistical modeling and probabilistic programming using PyMC3 and ArviZ, 2E by Osvaldo Martin (Amazon) <— Very recently written and up to date with great examples and intuition. Also see the accompanying resources in Python / PyMC3.

You might also consider:

Gaussian Processes for Machine Learning by Carl Edward Rasmussen & Christopher K. I. Williams (Amazon) <— GPs are awesomely powerful tools, and I have to give a nod to Chris who taught me probability theory a long time ago. That said, GPs are an advanced method, can be slow, and there’s simpler models to try first. The material is available online for free, and see BDA for an intro
Probabilistic Graphical Models: Principles and Techniques by Daphne Koller & Nir Freidman (Amazon) <— A somewhat dry, slightly dated, mathematically heavy but complete reference on PGMs. For easier material see PRML and BDA
Bayesian Methods for Hackers by Cameron Davidson-Pilon (Amazon) <— Entertaining guide to practical Bayesian inference with supporting code in PyMC3 (and TFP!) available online
Bayesian Cognitive Modeling - A Practical Course by Eric-Jan Wagenmakers & Michael D. Lee (Amazon) <— Practical modelling and solid grounding from the cog-science viewpoint. The code & packages are a little dated, but see the comprehensive port to Python in the PyMC3 docs.

Further Statistics Reading for the Insurance Domain

As a generalist ML / stats practitioner I’m biased towards solving technical problems with modern reproducible research, and in particular Bayesian inference. I believe I’m not alone in finding standard actuarial techniques somewhat archaic and suffering from ‘professionalisation’ where over-simplified models are taught and learnt rote, unnecessarily implemented by hand for a written exam, rarely questioned and quickly forgotten ⁴. I want to overcome my bias because “Statistics is applied statistics” ⁵ and it’s vital to understand one’s domain in detail: the business processes, the nuances of the data-generating processes, and learning from the hard-won lessons of the domain experts. The following texts appear to lead in very much the right direction:

Loss Data Analytics by Jed Frees et al. (Open Actuarial Textbooks Project) <— I stumbled on this when looking background on Edward (Jed) Frees for RMAFA below, and what a welcome surprise! LDA appears to be the first of a planned series of open source, freely available, modern, implementable texts, published natively online with supporting R code, and a list of contributors including names familiar to me from the Insurance Data Science Conferences inc Katrien Antonio and Arthur Charpentier (also see CAS book below)
Regression Modeling with Actuarial and Financial Applications by Edward W. Frees (Amazon) <— A slightly older, more focussed publication by Jed Frees recommended to me as “This book is very good: I’ve not seen a better introduction to linear models anywhere. Lots of practical applications, and it did a good job making many of the more obscure elements a bit more intuitive”
Predictive Modeling Applications in Actuarial Science. Vols 1 & 2 edited by Edward W. Frees et al. (Amazon) <— This book draws together a broad reach of chapters from many domain experts including contributors to the Open Actuarial Textbooks Project. Recommended to me as “Variable but covers the fundamentals of GI rating reasonably well: GLMs, aggregate models, credibility, geospatial. Incomplete / immature on modern data science but the case studies are valuable”. Also note it has online supporting materials
Pricing in General Insurance by Pietro Parodi (Amazon) <— The go-to reference for standard approaches to pricing, well-explained with practical, real-world examples with accompanying online materials
Computational Actuarial Science with R by Arthur Charpentier (Amazon) <— Comprehensive practical guide to understanding and using standard actuarial analyses and models, though the R code could apparently use a refresh. Charpentier is a prolific communicator and releases a lot of modern statistical actuarial research.

General Data Analysis / Python / R / Software Dev

It’s impossible to cover all ground here, but these are good references for day-to-day “data science” work:

Python for Data Analysis, 2E by Wes McKinney (Amazon) <— Solid practical grounding in the Scientific Python stack for Exploratory Data Analysis (EDA) from the creator of essential python library pandas. Wes made supporting materials and Notebooks available online
Python Data Science Handbook: Tools and Techniques for Developers by Jake VanderPlas (Amazon) <— Another solid practical guide to essential Python libraries inc numpy, matplotlib, pandas, and a little scikit-learn and scipy. Freely available online with supporting code (it even has a Twitter handle)
R for Data Science by Hadley Wickham & Garrett Grolemund (Amazon) <— a solid practical grounding in using R for DS. Available for free online from the authors
Data Science at the Command Line by Jeroen Janssens (Amazon) <— Relearn the power of UNIX terminal tools for data manipulation and exploration. Materials and scripts also available online. Also see sed & awk, 2E if like me you love a good bash script
Python Cookbook, 3E by David Beazley & Brian K. Jones (Amazon) <— Modern and concise recipes for ‘getting stuff done’ in Python
Flask Web Development, 2E by Miquel Grinberg (Amazon) <— Accessible, practical guide to implementing simple webapps in Flask a small Pythonic web framework very useful for creating MVP / demo APIs (for prediction engines etc). Miguel published a lot of this material in a mega-tutorial blogpost series.

General Reading and Data Viz

Inspiration and casual interest - loan these out across the company to help spark ideas and bridge gaps:

The Art of Statistics: Learning from Data by David Spiegelhalter (Amazon) <— Written for a general audience with reference to real-world examples and building intuition and statistical literacy
Information is Beautiful (New Edition) by David McCandless (Amazon) <— Lots of chart eye-candy, possibly a little dated these days (the hype around infographics is thankfully long-gone), but there’s some great ideas here in the tradition of Tufte and D3.js
The Truthful Art: Data, Charts, and Maps for Communication by Alberto Cairo (Amazon) <— Another useful reference with attractive examples for visual storytelling underpinned with theory of visualisation - useful when building dashboards
Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole Nussbaumer Knaflic (Amazon) <— Fundamentals of data viz / dashboarding and examples of how to communicate numeric information to a general business audience. Some visuals available online
Bad Data Handbook by Q Ethan McCallum (Amazon) <— An entertaining collection of real-world stories and pain & hardship from data practitioners around the world dealing with erroneous / incomplete / missing / pathological datasets and ill-formed business problems.

Do shout if you have recommendations worth adding!

These purchases were well-supported internally as part of our wider T&D program, and represent a powerful investment for relatively little money. ↩︎
Proper references can also help to justify the use of non-traditional techniques if you can show that other people (usually smarter than you) also think in the same way. It’s dangerous to go alone! ↩︎
Thanks in particular to Mick Crawford and the folks on the Pandas Arms Slack channel. ↩︎
Thanks also to Kenny Holms and the folks on the Actuaries Anonymous Slack channel for opinions and recommendations on the actuarial collection. ↩︎
Gelman usually has an apposite quote. ↩︎

Jonathan Sedar

Personal website and new home of The Sampler blog