Editor’s note: this post is from The Sampler archives in 2015. During the last 5 years a lot has changed, not least that now most companies in most sectors have contracted / employed data scientists, and built / bought / run systems and projects. Most professionals will at least have an opinion now on data science, occasionally earned.
I was prompted to dig this post up following an offline chat about a DS Project Checklist by fast.ai: it seems we all eventually stumble on thinking about the meta issues.
Let’s see how well this post has aged - feel free to comment below - and I might write a followup at some point.
The heart of data science is to use statistical and programmatic techniques to realise value from data. This applies to non-profit organisations, conventional businesses and new startups alike. That realisation can take different forms:
… all three viewpoints are valid and demonstrate that we all want to follow a methodical process from initial speculation, through justified strategies and hypotheses, through data analysis, to improved models, systems and processes.
To define such a process is nothing new, and data analysis methodologies such as the Cross Industry Standard Process for Data Mining (CRISP-DM) - developed as far back as the mid-nineties - adequately capture the workflows, tasks, responsibilities, and explain the benefits to various parts of the business.
However, many aspects of data science are still maturing, the CRISP-DM process appears to be abandoned and a handful of new processes are in discussion; the most notable being OSEMN (Obtain, Scrub, Explore, Model, iNterpret).
Applied AI are often called upon to deliver analytical insights and systems where before there were none. Naturally our projects are sold upon these final deliverables, but there’s tremendous business value to be found throughout the analytical process and I think it’s worth trying to define.
Every new process needs a convoluted acronym, so I’ll invent SPEACS 1. This is currently just a rough sketch, but you can see the general flow from ideas to implementation and the iterative nature.
Let’s work through each stage in more detail and discuss the value to be found.
Everything starts here: defining the business case, the potential benefits and risks. There’s many questions which we ought to try to answer before going any further, including:
On the technical side we might expect to support these questions by variously creating synthetic data, efficient data explorations & visualisation, and small simulations to deliver “what-if” analyses, make approximate estimations and help to engage with wider parts of the business.
At this stage though, we are unlikely to conduct any heavy analysis, and certainly wouldn’t implement any systems. We aim to gain valuable insights into the current & future states of the business, and build a justified case for doing performing those analyses or building those systems.
Possibly the project will stop right here, or the goals will be sought through other means, and that’s okay. Change is always hard, and every minute spent on these questions saves hours down the road.
This stage dives into the detail of what and how the project will run. I’ve placed this at the start of an iterative sub-cycle of Core Technical Activities, since planning is always tightly influenced by what’s actually possible and the various outcomes of the full process. Some considerations include:
At this stage we are still unlikely to have accessed any data, created descriptive analyses nor created predictive models. What we seek here is to set up the right environment (both business and technical) for the project to succeed - whilst accounting for all sorts of considerations and compromises.
The valuable outputs / artifacts / deliverables are likely to included project schedules, technical architectures, legal frameworks, privacy statements, ethical statements and priced business plans.
“An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.” - John Tukey 2
Most analysts will tell you that data preparation consumes a disproportionate amount of project time, and they’re totally right. What you don’t hear so often is the huge amount of value created during this stage:
In the initial stages of most projects we inevitably use some data that was created for another purpose - actuarial extracts, customer transaction logs, social media and marketing messages, equity price movements etc. In fact it’s quite likely that your chosen data sources have never before been properly documented, combined and stored.
High quality data is documented, clean, complete and representative; it makes downstream analysis and insight generation repeatable and reliable.
Well maintained data sources are scalable, backed-up, always available, proximate to the data analysis, and have a data dictionary, schemas and permissions-based access.
Whilst ‘big data’ offers the opportunity to investigate and model a lot of information, more often than not the actual analyses can and should be performed on smaller, well-chosen sample sets. In fact, most general business issues do not involve datasets with more than a few millions of samples, which is still small data.
Feature engineering (the selection, creation, transformation of data) lets us consider how to cleanly and intelligently select features for use in modelling, create derivative features, and modify those that already exist.
Item engineering (value imputation, item exclusion, datatype modification) lets us consider how to cope with missing values and ensure that datatypes are optimal.
If the data itself is well-engineered - tidy data - it will often lead to new insights just by itself without statistical modelling. These are the age-old summaries, distributions and visualisations as delivered by business intelligence solutions.
We can summarise most of the above points by thinking about the data ‘pipeline’. Proper planning and engineering can make the process of data acquisition more robust, extensible, repeatable, and auditable.
Quality control, security, privacy and sampling can and should become more automated as a data science business function matures.
Many other projects throughout software development, marketing, product design, and management will all benefit from a well-maintained data pipeline.
Finally, the bit that most people think about when they hear ‘data science project’. You’ll hopefully agree though, that by this stage we should have already realised a huge amount of value for the business. Simply getting here is often be more than half of the project.
The now-standard toolkits in Python and R et al. allow us to very quickly investigate high-dimensional datasets, discover trends, visualise patterns and share insights etc. Javascript and proprietary plotting tools let us share rich, interactive visualisations with non-technical users.
Descriptive analysis is usually never ‘done’, and we are likely to come back to it throughout the project to investigate new data, test our ideas, and make estimates for project planning.
This stage really is the heart of the project cycle and you can expect to iterate through it several times. We might also stop here in the case that we answer the key business questions posed in the Strategy stage.
It’s important to start with at least a basic hypothesis of how you expect the data to behave and what it will explain and predict
This hypothesis is of the data-generating process somewhere out there in the real world that resulted in the data you have acquired: that process is noisy, full of errors and assumptions, and your understanding of it may have multiple explanations that result in nearly-equivalent hypothesis (model configurations) to be argued and tested
Seek to prove or disprove your assumptions starting with parsimonious, explainable models of the data-generating process and always evaluate appropriately using reasoned summary statistics, prior and posterior predictions, state any asymptotic assumptions, use cross-validation and hold-out sets.
Prediction is often highly problem-specific and various tools & techniques are not always suitable nor are transferrable between domains. It’s important to always have a solid understanding of the actual problem to be solved, rather than blindly throwing algorithms at the data and hoping for success.
That said, for most practical purposes it’s reasonable to start with conventional ‘off the shelf’ algorithms that you understand and believe to approximate the data-generating process, rather than e.g. craft a bespoke Bayesian inferential model.
A well-tested predictive model can be immensely valuable and both the source code and trained model should be maintained, versioned and backed-up like any other software project.
“Data are not taken for museum purposes; they are taken as a basis for doing something. If nothing is to be done with the data, then there is no use in collecting any. The ultimate purpose of taking data is to provide a basis for action or a recommendation for action. The step intermediate between the collection of data and the action is prediction.” - W. Edwards Deming
The final stage of this inner cycle of Core Technical Activities is communication & change: putting to use the observations and insights gained by making a change in the business or more widely. Without this stage, all the best analysis and modelling in the world is useless.
Ensure that the models are answering the right question, and ensure the results are communicated in a way that is relevant, timely and accessible to the audience.
Good reporting is relevant reporting. Think scientifically, summarise findings to the appropriate technical or business level, ensure that results can be verified and code audited.
The right audience will be able to draw upon relevant observations and lead to reasoned insights.
It’s hard to argue with well-sourced, well-supported insights, but the greatest value comes when someone makes a change - and change is hard to achieve.
Good stakeholder management starting from the Strategy and Planning stages ought to ensure that there is a willing audience for the above work; there’s immense value in simply bringing them along with the project.
Quite often, a data science project need go no further than this stage; if the analysis is sufficient to make a reasoned change, then it’s time to revisit the Strategy stage, and extend, modify or simply complete the project and move onto the next.
If the aim is to have a system or ‘product’, then we can move onto the next stage…
Congratulations, you’ve reached a level where it’s worthwhile to embed your insights into systems. This systemisation might be:
well documented scripts manually run by analysts on a regular basis
simple applications - potentially integrated into larger systems - providing business intelligence or predictive modelling to internal teams
highly technical applications more akin to a full software product, with well specified processes, user types, security, backup & failover etc etc.
Think carefully about how systems management & execution fit into regular business processes and vice versa. Implementing new technical systems is a great opportunity to redesign old business processes and make sure teams are well-configured and delivering business value.
Try to keep the model complexity as low as is reasonable given the dataset and the desired results. Smaller, simpler models scale more easily and can be repeated more frequently on cheaper hardware. Generally speed of execution beats complexity.
It’s essential that any solutions put in place are subject to regular monitoring and re-evaluation to ensure their continued relevance. When necessary the solutions assumptions and structure may need to be updated.
Build things iteratively and keep the development process lean. Data science often looks like software development and systems implementation looks the most similar.
Follow an iterative process, test everything, allow for both experimentation and failure, factor in 50% more time than your most pessimistic estimates, bring stakeholders along closely with regular demos.
This last bit should be the easiest stage of all, since system development is very well understood by now, but it’s where we should all be very careful not to fall into the twin traps of (1) academically hand-waving away the dirty complexities & compromises of software development, and (2) the hacker’s dismissal of mathematical rigour.
A well-functioning data-science implementation team is of tremendous business value.
My intention in this post was to demonstrate that a ‘data science’ project yields value all the way throughout the process, not just the analysis and modelling stages.
I’ve explained this in a new-drafted process called SPEACS, which I’ve detailed with examples and opinions gained from a several years working in the arenas of data science, consulting and systems development. I’d love to know your thoughts, and it’s likely that I’ll elaborate upon this in future.
Pronounced “speaks”. ↩︎
Tukey is always a good read. See for example his paper ‘Sunset Salvo’, in which he condenses 40+ years of observations into only a few pages of incredibly quotable and still highly-relevant insight. ↩︎