Editor's note: this post is from The Sampler archives in 2015 - indeed it was the first post there. During the last 4 years a lot has changed, not least that now most companies in most sectors have contracted / employed data scientists, and built / bought / run systems and projects. Most professionals will at least have an opinion now on data science, occasionally earned.
Let's see how well this post has aged - feel free to comment below - and I might write a followup at some point.
The term ‘data science’ has been around for several years with many explanations, discussions and breathless over-excitement in the technology and business press. What is it, where did it come from, and who's using it today?
The term ‘data science’ is a useful shortcut to describe the recent confluence and evolution of several previously distinct disciplines1, made possible by an increasing availability of data and sophistication of high quality open source software, decreasing costs of hardware and data processing, intense academic research and massive commercial and industrial interests.
Data science as its own discipline is wide ranging and rapidly evolving, with a general theme of letting humans understand more about a situation, predict real-world actions and identify patterns in data.
It involves descriptive analysis, statistical modelling, iterative experimentation, agile systems development and high quality communication.
I would suggest the 4 main aspects are:
Computer science (machine learning, software development)
Designing and creating novel software and hardware systems to create, store and efficiently analyse large volumes of data. Running advanced algorithmic processing to discover and disseminate patterns in data and predict new events.
Statistical science (experimentation, analysis, predictive modelling)
Researching new mathematical theory, designing algorithms and employing a scientific method to test them experimentally. Applying statistical theory to analyse past events or artifacts and predict future actions.
Graphic design and story telling (visualisation, exploration, interaction)
Distilling hypotheses, insights and predictions into cogent arguments and communicating them effectively to a variety of casual, technical or managerial audiences. Bringing the data to life and the making the science accessible.
Leadership and domain expertise (industry experience, project management)
We often find that a data science task, project or programme will be the first of its kind within an organisation and it takes strong will, management buy-in, clear vision and solid communication to take on what's often a tough technical challenge with potentially huge benefits to the business.
"The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill in the next decades.” - Hal Varian, Chief Economist, Google.
The data scientist is best brought into the core of the business, to work closely with technical, operational and leadership teams to help improve decision making and critical business functions.
As an individual, or more likely a small team, they will have considerable skill in rapid and powerful software engineering, know advanced statistics, have effective communication and deep subject-matter expertise.
It's potentially a very varied, highly skilled role, and the remit of a data scientist may cover, for example:
The famous ‘data science venn diagram’ by Drew Conway of DataKind, IA Ventures, Alluvium and more, is a lighthearted but surprisingly accurate summary of skills required and regularly employed by a data scientist during the course of their work.
In discussion several years later, Drew reflected that the diagram is still relevant and highlighted the additional importance of strong communication.2
"Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.” - Josh Wills, Director of Data Science, Cloudera.
The first industry to really make use of (and thus help define) data science has been the internet-oriented technology sector. They made key progress in:
These companies and others have improved the state of the art in recommendation engines, natural language processing, data compression, game-theoretic auctions, massive-scale psychological experiments, human-computer interaction, campaign analysis, user profiling and more.
Naturally, statistical modelling and data analysis is a critical, core capability for the typically more conservative pharmaceutical, telecoms and financial sectors too. These companies have conducted drug discovery, network optimization and predictive modelling for a long long time, and to assume they aren't familiar with statistical data analysis would be foolish.
However the sheer abundance of new technologies, tools and techniques available today cannot be underestimated. It makes possible all sorts of high value analysis and modelling that simply wasn't practical in years past.
Today’s advanced analytics in insurance pushes far beyond the boundaries of traditional actuarial science… While the impetus to invest in analytics has never been greater for insurance companies, the challenges of capturing business value should not be underestimated. - McKinsey & Co. Unleashing the Value of Advanced Analytics in Insurance
The insurance sector in particular is all about risk modelling and data analysis, and is a natural fit for a data science approach.
Now is the time for insurance companies to take advantage of the past five years of rapid development in the data science discipline. It's time to make wide-ranging improvements throughout their businesses, to:
We'll write specifically about the opportunities for the insurance industry to make best use of data science in future posts.
The field of data science has grown to such an extent that dedicated books are starting to appear with first generation data scientists passing down their knowledge and experiences, for example the Data Science Handbook.↩