Editor’s note: this post is from The Sampler archives in 2015. During the last 4 years a lot has changed, not least that now most companies in most sectors have contracted / employed data scientists, and built / bought / run systems and projects. Most professionals will at least have an opinion now on data science, occasionally earned.
Let’s see how well this post has aged - feel free to comment below - and I might write a followup at some point.
The practicing data scientist will be familiar with a wide range of software for scientific programming, data acquisition, storage & carpentry, lightweight application development, and visualisation. Above all, agile iteration, proper source control, good communication are vital.
As I outlined in the previous post, the discipline of data science exists at a confluence of:
A lot of one’s time will be spent at a computer and a lot of that time will be spent writing code, so it’s critical to use the best tools available: powerful hardware, modern software technologies and established engineering methodologies.
Let’s take a quick, high-level tour of some of the technical considerations:
Over the past ten years, R and Python have become two of the most important core technologies in the data science toolbox1. Both are open-source programming languages with a huge ecosystem of supporting code libraries and packages for data processing and statistical modelling.
R has grown organically from the statistics community and is widely used and praised for a rich set of industry-standard algorithms, publication-quality plotting, rapid prototyping and native functional programming. Whilst powerful, it’s perhaps best thought of as an interactive environment for statistics with a high-level programming language attached; there seems to be almost a tradition within academia to release an implementation of one’s ‘novel’ algorithm as a new R package, which coupled with R’s inherently muddled syntax and culture of poor documentation can make for a daunting initiation to newcomers, and regular frustration for software engineers.
Python is a very popular general-purpose high-level programming language with syntax that’s considered intuitive and aesthetic, but a runtime that can be slow compared to compiled languages like C++, Rust etc. In 2005 the creation of NumPy, a library for very fast matrix computation, spurred the use of Python within the computer science / machine learning communities who might traditionally use MATLAB or C. In the years since, a wealth of best-in-class open-source libraries have been developed for data manipulation, efficient computation and statistical modelling. When this new facility is coupled with Python’s tradition for excellent documentation, well-maintained open-source code, strong developer communities, consistent syntax, and an ethos of ‘batteries included’, Python has become the default choice for data scientists.
Data visualisation in both R and Python can be made accurate and beautiful, but it’s worth also noting D3.js: a comprehensive and powerful JavaScript library that makes it quite simple to develop rich, interactive, web- and production-ready visualisations. Tools for web-based data visualisation tend to evolve and specialise extremely rapidly but D3 has become a standard, and frequently used alongside and within other new libraries.
Acquiring, cleaning and storing data will often involve using a whole host
of additional tools and languages, including Unix command line tools
(awk
, sed
, cut
, grep
, curl
,
etc etc),
Perl, SQL, JavaScript and many more. There’s quite a good
Quora conversation
on this more details.
There’s a whole suite of legacy environments and languages available including MATLAB, SPSS, Stata and SAS. These closed-source tools commonly have expensive licensing, surprisingly conservative development cycles2 and reduced functionality when compared to open source. The high economic barriers to entry limit the size of the user base, leading to fewer contributors, a smaller community and reduced sense of ownership for practitioners.
There are a handful of large companies further undermining the cause for
closed-source software by their packaging, customising and selling of
enterprise-ready distributions of the above open-source tools, bundled with
their own technical support, consulting and library extensions. Two such
companies are Revolution R,
recently acquired by Microsoft, and
Continuum Analytics
who continue to make major open source developments in the Python community.
We’ve all been through the pain of trying to use spreadsheets for something too complex. It’s initially very tempting to ‘use what you know’, and many businesses rely on Excel files ‘in production’ for critical parts of their processes and products. They are misused as primary datastores, pricing calculators, interactive analysis tools, and report generators. To put it simply, spreadsheets are the wrong tool for data analysis & predictive modelling and should be avoided wherever possible. To elaborate:
I just don’t care about proprietary software. It’s not “evil” or “immoral,” it just doesn’t matter. I think that Open Source can do better … it’s just a superior way of working together and generating code. - Linus Torvalds, Interview on GPLv2, 2007
While there are certainly technical crossovers and shared goals, it’s important to treat the subjects separately. The ability to store and efficiently manipulate a huge amount of data is incredibly useful, but ‘big data analytics’ often concerns itself simply with providing basic summary statistics rather than statistical modelling, e.g. counting sales volume of an item by location by date, or counting page requests on a website.
NoSQL storage and map-reduce data processing have been around for a long time now, there’s many ways to do it and many tools available. Hadoop, HBase, Cassandra, MongoDB, Redis, Riak, Redshift, BigQuery, Mahout, Spark etc. all have worthwhile use cases depending on the nature and volume of data to be stored and processed and I won’t go into them here.
In a recent talk, Wes McKinney3 observed that the Python data science ecosystem still doesn’t have a great story to tell about ‘big data’ and there’s very much a need to interface well with high performance big data systems. I agree, but it’s worthwhile to remember that one can gain deep insight and develop highly predictive models with only small to medium-sized datasets. Intelligent surveying, balanced subsampling, advanced modelling and even simple human communication can often help solve the business issue without requiring us to process terabytes of averages.
“One only needs two tools in life: WD-40 to make things go, and duct tape to make them stop.” - G. Weilacher
It also requires a great deal of thought and human computer interaction. Fortunately we live in a world where memory and storage are fast and cheap, processors are multi-core and large high-resolution displays are available. Getting the right tools for the job is essential.
The data analytics function within any company needs to have excellent desktop hardware4 and capital expenditure here will be rewarded many times over in improved speed and sophistication of computation, the breadth of analysis possible, and depth of knowledge gained.
When dealing with larger datasets5 or complex models, it’s wise to consider separate server hardware. As noted above, RAM and processing is cheap these days, so building a powerful in-house server is reasonable. External cloud-based servers are worth considering for their ability to scale on demand, reducing capital outlay for short-term projects. That said, holding certain data outside the corporate firewall often requires a layer of legal arrangements and regulations that may make it not viable at all. We’ll write about our approach to massive and efficient data anonymisation in a future blog post.
Naturally it’s also of vital importance in data science. As teams grow and models are increasingly implemented in production systems rather then one-off analyses, proper source control is critical to provide code versioning, code review, auditability, continuous integration, testing, documentation and training. Even for one-off analyses undertaken by just one person, these standard working methodologies will preserve the code and may help greatly on the next project5.
Distributed version control tools like Git and Mercurial are the way to go; they’re powerful, widely supported and easy to implement into the development process.
“The key word in ‘data science’ is not data, it’s science” - Jeff Leek, SimplyStatistics.org, 2013
Good tools for data science provide a framework for discovering new insights and solving problems not previously possible.
To recap the technical considerations:
I’ll no doubt elaborate on the above in future posts about the technical aspects of running a data science department and certainly when discussing particular examples of our work. For now, thanks for reading!
Honourable mentions to Julia and Clojure as alternative languages/environments for general statistical programming. Julia is a relatively new language with a MATLAB-like syntax and many optimisations including just-in-time compilation that make it incredibly fast for numeric computation. Clojure is a lisp-like, functional language that can run closely alongside Java on the JVM and thus scale easily to massive data processing tasks. Both have found niches in the mathematical and computer science communities respectively, but both are still young and don’t have anywhere near the breadth of packages and community uptake as R and Python. ↩︎
Essentially, many hands make light work: Linus' Law according to Eric Raymond ↩︎
At Applied AI we’re set up with hardware and software to allow us to easily work from anywhere with an internet connection. As an aside, we almost exclusively use software-as a-service (SaaS) tools for internal business operations and where suitable use cloud-based virtual servers for on-demand data processing and modelling. This helps to keep the company lean and flexible, and we’ll write more about all that in future. ↩︎
As a rule of thumb, it’s reasonable to consider using a server when your dataset to be processed grows larger than 1/3 of the available memory in your machine. For example, a laptop with 16GB RAM might have 12GB free after the OS, so this rule-of-thumb 4GB would roughly correspond to a dataset of 40M rows x 12 numeric features, each array element represented as a double precision 8 byte float. This is not a particularly large dataset, and it’s quite easy for the user or an algorithm to create copies or transformations of the data during processing thereby consuming memory. ↩︎