What is Data Science? – Evolution of Data Science

Excerpts from Global Data Science Forum What is Data Science?
By Paco Nathan posted Mon March 04, 2019

What is Data Science?

A popular 2012 tweet by Josh Wills:

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

Data Science gained traction in industry circa 2008, just as tooling for big data was on the rise, and as business use cases for machine learning (ML) became popularized. Those three grew together in contrast to an earlier era of business intelligence (BI), which was initially popularized by Gartner analyst Howard Dresner. Most of BI was defined atop data warehouse (DW) practices, based on work by Barry Devlin and Paul Murphy, Ralph Kimball, Bill Inmon, et al. BI and DW were both introduced in the late 1980s, then became widespread practices throughout the 1990s.
Data science emerged in response to demand for more advanced techniques and larger scale-out than what the best practices from the prior decade could provide. Cloud resources were becoming popular, and crucial insights could be obtained more quickly and more cost-effectively due to popular open source tools such as Hadoop, Spark, plus a whole range of Python libraries.
In 1962, a Bell Labs mathematician named John Tukey wrote a paper called “The Future of Data Analysis”.  Tukey urged a provocative new stance for applied mathematics which he called data analysis. The interesting section headings are:

“We should seek out wholly new questions to be answered.”
“We need to tackle old problems in more realistic frameworks.”
“We should seek out unfamiliar summaries of observational material, and establish their useful properties.”
“And still more novelty can come from finding, and evading, still deeper lying constraints.”

In the  books on visualizing data by Ed Tufte, references to Tukey show up throughout most all of books.

A generation later, another Bell Labs researcher named William Cleveland coined the term data science in a 2001 paper citing Tukey among others,  “Data science: An action plan for expanding the technical areas of the field of statistics”. Cleveland proposed an outline for a multi-disciplinary curriculum:

(25%) Multidisciplinary Investigations: data analysis collaborations in a collection of subject matter areas.
(20%) Models and Methods for Data: statistical models; methods of model building; methods of estimation and distribution based on probabilistic inference.
(15%) Computing with Data: hardware systems; software systems; computational algorithms.
(15%) Pedagogy: curriculum planning and approaches to teaching for elementary school, secondary school, college, graduate school, continuing education, and corporate training.
(5%) Tool Evaluation: surveys of tools in use in practice, surveys of perceived needs for new tools, and studies of the processes for developing new tools.
(20%) Theory: foundations of data science; general approaches to models and methods, to computing with data, to teaching, and to tool evaluation; mathematical investigations of models and methods, of computing with data, of teaching, and of evaluation.

This curriculum indicates what Cleveland thought the field required, namely that data science is a space in which statistics and computing needed to interact, to provide the necessary resources and scale.

That same year, a UC Berkeley professor named Leo Breiman wrote “Statistical Modeling: The Two Cultures”. One culature is of the previous era which he called data modeling and a new trend emerging which he called algorithmic modeling. That culture of data modeling was what Tukey had argued against.  The newer culture embraced much larger data rates and more computation and also leveraged machine learning algorithms to help automate decisions at scale.

The current heyday of data science began when some of these applications which required more data started to become tractable, reliable, and cost-effective (in that order).

Check out these histories by lead architects at those firms – roughly centered on Q3 1997, which turned out to be a key inflection point for the Dot Com Boom:

“Early Amazon: Splitting the website”, Greg Linden, Amazon
“eBay Architecture”, Randy Shoup, eBay
“Inktomi’s Wild Ride”, Erik Brewer, Yahoo! Search (0:05:31 ff)
“Underneath the Covers at Google”, Jeff Dean, Google (0:06:54 ff)

The timing for those projects was during the peak of data warehouses and business intelligence adoption. However, a common theme among those four architects’ reflections is that they recognized how they’d need to scale ecommerce applications but could not do so with available tooling. Instead they turned to open source tools (such as Linux) for early data science work on proto clouds, leveraging ML at scale for ecommerce. Their timing was impeccable, particularly for Amazon: just in time to monetize the first big wave of ecommerce in the holiday season of Q4 1997. The rest is history.

The gist is that ecommerce firms split their web apps using a principle of horizontal scale out, i.e., proto cloud work on server farms. Those many servers generated lots of log files (proto Big Data), which in turn were analyzed using machine learning algorithms, which in turn provided predictive analytics that improved customer experience in the web apps. A virtuous cycle emerge, with data as a product.

However, after Q4 1997 the world of data changed, predictive analytics loomed large. Breiman described that sea change quite succinctly:

A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.

Plenty of other people also helped further the cause of “data science” and deserve credit, such as Jeff Wu who likely coined the phrase (in its contemporary usage) during his U Michigan appointment lecture “Statistics = Data Science?”

The main takeaway from this article:

Looking at decades of history, data science found its place by applying increasingly advanced mathematics for novel business cases, in response to surges in data rates and compute resources.

In the latest wave of AI applications in industry, we have the term ABC emerging to describe a winning combo of “AI”, “Big Data”, and “Cloud Computing” – as the latest embodiment of that takeaway described above.

Beyond the well-known roles of data scientist and data engineer, there’s another important role emerging which has not yet been named. We found that 23% of the enterprise organizations attempting to leverage data science, machine learning, artificial intelligence, etc., cite recognize business use case as a critically missing skill within their teams. What would you call that role? Where and how does a person learn to perform it?

Data Science – More explanations

What is Data Science?

 In a 2009 McKinsey&Company article, Hal Varian, Google’s chief economist and UC Berkeley professor of information sciences, business, and economics, predicted the importance of adapting to technology’s influence and reconfiguration of different industries. 2

“The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill in the next decades.”

Chapter 1. Introduction: What Is Data Science?
Doing Data Science by Rachel Schutt, Cathy O’Neil

Data Science vs. Big Data vs. Data Analytics
By Shivam Arora
Jan 4, 2019

Oracle Artificial Intelligence (AI)—What Is Data Science?

An elaborate article giving many details of data science and related issues.

Earlier Articles

What is Data Science? – An Introduction to Data Science

Data Science – Online Study Notes and Video Courses – Free Also

Data Analytics and Data Mining – Difference Explained

Updated on 31 May 2019, 26 May 2019