(Skiena, 2017) - Book - The Data Science Design Manual - 3
(Skiena, 2017) - Book - The Data Science Design Manual - 3
• Robustness: Real scientists are comfortable with the idea that data has
errors. In general, computer scientists are not. Scientists think a lot about
possible sources of bias or error in their data, and how these possible prob-
lems can effect the conclusions derived from them. Good programmers use
strong data-typing and parsing methodologies to guard against formatting
errors, but the concerns here are different.
Becoming aware that data can have errors is empowering. Computer
scientists chant “garbage in, garbage out” as a defensive mantra to ward
off criticism, a way to say that’s not my job. Real scientists get close
enough to their data to smell it, giving it the sniff test to decide whether
it is likely to be garbage.
Aspiring data scientists must learn to think like real scientists. Your job is
going to be to turn numbers into insight. It is important to understand the why
as much as the how.
To be fair, it benefits real scientists to think like data scientists as well. New
experimental technologies enable measuring systems on vastly greater scale than
ever possible before, through technologies like full-genome sequencing in biology
and full-sky telescope surveys in astronomy. With new breadth of view comes
new levels of vision.
Traditional hypothesis-driven science was based on asking specific questions
of the world and then generating the specific data needed to confirm or deny
it. This is now augmented by data-driven science, which instead focuses on
generating data on a previously unheard of scale or resolution, in the belief that
new discoveries will come as soon as one is able to look at it. Both ways of
thinking will be important to us:
There is another way to capture this basic distinction between software en-
gineering and data science. It is that software developers are hired to build
systems, while data scientists are hired to produce insights.
This may be a point of contention for some developers. There exist an
important class of engineers who wrangle the massive distributed infrastructures
necessary to store and analyze, say, financial transaction or social media data