so you want to be a data scientist?
or you believe you already are one? well, i’ve had a bit of science training in my background, and i’m here to say: really? ok then, let’s study what this really means. i feel, in order to truly understand what a data scientist is, we must study the origins of the term scientist. let’s see what webster has to say about this:
scientist: a person who is trained in a science and whose job involves doing scientific research or solving scientific problems, merriam webster
which of course, leads me to ask: what then is scientific research and/or a scientific problem? i would argue that scientific research needs to follow the scientific method, and scientific problems are those things needing to be definitively solved using the scientific method…
scientific method: principles and procedures for the systematic pursuit of knowledge involving the recognition and formulation of a problem, the collection of data through observation and experiment, and the formulation and testing of hypotheses, merriam webster
exploring what the market thinks…
now before we go any further (because you know i can, and i will – as it is in my nature), let’s see what the market has to say about the title: “data scientist”
- ibm says: while not tied exclusively to big data projects, the data scientist role does complement them because of the increased breadth and depth of data being examined, as compared to traditional roles
- techtarget says: a data scientist is a job title for an employee or business intelligence (bi) consultant who excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge.
- wikipedia says: in general terms, data science is the extraction of knowledge from data, which is a continuation of the field data mining and predictive analytics, also known as knowledge discovery and data mining (kdd). it employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information theory and information technology, including signal processing, probability models, machine learning, statistical learning, data mining, database, data engineering, pattern recognition and learning, visualization, predictive analytics, uncertainty modeling, data warehousing, data compression, computer programming, and high performance computing.
- blogger on udacity says: “data scientist” is often used as a blanket title to describe jobs that are drastically different.
- kdnuggets offers this insight: a veteran statistician argues that 3 different areas usually included in “data science” require dramatically different, skills, education, and training with very little overlap.
ok, enough is enough, where does that leave us?
my perspective is this: none (with the exception of the vetran statistician) seems to have a good idea about what “data science” is or should be, and none (again except the statistician) seem to relate the job or responsibilities back to the core discipline of scientist. that is to say: setting up an experiment or a scientific test.
a final point: a scientific hypothesis must be falsifiable, meaning that one can identify a possible outcome of an experiment that conflicts with predictions deduced from the hypothesis; otherwise, it cannot be meaningfully tested. wikipedia scientific method
i believe, the title is just “hot air” until or unless the individual being suited for the title (wishing to carry the title) is actually performing scientific experiments in accordance with the scientific method.
which leads me to the following question..
if you call yourself a data scientist, then are you actively practicing true science? can you prove your hypothesis? do you have indisputable scientific proof that your theory is true? have you formulated, documented, and produced a hypothesis before simply “diving in to explore the data”?
do i really have to say it?
yes, i suppose i do… i have yet to meet a real “data scientist” – one who has controlled experiments with hypothesis, test results, and scientific proofs of their research. the oxford english dictionary points out the following: “a method or procedure that has characterized natural science since the 17th century, consisting in systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses.”
there may be a few “data scientists” out there, but they have yet to reveal themselves. i truly think that if these people exist, they would / can and should be publishing in professional journals about their experiments, patterns, tests, measurements, and so on. especially when it comes to turning data in to information in accordance with a hypothesis that has been properly tested and proven. but just because i don’t see it doesn’t mean it doesn’t exist. perhaps, we need an international journal of data science?
what does this all mean?
there is no real “data scientist title” to speak of today that defines the proper bounds and conditions for formulating data as an asset on the books of the corporation. you see, if you can prove a hypothesis for data, then you can directly correlate data with a value on the books (either intrinsic, extrinsic, or both). it would truly mean you can prove the creation of knowledge / information based on data correlation, mining or otherwise. it would mean you can test for false positives, establish a control set, setup a proper neural network and learning algorithm, and so on.
it would mean you have standards, patterns, and repeatability within your business rules, data associations, and collaborations along the way. in the end, it has nothing to do with “big data”, nosql, structured, or unstructured, or machine feeds, or anything else regarding the nature, size, or velocity of the data. it has everything to do with how you are treating the data, testing the data, turning the data in to information – and how you are proving your results.
ok, i’ve provided enough rhetoric and opinion about the title itself, what then should you focus on if you really want to be a data scientist (i mean a real data scientist)?
so you really want to be a data scientist?
here (in my opinion) are some things you should consider, before you decide to attach the label “data scientist” to your job title…
- do scientific experiments that follow the scientific method – on any size data
- publish the hypothesis, experiments, control cases, and results in a professional & respected journal
- explain the patterns discovered, their ties to business, and relationship between the raw data and the value of the results to the business (ie: turning data in to information)
- show margin of error, be able to discuss the practicality of false positives in your results, be able to discuss how the conclusion might be faulty
- add references to your work, show what influences have been leveraged and applied
in my humble opinion, when you have begun this journey, you can and should call yourself a data scientist. i don’t think that “data science” has anything to do with “big or small data”, or “unstructured vs structured” – i think if you go this route, and believe the market place general definitions (as stated at the start of this entry), i think you are doing the entire line of work a disservice.
in the end, it’s about: reaching a conclusion that adds value or knowledge to the business. turning data into business assets is the end-game, and proving the path to getting to that value, again with the margin of error, and showing the control and the false positives. the rigors of testing and proving are part of the job, nay, part of the responsibility of assuming the role of data scientist.
let me just add the following: i do not consider myself a data scientist (yet) but rather a data explorer – not as sexy sounding, but none-the-less, i am guilty of not following the full scientific method (at least not yet). for the data vault models, etl / elt patterns, and core data sets, i’ve followed most (but not all) of the scientific method. i’ve not published hypothesis, or controlled experiments, or test results.
what i will say is this: it’s not easy, fast, or simple to go this road, however, the results of executing the scientific method against data, metadata, architecture, methodology, and implementation patterns will yield far more gains than can be imagined. i believe that if data science is taken seriously, we can and will ultimately reach the true nature: information science and value added data assets to the business.
anyhow, i hope you enjoyed this rant – i’d love to hear your thoughts and feedback, even contradictory opinions are welcome.
(c) dan linstedt, 2015 all rights reserved.