Data Scientists – Bringing Order to Big Data Chaos
April 1, 2016Data scientist is a hot job in Silicon Valley. Since they work among systems capable of storing and providing access to exabyte-scale data, they are often the best sources to find out more about designing the technological tools they need to do their best work. Peaxy has designed Aureum with them in mind because we have top data scientists already working on our team.
Data Scientist Defined
In simple words, a data scientist takes data and finds structure and patterns in it. Then he/she transforms those patterns into information, which is formulated as a mathematical model. In a second step, the model is used to transform the information into knowledge by making predictions. In the final step, a data scientist must be able to build a story around the model and the predictions so that ordinary people and C-level executives can take action and make commitments—here on the West Coast we say we must “put a bird on it.”
Data scientists are often jokingly called data analysts who live in Palo Alto. This does not give justice to the amount of skills and knowledge required to perform data science. First of all, a data scientist needs mathematical knowledge in topics including multivariable calculus, linear and nonlinear programming, machine learning, and knowledge in mathematical statistics including distributions, design of experiments, statistical tests, linear and logistic regressions.
Second, data scientists need solid computer science skills, including programming with R and SQL, logging, regular expressions and parsing. Third, the data scientist needs domain expertise, which in practice includes being able to work in an interdisciplinary team and have business sense. Having implicit knowledge is key to be able to perform quick back-of-the-envelope calculations.
Mathematical and statistical skills are used to accomplish 5 percent of a project, while computer skills are necessary to accomplish the 45 percent that are data munging. The final half goes into formulating the story, creating visualizations and selling the result. The exact mix depends on the kind of company the data scientist works for.
Data scientists for different industries
I am an analyst who lives in Palo Alto and have skills around basic tools including relational database design and SQL, Java, R and, most of all, data visualization and communication tools, which in addition to R includes heavy lifting LaTeX and Illustrator. I do not have to do much data munging, machine learning, nor heavy-duty programming, for that matter.
Some companies, such as those growing at very high speed, are just trying to get their data under control, or “wrangling it.” Usually there are many low hanging fruits and the main activities include logging and data munging. Formulating the story, creating visualizations, and selling the result to the CEO are particularly difficult in these kinds of companies.
An opposite model for companies are those giving away a service in exchange for the customer providing data that can be aggregated and sold. Here, mathematical and statistical skills are used for a much larger percentage of the work, for example to analyze click streams, reconstruct true networks of friends, discover secret desires and habits, etc. These companies have excellent data hygiene and there is not much need for data munging.
The last type of companies are enterprises that are data-driven, but not sellers of data. Manufacturing companies fall in this category, because they must constantly analyze their production machines and feed back the data, both locally and globally in what is now called the Internet of Things (IoT). Other examples of relevant sectors are transportation, communications, oil and gas, healthcare, government, financial. The scale of their data can be petabytes, and they have seasoned advanced analytics teams. They do a lot of data munging, but there is also a very broad deployment of tools, requiring broad mathematical and statistical skills.
Peaxy recognizes that there are many different data scientists across a broad spectrum of industries, and we know their pain points. Each month we work to provide new features that help ease that pain and open up new horizons for business data access strategy.