Optimise Your Workflow – Tips for Future Data ScientistJune 7, 2017
Data Science is essentially a process of a lot of iteration. To complete a project in data science, one will have to make many changes, consistently during the process while trying new ideas.
The first step first, lets us make it clear, especially for the ones who would like to pursue their career in data science, to not confuse a job of a data scientist to that of a software engineer. Methodologies of software engineering cannot be used in data science. Data science is more of science and less of engineering. There are some relevant software’s in data science that assist in optimising workflow, however, it is also the clarity, experience and intuition of the data scientist and the team that sets the preliminary analysis on track.
If a data science project has taken longer than planned to complete, it could be because of iterations. Let us understand, iterations will happen during the course of a project, however, if an iteration is for any other reason besides the flow of new information, it is uncalled for and could have been eliminated.
An unprecedented iteration could be either because the business pain point was not identified correctly, the data scientist was not aligned with the company objective, a data scientist did not initially believe in a collection of a few variables, or it could be because of assumptions and biases in the data were not accounted for. These are just a few scenarios which can be easily avoided.
Imagine if all variables are not accounted for, one will have to do the analysis again, and that would be really time-consuming, also counterproductive for the project and the team working on it.
Some tips to avoid such scenarios:
- Identify and choose the right issue to use the skills of a data scientist and the advantages of applying data analytics. Do not try to solve every resolvable issue with this technique. Apply data science only if the concern or problem is large enough, and clearly identify as to what objective or hypothesis you are running with. Check for the alignment of that hypothesis with the desired business outcome. Break down a large issue with all possible outcomes and then ask at each step what variables would be required. Defining each factor, and applicability of outcomes would be a great starting point. Make use of pipeline and data sharing tools.
- Identify the data requirement, this is simple, define the time period you would need the data from, collect all information and data points even if it might not look important now, and lastly put a structure to your data requirement by designing tables, this will also further add clarity to what variables would be captured.
- This step is simple yet mostly faltered on, always ensure that the analysis created is reproducible.
- It’s a daunting task to write codes, now imagine to continue writing it over and over again. To avoid syntax errors, it would be great to make a directory of most commonly used codes and ensure everyone on the team has this, it will ensure efficiency in work and it also takes care of simple errors.
- Be flexible and adaptive to technologies, there is no process that is perfect. Be adaptive to the limitations in technology and processes, always finding an alternative will help you reach the goal faster.
- Understand the business, you might be a pro in programming and numbers, and data analysis comes naturally to you, however, if you fail to understand how your business works there will always be a gap in understanding the output.
- Speak the language of your stakeholders, they might not understand algorithms and you should not assume they understand the technical language. Help them visualise your findings, your approaches. Use visuals and examples to illustrate your plan. Connecting with the audience is half the battle won.
Demarcate the data science project in four phases –
The first phase is the Preliminary Analysis -this is where an overview of data points is done.
Second Phase is the Exploratory Data Phase – Specific to asking the right questions and cleaning the data to answer those questions.
Third Phase is Data Visualisation – Here the focus shifts on how to present the analysis.
The Fourth Phase is Knowledge Discovery Phase – The last stage, where models are made to explain the data, algorithms are tested to come up with the best outcome possible.
This is not a definitive workflow and one could make changes to further increase efficiency and productivity. Data Science is exploratory in nature where the data scientist is constantly innovating and learning, preparing themselves to overcome business and project challenges.