How DEEP is your Data Science project?
” I am very frustrated with my work …I get blamed for every project failure… I don’t want to pursue Data Science anymore …I want to quit my job… “
Of course, I would be taken aback by such an exclamation, especially when it comes from one of my brightest students.
I always believed that one of the perks of being a coach is to be able to walk your students through almost every challenge they concur. As I further enquired in order to find out what was wrong, something serious did come out of the discussion.
His management had provided him some data and asked him to come up with a proposed solution. The issue he faced was that they gave him no problem statement, no goal statement, no project objectives, no understanding of the Business domain, no process flow diagrams, no knowledge of the data source, no data dictionary and absolutely nothing about vision etc.
The management believed that if you owned the title of a Data Scientist, you had magic hands. You look into the data and are expected to give the management “solutions” immediately. If only it were true…
On further discussing with him, I found that due to time pressure they were forced to jump into solution mode without the use of any methodology or a structured data science framework
Typically, data scientists are so engrossed in building algorithms that they tend to miss out on the bigger picture. Well, more importantly, I have noticed that there are several training institutes/universities that conduct and train on data science and machine learning but very few give importance to the Data Science Methodology or framework.
The phone call with my student compelled me to write the below piece on Data Science Framework/methodology.
What D.E.E.P stands for?
- Define Phase
- Explore Phase
- Exploit Phase
- Productionize Phase
Here is an insight into these phases
The first phase we go through is the Define phase, one of the crucial stages in a basic data science project more often than less neglected by a Data science team.
- Problem statement – Before undertaking the project, it is critical we have a short portrayal of the issues that should be tended to by a Data Science group and ought to be exhibited to them before they attempt to tackle a problem.
- Objective / Goal – An Objective is a high-level statement that provides an overall context for what the Data Scientist is trying to achieve and should align with business goals.
- Business Process – A high-level process flow that captures the business activities, data capture and importantly customer interaction “moment of truth”.
- Data Source – Understanding data source helps the team identify possible sources of predictive patterns.
- Data Dictionary – Specifically for structured data, creating a data dictionary is one of the most important parts of Define phase. It consists of a set of information describing the contents, format, and structure of the data and their relationship between its attributes.
In the Explore phase, we tend to carry out most of the dirty work. This is the phase wherein most of the data scientists try to cut corners.
- Data Cleansing – where we check the data for completeness and cleanliness.
- Data imputation – In data imputation step, we try to find if there are any missing values and have a strategy either to replace it with mean or median or most likely values or sometimes delete the record which is usually not recommended
- Label/OneHot Encoding – In Label encoding, all the categorical data will be converted to numeric format. One hot encoding transforms categorical features to a columnar format that works better with classification and regression algorithms.
- Data Transformation – For specific clustering, classification and regression algorithms we need data to be normalized, scaled or minimax etc. These are some of the activities associated with Data transformation.
- Statistical Exploration – In Statistical Exploration we try to understand individual attribute patterns (mean, max, range, and min), relationships between attributes, any Outliers or any possible errors.
- Inferential Analytics – Inferential Analytics highlights valid inferences about a population based on an analysis of a representative sample of that population. One of my favorites was the Chi-Square test, which was largely used to draw inferences between categorical data.
The Exploit phase is where the Data Scientist plays with data, builds the model and emphasizes on the importance of this phase and trusts that this is the only step that adds value to the project which is a general misconception as each stage is crucial for the progress of the project.
Some of the key activities of Exploit phase are:
- Data Stratification – Stratification is the process of dividing members of the population into homogeneous subgroups before sampling. Based on Statistical Exploration, Stratification will be carried out.
- Feature Engineering – Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of model building.
- Model Building & Machine Learning – Multiple models based on different machine learning algorithms and on different datasets. More importantly, we test the models for accuracy, resources, response time, processing time etc. Later we identify best performing models by fine-tuning the parameters of machine learning algorithms.
- Model Prediction – Once models are built, we now predict outcomes for new data. We keep validating model accuracy to make sure accuracy levels are consistent with different variations in data.
- Cross-Validation – Cross Validation is a very useful technique for assessing the performance of machine learning models. It helps in knowing how the machine learning model would generalize to an independent data set.
- Visualization – Data visualization is a way used to communicate information by encoding into as visual objects (Ex- ROC Curve, AUC, scatter plot, regression plots) contained in graphics.
- Reporting – At the end of the Exploit phase, reports need to be provided to the project owners on the algorithms used, prediction results and expected benefits. This is part of the stage is crucial but unfortunately, most of the data science teams fail to prepare any kind of reports.
And now reach the final stage, the Productionize phase where everything looks good, the model has been built and reported to management. It is now time to move the model into production/live environment with the live data feed.
Below are the key activities within the Productionize phase:
- Data Product Development plan – Like a typical software engineering project, once the machine learning models are developed and tested, we need to put this back into staging environment where we get live or test data feed. We often convert our model into a web service which does the prediction on live data and sends back the results.
- Testing of the solution against Data Feed – Once a web service has been developed, the services need to be tested. This activity performs unit and system testing against the data feed. Once we get the prediction against the data feed, we often validate this with a Domain Expert.
- Deployment – On successful testing, we deploy the solution/model into a production environment.
- Model Maintenance – The final task is the model maintenance. We get a new set of data, we validate the model and keep the model updated (attribute reselections, different parameter tuning etc.
With this, I would like to conclude my article in hopes that the above methodology/framework gives some discipline while planning and executing a Data Science project.