An average day in the life of a data scientist consists of preparing (identifying, collecting, cleaning, aggregating etc.) data, modelling prepared data and operationalizing data models to allow business to consume insights. That data preparation - also called data wrangling (or data integration) - is a huge challenge for most data scientists. There is even a joke about it: Data Scientists spent 80 percent of their time dealing with data preparation problems and the other 20 percent of their time complaining about how long it takes to deal with data preparation problems.
There have been multiple studies that talk about the time spent on data wrangling. Figure Eight, an AI and machine learning services company, does a yearly survey of data scientists. Based on the results from the 2016 survey, it has become an axiom that data wrangling costs data scientists as much as 80% of their time – leaving only 20% for data modeling and machine learning. A more recent survey of CIO’s from IDG’s CIO Research Services revealed that 98% believe data preparation and aggregation of large datasets in a timely fashion is a major challenge.
The most common approach to data wrangling involves manually writing code – and if your data is structured – that probably means writing T-SQL or some variant. But because how you “wrangle” the data impacts how you will be able to put a data science project into production, writing code isn’t enough. A data scientist needs to also document data sources, how data is modified, enriched or recalculated, and take security, privacy and compliance into account.
Wouldn’t it be great if there was a tool that supported data wrangling without lots of coding, could automate the data pipeline and automatically document the entire process? All while helping to eliminate data silos typically created by AI projects. We thought so too…. So TimeXtender built Discovery Hub. Here is a short video regarding how we help build data pipelines and wrangle data for data science and artificial intelligence.
Discovery Hub helps dramatically accelerate time to value for data scientists by reducing the work required to build and maintain a data estate required to power analytics, AI and Machine Learning.
Using Discovery Hub to build and document your data estate allows data scientists to perform data discovery on all necessary data using a single connection, without the need for direct access to source systems. Schedule automatic incremental refreshes of your data lake though our 100+ native connectors and adapters. Then combine and model data from all data sources using a single platform to filter, group, join and aggregate data for easy access.
Data Wrangling and automated machine learning
While Discovery Hub can help dramatically reduce the time spent on data wrangling for machine learning, traditional machine learning model development remains resource-intensive, requiring significant domain knowledge and time to produce and compare dozens of models. And considering the complexity, time, resources, and extensive domain knowledge required to develop machine learning experiments, many medium and corporate sized organizations can find it difficult to leverage the benefits of machine learning.
Automated Machine Learning (AutoML) helps simplify machine learning development and makes it easier to develop machine learning models in any size organization. By selecting and training machine learning models, and eliminating repetitive experimentation tasks, AutoML helps organizations take advantage of the benefits machine learning has to offer, much faster.
TimeXtender and Microsoft have teamed up to provide the benefits of automated data wrangling and automated machine learning. Using one of TimeXtender’s Azure Marketplace templates, you can quickly build and deploy a prebuilt end-to-end machine learning solution using Discovery Hub and Azure Automated Machine Learning service. Read more about how Discovery Hub and Azure Machine Learning Services work together here.
Ultimately, it is our goal to reverse the 80-20 rule in data wrangling - so that instead of data wrangling taking 80% of the data scientist's time and effort, it takes 20% (or less) of their time - freeing them to do the more strategic work of building machine learning models, optimizing them and deploying into production.