Critical Data Science Mistakes (And How to Avoid Them)

In this special guest feature, Dave McCarthy, VP of IoT Solutions at Bsquare, discusses a few of the biggest data science errors organizations make and tips on how to avoid them. Dave is a leading authority on industrial IoT. He advises Fortune 1000 customers on how to integrate device and sensor data with their enterprise systems to improve business outcomes. Dave regularly speaks at technology conferences around the globe and recently delivered the keynote presentation at Internet of Things North America. Dave earned an MBA with honors from Northeastern University.

In Gartner’s recent estimation, around 85 percent of big data projects fail. Despite key advances such as the growing pool of data scientists, advances in collective intelligence, new tools coming to market, and consulting teams emerging to help eliminate the possibility of errors, some critical data science mistakes still persist. To help turn the tide in your current project – or for those considering starting a new project – here are a few of the biggest data science errors organizations make and tips on how to avoid them.

Poor Data Quality

Regardless of a business’ type or size, its data scientists are bound to find messy data, and organizing it can take a significant amount of time and effort. That’s why it’s imperative to avoid manual data entry wherever possible. One alternate to entering data by hand is utilizing application integration tools that reduce the proliferation of typographical errors, alternate spellings, and individual idiosyncrasies. Another key to good data quality is careful data preparation. This involves clear communication and documentation of placeholder values, calculation and association logic, and cross-dataset keys. It also should include using well-defined industry standards, continuous anomaly detection, and statistical validation techniques (such as tracking frequency and distribution characteristics on incoming and historical data).

Pro Tip: Data scientists should make it clear to program stakeholders exactly what ensuring data quality entails and the importance of thorough implementation, the absence of which can jeopardize the quality of your results. Time spent ‘readying data’ also helps prevent the need to rework down the line. Once data is uniform and consistent, it’s time to weed out the data you don’t need – an essential step to ensuring data quality.

Too Much Data

Despite the current hype around ‘big data,’ an overabundance of data can actually cause a host of problems that prevent meaningful progress. In these instances, reducing features and employing data selection techniques (such as PCA and penalization methods) can help eliminate the noise to cut through to what matters most. One common misstep when performing predictive analytics, for example, is collecting too much data that is unrelated to reaching the goal. If the data becomes too large, you may fall into the trap of developing excellent predictive models that don’t deliver results due to a combination of high variance fields and an inability to generalize well. Conversely, if you track too many occurrences without robust validation procedures and statistical tests in place, rare events may seem more frequent than they actually are. In either circumstance, validation and testing routines are paramount.

Pro Tip: Filtering out the noise might create the illusion that anything is possible, including the coveted ability to predict critical business events. But the truth is far more complex, as predictive capabilities require an array of variables, not just data.

Assuming Event Prediction is a Slam Dunk

Predictive analytics is an exciting capability made possible by IoT. Because of its perceived value to business, it can quickly become the priority of company stakeholders. But predictive analytics is not possible or valuable in all instances. It’s essential to first establish a clear objective for your analytics program and follow that with research to ensure its viability and value upon completion. For example, an oil and gas business might want to predict failure of oil pumps. The next step is syncing with subject matter experts to determine which predictions will aid in achieving that goal. Next, you’ll need to make sure you have all of the data required to make the prediction. If you don’t, it may be possible to create a plan to obtain the missing data in some instances. However, this is not always possible, in which case you may need to re-set goals.

Pro Tip: In advance of any IoT program, in-house data scientists can benefit greatly from enlisting the services of complimentary outside data science resources. Doing so helps avoid mistakes, conserve time and energy, allow for better allocation of internal resources, and reduce time to value.

Overpromising What Data Science Can Deliver

There’s often an educational gap data scientists face when applying their skills to a particular vertical market, such as manufacturing or transportation. The reality is, data scientists require subject matter experts in order to be able to correctly interpret the dataand map it to the chosen business use case(s). Many large industrial organizations have veterans with deep institutional knowledge. Pairing these experts with data scientists will go a long way toward setting appropriate expectations and meeting objectives. Fundamentally, data science projects are different from software projects. A typical data science project is iterative, exploratory (there is a reason why it is called science), and constantly evolving with additional data, data sources, and business use cases.

Pro Tip: Data science is not a silver bullet. Instead it’s the highly advanced (and ongoing) mathematical analysis of extremely large data sets in search of unique and actionable insights. Often the data needs to be refined, cleansed, restructured, and even combined with other data sources (collectively known as Data Exploration) before it can truly add value. Failure to understand this is the principle reason expectations often go unmet.

Data scientists provide expertise that is essential to increasing ROI, especially around IoT initiatives. Avoiding common mistakes can help accelerate these projects and allow organizations to get the most out of their data.

Sign up for the free insideBIGDATA newsletter.