**By Thomas Joseph, Aspire Systems.**

Over the past few months, many people have been asking me to write on what it entails to do a __data science literature is replete with articles on specific algorithms or definitive methods with code on how to deal with a problem. However an end to end view of what it takes to do a data science project for a specific business use case is little hard to find. From this week onward, we would be starting a new series called the Applied data Science Series. In this series I would be giving an end to end perspective on tackling business use cases or societal problems within the framework of data Science. In this first article of the applied data science series we will deal with a predictive maintenance business use case. The use case involved is to predict the end life of large industrial batteries, which falls under the genre of use cases called preventive maintenance use cases.

**The big picture**

Before we delve deep into the business problem and how to solve it from a data science perspective, let us look at the big picture on the life cycle of a data science project

.The above figure is a depiction of the big picture on what it entails to solve a business problem from a data Science perspective. Let us deal with each of the components end to end.

**In the Beginning …… : Business Discovery**

The start of any data science project is with a business problem. The problem we have at hand is to try to predict the end life of large industrial batteries. When we are encountered with such a business problem, the first thing which should come to our mind is on the key variables which will come into play . For this specific example of batteries some of the key variables which determine the state of health of batteries are conductance, discharge , voltage, current and temperature.

The next questions which we need to ask is on the lead indicators or trends within these variables, which will help in solving the business problem. This is where we also have to take inputs from the domain team. For the case of batteries, it turns out that a key trend which can indicate propensity for failure is drop in conductance values. The conductance of batteries will drop over time, however the rate at which the conductance values drop will be accelerated before points of failure. This is a vital clue which we will have to be cognizant about when we go for detailed exploratory analysis of the variables.

The other key variable which can come into play is the discharge. When a battery is allowed to discharge the voltage will initially drop to a minimum level and then it will regain the voltage. This is called the “Coup de Fouet” effect. Every manufacturer of batteries will prescribes standards and control charts as to how much, voltage can drop and how the regaining process should be. Any deviation from these standards and control charts would mean anomalous behaviors. This is another set of indicator which will have to look out for when we explore data.

In addition to the above two indicators there are many other factors which one would have to be aware of which will indicate failure. During the business exploration phase we have to identify all such factors which are related to the business problem which we are to solve and formulate hypothesis about them. Once we formulate our hypothesis we have to look out for evidences / trends within the data about these hypothesis. With respect to the two variables which we have discussed above some hypothesis we can formulate are the following.

- Gradual drop in conductance over time would mean normal behavior and sudden drop would mean anomalous behavior
- Deviation from manufactured prescribed “Coup de Fouet” effect would indicate anomalous behavior

When we go about in exploring data, hypothesis like the above will be point of reference in terms of trends which we will have to look out on the variables involved. The more hypothesis we formulate based on domain expertise the better it would be at the exploratory stage. Now that we have seen what it entails within the business discovery phase, let us encapsulate our discussions on key considerations within the business discovery phase

- Understand the business problem which we are set out to solve
- Identify all key variables related to the business problem
- Identify the lead indicators within these variable which will help in solving the business problem.
- Formulate hypothesis about the lead indicators

Once we are equipped with sufficient knowledge about the problem from a business and domain perspective now its time to look at the data we have at hand.

**And then came data ……. : data Discovery**

In the data discovery phase we have to try to understand some critical aspects about how data is captured and how the variables are represented within the data sets. Some of the key considerations during the data discovery phase are the following

- Do we have data pertaining to all the variables and lead indicators which we defined during the business discovery phase ?
- What is the mechanism of data capture ? Does the data capture mechanism differ according to the variables ?
- What is the frequency of data capture ? Does it vary across the variables ?
- Does the volume of data captured, vary according to the frequency and variables involved ?

In the case of the battery prediction problem, there are three different data sets . These data sets pertained to different set of variables. The frequency of data collection and the volume of data captured also varies. Some of the key data sets involved are the following

- Conductance data set : data Pertaining to the conductance of the batteries. This is collected every 2-3 days . Some of the key data points collected along with the conductance data include
- Time stamp when the conductance data was taken
- Unique identifier for each battery
- Other related information like manufacturer , installation location, model , string it was connected to etc

- Terminal voltage data : data pertaining to Voltage and temperature of battery. This is collected every day. Key data points include
- Voltage of the battery
- Temperature
- Other related information like battery identifier, manufacturer, installation location, model, string data etc

- Discharge data : Discharge data is collected once every 3 months. Key variable include
- Discharge voltage
- Current at which voltage discharges
- Other related information like battery identifier, manufacturer, installation location, model, string data etc

As seen, we have to play around with three very distinct data sets with different sets of variables, different frequency of time when the data points arrive and different volume of data for each of the variables involved. One of the key challenges, one would encounter is in connecting all these variables together into a coherent data set, which will help in the predictive task. It would be easier to get this done if we can formulate the predictive problem by connecting the data sets available to the business problem we are trying to solve. Let us first attempt to formulate the predictive problem.