# Statistically Speaking – When is data “significant?”

This post has already been read 1556 times!
0 Flares 0 Flares With the era of Big Data and the Internet of Things coming on stream, it’s important for decision makers to understand how to use the deluge of data headed for analysis. For many executives, measurement is normally made within a financial context. Indeed, the KPIs are normally Return on Investment (ROI) or (Internal Rate of Return). These terms provide important benchmarks and form an integral part of decision making

. However, the advantage of Big Data will be providing more accurate statistical measurements by using larger populations or larger sampling sizes. While KPIs will still focus on financial ratios, terminology such as “significant alpha,” “confidence interval,” and “coefficient of correlation” will become more familiar in the decision making process.

What does that mean (no pun intended)?

Most high level execs may have taken a statistics course or two way back when. For most, statistics was one of those courses that ranked right up there with calculus and chemistry; challenging to learn and quickly forgotten. Of course, there are those in the organization who probably use some statistical functions such as Quality Control and Marketing. But how often do C-levelers hear terms like “statistical significance,” “P Value” or “coefficient of correlation?” And if these or similar terms are bandied about, do we remember exactly what they mean and how they can help us make decisions?

This short article will hopefully refresh some of those memories lost in a facile manner.

Here is a typical example of where using statistical analysis can help add support to managerial decisions.

Suppose last year you contracted an expensive consulting company to come in and train your sales executives. Now, when the decision was being made, the consulting firm no doubt talked about a typical ROI on such an “investment.” Now, don’t get me wrong, ROI is a good thing and a good measurement of success. However, ROI does not necessarily tell you if the training and the results were real or just an improvement that may be the results of other factors. Besides the question of ROI, another question also needs to be asked: “Did the training make a significant change in performance? “Ok, what does significant mean,” you may ask?

What if we could find out if we could test how well our sales team really performed after the training as compared to before the training. At what point would you say the training had a real impact if sale closings changed by a factor of 10%, 20% or 50%? A 10% improvement could have been the result of smaller sales or multiple sales made to the same buyer. Or it could have been due to a lucky string of buyers? Could the same thing be said if we had a 20% improvement? Certainly, we could say that a 50% improvement was a significant change. By using basic statistics, you can find out exactly when certain changes are statistically significant. Now, I am not implying that you perform the gathering of data, the input and the output. Others can do that. However, it is important for you to understand what statistical testing can tell you and what it means.

Normal distribution and Probability

Variables (test scores, sales and production figures, QC measurements, etc.) are all independent numbers in that they represent a unique observation. We can take observations from an entire population or sample and infer about a population. The larger the sample size, the more accurate the outcome. For example, we can sample the entire company sales figures or we could take just a few samples and infer what the total population would be from the sample. Big Data will help with that aspect and help provide more accurate outcomes with the larger samples and populations. The important thing is that no matter what size sample or populations, if the variables are normally distributed we can tell a lot about the probability of a number being observed. The normal distribution will have a mean at the center of the curve can be divided into standard deviations (2σ) above and below the mean (average) of the data. Between +/- standard deviations we will find 95.44% of all the data points. That means that if we have a number beyond +/- 2σ the probability of finding a data point out there would be +/- 2.27%. (2.14%+.13%). As about 95% of the data is between 2σ we could say that any point beyond a Significance level of .95 would be statistically significant. Translated into English, we would only expect to see a variable beyond 95% of the data only about 2 times out of 100. We could also make the significance level at 99% so that a significant data point would be less than 1% or one out of 100 times. When we say times, we refer if you took 100 different sample test data and found the mean of those tests, a significant data test set would have a mean beyond (smaller in probability) than the 95% or 99% significance level. (Also called a critical point or an alpha point).

So, in our example, we can find out if the difference in the results of pre and post sales training was less than a 5% probability of being caused by chance.

The Null Hypothesis

The way we prove something is significant is to prove it is not significant. We back into our conclusion. For example, we would state the testing questions like this:

Ho (the Null Hypothesis): The training did not have a significant impact on sales production

Ha (the Alternative Hypothesis): The training did have a significant impact on sales production.

If the findings had shown a probability number within the .05 (1.0-.95 significance level +/-2σ), we would accept the Null Hypothesis. If the findings came up with a number outside .05 significance level (smaller than a 5% probability), we would reject the Null Hypothesis and accept the Alternative Hypothecs). If we know there had been an increase in the average sales production, we would want to see if the average increase was beyond the critical value (in the shaded area). Then we could state with 95% confidence that the training had made a statistically significant impact. Sometimes, this can be confusing so I think the graphic below will help make it clear.

The blue curve represents the distribution of sales production before the training. The mean divides the data set in half. The red curve represents the distribution of sales production after the training. As we can see, if the mean of the after training sales production is beyond the significance level (or critical point) of .05 we can say that the two sets are definitely different. Indeed, there would be less than a 5 out of 100 probability that blue variable data would be observed beyond the blue Significance level alpha. We could also say we are at least 95% confident the null hypothesis in is not correct in assuming the data sets are really different in a statistical sense. In conclusion, big data can collect an exponential amount of more data but the data still has to fit into some sort of hypothesis and matched with the proper testing. However, the results from big data are to help support decision making and make better choices on courses of action. But as the legendary baseball announcer for the Los Angeles Dodgers-Vin Scully-once said about the importance of statistics in the game: “Statistics are like a light post for a drunk. It gives support but no Enlightenment.”

If you want to know if data is really significant, tell your staff you like a significance level of .01 and watch them look at you in wonder and awe.