How confident are you really about your modelled data? If you asked yourself honestly, the answer to this question is likely to be akin to sticking your finger in the air and seeing which way the wind is blowing.
The problem with modelled data is by its very nature – it’s modelled. And therefore errors and inaccuracies can creep in making it at best useless, and at worst, a dangerous tool in business decision making. This is why confidence scores are crucial to today’s modelled data attributes.
For many businesses, to be able to trust and use data science and modelled data, transparency and explainability are key. If brands are to make important decisions around pricing, qualification, risk and more using data science, they have to be able to understand how models came to the scores they have and how accurate the models themselves are. It is vital for communicating with customers and regulators alike.
Let’s take the insurance industry as an example: Confidence scored data gives autonomy to insurers to create their own thresholds when making nuanced judgements around pricing or the customer journey. Companies can decide themselves between a more disruptive but thorough customer journey or automated form fill when creating policies. Speciality services can tailor models to these variables with full transparency into the quality of the data and the risk they are facing.
However, there are two main problems in creating accurate confidence scores on modelled insurance data. The first is when there isn’t very much training data available. The second is when there is an abundance of training data available, but it is skewed or not representative of the data to be predicted. If this is the case there is a significant risk that the model will produce high confidence scores for inaccurate predictions. This is because the scoring population is inconsistent with the training population. It’s like creating a model to identify oranges and using it to predict apples. That the model has good confidence in its ability to predict oranges simply isn’t applicable.
To mitigate the risk of small training data, a good usage of statistical methods/approaches/tests (and distribution assumptions) to select upper and lower confidences reflective of volatile data is key. However, the solution to the second issue is more complex than it might seem at first. To combat it, it is crucial to create a process that ensures the test data is representative of the training data and vice versa. In recent times the flood of data has removed the need to be strict with confidence scores and boundaries, however when modelling on skewed data this discipline is still imperative. Training and test data must be collaborated to remove bias.
Given these challenges, creating confidence scores can often be just as complex as creating your predictive model. It requires judgement, statistics and experience. Moreover, accurate confidence scores are vital when providing data that will underpin business process and an important part of building trust both with consumers and regulators.
Emma Duckworth is lead data scientist at Outra
To leave a comment please register – it takes less than a minute and is free of charge. You will also get our weekly email update The DM Report (to opt out contact firstname.lastname@example.org). If you are an existing user, please log in. If you have forgotten your log-in details please email email@example.com to get them reset!