Making data predictions in a sea of uncertainty

    By Mercia Silva, Data Scientist at BJSS

    Mercia Silva



    First an electrical engineer, then a geophysicist and now a data scientist. For some people this may seem very different but for Mercia they were very natural moves: All focused in mathematics, data, coding and on solving practical problems. She is originally from Brazil, likes running outdoors, travelling to encounter new places, culture and experience different food.

    Developing and gaining trust in a robust model that meets the specific demands of an organisation, iterating it into a usable product. These are the real challenges that data scientists face when implementing machine learning within enterprise organisations. Discover how one of our lead Data Scientists solved these challenges for a major oil and gas firm. Our data scientists face an array of challenges implementing machine learning within enterprise organisations. Creating a top-notch model is not necessarily up there, in fact it's maybe even the easy bit.

    This story isn't going to spin a tale of neural networks, complex predictive problems and arrays of graphics cards in the cloud (that's one for another time). Instead it's about the reality of applying data science in big business. It's about winning the trust of hundreds of highly skilled experts who will use the predictions in their line of work. It's about using technical solutions and a deep understanding of the domain and users to guarantee success.

    Part 1 - Predicting Hydrocarbon Recovery


    Mercia, one of BJSS' founding data scientists, had been selected to work with a major oil and gas firm to help them further develop their ability to predict hydrocarbon recovery from oil and gas fields. The decisions that these predictions impact are of incredible value. They must be well understood and trusted. Admittedly not one of the core value propositions of machine learning!

    “The decisions that these predictions impact are of incredible value”

    An existing regression model was in place, used by a number of engineers across the organisation. This model however faced a number of problems including; perceived mistrust, data errors and less than ideal fit. The initial ask on Mercia was seemingly straightforward; "We already have a regression model, can you improve the fit?". As can be expected of any keen data scientist, Mercia asked whether considering alternative solutions might be on the table. It was.


    Part 2 - random forests and reservoirs


    Data used for the existing regression contained plentiful extra information, around 200 further features unused for prediction. Mercia and team set about understanding the predictive ability of these features, creating a new predictor, a random forest regressor using many features originally unused. It was at this early stage that initial concerns surfaced regarding the level of trust that could be placed in a black-box prediction of such magnitude.

    Reservoir engineers use applied mathematics, geophysics, petrophysics and other fields to understand the flow of fluids through porous medium within reservoirs. Even with the blessing of senior engineers involved in building the model, how would other engineers trust the solution. Engineers with years of experience in such an expert field would have to trust an opaque bundle of software and mathematics. Something that claims to solve a problem they've spent so many years understanding.

    There are hundreds of these engineers in a large oil and gas company. To complicate things further, there is a flip-side to this problem. You must ensure that new engineers don't learn to trust such a solution too far. They must not stop questioning it. Errors occur, predictions can be poor. If unquestioned, such issues could lead to significant financial impact.

    Mercia worked with engineers across the firm to understand how they predict recovery factor and how the new solution could be adjusted to fit their existing workflow, providing the same type of information that they already see. Essentially, she wanted the predictor to fit as naturally as possible into their current approach. The tool should support, not supplant, their process.

    Part 3 - building the model


    A significant concern of the engineers was how any algorithm would deal with uncertainty of the resulting prediction or the input variables. Is the prediction likely to be very accurate, or could it be way off the mark? If the input variables are incorrect, what does this mean for the prediction? This uncertainty needs to be quantified and explained in such a way that allows it to be weighed into any conclusions formed. Mercia and the team proposed solutions to accommodate these concerns while allowing the team to take advantage of new technologies.

    To output these uncertainty ranges expected by engineers, the team implemented a methodology called Quantile Regression Forest on top of the popular Python scikit-learning random forest regressor.

    As shown in the figure below, a normal random forest regression averages the prediction outputs of multiple decision trees to output a single prediction.

    A quantile regression forest treats the result of each tree as a separate value in the population of a predicted distribution. The random forest prediction (the average) is then treated as the most probable value in the distribution. This is shown in the following figure, illustrating the prediction using a histogram.

    Each decision tree sees a sample of the same input data during training. If, when making a prediction, the set of inputs are values which were commonly seen during training then all trees will return similar responses, resulting in a tighter distribution and a low uncertainty. On the contrary, if the set of inputs is uncommon, many of the trees will not have seen such inputs during training. The results of the individual trees will be more diverse, resulting in a wider distribution and hence a higher uncertainty (illustrated by the histogram above).

    Using this approach, the output can be presented to the user in the format they are used to, surfacing the P10, P50 and P90.

    Conversations with users had highlighted the need to help answer the question ‘What if I'm not certain that input n is correct?’

    This approach was piloted with engineers and was widely accepted. They could see the machine functioning in a manner similar to their own, bringing greater confidence. However, there was a final further improvement required: how to address uncertain inputs?

    Ongoing conversations with users had highlighted the need to help answer the question "What if I'm not certain that input n is correct". To address this need, for each input into the model a prediction is made not only for that input, but also for other values of said input. These form ‘sensitivities’ for each of the input variables which can be visualised for engineers to aid sensitivity analysis.

    An example sensitivity plot is shown below. It shows the variation of the model prediction should the input, n, be of a different value. These plots are used to understand how uncertainty of the model can be attributed to certain input variables.

    Part 4 - moving from model to product


    To surface the model, results, and sensitivities to users effectively, the team built and deployed a small web application. Through this application, inputs can be entered, and a prediction generated in seconds. The textual prediction is displayed alongside plots for sensitivity analysis and an additional plot displaying the overall prediction against a backdrop of known values in the database.

    At this point the model is complete and ready for the limelight. Initially opened to 15 pilot users, it was very well received, with excellent feedback from real projects. The team adjusted aspects of the tool based on this feedback and opened the floodgates. Now live throughout the business, 400 engineers use the model which has undergone 3 production iterations since. It's been a huge success, accurately predicting recovery rates with value delivered both in time savings and projects moving forward based upon conclusions corroborated by the tool. In addition to that, engineers are already finding use cases for the tool apart from those it was designed for.



    Winning the trust of hundreds of highly skilled experts" was a critical success factor for this project. Successful machine learning in production is often as much about understanding the human factors and business processes as it is the algorithm and technology. Mercia and the client team who worked with her were only successful because of their laser focus on successful implementation within their userbase, a unique approach that the BJSS data science team bring to all of their deliveries.