Combining datasets without compromising on trust

The rapid adoption of technology to optimise every aspect of our lives not only generates large volumes of data around our activities but also heavily relies on this data and the potential benefits to be gained from the use of this data serves both individuals and organisations.

For example, providing everyday convenience, my precise location details enable taxi hailing apps to find the nearest car to me and assist food delivery apps to speedily deliver my order. Through knowing my previous purchase history, brands are able to deliver personalised recommendations and tailor experiences to intuitively match my preferences. Sensors in my smartwatch tracking my vital statistics can even instruct my phone to call emergency services if an anomaly, such as a sudden fall, is detected in the real-time data.

On the flip side of consumer benefits, the ability to derive insights from behavioural, transactional, financial, and operational data can open up a wide range of opportunities for commercial growth and innovation-led strategies.

However, all too often, organisations find themselves limited by the scope of the data they have access to. A company’s data that is collected in the normal course of their activities is naturally limited to the scope of its particular business and impacted by regulatory constraints, consumer transparency and fairness. A wider view of the behaviours and preferences of their customers would increase the opportunities to deliver innovative products and services, with some such business scenarios listed below.

Summing the parts
Many organisations would derive huge benefits from enhancing the analysis of their customer data with data from other companies, thereby widening their view of their customers and gaining a deeper understanding of how they can better serve their needs.

But while joining these datasets together by matching individuals in each dataset may be technically possible, it is extremely unlikely to be permitted under data protection legislation – unless individuals have freely given both organisations specific, informed consent for their data to be combined in this way, or it is used for limited permitted purposes like fraud prevention.

It is also important to note that as datasets are blended together, the data becomes increasingly sensitive and granular; as such, the risk of intrusive profiling or undesired effects grows significantly, which presents inherent privacy, security and ethical risks.

Combined analytics scenarios
Having the ability to combine insights generated from the data of different organisations would not only assist business growth but would also enable consumer services to be improved and further convenience to be delivered. The commercial examples below illustrate just a handful of real-world opportunities combined analytics presents:

– While a financial institution can analyse the amount of money spent with specific merchants, they do not know what was purchased. If they could combine their data with the till spend information from the actual merchants, a much richer view of collective consumer spending habits could be built up, which could lead to improved customer offerings.

– A mobile network may have rich data that tracks mobility across locations; however, it does not have access to information that would explain why groups of people are going to different places. As such, they are missing critical contextual data that would enable more accurate forecasting and capacity planning.

– While a loyalty programme has data on the offers and redemptions of its members, it does not have a view of their other spending habits. If it were possible to compliantly and efficiently incorporate information about cohorts of their members’ overall spending habits, the loyalty programme provider would be far better informed to deliver more relevant offers to its members.

– Smart city projects use data from sensors that are positioned in multiple places, including traffic lights, roads, CCTV, public service vehicles, etc. Being able to leverage additional data from sources such as privately-owned vehicles, especially in terms of their journeys and trajectories, would greatly enhance the data-driven improvements that could be delivered to the community, such as traffic forecasting, capacity planning, etc.

– While an airline can produce aggregated statistics on routes, frequencies, spend per seat, cost of flights, etc., they do not know what duty-free products were purchased in the airport or where their passengers stay and eat at their chosen destination. By having access to aggregate statistics on the behaviour of passengers that fly on particular routes, dates or times, airlines could increase their revenue-per-seat with more relevant in-flight product offerings.

The limitless value in combined analytics
The value associated with combining insights across datasets is virtually limitless. An organisation can dramatically improve their existing analytics programmes while also opening up new opportunities to monetise their data by enabling other data controllers to benefit from their insights.

For many use cases, it is possible to enhance datasets based on cohorts or segments so that the data for individuals in one dataset is enriched with aggregated insights derived from segments built over another dataset. In this way, data enrichment can occur without directly matching individuals across the datasets.

However, even this segment-based combined analytics on identifiable data may require consent from the data subjects from at least one of the datasets; therefore, to ensure consumer trust is maintained and rights to privacy upheld, it is unlikely to be a viable option for any of the cases described above.

With that said, how then can companies unlock the massive potential in combining their data with other sources, while remaining compliant with data protection legislation, acting ethically and retaining the trust of their customers?

Anonymisation can open the doors for combined analytics
In today’s privacy-centric world that is dominated by increasingly strict regulations and rising consumer concern and tech-savviness, there is only way to lawfully conduct the aforementioned combined analytics. That is to first anonymise both datasets separately so that the combined analysis can be performed while ensuring no re-identification of individuals can occur via commingling.

Achieving the very high threshold for true anonymisation, set by the GDPR, can “switch off” privacy regulations. When this happens, the data involved is not considered to be personal data any longer. This means that issues surrounding consent, data minimisation and data retention, which apply to the collection and use of personal data under the GDPR, will then no longer apply. In addition to this, the truly anonymised data can also be held for as long as it is required, and it can be used for all types of analytic use cases.

Even when datasets have been anonymised to the appropriate level, there are limitations to how combined analytics may be performed and what outputs can be extracted. Combining even anonymised datasets in a 1-to-1 fashion must be avoided for the reasons set out above. Trying to join the data of anonymised individuals across datasets in a 1-to-1 fashion means looking for exact matches of overlapping fields across all events or rows in the dataset so that an individual in one dataset is matched with an individual in another.

Although the data of both individuals is anonymised and cannot be associated with their named identity, this type of 1-to-1 matching creates an enriched dataset for the individual that could potentially lead to re-identification.

It is possible to associate an anonymised individual in one dataset with a group of similar individuals in the other, so that the individual’s data may be enhanced with aggregate statistics and analytics. In fact, this is typically the goal of most analytics processes – to enrich the information of an individual or group with insights or learnings gleaned from the activities of groups of similar users so that more effective decisions or recommendations may be made.

There are certain guardrails that still need to be applied to ensure the resulting enriched information is not too granular or does not increase the likelihood of a re-identification occurring.

Another important point is that when datasets have been anonymised to the required level, the insights that can be extracted cannot be at the level of individuals as this would potentially enable re-identification of the individual.

Since their data has been enriched with data from another source, additional information could be associated with their data from the original source which could result in privacy harm. Thus, combined analytics over multiple datasets that have been properly anonymized should importantly produce either aggregate statistics or trained model code that can be applied to identifiable datasets and produce a desired outcome, such as relevant recommendations, propensity scores or segmentation.

However, achieving true anonymisation is not a simple task. True anonymisation requires a unique blend of data science and data privacy expertise, but the value accruing from data combination is such that it more than justifies the effort required.

When done right, the whole can be greater than the sum of the parts. Companies can realise immense value without compromising on the trust and respect of their customers and while acting in an ethical fashion.

Dr Maurice Coyle is chief data scientist at Truata