The data now available to businesses – should they choose to engage with it all – is staggering! Like drinking from a fire hydrant … the danger is that one is quickly swamped by the sheer volume of data, concepts, tools and ideas floating around this wealth of information and the potential it holds. With all this data available why on earth is synthetic data – artificially generated data to be used instead of, or in conjunction with ‘real’ data – of interest to anyone? The amount of real data being generated is already larger than we can comprehend , so why would we want to add to the confusion by creating synthetic data?
It’s difficult to pin down a precise definition, since its exact nature varies according to the specific function it is generated for, but broadly speaking synthetic data can be thought of as any data not derived from direct observation or measurement. A synthetic dataset maybe fully synthetic, meaning it does not contain any original data, or partially synthetic, meaning that only sensitive or missing data has been replaced. It is typically generated through machine learning models, such as decision trees or deep learning, that use the statistical relationships between the variables in the data of interest to generate an entirely new artificial dataset based on these relationships. These statistical relationships may be derived directly form observing an existing dataset or built from the ground up by a data scientist with prior knowledge of these relationships.
The potential utility of synthetic data is likely familiar to anyone who’d been involved in a data analysis project. This might range from simply facilitating more reliable data analysis, by filling gaps to tackling security and privacy concerns, but synthetic data can also play a vital role in the field of machine learning.
Better Data Analysis
Some of the key applications of synthetic data are based around overcoming problems associated with data quality and quantity. Looking at data quality issues first – when working with real data, various skews and biases in the collection process can lead to difficulties in maximising the value of the data later down the line. These may be down to sampling issues – it’s notoriously difficult to obtain a truly random sample – but there may also be incomplete entries, incorrect formatting, nonsensical outliers and various other quirks that anyone who has worked with real life data is no doubt familiar with.
Being able to generate a fresh dataset with all of the insight of the original but without the inherent messiness that comes with real data – outliers and missing data being just a couple of examples – has the additional benefit of making it much easier to work with, and can even help provide a more navigable data source for those less familiar with the idiosyncrasies of real world data.
Building upon your freshly created, easy to use synthetic data set, you can also expand it to deal with issues arising from a lack of a truly random sample. Imagine you have a dataset used for a segmentation project, where on the back end you want to develop a classification tool that can be used to assign current and potential customers to one of the segments you have created. Whilst there are many powerful classification algorithms that can be used for this purpose, they will all be subject to some level of bias based on the respective sizes of the segments in the sample. By creating a synthetic version of the dataset, that bring your sample up to, say, 1000 entries for each segment, however, you can develop a classification tool safe in the knowledge that it will free from any sizing bias in the original sample.
Security & Privacy
No matter how much data you have, it’s likely that some of it contains sensitive information – be that internal information or personal details from your client base. This data likely holds a lot of potential with regards to developing business intelligence, but, due to its sensitive nature, actually digging through it to find and develop insight can often be something of a security nightmare, particularly in a post-GDPR world. Whilst any organisation with a CRM database can benefit from additional security for their data, this has particular benefits in fields such as financial services and healthcare, where customer or patient data can be especially sensitive.
Methods to circumvent this traditionally involved either removing or anonymising the sensitive data, but both of these approaches have their pitfalls. Removing the data altogether risks essentially deleting a source of potential insight, undermining the process of analysing the data in the first place. Anonymising is generally effective for small scale analysis, but often papers over the cracks from a security perspective as sensitive information can often be reverse engineered in the case of a hack or data breach.
Using synthetic data, however, provides an excellent compromise between preserving the key insights contained within a dataset without any risk of exposing sensitive information. This is particularly useful for those in the financial services industry, who tend to hold vast swathes of highly sensitive information about many of their customers but can also be useful to any company looking to embark on a large-scale data analysis project with sales or research data that may contain sensitive information.
Consider, for example, a scenario when you want to share data with a third party. This may be a third-party service such as cloud storage or computing, or an agency or freelancer you are collaborating with on a data analytics project. These scenarios can often, at the very least, leave you with several hoops to jump through before you can safely share your data, which can ultimately limit the potential of your data. Generating a synthetic version of your data – which would retain all important statistical properties of the original data set – allows you to make the most of what you have without having to worry about the exposure of sensitive information.
Machine Learning
As well as streamlining the regular data analysis process, synthetic data has been found to have several powerful benefits in the field of machine learning. This is typically to overcome data scarcity issues – machine learning algorithms typically take in massive amounts of data, which can be expensive and time consuming to collect in a traditional sense, and so feeding in synthetic data can be a cost-effective way of improving the performance of the algorithm.
Consider self-driving cars… the AI software that drives these cars is fed data of real cars being driven to learn and improve how to drive, but the volume of data required to ensure that a self-driving car is 100% safe far outweighs what is realistic to collect. To circumvent this, Google have their cars drive around 3 million simulated miles per day, which provides their algorithms with the data they need to train their cars to drive themselves safely.
The growing importance of synthetic data is reflected in the wave of new organisations focusing on generating synthetic data for their customers. DataGen, for example, is an Israeli start-up that uses machine learning to produce artificially generated still images and videos that their customers can use to train their own AI. Last year, a synthetic text generator co-authored an article for The Guardian. Synthetic data has even been leveraged as part of COVID-19 research collaboration.
So, as we can see, even in our data saturated world, there is very much a place for synthetic data. Next time you feel that you have reached an impasse related to your data usage, consider whether synthetic data might help. From helping you circumvent security and privacy concerns, to boosting the quantity and quality of your data, and enhancing the development of more complex machine learning products, the applications of synthetic data are both plentiful and diverse.