Khaled El Emam is the SVP & General Manager at Replica Analytics. He is also the Canada Research Chair in Medical AI, Director of the Electronic Health Information Laboratory at the CHEO Research Institute and a Professor at the University of Ottawa. CityAge sat down with Khaled after our conference in Ottawa on Dec.8, The Data Effect: Data for Canada’s Health. We talked about what synthetic data is and how it can help solve the health data problem in Canada (Read more on this problem through the CBC and the Globe and Mail). The below conversation has been condensed and edited for clarity.
CityAge: Can you tell me a bit about who you are and what you do?
Khaled El Emam: I'm a computer scientist/engineer by training, and I'm a professor at the University of Ottawa. I run a research lab at the Children's Hospital of Eastern Ontario focused on AI and machine learning. Our main research agenda is on developing technologies for synthetic data generation. I’m also the co-founder of Replica Analytics, recently acquired by Aetion. Replica develops synthetic data generation technology.
CA: What is synthetic data?
KE: You start with real information, and then you train AI models using that real information. The AI models learn the patterns in the data. Then you create fake data from those models. And the idea is that these look the same, behave the same, and have the same patterns as the real data, so you maintain utility, but they’re not the real data.
CA: Why do we need synthetic data?
KE: One of the key advantages is privacy, as synthetic data cannot be mapped back to real people. Synthetic data is easier to re-use and share, both within and outside organizations.
CA: What are some of the other advantages?
KE: One of them is that you can de-bias a biased data set. When you have certain groups that are under-represented in a dataset, you can use synthetic data generation technologies or techniques to compensate for that bias, or to increase diversity in your data by simulating additional patients. So you take the real data, you try to learn the patterns in these under-represented groups, and then simulate additional patients. You can simulate patients for clinical studies, for clinical trials, and more.
CA: Why do we need this in Canadian health care today?
KE: We have a serious data access challenge that is impacting important research, innovation and healthcare. This is an ongoing problem, and I live it every day. I run a research lab and it takes a very long time to get access to Canadian patient data, and sometimes it's just not possible. We are competing against other countries and other jurisdictions, so our inability to use our own data puts us at a disadvantage from a research perspective, but also from a company perspective. I mentioned that I run a company that builds AI and machine learning tools. We need data, and we end up purchasing data or getting access to data from other jurisdictions, because it's a lot more accessible. This data may not be representative of the Canadian health care system, or the standard of care here. So it puts Canadian companies at a potential disadvantage, but also patients at a potential disadvantage.
CA: Can you give me an example that proves synthetic data works in real-world situations?
KE: A Canadian example is a project we did in Alberta. This was quite a complex dataset covering multiple domains. It was labs, drugs, hospital discharges, emergency department visits, and claims for 300,000 patients over 7 years. The objective was to look at the impact of opioid use. It was an opioid using population, or subset of the population. We got the dataset, and we worked with partners at the University of Alberta. The epidemiologist at University of Alberta built the models, compared real and synthetic data, and they found that they draw the same conclusions for quite a complex model. They are predicting mortality, emergency department visits, and some specific diagnoses, and the conclusions were the same using synthetic data. We demonstrated the privacy risks as well. The province’s Privacy Commissioner was consulted all the way to demonstrate that risks were small, but also to address their concerns or questions about synthetic data.
CA: What needs to happen for synthetic data to be adopted more broadly?
KE: Greater awareness of synthetic generation as a high-utility, privacy enhancing technology is a factor, although that is changing – the analysts are making predictions about its growth and adoption that are coming to fruition. I think some of the privacy law changes proposed need to better enable the use of technology. They cannot set standards so high that they are too challenging to meet. So we have to ensure that privacy laws are realistic and protective. It's not one or the other. Both can be met – we know how to do that. Nobody wants to take shortcuts. But we need to be able to move forward. If you want zero risk, then the easiest answer is doing nothing.
CA: Canada is facing a serious health data problem, one that’s impacting many of our daily lives. What do you want readers to take away from this conversation?
KE: We know how to solve this problem. We need the leadership of the entities that hold the data and we need to be willing to take some responsible risks to move forward.
I've been having this conversation for 20 years. This could be 2002 and we'd probably be having exactly the same conversation. We wouldn’t be talking about synthetic data. We'd be talking about some other approach, but we would be essentially having the same conversation. COVID-19 has taught us a lot of things about the consequences of not having access to good data. There is momentum on this topic -- we just need to take advantage of it. However, there are folks who are happy with the status quo, who are afraid of the change. We have to understand those concerns and address them so we can move forward to leverage data responsibly and in a privacy protective way.