Response to “The perpetual motion machine of AI-generated data and the distraction of ChatGPT as a ‘scientist’”

Many of Jennifer Listgarten’s arguments are compelling: in particular, that the protein folding problem is an outlier relative to other grand challenges in science, both in terms of the precise way the problem can be stated and performance measured and in terms of the amount of available, high quality data1. However, although existing biological databases tend to be small relative to the compendia used to train large language models, it seems plausible that one type of biological data — whole genome sequencing — will soon be generated at massive scales, opposite to what was argued1. As genome sequencing costs go down and the potential for clinical use of genomic data goes up, it will make economic sense to fully sequence everyone. Each 3 billion base-pair individual genome can be represented as 30 million unique bases, so fully sequencing the US population of 300 million individuals yields a total of 9 × 1015 bases, which is comparable in size to the 400-terabyte Common Crawl dataset used to train large language models. Using such data to train large-scale machine learning models will be challenging because of privacy considerations. Nonetheless, I see at least four paths where such models could be built on massive genomic data.

The first path involves federated data access. A federated approach uses software to enable multiple databases to function as one, facilitating interoperability while maintaining autonomy and decentralization2. Federation capabilities are supported by existing genomic biobanks, such as the UK Biobank, NIH All of Us and Finland’s FinnGen initiative3, and are further facilitated by commercial entities such as lifebit.ai. In a federated approach, a deep learning model can be trained from data drawn from multiple biobanks while maintaining privacy guarantees.

留言 (0)

沒有登入
gif