Population genomics is not one size fits all – PacBio

world at night

Learn more about the importance of capturing diversity with population-specific reference pangenomes.

The movement towards the development of pangenomes is gaining momentum, bringing the reality of truly personalized medicine closer. The first human pangenome was developed using the genetic data of 47 individuals. But knowing what we do now about genomic diversity, researchers recognize that one human pangenome won’t fit all.

Earlier this month, we published a piece describing the impact of a recent study written by Professors Kai Ye of Xian Jiaotong University, Xian, China and Shu-hua Xu of Fudan University, Shanghai, China. The study, appearing in Nature, notes that while nearly 60 percent of the global human population is spread across Asia, people from the world’s largest continent have been underrepresented in the pangenomes created so far.

Starting a few years ago, researchers set out to do something about it through the Chinese Pangenome Consortium (CPC) by initiating the first of three phases of a project aimed at better representing the genetic diversity of Chinese populations.

Why diversity matters

Unlike previous population genetic studies in China, which were mainly aimed at revealing the genetic relationships and genetic history of populations, we attempted to uncover missing sequences and hidden variations that had not been identified before in Chinese ethnic groups, said Prof. Shu-hua Xu . For example, approximately 18.4% of small variants and 17.1% of structural variants (SVs) identified were specific to CPC assemblages compared to a recently released pangenome reference from the Human Pangenome Reference Consortium (HPRC). These newly identified genomic variations are more informative and therefore may facilitate the discovery of finer-scale population relationships, since most of the new variations are population specific.

The CPC Reference Pangenome provides one of the most comprehensive understandings of genomic variation within an East Asian population that has been constructed to date. As Prof. Shu-hua Xu noted, our results suggest that the use of population-specific references in sequence alignment improved the quality of the alignment. Compared to the HPRC reference, the use of the CPC reference improved the perfect alignment rate of short reads in East Asian samples. It would also help improve the accuracy of profiling parts of the genome enriched with complex sequence variations such as genes that regulate the immune system.

As shared in the paper, Phase I of this study captured a large number of missing sequences and hidden variations from a collection of 116 high-quality sequences and haplotype phases again assemblies of 36 underrepresented Chinese ethnic minorities. Notably, our efforts have added 189 million base pairs of euchromatic polymorphic sequences and 1,367 protein-coding gene duplications to the current state-of-the-art reference, GRCh38. We identified 15.9 million small variants and 78,000 structural variants, of which 5.9 million small variants and 34,000 structural variants have not been reported elsewhere, said Prof. Shu-hua Xu. These previously missing sequences have the potential to help researchers trace missing links in human evolution, identify heredity for disease mapping, and aid medical research today and in the future.

Advantages of Pangenome HiFi references

The professor. Kai Ye, Prof. Shu-hua Xu and their team took long-read, high-precision HiFi data as the primary data type for their study, and strategically combined assembly and alignment to successfully solve sequences and structures. The professor. Kai Ye explained that to avoid artifacts introduced during cell line immortalization processes, we directly extracted DNA from blood samples. Additionally, to ensure that samples represent various ethnic characteristics, we required samples to be from three generations of the same ethnic group.

Not only is creating a population-specific pangenome a good practice, but with proper sequencing technology it is achievable. For population genomics efforts like CPC, a platform like the Revio system allows researchers to reveal more with accurate long-read sequencing at scale with increased throughput for larger cohorts. With HiFi Reads, researchers are able to obtain phased genomes with high accuracy, higher completeness and higher resolution for all classes of variants, robust coverage in complex and repeat-rich regions, and methylation status on the side of the genome for variants. multiomics studies. And as the CPC project has discovered, HiFi sequencing is ideal for creating differentiated population sequencing datasets that go far beyond the limitations of traditional short-read technology.

Compared to traditional linear genomes, pangenomics characterizes the genetic diversity of multiple populations in a graphical representation of the genome. Its information on integrated prior variants provides a potential set of functional variants for disease research, addressing the missing heritability bottleneck faced by researchers, said Prof. Kai Ye. Current applications such as disease identification by resequencing strategies are severely limited by the sequences contained in linear genomes. However, pangenomic graph representation reveals previously unrepresented variants from different populations and those with lower population frequencies, allowing for individualized identification of the disease.

We look forward to seeing what Phase II of the study brings. The team will sequence 1,000 samples spanning 56 ethnic groups, and the possibility of discovery is endless.

To learn more about PacBio solutions for your population genomics study, please contact one of our scientists or check out our recent webinar, Shining a Light on Dark Genes in Population Sequencing.

#Population #genomics #size #fits #PacBio
Image Source : www.pacb.com

Leave a Comment