The ambitious challenge of finishing the human genome

The current human reference genome assembly, known as Genome Reference Consortium Human Build 38 (GRCh38 or hg38), does not contain a complete sequence of the human genome. The missing portions of the genome mostly lie in heterochromatic regions and near the centromeres and telomeres, and their sequences were never determined due to difficulties in mapping, cloning, or assembling the reads. Though absent from the reference, these sequences are known to contain genes and other functional elements that may be relevant to human health and disease.

The first complete telomere-to-telomere sequence of a human chromosome, the X chromosome, was published in July 2020. Using new sequencing technologies from PacBio and Oxford Nanopore, researchers were able to generate high-coverage, ultra-long reads that span hundreds of thousands of base pairs which helped bypass some of the challenges of the chromosome sequence assembly.

The new X chromosome sequence comes from the CHM13 (complete hydatidiform mole) cell line which is uniformly homozygous and has a 46,XX karyotype. This effectively haploid genome was used to avoid having to assemble both haplotypes of a "normal" diploid genome.

The X chromosome is linked to a number of diseases such as haemophilia, chronic granulomatous disease, and Duchenne muscular dystrophy. Closing the gaps in the X chromosome sequence assembly marks an important milestone in genomics and medical genetics.

Related projects

Create pretty sequence logo diagrams online.
bioinformatics, sequence analysis
Convert FASTQ files to FASTA format.
FASTQ, FASTA, bioinformatics
Learn about the FASTA format and its applications.
FASTA, bioinformatics
A WASM port of MrBayes.
MrBayes, WASM, bioinformatics
Edit PDB files online.
bioinformatics, PDB
A WASM port of the BLAST sequence alignment tool.
BLAST, WASM, bioinformatics
Explore the human genome.
bioinformatics, genomics
Open FASTA files in the browser.
bioinformatics, FASTA
An outline of the structure of the Gene Ontology RDF graph and ways to query it.
bioinformatics, RDF
A WASM port of the MUSCLE sequence alignment tool.
MUSCLE, WASM, bioinformatics
An overview of the collaboratively edited structured pathway encyclopedia.
bioinformatics
How progressive alignment makes it possible to efficiently align hundreds to thousands of large genomes.
bioinformatics

Made by Anton Vasetenkov.

If you want to say hi, you can reach me on LinkedIn or via email. If you like my work, you can support me by buying me a coffee.