Alternate gene annotations for rat, macaque, and marmoset for single cell RNA and ATAC analyses

dataset

posted on 2022-09-21, 14:34 authored by BaDoi PhanBaDoi Phan, Andreas PfenningAndreas Pfenning

Custom genome and gene annotations for single cell ATAC and RNA-seq analyses

by BaDoi Phan (badoi dot phan at pitt dot edu)

This Kilthub upload is a clone of the github repository where this project may be updated or corrected in the future: https://github.com/pfenninglab/custom_ArchR_genomes_and_annotations

Premise:

Not all of single-cell ATAC-seq biomedical molecular epigenetics is done in human and mouse genomes where there are high quality genomes and gene annotations. For the other species that are still highly relevant to study health and disease, here are some ArchR annotations to enable less frustration to have snATAC-seq data analyzed with [ArchR](https://www.archrproject.com).

Strategy for better gene annotations:

We can use the proper that evolution of related mammalian species tend to have orthologous gene elements (TSS, exons, genes). For example, house mouse (mus musculus) is a median of 15.4MY diverged from the Norway rat (rattus norvegicus), with [TimeTree](http://www.timetree.org). Humans are a median of 28.9 MY diverged from rhesus macaques. To borrow the higher quality and more complete gene annotations, we can use a gene-aware method of lifting gene annotations from one genome to another, [liftoff, Shumate and Salzberg, 2021](https://academic.oup.com/bioinformatics/article/37/12/1639/6035128). For the source of "high quality" gene annotation, we use the NCBI Refseq annotations from the hg38/GRCh38 and mm10/GRCm38 annotations downloaded from the UCSC Genome browser.

For single cell RNA-seq, He, Kleyman et al. 2021 Current Biology (https://pubmed.ncbi.nlm.nih.gov/34727523/) found that using a regular liftOver of the human NCBI Refseq to rheMac10 was able to recover higher number of UMI counts to genes. This is likely due to incomplete annotations in either rheMac8 or rheMac10 genomes for the 3' UTRs that are usually targeted by common single cell/nucleus RNA-seq technologies. This allow more reads that would otherwise be found "outside" a gene because of incomplete 3' UTRs in a target species to be appropriately attributed to that gene using the orthologs of that gene from a more complete annotation in a related species. Furthermore, the complex splicing is better measured in humans, so more "intergenic" annotations by the rheMac10 annotations became "intronic" and better able to be mapped to a liftOvered annotation from human. For this reason, we create alternate annotations for the rhesus macaque, marmoset, and rat genomes borrowing orthology as identified with the newer liftoff method from more complete human or mouse annotations.

Similarly, for single cell ATAC-seq seq, a more complete map of genes and transcription start sites (TSS) enable aggregate metrics like a "gene score" to better calculate gene-based measures to perform co-clustering with single cell RNA-seq dataset. A more complete annotation would be able to accurately discern single cell open chromatin regions and not falsely report exonic regions or alternate promoters that were missed from primary transcriptomic data in monkey, marmoset, or rat but can be bioinformatically inferred.

Lastly, work by the ENCODE Consortium has found with the large human and mouse epigenomic data that certain regions of the genome in these species have artifactual signals and need to be excluded from epigenomic analsyes, [Amemiya et al., 2021](https://www.nature.com/articles/s41598-019-45839-z). These regions were pulled from and human and mouse from [here](https://github.com/Boyle-Lab/Blacklist/) and used the liftOver to map to the target genomes below, for simplicity.

list of resources by file name

Surprisingly, all these files are small enough to put on github for a couple custom genomes. Below are the organizations

- *.gtf.gz and *.gff3.gz: the gzipped annotation from the higher quality annotations to the target genome using [liftoff](https://github.com/agshumate/Liftoff)

- *liftOver*blacklist.v2.bed: the ENCODE regions to exclude from epigenomic analyses mapped to the target genome using [liftOver](https://genome-store.ucsc.edu)

- *ArchRGenome.R: the Rscript used to make the custom ArchR annotations

- *ArchR_annotations.rda: the R Data object that contains the geneAnnotation and objects to use with [ArchR::createArrowFiles()](https://www.archrproject.com/reference/createArrowFiles.html)

list of species/genomes/source files

For most of these files, the genome fasta sequences were grabbed from the UCSC Genome Browser at https://hgdownload.soe.ucsc.edu/goldenPath/${GENOME_VERSION}/, where ${GENOME_VERSION} is any of the version below except **mCalJac1**. Some of these genomes were updated from the Vertebrate Genome Project, which seeks to create complete rather than draft genome assemblies of all mammals on the planet, [Rhie et al. 2021](https://www.nature.com/articles/s41586-021-03451-0). These genomes have **VGP** and that naming version if there's an alternate naming scheme. The VGP is [pretty cool](https://vertebrategenomesproject.org) and they make good [genome assemblies](https://vgp.github.io/genomeark/).

- rn6: [rat genome v6, BCM-Baylor version](https://www.nature.com/articles/nature02426)

- rn7: [rat genome also called VGP mRatBN7.2](https://journals.physiology.org/doi/abs/10.1152/physiolgenomics.00017.2022)

- rheMac8: [rhesus macaque v8](https://hgdownload.soe.ucsc.edu/goldenPath/rheMac8/bigZips/)

- rheMac10: [rhesus macaque v10](https://www.science.org/doi/10.1126/science.abc6617?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub 0pubmed)

- mCalJac1: marmoset VGP genome, [fasta from the maternal assembly here](https://www.ncbi.nlm.nih.gov/assembly/GCA_011078405.1/)