Or, one scientist’s descent into (further) madness
About the author
Miranda Wade received her B.S. in Biological Science from Colorado State University and her dual PhD in Integrative Biology and Ecology, Evolutionary Biology, and Behavior from Michigan State University. During her time in the Meek Lab at MSU, her work consisted of using ‘omics to address various conservation questions in both a rare desert place facing land-use change and the molecular consequences of microplastics exposure in a model fish species. She is currently the Social Media Editor for the American Genetic Association and a PostDoc in the Sin Lab at the University of Hong Kong. For her postdoctoral work, she is exploring the genomic basis of coloration in birds.
Hello, lovely AGA Blog readers. It is I, your somewhat negligent social media/blog editor. Oops. I am currently writing you from a semi-dark corner in my current lab while it is early enough in the morning that no one else is here. Recently, I decided it would be a grand idea to learn how to annotate a reference genome. I needed to do it anyways to process some data (aka for my JOB) but as someone who used either a) a non-model species or b) a model species with an already annotated genome for my PhD work, I was in unchartered territory. And because I am me, I thought learning how to do this would be ~f u n~ (okay, it kind of was, genomics is just SO COOL). Since I am sure I am not the only highly-educated dummy out there who needs to learn new skills to do new things (especially in the era of everyone just sequencing entire genomes for funsies) I thought I would somewhat humorously share some of the programs I used, things I learned, and misadventures I stumbled into.
First off, you need a decent genome to annotate. I was using the Northern Cardinal (Cardinalis cardinalis) genome. Now, I did not build the genome that I annotated, my boss actually did that during his PostDoc in the Edwards lab. Fun fact, Dr. Edwards is a past President of the AGA and the Distinguished Lecturer for the 2024 AGA Presidential Symposium. Isn’t academia fun?!
Once you have a genome you can essentially choose a few different ways to get things going. Since an annotation is essentially a map of where all the genes are in the genome of interest, you can predict these genes through either a systematic search of the genome (aka ab initio) or using the similarities in genes between species (AKA homology), or a combination of these methods, which is what I did. Having some sort of transcriptome or RNAseq evidence improves gene prediction, so I thought to use some blood RNAseq data from the Sequence Read Archive along with the existing genome to build a transcriptome. I figured using RNAseq from the blood would be better than nothing (spoiler alert, it kind of wasn’t since you aren’t going to have the most complete representation of genes being expressed) since we were waiting on our RNAseq data for other tissues. I initially made a genome-assisted transcriptome using StringTie and off I went.
The program suggested for me to use for ab initio prediction was MAKER. I was lucky enough to stumble across this amazing tutorial by Daren Card that I thought nicely explained things that my smooth brain could understand and ultimately use. I know some people prefer BRAKER, but I think MAKER served me well. It’s also nice that MAKER includes lots of things I could use ‘out of the box’ such as using the RepBase library for RepeatMasker for Aves and AUGUSTUS gene prediction data for the chicken. I think the biggest issue with running MAKER is that she is a memory-hungry program that can easily gobble up 100+GB of RAM while running. MAKER does the best with gene prediction after at least two rounds, so I waited (days) for the high-performance computing cluster to clear enough for my job to start and then three days for it to run (can you tell I like to be dramatic?) to see the fruits of my effort. The first run was okay and the second run had about 17,000 genes predicted. I then took those predictions and used GeMoMa for some fun homology gene searching among other birds. I then performed the customary functional analysis and annotation as well as a BUSCO on the completeness of my annotation (96% in case anyone wanted to know) and felt very proud of myself.
THEN EVERYTHING CHANGED WHEN THE FIRE NATION ATTACKED
And by ‘fire nation’ what I mean is an indexing error in my directory that made all my files go POOF (of course I did not have them backed up). Also, I think more of my brain cells/mental well-being also went poof, but that’s neither here nor there. Anyway, I got to do it all again (yippee) but at this point, I had RNAseq data from other tissues to throw together with my blood sample, so I used Trinity (can we fund them again?? I am concerned) for a De novo transcriptome assembly to use to improve my annotation and ended up using eggNOG-mapper for functional annotation and to identify some unknown genes from MAKER/GeMoMa. Finally, I have an annotated (and backed-up!) Northern Cardinal genome to use for some fun differential expression analyses. Who am I working with (outside my home lab)? What am I looking at? When will I finish (spoiler alert, I am a PostDoc now so the answer must be SOON)? Where will it be published? How am I doing (obviously I am unhinged, but that’s par for the course at this point.)
Moral of the story:
- Back up your data sooner rather than later
- There are always a dozen ways to achieve the same (or highly similar) result
- When I think ‘That sounds fun’ in the context of science, maybe I should take more than a moment to think before agreeing to something (or maybe I shouldn’t, we must do things FOR THE PLOT).