XY sex determination in a cnidarian

Breeding and animal maintenance

Colonies were maintained at the University of Pittsburgh as described in [47]. Briefly, they were grown on 75 mm × 25 mm glass slides in 38-l aquaria filled with artificial seawater (Instant Ocean Reef Crystals) and maintained at 22–23 °C. Adult colonies were either fed 3-day-old Artemia nauplii (3 × per week) or a suspension of pureed oysters (× per week). Breeding colonies were kept on a 8-h/16-h light/dark cycle. After their first exposure to light, males and females were placed in separate 3-l bins, where they released gametes 1–1.5 h later. Within 20 min of spawning, eggs were harvested by filtering water from the female bin through a 20-μm cell strainer, and sperm was harvested by collecting 10–15 ml from the male container. Eggs were fertilized by mixing them with the sperm and 15 ml of additional artificial seawater in a 100-mm polystyrene petri dish. Embryos developed into larvae and were settled 72–96 h post-fertilization (hpf). Metamorphosis was induced by incubating the larvae in 56 mM CsCl in filtered seawater for 4–6 h, then transferring them with glass pipettes onto glass microscope slides. Three days later, larvae that successfully metamorphosed began feeding as described above, but without the oyster supplement.

Mapping population and sample preparation

A mapping population was created by crossing a male colony (291–10) to a female half-sibling (295–8). The resulting F1 offspring were observed weekly until male or female gonophores could be discerned, at which point they were scored appropriately and moved to male-only or female-only tanks. Thereafter, colonies were observed and cleaned biweekly.

DNA was extracted from colonies when they had grown to cover approximately 2 cm2 of their slide. In most cases, this was performed before the animals had been classified as male or female. Animals were starved for at least two days, then a portion of the colony measuring ~ 1 cm2 was removed by scraping it from the slide with a razor blade. Harvested tissue was placed in a 1.7-mL microfuge tube, briefly spun in a benchtop microcentrifuge, and residual seawater removed by aspirating with a pipette. Tissue was lysed by the addition of 200 μL UEB buffer (7 M Urea, 0.3125 M NaCl, 0.05 M Tris–HCl pH 8.0, 0.02 M EDTA pH 8.0, 1% N-Lauroylsarcosine sodium salt) [48], followed by grinding with a plastic pestle until all tissue was dissolved. Next, one volume of equilibrated phenol:chloroform:isoamyl alcohol (25:24:1) was added and the mixture homogenized by inverting vigorously. The mixture was centrifuged for 10 min at > 3000 g, and the aqueous phase transferred to a new tube. Total nucleic acid was precipitated with 0.7 volumes of isopropanol, then centrifuged for 30 min at top speed (at least 13,000 rpm) at room temperature. The resulting DNA pellet was washed twice with 70% ethanol, then transferred to a new tube, centrifuged for 1 min, and excess 70% ethanol aspirated with a pipette. To remove RNA, the DNA pellet was resuspended in 1X TE, and 1 μL Ambion Rnase cocktail (ThermoFisher, Cat. AM2286) per 100 μL suspension was added, followed by incubation at 37 ℃ for 15 min. DNA was then re-extracted with phenol:chloroform:isoamyl alcohol (25:24:1), precipitated with isopropyl alcohol, washed with ethanol, and resuspended with 1XTE as described above. DNA samples were stored at − 20 °C prior to being sent to the NIH Intramural Sequencing Center for sequencing.

Karyotype analysis

Embryos from a cross between colony 291–10 and colony 295–8 were used to generate multiple embryos for karyotyping. Embryos at the 64–128 cell stage (approximately 8 h post-fertilization) were submitted to the University of Pittsburgh Cell Culture and Cytogenetics Facility. The embryos were split into three tubes with 3 ml of seawater each. Two tubes were treated with 60 µl ColcemidTM (0.1 µg/mL) and one tube was treated with 60 µl vinblastine (0.01 mg/ml) (f.c. 0.2 ng/µl) for 90 min with gentle shaking. Following mitotic arrest, one Colcemid tube was treated with 5 ml 1:1 (sterile water:Mg + Ca + free seawater) hypotonic solution with 50 µl Colcemid for 30 min with constant agitation; the vinblastine tube was treated with 5 ml 1:1 (sterile water:Mg + Ca + -free seawater) hypotonic solution with 50 µl vinblastine for 30 min with constant agitation, and the third tube with 90 min Colcemid was treated with 0.075 M KCl with 50 µl Colcemid for 30 min. Tubes were fixed with Carnoy’s fixative and stored at − 20 °C overnight. Slides were prepared the next day following a few washes in the fixative. Both Colcemid-treated samples produced a few metaphase cells, but the vinblastine tube produced no visible metaphase cells. Slides were observed on an Olympus BX61 microscope and imaged and analyzed using the Genus software platform on the Cytovision System (Leica Microsystems, San Jose, CA).

Whole genome sequencing and SNP discovery

A detailed description of the complete analysis pipeline, including all scripts, can be downloaded from https://github.com/nicotralab/chen-et-al-sex-determination [49]. Several data sets were too large to be placed in the GitHub repository but can be downloaded directly from https://zenodo.org/record/6368105 [50]. The reference assembly of the Hydractinia genome used in this study was that of the male parent, colony 291–10 [51, 52]. It was assembled from a combination of PacBio long-read and Illumina-short read sequencing and consisted of 4,840 scaffolds, with an N50 of 2.2 Mb. Raw reads are available via BioProject PRJNA807936. The version of the assembly used in this project can be downloaded from https://zenodo.org/record/6368105 [50].

To generate sequence data for variant calling for the female parent and all F1 progeny, PCR-free libraries were generated from 1 µg genomic DNA using the TruSeq® DNA PCR-Free HT Sample Preparation Kit (Illumina). The median insert sizes were approximately 400 bp. Libraries were tagged with unique dual index DNA barcodes to allow pooling of libraries and minimize the impact of barcode hopping. Libraries were pooled for sequencing on the NovaSeq 6000 (Illumina) to obtain at least 500 million 151-base read pairs per individual library. We obtained a mean coverage of 47 ± 11 × per sample, with a mean depth of 162 ± 38 million reads per sample (for per-sample statistics, see the “Methods” section and Additional File 9).

All raw sequence data for the female parent and offspring are available via BioProject PRJNA816479 at NCBI. For the male parent, we used Illumina sequencing data from the genome project. The original Illumina dataset, which consisted of 240 million reads, was downsampled with seqtk [53], resulting in 66 × 106 251 bp paired end reads. The downsampled files can be downloaded from https://zenodo.org/record/6368105 [50].

For each sample, raw reads were mapped to an assembly of the paternal genome. Reads were mapped to the assembly using BWA-MEM [54] with mapping parameters “-M -t 8.” The resulting.sam files were converted to.bam format and then sorted with Samtools [55]. Duplicates were then marked with Picard [56]. Genotypes were called with GATK HaplotypeCaller [25]. The resulting file, rawvariants.90f1.vcf.gz, can be downloaded from https://zenodo.org/record/6368105 [50]. A total of 9.74 million single nucleotide polymorphisms (SNPs) and 1.83 million insertion/deletion variants (indels) were identified.

Raw variant calls were filtered to generate datasets of high-quality variants suitable for genetic mapping. Briefly, raw variant calls were filtered with a custom python script (qualityfilter.py) to retain only those variants for which (1) no samples were missing data; (2) all samples genotyped as homozygous reference (0/0) had no more than two mapped reads corresponding to the alternative allele and also had more than ten mapped reads corresponding to the reference; (3) all samples genotyped as homozygous alternative (1/1) had no more than two mapped reads corresponding to the reference allele and also had more than ten mapped reads corresponding to the alternate allele; and (4) all samples genotyped as heterozygous (0/1) had an alternate allele read count percentage of greater than 0.3 or less than 0.7. This dataset was further filtered according to GATK best practices [57]. Specifically, SNPs were flagged as low quality if they met any of the following criteria: quality by depth (QD) < 2; Fisher’s exact test of strand bias (FS) > 60; RMS mapping quality (MQ) < 40; rank sum of alt versus reference mapping quality (MQRankSum) < 12.5; read position rank sum (ReadPosRankSum) < 8; and read depth (DP) < 10. Indels were flagged as low quality if they met any of the following criteria: QD < 2.0, FS > 200, or ReadPosRankSum <  − 20.0. Flagged variants were then removed with bcftools [55]. The resulting file, GATK-passed.vcf, can be downloaded from https://zenodo.org/record/6368105 [50]. After this filtering, 1,312,632 variants (1,083,937 SNPs and 228,695 indels) remained.

From this filtered dataset, we generated two sets of markers suitable for mapping via a pseudo-testcross strategy [26]. In a pseudo-testcross, two parents from an outcrossing population are bred to create an F1 population. A genetic map for each parental genome is then constructed with markers that are heterozygous in that parent and homozygous in the other. For example, a genetic map of the maternal genome can be constructed from variants that are homozygous in the male parent and heterozygous in the female parent (i.e., “0/0” × “0/1” in the notation of a.vcf file). Likewise, a genetic map of the paternal genome can be constructed from variants heterozygous in the male parent and homozygous in the female parent (e.g., “0/1” × “0/0”). To create a set of variants for mapping the maternal genome we used the Linux command-line tool “awk” to extract variants with paternal genotype “0/0” and maternal genotype “0/1” (hereafter, the “female PT dataset”). Note that variants with paternal genotype “1/1” were not included because the paternal genome was the reference genome, thus no paternal genotype should be homozygous for the alternative allele. To create a dataset of markers to map the paternal genome we used “awk” to extract variants with paternal genotype “0/1” and maternal genotype “0/0” or “1/1” from the filtered dataset (hereafter, the “male PT dataset”). In this study, we identified 303,020 variants (255,004 SNPs and 48,016 indels) suitable for mapping the maternal genome and 251,912 variants (217,242 SNPs and 34,670 indels) suitable for mapping the paternal genome. SNPs and indels were treated equivalently for mapping purposes.

Next, we used a custom bash script (getPTvariants.sh) to identify probable genotyping errors according to the segregation pattern of offspring genotypes. In both resulting datasets, the segregation of F1 genotypes is expected to be 1:1 homozygous:heterozygous. Homozygous offspring should have the same genotype as the homozygous parent (e.g., if one parent is “0/0” and the other parent “0/1”, homozygotes should be “0/0”). An F1 offspring with a genotype of the alternative homozygote class (“1/1” in the preceding example) would probably be the result of a genotyping error. We determined the frequencies of such errors for each F1 offspring, and found they had an average of 1.18% ± 0.21% (mean ± standard deviation) in the female PT dataset and 1.24% ± 0.19% in the male PT dataset.

Unexpected homozygous genotypes could also arise in F1 offspring if one of the two parents were misgenotyped. At such a variant, the unexpected homozygote class should segregate with other genotypes in a Mendelian pattern. To search for this type of genotyping error, we determined the frequency of unexpected homozygous genotypes at each variant in both datasets. We found 5.12% of variants in the female PT dataset and 4.9% of variants in the male PT dataset had unexpected homozygotes. At each of these variants, the number of unexpected homozygotes averaged 24.59% ± 0.68% in the female PT dataset and 23.69% ± 1.13% in the male PT dataset. The occurrence of unexpected homozygotes at a frequency of ~ 0.25 is consistent with both parents being heterozygous at these variants. To exclude these variants, as well as the genotyping errors described above, we changed all unexpected homozygotes to missing data, then excluded variants with more than 10% missing data from each PT dataset. The resulting files, GATKBP-passed.femaleHet.abxabRemoved.vcf and GATKBP-passed.maleHet.abxabRemoved.vcf can be downloaded from https://zenodo.org/record/6368105 [50].

Genetic map construction

Genetic maps were constructed in R (version 3.6.1) [58] with package OneMap (Version 2.1.1) [59]. Prior to importing the datasets into R/Onemap, variants were tested for Mendelian segregation (χ2 goodness of fit) and those where p < 0.00001 were removed with a custom Perl script (removeDistorted.pl) The resulting files, femalePT.vcf.gz and malePT.vcf.gz, are available at https://github.com/nicotralab/chen-et-al-sex-determination [49]. Each dataset was then thinned with vcftools [60] to ensure that the distance between adjacent variants was no less than 5000 bp and converted from vcf format into OneMap’s “.raw” format with a bash shell script (thin-for-onemap.v2.sh). After this step, we were left with 23,462 variants (20,058 SNPs and 3404 indels) for the maternal genome and 22,359 variants (19,771 SNPs and 2863 indels) for the paternal genome (Additional Files 10 and 11).

We constructed linkage maps in R (version 3.6.1; [58] with the package OneMap (Version 2.1.1) [59]. Variants with identical genotypes in the F1 animals were binned to create single markers for linkage mapping with the function find_bins(). After binning, we recalculated segregation distortion using a Bonferroni-corrected p-value of 0.05 and removed any remaining distorted markers. This resulted in a set of 977 markers for the maternal genome and 487 for the paternal genome. Two-point tests were used to calculate recombination fractions and LOD scores for each pair of markers, and linkage groups between non-distorted markers identified with a maximum recombination fraction (rf) of 0.4 and minimum LOD score determined by the OneMap function suggest_lod() (6.14 for the female dataset and 5.56 for the male dataset). In both cases, 15 linkage groups were obtained.

To order markers within each linkage group, the function order_seq() was used. This function selects an initial set of five markers and applies an exhaustive search to determine the order with the lowest LOD score. To this framework map, the remaining markers are added one-by-one to optimize the total LOD score of the growing map. Recombination fractions were converted to gastrozooid (cM) units using the Kosambi map function.

Most of the initial maps contained pairs of markers that were placed within 0.0001 cM of each other by the OneMap software. Upon closer inspection, we discovered that these markers were simply markers that were located on the same contig in the reference genome but had their alternative alleles in opposite phase of one another. Since these markers were essentially redundant to one another we decided to remove them from the maps. To do this we identified them in the initial maps with a custom perl script (identify_redundant_markers.pl), then removed them using the drop_marker function in OneMap.

A recombination fraction plot was then generated with the function rf_graph_table and visually inspected to identify misplaced markers. These were removed from the map and re-inserted with the try_seq() function. Markers that could not be confidently placed were removed entirely from the final maps. Summary statistics for each map were calculated using the Genetic Map Comparator [61]. The “unbinned” maps were created by using the custom perl script unbin_markers_in_map.pl.

Comparison of recombination rates

To compare recombination rates in the female and male genomes, we identified pairs of markers in the final maps that were located within 5 kb of each other in the reference genome assembly. Linkage groups having two or more such markers were then reconstructed as described for the initial maps. Linkage maps were then compared and summary statistics calculated with the Genetic Map Comparator [61].

QTL mapping

Loci linked to sexual phenotype were identified using in R with package qtl (version 1.44–9) [28]. Prior to QTL mapping, the data were prepared for R/qtl with a custom perl script (onemapRaw_to_Rqtl.pl). Briefly, for each PT dataset, the marker data in OneMap’s.raw format and the corresponding linkage map were combined and converted to R/qtl’s.csvr format. Header information including the phenotypes (sex) of each colony was added manually. The two resulting files (femaledata.rqtl.with.phenotypes.csvr and maledata.rqtl.with.phenotypes.csvr available in from https://github.com/nicotralab/chen-et-al-sex-determination) were imported into R using the read.cross function. For each dataset, QTL genotype probabilities were calculated with calc.genoprob with default settings and the kosambi map function. A single-QTL genome scan was performed using the function scanone for a binary phenotype under the Haley-Knott regression model, with 1000 permutations and perm.Xsp = 0. Significance thresholds were calculated with the summary function. Figures were created with the plot function, then imported into Adobe Illustrator for further annotation.

Depth of coverage analysis with SATC

For each sample, raw illumine reads were mapped to the reference genome using BWA-MEM2 [62], and the resulting sam files were sorted and compressed with Samtools [55]. For this mapping, contig HyS0057 was split into three segments to reflect the misassembly from 1,104,702 to 1,185,628 (Additional File 1: Table S1). Duplicates were removed with Picard [56]. Secondary and unmapped reads were removed with Samtools fixmate and the resulting bam file was indexed with Samtools index. Index stats were calculated with Samtools idxstats. SATC was run online via a Shiny app that is available at http://popgen.dk:3838/genis/satc/. The idxstats files for each sample were converted into an input file with the bash script make_shiny_input.sh, which was obtained from http://popgen.dk:3838/genis/satc/. The resulting datafile (Additional file 12) was uploaded and SATC was run with the following options: scaffolds weighted by length, gaussian clustering, and a minimum scaffold length of 2500 bp. Normalization was based on the mean coverage of the five longest contigs in the genome assembly.

Identification of candidate Y-linked genes

Coordinates from each annotated gene from the reference assembly were used to cut assembled contigs into individual gene contigs. A FASTA file of all gene contigs was then used as a new reference assembly for mapping. Raw Illumina reads from each 339 library were individually mapped to this reference using HISAT2 under default parameters. The resulting sam file was converted to a sorted bam for further analysis. Coverage of each gene was summarized using the samtools coverage function. Coverage statistics were then assessed using a simple python script to identify genes that met our criteria.

Gene annotation

Gene models were downloaded from the Hydractinia genome portal [52]. Gene functions were predicted using the standalone version of PANNZER2 [63] with default parameters, as well as with a DIAMOND [64] similarity search against nr with the option “blastp.” Homologous sequences in the NCBI Model Organisms (landmark) database were identified using BLASTP as implemented through the BLAST website (https://blast.ncbi.nlm.nih.gov/Blast.cgi). Conserved protein domains were identified with the Pfam database using hmmscan as implemented by the HMMER Webserver [65].

Gene expression analysis

Sex-specific expression of genes in the putative sex determination locus was estimated using previously published RNA-seq libraries from H. symbiolongicarpus gastrozooids and gonozooids [32,33,34,35]. Here, we reanalyzed the raw sequencing data to identify genes that were expressed only in one type of sexual polyp and in no other polyp type. The paired-end RNA-seq reads from each library were mapped to the entire genome assembly using HISAT2 [66] under default settings. The resulting mapping files were processed and sorted using Samtools [55] before proceeding to quantitation. Using the reference annotations for the primary haplotype of the genome, FPKMs of each gene model were estimated for each library and normalized by library size using the cuffnorm function of Cufflinks [67].

留言 (0)

沒有登入
gif