Sequencing by avidity enables high accuracy with low reagent consumption

Solution measurements of nucleotide incorporation

Solution measurements of nucleotide kinetics were performed using commercially available dATP-Cy5 (Jena Bioscience, catalog no. NU-1611-CY5-S). DNA substrates for solution kinetic assays were prepared by annealing a 5′FAM-labeled primer oligo (purchased from IDT) and high-performance liquid chromatography-purified (5′-CGAGCCGTCCAACCTACTCA-3′) with a template oligo (5′-ACGACCATGTTGAGTAGGTTGGACGGCTCG-3′). Annealing was performed with 10% excess template oligo in the annealing buffer using a PCR machine to heat oligos to 95 °C, followed by slow cooling to room temperature over 60 min. Solution kinetics were performed by mixing a preformed enzyme–DNA complex with fluorescent nucleotide and MgSO4 using a RQF3 Rapid Quench Flow (KinTek Corp.). The enzyme used was an engineered variant of Candidatus altiarchaeales archaeon. The final reaction was conducted in 25 mM Tris pH 8.5, 40 mM NaCl and 10 mM ammonium chloride at 37 °C. Extension products were separated from unextended primer oligos by capillary electrophoresis using a 3500 Series Genetic Analyzer (ThermoFisher) to achieve single-base resolution. Products were quantified and fit to a single exponential equation. The observed rates as a function of nucleotide concentration were then fit to a hyperbolic equation to derive apparent Kd (Kd,app) and rate of polymerization (kpol).

Avidite synthesis and construction

Initial research scale avidites were constructed by dissolving 5 mg of 10 kD 4-arm-PEG-SG (Laysan Bio, catalog no. 4arm-PEG-SG-10K-5g) in 100 µl of 95% organic solvent (for example, ethanol) and 5 mM MOPS pH 8.0 to make a 50 mg ml–1 solution (5 mM), 19 µl of which was combined with 1.5 µl of 10 mM dATP-NH2 (7-deaza-7-propargylamin′-2′-deoxyadenosin′-5′-triphosphate; Trilink, catalog no. N-2068) and 8.0 µl of 3.75 mM 2 kD Biotin-PEG-NH2 (Laysan Bio, catalog no. Biotin-PEG-NH2-2K-1g) in 95% organic solvent (for example, ethanol) and 5 mM MOPS pH 8.0. After mixing, 5 mM 10 kD 4-arm-PEG-SG was added. The final composition was 0.50 mM dA-NH2, 1.0 mM biotin-PEG-NH2 (2 kD), 0.25 mM 4-arm-PEG-NHS, 85.5% organic solvent (for example, ethanol) and 4.5 mM MOPS pH 8.0. Following 1,000-rpm incubation at 25 °C for 90 min, the reaction volume was adjusted to 100 µl by the addition of MOPS pH 8.0. Purification was performed using a Biorad Biospin P6 column pre-equilibrated in 10 mM MOPS pH 8.0. The purified dATP-PEG–biotin complex was mixed with Zymax Cy5 Streptavidin (Fisher Scientific, catalog no. 438316) in a 2.5:1 volumetric ratio and allowed to equilibrate for 30 min at room temperature.

Real-time measurement of avidite association and dissociation

Real-time measurement of avidite binding kinetics was performed using an Olympus IX83 microscope at 545 and 635 nm excitation (Lumencor Light Engine) set to an approximate power density of about 1 W cm–2, with an Olympus objective (catalog no. UCPLFLN20XPH) and a Semrock BrightLine multiband laser filter set (catalog no. LF405/488/532/635) containing a matching quad band exciter, emitter and dichroic. Flow rates of 60 µl s–1 were used for reagent exchanges. Circular PhiX libraries were introduced to AVITI flow cells, hybridized in 3× SSC buffer for 5 min at 50 °C and cooled to room temperature. Amplification reagents were introduced into the flow cell to perform rolling circle amplification and amplify genomic DNA. The instrument was paused following polony generation and priming and the flowcell moved to the microscope. Custom control software was written to control all peripheral hardware and synchronize data collection with flow of materials into the sample. Data collection (4 fps) was triggered by flow of the avidity mix and collected for 55 s. Polonies in the field were localized by a spot-finding algorithm, and background-corrected intensities were extracted versus time. Experiments were performed at 0.5 pM, 1 nM, 7.5 nM and 10 nM avidite or monovalent dye-labeled nucleotide concentrations. Substrates at the respective concentrations were combined with 100 nM engineered enzyme variant of C. altiarchaeales archaeon in the avidity on rate assay buffer formulation (25 mM HEPES pH 8.8, 25 mM NaCl, 0.5 mM EDTA, 5 mM strontium acetate, 25 mM ascorbic acid and 0.2% Tween-20). Avidites and nucleotides were labeled with Alexa Fluor 647. Higher-concentration data collection was limited by the ability to detect polony intensity from free avidite intensity at elevated concentrations. Off-rate measurements were performed by binding avidites to flowcell polonies, followed by washing with avidity on rate assay buffer and triggering of data collection.

Genomic DNA and next-generation sequencing library preparation

Human DNA from cell line sample HG002 was obtained from the Coriell Institute. Linear next-generation sequencing library construction was performed using a KAPA HyperPrep library kit (Roche, catalog no. 07962363001) according to published protocols. Finished linear libraries were circularized using the Element Adept Compatibility kit (catalog no. 830-00003). Final circular libraries were quantified by quantitative PCR with the standard and primer set provided in the kit. Circular library DNA was denatured using sodium hydroxide and neutralized with excess Tris pH 7.0 before dilution. Denatured libraries were diluted to 8 pM in hybridization buffer before loading onto the sequencing cartridge.

Single-cell 3′ gene expression library circularization

Single-cell RNA-seq libraries were prepared from two lots of peripheral blood mononuclear cell suspension (10,000 and 1,000 cells) using the Chromium Next GEM Single Cell 3′ Kit v.3.1 (catalog no. 1000268). Each library was quantified and individually processed for sequencing using the Adept Library Compatibility Kit (catalog no. 830-00003). Processed libraries were pooled and sequenced with 28 cycles for read 1, 90 for read 2 and index reads.

Sequencing instrument and workflow

Sequencing results were obtained with commercialized formulations of avidites, enzymes and buffers. Element Bioscience’s AVITI commercial system (catalog no. 88-00001) was used for all sequencing data. AVITI 2 × 150 kits were loaded on the instrument (catalog no. 86-00001). Primary analysis was performed onboard the AVITI sequencing instrument, and FASTQ files were subsequently analyzed using a secondary analysis pipeline from Sentieon.

Sequencing primary analysis

Four images were generated per field of view during each sequencing cycle, corresponding to the dyes used to label each avidite. An analysis pipeline was developed that uses the images as input to identify the polonies present on the flowcell and to assign to each polony a base call and quality score for each cycle, representing the accuracy of the underlying call. The analysis approach has steps similar to those described in ref. 25. Briefly, intensity is extracted for each polony in each color channel; intensities are then corrected for color cross-talk and phasing and normalized to make cross-channel comparisons. The highest normalized intensity value for each polony in each cycle determines the base call. In addition to assigning a base call, a quality score corresponding to call confidences is also assigned. The standard Q-score definition is utilized where the Q-value is defined as Q = −10 × log_10p, where p is the probability that the base call is an error. Q-score generation follows the approach of Ewing et al., with modified predictors21, and is encoded using the phred+33 ASCII scheme. The predictors used for quality score training are (1) maximum intensity per polony across color channels; (2) clarity of each polony (defined as (A + 1)/(B + 1), where A is the highest intensity across color channels and B is the second highest); (3) the sum of phasing and prephasing estimates; and (4) the median clarity value taken across the 10% of the lowest-intensity polonies. The sequence of base call assignments and quality scores across the cycles constitutes the output of the run. These data are represented in standard FASTQ format for compatibility with downstream tools.

Quality score assessment

To assess the accuracy of quality scores (Fig. 3), the FASTQ files were aligned with BWA to generate BAM files. GATK BaseRecalibrartor was then applied to the BAM, specifying files of publicly available known sites to exclude human variant positions.

K-mer error analysis

The same run used to generate recalibrated quality scores was analyzed via custom script for all k-mers of size 1, 2 and 3. The computation is based on 1% of a 35X genome to ensure adequate sampling of each k-mer. For example, each 3-mer is sampled at least 850,000 times (average 6.7 million). This figure is based on a publicly available run from each platform. For the instances of each k-mer, the percentage mismatching a variant-masked reference was computed. The same script was applied to a publicly available NovaSeq dataset for HG002 and a publicly available NextSeq 2000 dataset for HG001 (Demo Data for HG002 were not available). We tabulated the number of k-mers in which the percentage incorrect was lowest for AVITI among the three platforms compared.

Homopolymer analysis

A BED file provided by National Institute of Standards and Technology (NIST) genome-stratifications v.3.0, containing 673,650 homopolymers of length >11, was used to define regions of interest for homopolymer analysis (GRCh38_SimpleRepeat_homopolymer_gt11_slop5). Reads overlapping these BED intervals (using samtools view -L and adjusting for slop5) were selected for accuracy analysis. Reads with any of the following flags set were discarded: secondary, supplementary, unmapped or reads with mapping quality of 0. Reads were oriented in the 5′→3′ direction and split into three segments: preceding the homopolymer, overlapping it and following it. The mismatch rate for each read segment was computed, excluding N-calls, softclipped bases and indels. For example, if a 150-bp read (aligned on the forward strand) contained a homopolymer in positions 100–120, the first 99 cycles were used to compute the error rate before the homopolymer and the last 30 to compute error rate following the homopolymer. Reads were discarded if the sequence either preceding or following the homopolymer was <5 bp in length. All reads were then stacked into a matrix according to their positional offset relative to the homopolymer, and error rate per post-offset was computed.

Average error rate was computed for avidity sequencing runs and for publicly available data from multiple SBS instruments, for comparison. Differences oin mismatch percentage, across all BED intervals, between AVITI and NovaSeq were plotted in a histogram and examples showing various percentiles within the distribution were chosen for display via Integrative Genomics Viewer.

Publicly available datasets for NovaSeq were obtained from the Google Brain Public Data repository on Google Cloud (https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing/fastq). Publicly available NextSeq 2000 data were obtained from Illumina Demo Data on BaseSpace (https://basespace.illumina.com/datacentral).

Single-cell gene expression data analysis

Following sequencing, Bases2Fastq software was used to generate FASTQ files for compatible upload into 10X Cloud and subsequent analysis with the 10X Genomics Cell Ranger analysis package. Data visualization of single-cell gene expression profiling was generated using 10X Genomics Loupe Browser.

Whole-genome sequencing analysis

A FASTQ file with base calls and quality scores was downsampled to 35× raw coverage (360,320,126 input reads) and used as an input into Sentieon BWA followed by Sentieon DNAscope40. Following alignment and variant calling, variant calls were compared with the NIST genome in Bottle Truth Set v.4.2.1 via the hap.py comparison framework to derive total error counts and F1 scores41. The results are computed based on the 3,848,590 SNV and 982,234 indel passing variant calls made by DNAScope.

1 × 300 Data generation

An E. coli library was prepared using enzymatic shearing and PCR amplification. The library was then sequenced for 300 cycles using new enzymes for stepping along the DNA template and for avidite binding. The reagent formulation with increased enzyme and nucleotide concentrations during the stepping process was used to improve stepping performance. The contact times for avidite binding and exposure were both reduced without performance losses, to decrease cycle time over the 600 cycles of sequencing. The displays show only 299 cycles of data, because cycle 300 was used only for prephasing correction. To minimize soft clipping during alignment the following inputs were used in the call to BWA–MEM: -E 6,6 -L 1000000 -S.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

留言 (0)

沒有登入
gif