How to Remove Contamination Reads Rad Seq

March 02, 2022 Post a Comment

Article Navigation

Article Contents

ipyrad: Interactive assembly and analysis of RADseq datasets

Deren A R Eaton,

Section of Ecology, Evolution, and Environmental Biology

, Columbia University, New York, NY 10027,

USA

Search for other works by this author on:

Abstract

Summary

ipyrad is a free and open source tool for assembling and analyzing restriction site-associated DNA sequence datasets using de novo and/or reference-based approaches. Information technology is designed to be massively scalable to hundreds of taxa and thousands of samples, and can be efficiently parallelized on high performance computing clusters. Information technology is available both as a command line interface and equally a Python parcel with an awarding programming interface, the latter of which can be used interactively to write complex, reproducible scripts and implement a suite of downstream analysis tools.

Availability and implementation

ipyrad is a complimentary and open source program written in Python. Source lawmaking is available from the GitHub repository (https://github.com/dereneaton/ipyrad/), and Linux and MacOS installs are distributed through the conda bundle manager. Complete documentation, including numerous tutorials, and Jupyter notebooks demonstrating example assemblies and applications of downstream analysis tools are available online: https://ipyrad.readthedocs.io/.

i Introduction

Over the final decade molecular systematics has increasingly transitioned from investigating phylogenetic and phylogeographic patterns using datasets composed of 1 or only a handful of markers to massive datasets containing thousands or tens of thousands of loci. Among several methods that have been developed for subsampling loci from the genome ( McKain et al., 2018), restriction site-associated Deoxyribonucleic acid sequencing (RADseq) and related methods have become a popular selection for their flexibility and affordability ( Andrews et al., 2016; Baird et al., 2008; Davey et al., 2011; Elshire et al., 2011; Miller et al., 2007; Peterson et al., 2012). RADseq (and RADseq-like) protocols utilize restriction enzymes to assimilate (fragment) a genome such that regions proximal to restriction enzyme recognition sequences can be consistently selected for brusque-read sequencing. In contrast to whole genome sequencing or re-sequencing (Stratton, 2008), RADseq provides a more efficient mode to gather high depth comparative sequence information shared across large numbers of samples, peculiarly when genome sizes are big ( Clugston et al., 2019). For this reason, RADseq methods have been employed for diverse questions ranging from population genetics (García‐Olivares et al., 2019), and phylogenetics (Eaton and Ree, 2013; Hipp et al., 2014; Wagner et al., 2013), to constructing linkage maps ( Amores et al., 2011; Rubin and Moreau, 2016), QTL-mapping ( Palaiokostas et al., 2013) and investigating Deoxyribonucleic acid methylation ( Schield et al., 2016; Trucchi et al., 2016). Even equally hereafter technological improvements reduce the per read cost of sequencing, reduced-representation methods will continue to offer advantages to studies that benefit from sequencing many populations or individuals (e.g. phylogeography), or that do not crave sampling the unabridged genome (e.g. linkage mapping). Similarly, RADseq methods are probable to continue to meliorate in ways that promote these benefits, equally with contempo advances that reduce the toll of indexing and allow for PCR duplicate removal ( Glenn et al., 2019), and methods for enriching libraries to reduce missing data and increase multiplexing efficiency ( Hoffberg et al., 2016).

The process of organizing and making sense of the vast quantities of information that come off a modern sequencing musical instrument is non-lilliputian, and of great consequence. Simple parameter misspecification during the assembly procedure can have considerable affect on downstream assay, potentially influencing the interpretation of the genetic patterns in the data (Linck and Battey, 2019; Shafer et al., 2017). Prior to the availability of unified associates tools ( Catchen et al., 2013; Eaton, 2014; Rochette et al., 2019), these datasets were typically assembled in an ad hoc way using scripts developed in-house, leading to wide variability in the quality of assemblies being performed by the customs. Additionally, downstream analyses typically involve writing complicated scripts to manage running multiple iterations of statistical inference software, organizing and post-processing the output and generating publication-gear up plots. This proliferation of methods and lack of community standards has two meaning consequences: (i) unnecessary complexity in assembly and assay workflows which increases the potential for errors and (ii) a lack of reproducibility or oversight when ad hoc scripts are rarely reused or evaluated. What is needed is a user friendly, computationally robust and scalable method for both assembling and analyzing large-calibration genomic datasets.

ipyrad was adult to address this need, and provides a simple, reproducible and well-documented RADseq assembly and assay framework that is computationally efficient, massively scalable across large calculating clusters, flexible to arrange all variants of RADseq data types and suitable for population genetic scale equally well equally phylogenetic scale datasets. The ipyrad application programming interface (API) enables and encourages the creation of reproducible scientific workflows by providing a uniform, well-documented interface to several popular downstream phylogenetic and population genetics programs. ipyrad is a basis-up reimplementation of the RADseq assembly workflow implemented in pyRAD (Eaton, 2014), and includes numerous new capabilities which profoundly extend the power, speed and utility of the original program.

2 i pyrad assembly process

The ipyrad assembly workflow is fully self-contained, capable of taking raw Illumina data from a sequencing facility and producing assembled output files without the demand for pre- or post-processing past other software. The general workflow consists of seven steps: (i) demultiplexing raw reads to samples (based on unmarried or combinatorial inline barcodes or indexed adapters) or alternatively importing information which has already been demultiplexed; (two) quality control, filtering and trimming for adapter contamination; (iii) identifying read copies from the same locus within samples using de novo clustering or reference mapping. For paired-terminate data, the de novo method offset merges read pairs with VSEARCH ( Rognes et al., 2016) before clustering, and indels are and then imputed during a gapped alignment process which is performed by Muscle (Edgar, 2004). For reference assemblies, paired-cease reads are mapped to the reference to produce gapped alignments, and mate pairs that map with wrong orientation or to multiple locations (i.east. as paralogs) are discarded; (4) joint interpretation of sequencing fault rate and sample heterozygosity; (v) making consensus basecalls and haplotype calls within samples; (6) identifying orthology across samples by de novo clustering or reference mapped positions; and (vii) applying a final round of filtering and trimming to assembled loci, generating informative assembly statistics and writing output files in numerous useful formats for downstream analysis.

While ipyrad retains the general workflow of the original pyRAD method, the codebase has been completely refactored and rewritten with emphasis on functioning and scalability. Even on a comparatively modest dataset the performance gains are substantial. For example, using the original pyRAD (five.1.0) on a calculator with 12 cores and 48 GB of RAM, the 13 Pedicularis samples from Eaton and Ree (2013) assembled in ∼xx h. Using the same hardware, ipyrad assembles the same data in <30 min. The Pedicularis dataset has few samples, and is high-quality, single-cease RAD data ( Baird et al., 2008). Paired-end data, very large datasets, low quality data and reference assembly methods obtain even greater performance improvements in the new implementation.

3 New capabilities implemented in ipyrad

3.1 Massive parallelization

Multi-procedure and multi-node calculating (MPI; Gropp et al., 1996) allows for efficient distribution of work across massive-calibration calculating clusters. ipyrad utilizes the ipyparallel Python library to distribute jobs across cores of a unmarried computer, and can leverage MPI to distribute jobs beyond compute nodes on high performance computing clusters, even while working interactively. By default, ipyrad uses a load-balanced scheduler to distribute jobs among cores (including across dissimilar host nodes), and efficiently distributes threaded functions (e.g. VSEARCH clustering; Rognes et al., 2016) across concrete cores within compute nodes. Congenital on height of ipyparallel and MPI, ipyrad parallelization tin can hands and efficiently calibration to hundreds of cores. Although the codebase of ipyrad is written in Python, it retains high performance through the utilize of merely-in-time compilation ( Lam et al., 2015), and incorporation of industry standard compiled software into the assembly pipeline (Edgar, 2004; Li et al., 2009; Li and Durbin, 2009; Martin, 2011; Quinlan, 2014; Rognes et al., 2016).

3.ii Awarding programming interface

ipyrad provides a command line interface that is like shooting fish in a barrel to employ and which inherits interaction logic from its predecessor (Eaton, 2014). Additionally, ipyrad provides an API mode, which can be accessed programmatically to run interactive assemblies in Jupyter notebooks ( Kluyver et al., 2016). The API manner allows researchers to document, share and publish their assembly workflows, promoting reproducibility in science. The API style also serves every bit a starting point for analyses using the downstream tools available through the ipyrad-assay toolkit.

iii.3 De novo and reference-based assemblies

To appraise orthology of sequenced reads ipyrad implements two cadre associates methods: de novo, in which a sequence similarity threshold is practical during a seed and extend clustering algorithm; and reference, in which reads are mapped to a reference genome. In addition, ipyrad can implement aspects of these methods in conjunction. For example, if a reference genome is quite afar from sample taxa and so the de novo+reference method tin recover more data past applying the reference workflow to mapped reads, and the de novo workflow to non-mapping reads, with the final dataset compiled of the two datasets together. Two alternative methods make apply of a reference genome in a contrasting way, equally a filter. In de novo – reference and reference – reference any reads that map to a 'filter-reference' file are removed from the dataset, prior to de novo or reference assembly of the remaining reads, which provides a useful means for removing sequences from contaminants or symbionts.

3.4 Branching architecture

Methods for assembling RADseq information are sensitive to the parameter settings used during filtering, mapping/clustering and base of operations calling. For instance, Linck and Battey (2019) showed that unlike minor allele frequency thresholds can produce significantly dissimilar inference of population construction. It is therefore disquisitional to generate multiple datasets nether a range of parameter settings for comparison ( Crotti et al., 2019; Paris et al., 2017). ipyrad implements an iterative branching design that reduces redundancy and facilitates the generation of multiple datasets exploring a range of parameters settings, without the need to re-run the entire associates. By saving intermediate files, different named assemblies can be restarted from intermediate steps of the assembly workflow. This vastly reduces back-up in computation; enforces a reproducible workflow in which new branches practice not overwrite earlier results; provides a convenient step in which to remove individuals from assemblies (failed samples, outgroups, etc.), or to merge samples from unlike libraries into a shared assembly.

3.5 Assay tools

ipyrad includes an 'assay' module which provides a powerful, unproblematic and reproducible interface to several widely used methods for inferring phylogenetic relationships (RAxML; Stamatakis, 2014), population construction (Construction; Pritchard et al., 2000) and admixture (TreeMix; Pickrell and Pritchard, 2012), amongst many others. In typical usage the assay API will use an internal data structure generated past the ipyrad assembly process, but information technology is as well flexible enough to import genotypes (e.chiliad. VCF files) generated past other RADSeq associates programs. Diverse population genetic and phylogenetic methods tin be differentially impacted by missing data, therefore the analysis API provides unproblematic options for filtering, imputing, consensus sampling and/or running replicate analyses to effectively quantify uncertainty around missing data. The analysis API leverages the massive parallelization provided by the ipyrad backend, manages organization of intermediate files and provides a simple interface for generating publication-ready plots of results (Eaton, 2019), contributing benefits of both usability and reproducibility.

4 Determination

ipyrad is a user friendly, robust, efficient, scalable and flexible plan for assembling and analyzing RADseq datasets. The parallelization backend allows ipyrad to scale up to the limit of computational resources it is provided, facilitating the assembly of very large-calibration datasets encompassing hundreds of taxa and thousands of samples. The API mode facilitates the cosmos of documented and shareable associates workflows, promoting reproducibility. The analysis module provides a unified and coherent interface to many common downstream phylogenetic and population genetic inference methods, reducing the friction and overhead generated by file format conversion, configuration file cosmos and execution which are typically associated with implementation of these methods. This combination of API style, parallelized backend and analysis tools allows researchers to efficiently perform, document and publish their full RADseq associates and analysis workflows inside a unmarried computational framework, thus greatly reducing cognitive overhead and promoting reproducibility.

Acknowledgement

We thank Laura Bertola and Ed Myers for useful comments on an early draft of the manuscript.

Funding

This work was supported by grants from the National Science Foundation [DEB-1253710; DEB 1745562; DEB-1557059], the São Paulo Inquiry Foundation [BIOTA, 2013/50297-0] and the National Aeronautics and Infinite Assistants through the Dimensions of Biodiversity Plan [DOB 1343578]. I.O. was supported by the Mina Rees Dissertation Fellowship in the Sciences provided by the Graduate Middle of the City University of New York.

Conflict of Interest: none declared.

References

Amores

et al. (

2011

)

Genome development and meiotic maps past massively parallel DNA sequencing: spotted gar, an outgroup for the teleost genome duplication

Genetics

188

799

–

808

Andrews

Thousand.R.

et al. (

2016

)

Harnessing the power of RADseq for ecological and evolutionary genomics

Nat. Rev. Genet

–

Baird

Due north.A.

et al. (

2008

)

Rapid SNP discovery and genetic mapping using sequenced RAD markers

PLoS One

e3376

Catchen

et al. (

2013

)

Stacks: an analysis tool set for population genomics

Mol. Ecol

3124

–

3140

Clugston

J.A.R.

et al. (

2019

)

RADseq as a valuable tool for plants with big genomes—a example study in cycads

Mol. Ecol. Resour

xix

1610

–

1622

Crotti

Thou.

et al. (

2019

)

Causes and analytical impacts of missing information in RADseq phylogenetics: insights from an African frog (Afrixalus)

Zool. Scr

157

–

167

Davey

J.Due west.

et al. (

2011

)

Genome-broad genetic marking discovery and genotyping using side by side-generation sequencing

Nat. Rev. Genet

499

–

510

Eaton

D.A.R.

(

2014

)

PyRAD: assembly of de novo RADseq loci for phylogenetic analyses

Bioinformatics

1844

–

1849

Eaton

D.A.R.

(

2019

)

Toytree; a minimalist tree visualization and manipulation library for Python

Methods Ecol. Evol

Eaton

D.A.R.

Ree

R.H.

(

2013

)

Inferring phylogeny and introgression using RADseq data: an example from flowering plants (Pedicularis: Orobanchaceae)

Syst. Biol

689

–

706

Edgar

R.C.

(

2004

)

Muscle: multiple sequence alignment with high accuracy and high throughput

Nucleic Acids Res

1792

–

1797

Elshire

R.J.

et al. (

2011

)

A robust, simple genotyping-past-sequencing (GBS) arroyo for high diversity species

PLoS Ane

e19379

García‐olivares

et al. (

2019

)

A topoclimate model for Quaternary insular speciation

J. Biogeogr.

2769

–

2786

Glenn

T.C.

et al. (

2019

)

Adapterama I: universal stubs and primers for 384 unique dual-indexed or 147,456 combinatorially-indexed Illumina libraries (iTru & iNext)

Peer J

e7755

Gropp

Westward.

et al. (

1996

)

A high-functioning, portable implementation of the MPI bulletin passing interface standard

Parallel Comput

789

–

828

Hipp

A.L.

et al. (

2014

)

A framework phylogeny of the American oak clade based on sequenced RAD data

PLoS One

e93975

Hoffberg

S.L.

et al. (

2016

)

RAD cap: sequence capture of dual-digest RAD seq libraries with identifiable duplicates and reduced missing data

Mol. Ecol. Resour

sixteen

1264

–

1278

Kluyver

et al. (

2016

) Jupyter Notebooks—A Publishing Format for Reproducible Computational Workflows. ELPUB, pp.

–

ninety

Lam

Southward.K.

et al. (

2015

) Numba: A LLVM-based Python JIT Compiler. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC. ACM, New York, NY, The states.

et al. ; 1000 Genome Projection Information Processing Subgroup. (

2009

)

The sequence alignment/map format and SAMtools

Bioinformatics

2078

–

2079

Durbin

(

2009

)

Fast and accurate short read alignment with Burrows–Wheeler transform

Bioinformatics

1754

–

1760

Linck

Eastward.

Battey

C.J.

(

2019

)

Minor allele frequency thresholds strongly impact population construction inference with genomic data sets

Mol. Ecol. Resour

639

–

647

Martin

(

2011

)

Cutadapt removes adapter sequences from high-throughput sequencing reads

EMBnet J

ten

–

McKain

K.R.

et al. (

2018

)

Practical considerations for constitute phylogenomics

Appl. Plant Sci

half dozen

e1038

Miller

M.R.

et al. (

2007

)

Rapid and cost-effective polymorphism identification and genotyping using brake site associated DNA (RAD) markers

Genome Res

240

–

248

Palaiokostas

et al. (

2013

)

Mapping the sex conclusion locus in the Atlantic halibut (Hippoglossus hippoglossus) using RAD sequencing

BMC Genomics

xiv

566

Paris

et al. (

2017

)

Lost in parameter space: a road map for stacks

Methods Ecol. Evol

1360

–

1373

Peterson

B.K.

et al. (

2012

)

Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species

PLoS Ane

seven

e37135

Pickrell

J.K.

Pritchard

J.K.

(

2012

)

Inference of population splits and mixtures from genome-wide allele frequency data

PLoS Genet

e1002967

Pritchard

J.K.

et al. (

2000

)

Inference of population structure using multilocus genotype information

Genetics

155

945

–

959

Quinlan

A.R.

(

2014

)

BEDTools: the Swiss-army tool for genome feature analysis

Curr. Protoc. Bioinformatics

–

Rochette

N.C.

et al. (

2019

)

Stacks two: analytical methods for paired-finish sequencing meliorate RADseq-based population genomics

Mol. Ecol. Resour

4737

–

4754

Rognes

et al. (

2016

)

VSEARCH: a versatile open source tool for metagenomics

PeerJ

e2584

Rubin

B.E.R.

Moreau

C.S.

(

2016

)

Comparative genomics reveals convergent rates of evolution in emmet–plant mutualisms

Nat. Commun

12679

Schield

D.R.

et al. (

2016

)

Epi RAD seq: scalable analysis of genomewide patterns of methylation using next-generation sequencing

Methods Ecol. Evol

–

Shafer

A.B.A.

et al. (

2017

)

Bioinformatic processing of RAD-seq data dramatically impacts downstream population genetic inference

Methods Ecol. Evol

viii

907

–

917

Stamatakis

(

2014

)

RAxML version 8: a tool for phylogenetic analysis and post-analysis of big phylogenies

Bioinformatics

1312

–

1313

Stratton

(

2008

)

Genome resequencing and genetic variation

Nat. Biotechnol

–

Trucchi

et al. (

2016

)

BsRADseq: screening Dna methylation in natural populations of non-model species

Mol. Ecol

1697

–

1713

Wagner

C.East.

et al. (

2013

)

Genome-broad RAD sequence data provide unprecedented resolution of species boundaries and relationships in the Lake Victoria cichlid adaptive radiations

Mol. Ecol

787

–

798

Associate Editor: Russell Schwartz

Russell Schwartz

Associate Editor

Search for other works past this author on:

tayloraters2000.blogspot.com

Source: https://academic.oup.com/bioinformatics/article/36/8/2592/5697088

Taylor Aters2000

How to Remove Contamination Reads Rad Seq

Article Contents

ipyrad: Interactive assembly and analysis of RADseq datasets

Abstract

i Introduction

2 i pyrad assembly process

3 New capabilities implemented in ipyrad

3.1 Massive parallelization

3.ii Awarding programming interface

iii.3 De novo and reference-based assemblies

3.4 Branching architecture

3.5 Assay tools

4 Determination

Acknowledgement

Funding

References

Post a Comment for "How to Remove Contamination Reads Rad Seq"