How to Remove Contamination Reads Rad Seq
Article Navigation
Article Contents
ipyrad: Interactive assembly and analysis of RADseq datasets
Deren A R Eaton, Section of Ecology, Evolution, and Environmental Biology , Columbia University, New York, NY 10027, USA Search for other works by this author on: Department of Biology, Graduate Schoolhouse, University Center of the City University of New York , New York, NY 10016, United states Search for other works by this author on:
Received:
17 September 2019
Revision received:
09 December 2019
Accepted:
31 Dec 2019
Published:
06 Jan 2020
Abstract
Summary
ipyrad is a free and open source tool for assembling and analyzing restriction site-associated DNA sequence datasets using de novo and/or reference-based approaches. Information technology is designed to be massively scalable to hundreds of taxa and thousands of samples, and can be efficiently parallelized on high performance computing clusters. Information technology is available both as a command line interface and equally a Python parcel with an awarding programming interface, the latter of which can be used interactively to write complex, reproducible scripts and implement a suite of downstream analysis tools.
Availability and implementation
ipyrad is a complimentary and open source program written in Python. Source lawmaking is available from the GitHub repository (https://github.com/dereneaton/ipyrad/), and Linux and MacOS installs are distributed through the conda bundle manager. Complete documentation, including numerous tutorials, and Jupyter notebooks demonstrating example assemblies and applications of downstream analysis tools are available online: https://ipyrad.readthedocs.io/.
i Introduction
Over the final decade molecular systematics has increasingly transitioned from investigating phylogenetic and phylogeographic patterns using datasets composed of 1 or only a handful of markers to massive datasets containing thousands or tens of thousands of loci. Among several methods that have been developed for subsampling loci from the genome ( McKain et al., 2018), restriction site-associated Deoxyribonucleic acid sequencing (RADseq) and related methods have become a popular selection for their flexibility and affordability ( Andrews et al., 2016; Baird et al., 2008; Davey et al., 2011; Elshire et al., 2011; Miller et al., 2007; Peterson et al., 2012). RADseq (and RADseq-like) protocols utilize restriction enzymes to assimilate (fragment) a genome such that regions proximal to restriction enzyme recognition sequences can be consistently selected for brusque-read sequencing. In contrast to whole genome sequencing or re-sequencing (Stratton, 2008), RADseq provides a more efficient mode to gather high depth comparative sequence information shared across large numbers of samples, peculiarly when genome sizes are big ( Clugston et al., 2019). For this reason, RADseq methods have been employed for diverse questions ranging from population genetics (GarcÃa‐Olivares et al., 2019), and phylogenetics (Eaton and Ree, 2013; Hipp et al., 2014; Wagner et al., 2013), to constructing linkage maps ( Amores et al., 2011; Rubin and Moreau, 2016), QTL-mapping ( Palaiokostas et al., 2013) and investigating Deoxyribonucleic acid methylation ( Schield et al., 2016; Trucchi et al., 2016). Even equally hereafter technological improvements reduce the per read cost of sequencing, reduced-representation methods will continue to offer advantages to studies that benefit from sequencing many populations or individuals (e.g. phylogeography), or that do not crave sampling the unabridged genome (e.g. linkage mapping). Similarly, RADseq methods are probable to continue to meliorate in ways that promote these benefits, equally with contempo advances that reduce the toll of indexing and allow for PCR duplicate removal ( Glenn et al., 2019), and methods for enriching libraries to reduce missing data and increase multiplexing efficiency ( Hoffberg et al., 2016).
The process of organizing and making sense of the vast quantities of information that come off a modern sequencing musical instrument is non-lilliputian, and of great consequence. Simple parameter misspecification during the assembly procedure can have considerable affect on downstream assay, potentially influencing the interpretation of the genetic patterns in the data (Linck and Battey, 2019; Shafer et al., 2017). Prior to the availability of unified associates tools ( Catchen et al., 2013; Eaton, 2014; Rochette et al., 2019), these datasets were typically assembled in an ad hoc way using scripts developed in-house, leading to wide variability in the quality of assemblies being performed by the customs. Additionally, downstream analyses typically involve writing complicated scripts to manage running multiple iterations of statistical inference software, organizing and post-processing the output and generating publication-gear up plots. This proliferation of methods and lack of community standards has two meaning consequences: (i) unnecessary complexity in assembly and assay workflows which increases the potential for errors and (ii) a lack of reproducibility or oversight when ad hoc scripts are rarely reused or evaluated. What is needed is a user friendly, computationally robust and scalable method for both assembling and analyzing large-calibration genomic datasets.
ipyrad was adult to address this need, and provides a simple, reproducible and well-documented RADseq assembly and assay framework that is computationally efficient, massively scalable across large calculating clusters, flexible to arrange all variants of RADseq data types and suitable for population genetic scale equally well equally phylogenetic scale datasets. The ipyrad application programming interface (API) enables and encourages the creation of reproducible scientific workflows by providing a uniform, well-documented interface to several popular downstream phylogenetic and population genetics programs. ipyrad is a basis-up reimplementation of the RADseq assembly workflow implemented in pyRAD (Eaton, 2014), and includes numerous new capabilities which profoundly extend the power, speed and utility of the original program.
2 i pyrad assembly process
The ipyrad assembly workflow is fully self-contained, capable of taking raw Illumina data from a sequencing facility and producing assembled output files without the demand for pre- or post-processing past other software. The general workflow consists of seven steps: (i) demultiplexing raw reads to samples (based on unmarried or combinatorial inline barcodes or indexed adapters) or alternatively importing information which has already been demultiplexed; (two) quality control, filtering and trimming for adapter contamination; (iii) identifying read copies from the same locus within samples using de novo clustering or reference mapping. For paired-terminate data, the de novo method offset merges read pairs with VSEARCH ( Rognes et al., 2016) before clustering, and indels are and then imputed during a gapped alignment process which is performed by Muscle (Edgar, 2004). For reference assemblies, paired-cease reads are mapped to the reference to produce gapped alignments, and mate pairs that map with wrong orientation or to multiple locations (i.east. as paralogs) are discarded; (4) joint interpretation of sequencing fault rate and sample heterozygosity; (v) making consensus basecalls and haplotype calls within samples; (6) identifying orthology across samples by de novo clustering or reference mapped positions; and (vii) applying a final round of filtering and trimming to assembled loci, generating informative assembly statistics and writing output files in numerous useful formats for downstream analysis.
While ipyrad retains the general workflow of the original pyRAD method, the codebase has been completely refactored and rewritten with emphasis on functioning and scalability. Even on a comparatively modest dataset the performance gains are substantial. For example, using the original pyRAD (five.1.0) on a calculator with 12 cores and 48 GB of RAM, the 13 Pedicularis samples from Eaton and Ree (2013) assembled in ∼xx h. Using the same hardware, ipyrad assembles the same data in <30 min. The Pedicularis dataset has few samples, and is high-quality, single-cease RAD data ( Baird et al., 2008). Paired-end data, very large datasets, low quality data and reference assembly methods obtain even greater performance improvements in the new implementation.
3 New capabilities implemented in ipyrad
3.1 Massive parallelization
Multi-procedure and multi-node calculating (MPI; Gropp et al., 1996) allows for efficient distribution of work across massive-calibration calculating clusters. ipyrad utilizes the ipyparallel Python library to distribute jobs across cores of a unmarried computer, and can leverage MPI to distribute jobs beyond compute nodes on high performance computing clusters, even while working interactively. By default, ipyrad uses a load-balanced scheduler to distribute jobs among cores (including across dissimilar host nodes), and efficiently distributes threaded functions (e.g. VSEARCH clustering; Rognes et al., 2016) across concrete cores within compute nodes. Congenital on height of ipyparallel and MPI, ipyrad parallelization tin can hands and efficiently calibration to hundreds of cores. Although the codebase of ipyrad is written in Python, it retains high performance through the utilize of merely-in-time compilation ( Lam et al., 2015), and incorporation of industry standard compiled software into the assembly pipeline (Edgar, 2004; Li et al., 2009; Li and Durbin, 2009; Martin, 2011; Quinlan, 2014; Rognes et al., 2016).
3.ii Awarding programming interface
ipyrad provides a command line interface that is like shooting fish in a barrel to employ and which inherits interaction logic from its predecessor (Eaton, 2014). Additionally, ipyrad provides an API mode, which can be accessed programmatically to run interactive assemblies in Jupyter notebooks ( Kluyver et al., 2016). The API manner allows researchers to document, share and publish their assembly workflows, promoting reproducibility in science. The API style also serves every bit a starting point for analyses using the downstream tools available through the ipyrad-assay toolkit.
iii.3 De novo and reference-based assemblies
To appraise orthology of sequenced reads ipyrad implements two cadre associates methods: de novo, in which a sequence similarity threshold is practical during a seed and extend clustering algorithm; and reference, in which reads are mapped to a reference genome. In addition, ipyrad can implement aspects of these methods in conjunction. For example, if a reference genome is quite afar from sample taxa and so the de novo+reference method tin recover more data past applying the reference workflow to mapped reads, and the de novo workflow to non-mapping reads, with the final dataset compiled of the two datasets together. Two alternative methods make apply of a reference genome in a contrasting way, equally a filter. In de novo – reference and reference – reference any reads that map to a 'filter-reference' file are removed from the dataset, prior to de novo or reference assembly of the remaining reads, which provides a useful means for removing sequences from contaminants or symbionts.
3.4 Branching architecture
Methods for assembling RADseq information are sensitive to the parameter settings used during filtering, mapping/clustering and base of operations calling. For instance, Linck and Battey (2019) showed that unlike minor allele frequency thresholds can produce significantly dissimilar inference of population construction. It is therefore disquisitional to generate multiple datasets nether a range of parameter settings for comparison ( Crotti et al., 2019; Paris et al., 2017). ipyrad implements an iterative branching design that reduces redundancy and facilitates the generation of multiple datasets exploring a range of parameters settings, without the need to re-run the entire associates. By saving intermediate files, different named assemblies can be restarted from intermediate steps of the assembly workflow. This vastly reduces back-up in computation; enforces a reproducible workflow in which new branches practice not overwrite earlier results; provides a convenient step in which to remove individuals from assemblies (failed samples, outgroups, etc.), or to merge samples from unlike libraries into a shared assembly.
3.5 Assay tools
ipyrad includes an 'assay' module which provides a powerful, unproblematic and reproducible interface to several widely used methods for inferring phylogenetic relationships (RAxML; Stamatakis, 2014), population construction (Construction; Pritchard et al., 2000) and admixture (TreeMix; Pickrell and Pritchard, 2012), amongst many others. In typical usage the assay API will use an internal data structure generated past the ipyrad assembly process, but information technology is as well flexible enough to import genotypes (e.chiliad. VCF files) generated past other RADSeq associates programs. Diverse population genetic and phylogenetic methods tin be differentially impacted by missing data, therefore the analysis API provides unproblematic options for filtering, imputing, consensus sampling and/or running replicate analyses to effectively quantify uncertainty around missing data. The analysis API leverages the massive parallelization provided by the ipyrad backend, manages organization of intermediate files and provides a simple interface for generating publication-ready plots of results (Eaton, 2019), contributing benefits of both usability and reproducibility.
4 Determination
ipyrad is a user friendly, robust, efficient, scalable and flexible plan for assembling and analyzing RADseq datasets. The parallelization backend allows ipyrad to scale up to the limit of computational resources it is provided, facilitating the assembly of very large-calibration datasets encompassing hundreds of taxa and thousands of samples. The API mode facilitates the cosmos of documented and shareable associates workflows, promoting reproducibility. The analysis module provides a unified and coherent interface to many common downstream phylogenetic and population genetic inference methods, reducing the friction and overhead generated by file format conversion, configuration file cosmos and execution which are typically associated with implementation of these methods. This combination of API style, parallelized backend and analysis tools allows researchers to efficiently perform, document and publish their full RADseq associates and analysis workflows inside a unmarried computational framework, thus greatly reducing cognitive overhead and promoting reproducibility.
Acknowledgement
We thank Laura Bertola and Ed Myers for useful comments on an early draft of the manuscript.
Funding
This work was supported by grants from the National Science Foundation [DEB-1253710; DEB 1745562; DEB-1557059], the São Paulo Inquiry Foundation [BIOTA, 2013/50297-0] and the National Aeronautics and Infinite Assistants through the Dimensions of Biodiversity Plan [DOB 1343578]. I.O. was supported by the Mina Rees Dissertation Fellowship in the Sciences provided by the Graduate Middle of the City University of New York.
Conflict of Interest: none declared.
References
Amores
A.
et al. (
2011
)
Genome development and meiotic maps past massively parallel DNA sequencing: spotted gar, an outgroup for the teleost genome duplication
.
Genetics
,
188
,
799
–
808
.
Andrews
Thousand.R.
et al. (
2016
)
Harnessing the power of RADseq for ecological and evolutionary genomics
.
Nat. Rev. Genet
.,
17
,
81
–
92
.
Baird
Due north.A.
et al. (
2008
)
Rapid SNP discovery and genetic mapping using sequenced RAD markers
.
PLoS One
,
3
,
e3376
.
Catchen
J.
et al. (
2013
)
Stacks: an analysis tool set for population genomics
.
Mol. Ecol
.,
22
,
3124
–
3140
.
Clugston
J.A.R.
et al. (
2019
)
RADseq as a valuable tool for plants with big genomes—a example study in cycads
.
Mol. Ecol. Resour
.,
xix
,
1610
–
1622
.
Crotti
Thou.
et al. (
2019
)
Causes and analytical impacts of missing information in RADseq phylogenetics: insights from an African frog (Afrixalus)
.
Zool. Scr
.,
48
,
157
–
167
.
Davey
J.Due west.
et al. (
2011
)
Genome-broad genetic marking discovery and genotyping using side by side-generation sequencing
.
Nat. Rev. Genet
.,
12
,
499
–
510
.
Eaton
D.A.R.
(
2014
)
PyRAD: assembly of de novo RADseq loci for phylogenetic analyses
.
Bioinformatics
,
30
,
1844
–
1849
.
Eaton
D.A.R.
(
2019
)
Toytree; a minimalist tree visualization and manipulation library for Python
.
Methods Ecol. Evol
.
Eaton
D.A.R.
, Ree R.H.
2013
)
Inferring phylogeny and introgression using RADseq data: an example from flowering plants (Pedicularis: Orobanchaceae)
.
Syst. Biol
.,
62
,
689
–
706
.
Edgar
R.C.
(
2004
)
Muscle: multiple sequence alignment with high accuracy and high throughput
.
Nucleic Acids Res
.,
32
,
1792
–
1797
.
Elshire
R.J.
et al. (
2011
)
A robust, simple genotyping-past-sequencing (GBS) arroyo for high diversity species
.
PLoS Ane
,
6
,
e19379
.
GarcÃa‐olivares
V.
et al. (
2019
)
A topoclimate model for Quaternary insular speciation
.
J. Biogeogr.
,
46
,
2769
–
2786
.
Glenn
T.C.
et al. (
2019
)
Adapterama I: universal stubs and primers for 384 unique dual-indexed or 147,456 combinatorially-indexed Illumina libraries (iTru & iNext)
.
Peer J
,
7
,
e7755
.
Gropp
Westward.
et al. (
1996
)
A high-functioning, portable implementation of the MPI bulletin passing interface standard
.
Parallel Comput
.,
22
,
789
–
828
.
Hipp
A.L.
et al. (
2014
)
A framework phylogeny of the American oak clade based on sequenced RAD data
.
PLoS One
,
9
,
e93975
.
Hoffberg
S.L.
et al. (
2016
)
RAD cap: sequence capture of dual-digest RAD seq libraries with identifiable duplicates and reduced missing data
.
Mol. Ecol. Resour
.,
sixteen
,
1264
–
1278
.
Kluyver
T.
et al. (
2016
) Jupyter Notebooks—A Publishing Format for Reproducible Computational Workflows. ELPUB, pp.
87
–
ninety
.
Lam
Southward.K.
et al. (
2015
) Numba: A LLVM-based Python JIT Compiler. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC. ACM, New York, NY, The states.
Li
H.
et al. ; 1000 Genome Projection Information Processing Subgroup. (
2009
)
The sequence alignment/map format and SAMtools
.
Bioinformatics
,
25
,
2078
–
2079
.
Li
H.
, Durbin R.
2009
)
Fast and accurate short read alignment with Burrows–Wheeler transform
.
Bioinformatics
,
25
,
1754
–
1760
.
Linck
Eastward.
, Battey C.J.
2019
)
Minor allele frequency thresholds strongly impact population construction inference with genomic data sets
.
Mol. Ecol. Resour
.,
19
,
639
–
647
.
Martin
M.
(
2011
)
Cutadapt removes adapter sequences from high-throughput sequencing reads
.
EMBnet J
.,
17
,
ten
–
12
.
McKain
K.R.
et al. (
2018
)
Practical considerations for constitute phylogenomics
.
Appl. Plant Sci
.,
half dozen
,
e1038
.
Miller
M.R.
et al. (
2007
)
Rapid and cost-effective polymorphism identification and genotyping using brake site associated DNA (RAD) markers
.
Genome Res
.,
17
,
240
–
248
.
Palaiokostas
C.
et al. (
2013
)
Mapping the sex conclusion locus in the Atlantic halibut (Hippoglossus hippoglossus) using RAD sequencing
.
BMC Genomics
,
xiv
,
566
.
Paris
J.
et al. (
2017
)
Lost in parameter space: a road map for stacks
.
Methods Ecol. Evol
.,
8
,
1360
–
1373
.
Peterson
B.K.
et al. (
2012
)
Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species
.
PLoS Ane
,
seven
,
e37135
.
Pickrell
J.K.
, Pritchard J.K.
2012
)
Inference of population splits and mixtures from genome-wide allele frequency data
.
PLoS Genet
,
8
,
e1002967
.
Pritchard
J.K.
et al. (
2000
)
Inference of population structure using multilocus genotype information
.
Genetics
,
155
,
945
–
959
.
Quinlan
A.R.
(
2014
)
BEDTools: the Swiss-army tool for genome feature analysis
.
Curr. Protoc. Bioinformatics
,
47
,
xi
–
12
.
Rochette
N.C.
et al. (
2019
)
Stacks two: analytical methods for paired-finish sequencing meliorate RADseq-based population genomics
.
Mol. Ecol. Resour
.,
28
,
4737
–
4754
.
Rognes
T.
et al. (
2016
)
VSEARCH: a versatile open source tool for metagenomics
.
PeerJ
,
4
,
e2584
.
Rubin
B.E.R.
, Moreau C.S.
2016
)
Comparative genomics reveals convergent rates of evolution in emmet–plant mutualisms
.
Nat. Commun
.,
7
,
12679
.
Schield
D.R.
et al. (
2016
)
Epi RAD seq: scalable analysis of genomewide patterns of methylation using next-generation sequencing
.
Methods Ecol. Evol
.,
7
,
60
–
69
.
Shafer
A.B.A.
et al. (
2017
)
Bioinformatic processing of RAD-seq data dramatically impacts downstream population genetic inference
.
Methods Ecol. Evol
.,
viii
,
907
–
917
.
Stamatakis
A.
(
2014
)
RAxML version 8: a tool for phylogenetic analysis and post-analysis of big phylogenies
.
Bioinformatics
,
30
,
1312
–
1313
.
Stratton
M.
(
2008
)
Genome resequencing and genetic variation
.
Nat. Biotechnol
.,
26
,
65
–
66
.
Trucchi
E.
et al. (
2016
)
BsRADseq: screening Dna methylation in natural populations of non-model species
.
Mol. Ecol
.,
25
,
1697
–
1713
.
Wagner
C.East.
et al. (
2013
)
Genome-broad RAD sequence data provide unprecedented resolution of species boundaries and relationships in the Lake Victoria cichlid adaptive radiations
.
Mol. Ecol
.,
22
,
787
–
798
.
© The Author(southward) 2020. Published by Oxford University Printing. All rights reserved. For permissions, please e-mail service: journals.permissions@oup.com
Associate Editor: Russell Schwartz Russell Schwartz Associate Editor Search for other works past this author on:
Source: https://academic.oup.com/bioinformatics/article/36/8/2592/5697088
Post a Comment for "How to Remove Contamination Reads Rad Seq"