author_facet Alvarez, Roberto Vera
Mariño-Ramírez, Leonardo
Landsman, David
Alvarez, Roberto Vera
Mariño-Ramírez, Leonardo
Landsman, David
author Alvarez, Roberto Vera
Mariño-Ramírez, Leonardo
Landsman, David
spellingShingle Alvarez, Roberto Vera
Mariño-Ramírez, Leonardo
Landsman, David
GigaScience
Transcriptome annotation in the cloud: complexity, best practices, and cost
Computer Science Applications
Health Informatics
author_sort alvarez, roberto vera
spelling Alvarez, Roberto Vera Mariño-Ramírez, Leonardo Landsman, David 2047-217X Oxford University Press (OUP) Computer Science Applications Health Informatics http://dx.doi.org/10.1093/gigascience/giaa163 <jats:title>Abstract</jats:title> <jats:sec> <jats:title>Background</jats:title> <jats:p>The NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative provides NIH-funded researchers cost-effective access to commercial cloud providers, such as Amazon Web Services (AWS) and Google Cloud Platform (GCP). These cloud providers represent an alternative for the execution of large computational biology experiments like transcriptome annotation, which is a complex analytical process that requires the interrogation of multiple biological databases with several advanced computational tools. The core components of annotation pipelines published since 2012 are BLAST sequence alignments using annotated databases of both nucleotide or protein sequences almost exclusively with networked on-premises compute systems.</jats:p> </jats:sec> <jats:sec> <jats:title>Findings</jats:title> <jats:p>We compare multiple BLAST sequence alignments using AWS and GCP. We prepared several Jupyter Notebooks with all the code required to submit computing jobs to the batch system on each cloud provider. We consider the consequence of the number of query transcripts in input files and the effect on cost and processing time. We tested compute instances with 16, 32, and 64 vCPUs on each cloud provider. Four classes of timing results were collected: the total run time, the time for transferring the BLAST databases to the instance local solid-state disk drive, the time to execute the CWL script, and the time for the creation, set-up, and release of an instance. This study aims to establish an estimate of the cost and compute time needed for the execution of multiple BLAST runs in a cloud environment.</jats:p> </jats:sec> <jats:sec> <jats:title>Conclusions</jats:title> <jats:p>We demonstrate that public cloud providers are a practical alternative for the execution of advanced computational biology experiments at low cost. Using our cloud recipes, the BLAST alignments required to annotate a transcriptome with ∼500,000 transcripts can be processed in &amp;lt;2 hours with a compute cost of ∼$200–$250. In our opinion, for BLAST-based workflows, the choice of cloud platform is not dependent on the workflow but, rather, on the specific details and requirements of the cloud provider. These choices include the accessibility for institutional use, the technical knowledge required for effective use of the platform services, and the availability of open source frameworks such as APIs to deploy the workflow.</jats:p> </jats:sec> Transcriptome annotation in the cloud: complexity, best practices, and cost GigaScience
doi_str_mv 10.1093/gigascience/giaa163
facet_avail Online
Free
finc_class_facet Informatik
Medizin
format ElectronicArticle
fullrecord blob:ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5My9naWdhc2NpZW5jZS9naWFhMTYz
id ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5My9naWdhc2NpZW5jZS9naWFhMTYz
institution DE-L229
DE-D275
DE-Bn3
DE-Brt1
DE-Zwi2
DE-D161
DE-Gla1
DE-Zi4
DE-15
DE-Rs1
DE-Pl11
DE-105
DE-14
DE-Ch1
imprint Oxford University Press (OUP), 2021
imprint_str_mv Oxford University Press (OUP), 2021
issn 2047-217X
issn_str_mv 2047-217X
language English
mega_collection Oxford University Press (OUP) (CrossRef)
match_str alvarez2021transcriptomeannotationinthecloudcomplexitybestpracticesandcost
publishDateSort 2021
publisher Oxford University Press (OUP)
recordtype ai
record_format ai
series GigaScience
source_id 49
title Transcriptome annotation in the cloud: complexity, best practices, and cost
title_unstemmed Transcriptome annotation in the cloud: complexity, best practices, and cost
title_full Transcriptome annotation in the cloud: complexity, best practices, and cost
title_fullStr Transcriptome annotation in the cloud: complexity, best practices, and cost
title_full_unstemmed Transcriptome annotation in the cloud: complexity, best practices, and cost
title_short Transcriptome annotation in the cloud: complexity, best practices, and cost
title_sort transcriptome annotation in the cloud: complexity, best practices, and cost
topic Computer Science Applications
Health Informatics
url http://dx.doi.org/10.1093/gigascience/giaa163
publishDate 2021
physical
description <jats:title>Abstract</jats:title> <jats:sec> <jats:title>Background</jats:title> <jats:p>The NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative provides NIH-funded researchers cost-effective access to commercial cloud providers, such as Amazon Web Services (AWS) and Google Cloud Platform (GCP). These cloud providers represent an alternative for the execution of large computational biology experiments like transcriptome annotation, which is a complex analytical process that requires the interrogation of multiple biological databases with several advanced computational tools. The core components of annotation pipelines published since 2012 are BLAST sequence alignments using annotated databases of both nucleotide or protein sequences almost exclusively with networked on-premises compute systems.</jats:p> </jats:sec> <jats:sec> <jats:title>Findings</jats:title> <jats:p>We compare multiple BLAST sequence alignments using AWS and GCP. We prepared several Jupyter Notebooks with all the code required to submit computing jobs to the batch system on each cloud provider. We consider the consequence of the number of query transcripts in input files and the effect on cost and processing time. We tested compute instances with 16, 32, and 64 vCPUs on each cloud provider. Four classes of timing results were collected: the total run time, the time for transferring the BLAST databases to the instance local solid-state disk drive, the time to execute the CWL script, and the time for the creation, set-up, and release of an instance. This study aims to establish an estimate of the cost and compute time needed for the execution of multiple BLAST runs in a cloud environment.</jats:p> </jats:sec> <jats:sec> <jats:title>Conclusions</jats:title> <jats:p>We demonstrate that public cloud providers are a practical alternative for the execution of advanced computational biology experiments at low cost. Using our cloud recipes, the BLAST alignments required to annotate a transcriptome with ∼500,000 transcripts can be processed in &amp;lt;2 hours with a compute cost of ∼$200–$250. In our opinion, for BLAST-based workflows, the choice of cloud platform is not dependent on the workflow but, rather, on the specific details and requirements of the cloud provider. These choices include the accessibility for institutional use, the technical knowledge required for effective use of the platform services, and the availability of open source frameworks such as APIs to deploy the workflow.</jats:p> </jats:sec>
container_issue 2
container_start_page 0
container_title GigaScience
container_volume 10
format_de105 Article, E-Article
format_de14 Article, E-Article
format_de15 Article, E-Article
format_de520 Article, E-Article
format_de540 Article, E-Article
format_dech1 Article, E-Article
format_ded117 Article, E-Article
format_degla1 E-Article
format_del152 Buch
format_del189 Article, E-Article
format_dezi4 Article
format_dezwi2 Article, E-Article
format_finc Article, E-Article
format_nrw Article, E-Article
_version_ 1792342218575446027
geogr_code not assigned
last_indexed 2024-03-01T16:32:19.942Z
geogr_code_person not assigned
openURL url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fvufind.svn.sourceforge.net%3Agenerator&rft.title=Transcriptome+annotation+in+the+cloud%3A+complexity%2C+best+practices%2C+and+cost&rft.date=2021-01-29&genre=article&issn=2047-217X&volume=10&issue=2&jtitle=GigaScience&atitle=Transcriptome+annotation+in+the+cloud%3A+complexity%2C+best+practices%2C+and+cost&aulast=Landsman&aufirst=David&rft_id=info%3Adoi%2F10.1093%2Fgigascience%2Fgiaa163&rft.language%5B0%5D=eng
SOLR
_version_ 1792342218575446027
author Alvarez, Roberto Vera, Mariño-Ramírez, Leonardo, Landsman, David
author_facet Alvarez, Roberto Vera, Mariño-Ramírez, Leonardo, Landsman, David, Alvarez, Roberto Vera, Mariño-Ramírez, Leonardo, Landsman, David
author_sort alvarez, roberto vera
container_issue 2
container_start_page 0
container_title GigaScience
container_volume 10
description <jats:title>Abstract</jats:title> <jats:sec> <jats:title>Background</jats:title> <jats:p>The NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative provides NIH-funded researchers cost-effective access to commercial cloud providers, such as Amazon Web Services (AWS) and Google Cloud Platform (GCP). These cloud providers represent an alternative for the execution of large computational biology experiments like transcriptome annotation, which is a complex analytical process that requires the interrogation of multiple biological databases with several advanced computational tools. The core components of annotation pipelines published since 2012 are BLAST sequence alignments using annotated databases of both nucleotide or protein sequences almost exclusively with networked on-premises compute systems.</jats:p> </jats:sec> <jats:sec> <jats:title>Findings</jats:title> <jats:p>We compare multiple BLAST sequence alignments using AWS and GCP. We prepared several Jupyter Notebooks with all the code required to submit computing jobs to the batch system on each cloud provider. We consider the consequence of the number of query transcripts in input files and the effect on cost and processing time. We tested compute instances with 16, 32, and 64 vCPUs on each cloud provider. Four classes of timing results were collected: the total run time, the time for transferring the BLAST databases to the instance local solid-state disk drive, the time to execute the CWL script, and the time for the creation, set-up, and release of an instance. This study aims to establish an estimate of the cost and compute time needed for the execution of multiple BLAST runs in a cloud environment.</jats:p> </jats:sec> <jats:sec> <jats:title>Conclusions</jats:title> <jats:p>We demonstrate that public cloud providers are a practical alternative for the execution of advanced computational biology experiments at low cost. Using our cloud recipes, the BLAST alignments required to annotate a transcriptome with ∼500,000 transcripts can be processed in &amp;lt;2 hours with a compute cost of ∼$200–$250. In our opinion, for BLAST-based workflows, the choice of cloud platform is not dependent on the workflow but, rather, on the specific details and requirements of the cloud provider. These choices include the accessibility for institutional use, the technical knowledge required for effective use of the platform services, and the availability of open source frameworks such as APIs to deploy the workflow.</jats:p> </jats:sec>
doi_str_mv 10.1093/gigascience/giaa163
facet_avail Online, Free
finc_class_facet Informatik, Medizin
format ElectronicArticle
format_de105 Article, E-Article
format_de14 Article, E-Article
format_de15 Article, E-Article
format_de520 Article, E-Article
format_de540 Article, E-Article
format_dech1 Article, E-Article
format_ded117 Article, E-Article
format_degla1 E-Article
format_del152 Buch
format_del189 Article, E-Article
format_dezi4 Article
format_dezwi2 Article, E-Article
format_finc Article, E-Article
format_nrw Article, E-Article
geogr_code not assigned
geogr_code_person not assigned
id ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5My9naWdhc2NpZW5jZS9naWFhMTYz
imprint Oxford University Press (OUP), 2021
imprint_str_mv Oxford University Press (OUP), 2021
institution DE-L229, DE-D275, DE-Bn3, DE-Brt1, DE-Zwi2, DE-D161, DE-Gla1, DE-Zi4, DE-15, DE-Rs1, DE-Pl11, DE-105, DE-14, DE-Ch1
issn 2047-217X
issn_str_mv 2047-217X
language English
last_indexed 2024-03-01T16:32:19.942Z
match_str alvarez2021transcriptomeannotationinthecloudcomplexitybestpracticesandcost
mega_collection Oxford University Press (OUP) (CrossRef)
physical
publishDate 2021
publishDateSort 2021
publisher Oxford University Press (OUP)
record_format ai
recordtype ai
series GigaScience
source_id 49
spelling Alvarez, Roberto Vera Mariño-Ramírez, Leonardo Landsman, David 2047-217X Oxford University Press (OUP) Computer Science Applications Health Informatics http://dx.doi.org/10.1093/gigascience/giaa163 <jats:title>Abstract</jats:title> <jats:sec> <jats:title>Background</jats:title> <jats:p>The NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative provides NIH-funded researchers cost-effective access to commercial cloud providers, such as Amazon Web Services (AWS) and Google Cloud Platform (GCP). These cloud providers represent an alternative for the execution of large computational biology experiments like transcriptome annotation, which is a complex analytical process that requires the interrogation of multiple biological databases with several advanced computational tools. The core components of annotation pipelines published since 2012 are BLAST sequence alignments using annotated databases of both nucleotide or protein sequences almost exclusively with networked on-premises compute systems.</jats:p> </jats:sec> <jats:sec> <jats:title>Findings</jats:title> <jats:p>We compare multiple BLAST sequence alignments using AWS and GCP. We prepared several Jupyter Notebooks with all the code required to submit computing jobs to the batch system on each cloud provider. We consider the consequence of the number of query transcripts in input files and the effect on cost and processing time. We tested compute instances with 16, 32, and 64 vCPUs on each cloud provider. Four classes of timing results were collected: the total run time, the time for transferring the BLAST databases to the instance local solid-state disk drive, the time to execute the CWL script, and the time for the creation, set-up, and release of an instance. This study aims to establish an estimate of the cost and compute time needed for the execution of multiple BLAST runs in a cloud environment.</jats:p> </jats:sec> <jats:sec> <jats:title>Conclusions</jats:title> <jats:p>We demonstrate that public cloud providers are a practical alternative for the execution of advanced computational biology experiments at low cost. Using our cloud recipes, the BLAST alignments required to annotate a transcriptome with ∼500,000 transcripts can be processed in &amp;lt;2 hours with a compute cost of ∼$200–$250. In our opinion, for BLAST-based workflows, the choice of cloud platform is not dependent on the workflow but, rather, on the specific details and requirements of the cloud provider. These choices include the accessibility for institutional use, the technical knowledge required for effective use of the platform services, and the availability of open source frameworks such as APIs to deploy the workflow.</jats:p> </jats:sec> Transcriptome annotation in the cloud: complexity, best practices, and cost GigaScience
spellingShingle Alvarez, Roberto Vera, Mariño-Ramírez, Leonardo, Landsman, David, GigaScience, Transcriptome annotation in the cloud: complexity, best practices, and cost, Computer Science Applications, Health Informatics
title Transcriptome annotation in the cloud: complexity, best practices, and cost
title_full Transcriptome annotation in the cloud: complexity, best practices, and cost
title_fullStr Transcriptome annotation in the cloud: complexity, best practices, and cost
title_full_unstemmed Transcriptome annotation in the cloud: complexity, best practices, and cost
title_short Transcriptome annotation in the cloud: complexity, best practices, and cost
title_sort transcriptome annotation in the cloud: complexity, best practices, and cost
title_unstemmed Transcriptome annotation in the cloud: complexity, best practices, and cost
topic Computer Science Applications, Health Informatics
url http://dx.doi.org/10.1093/gigascience/giaa163