Eintrag weiter verarbeiten
GeneCodeq: quality score compression and improved genotyping using a Bayesian framework
Gespeichert in:
Zeitschriftentitel: | Bioinformatics |
---|---|
Personen und Körperschaften: | , , |
In: | Bioinformatics, 32, 2016, 20, S. 3124-3132 |
Format: | E-Article |
Sprache: | Englisch |
veröffentlicht: |
Oxford University Press (OUP)
|
Schlagwörter: |
author_facet |
Greenfield, Daniel L. Stegle, Oliver Rrustemi, Alban Greenfield, Daniel L. Stegle, Oliver Rrustemi, Alban |
---|---|
author |
Greenfield, Daniel L. Stegle, Oliver Rrustemi, Alban |
spellingShingle |
Greenfield, Daniel L. Stegle, Oliver Rrustemi, Alban Bioinformatics GeneCodeq: quality score compression and improved genotyping using a Bayesian framework Computational Mathematics Computational Theory and Mathematics Computer Science Applications Molecular Biology Biochemistry Statistics and Probability |
author_sort |
greenfield, daniel l. |
spelling |
Greenfield, Daniel L. Stegle, Oliver Rrustemi, Alban 1367-4811 1367-4803 Oxford University Press (OUP) Computational Mathematics Computational Theory and Mathematics Computer Science Applications Molecular Biology Biochemistry Statistics and Probability http://dx.doi.org/10.1093/bioinformatics/btw385 <jats:title>Abstract</jats:title> <jats:p>Motivation: The exponential reduction in cost of genome sequencing has resulted in a rapid growth of genomic data. Most of the entropy of short read data lies not in the sequence of read bases themselves but in their Quality Scores—the confidence measurement that each base has been sequenced correctly. Lossless compression methods are now close to their theoretical limits and hence there is a need for lossy methods that further reduce the complexity of these data without impacting downstream analyses.</jats:p> <jats:p>Results: We here propose GeneCodeq, a Bayesian method inspired by coding theory for adjusting quality scores to improve the compressibility of quality scores without adversely impacting genotyping accuracy. Our model leverages a corpus of k-mers to reduce the entropy of the quality scores and thereby the compressibility of these data (in FASTQ or SAM/BAM/CRAM files), resulting in compression ratios that significantly exceeds those of other methods. Our approach can also be combined with existing lossy compression schemes to further reduce entropy and allows the user to specify a reference panel of expected sequence variations to improve the model accuracy. In addition to extensive empirical evaluation, we also derive novel theoretical insights that explain the empirical performance and pitfalls of corpus-based quality score compression schemes in general. Finally, we show that as a positive side effect of compression, the model can lead to improved genotyping accuracy.</jats:p> <jats:p>Availability and implementation: GeneCodeq is available at: github.com/genecodeq/eval</jats:p> <jats:p>Contact: dan@petagene.com</jats:p> <jats:p>Supplementary information: Supplementary data are available at Bioinformatics online.</jats:p> GeneCodeq: quality score compression and improved genotyping using a Bayesian framework Bioinformatics |
doi_str_mv |
10.1093/bioinformatics/btw385 |
facet_avail |
Online Free |
finc_class_facet |
Mathematik Informatik Biologie Chemie und Pharmazie |
format |
ElectronicArticle |
fullrecord |
blob:ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHczODU |
id |
ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHczODU |
institution |
DE-105 DE-14 DE-Ch1 DE-L229 DE-D275 DE-Bn3 DE-Brt1 DE-Zwi2 DE-D161 DE-Gla1 DE-Zi4 DE-15 DE-Pl11 DE-Rs1 |
imprint |
Oxford University Press (OUP), 2016 |
imprint_str_mv |
Oxford University Press (OUP), 2016 |
issn |
1367-4811 1367-4803 |
issn_str_mv |
1367-4811 1367-4803 |
language |
English |
mega_collection |
Oxford University Press (OUP) (CrossRef) |
match_str |
greenfield2016genecodeqqualityscorecompressionandimprovedgenotypingusingabayesianframework |
publishDateSort |
2016 |
publisher |
Oxford University Press (OUP) |
recordtype |
ai |
record_format |
ai |
series |
Bioinformatics |
source_id |
49 |
title |
GeneCodeq: quality score compression and improved genotyping using a Bayesian framework |
title_unstemmed |
GeneCodeq: quality score compression and improved genotyping using a Bayesian framework |
title_full |
GeneCodeq: quality score compression and improved genotyping using a Bayesian framework |
title_fullStr |
GeneCodeq: quality score compression and improved genotyping using a Bayesian framework |
title_full_unstemmed |
GeneCodeq: quality score compression and improved genotyping using a Bayesian framework |
title_short |
GeneCodeq: quality score compression and improved genotyping using a Bayesian framework |
title_sort |
genecodeq: quality score compression and improved genotyping using a bayesian framework |
topic |
Computational Mathematics Computational Theory and Mathematics Computer Science Applications Molecular Biology Biochemistry Statistics and Probability |
url |
http://dx.doi.org/10.1093/bioinformatics/btw385 |
publishDate |
2016 |
physical |
3124-3132 |
description |
<jats:title>Abstract</jats:title>
<jats:p>Motivation: The exponential reduction in cost of genome sequencing has resulted in a rapid growth of genomic data. Most of the entropy of short read data lies not in the sequence of read bases themselves but in their Quality Scores—the confidence measurement that each base has been sequenced correctly. Lossless compression methods are now close to their theoretical limits and hence there is a need for lossy methods that further reduce the complexity of these data without impacting downstream analyses.</jats:p>
<jats:p>Results: We here propose GeneCodeq, a Bayesian method inspired by coding theory for adjusting quality scores to improve the compressibility of quality scores without adversely impacting genotyping accuracy. Our model leverages a corpus of k-mers to reduce the entropy of the quality scores and thereby the compressibility of these data (in FASTQ or SAM/BAM/CRAM files), resulting in compression ratios that significantly exceeds those of other methods. Our approach can also be combined with existing lossy compression schemes to further reduce entropy and allows the user to specify a reference panel of expected sequence variations to improve the model accuracy. In addition to extensive empirical evaluation, we also derive novel theoretical insights that explain the empirical performance and pitfalls of corpus-based quality score compression schemes in general. Finally, we show that as a positive side effect of compression, the model can lead to improved genotyping accuracy.</jats:p>
<jats:p>Availability and implementation: GeneCodeq is available at: github.com/genecodeq/eval</jats:p>
<jats:p>Contact: dan@petagene.com</jats:p>
<jats:p>Supplementary information: Supplementary data are available at Bioinformatics online.</jats:p> |
container_issue |
20 |
container_start_page |
3124 |
container_title |
Bioinformatics |
container_volume |
32 |
format_de105 |
Article, E-Article |
format_de14 |
Article, E-Article |
format_de15 |
Article, E-Article |
format_de520 |
Article, E-Article |
format_de540 |
Article, E-Article |
format_dech1 |
Article, E-Article |
format_ded117 |
Article, E-Article |
format_degla1 |
E-Article |
format_del152 |
Buch |
format_del189 |
Article, E-Article |
format_dezi4 |
Article |
format_dezwi2 |
Article, E-Article |
format_finc |
Article, E-Article |
format_nrw |
Article, E-Article |
_version_ |
1792334349438287878 |
geogr_code |
not assigned |
last_indexed |
2024-03-01T14:27:12.293Z |
geogr_code_person |
not assigned |
openURL |
url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fvufind.svn.sourceforge.net%3Agenerator&rft.title=GeneCodeq%3A+quality+score+compression+and+improved+genotyping+using+a+Bayesian+framework&rft.date=2016-10-15&genre=article&issn=1367-4803&volume=32&issue=20&spage=3124&epage=3132&pages=3124-3132&jtitle=Bioinformatics&atitle=GeneCodeq%3A+quality+score+compression+and+improved+genotyping+using+a+Bayesian+framework&aulast=Rrustemi&aufirst=Alban&rft_id=info%3Adoi%2F10.1093%2Fbioinformatics%2Fbtw385&rft.language%5B0%5D=eng |
SOLR | |
_version_ | 1792334349438287878 |
author | Greenfield, Daniel L., Stegle, Oliver, Rrustemi, Alban |
author_facet | Greenfield, Daniel L., Stegle, Oliver, Rrustemi, Alban, Greenfield, Daniel L., Stegle, Oliver, Rrustemi, Alban |
author_sort | greenfield, daniel l. |
container_issue | 20 |
container_start_page | 3124 |
container_title | Bioinformatics |
container_volume | 32 |
description | <jats:title>Abstract</jats:title> <jats:p>Motivation: The exponential reduction in cost of genome sequencing has resulted in a rapid growth of genomic data. Most of the entropy of short read data lies not in the sequence of read bases themselves but in their Quality Scores—the confidence measurement that each base has been sequenced correctly. Lossless compression methods are now close to their theoretical limits and hence there is a need for lossy methods that further reduce the complexity of these data without impacting downstream analyses.</jats:p> <jats:p>Results: We here propose GeneCodeq, a Bayesian method inspired by coding theory for adjusting quality scores to improve the compressibility of quality scores without adversely impacting genotyping accuracy. Our model leverages a corpus of k-mers to reduce the entropy of the quality scores and thereby the compressibility of these data (in FASTQ or SAM/BAM/CRAM files), resulting in compression ratios that significantly exceeds those of other methods. Our approach can also be combined with existing lossy compression schemes to further reduce entropy and allows the user to specify a reference panel of expected sequence variations to improve the model accuracy. In addition to extensive empirical evaluation, we also derive novel theoretical insights that explain the empirical performance and pitfalls of corpus-based quality score compression schemes in general. Finally, we show that as a positive side effect of compression, the model can lead to improved genotyping accuracy.</jats:p> <jats:p>Availability and implementation: GeneCodeq is available at: github.com/genecodeq/eval</jats:p> <jats:p>Contact: dan@petagene.com</jats:p> <jats:p>Supplementary information: Supplementary data are available at Bioinformatics online.</jats:p> |
doi_str_mv | 10.1093/bioinformatics/btw385 |
facet_avail | Online, Free |
finc_class_facet | Mathematik, Informatik, Biologie, Chemie und Pharmazie |
format | ElectronicArticle |
format_de105 | Article, E-Article |
format_de14 | Article, E-Article |
format_de15 | Article, E-Article |
format_de520 | Article, E-Article |
format_de540 | Article, E-Article |
format_dech1 | Article, E-Article |
format_ded117 | Article, E-Article |
format_degla1 | E-Article |
format_del152 | Buch |
format_del189 | Article, E-Article |
format_dezi4 | Article |
format_dezwi2 | Article, E-Article |
format_finc | Article, E-Article |
format_nrw | Article, E-Article |
geogr_code | not assigned |
geogr_code_person | not assigned |
id | ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHczODU |
imprint | Oxford University Press (OUP), 2016 |
imprint_str_mv | Oxford University Press (OUP), 2016 |
institution | DE-105, DE-14, DE-Ch1, DE-L229, DE-D275, DE-Bn3, DE-Brt1, DE-Zwi2, DE-D161, DE-Gla1, DE-Zi4, DE-15, DE-Pl11, DE-Rs1 |
issn | 1367-4811, 1367-4803 |
issn_str_mv | 1367-4811, 1367-4803 |
language | English |
last_indexed | 2024-03-01T14:27:12.293Z |
match_str | greenfield2016genecodeqqualityscorecompressionandimprovedgenotypingusingabayesianframework |
mega_collection | Oxford University Press (OUP) (CrossRef) |
physical | 3124-3132 |
publishDate | 2016 |
publishDateSort | 2016 |
publisher | Oxford University Press (OUP) |
record_format | ai |
recordtype | ai |
series | Bioinformatics |
source_id | 49 |
spelling | Greenfield, Daniel L. Stegle, Oliver Rrustemi, Alban 1367-4811 1367-4803 Oxford University Press (OUP) Computational Mathematics Computational Theory and Mathematics Computer Science Applications Molecular Biology Biochemistry Statistics and Probability http://dx.doi.org/10.1093/bioinformatics/btw385 <jats:title>Abstract</jats:title> <jats:p>Motivation: The exponential reduction in cost of genome sequencing has resulted in a rapid growth of genomic data. Most of the entropy of short read data lies not in the sequence of read bases themselves but in their Quality Scores—the confidence measurement that each base has been sequenced correctly. Lossless compression methods are now close to their theoretical limits and hence there is a need for lossy methods that further reduce the complexity of these data without impacting downstream analyses.</jats:p> <jats:p>Results: We here propose GeneCodeq, a Bayesian method inspired by coding theory for adjusting quality scores to improve the compressibility of quality scores without adversely impacting genotyping accuracy. Our model leverages a corpus of k-mers to reduce the entropy of the quality scores and thereby the compressibility of these data (in FASTQ or SAM/BAM/CRAM files), resulting in compression ratios that significantly exceeds those of other methods. Our approach can also be combined with existing lossy compression schemes to further reduce entropy and allows the user to specify a reference panel of expected sequence variations to improve the model accuracy. In addition to extensive empirical evaluation, we also derive novel theoretical insights that explain the empirical performance and pitfalls of corpus-based quality score compression schemes in general. Finally, we show that as a positive side effect of compression, the model can lead to improved genotyping accuracy.</jats:p> <jats:p>Availability and implementation: GeneCodeq is available at: github.com/genecodeq/eval</jats:p> <jats:p>Contact: dan@petagene.com</jats:p> <jats:p>Supplementary information: Supplementary data are available at Bioinformatics online.</jats:p> GeneCodeq: quality score compression and improved genotyping using a Bayesian framework Bioinformatics |
spellingShingle | Greenfield, Daniel L., Stegle, Oliver, Rrustemi, Alban, Bioinformatics, GeneCodeq: quality score compression and improved genotyping using a Bayesian framework, Computational Mathematics, Computational Theory and Mathematics, Computer Science Applications, Molecular Biology, Biochemistry, Statistics and Probability |
title | GeneCodeq: quality score compression and improved genotyping using a Bayesian framework |
title_full | GeneCodeq: quality score compression and improved genotyping using a Bayesian framework |
title_fullStr | GeneCodeq: quality score compression and improved genotyping using a Bayesian framework |
title_full_unstemmed | GeneCodeq: quality score compression and improved genotyping using a Bayesian framework |
title_short | GeneCodeq: quality score compression and improved genotyping using a Bayesian framework |
title_sort | genecodeq: quality score compression and improved genotyping using a bayesian framework |
title_unstemmed | GeneCodeq: quality score compression and improved genotyping using a Bayesian framework |
topic | Computational Mathematics, Computational Theory and Mathematics, Computer Science Applications, Molecular Biology, Biochemistry, Statistics and Probability |
url | http://dx.doi.org/10.1093/bioinformatics/btw385 |