author_facet Greenfield, Daniel L.
Stegle, Oliver
Rrustemi, Alban
Greenfield, Daniel L.
Stegle, Oliver
Rrustemi, Alban
author Greenfield, Daniel L.
Stegle, Oliver
Rrustemi, Alban
spellingShingle Greenfield, Daniel L.
Stegle, Oliver
Rrustemi, Alban
Bioinformatics
GeneCodeq: quality score compression and improved genotyping using a Bayesian framework
Computational Mathematics
Computational Theory and Mathematics
Computer Science Applications
Molecular Biology
Biochemistry
Statistics and Probability
author_sort greenfield, daniel l.
spelling Greenfield, Daniel L. Stegle, Oliver Rrustemi, Alban 1367-4811 1367-4803 Oxford University Press (OUP) Computational Mathematics Computational Theory and Mathematics Computer Science Applications Molecular Biology Biochemistry Statistics and Probability http://dx.doi.org/10.1093/bioinformatics/btw385 <jats:title>Abstract</jats:title> <jats:p>Motivation: The exponential reduction in cost of genome sequencing has resulted in a rapid growth of genomic data. Most of the entropy of short read data lies not in the sequence of read bases themselves but in their Quality Scores—the confidence measurement that each base has been sequenced correctly. Lossless compression methods are now close to their theoretical limits and hence there is a need for lossy methods that further reduce the complexity of these data without impacting downstream analyses.</jats:p> <jats:p>Results: We here propose GeneCodeq, a Bayesian method inspired by coding theory for adjusting quality scores to improve the compressibility of quality scores without adversely impacting genotyping accuracy. Our model leverages a corpus of k-mers to reduce the entropy of the quality scores and thereby the compressibility of these data (in FASTQ or SAM/BAM/CRAM files), resulting in compression ratios that significantly exceeds those of other methods. Our approach can also be combined with existing lossy compression schemes to further reduce entropy and allows the user to specify a reference panel of expected sequence variations to improve the model accuracy. In addition to extensive empirical evaluation, we also derive novel theoretical insights that explain the empirical performance and pitfalls of corpus-based quality score compression schemes in general. Finally, we show that as a positive side effect of compression, the model can lead to improved genotyping accuracy.</jats:p> <jats:p>Availability and implementation: GeneCodeq is available at: github.com/genecodeq/eval</jats:p> <jats:p>Contact: dan@petagene.com</jats:p> <jats:p>Supplementary information: Supplementary data are available at Bioinformatics online.</jats:p> GeneCodeq: quality score compression and improved genotyping using a Bayesian framework Bioinformatics
doi_str_mv 10.1093/bioinformatics/btw385
facet_avail Online
Free
finc_class_facet Mathematik
Informatik
Biologie
Chemie und Pharmazie
format ElectronicArticle
fullrecord blob:ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHczODU
id ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHczODU
institution DE-105
DE-14
DE-Ch1
DE-L229
DE-D275
DE-Bn3
DE-Brt1
DE-Zwi2
DE-D161
DE-Gla1
DE-Zi4
DE-15
DE-Pl11
DE-Rs1
imprint Oxford University Press (OUP), 2016
imprint_str_mv Oxford University Press (OUP), 2016
issn 1367-4811
1367-4803
issn_str_mv 1367-4811
1367-4803
language English
mega_collection Oxford University Press (OUP) (CrossRef)
match_str greenfield2016genecodeqqualityscorecompressionandimprovedgenotypingusingabayesianframework
publishDateSort 2016
publisher Oxford University Press (OUP)
recordtype ai
record_format ai
series Bioinformatics
source_id 49
title GeneCodeq: quality score compression and improved genotyping using a Bayesian framework
title_unstemmed GeneCodeq: quality score compression and improved genotyping using a Bayesian framework
title_full GeneCodeq: quality score compression and improved genotyping using a Bayesian framework
title_fullStr GeneCodeq: quality score compression and improved genotyping using a Bayesian framework
title_full_unstemmed GeneCodeq: quality score compression and improved genotyping using a Bayesian framework
title_short GeneCodeq: quality score compression and improved genotyping using a Bayesian framework
title_sort genecodeq: quality score compression and improved genotyping using a bayesian framework
topic Computational Mathematics
Computational Theory and Mathematics
Computer Science Applications
Molecular Biology
Biochemistry
Statistics and Probability
url http://dx.doi.org/10.1093/bioinformatics/btw385
publishDate 2016
physical 3124-3132
description <jats:title>Abstract</jats:title> <jats:p>Motivation: The exponential reduction in cost of genome sequencing has resulted in a rapid growth of genomic data. Most of the entropy of short read data lies not in the sequence of read bases themselves but in their Quality Scores—the confidence measurement that each base has been sequenced correctly. Lossless compression methods are now close to their theoretical limits and hence there is a need for lossy methods that further reduce the complexity of these data without impacting downstream analyses.</jats:p> <jats:p>Results: We here propose GeneCodeq, a Bayesian method inspired by coding theory for adjusting quality scores to improve the compressibility of quality scores without adversely impacting genotyping accuracy. Our model leverages a corpus of k-mers to reduce the entropy of the quality scores and thereby the compressibility of these data (in FASTQ or SAM/BAM/CRAM files), resulting in compression ratios that significantly exceeds those of other methods. Our approach can also be combined with existing lossy compression schemes to further reduce entropy and allows the user to specify a reference panel of expected sequence variations to improve the model accuracy. In addition to extensive empirical evaluation, we also derive novel theoretical insights that explain the empirical performance and pitfalls of corpus-based quality score compression schemes in general. Finally, we show that as a positive side effect of compression, the model can lead to improved genotyping accuracy.</jats:p> <jats:p>Availability and implementation:  GeneCodeq is available at: github.com/genecodeq/eval</jats:p> <jats:p>Contact:  dan@petagene.com</jats:p> <jats:p>Supplementary information:  Supplementary data are available at Bioinformatics online.</jats:p>
container_issue 20
container_start_page 3124
container_title Bioinformatics
container_volume 32
format_de105 Article, E-Article
format_de14 Article, E-Article
format_de15 Article, E-Article
format_de520 Article, E-Article
format_de540 Article, E-Article
format_dech1 Article, E-Article
format_ded117 Article, E-Article
format_degla1 E-Article
format_del152 Buch
format_del189 Article, E-Article
format_dezi4 Article
format_dezwi2 Article, E-Article
format_finc Article, E-Article
format_nrw Article, E-Article
_version_ 1792334349438287878
geogr_code not assigned
last_indexed 2024-03-01T14:27:12.293Z
geogr_code_person not assigned
openURL url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fvufind.svn.sourceforge.net%3Agenerator&rft.title=GeneCodeq%3A+quality+score+compression+and+improved+genotyping+using+a+Bayesian+framework&rft.date=2016-10-15&genre=article&issn=1367-4803&volume=32&issue=20&spage=3124&epage=3132&pages=3124-3132&jtitle=Bioinformatics&atitle=GeneCodeq%3A+quality+score+compression+and+improved+genotyping+using+a+Bayesian+framework&aulast=Rrustemi&aufirst=Alban&rft_id=info%3Adoi%2F10.1093%2Fbioinformatics%2Fbtw385&rft.language%5B0%5D=eng
SOLR
_version_ 1792334349438287878
author Greenfield, Daniel L., Stegle, Oliver, Rrustemi, Alban
author_facet Greenfield, Daniel L., Stegle, Oliver, Rrustemi, Alban, Greenfield, Daniel L., Stegle, Oliver, Rrustemi, Alban
author_sort greenfield, daniel l.
container_issue 20
container_start_page 3124
container_title Bioinformatics
container_volume 32
description <jats:title>Abstract</jats:title> <jats:p>Motivation: The exponential reduction in cost of genome sequencing has resulted in a rapid growth of genomic data. Most of the entropy of short read data lies not in the sequence of read bases themselves but in their Quality Scores—the confidence measurement that each base has been sequenced correctly. Lossless compression methods are now close to their theoretical limits and hence there is a need for lossy methods that further reduce the complexity of these data without impacting downstream analyses.</jats:p> <jats:p>Results: We here propose GeneCodeq, a Bayesian method inspired by coding theory for adjusting quality scores to improve the compressibility of quality scores without adversely impacting genotyping accuracy. Our model leverages a corpus of k-mers to reduce the entropy of the quality scores and thereby the compressibility of these data (in FASTQ or SAM/BAM/CRAM files), resulting in compression ratios that significantly exceeds those of other methods. Our approach can also be combined with existing lossy compression schemes to further reduce entropy and allows the user to specify a reference panel of expected sequence variations to improve the model accuracy. In addition to extensive empirical evaluation, we also derive novel theoretical insights that explain the empirical performance and pitfalls of corpus-based quality score compression schemes in general. Finally, we show that as a positive side effect of compression, the model can lead to improved genotyping accuracy.</jats:p> <jats:p>Availability and implementation:  GeneCodeq is available at: github.com/genecodeq/eval</jats:p> <jats:p>Contact:  dan@petagene.com</jats:p> <jats:p>Supplementary information:  Supplementary data are available at Bioinformatics online.</jats:p>
doi_str_mv 10.1093/bioinformatics/btw385
facet_avail Online, Free
finc_class_facet Mathematik, Informatik, Biologie, Chemie und Pharmazie
format ElectronicArticle
format_de105 Article, E-Article
format_de14 Article, E-Article
format_de15 Article, E-Article
format_de520 Article, E-Article
format_de540 Article, E-Article
format_dech1 Article, E-Article
format_ded117 Article, E-Article
format_degla1 E-Article
format_del152 Buch
format_del189 Article, E-Article
format_dezi4 Article
format_dezwi2 Article, E-Article
format_finc Article, E-Article
format_nrw Article, E-Article
geogr_code not assigned
geogr_code_person not assigned
id ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHczODU
imprint Oxford University Press (OUP), 2016
imprint_str_mv Oxford University Press (OUP), 2016
institution DE-105, DE-14, DE-Ch1, DE-L229, DE-D275, DE-Bn3, DE-Brt1, DE-Zwi2, DE-D161, DE-Gla1, DE-Zi4, DE-15, DE-Pl11, DE-Rs1
issn 1367-4811, 1367-4803
issn_str_mv 1367-4811, 1367-4803
language English
last_indexed 2024-03-01T14:27:12.293Z
match_str greenfield2016genecodeqqualityscorecompressionandimprovedgenotypingusingabayesianframework
mega_collection Oxford University Press (OUP) (CrossRef)
physical 3124-3132
publishDate 2016
publishDateSort 2016
publisher Oxford University Press (OUP)
record_format ai
recordtype ai
series Bioinformatics
source_id 49
spelling Greenfield, Daniel L. Stegle, Oliver Rrustemi, Alban 1367-4811 1367-4803 Oxford University Press (OUP) Computational Mathematics Computational Theory and Mathematics Computer Science Applications Molecular Biology Biochemistry Statistics and Probability http://dx.doi.org/10.1093/bioinformatics/btw385 <jats:title>Abstract</jats:title> <jats:p>Motivation: The exponential reduction in cost of genome sequencing has resulted in a rapid growth of genomic data. Most of the entropy of short read data lies not in the sequence of read bases themselves but in their Quality Scores—the confidence measurement that each base has been sequenced correctly. Lossless compression methods are now close to their theoretical limits and hence there is a need for lossy methods that further reduce the complexity of these data without impacting downstream analyses.</jats:p> <jats:p>Results: We here propose GeneCodeq, a Bayesian method inspired by coding theory for adjusting quality scores to improve the compressibility of quality scores without adversely impacting genotyping accuracy. Our model leverages a corpus of k-mers to reduce the entropy of the quality scores and thereby the compressibility of these data (in FASTQ or SAM/BAM/CRAM files), resulting in compression ratios that significantly exceeds those of other methods. Our approach can also be combined with existing lossy compression schemes to further reduce entropy and allows the user to specify a reference panel of expected sequence variations to improve the model accuracy. In addition to extensive empirical evaluation, we also derive novel theoretical insights that explain the empirical performance and pitfalls of corpus-based quality score compression schemes in general. Finally, we show that as a positive side effect of compression, the model can lead to improved genotyping accuracy.</jats:p> <jats:p>Availability and implementation: GeneCodeq is available at: github.com/genecodeq/eval</jats:p> <jats:p>Contact: dan@petagene.com</jats:p> <jats:p>Supplementary information: Supplementary data are available at Bioinformatics online.</jats:p> GeneCodeq: quality score compression and improved genotyping using a Bayesian framework Bioinformatics
spellingShingle Greenfield, Daniel L., Stegle, Oliver, Rrustemi, Alban, Bioinformatics, GeneCodeq: quality score compression and improved genotyping using a Bayesian framework, Computational Mathematics, Computational Theory and Mathematics, Computer Science Applications, Molecular Biology, Biochemistry, Statistics and Probability
title GeneCodeq: quality score compression and improved genotyping using a Bayesian framework
title_full GeneCodeq: quality score compression and improved genotyping using a Bayesian framework
title_fullStr GeneCodeq: quality score compression and improved genotyping using a Bayesian framework
title_full_unstemmed GeneCodeq: quality score compression and improved genotyping using a Bayesian framework
title_short GeneCodeq: quality score compression and improved genotyping using a Bayesian framework
title_sort genecodeq: quality score compression and improved genotyping using a bayesian framework
title_unstemmed GeneCodeq: quality score compression and improved genotyping using a Bayesian framework
topic Computational Mathematics, Computational Theory and Mathematics, Computer Science Applications, Molecular Biology, Biochemistry, Statistics and Probability
url http://dx.doi.org/10.1093/bioinformatics/btw385