Bit Reduction based Compression Algorithm for DNA Sequences

Authors

  • Rosario Gilmary  Department of Computer Science and Engineering, Pondicherry Engineering College, India
  • Murugesan G  Department of Computer Science and Engineering, St. Joseph’s College of Engineering, India

DOI:

https://doi.org/10.32628/IJSRSET218529

Keywords:

Data compression, lossy and lossless compression, DNA, bases,bit reduction, hexa decimal format, variable length code, huffman codes

Abstract

Deoxyribonucleic acid called DNA is the smallest fundamental unit that bears the genetic instructions of a living organism. It is used in the up growth and functioning of all known living organisms. Current DNA sequencing equipment creates extensive heaps of genomic data. The Nucleotide databases like GenBank, size getting 2 to 3 times larger annually. The increase in genomic data outstrips the increase in storage capacity. Massive amount of genomic data needs an effectual depository, quick transposal and preferable performance. To reduce storage of abundant data and data storage expense, compression algorithms were used. Typical compression approaches lose status while compressing these sequences. However, novel compression algorithms have been introduced for better compression ratio. The performance is correlated in terms of compression ratio; ratio of the capacity of compressed file and compression/decompression time; time taken to compress/decompress the sequence. In the proposed work, the input DNA sequence is compressed by reconstructing the sequence into varied formats. Here the input DNA sequence is subjected to bit reduction. The binary output is converted to hexadecimal format followed by encoding. Thus, the compression ratio of the biological sequence is improved.

References

  1. Afify, H., Islam, M., Abdel-Wahed, M., et al., 2010, Genomic Sequences Differential Compression Model, Proceeding of 27th National Radio Science Conferenec, Egypt.
  2. Bacem Saada, Jing Zhang, " DNA Sequences Compression Algorithm Based on Extended- ASCII Representation in Proceedings of the world congress on engineering and computer science 2015 Vol II WCECS 2015, October 21-23, 2015, San Francisco, USA.
  3. Behzadi B and Le Fessant F, "DNA compression challege revisited: a dynamic programming approach",in Proceedings of the Annual Symposium on Combinatorial Pattern Matching, pp. 90-200, Springer, Berlin,Germany,2005.
  4. Cao M D, Dix T I, Allison L, and Mears C, "A simple statistical algorithm for biological sequence compression," in Proceedings of the Data Compression Conference (DCC'07), pp. 43-52, IEEE, Snowbird, Utah, USA, March 2007.
  5. Chen X, Li M, Ma B, and Tromp J, "DNACompress: fast and effective DNA sequence compression,"Bioinformatics, vol. 18, no. 12, pp. 1696-1698, 2002.
  6. Chen X, Kwong S, and Li M, "Compression algorithm for DNA sequences and its applications in genome comparison," in Proceedings of the 4th Annual International Conference on Computation Molecular Biology (RECOMB'00), p. 107, ACM, Tokyo, Japan, April 2000.
  7. Grumbach S and Tahi F, "A new challenge for compression algorithms: genetic sequences", Information Processing and Management, vol. 30, no.6, pp. 875-886,1994.
  8. Grumbach S and Tahi F," Compression of DNA sequences", in Proceedings of the IEEE Symposium on Data Compression, pp. 340- 3550, Snowbird, Utah, USA, 1993
  9. Kanika Mehta and Satya Prakash Ghrera," DNA compression using referential compression algorithm",in Contemporary Computing (IC3), 2015 Eighth International Conference.
  10. Loewenstern D and Yianilos P N, "Significantly lower entropy estimates for natural DNA sequences,"Journal of Computational Biology, vol. 6, np. 1,pp. 125-142, 1999.
  11. Myung J I, Navarro D J, and Pitt M A, "Model selection by normalized maximum likelihood", Journal of Mathematical Psychology, vol.50, no. 2, pp. 167-179,2006
  12. Ma B, Tromp J, and Li M, "PatternHunter: fast and more sensitive homology search", Bioinformatics, vol. 18, no. 3, pp. 440-445, 2002.
  13. Matsumoto T, Sadakane K, and Imai H, "Biological sequence compression algorithms", Genome Informatics, vol. , pp. 43-52, 2000.
  14. Pamela Vinitha Eric, Gopakumar Gopalakrishnan and Muralikrishnan Karunakaran, " An Optimal Seed Based Compression Algorithm for DNA Sequences", Advances in Bioinformatics, vol 2016 (2016), Article ID 3528406, 7 pages.
  15. Prasad, V. H., and Kumar, P. V., 2012, A New Revised DNA Cramp Tool Based Approach of Chopping DNA Repetitive and Non- Repetitive Genome Sequences, International Journal of Computer Science Issues (IJCSI), 9(6), 448-454.
  16. Rajeswari, P. R., and Apparao, A., 2011, DNABIT Compress- Genome compression algorithm, Bioinformatics, 5(8), 350-360.
  17. Rajeswari, P. R., and Apparao, A., 2010, GenBit Compress Tool (GBC): A Java-Based Tool To Compress DNA Sequences and Compute Compression Ratio (BITS/BASE) Of Genomes, International Journal of Computer Science and Information Technology, 2(3), 181-191.
  18. Satyanvesh, D., Balleda, K., Padyana, A., 2012, GenCodex- A Novel Algorithm for Compressing DNA seuences on Multi-cores and GPUs, Proc. IEEE, 19th International Conf. on High Performance Computing (HiPC), Pune, India, No 37.
  19. Zhu Z, Zhang Y, Ji Z, He S, Yang X," High - throughput DNA sequence data compression",in Briefings in bioinformatics. 2015 Jan; 16 (1)

Downloads

Published

2021-10-30

Issue

Section

Research Articles

How to Cite

[1]
Rosario Gilmary, Murugesan G "Bit Reduction based Compression Algorithm for DNA Sequences" International Journal of Scientific Research in Science, Engineering and Technology (IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 8, Issue 5, pp.270-277, September-October-2021. Available at doi : https://doi.org/10.32628/IJSRSET218529