Data Mining Model for Big Data Analysis

Authors(2) :-Syeda Meraj Bilfaqih, Sabahat Khatoon

Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This data-driven model involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. We analyze the challenging issues in the data-driven model and also in the Big Data revolution.

Authors and Affiliations

Syeda Meraj Bilfaqih
Department of Computer Science, King Khalid University, Saudi Arabia
Sabahat Khatoon
Department of Computer Science, King Khalid University, Saudi Arabia

Big Data, Data Mining, Heterogeneity, Autonomous Sources, Complex and Evolving Associations

  1. R. Ahmed and G. Karypis, "Algorithms for Mining the Evolution of Conserved Relational States in Dynamic Networks," Knowledge and Information Systems, vol. 33, no. 3, pp. 603-630, Dec. 2012.
  2. 2M.H. Alam, J.W. Ha, and S.K. Lee, "Novel Approaches to Crawling Important Pages Early," Knowledge and Information Systems, vol. 33, no. 3, pp 707-734, Dec. 2012.
  3. 3S. Aral and D. Walker, "Identifying Influential and Susceptible Members of Social Networks," Science, vol. 337, pp. 337-341, 2012.
  4. 4A. Machanavajjhala and J.P. Reiter, "Big Privacy: Protecting Confidentiality in Big Data," ACM  Crossroads,  vol.  19,  no.  1, pp. 20-23, 2012.
  5. 5S. Banerjee and N. Agarwal, "Analyzing Collective Behavior from Blogs Using Swarm Intelligence," Knowledge and Information Systems, vol. 33, no. 3, pp. 523-547, Dec. 2012.
  6. 6E. Birney, "The Making of ENCODE: Lessons for Big-Data Projects," Nature, vol. 489, pp. 49-51, 2012.
  7. 7J. Bollen, H. Mao, and X. Zeng, "Twitter Mood Predicts the Stock Market," J. Computational Science, vol. 2, no. 1, pp. 1-8, 2011.
  8. 8S. Borgatti, A. Mehra, D. Brass, and G. Labianca, "Network Analysis in the Social Sciences," Science, vol. 323, pp. 892-895, 2009.
  9. 9J. Bughin, M. Chui, and J. Manyika, Clouds, Big Data, and Smart Assets: Ten Tech-Enabled Business Trends to Watch. McKinSey Quarterly, 2010.
  10. 10D. Centola, "The Spread of Behavior in an Online Social Network Experiment," Science, vol. 329, pp. 1194-1197, 2010.
  11. 11E.Y. Chang, H. Bai, and K. Zhu, "Parallel Algorithms for Mining Large-Scale Rich-Media Data," Proc. 17th ACM Int’l Conf. Multi- media, (MM ’09,) pp. 917-918, 2009.
  12. 12R. Chen, K. Sivakumar, and H. Kargupta, "Collective Mining of Bayesian Networks from Distributed Heterogeneous Data," Knowledge and Information Systems, vol. 6, no. 2, pp. 164-187, 2004.
  13. 13Y.-C. Chen, W.-C. Peng, and S.-Y. Lee, "Efficient Algorithms for Influence Maximization in Social Networks," Knowledge and Information Systems, vol. 33, no. 3, pp. 577-601, Dec. 2012.
  14. 14 C.T. Chu, S.K. Kim, Y.A. Lin, Y. Yu, G.R. Bradski, A.Y. Ng, and K. Olukotun, "Map-Reduce for Machine  Learning on Multicore," Proc. 20th Ann. Conf. Neural Information Processing Systems (NIPS ’06), pp. 281-288, 2006.
  15. 15G. Cormode and D. Srivastava, "Anonymized Data: Generation, Models, Usage," Proc. ACM SIGMOD Int’l Conf. Management Data, pp. 1015-1018, 2009.
  16. 16S. Das, Y. Sismanis, K.S. Beyer, R. Gemulla, P.J. Haas, and J. McPherson, "Ricardo: Integrating R  and Hadoop,"  Proc. ACM SIGMOD Int’l Conf. Management Data (SIGMOD ’10), pp. 987-998. 2010.
  17. 17P. Dewdney, P. Hall, R. Schilizzi, and J. Lazio, "The Square Kilometre Array," Proc. IEEE, vol. 97, no. 8, pp. 1482-1496, Aug. 2009.
  18. 18P. Domingos and G. Hulten, "Mining High-Speed Data Streams," Proc. Sixth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’00), pp. 71-80, 2000.
  19. 19G. Duncan, "Privacy by Design," Science, vol. 317, pp. 1178-1179, 2007.
  20. 20 B. Efron, "Missing Data, Imputation, and the Bootstrap," J. Am. Statistical Assoc., vol. 89, no. 426, pp. 463-475, 1994.
  21. 21A. Ghoting and E. Pednault, "Hadoop-ML: An Infrastructure for the Rapid Implementation of Parallel Reusable Analytics," Proc. Large-Scale Machine Learning: Parallelism and Massive Data Sets Workshop (NIPS ’09), 2009.
  22. 22D. Gillick, A. Faria, and J. DeNero, MapReduce: Distributed Computing for Machine Learning, Berkley,  Dec. 2006.
  23. 23M. Helft, "Google Uses Searches to Track Flu’s Spread," The New York Times, internet/12flu.html. 2008.
  24. 24 D. Howe et al., "Big Data: The Future of Biocuration," Nature, vol. 455,  pp. 47-50, Sept. 2008.
  25. 25B. Huberman, "Sociology of Science: Big Data Deserve a Bigger Audience," Nature, vol. 482, p. 308, 2012.
  26. 26"IBM What Is Big Data: Bring Big Data to the Enterprise," http://, IBM, 2012.
  27. 27A. Jacobs, "The Pathologies of Big Data," Comm. ACM, vol. 52, no. 8, pp. 36-44, 2009.
  28. 28I. Kopanas, N. Avouris, and S. Daskalaki, "The Role of Domain Knowledge in a Large Scale Data Mining Project," Proc. Second Hellenic Conf. AI: Methods and Applications of Artificial Intelligence,
  29. I.P. Vlahavas, C.D. Spyropoulos, eds., pp. 288-299, 2002.
  30. 29A. Labrinidis and H. Jagadish, "Challenges and Opportunities with Big Data," Proc. VLDB Endowment, vol. 5, no. 12, 2032-2033, 2012.
  31. 30 Y.  Lindell  and  B.  Pinkas,  "Privacy  Preserving  Data  Mining," J. Cryptology, vol. 15, no. 3, pp. 177-206, 2002.
  32. 31W. Liu and T. Wang, "Online Active Multi-Field Learning for Efficient Email Spam Filtering," Knowledge and Information Systems, vol. 33, no. 1, pp. 117-136, Oct. 2012.
  33. 32J. Lorch, B. Parno, J. Mickens, M. Raykova, and J. Schiffman, "Shoroud: Ensuring Private Access to Large-Scale Data in the Data Center," Proc. 11th USENIX Conf. File and Storage Technologies (FAST ’13), 2013.
  34. 33D. Luo, C. Ding, and H. Huang, "Parallelization with Multi- plicative Algorithms for Big Data Mining," Proc. IEEE 12th Int’l Conf. Data Mining, pp. 489-498, 2012.
  35. 34 J. Mervis, "U.S. Science Policy: Agencies Rally to Tackle Big Data," Science, vol. 336, no. 6077, p. 22, 2012.
  36. 35F. Michel, "How Many Photos Are Uploaded to Flickr Every Day and Month?" 6855169886/, 2012.
  37. 36T. Mitchell, "Mining our Reality," Science, vol. 326, pp. 1644-1645, 2009.
  38. 37 Nature  Editorial,  "Community  Cleverness  Required,"  Nature, vol. 455, no. 7209, p. 1, Sept. 2008.
  39. 38S. Papadimitriou and J. Sun, "Disco: Distributed Co-Clustering with Map-Reduce: A Case Study Towards Petabyte-Scale End-to- End Mining," Proc. IEEE Eighth Int’l Conf. Data Mining (ICDM ’08), pp. 512-521, 2008.
  40. 39C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, "Evaluating MapReduce for Multi-Core and Multi- processor Systems," Proc. IEEE 13th Int’l Symp. High Perfor- mance Computer Architecture  (HPCA ’07), pp. 13-24, 2007.
  41. 40 A.  Rajaraman  and  J.  Ullman,  Mining  of  Massive  Data  Sets. Cambridge Univ. Press, 2011.
  42. 41C. Reed, D. Thompson, W. Majid, and K. Wagstaff, "Real Time Machine Learning to Find Fast Transient Radio Anomalies: A Semi-Supervised Approach Combining Detection and RFI Exci- sion," Proc. Int’l Astronomical Union Symp. Time Domain Astronomy, Sept. 2011.
  43. 42E. Schadt, "The Changing Privacy  Landscape  in  the  Era  of Big Data," Molecular Systems, vol. 8, article 612, 2012.
  44. 43J. Shafer, R. Agrawal, and M. Mehta, "SPRINT: A Scalable Parallel Classifier for Data Mining," Proc. 22nd VLDB Conf., 1996.
  45. 44A. da Silva, R. Chiky, and G. He´brail, "A Clustering Approach for Sampling Data Streams in Sensor Networks," Knowledge and Information Systems, vol. 32, no. 1, pp. 1-23, July 2012.
  46. 45K. Su, H. Huang, X. Wu, and S. Zhang, "A Logical Framework for Identifying Quality Knowledge from Different Data Sources," Decision Support Systems, vol. 42, no. 3, pp. 1673-1683, 2006.
  47. 46"Twitter Blog, Dispatch from the Denver Debate," http://, Oct. 2012.
  48. 47D. Wegener, M. Mock, D. Adranale, and S. Wrobel, "Toolkit-Based High-Performance Data Mining of Large Data on MapReduce Clusters," Proc. Int’l Conf. Data Mining Workshops (ICDMW ’09), pp. 296-301, 2009.
  49. 48C. Wang, S.S.M. Chow, Q. Wang, K. Ren, and W. Lou, "Privacy- Preserving Public Auditing for Secure Cloud Storage" IEEE Trans. Computers, vol. 62, no. 2, pp. 362-375, Feb. 2013.
  50. 49X. Wu and X. Zhu, "Mining with Noise Knowledge: Error-Aware Data Mining," IEEE Trans. Systems, Man and Cybernetics, Part A, vol. 38, no. 4, pp. 917-932, July 2008.
  51. 50X. Wu and S. Zhang, "Synthesizing High-Frequency Rules from Different Data Sources," IEEE Trans. Knowledge and Data Eng., vol. 15, no. 2, pp. 353-367, Mar./Apr. 2003.
  52. 51X. Wu, C. Zhang, and S. Zhang, "Database Classification for Multi-Database Mining," Information Systems, vol. 30, no. 1, pp. 71- 88, 2005.
  53. 52X. Wu, "Building Intelligent Learning Database Systems," AI Magazine, vol. 21, no. 3, pp. 61-67, 2000.
  54. 53X. Wu, K. Yu, W. Ding, H. Wang, and X. Zhu, "Online Feature Selection with Streaming Features," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 35, no. 5, pp. 1178-1192, May 2013.
  55. 54A. Yao, "How to Generate and Exchange Secretes," Proc. 27th Ann. Symp. Foundations Computer Science (FOCS) Conf., pp. 162-167, 1986.
  56. 55M. Ye, X. Wu, X. Hu, and D. Hu, "Anonymizing Classification Data Using Rough Set Theory," Knowledge-Based Systems, vol. 43, pp. 82-94, 2013.
  57. 56J. Zhao, J. Wu, X. Feng, H. Xiong, and K. Xu, "Information Propagation in Online Social Networks: A Tie-Strength Perspec- tive," Knowledge and Information Systems, vol. 32, no. 3, pp. 589-608, Sept. 2012.
  58. 57X. Zhu, P. Zhang, X. Lin, and Y. Shi, "Active Learning From Stream Data Using Optimal Weight Classifier Ensemble," IEEE Trans. Systems, Man, and Cybernetics, Part B, vol. 40, no. 6, pp. 1607- 1621, Dec. 2010.

Publication Details

Published in : Volume 2 | Issue 3 | May-June 2016
Date of Publication : 2016-06-30
License:  This work is licensed under a Creative Commons Attribution 4.0 International License.
Page(s) : 12-25
Manuscript Number : IJSRSET1621117
Publisher : Technoscience Academy

Print ISSN : 2395-1990, Online ISSN : 2394-4099

Cite This Article :

Syeda Meraj Bilfaqih, Sabahat Khatoon, " Data Mining Model for Big Data Analysis, International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 2, Issue 3, pp.12-25, May-June.2016

Follow Us

Contact Us