• Vol 9, No 1 (2018)
  • Electrical, Electronics, and Computer Engineering

An Approximation Method of Regression Analysis in Concurrent Big Data Stream

Chanintorn Jittawiriyanukoon, Vilasinee Srisarkun


Cite this article as:

Jittawiriyanukoon, C., Srisarkun, V., 2018. An Approximation Method of Regression Analysis in Concurrent Big Data Stream. International Journal of Technology. Volume 9(1), pp. 192-200

231
Downloads
Chanintorn Jittawiriyanukoon Assumption University
Vilasinee Srisarkun Assumption University
Email to Corresponding Author

Abstract
image

Time series big data dynamically changes the size, and, unfortunately, it may be difficult to curate the enormous amount of data due to the processing capacity and storage size. This big data allows researcher to iterate on the model millions of times over. To execute a regression on several billion rows of data on a distributed network, the resource capacity regarding large volumes of data and its distributed environment must be considered. Algorithms must be real-time based data awareness. Moreover, analyzing big data sources requires the data to be pre-processed rather than immediately collected and analyzed. This pre-processing approach for the big data sources helps minimize the amount of collected data by extracting insights. It analyzes big data quicker and is cost-effective for storage space. Hence, in this research, an approximation method for analyzing regression problems in a big data stream with parallelism is proposed. The partitioning method for huge data stream helps reduce the computing time and required space, and the speed-up can improve the processing time. The performance evaluation of concurrent regression model is first executed by massive online analysis (MOA) simulation. Then, to validate the approximation method, the results performed by our proposed method are compared to those results collected from the simulation. The comparisons show evenly between the two methods.

Approximation method; Big data curation; MOA; Parallel processing; Regression analysis

Conclusion

This paper proposes an approach to handle the computing cost in big data curation. The main contribution of this research is to partition huge dataset into sub tasks and then execute siblings on parallel processing systems. Both splitting and reassembling time are seriously considered in our calculation. The approximation technique based upon queuing theory has been introduced, and to validate these approximation results, simulation results were compared to the approximation results, which the approximation results were comparable to the simulation results. In addition, the cost-effective analysis was emphasized to measure the speed-up factor. The research contributes to how parallel computation can accommodate for the massive data cost. To manipulate big data analytics on business intelligence, the capability to execute a big number of concurrent queries is a critical problem for an application such as Google Big Query. Regarding to new benchmark recently released by Google’s cloud, a new mechanism can simplify data preparation for both machine learning and deep learning. For instance, the cloudserverless model allows as large as 25 concurrent queries on datasets. Hence, future research will study on 30 concurrent queries on applications such as Google’s cloud to determine whether or not they can handle high concurrency. Another investigation may include costeffectiveness analysis for the case of multiple processing units. The approximation method will then be applied to reduce the complexity in simulation execution in future research.

References

Bifet,A., Kirkby, R., Holmes, G., Pfahringer, B., 2010. MOA: Massive Online Analysis.Journal of Machine Learning Research 11,pp. 1601–1604

Fox,J., Weisberg, S., 2011. An R Companion to Applied Regression. California: SAGE Publications

G.D.G.Software SARL., 2016. Software SARL. Available online at http://www.gdgsoft.com,Accessed on February 15, 2017

Heinis,T., 2014. Data Analysis: Approximation Aids Handling of Big Data. Nature, pp. 198–198

Hodge,V.J., 2014. Outlier Detection in Big Data.IGI Global, pp. 1762–1771

Hu,W., Kaabouch, N., 2014. Big DataManagement, Technologies, and Applications. Information Science Reference,IGI Global

James,G., Witten, D., Hastie, T., Tibshirani, R., 2013. An Introduction to Statistical Learning with Applications in R. NewYork: Springer

Jittawiriyanukoon,C., 2014. Performance Evaluation of Reliable Data Scheduling for ErlangMultimedia in Cloud Computing. IEEEProceedings of the Ninth International Conference on Digital InformationManagement (ICDIM 2014), pp. 39–44

Khan,A., Ahirwar, K.K., 2011. Mobile Cloud Computing as a Future of MobileMultimedia Database. InternationalJournal of Computer Science and Communication, Volume 2(1), pp. 219–231

Malik,A.W., Park, A.J., Fujimoto, R.M., 2010. An Optimistic Parallel SimulationProtocol for Cloud Computing Environments. SCSM&S Magazine, Volume IV, pp. 1–9

Srimani,P.K., Patil, M.M., 2016. Mining Data Streams with Concept Drift in MassiveOnline Analysis Frame Work. WSEASTransaction on Computers, Volume 15, pp. 133–142

Sunghae,J., Seung-Joo, L., Jea-Bok, R., 2015. A Divided Regression Analysis for BigData. International Journal of SoftwareEngineering and Its Application, Volume 9(5), pp. 21–32

Tsai,C.-F., Lin, W.-C., Ke, S.-W., 2016. Big Data Mining with Parallel Computing: AComparison of Distributed and Map Reduce Methodologies. Journal of Systems and Software, Volume 122, pp. 83–92