Published at : 27 Jan 2018
Volume : IJtech
Vol 9, No 1 (2018)
DOI : https://doi.org/10.14716/ijtech.v9i1.1509
Jittawiriyanukoon, C., Srisarkun, V., 2018. An Approximation Method of Regression Analysis in Concurrent Big Data Stream. International Journal of Technology. Volume 9(1), pp. 192-200
Chanintorn Jittawiriyanukoon | Assumption University |
Vilasinee Srisarkun | Assumption University |
Time series big data dynamically changes the size, and, unfortunately, it may be difficult to curate the enormous amount of data due to the processing capacity and storage size. This big data allows researcher to iterate on the model millions of times over. To execute a regression on several billion rows of data on a distributed network, the resource capacity regarding large volumes of data and its distributed environment must be considered. Algorithms must be real-time based data awareness. Moreover, analyzing big data sources requires the data to be pre-processed rather than immediately collected and analyzed. This pre-processing approach for the big data sources helps minimize the amount of collected data by extracting insights. It analyzes big data quicker and is cost-effective for storage space. Hence, in this research, an approximation method for analyzing regression problems in a big data stream with parallelism is proposed. The partitioning method for huge data stream helps reduce the computing time and required space, and the speed-up can improve the processing time. The performance evaluation of concurrent regression model is first executed by massive online analysis (MOA) simulation. Then, to validate the approximation method, the results performed by our proposed method are compared to those results collected from the simulation. The comparisons show evenly between the two methods.
Approximation method; Big data curation; MOA; Parallel processing; Regression analysis
This paper proposes an approach to handle the computing cost in big data curation. The main contribution of this research is to partition huge dataset into sub tasks and then execute siblings on parallel processing systems. Both splitting and reassembling time are seriously considered in our calculation. The approximation technique based upon queuing theory has been introduced, and to validate these approximation results, simulation results were compared to the approximation results, which the approximation results were comparable to the simulation results. In addition, the cost-effective analysis was emphasized to measure the speed-up factor. The research contributes to how parallel computation can accommodate for the massive data cost. To manipulate big data analytics on business intelligence, the capability to execute a big number of concurrent queries is a critical problem for an application such as Google Big Query. Regarding to new benchmark recently released by Google’s cloud, a new mechanism can simplify data preparation for both machine learning and deep learning. For instance, the cloudserverless model allows as large as 25 concurrent queries on datasets. Hence, future research will study on 30 concurrent queries on applications such as Google’s cloud to determine whether or not they can handle high concurrency. Another investigation may include costeffectiveness analysis for the case of multiple processing units. The approximation method will then be applied to reduce the complexity in simulation execution in future research.
Bifet,A., Kirkby, R., Holmes, G., Pfahringer, B., 2010. MOA: Massive Online Analysis.Journal of Machine Learning Research 11,pp. 1601–1604
Fox,J., Weisberg, S., 2011. An R Companion to Applied Regression. California: SAGE Publications
G.D.G.Software SARL., 2016. Software SARL. Available online at http://www.gdgsoft.com,Accessed on February 15, 2017
Heinis,T., 2014. Data Analysis: Approximation Aids Handling of Big Data. Nature, pp. 198–198
Hodge,V.J., 2014. Outlier Detection in Big Data.IGI Global, pp. 1762–1771
Hu,W., Kaabouch, N., 2014. Big DataManagement, Technologies, and Applications. Information Science Reference,IGI Global
James,G., Witten, D., Hastie, T., Tibshirani, R., 2013. An Introduction to Statistical Learning with Applications in R. NewYork: Springer
Jittawiriyanukoon,C., 2014. Performance Evaluation of Reliable Data Scheduling for ErlangMultimedia in Cloud Computing. IEEEProceedings of the Ninth International Conference on Digital InformationManagement (ICDIM 2014), pp. 39–44
Khan,A., Ahirwar, K.K., 2011. Mobile Cloud Computing as a Future of MobileMultimedia Database. InternationalJournal of Computer Science and Communication, Volume 2(1), pp. 219–231
Malik,A.W., Park, A.J., Fujimoto, R.M., 2010. An Optimistic Parallel SimulationProtocol for Cloud Computing Environments. SCSM&S Magazine, Volume IV, pp. 1–9
Srimani,P.K., Patil, M.M., 2016. Mining Data Streams with Concept Drift in MassiveOnline Analysis Frame Work. WSEASTransaction on Computers, Volume 15, pp. 133–142
Sunghae,J., Seung-Joo, L., Jea-Bok, R., 2015. A Divided Regression Analysis for BigData. International Journal of SoftwareEngineering and Its Application, Volume 9(5), pp. 21–32
Tsai,C.-F., Lin, W.-C., Ke, S.-W., 2016. Big Data Mining with Parallel Computing: AComparison of Distributed and Map Reduce Methodologies. Journal of Systems and Software, Volume 122, pp. 83–92