• Vol 10, No 7 (2019)
  • Electrical, Electronics, and Computer Engineering

Preventing and Rectifying Cloud Quality of Service Violation Through Adaptive Resource Scaling and Replication

Tong-Sheng Wong, Gaik-Yee Chan, Fang-Fang Chua

Corresponding email: gychan58@yahoo.com


Cite this article as:
Wong, T-S., Chan, G-Y. Chua, F-F., 2019. Preventing and Rectifying Cloud Quality of Service Violation Through Adaptive Resource Scaling and Replication . International Journal of Technology. Volume 10(7), pp. 1395-1406
29
Downloads
Tong-Sheng Wong Faculty of Computing and Informatics, Multimedia University, Persiaran Multimedia, 63100 Cyberjaya, Selangor, Malaysia
Gaik-Yee Chan Faculty of Computing and Informatics, Multimedia University, Persiaran Multimedia, 63100 Cyberjaya, Selangor, Malaysia
Fang-Fang Chua Faculty of Computing and Informatics, Multimedia University, Persiaran Multimedia, 63100 Cyberjaya, Selangor, Malaysia
Email to Corresponding Author

Abstract
image

One major challenge in delivering and accessing cloud applications is the management of Quality of Services (QoS). It is mandatory for cloud service providers to ensure their performance and fulfil QoS, as defined in the Service Level Agreement (SLA). In this paper, we propose a Scaling and Fault Tolerance (SFT) algorithm to deploy preventive or remedial measures based on 16 decision rules for QoS violation detection and prediction. We simulate the SFT algorithm in a cloud simulator with four scenarios to measure its effectiveness in handling events such as faulty virtual machines (VMs), or over and under-provisioning of resources. Our experimental results show that the proposed SFT algorithm performs effectively (close to a 90%-100% effective rate) in providing preventive or remedial measures and reducing the number of VMs when they are not needed.

Cloud computing; Fault tolerance; Quality of service violation; Replication; Scalability

Introduction

According to the National Institute of Standards and Technology (NIST) (Mell & Grance, 2011), cloud computing has emerged as one compelling paradigm for providing convenient and on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. This has made possible the hosting of cloud services provided by cloud service providers, such as Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS). An increasing number of organizations are adapting to cloud computing platforms for their daily business functions. These organizations are showing determination in embracing Industry 4.0, as cloud computing helps to pool and centralize information for making better business decision. The focus of this paper is therefore on the Quality of Service (QoS) pertaining to SaaS.

Cloud service providers are mandated to enforce service performance and the quality of their services, as defined in the Service Level Agreement (SLA).  From their perspective, maintaining the conditions defined in the SLA and maximizing the QoS metrics are important tasks. We define QoS metrics as CPU load, response time and throughput, as elements of the performance aspect, for evaluating cloud services (Bardsiri & Hashemi, 2014).

Cloud monitoring tools measure and collect cloud QoS data, and this information is used for making decisions on scaling cloud resources horizontally, and also for providing fault toleranceif necessary. Therefore, using QoS metrics for modelling the performance of cloud services could enable better decision making with regard to preventing or rectifying any cloud QoS violation by correlating the decision rules with the QoS metrics, such as response time and throughput. A detailed description of the formulation of these decision rules can be found in the study by Wong et al. (2018).

The work by Wong et al. (2019) has proven the feasibility of using horizontal scaling as a preventive measure and fault tolerance mechanism of replication, as rectification for QoS violations. In this paper, we propose a scaling and fault tolerance (SFT) algorithm and evaluate its effectiveness based on CloudSim Plus (Filho et al., 2017), a toolkit with libraries for the simulation of cloud computing scenarios. A total of four scenarios were used in the simulation to evaluate the effectiveness of the proposed SFT algorithm. One scenario provided a preventive measure for probable cloud QoS violation; another  a remedial measure for certain cloud QoS violation; while the other two monitored consideration of over- or under- provisioning of virtual machines (VMs). One example for monitoring the provisioning of VM is by reducing the number of idle VM based on real-world QoS measurement of cloud services, as discussed by Zheng et al. (2014).

Our experimental results show that the proposed SFT algorithm performs effectively (close to a 90%-100% effective rate) in providing preventive or remedial measures and reducing the number of VMs when they are not needed, consequently guaranteeing QoS performance, as defined in the cloud services SLA. Additionally, the 16 decision rules determine QoS violation at four levels, namely no violation, normal, probable violation and certain violation; unlike many other works, that only define violations as normal or violation, the four levels are able to detect and predict whether a violation will occur or not before it actually happens. The SFT algorithm takes appropriate action to prevent the occurrence of actual violation, or rectifies it if violation has occurred. Together with the 16 decision rules, it thus contributes towards another aspect of detection, prediction, prevention and rectification measures with regard to response time and throughput for cloud QoS violations.

Unlike the work of Aruna and Aramudhan (2016), which includes cost in its proposed method of using fuzzy sets to shortlist providers based on the QoS agreed in the SLA, the SFT algorithm does not include the cost factor in resolving QoS violations. This will be left for future work.

Scalability is defined as the handling of increasing workloads by allocating more resources to the system (Lehrig et al., 2015). There are two general scalability approaches, namely horizontal and vertical scaling. Horizontal scaling involves adding or removing VMs to spread the load across multiple distributed VMs, while vertical scaling involves increasing and decreasing the power of an existing VM by means of more memory (RAM), storage (HDD/SSD), or processors (CPUs). In this paper, the focus is on application scalability, which is defined as the maintenance of cloud services application performance goals by avoiding QoS violation events when the workload submitted by users increases (Kuperberg et al., 2011).

Fault tolerance, as defined by Ganesh et al. (2014), is the ability of the cloud environment to handle unanticipated changes, such as hardware failure, software defects or network congestion. Two standard policies, namely proactive and reactive fault tolerance, can be used for real-time cloud applications. Proactive fault tolerance can predict faults, errors and failures, and once a suspicious component has been detected, it will be replaced proactively. Proactive fault tolerance techniques include pre-emptive migration, software rejuvenation and self-healing.

Reactive fault tolerance reduces the effect of failure on applications being executed when the failure effectively occurs. Examples of reactive fault tolerance techniques are check pointing or restart, replication, job migration and task resubmission. In this paper, the focus is on implementing a reactive fault-tolerance policy on computation failure, which involves hardware or infrastructure failure.


Conclusion

In this paper, we have presented the design and implementation of a system that can perform VM-scaling, replication and task retry. We developed a scaling and fault tolerance (SFT) algorithm to deploy preventive measures or take remedial action based on QoS decision outcomes with regard to response time and throughput. Experiments based on four scenarios to measure the effectiveness of the algorithm in handling events such as faulty VMs and over- and under-provisioning were conducted. Our experimental results show that the algorithm was effective 90% to 100% of the time when handling probable violation events using a scaling technique as the preventive measure; when taking remedial action using replication and task re-submission as fault tolerance techniques; and in resolving over-provisioning. The SFT algorithm, together with the 16 decision rules, thus contributes to an additional aspect of detection, prediction, prevention and rectification measures of response time and throughput for cloud QoS violations. 

Supplementary Material
FilenameDescription
R3-EECE-3252-20190913171809.docx list of tables & figures
References

Aruna, L., Aramudhan, M., 2016. Framework for Ranking Service Providers of Federated Cloud Architecture using Fuzzy Sets. International Journal of Technology, Volume 7(4), pp. 643–653

AWS (Amazon Web Services), 2018. Amazon EC2 T2 Instances –Amazon Web Services, Inc. Available Online at https://aws.amazon.com/ec2/ instance-types/t2/, Accessed on July 20, 2018

Bardsiri, A.K., Hashemi, S.M., 2014. QoS Metrics for Cloud Computing Services Evaluation. International Journal of Intelligent Systems and Applications, Volume 6(12), pp. 27–33

Calheiros, R.N., Ranjan, R., Beloglazov, A., De Rose, C.A.F., Buyya, R., 2010.  CloudSim: A Toolkit for Modeling and Simulation of Cloud Computing Environments and Evaluation of Resource Provisioning Algorithms. Software Practice and Experience, Volume 41(1), pp. 23–50

Filho, M.C.S., Oliveira, R.L., Monteiro, C.C., Inacio, P.R.M., Freire, M.M., 2017. CloudSim Plus: A Cloud Computing Simulation Framework Pursuing Software Engineering Principles for Improved Modularity, Extensibility and Correctness. In: Proceedings of IFIP/IEEE International Symposium on Integrated Network Management, 8-12 May, 2017, pp.400–406

Ganesh, A., Sandhya, M., Shankar, S., 2014. A Study on Fault Tolerance Methods in Cloud Computing. In: Proceedings of IEEE International Advance Computing Conference (IACC), pp.844–849

Kuperberg, M., Herbst, N., Kistowski, J.V., Reussner, R., 2011. Defining and Quantifying Elasticity of Resources in Cloud Computing and Scalable Platforms. In: Karlsruhe Reports in Informatics, Volume 16, pp. 1–17

Lehrig, S., Eikerling, H. Becker, S., 2015. Scalability, Elasticity, and Efficiency in Cloud Computing: A Systematic Literature Review of Definitions and Metrics. In: Proceedings of the 11th International ACM SIGSOFT Conference on Quality of Software Architectures - QoSA '15, pp. 83–92

Mell, P., Grance, T., 2011. The NIST Definition of Cloud Computing. NIST Special Publication 800-145, pp. 1–3

Wong, T.S., Chan, G.Y., Chua, F.F., 2018. A Machine Learning Model for Detection and Prediction of Cloud Quality of Service Violation. In: International Conference on Computational Science and Its Applications (ICCSA), LNCS, pp. 498–513

Wong, T.S., Chan, G.Y., Chua, F. F., 2019. Adaptive Preventive and Remedial Measures in Resolving Cloud Quality of Service Violation. In: Proceedings of IEEE 33rd International Conference on Information Networking (ICOIN 2019), 9-11 January 2019, Kuala Lumpur, Malaysia, pp. 473–479

Zheng, Z., Zhang, Y., Lyu, M.R., 2014. Investigating QoS of Real-World Web Services. In: IEEE Transactions on Services Computing, Volume 7(1), pp.32–­39