For over 35 years, NHLS has been a robust source for enterprise technology and software training solutions offering industry-leading learning content. They provide computer courses and certifications to more than 30 million students through in-person and online learning experiences.
Understanding the challenges
NHLS turned to Galaxy to check the load the platform can withstand under certain user scenarios over different web pages, and wanted the system to be able to entertain 10,000 concurrent users. They expressed concerns over the performance of their learning platform seen during user interaction.
They wanted us to go for performance testing to pull off higher volume load tests, and implement required measures to optimize website load times and ensure zero-downtime during the busiest days.
Test planning and implementation
We developed an in-depth understanding of the client’s system architecture and the platform. We used Jmeter to simulate heavy loads on virtual servers, networks to check strength, test the ability to handle heavy loads and determine system performance for a variety of loads.
We started with 1000 users. Reports of regression and stress tests made it pretty clear that the webapp is not optimized, since even after the FMP (first meaningful paint), the load times were far from what we expected. Servers were running out of capacity even on a few requests, which was not ideal for the server architecture NHLS already had.
Their application concurrency target was 10,000 users which was initially crashing at 100 users. In order to identify the point of bottlenecks due to which application started degrading performance, we defined few performance test objectives:
- Response Time: To check the amount of time between a specific request and a corresponding response. User search should not take more than 2 seconds.
- Throughput: To check how much bandwidth gets used during performance testing. Application servers should have the capacity of entertaining maximum requests per second.
- Resource Utilization: All the resources like processor and memory utilization, network Input output, etc. should be at less than 70% of their maximum capacity.
- Maximum User Load: System should be able to handle 10,000 concurrent user load without breaking database by fulfilling all of the above defined objectives.
Bottlenecks we encountered and the Solutions we provided
We used Jmeter to start testing with 100 users and then ramped up progressively with heavier loads. We performed real-time analysis and conducted more thorough analysis using a variety of tests like load test, smoke test, spike test and soak test.
In order to get to grips first with inaccurate page load and slow page speed, we decided to test per page load. We onboarded with our team of developers and network/server engineers to look into the bottlenecks and solve the issues to get expected results.
Bottleneck #1: Obsolete code
Adding new features to old coding architecture accumulated unnecessary JS and CSS files, code controllers and models on every page. This was acquiring cumbersome and resource-heavy elements or code throughout the website, and exacerbating the page load.
Solution:
We minified static assets (JavaScript, CSS, and images) i.e. optimized scripts and removed unnecessary characters, comments and white spaces from the code to shrink file sizes. To further improve the page speed, the server team performed static code caching that reduced the bandwidth usage from the website.
This resulted in a significant size reduction in requested assets and improved the page speed taking only 2 seconds to load the home page.
Bottleneck #2: Memory
A single query was processing more data than needed, mainly accessing too many rows and columns, from so many parts of the database. This in case of large tables means that a large number of rows were being read from disk and handled in memory causing more I/O workload.
Solution:
We used RDS Performance Insights to quickly assess the load on the database, and determine when and where to take action, and filter the load by waits, SQL statements, hosts, or users.
We performed indexing, removed redundant indexes and unnecessary data from the tables to quickly locate data without having to scan/search every row in a database table every time a database table is accessed. Server team used Innodb storage engine for MySql to organize the data on disk to optimize common queries based on primary keys to minimize I/O time (minimizing the number of reads required to retrieve the desired data).
Bottleneck #3: CPU
Use of nested loops to process large data sets made it difficult to trace the flow of the code, hitting so many requests (1-10k requests) on the database by a single user. This caused the code to execute multiple times in the same execution context hitting the CPU limit and driving up its usage.
Solution:
We performed query performance optimization to remove unnecessary code in loop (by making sub queries of queries) and removed multiple loops thus reducing time of rendering content from looped code that resulted in sending only 100 requests by a single user now. This reduced page size, response time, and marked down CPU resources and memory from 8GB to 4GB on the application server.
Ridding the code off of redundancies and optimizing the database helped us get to the 5000 user traffic mark. This lessened the extra work of the MySQL server, reducing server cost to 10-20%.
We launched a single server on AWS and configured all the required packages such as Apache, PHP and PHP-fpm, load balancer, and others to run our application.
Bottleneck #4: Network Utilization
The former HTTP/1 protocol was using more than 1 TCP connections to send and receive for every single request/response pair. It utilized many resources on the web page making different requests for each file. As the overload continued, the server began to process more and more concurrent requests, which further increased the latency.
Solution:
We used HTTP2 to reduce latency in processing browser requests via single TCP connection. Enabling Keep-Alive avoided the need to repeatedly open and close a new connection. It helped reduce server latency by minimizing the number of round trips from sender to receiver. And with parallelized transfers, letting more requests complete more quickly thus improving the load time.
- To identify the slow log queries, and requests taking long execution time in the code, we established a proxy connection between Apache web server and PHP-FPM (communicating through modules earlier) to identify the bottlenecks of individual entities by letting them functioning individually. Then we configured PHP-FPM to identify RAM capacity by calculating how many max. parallel connections RAM can handle, leaving the system memory free to process at the same time.
- We found inadequate server capacity, while inserting the data in the login and without login scenario to create real-life testing environment.
We proposed a distributed server system so that more than 1 server can be auto generated. We added auto scaling and added 4 servers, but was still burning at the load of 8k users and saw an increased server cost. With Round Robin load balancing, we distributed incoming network traffic or client requests across the group of backend servers. This helped us identify that the load is increasing due to inaccurate working processes of sessions stored in the database.
Bottleneck #5: Session queues
The server was getting overloaded due to accumulating too many sessions when performing load of 10k users login concurrently. And because the sessions were stored in a database, increase in the wait activities decreased the transaction throughput taking session time upto 100s, thus increasing the load on the system.
Solution:
We switched storing sessions from database to Memcache server. It stored sessions and queries in memory/cache instead of files, thus reducing the number of times that the database or API needs to be read while performing operations. It cached the data in the RAM of the different nodes in the cluster, reducing the load in the web server.
Making such scalable and cost-efficient server infrastructure helped the client application achieve the load of 10k users in less than 5 mins using only 2 servers capacity.
The testing process was able to ensure a smooth customer experience and save significant capital expense by maximizing server capacity already in place.