We have a dedicated server for student WordPress sites, which are used as presentations of learning, or portfolios. The goal is to give each student a space where they can publish reflections of their learning and create a digital artifact to show their mastery of course material. We used to do this using AFS home mounts where the students would create a series of html pages. However, once I left the classroom there was no one to pick up teaching Dreamweaver, and the Apple AFS server would max out around 90 connections, so the students who started working later would get unusable timeouts. So, we created a locally hosted WordPress multisite to serve.
The local hardware was not able to handle the load of 800+ simultaneous connections, so we decided to move the set up to AWS. Thus began a painful journey to understand Elastic Block Store (EBS) volume queue lengths. Moving the server was pretty simple, and with security groups we were able to lock the site down so it only served traffic from the school’s IP address. The initial set up was using an Relational Database Service (RDS) t2.small mysql instance, an Elastic Cloud Compute (EC2) t2.small instance for the server and a General Purpose SSD (GP2) burstable 10 gb instance. Once it was all set up, then we would use Apache JMeter to simulate a load. of 120 users per grade level per school site, or around 800 simultaneous users.
As you might imagine, the server became non-responsive within 10 minutes. The EC2 instance ran out of memory and had to be hard rebooted. In my mind’s eye, I have this vision of the server in flames. Sigh.
So, time to start pulling threads.
The first step was to tune the apache server. Not my main area of expertise, so after googling about, I found some tuning lines to help it be more responsive under load:
KeepAlive On <IfModule prefork.c> StartServers 8 MinSpareServers 5 MaxSpareServers 20 MaxClients 512 MaxRequestsPerChild 4000 </IfModule> <IfModule prefork.c> StartServers 4 MinSpareServers 2 MaxSpareServers 5 ServerLimit 100 MaxClients 100 MaxRequestsPerChild 1000 </IfModule>
I need to understand what each of these lines do, but under pressure, this seemed to help.
I also pushed up the EC2 instance size so it wouldn’t run out of memory. We use an m4.large which appears to keep about 6GB of free space running under load, so the instance remains responsive. This was before I learned about scaling out rather than up, but for $998 for 3 years, this is a pretty beefy server for our normal operational needs.
The RDS was, of course, way too small to handle the load of 200+ simultaneous connections, so we reserved an r3.large instance. It handles the connections and the freeable memory never seems to move, so it has become the primary database. We were scaling it up in the morning and down after school, which saved us some money, but reserving it for 3 years for about $2600 made more sense. Our connection times remain about 3 seconds under normal load and around 10 seconds under full load. This is not ideal, as students tend to get impatient and try again multiple times, which means instead of 800 connections we would get around 3000, so it made a bad situation worse.
The big issue was the EBS volume. It took a ticket to the help desk to understand, but our queue credits were being exhausted in less than 5 minutes, and would take the rest of the day to recover. For a while we invested in a 500gb provisioned IOPS volume (1500) and that seemed to handle the load, but the cost of maintaining the volume was a few hundred dollar a month problem for a service that only got hit on Monday mornings, during the advisory period when students would work on their portfolios.
Luckily we are hosted in a region that supports Elastic File Service (EFS), so I pushed the data from the EBS volume to the EFS volume and updated the /etc/fstab to auto mount it in the /home/www folder. This gave us the ability to create an autoscaling group and experiment with load balancing, but the m4.large instance is strong enough to handle the load. However, there is still a noticeable lag when connecting to the server, even though it doesn’t crumple to dust anymore.
Today’s experiment is to figure out where that lag is at, whether I am exhausting the root volume’s credit (a 20gb GP2 burstable), hitting a limit in EFS, or maybe I could get better performance from a RAID 0 configuration of two 10gb GP2 volumes. Once that is identified, I need to figure out which approach is more cost effective: spinning up raid volumes or relying on the EFS volume for scaling out.
The RAID was very simple to set up, due to these instructions. I initialized the RAID, copied over the data from the EFS volume, dismounted the EFS and mounted the RAID volume in the /home/www. It worked flawlessly, and the data transferred very quickly. It would not be too difficult to script the initialization of more raid volumes and attaching them to a new instance from my server’s AMI, so that will be interesting to try later.
I’m using JMeter to run an sustained load on four of the WordPress installations, just hitting the front page.
Test 1: EFS
The test ran for 1 hour:
16:24 – cpu 0%, free memory 6866, 7587
17:27 – cpu spike to 92%, free memory 5648, 6333.
Pages loaded with a 10 second lag under the test load at the end of the hour.
Observations: the EBS root volume volume burst credit level never went down, so the load isn’t being felt there. The EFS metrics didn’t budget either, and the database worked fine. The CPU averages about 50%, with spikes as the test load went from one page to another. No clear evidence as to where the bottleneck might be.
Test 2: RAID 0
The test ran for an hour:
18:00 cpu 0%, free 6849, 7464
19:03 – cpu 89%, free 5934, 6619, 10 second lag
Observations: the RAID didn’t even notice the load, and there was no deletion of the credit balances of the raid volumes or root volume. The EC2 instance and database performed in similar fashion, so whatever the lag is, it isn’t in the EFS or EBS volumes.
So, while the experiment didn’t my question, it was interesting to see how the EFS and RAID volumes held up. I’m going to spend more time figuring out the apache tuning, and see if I can script the automatic creation of a raid so it works with an AMI generated instance. Time to shut all this down 🙂