RAID vs EFS experiments with WordPress Multisites

We have a dedicated server for student WordPress sites, which are used as presentations of learning, or portfolios. The goal is to give each student a space where they can publish reflections of their learning and create a digital artifact to show their mastery of course material. We used to do this using AFS home mounts where the students would create a series of html pages. However, once I left the classroom there was no one to pick up teaching Dreamweaver, and the Apple AFS server would max out around 90 connections, so the students who started working later would get unusable timeouts. So, we created a locally hosted WordPress multisite to serve.

The local hardware was not able to handle the load of 800+ simultaneous connections, so we decided to move the set up to AWS. Thus began a painful journey to understand Elastic Block Store (EBS) volume queue lengths. Moving the server was pretty simple, and with security groups we were able to lock the site down so it only served traffic from the school’s IP address. The initial set up was using an Relational Database Service (RDS) t2.small mysql instance, an Elastic Cloud Compute (EC2) t2.small instance for the server and a General Purpose SSD (GP2) burstable 10 gb instance. Once it was all set up, then we would use Apache JMeter to simulate a load. of 120 users per grade level per school site, or around 800 simultaneous users.

As you might imagine, the server became non-responsive within 10 minutes.  The EC2 instance ran out of memory and had to be hard rebooted. In my mind’s eye, I have this vision of the server in flames. Sigh.

So, time to start pulling threads.

The first step was to tune the apache server. Not my main area of expertise, so after googling about, I found some tuning lines to help it be more responsive under load:

KeepAlive On

<IfModule prefork.c>
 StartServers 8
 MinSpareServers 5
 MaxSpareServers 20
 MaxClients 512
 MaxRequestsPerChild 4000

<IfModule prefork.c>
 StartServers 4
 MinSpareServers 2
 MaxSpareServers 5
 ServerLimit 100
 MaxClients 100
 MaxRequestsPerChild 1000

I need to understand what each of these lines do, but under pressure, this seemed to help.

I also pushed up the EC2 instance size so it wouldn’t run out of memory. We use an m4.large which appears to keep about 6GB of free space running under load, so the instance remains responsive. This was before I learned about scaling out rather than up, but for $998 for 3 years, this is a pretty beefy server for our normal operational needs.

The RDS was, of course, way too small to handle the load of 200+ simultaneous connections, so we reserved an r3.large instance. It handles the connections and the freeable memory never seems to move, so it has become the primary database. We were scaling it up in the morning and down after school, which saved us some money, but reserving it for 3 years for about $2600 made more sense. Our connection times remain about 3 seconds under normal load and around 10 seconds under full load. This is not ideal, as students tend to get impatient and try again multiple times, which means instead of 800 connections we would get around 3000, so it made a bad situation worse.

The big issue was the EBS volume. It took a ticket to the help desk to understand, but our queue credits were being exhausted in less than 5 minutes, and would take the rest of the day to recover. For a while we invested in a 500gb provisioned IOPS volume (1500) and that seemed to handle the load, but the cost of maintaining the volume was a few hundred dollar a month problem for a service that only got hit on Monday mornings, during the advisory period when students would work on their portfolios.

Luckily we are hosted in a region that supports Elastic File Service (EFS), so I pushed the data from the EBS volume to the EFS volume and updated the /etc/fstab to auto mount it in the /home/www folder. This gave us the ability to create an autoscaling group and experiment with load balancing, but the m4.large instance is strong enough to handle the load. However, there is still a noticeable lag when connecting to the server, even though it doesn’t crumple to dust anymore.

Today’s experiment is to figure out where that lag is at, whether I am exhausting the root volume’s credit (a 20gb GP2 burstable), hitting a limit in EFS, or maybe I could get better performance from a RAID 0 configuration of two 10gb GP2 volumes. Once that is identified, I need to figure out which approach is more cost effective: spinning up raid volumes or relying on the EFS volume for scaling out.

The RAID was very simple to set up, due to these instructions. I initialized the RAID, copied over the data from the EFS volume, dismounted the EFS and mounted the RAID volume in the /home/www. It worked flawlessly, and the data transferred very quickly. It would not be too difficult to script the initialization of more raid volumes and attaching them to a new instance from my server’s AMI, so that will be interesting to try later.

I’m using JMeter to run an sustained load on four of the WordPress installations, just hitting the front page.

Test 1: EFS

The test ran for 1 hour:
16:24 – cpu 0%, free memory 6866, 7587
17:27 – cpu spike to 92%, free memory 5648, 6333.
Pages loaded with a 10 second lag under the test load at the end of the hour.

Graph data:

Observations: the EBS root volume volume burst credit level never went down, so the load isn’t being felt there. The EFS metrics didn’t budget either, and the database worked fine. The CPU averages about 50%, with spikes as the test load went from one page to another. No clear evidence as to where the bottleneck might be.

Test 2: RAID 0

The test ran for an hour:
18:00 cpu 0%, free 6849, 7464
19:03 – cpu 89%, free 5934, 6619, 10 second lag

Observations: the RAID didn’t even notice the load, and there was no deletion of the credit balances of the raid volumes or root volume. The EC2 instance and database performed in similar fashion, so whatever the lag is, it isn’t in the EFS or EBS volumes.

So, while the experiment didn’t my question, it was interesting to see how the EFS and RAID volumes held up. I’m going to spend more time figuring out the apache tuning, and see if I can script the automatic creation of a raid so it works with an AMI generated instance. Time to shut all this down 🙂

LPIC1-101 Harder than I thought

This was one of the more difficult exams for me, nearly as tough as the SA Pro, mainly because of the nitnoid detail that you have to know off the top of your head instead of looking up the options in the man pages. On a scale of 0-800, with 500 being a pass, I got an 570, which is a humbling experience. Always good to be reminded that you may not be as smart as you think, and need to dive deep to learn all the details. Here are my scores:

System Architecture: 87%
Linux Installation and PKG mngt: 45%
GNU/Linux Commands: 80%
Devices, Filesystems, FHS: 66%

I’m a bit surprised about the last score, but the 45% in package management sounds about right as I couldn’t remember the right flags or specific commands to do the alternative install activities. Outside of yum update -y and rpm -ivvh I need to look it up in the man pages.

The other challenge of this test is the fill in the blank answer, which doesn’t allow for autocomplete or spelling errors. You know it or you don’t.

So, could have been better, but I’ll take this pass and focus even closer on the 102 exam, which I expect to take in 2-3 weeks.

My primary review course was, of course, from Linux Academy, and I also created a set of flash cards from their study guide.

I did also use Exam-Labs for practice, which gave me a chance to understand the question format. Wasn’t a great test taking tool, but worth the $10. I’ll probably buy a practice exam for the second one as well.

Preparing for the LPIC-1 Part 1

I’m taking a bit of a break from finishing the remaining AWS certifications. In an effort to validate my linux skills, not all hard won through diligent googling, I am pursuing the Linux Professional Institute Certification plan. The first stop on the tour is the LPIC-1 Part 1 exam, which I am taking this Saturday. I’ve spent the last two weeks reviewing the Linux Academy videos, alternately trying to stay awake as the instructor’s accelerated chipmunk voice and listening hard for those pieces of common admin task gold that I haven’t had to do yet. It has been a grind, but I feel confident that I can beat this test easily, as I have had to do these tasks fairly regularly over the last twenty years. It is good to review all the switches and options, but I think I may go mad trying to memorize them all.

I wish the tests weren’t two for each level at $200 a piece, but they are good for 5 years, as opposed to AWS certs only lasting two. Just some last minute cram and jam through the study guide, and then take the test. I’m turning into quite the cert queen… If I am lucky, I can review for the second half and take it before the end of the year, maybe clear the LPIC-1 entirely and start working on LPIC-2, which should be a lot more interesting and include tasks I haven’t had to do yet, or only have done once or twice after googling for a solution.