Ah, the joys of cloud computing! One moment, you’re spinning up an EC2 instance, feeling like an all-powerful architect of the digital realm. The next, you’re furiously typing ssh ec2-user@your-instance-ip
and getting stonewalled by a cold, heartless timeout.
This is the story of how an EC2 instance drove me to the brink of madness, how I recreated it from scratch (more than once!), and how I finally found the culprit—a sneaky little blocking command in my user data script. Buckle up.
The Mystery: EC2 Says “Nope” to SSH
I launched a new EC2 instance, configured the security group rules correctly (port 22 open, check!), verified the key pair, and ensured that the instance was actually running. At first, I SSHed to it without any problems, installed some software on it (.NET 8, in this case), configured some script in its use data, but then – I hit a wall. Suddenly, every attempt to SSH into the instance ended in despair. No response. Just an endless, infuriating timeout.
I did what any sane person would do—I destroyed the instance and created a new one. And yet, the problem persisted. At this point, I started questioning my entire existence.
The Clue: Works Fine on GCP?
As any good troubleshooter does, I started comparing notes. The exact same setup—same app, same script—worked fine on Google Cloud Platform (GCP). The instance booted up, I could SSH in, and everything was smooth. So why was AWS treating me like that?
And that was when the lightbulb went off. If the same script worked on GCP but not AWS, something specific to EC2 was causing the issue.
The Culprit: A Blocking Command in User Data
And then, after much trial and error, I found it. The problem was hiding inside my user data script:
#cloud-boothook
#!/bin/bash
export ASPNETCORE_ENVIRONMENT=Production
cd /home/ec2-user/catalog
dotnet catalog.dll
Seems harmless, right? But there’s a big problem here. The dotnet catalog.dll
command starts my application and blocks the boot process. This means the SSH service never starts, and the instance just sits there, running my app but refusing to let me in.
AWS user data scripts execute at boot, and if they contain a blocking command, they prevent other essential services—like SSH—from starting. On GCP, this isn’t an issue because their startup script mechanism handles it differently, and the user data script runs AFTER all the startup processes of the instance are running.
The Fix: Run Your App as a Service
The solution? Instead of running my app directly in the user data script, I needed to set it up as a service so that it runs after the instance has fully booted. Here’s how I did it:
- Remove the blocking command from user data.
- Create a systemd service for my app:
sudo nano /etc/systemd/system/catalog.service
And add the following:
[Unit] Description=Catalog Service After=network.target [Service] ExecStart=/usr/bin/dotnet /home/ec2-user/catalog/catalog.dll Restart=always User=ec2-user WorkingDirectory=/home/ec2-user/catalog [Install] WantedBy=multi-user.target
- Enable and start the service:
sudo systemctl enable catalog.service sudo systemctl start catalog.service
And just like that, my SSH woes were over. The instance booted properly, SSH was accessible, and my app still ran without interfering with the startup process.
The Takeaway: User Data Scripts Are Not for Long-Running Processes
If there’s one thing to learn from this ordeal, it’s that user data scripts should not be used to start long-running applications. Instead, use a proper service manager like systemd.
So, next time your EC2 instance refuses to SSH, and you’ve checked everything else, take a look at your user data script. It might just be the villain in your cloud horror story.