Eucalyptus Troubleshooting (1.5.1)

Eucalyptus cloud admins are encouraged to consult the Known Bugs page before diving into the investigation of unexpected behavior.

1. Restarting

If an administrator ever needs to stop/start a Eucalyptus front-end because of a configuration change, or if the machine on which the front-end is running reboots unexpectedly, the administrator must terminate all running instances in the system before bringing Eucalyptus back online. (It is possible to restart the cloud controller using /etc/init.d/eucalyptus restart on the head-node without affecting the rest of the system, but then some of the configuration is not reloaded. Doing stop followed by start on the head-node will reload the configuration, but will also destroy the virtual network setup among the running VMs, making them inaccessible.)

If the restart is planned, the administrator can use the client tools to terminate all users instances before stopping/reconfiguring/starting Eucalyptus. If the restart was unplanned (front-end machine crashes), the admin can try to start Eucalyptus and immediately terminate all running instances, or can manually stop all eucalyptus components, destroy all running Xen instances using 'xm shutdown' or 'xm destroy' on the nodes, and starting all Eucalyptus components.

2. Diagnostics

Installation/Discovering resources

If something is not working right with your Eucalyptus installation, the best first step (after making sure that you have followed the installation/configuration/networking documents faithfully) is to make sure that your cloud is up and running, that all of the components are communicating properly, and that there are resources available to run instances. After you have set up and configured Eucalyptus, set up your environment properly with your admin credentials, and use the following command to see the 'status' of your cloud:

ec2-describe-availability-zones verbose

You should see output similar to the following:

AVAILABILITYZONE        cluster <hostname of your front-end>
AVAILABILITYZONE        |- vm types     free / max   cpu   ram  disk
AVAILABILITYZONE        |- m1.small     0128 / 0128   1    128    10
AVAILABILITYZONE        |- c1.medium    0128 / 0128   1    256    10
AVAILABILITYZONE        |- m1.large     0064 / 0064   2    512    10
AVAILABILITYZONE        |- m1.xlarge    0064 / 0064   2   1024    20
AVAILABILITYZONE        |- c1.xlarge    0032 / 0032   4   2048    20
AVAILABILITYZONE        |- <node-hostname-a>        certs[cc=true,nc=true] @ Sun Jan 04 15:13:30 PST 2009
AVAILABILITYZONE        |- <node-hostname-b>        certs[cc=true,nc=true] @ Sun Jan 04 15:13:30 PST 2009
AVAILABILITYZONE        |- <node-hostname-c>        certs[cc=true,nc=true] @ Sun Jan 04 15:13:30 PST 2009
AVAILABILITYZONE        |- <node-hostname-d>        certs[cc=true,nc=true] @ Sun Jan 04 15:13:30 PST 2009
AVAILABILITYZONE        |- <node-hostname-e>        certs[cc=true,nc=true] @ Sun Jan 04 15:13:30 PST 2009
AVAILABILITYZONE        |- <node-hostname-f>        certs[cc=true,nc=true] @ Sun Jan 04 15:13:30 PST 2009
...

Next, the administrator should consult the Eucalyptus logfiles. On each machine running a Eucalyptus component, the logfiles are located in:

$EUCALYPTUS/var/log/eucalyptus/

On the front-end, the Cloud Controller (CLC) logs primarily to 'cloud-output.log' and 'cloud-debug.log'. Consult these files if your client tool (ec2 API tools) output contains exception messages, or if you suspect that none of your operations are ever being executed (never see Xen activity on the nodes, network configuration activity on the front-end, etc.).

The Cluster Controller (CC) also resides on the front-end, and logs to 'cc.log' and 'httpd-cc_error_log'. Consult these logfile in general, but especially if you suspect there is a problem with networking. 'cc.log' will contain log entries from the CC itself, and 'httpd-cc_error_log' will contain the STDERR/STDOUT from any external commands that the CC executes at runtime.

A Node Controller (NC) will run on every machine in the system that you have configured to run VM instances. The NC logs to 'nc.log' and 'httpd-nc_error_log'. Consult these files in general, but especially if you believe that there is a problem with VM instances actually running (i.e., it appears as if instances are trying to run - get submitted, go into 'pending' state, then go into 'terminated' directly - but fail to stay running).

Node Controller troubleshooting

  • If nc.log reports "Failed to connect to hypervisor," xen/kvm + libvirt is not functioning correctly.

Walrus troubleshooting

  • "ec2-upload-bundle" will report a "409" error when uploading to a bucket that already exists. This is a known compatibility issue when using ec2 tools with Eucalyptus. The workaround is to use ec2-delete-bundle with the "--clear" option to delete the bundle and the bucket, before uploading to a bucket with the same name, or to use a different bucket name.
  • When using "ec2-upload-bundle," make sure that there is no "/" at the end of the bucket name.

Block storage troubleshooting

  • Unable to attach volumes when the front end and the NC are running on the same machine. This is a known issue with ATA over Ethernet (AoE). AoE will not export to the same machine that the server is running on. The workaround is to run the front end and the node controller on different hosts.
  • Volume ends up in "deleted" state when created, instead of showing up as "available." Look for error messages in $EUCALYPTUS/var/log/eucalyptus/cloud-error.log. A common problem is that ATA-over-Ethernet may not be able to export the created volume (this will appear as a "Could not export..." message in cloud-error.log). Make sure that "VNET_INTERFACE" in eucalyptus.conf on the front end is correct.
  • Failure to create volume/snapshot. Make sure you have enough loopback devices. If you are installing from packages, you will get a warning. On most distributions, the loopback driver is installed as a module. The following will increase the number of loopback devices available,
    rmmod loop ; modprobe loop max_loop=256
    
  • If block devices do not automatically appear in your VMs, make sure that you have the "udev" package installed.
  • I'm running gentoo. I get "which: no vblade in ((null))." Try compiling "su" without pam.