Eucalyptus cloud admins are encouraged to consult the Known Bugs page before diving into the investigation of unexpected behavior.
If an administrator ever needs to stop/start a Eucalyptus front-end because of a configuration change, or if the machine on which the front-end is running reboots unexpectedly, the administrator must terminate all running instances in the system before bringing Eucalyptus back online. (It is possible to restart the cloud controller using /etc/init.d/eucalyptus restart on the head-node without affecting the rest of the system, but then some of the configuration is not reloaded. Doing stop followed by start on the head-node will reload the configuration, but will also destroy the virtual network setup among the running VMs, making them inaccessible.)
If the restart is planned, the administrator can use the client tools to terminate all users instances before stopping/reconfiguring/starting Eucalyptus. If the restart was unplanned (front-end machine crashes), the admin can try to start Eucalyptus and immediately terminate all running instances, or can manually stop all eucalyptus components, destroy all running Xen instances using 'xm shutdown' or 'xm destroy' on the nodes, and starting all Eucalyptus components.
If something is not working right with your Eucalyptus installation, the best first step (after making sure that you have followed the installation/configuration/networking documents faithfully) is to make sure that your cloud is up and running, that all of the components are communicating properly, and that there are resources available to run instances. After you have set up and configured Eucalyptus, set up your environment properly with your admin credentials, and use the following command to see the 'status' of your cloud:
ec2-describe-availability-zones verbose
You should see output similar to the following:
AVAILABILITYZONE cluster <hostname of your front-end> AVAILABILITYZONE |- vm types free / max cpu ram disk AVAILABILITYZONE |- m1.small 0128 / 0128 1 128 10 AVAILABILITYZONE |- c1.medium 0128 / 0128 1 256 10 AVAILABILITYZONE |- m1.large 0064 / 0064 2 512 10 AVAILABILITYZONE |- m1.xlarge 0064 / 0064 2 1024 20 AVAILABILITYZONE |- c1.xlarge 0032 / 0032 4 2048 20 AVAILABILITYZONE |- <node-hostname-a> certs[cc=true,nc=true] @ Sun Jan 04 15:13:30 PST 2009 AVAILABILITYZONE |- <node-hostname-b> certs[cc=true,nc=true] @ Sun Jan 04 15:13:30 PST 2009 AVAILABILITYZONE |- <node-hostname-c> certs[cc=true,nc=true] @ Sun Jan 04 15:13:30 PST 2009 AVAILABILITYZONE |- <node-hostname-d> certs[cc=true,nc=true] @ Sun Jan 04 15:13:30 PST 2009 AVAILABILITYZONE |- <node-hostname-e> certs[cc=true,nc=true] @ Sun Jan 04 15:13:30 PST 2009 AVAILABILITYZONE |- <node-hostname-f> certs[cc=true,nc=true] @ Sun Jan 04 15:13:30 PST 2009 ...
Next, the administrator should consult the Eucalyptus logfiles. On each machine running a Eucalyptus component, the logfiles are located in:
$EUCALYPTUS/var/log/eucalyptus/
On the front-end, the Cloud Controller (CLC) logs primarily to 'cloud-output.log' and 'cloud-debug.log'. Consult these files if your client tool (ec2 API tools) output contains exception messages, or if you suspect that none of your operations are ever being executed (never see Xen activity on the nodes, network configuration activity on the front-end, etc.).
The Cluster Controller (CC) also resides on the front-end, and logs to 'cc.log' and 'httpd-cc_error_log'. Consult these logfile in general, but especially if you suspect there is a problem with networking. 'cc.log' will contain log entries from the CC itself, and 'httpd-cc_error_log' will contain the STDERR/STDOUT from any external commands that the CC executes at runtime.
A Node Controller (NC) will run on every machine in the system that you have configured to run VM instances. The NC logs to 'nc.log' and 'httpd-nc_error_log'. Consult these files in general, but especially if you believe that there is a problem with VM instances actually running (i.e., it appears as if instances are trying to run - get submitted, go into 'pending' state, then go into 'terminated' directly - but fail to stay running).
rmmod loop ; modprobe loop max_loop=256