To troubleshoot Eucalyptus, the administrator must know the location of the Eucalyptus components, that is, on which machine each component is installed. The administrator must have root access to each machine hosting the components and must know the network configuration connecting the components.
Usually when an issue arises in Eucalyptus, you can find a clue or trace or record that suggests the nature of the problem either in the eucalyptus log files or in the system log files. Assuming Eucalyptus is installed in root (/), the eucalyptus logs are located on each machine hosting a component in the following directory: /var/log/eucalyptus/.
Here are the relevant logs for each component:
Cloud Controller (CLC), Walrus, and Storage Controller (SC):
Cluster Controller (CC):
Node Controller (NC):
You can control the amount of information displayed in the logs by modifying variables in eucalyptus.conf. For the CLC, SC and walrus, you must modify the CLOUD_OPTS variable by adding the parameter --log-level=LEVEL. For the CC and NC, you must modify the variable LOGLEVEL=LEVEL. The possible values for LEVEL are: DEBUG, INFO, WARN, ERROR and FATAL. After changing these values, you must restart the components for the changes to take effect.
In addition, information regarding the nature of an issue may appear in the system’s logs. In particular, you might want to search for clues in /var/log/xen/.
It is also important to understand the elements of the network on your system. For example, you may wish to list bridges to see which devices are enslaved by the bridge. To do so, use the brctl command. You may also want to list network devices and evaluate existing configurations.To do so, you can use these commands: ip, ifconfig, and route. You can also use vconfig, if, for example, you wish to evaluate VLAN configuration (MANAGED mode only).
Administrator credentials allow access to more information than user credentials. For example, with administrator credentials euca-describe-instances gives you additional information, including all instances running by all users on the system. Thus, make sure you have Euca2ools installed with proper administrator credentials.
Here we provide troubleshooting strategies and solutions for these commonly occurring issues in Eucalyptus:
|
You can use the euca_conf to check that all components are registered correctly. To do so, on the CLC machine (as root user) run these commands:
Check that the IP addresses returned are consistent with your network configuration. For example, Walrus should be registered with a public IP, not localhost (127.0.0.1). |
|
You can quickly check to confirm that the CLC is running, by accessing the Web UI (https://<IPAddress>: 8443). Once you’ve confirmed the CLC is running, check to see that the components are correctly registered (see above). A very useful high-level check can be performed with euca-describe-availability verbose (with admin credentials), which will indicate if your cloud resources are available. The output of the command will indicate the maximum capacity of your cloud installation for each VM Type (e.g., m1.small, c1.medium, m1.large, etc.) and the current availability of each VM type. The following example shows the cloud is unloaded and all resources are available. AVAILABILITYZONE cluster <hostname of your front-end> AVAILABILITYZONE |- vm types free / max cpu ram disk AVAILABILITYZONE |- m1.small 0128 / 0128 1 128 10 AVAILABILITYZONE |- c1.medium 0128 / 0128 1 256 10 AVAILABILITYZONE |- m1.large 0064 / 0064 2 512 10 AVAILABILITYZONE |- m1.xlarge 0064 / 0064 2 1024 20 AVAILABILITYZONE |- c1.xlarge 0032 / 0032 4 2048 20 |
|
First, check that the CC has been started and registered (as describe above). Next, check on the CC machine to confirm that the cc.log is growing (i.e., the CLC is polling the CC). If not, the registration was not successful for several possible reasons, including an incorrect key, wrong IP address, firewall impediment, etc. You may also want to inspect the other eucalyptus log files on the CC. |
|
First, check that the CC is running correctly (see above). Next, check that the NC has been started and registered with the correct CC (in the event you have more than one CC). Now, check the cc.log on the CC to confirm the CC is polling the NC. (If not, the node may not be registered correctly). Now check the nc.log to confirm the NC is being polled by the CC. (If not, check the eucalyptus log files on the NC machines for errors (e.g., incorrect keys, cannot talk to hypervisor, libvirt misconfigured etc.). |
|
For information on proper configuration of libvirt, see Hypervisor Configuration |
|
Follow the steps in the previous troubleshooting solutions above: Check that the CC, NC, and CLC are running correctly. Next, check that there are enough resources available (for example disk space) on the NC machines and that they are accessible to the user “eucalyptus” (for example the disk space is accessible). |
|
First, check use the euca-describe-addresses command to see if there is available IPs. If not examine your configuration, in particular the value of VNET_PUBLICIPS (see Section 8: Eucalyptus EE Networking Configuration). If all IPs are taken, you may need to allocate more IPs to Eucalyptus. If IPs are available, but you still get errors, you may need to perform a clean restart of the CC. |
|
Use the euca-describe-availability-zones verbosecommand to confirm that you have available resources. If you do have resources available, check that you also have available public IP addresses. (Try allocating and de-allocating an IP Address). Next, check that the root file system of the image you want to run fits with the size of the instance type you are using. |
|
If you use KVM, use euca-get-console-output to get the console output of the instance. If you use XEN and you get an error, log into the NC machine as root and use the xm console command to get the console output. Now, check in the instance console output to confirm that the instance is booted (that is the instance shows the kernel messages and that there are no errors mounting the root file system). |
|
If your image is very big it may take a very long time to boot. To check for errors in the preparation of the instance, log into the NC as root, and check the nc.log for information about your instance. Reasons for the failure might include: Difficulty communicating with walrus (check in $INSTANCE_PATH/<user>/<instance ID> to determine if the kernel/initrd and root are correct); errors in preparing the image (check in the nc.log); errors talking to libvirt/hypervisor (again check nc.log, libvirt logs, etc.). |
|
Make sure that the security group the instance is using allows ssh (port 22) connections from the client you are using. Check that the instance is fully booted (as explained above). Check that the network configuration for your mode is correct (in particular the VNET_*INTERFACE values). |
|
When attempting to log into a VM via ssh you may receive a warning message stating that your "Remote Host Identification Has Changed" as shown in the following example: $ ssh -i mykey root@192.168.7.23 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that the RSA host key has just been changed The fingerprint for the RSA key sent by the remote host is 17:91:22:94:7b:13:5c:dd 80:ee:eb:cd:25:73:dc:48 Add correct host key in /home/bob/.ssh/known_hosts:11 RSA host key for 192.168.7.23 has changed and you have requested strict checking. Host key verification failed. This type of message appears when a new instance assumes a known IP address (that is, an IP address previously used by a now-terminated instance). While in general this could be indicative of a "man-in-the-middle attack," in the cloud setting, this is harmless because public IPs are frequently reused. You can work around this warning by deleting the line containing the offending key. In the above example, the key is located at line 11 in the file home/bob/.ssh/known_hosts. You can delete this line using the sed (stream editor) as shown: $ sed '11' d /home/bob/.ssh/known_hosts |
|
Check that there is no firewall between them. Check that the IP address used during configuration is correct. Check that there is connectivity between each of the machines hosting the components using the IP specified during configuration. Check that the components are running (as described above). Check also that each machine hosting components is running NTP and that the machines’ internal clocks are synchronized. |
|
Walrus deals with, possibly, very big images. The size of available disk space should be at least three times the size of the image you wish to upload. The reason is that the image needs to be uploaded, then decrypted before sending it to the NC, which requires in itself approximately twice the size of the image. In addition, temporary files are created, so three-times the image size times is a safe amount to reserve. |
|
By default, NCs allocate 1 real core/CPU per virtual core/CPU. That is, if an instance requires 2 cores/CPUs, and the NC has only 2 cores/CPUs then no more instances will be allowed on that NC. The NC’s CPUs can be overcommitted using the MAX_CORES options in eucalyptus.conf. Note that you must restart the NC after modifying the value. (Note that performance may suffer when cores are overcommitted). |
|
NO. Unlike the CPUs/cores, memory cannot be overcommitted. The total amount of memory that the hypervisors allocates to VMs cannot exceed the total amount of physical memory on the node. |
|
To debug an image as used by Eucalyptus: Set MANUAL_INSTANCES_CLEANUP to 1. In this case, when an instance fails, the temporary files (i.e., root file system, kernel, etc.) are not deleted. You can find these files at $INSTANCE_PATH/<user>/<instanceId> along with the libvirt.xml configuration file used to start the instance. You can then modify the libvirt.xml (the network part will need to be modified) and start the instance manually using virsh create. |
|
On the “Configuration” page of the Eucalyptus Web UI, under “Walrus configuration.” confirm the “space reserved for unbundling images” is enough to contain your image. If not, increase the size of space reserved in the field provided. (Note that very large images can take a long time to boot). |
|
If you are trying to upload to an already existing bucket, Eucalyptus will return a “409” error. This is a known compatibility issue when using ec2 tools with Eucalyptus. The workaround is to use ec2-delete-bundle with the --clear option to delete the bundle and the bucket, before uploading to a bucket with the same name, or to use a different bucket name. Note: If you are using Euca2ools, this is not necessary. In addition, when using ec2-upload-bundle, make sure that there is no "/" at the end of the bucket name. |
|
Make sure you have enough loopback devices. (Note that you should have received a warning when starting Eucalyptus components). On most distributions, the loopback driver is installed as a module. The following will increase the number of loopback devices available: [root@clc]# rmmod loop ; modprobe loop max_loop=256 |
|
AoE requires the SC and NCs to be on the same physical subnet. You can check and change the Ethernet device used by the SC to export the AoE volumes by modifying the “Storage Interface” field found in the “Storage Controller” section (on the Configuration page of the Eucalyptus Web UI). (Note that this problem will arise only when the machine hosting the SC has multiple Ethernet devices). AoE will not export to the same machine that the server is running on, which means that the SC and NC must be hosted on separate physical host machines. |
|
All networking modes, except SYSTEM, will start a DHCP server when instances are running. The CC log may report a failure to start the DHCP server. Or, you may notice upon starting an instance that the DHCP server is missing on the CC machine (You use the ps command to check for the presence of DHCP server). Also, make sure that your DHCP binary is compatible with ISC DHCP daemon 3.x and that the binary specified in VNET_DHCPDAEMON is correct. You may see errors in the |
|
To check that your Eucalyptus installation is properly configured, we recommend first running a Eucalyptus-prepared image (downloadable via the “image” tab on the Eucalyptus Web interface). Check to see that your instance is fully booted (as described above). Check that the security group used by the instance allows for connectivity from the client. For example, if using ssh, port 22 should be open. You will also need to check in the eucalyptus.conf file for the values of the VNET_PRIVINTERFACE and VNET_BRIDGE (when applicable) on both the CC and NC machine(s) and that the Ethernet devices specified are on the same physical subnet. Check if DHCP server has started (as described above). If you have a DHCP server on your LAN, it may be possible that the cloud controller’s DHCP server not to provide an IP address to your instances. Since all the cloud instances have MAC addresses beginning with d0:0d you may want to tell your main DHCP server to ignore requests sent from these MAC addresses. |
|
The solution to this problem is to have your VM ping the CC. This will exercise the networking layer in your VM, and it will then acquire a valid IP address. |
|
You are probably using the ifconfig command to see the Ethernet device configuration, which only shows one address per interface. Please use the ip addr show command to see all addresses associated with the interface. |