This is a document I wrote for Ubuntu on installing UEC 2.x. I am posting it here for easy access to people looking for guidance on the Eucalyptus site.
Everything You Always Wanted to Know About UEC
Step by Step Installation, Running, and Debugging
Reinstallation from scratch
Environment
1. Everything but Node Controller – Dell XPS 9100, 750GB Raid1, 6GB Ram, eth0, built in 1Gb nic (RTL 8100/81688 PCI Express), eth1, 1gb nic (Broadcom NetExtreme BCM5705
2. Switch NetGear 5 port Gigabit router
3. Ubuntu Server 10.10 installed as cluster1 with 192.168.0.150-192.168.0.200 public ip range
Background
This server already has had 10.10 UEC cluster installed on it several times. I've gotten pretty good at that. Wiping the LVM and/or removing the Vgroup and Vdrives to get it to actually install has been problematic. Also, I discovered long after initial installation that when nothing was happening, simply booting the machine and starting the Eucalyptus services both errors and warnings were emitted. Hopefully these are all benign but if they are benign they shouldn't even exist, certainly not as errors. So I thought, and was encouraged by kim0, to start from scratch and chronical everything. Here goes...
1) I saved off the definition in /etc/network/interfaces of eth1. It saves time and since I am not a RHCE I don't know the params by heart.
2) Boot off Ubuntu 10.10 server disk.
3) Select English (I am in U.S.)
4) Select Install Ubuntu Enterprise Cloud
5) Select English again...maybe this is their way of asking, “Are you sure?”
6) Select United States
7) Select default keyboard
8) Select USA..twice (again...are you really sure?
9) Select eth0 as primary network interface.
10) Select hostname of cor9100
11) leave cloud controller address blank as this will be the cloud controller
12) Accept default of everything but Node controller
13) Select eth1 as interface for communicating with nodes. This must be the first full installation after adding eth1. I don't recall it ever asking this. I hope this means it will configure things such that the node controller, attached via eth1, will be routed to the outside world, as that is an existing problem/feature, the node controller is orphaned from the outside world.
14) Accept timezone
15) Activate serial ATA RAID
16) Select use entire disk.
17) Accept disk to partition default
18) Select Yes for write changes to disk
19) Fully name of new user...me (Walt Corey)
20) accept userid choice, password stuff, no encrypt of home, no proxy, install security updates automatically
21) Mail Postfix configuration. This defaults to cor9100. I believe I want cor9100.corey.org. Somewhere along the line I recall the warning, “cannot determine fully qualified host name”. Should I have, in step 10, specified cor9100.corey.org instead of cor9100? It probably really doesn't matter as I am not going to set up a mail server on this machine. It is, after all, the cluster, cloud, storage, Walrus controller.
22) Accept cluster1 as cluster name
23) Add the range 192.168.0.150-192.168.0.200 as public range
24) reboot and ssh in to check if eth1 was configured and check for initial startup log files. Question, it did configure eth1 but as dhcp. I think this needs to be status so I will copy the eth1 entry I saved off from the old configuration.
# The secondary network interface
auto eth1
iface eth1 inet static
address 192.168.1.1
netmask 255.255.255.0
broadcast 192.168.1.255
network 192.168.1.0
25) Yes, there were errors. I copied them to a separate subdir under /tmp and will include them in the appendix.
26) Remove logs and reboot. Bring in that set of logs for comparison to first set
27) Short of a detailed analysis, in cc.log I see a warning that eth1 MUST be a bridge for managed-novlan and that tunneling is disabled. OK, is this a bug, a feature, or a docup? It needs an explanation.
28) Shutdown cluster1 server until right before new node is reinstalled then restart right before I reboot the node as everything should connect. Also delete all logs again as this will eliminate lines saying it is trying to discover a node. I will leave Eucalyptus configurated for debug level logging as this may provide more information surrounding errors.
29) Installing cor9000 (node controller)...
30) Network autoconfiguration failed...hmmm...need to have cluster controller running perhaps? Restarting cluster and will retry autoconfiguration of network. This made no difference but I think later on in the installation the postfix or other stuff will expect to find the cluster controller so I will leave it running and manually configure eth0 here. Made ipaddress 192.168.1.2, gateway 192.168.1.1, and there was one other question which I answered 1.1 as well.
31) Hostname cor9000.corey.org and chose node controller.
32) Activating ATA RAID, selected use entire disk.
33) I believe node controller is trying to update/upgrade but it can not as it can't reach the outside world. If I put it on a different subnet so it can, then it won't see the cluster controller. It just gave up (I believe) the update/upgrade and is continuing the installation. I did not think this would work but, according to the new UEC 2.0 document from CSS, I added the same 192.168.0.1 nameserver to the node's resolv.conf. Isn't the 192.168 subnet unroutable, much like 10.x? If this is on a private network with only a single eth connected to eth1 on the clc, how is it going to get to the outside world. Further in that same CSS doc is says to update/upgrade the software. This appears to be impossible. UEC clarification on this would be really welcome. This would argue for System mode would it not?
34) Looking at the logs on clc I see that the node did successfully register, all seems happy so far.
35) Time to get credentials. I did not get the expected, change password screen. I did, however, change the admin password and next added myself as user. Per several forum messages one shouldn't be downloading and registering images as admin. Is this a bug?
36) I logged off then back on as myself (not admin) and got my credentials, saved them to a unique subdir off my home and sourced it. Strange though, I could still get a verbose listing from availability zone. I published, as me, that previous maverick image and it showed up as owned by admin. How is that possible? It is consistent with getting the verbose AZ listing. Eucalyptus support told me the reason I could not see the A Z detail from ECC is I wasn't admin and only admin sees the verbose listing. Now I can not get rid of the bucket, euca-delete-bucket -b mybucket is supposed to remove everything in the bucket including the bucket, it does not. Directly specifying the manifest for that maverick image says invalid manifest. I went into the web interface and disabled the two images under the bucket. It then allowed me to delete and clear the bucket.
37) I can confirm by publishing the maverick image as me, not as admin, and starting it, it DOES transition to a running state and I can ssh into it. Yahoo!
38) I now have a rudimentary cloud-init config file that works great and does:
#cloud-config
apt_upgrade: true
packages:
- tomcat6
- tomcat6-admin
- mysql-server
39) I decided to raise the number of CPUs from the default to 16. This was not a good thing to do. I did, however, turn on virtio_net and virtio_root. Currently all three are on and working.