Corporate Home Open Source Home
Syndicate content
Eucalyptus

Join us at engage.eucalyptus.com

2 replies [Last post]
arwin.tugade
Offline
Joined: 01/20/2011

So I ran into a really weird scenario today with one of my clusters that has 3 node controllers so it looks like:

Cluster Controller 5
|
Node Controller 14
Node Controller 15
Node Controller 16

* This morning NC16 falls over for whatever reason, just became totally unresponsive
* I restart NC16 and it comes back up fine
* I try to launch a new instance in this cluster and it doesn't pick up an IP (RunInstances(): could not find/initialize any free network address, failing doRunInstances())
* I find out the filesystem on NC14 is read-only and because the instance was suppose to be launched on this NC is probably the reason why it couldn't pickup an IP.
* So I deregister it from cluster5 because it's having more serious problems
* I do a stop/start on the CC (I shouldn't have to do this, but the cc stopped talking to NC14 after I did according to cc.log)
* I manage to launch one instance in cluster5 and it picks up the next available ip
* Tried to launch more but they're not picking up ips
* I terminate all instances in this cluster because I'm thoroughly confused at this point (I really shouldn't have to do this)
* I do a /etc/init.d/eucalyptus-cc cleanstop
* I do a /etc/init.d/eucalyptus-cc cleanstart
* I can now relaunch all the instances that I orignally had to begin with

Any suggestions on how to fix this without terminating all instances in this cluster then doing the cleanstop/cleanstart?

Arwin

jeevanullas
Offline
Joined: 02/12/2010
Hi Arwin, I have one

Hi Arwin,

I have one question,

* I find out the filesystem on NC14 is read-only and because the instance was suppose to be launched on this NC is probably the reason why it couldn't pickup an IP.

Looks fine. If node controller file system is read only nothing much can be done did you try,

mount -o remount,rw /

* So I deregister it from cluster5 because it's having more serious problems

This was done from the CC using euca_conf --deregister-nodes "nc 14 ip"

* I do a stop/start on the CC (I shouldn't have to do this, but the cc stopped talking to NC14 after I did according to cc.log)

Now why was this required I mean stop and start of the CC. I believe it is CC 5 which you are talking.

Cheers,
Deependra

arwin.tugade
Offline
Joined: 01/20/2011
Same thing happened again, ips not being handed out

Just had a NC die abruptly and when I bring it back up, I try to relaunch an instance but the public / private ips are 0.0.0.0. There are IPs available as shown by euca-describe-addresses so I don't understand why this is happening. I am almost certain doing a cleanstop/cleanstart on the CC will fix the issue but that isn't a viable solution when you have instances that are being used without issue in that cluster. Is there a workaround for this?

I'm on Centos 5.6 / 2.0.3 rpm install from your repository.

Thanks,
Arwin