Commits · 9022440794e12563045a37b47eb6b0957f48f7ac · itminedu / snf-ganeti

Sep 02, 2010

Disable the RAPI CA checks in watcher · 34f06005

Iustin Pop authored 14 years ago


Since the RAPI certificate is not necessarily self-signed, and we
currently don't have any configuration variable for the real CA file, we
disable for now the CA checks. This fixes the 'restart RAPI every 5
minutes' problem with non-self-signed certs.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

34f06005

Aug 18, 2010

Support for resolving hostnames to IPv6 addresses · b705c7a6

Manuel Franceschini authored 14 years ago


This patch enables IPv6 name resolution by using socket.getaddrinfo
instead of socket.gethostbyname_ex.

It renames the HostInfo class to Hostname and unifies its use throughout
the code. This is achieved by using static calls where no object is
needed and removes some obsolete code.

For now, we just resolve to IPv4 addresses, but this will change once it
is needed.

Signed-off-by: Manuel Franceschini <livewire@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b705c7a6

Jul 26, 2010

watcher: smarter handling of instance records · f5116c87

Iustin Pop authored 14 years ago

This patch implements a few changes to the instance handling. First, old
instances which no longer exist on the cluster are removed from the
state file, to keep things clean.

Second, the instance restart counters are reset every 8 hours, since
some error cases might be transient (e.g. networking issues, or machine
temporarily down), and if the problem takes more than 5 restarts but is
not permanent, watcher will not restart the instance. The value of 8
hours is, I think, both conservative (as not to hammer the cluster too
often with restarts) and fast enough to clear semi-transient problems.

And last, if an instance is not restarted due to exhausted retries, this
should be warned, otherwise it's hard to understand why watcher doesn't
want to restart an ERROR_down instance.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

f5116c87

Jul 09, 2010

Introduce lib/netutils.py · a744b676

Manuel Franceschini authored 14 years ago


This patch moves network utility functions to a dedicated module.

Signed-off-by: Manuel Franceschini <livewire@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

a744b676

Jul 01, 2010

RAPI client: Switch to pycURL · 2a7c3583

Michael Hanselmann authored 14 years ago


Currently the RAPI client uses the urllib2 and httplib modules from
Python's standard library. They're used with pyOpenSSL in a very fragile
way, and there are known issues when receiving large responses from a RAPI
server.

By switching to PycURL we leverage the power and stability of the
widely-used curl library (libcurl). This brings us much more flexibility
than before, and timeouts were easily implemented (something that would
have involved a lot of work with the built-in modules).

There's one small drawback: Programs using libcurl have to call
curl_global_init(3) (available as pycurl.global_init) while exactly one
thread is running (e.g. before other threads) and are supposed to call
curl_global_cleanup(3) (available as pycurl.global_cleanup) upon exiting.
See the manpages for details. A decorator is provided to simplify this.

Unittests for the new code are provided, increasing the test coverage of
the RAPI client from 74% to 89%.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

2a7c3583

Jun 30, 2010

Rename some constants to facilitate IPv6 support · 9769bb78

Manuel Franceschini authored 15 years ago


Signed-off-by: Manuel Franceschini <livewire@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

9769bb78

Jun 03, 2010

ganeti-watcher should attempt to fix ganeti-rapi · db147305

Tom Limoncelli authored 15 years ago


Update ganeti-watcher so that it tests the master's RAPI port with a
simple test (in this case GetVersion). If it fails, make one attempt
at restarting ganeti-rapi and retest.

- daemons/ganeti-watcher: Test rapi and make one attempt at restarting it.
- lib/utils.py: add StopDaemon() function.

Signed-off-by: Tom Limoncelli <tlim@google.com>
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

db147305

Apr 09, 2010

Make watcher request the max coverage · ebacb943

Iustin Pop authored 15 years ago


Since the actions are potentially destructive, we should try to get a
consistent view of the cluster, so it's better to get the most coverage
possible.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

ebacb943

Apr 08, 2010

Watcher: automatic shutdown of orphan resources · 50273051

Iustin Pop authored 15 years ago


This patch changes the watcher so that it maintains (on all nodes) the
list of instances and DRBD devices by shutting down ones that confd
daemons indicate should not be running on this node.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

50273051

Mar 23, 2010

Watcher: fix some doc typos · 55c85950

Iustin Pop authored 15 years ago


Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

55c85950

Watcher: do not warn for missing hooks dir · 10e689d4

Iustin Pop authored 15 years ago


If the hooks dir does not exist, do not warn needlessly. This is similar
to commit a9b7e346 (for backend.py).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

10e689d4

Mar 08, 2010

Switch from os.path.join to utils.PathJoin · c4feafe8

Iustin Pop authored 15 years ago


This passes a full burnin with lots of instances, and should be safe as
we mostly to join a known root (various constants) to a run-time
variable.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

c4feafe8

Feb 26, 2010

watcher: Acquire lock early and give more friendly message · 001b3825

Michael Hanselmann authored 15 years ago


By opening the lock file early, other programs can lock the
state file to prevent ganeti-watcher from restarting daemons.
Using the pause feature is inherently prone to race conditions.

Before a traceback was logged when the lock file couldn't
be acquired. Now it'll be a more friendly message.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

001b3825

Move watcher's EnsureDaemon function to utils · 2826b361

Guido Trotter authored 15 years ago


This is going to be used from the nbma repository, to ensure that the
nld daemon is running.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

2826b361

Add watcher hooks · 9e289e36

Guido Trotter authored 15 years ago


These hooks are run on all nodes, after the "base" daemons are started.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

9e289e36

Abstract starting the node daemons · f1115454

Guido Trotter authored 15 years ago


We're using a separate function for this, as we're going to add some
functionality to this feature.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

f1115454

ganeti-watcher: remove unused Indent function · 46cf6260

Guido Trotter authored 15 years ago


Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

46cf6260

Feb 23, 2010

Catch disk activation errors in watcher · a9105b24

Michael Hanselmann authored 15 years ago


If activating disks fails for some reason, the watcher didn't
catch the exception. With this patch it's caught and logged.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

a9105b24

Jan 28, 2010

ganeti-watcher: ensure confd is running as well · 7369e826

Guido Trotter authored 15 years ago


Ganeti-confd should be running on all 2.1 nodes.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

7369e826

Jan 04, 2010

Fix unused imports or add silences where needed · 30e4e741

Iustin Pop authored 15 years ago


In some cases pylint doesn't parse the import correctly, so we add
silences; but there are also many cases of unused imports, which we
simply remove.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

30e4e741

daemons: handle arguments correctly and uniformly · f93427cd

Iustin Pop authored 15 years ago


Of all daemons, only rapi did abort when given argument. None of our
daemons use any arguments, but they accepted them blindly. This is a
very bad experience for the user.

This patch adds checking and exiting in all daemons, in a uniform way.
One other option would have been to add a flag to GenericMain
(noargs=True).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

f93427cd

Remove more unused variables · f4ad2ef0

Iustin Pop authored 15 years ago


This removes unused variables in the rest of the code (outside lib/).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

f4ad2ef0

Add targeted pylint disables · 7260cfbe

Iustin Pop authored 15 years ago


This patch should have only:

- pylint disables
- docstring changes
- whitespace changes

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

7260cfbe

Fix use of the logging functions · 07b8a2b5

Iustin Pop authored 15 years ago


The logging functions expand the arguments themselves, thus it's safer
to let them do it rather than manual string formatting.

Also re-wraps one comment.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

07b8a2b5

Nov 25, 2009

Remove quotes from CommaJoin and convert to it · 1f864b60

Iustin Pop authored 15 years ago


This patch removes the quotes from CommaJoin and converts most of the
callers (that I could find) to it. Since CommaJoin does str(i) for i in
param, we can remove these, thus simplifying slightly a few calls.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

1f864b60

Nov 05, 2009

Add new “daemon-util” script to start/stop Ganeti daemons · f154a7a3

Michael Hanselmann authored 15 years ago


Until now, Ganeti started and stopped its own daemons using custom functions.
To start, the daemon was just executed and then sent the appropriate signals to
stop it again. Init scripts would have to pay attention to the PID file and
other things.

With this patch, a new script is added (“daemon-util”, installed in
$prefix/lib/ganeti/), centralizing the starting and stopping of daemons. The
provided example init script is adjusted to use this new script. Ganeti's code
no longer calls its own init script.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

f154a7a3

Sep 18, 2009

Make ganeti-watcher use the standard debug option · 6d4e8ec0

Iustin Pop authored 15 years ago


Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

6d4e8ec0

Aug 26, 2009

ganeti-watcher: Don't run if paused · 3753b2cb

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

3753b2cb

Jul 24, 2009

Remove <DAEMON>_PID constants · 83052f9e

Guido Trotter authored 15 years ago


The <DAEMON>_PID constants were created to reference a daemon pid file,
but actually contain a daemon's name, because the various functions that
work with pidfiles abstract the filename from the daemon name
themselves. Removing the constants and using the actual daemon name
constants in their place.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

83052f9e

May 25, 2009

watcher: automatically restart noded/rapi · c4f0219c

Iustin Pop authored 16 years ago


This patch makes the watcher automatically restart the node and rapi
daemons, if they are not running (as per the PID file).

This is not an exhaustive test; a better one would be TCP connect to the
port, and an even better one a simple protocol ping (e.g. get / for rapi
and a rpc_call_alive for noded), but since we don't know how they've
been started we can't implement it today. rapi would need to write the
SSL/port to a file, and noded something similar, so that we know how to
connect.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

c4f0219c

watcher: handle full and drained queue cases · 24edc6d4

Iustin Pop authored 16 years ago


Currently the watcher is broken when the queue is full, thus not
fulfilling its job as a queue cleaner. It also doesn't handle nicely the
queue drained status.

This patch does a few changes:
  - first archive jobs, and only after submit jobs; this fixes the case
    where the queue is already full and there are jobs suited for
    archiving (but not the case where the jobs all too young to be
    archived)
  - handle nicely the job queue full and drained cases—instead of
    tracebacks, log such cases nicely
  - reverse the initial value and special cases for update_file; we now
    whitelist instead of blacklist cases, since we have much more
    blacklist cases than vice versa, and we set the flag to True only
    after the run is successful

The last change, especially, is a significant one: now errors during the
watcher run will not update the status file, and thus they won't be lost
again in the logs.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

24edc6d4

May 20, 2009

watcher: write the instance status to a file · 78f44650

Iustin Pop authored 16 years ago


This patch modifies the watcher to keep on-disk a file with the instance
status; this can be used from outside of ganeti to react to instances
being down (when the watcher cannot restart them).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

78f44650

May 19, 2009

watcher: try to restart the master if down · 7dfb83c2

Iustin Pop authored 16 years ago


Bugs in either our code or in associated libraries can bring the master daemon
down, and this (due to the 2.0 architecture) stops all work on the cluster.

Since the watcher already does periodic checks on the cluster, we modify
it to try to start the master automatically in case of failures to
connect. This will be tried only once per cycle.

Also, in this case, we modify the code so that the watcher status file
is not updated - its timestamp will reflect thus the time of last
successful connection to the master.

Side note: the except errors.ConfigurationError part could be cleaned
up, since in 2.0 we don't usually get that directly, and if we do it's
an error and we shouldn't touch the file anyway; but that is not a rc5
change.

Signed-off-by: Iustin Pop <iustin@google.com>

7dfb83c2

Apr 06, 2009

Fix the output of watcher on non-master nodes · 2c404217

Iustin Pop authored 16 years ago

Currently the watcher spews errors message on non-master nodes. This
cleans it up.

Reviewed-by: imsnah

2c404217

Change the watcher to use jobs instead of queries · 6dfcc47b

Iustin Pop authored 16 years ago

As per the mailing list discussion, this patch changes the watcher to
use a single job (two opcodes) for getting the cluster state (node list
and instance list); it will then compute the needed actions based on
this data.

The patch also archives this job and the verify-disks job.

Reviewed-by: imsnah

6dfcc47b

Mar 09, 2009

watcher: fix startup sequence locking the master · cc962d58

Iustin Pop authored 16 years ago

Currently, the watcher startup sequence does:
  - open a luxi client
  - get the instance list
  - get the node boot ids
  - open and lock the status file, and:
    - archive jobs
    - restart the down instances
    - check disks

This, of course, can lead to problems when a node is (genuinely or not)
locked for more than (watcher interval * maximum query clients) time. At
that time, the master is completely unresponsive until the node is
unlocked and all the watchers exit with error due to the state file
being locked by the first instance.

This patch reworks the startup sequence to first open/lock the status
file, and only then open a luxi client. This should prevent the above
case.

Reviewed-by: ultrotter

cc962d58

Feb 24, 2009

Remove the extra_args parameter in instance start · 07813a9e

Iustin Pop authored 16 years ago

This patch removes the extra_args parameter and instead switches the
instance to the HV_KERNEL_ARGS hypervisor option.

This is a big change, but it's a needed cleanup, this extra parameter on
all RPC calls is not generic and we also need to have a persistent value
here.

Reviewed-by: imsnah

07813a9e

Feb 16, 2009

watcher: fix checking of boot IDs · 3448aa22

Iustin Pop authored 16 years ago

The recent change (commit 2151) to the watcher to make it handle offline
nodes also saves the offline attribute to the state file, but this is
not needed and also breaks the checking of the boot ID. This patch
simply removes it, restoring the correct behaviour.

Reviewed-by: imsnah

3448aa22

watcher: autoarchive old jobs · f07521e5

Iustin Pop authored 16 years ago

This patch adds auto-archiving of jobs older than 6 hours to the
watcher.

Reviewed-by: imsnah

f07521e5

Feb 04, 2009

Implement lockless query operations · ec79568d

Iustin Pop authored 16 years ago

This patch adds the framework for, and enables lockless OpQueryInstances. This
means that instances will be shown in ERROR_up or ERROR_down state, even though
this is not an error (but just an in-progress job).

The framework is implemented as follows:
  - the OpQueryInstances, OpQueryNodes and OpQueryExports opcodes take
    an additional “use_locking” flag which will denote whether to lock
    or not; this patch only implements this for LUQueryInstances
  - the luxi query functions take an additional argument use_locking
    which is passed to the master daemon, and then passed to the above
    opcodes
  - cli.py export a new SYNC_OPT command line options which implement
    setting this flag to true
  - except for gnt-instance list, which uses this option, and for
    name-only queries (e.g. QueryNodes(fields=["names"])), all other
    callers are setting this flag to True
  - RAPI also sets the flag to True

The patch was tested with a continuous (0.2s sleep in-between)
gnt-instance list during a burnin, and no problems were observed.

Reviewed-by: ultrotter

ec79568d