Skip to content
Snippets Groups Projects
  1. Jul 19, 2009
    • Iustin Pop's avatar
      job queue: fix loss of finalized opcode result · 34327f51
      Iustin Pop authored
      
      Currently, unclean master daemon shutdown overwrites all of a job's
      opcode status and result with error/None. This is incorrect, since the
      any already finished opcode(s) should have their status and result
      preserved, and only not-yet-processed opcodes should be marked as
      ‘error’. Cancelling jobs between opcodes does the same (but this is not
      allowed currently by the code, so it's not as important as unclean
      shutdown).
      
      This patch adds a new _QueuedJob function that only overwrites the
      status and result of finalized opcodes, which is then used in job queue
      init and in the cancel job functions. The patch also adds some comments
      and a new set constants in constants.py highlighting the finalized vs.
      non-finalized opcode statuses.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      34327f51
    • Iustin Pop's avatar
      Switch gnt-debug submit-job to JobExecutor · b59252fe
      Iustin Pop authored
      
      Currently gnt-debug submits jobs individually, but in 2.1 JobExecutor
      uses the optimized SubmitManyJobs luxi call and as such should be used
      whenever multiple jobs need to be submitted.
      
      This patch converts gnt-debug submit-job to use it and also removes an
      extra empty line in the JobExecutor class.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      b59252fe
    • Iustin Pop's avatar
      Convert instance reinstall to multi instance model · 3d2ca95d
      Iustin Pop authored
      
      This patch converts ‘gnt-instance reinstall’ from single-instance to
      multi-instance model; since this is dangerours, it's required to pass
      “--force --force-multiple” to skip the confirmation.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      (cherry picked from commit 55efe6da)
      3d2ca95d
    • Iustin Pop's avatar
      gnt-instance batch-create: use the job executor · dd7dcca7
      Iustin Pop authored
      
      This small patch changed the batch create functionality to use the job
      executor instead of single-job submits.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      (cherry picked from commit d4dd4b74)
      dd7dcca7
    • Iustin Pop's avatar
      Modify cli.JobExecutor to use SubmitManyJobs · f2921752
      Iustin Pop authored
      
      This patch changes the generic "multiple job executor" to use the many
      jobs submit model, which automatically makes all its users use the new
      model.
      
      This makes, for example, startup/shutdown of a full cluster much more
      logical (all the submitted job IDs are visible fast, and then waiting
      for them proceeds normally).
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      (cherry picked from commit 23b4b983)
      f2921752
    • Iustin Pop's avatar
      Add a luxi call for multi-job submit · 56d8ff91
      Iustin Pop authored
      
      As a workaround for the job submit timeouts that we have, this patch
      adds a new luxi call for multi-job submit; the advantage is that all the
      jobs are added in the queue and only after the workers can start
      processing them.
      
      This is definitely faster than per-job submit, where the submission of
      new jobs competes with the workers processing jobs.
      
      On a pure no-op OpDelay opcode (not on master, not on nodes), we have:
        - 100 jobs:
          - individual: submit time ~21s, processing time ~21s
          - multiple:   submit time 7-9s, processing time ~22s
        - 250 jobs:
          - individual: submit time ~56s, processing time ~57s
                        run 2:      ~54s                  ~55s
          - multiple:   submit time ~20s, processing time ~51s
                        run 2:      ~17s                  ~52s
      
      which shows that we indeed gain on the client side, and maybe even on
      the total processing time for a high number of jobs. For just 10 or so I
      expect the difference to be just noise.
      
      This will probably require increasing the timeout a little when
      submitting too many jobs - 250 jobs at ~20 seconds is close to the
      current rw timeout of 60s.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      (cherry picked from commit 2971c913)
      56d8ff91
    • Iustin Pop's avatar
      job queue: fix interrupted job processing · f6424741
      Iustin Pop authored
      
      If a job with more than one opcodes is being processed, and the master
      daemon crashes between two opcodes, we have the first N opcodes marked
      successful, and the rest marked as queued. This means that the overall
      jbo status is queued, and thus on master daemon restart it will be
      resent for completion.
      
      However, the RunTask() function in jqueue.py doesn't deal with
      partially-completed jobs. This patch makes it simply skip such opcodes.
      
      An alternative option would be to not mark partially-completed jobs as
      QUEUED but instead RUNNING, which would result in aborting of the job at
      restart time.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      f6424741
    • Iustin Pop's avatar
      Fix an error path in job queue worker's RunTask · ed21712b
      Iustin Pop authored
      
      In case the job fails, we try to set the job's run_op_idx to -1.
      However, this is a wrong variable, which wasn't detected until the
      __slots__ addition. The correct variable is run_op_index.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      ed21712b
  2. Jul 17, 2009
  3. Jul 16, 2009
  4. Jul 14, 2009
  5. Jul 13, 2009
  6. Jul 08, 2009
  7. Jul 07, 2009
  8. Jul 01, 2009
  9. Jun 30, 2009
    • Iustin Pop's avatar
      Cleanup config data when draining nodes · dec0d9da
      Iustin Pop authored
      
      Currently, when draining nodes we reset their master candidate flag, but
      we don't instruct them to demote themselves. This leads to “ERROR: file
      '/var/lib/ganeti/config.data' should not exist on non master candidates
      (and the file is outdated)”.
      
      This patch simply adds a call to node_demote_from_mc in this case.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      dec0d9da
    • Iustin Pop's avatar
      Fix node readd issues · a8ae3eb5
      Iustin Pop authored
      
      This patch fixes a few node readd issues.
      
      Currently, the node readd consists of two opcodes:
        - OpSetNodeParms, which resets the offline/drained flags
        - OpAddNode (with readd=True), which reconfigures the node
      
      The problem is that between these two, the configuration is inconsistent
      for certain cluster configurations. Thus, this patch removes the first
      opcode and modified the LUAddNode to deal with this case too.
      
      The patch also modifies the computation of the intended master_candidate
      status, and actually sets the readded node to master candidate if
      needed. Previously, we didn't modify the existing node at all.
      
      Finally, the patch modifies the bottom of the Exec() function for this
      LU to:
        - trigger a node update, which in turn redistributes the ssconf files
          to all nodes (and thus the new node too)
        - if the new node is not a master candidate, then call the
          node_demote_from_mc RPC so that old master files are cleared
      
      My testing shows this behaves correctly for various cases.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      a8ae3eb5
    • Iustin Pop's avatar
      backend.DemoteFromMC: don't fail for missing files · 9a5cb537
      Iustin Pop authored
      
      If the config file is missing when the DemoteFromMC() function is
      called, it will raise a ProgrammerError. Instead of changing the
      utils.CreateBackup() file which is called from multiple places, for now
      we only change the DemoteFromMC() function to not call it if the file is
      not existing (we rely on the master to prevent race conditions here).
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarOlivier Tharan <olive@google.com>
      9a5cb537
    • Iustin Pop's avatar
      Allow GetMasterCandidateStats to ignore some nodes · 23f06b2b
      Iustin Pop authored
      
      This patch modifies ConfigWriter.GetMasterCandidateStats to allow it to
      ignore some nodes in the calculation, so that we can use it to predict
      cluster state without some nodes (which we know we will modify, and thus
      we should not rely on their state).
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarOlivier Tharan <olive@google.com>
      23f06b2b
Loading