openstack.live_snapshot的实现方法存在竞态

3570阅读 0评论2014-07-09 sak0
分类:虚拟化

openstack在H版中提供了live_snapshot,也就是不影响虚机业务运行的snapshot,代码中live_snapshot实现方法:

nova/virt/libvirt/driver.py

点击(此处)折叠或打开

  1. def _live_snapshot()

  2.         try:

  3.             # NOTE (rmk): blockRebase cannot be executed on persistent

  4.             # domains, so we need to temporarily undefine it.

  5.             # If any part of this block fails, the domain is

  6.             # re-defined regardless.

  7.             if domain.isPersistent():

  8.                 domain.undefine()

  9.  

  10.             # NOTE (rmk): Establish a temporary mirror of our root disk and

  11.             # issue an abort once we have a complete copy.

  12.             domain.blockRebase(disk_path, disk_delta, 0,

  13.                                libvirt.VIR_DOMAIN_BLOCK_REBASE_COPY |

  14.                                libvirt.VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT |

  15.                                libvirt.VIR_DOMAIN_BLOCK_REBASE_SHALLOW)

  16.  

  17.             while self._wait_for_block_job(domain, disk_path):

  18.                 time.sleep(0.5)

  19.  

  20.             domain.blockJobAbort(disk_path, 0)

  21.             libvirt_utils.chown(disk_delta, os.getuid())

  22.         finally:

  23.             self._conn.defineXML(xml)

  24.  

  25. def _wait_for_block_job(domain, disk_path, abort_on_error=False):

  26.         status = domain.blockJobInfo(disk_path, 0)

  27.         if status == -1 and abort_on_error:

  28.             msg = _('libvirt error while requesting blockjob info.')

  29.             raise exception.NovaException(msg)

  30.         try:

  31.             cur = status.get('cur', 0)

  32.             end = status.get('end', 0)

  33.         except Exception:

  34.             return False

  35.  

  36.         if cur == end and cur != 0 and end != 0:

  37.             return False

  38.         else:

  39.             return True

过程分析:

openstack层面: 首先调用libvirt接口domain.blockRebase发起qemu对于磁盘的“mirror job",然后反复调用libvirt接口domain.blockJobInfo反复查询备份job,当current刻度与offset对齐时,调用domain.blockJobAbort结束job

libvirt层面: domain.blockRebase调用qemu接口drive_mirror,domain.blockJobInfo调用qemu接口info blockjob,domain.blockJobInfo是一个同步接口,先调用qemu blockjob-cancel停止任务,然后不断查询,直到任务被关闭才返回

qemu层面:mirror任务的注释”Start mirroring a block device's writes to a new destination,using the specified target.“,其中重要循环:

block/mirror.c


点击(此处)折叠或打开

  1. static void coroutine_fn mirror_run(void *opaque)

  2.  

  3.     for (;;) {

  4.         uint64_t delay_ns;

  5.         int64_t cnt;

  6.         bool should_complete;

  7.  

  8.         if (s->ret < 0) {

  9.             ret = s->ret;

  10.             goto immediate_exit;

  11.         }

  12.  

  13.         cnt = bdrv_get_dirty_count(bs, s->dirty_bitmap);

  14.  

  15.         /* Note that even when no rate limit is applied we need to yield

  16.          * periodically with no pending I/O so that qemu_aio_flush() returns.

  17.          * We do so every SLICE_TIME nanoseconds, or when there is an error,

  18.          * or when the source is clean, whichever comes first.

  19.          */

  20.         if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - last_pause_ns < SLICE_TIME &&

  21.             s->common.iostatus == BLOCK_DEVICE_IO_STATUS_OK) {

  22.             if (s->in_flight == MAX_IN_FLIGHT || s->buf_free_count == 0 ||

  23.                 (cnt == 0 && s->in_flight > 0)) {

  24.                 trace_mirror_yield(s, s->in_flight, s->buf_free_count, cnt);

  25.                 qemu_coroutine_yield();

  26.                 continue;

  27.             } else if (cnt != 0) {

  28.                 mirror_iteration(s);

  29.                 continue;

  30.             }

  31.         }

  32.  

  33.         should_complete = false;

  34.         if (s->in_flight == 0 && cnt == 0) {

  35.             trace_mirror_before_flush(s);

  36.             ret = bdrv_flush(s->target);

  37.             if (ret < 0) {

  38.                 if (mirror_error_action(s, false, -ret) == BDRV_ACTION_REPORT) {

  39.                     goto immediate_exit;

  40.                 }

  41.             } else {

  42.                 /* We're out of the streaming phase. From now on, if the job

  43.                  * is cancelled we will actually complete all pending I/O and

  44.                  * report completion. This way, block-job-cancel will leave

  45.                  * the target in a consistent state.

  46.                  */

  47.                 s->common.offset = end * BDRV_SECTOR_SIZE;

  48.                 if (!s->synced) {

  49.                     block_job_ready(&s->common);

  50.                     s->synced = true;

  51.                 }

  52.  

  53.                 should_complete = s->should_complete ||

  54.                     block_job_is_cancelled(&s->common);

  55.                 cnt = bdrv_get_dirty_count(bs, s->dirty_bitmap);

  56.             }

  57.         }

  58.  

  59.         if (cnt == 0 && should_complete) {

  60.             /* The dirty bitmap is not updated while operations are pending.

  61.              * If we're about to exit, wait for pending operations before

  62.              * calling bdrv_get_dirty_count(bs), or we may exit while the

  63.              * source has dirty data to

  64.              *

  65.              * Note that I/O can be submitted by the guest while

  66.              * mirror_populate runs.

  67.              */

  68.             trace_mirror_before_drain(s, cnt);

  69.             bdrv_drain_all();

  70.             cnt = bdrv_get_dirty_count(bs, s->dirty_bitmap);

  71.         }

  72.  

  73.         ret = 0;

  74.         trace_mirror_before_sleep(s, cnt, s->synced);

  75.         if (!s->synced) {

  76.             /* Publish progress */

  77.             s->common.offset = (end - cnt) * BDRV_SECTOR_SIZE;

  78.  

  79.             if (s->common.speed) {

  80.                 delay_ns = ratelimit_calculate_delay(&s->limit, sectors_per_chunk);

  81.             } else {

  82.                 delay_ns = 0;

  83.             }

  84.  

  85.             block_job_sleep_ns(&s->common, QEMU_CLOCK_REALTIME, delay_ns);

  86.             if (block_job_is_cancelled(&s->common)) {

  87.                 break;

  88.             }

  89.         } else if (!should_complete) {

  90.             delay_ns = (s->in_flight == 0 && cnt == 0 ? SLICE_TIME : 0);

  91.             block_job_sleep_ns(&s->common, QEMU_CLOCK_REALTIME, delay_ns);

  92.         } else if (cnt == 0) {

  93.             /* The two disks are in sync. Exit and report successful

  94.              * completion.

  95.              */

  96.             assert(QLIST_EMPTY(&bs->tracked_requests));

  97.             s->common.cancelled = false;

  98.             break;

  99.         }

  100.         last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);

  101.     }

同步任务不断循环检查脏数据,有两种退出可能:1.source和target未同步时就设置了job->canceled 2.source和target   2.source和target同步后,should_complete && 迭代中脏页计数为0,而should_complete成立的条件是脏页为0的迭代并且job设置了退出;所以在这个设备不断IO的情况下,只有一个很小 的空当可以通过设置job状态而退出,而上层的openstack通过sleep(0.5)来钻这个空当,呵呵。

带来的后果:

1.同步任务会一直进行下去,直到GUEST OS中IO停止,造成宿主机资源一直被占用

2.libvirt的.blockJobAbort接口一直不返回,如果nova调用libvirt设置的阻塞方式,则nova也会被卡主

复现问题的方法:

将sleep(x)的值调大后非常容易复现这个竞争场景,默认的0,5也有出现的机会

 

解决方法:

1.在qemu中加入强制退出job的流程

2.慎用mirror接口,采用其他方法在线备份


上一篇:nova "image cache manager" 配置项分析
下一篇:qemu aio简记