Migrate meminfo-writer=False service setting to maxmem=0 as a method to
disable dynamic memory management. Remove the service from vm.features
dict in the process.
Additionally, translate any attempt to set the service.meminfo-writer
feature to either setting maxmem=0 or resetting it to the default (which
is memory balancing enabled if supported by given domain). This is to at
least partially not break existing tools using service.meminfo-writer as
a way to control dynamic memory management. This code does _not_ support
reading service.meminfo-writer feature state to get the current state of
dynamic memory management, as it would require synchronizing with all
the factors affecting its value. One of main reasons for migrating to
maxmem=0 approach is to avoid the need of such synchronization.
QubesOS/qubes-issues#4480
Use maxmem=0 for disabling dynamic memory balance, instead of cryptic
service.meminfo-writer feature. Under the hood, meminfo-writer service
is also set based on maxmem property (directly in qubesdb, not
vm.features dict).
Having this as a property (not "feature"), allow to have sensible
handling of default value. Specifically, disable it automatically if
otherwise it would crash a VM. This is the case for:
- domain with PCI devices (PoD is not supported by Xen then)
- domain without balloon driver and/or meminfo-writer service
The check for the latter is heuristic (assume presence of 'qrexec' also
can indicate balloon driver support), but it is true for currently
supported systems.
This also allows more reliable control of libvirt config: do not set
memory != maxmem, unless qmemman is enabled.
memory != maxmem only makes sense if qmemman for given domain is
enabled. Besides wasting some domain resources for extra page tables
etc, for HVM domains this is harmful, because maxmem-memory difference
is made of Popupate-on-Demand pool, which - when depleted - will kill
the domain. This means domain without balloon driver will die as soon
as will try to use more than initial memory - but without balloon driver
it sees maxmem memory and doesn't know about the lower limit.
FixesQubesOS/qubes-issues#4135
It makes a lot of sense to call long-running operations in that event
handler, including calling back into the VM. Allow that by using
fire_event_async, not just fire_event.
Also, document the event.
Commit 15cf593bc5 "tests/lvm: fix checking
lvm pool existence" attempted to fix handling '-' in pool name by using
/dev/VG/LV symlink. But those are not created for thin pools. Change
back to /dev/mapper, but include '-' mangling.
Related QubesOS/qubes-issues#4332
Restore old code for calculating subdir within the archive. The new one
had two problems:
- set '/' for empty input subdir - which caused qubes.xml.000 to be
named '/qubes.xml.000' (and then converted to '../../qubes.xml.000');
among other things, this results in the wrong path used for encryption
passphrase
- resolved symlinks, which breaks calculating path for any symlinks
within VM's directory (symlinks there should be treated as normal files
to be sure that actual content is included in the backup)
This partially reverts 4e49b951ce.
FixesQubesOS/qubes-issues#4493
vm.kill() will try to get vm.startup_lock, so it can't be called while
holding it already.
Fix this by extracting vm._kill_locked(), which expect the lock to be
already taken by the caller.
* origin/pr/239:
storage: fix NotImplementedError message for import_data()
storage/reflink: make resize()/import_volume() more readable
storage/reflink: unblock import_data() and import_data_end()
Try to collect more details about why the test failed. This will help
only if qvm-open-in-dvm exist early. On the other hand, if it hang, or
remote side fails to find the right editor (which results in GUI error
message), this change will not provide any more details.
First boot of whonix-ws based VM take extended period of time, because
a lot of files needs to be copied to private volume. This takes even
more time, when verbose logging through console is enabled. Extend the
timeout for that.
If domain is set to autostart, qubes-vm@ systemd service is used to
start it at boot. Cleanup the service when domain is removed, and
similarly enable the service when domain is created and already have
autostart=True.
FixesQubesOS/qubes-issues#4014
Cleanup VMs in template reverse topological order, not network one.
Network can be set to None to break dependency, but template can't. For
netvm to be changed, kill VMs first (kill doesn't check network
dependency), so netvm change will not trigger side effects (runtime
change, which could fail).
This fixes cleanup for tests creating custom templates - previously
order was undefined and if template was tried removed before its child
VMs, it fails. All the relevant files were removed later anyway, but it
lead to python objects leaks.
Cleaning up after domain shutdown (domain-stopped and domain-shutdown
events) relies on libvirt events which may be unreliable in some cases
(events may be processed with some delay, of if libvirt was restarted in
the meantime, may not happen at all). So, instead of ensuring only
proper ordering between shutdown cleanup and next startup, also trigger
the cleanup when we know for sure domain isn't running:
- at vm.kill() - after libvirt confirms domain was destroyed
- at vm.shutdown(wait=True) - after successful shutdown
- at vm.remove_from_disk() - after ensuring it isn't running but just
before actually removing it
This fixes various race conditions:
- qvm-kill && qvm-remove: remove could happen before shutdown cleanup
was done and storage driver would be confused about that
- qvm-shutdown --wait && qvm-clone: clone could happen before new content was
commited to the original volume, making the copy of previous VM state
(and probably more)
Previously it wasn't such a big issue on default configuration, because
LVM driver was fully synchronous, effectively blocking the whole qubesd
for the time the cleanup happened.
To avoid code duplication, factor out _ensure_shutdown_handled function
calling actual cleanup (and possibly canceling one called with libvirt
event). Note that now, "Duplicated stopped event from libvirt received!"
warning may happen in normal circumstances, not only because of some
bug.
It is very important that post-shutdown cleanup happen when domain is
not running. To ensure that, take startup_lock and under it 1) ensure
its halted and only then 2) execute the cleanup. This isn't necessary
when removing it from disk, because its already removed from the
collection at that time, which also avoids other calls to it (see also
"vm/dispvm: fix DispVM cleanup" commit).
Actually, taking the startup_lock in remove_from_disk function would
cause a deadlock in DispVM auto cleanup code:
- vm.kill (or other trigger for the cleanup)
- vm.startup_lock acquire <====
- vm._ensure_shutdown_handled
- domain-shutdown event
- vm._auto_cleanup (in DispVM class)
- vm.remove_from_disk
- cannot take vm.startup_lock again
First unregister the domain from collection, and only then call
remove_from_disk(). Removing it from collection prevent further calls
being made to it. Or if anything else keep a reference to it (for
example as a netvm), then abort the operation.
Additionally this makes it unnecessary to take startup lock when
cleaning it up in tests.
vm.shutdown(wait=True) waited indefinitely for the shutdown, which makes
useless without some boilerplate handling the timeout. Since the timeout
may depend on the operating system inside, add a per-VM property for it,
with value inheritance from template and then from global
default_shutdown_timeout property.
When timeout is reached, the method raises exception - whether to kill
it or not is left to the caller.
FixesQubesOS/qubes-issues#1696
LVM operations can take significant amount of time. This is especially
visible when stopping a VM (`vm.storage.stop()`) - in that time the
whole qubesd freeze for about 2 seconds.
Fix this by making all the ThinVolume methods a coroutines (where
supported). Each public coroutine is also wrapped with locking on
volume._lock to avoid concurrency-related problems.
This all also require changing internal helper functions to
coroutines. There are two functions that still needs to be called from
non-coroutine call sites:
- init_cache/reset_cache (initial cache fill, ThinPool.setup())
- qubes_lvm (ThinVolume.export()
So, those two functions need to live in two variants. Extract its common
code to separate functions to reduce code duplications.
FixesQubesOS/qubes-issues#4283
Both vm.create_on_disk() and vm.start() are coroutines. Tests in this
class didn't run them, so basically didn't test anything.
Wrap couroutine calls with self.loop.run_until_complete().
Additionally, don't fail if LVM pool is named differently.
In that case, the test is rather sily, as it probably use the same pool
for source and destination (operation already tested elsewhere). But it
isn't a reason for failing the test.
On some storage pools this operation can also be time consuming - for
example require creating temporary volume, and volume.create() already
can be a coroutine.
This is also requirement for making common code used by start()/create()
etc be a coroutine, otherwise neither of them can be and will block
other operations.
Related to QubesOS/qubes-issues#4283
Support 'supported-service.*' features requests coming from VMs. Set
such features directly (allow only value '1') and remove any not
reported in given call. This way uninstalling package providing given
service will automatically remove related 'supported-service...'
feature.
FixesQubesOS/qubes-issues#4402
Make it clear that volume creation fails because it needs to be in the
same pool as its parent. This message is shown in context of `qvm-create
-p root=MyPool` for example and the previous message didn't make sense
at all.
FixesQubesOS/qubes-issues#3438
R3 format had limitation of ~40 rules per VM. Do not generate compat
rules (possibly hitting that limitation) if new format, free of that
limitation is supported.
FixesQubesOS/qubes-issues#1570FixesQubesOS/qubes-issues#4228
Make sure events are sent to specific window found with xdotool search,
not the one having the focus. In case of Whonix, it can be first
connection wizard or whonixcheck report.
Searching based on class is used in many tests, searching by class, not
only by name in wait_for_window will allow to reduce code duplication.
While at it, improve it additionally:
- avoid active waiting for window and use `xdotool search --sync` instead
- return found window id
- add wait_for_window_coro() for use where coroutine is needed
- when waiting for window to disappear, check window id once and wait
for that particular window to disappear (avoid xdotool race
conditions on window enumeration)
Besides reducing code duplication, this also move various xdotool
imperfections handling into one place.
Allow removing VMs based on multiple prefixes at once. Removing them
separately doesn't handle all the dependencies (default_netvm, netvm)
correctly. This is needed for backup compatibility tests, where VMs are
created with `test-` prefix and `disp-tests-`. Additionally backup code
will create `disp-no-netvm`, which also may need to be removed.
If QUBES_TEST_TEMPLATES or QUBES_TEST_LOAD_ALL is set, create testcases
on modules import, instead of waiting until `load_tests` is called.
The `QUBES_TEST_TEMPLATES` doesn't require `qubes.xml` access, so it
should be safe to do regardless of the environment. The
`QUBES_TEST_LOAD_ALL` force loading tests (and reading `qubes.xml`)
regardless.
This is useful for test runners not supporting load_tests protocol. Or
with limited support - for example both default `unittest` runner and
`nose2` can either use load_tests protocol _or_ select individual tests.
Setting any of those variable allow to run a single test with those
runners.
With this feature used together load_tests protocol, tests could be
registered twice. Avoid this by not listing already defined test classes
in create_testcases_for_templates (according to load_tests protocol,
those should already be registered).
Allow easily list templates to be tested, without enumerating all the
test classes. This is especially useful with nose2 runner which can't
use load tests protocol _and_ select subset of tests.
If any object is leaked, QubesTestCase.cleanup_gc() raises an exception,
which have leaked objects list referenced in its traceback. This happens
after cleanup_traceback(), so isn't cleaned, causing cleanup_gc() fail
for all the further tests in the same test run.
Avoid this, by dropping list just before checking if any object is
leaked.
When a VM (or its template) does not explicitly set a qrexec_timeout,
fall back to a global default_qrexec_timeout (with default value 60),
instead of hardcoding the fallback value to 60.
This makes it easy to set a higher timeout for the whole system, which
helps users who habitually launch applications from several (not yet
started) VMs at the same time. 60 seconds can be too short for that.
qvm-sync-clock no longer fetches time from the network, by design.
So, lets not break clockvm's time and check only if everything else
correctly synchronize with it.
It isn't enough to wait for window to disappear, the service may still
be running. And if it is, test cleanup logic will complain about FD
leak.
To avoid deadlock on some test failure, do it with a timeout.
volume.path and volume.export() refer to the same thing in lvm_thin and
'file', but not in file-reflink (where volume.path is the -dirty.img,
which doesn't exist if the volume is not started).
Use the file-reflink storage driver if /var/lib/qubes is on a filesystem
that supports reflinks, e.g. when the btrfs layout was selected in
Anaconda. If it doesn't support reflinks (or if detection fails, e.g. in
an unprivileged test environment), use 'file' as before.
_wait_and_reraise() is similar to asyncio.gather(), but it preserves the
current behavior of waiting for all futures and only _then_ reraising
the first exception (if there is any) in line.
Also switch Storage.create() and Storage.clone() to _wait_and_reraise().
Previously, they called asyncio.wait() and implicitly swallowed all
exceptions.
With that syntax, the default timestamp would have been from the time of
the function's definition (not invocation). But all callers are passing
an explicit timestamp anyway.