[nova][placement] Placement requests and caching in the resource tracker

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

[nova][placement] Placement requests and caching in the resource tracker

Eric Fried
All-

Based on a (long) discussion yesterday [1] I have put up a patch [2]
whereby you can set [compute]resource_provider_association_refresh to
zero and the resource tracker will never* refresh the report client's
provider cache. Philosophically, we're removing the "healing" aspect of
the resource tracker's periodic and trusting that placement won't
diverge from whatever's in our cache. (If it does, it's because the op
hit the CLI, in which case they should SIGHUP - see below.)

*except:
- When we initially create the compute node record and bootstrap its
resource provider.
- When the virt driver's update_provider_tree makes a change,
update_from_provider_tree reflects them in the cache as well as pushing
them back to placement.
- If update_from_provider_tree fails, the cache is cleared and gets
rebuilt on the next periodic.
- If you send SIGHUP to the compute process, the cache is cleared.

This should dramatically reduce the number of calls to placement from
the compute service. Like, to nearly zero, unless something is actually
changing.

Can I get some initial feedback as to whether this is worth polishing up
into something real? (It will probably need a bp/spec if so.)

[1]
http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-11-01.log.html#t2018-11-01T17:32:03
[2] https://review.openstack.org/#/c/614886/

==========
Background
==========
In the Queens release, our friends at CERN noticed a serious spike in
the number of requests to placement from compute nodes, even in a
stable-state cloud. Given that we were in the process of adding a ton of
infrastructure to support sharing and nested providers, this was not
unexpected. Roughly, what was previously:

 @periodic_task:
     GET /resource_providers/$compute_uuid
     GET /resource_providers/$compute_uuid/inventories

became more like:

 @periodic_task:
     # In Queens/Rocky, this would still just return the compute RP
     GET /resource_providers?in_tree=$compute_uuid
     # In Queens/Rocky, this would return nothing
     GET /resource_providers?member_of=...&required=MISC_SHARES...
     for each provider returned above:  # i.e. just one in Q/R
         GET /resource_providers/$compute_uuid/inventories
         GET /resource_providers/$compute_uuid/traits
         GET /resource_providers/$compute_uuid/aggregates

In a cloud the size of CERN's, the load wasn't acceptable. But at the
time, CERN worked around the problem by disabling refreshing entirely.
(The fact that this seems to have worked for them is an encouraging sign
for the proposed code change.)

We're not actually making use of most of that information, but it sets
the stage for things that we're working on in Stein and beyond, like
multiple VGPU types, bandwidth resource providers, accelerators, NUMA,
etc., so removing/reducing the amount of information we look at isn't
really an option strategically.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [nova][placement] Placement requests and caching in the resource tracker

Matt Riedemann-3
On 11/2/2018 2:22 PM, Eric Fried wrote:

> Based on a (long) discussion yesterday [1] I have put up a patch [2]
> whereby you can set [compute]resource_provider_association_refresh to
> zero and the resource tracker will never* refresh the report client's
> provider cache. Philosophically, we're removing the "healing" aspect of
> the resource tracker's periodic and trusting that placement won't
> diverge from whatever's in our cache. (If it does, it's because the op
> hit the CLI, in which case they should SIGHUP - see below.)
>
> *except:
> - When we initially create the compute node record and bootstrap its
> resource provider.
> - When the virt driver's update_provider_tree makes a change,
> update_from_provider_tree reflects them in the cache as well as pushing
> them back to placement.
> - If update_from_provider_tree fails, the cache is cleared and gets
> rebuilt on the next periodic.
> - If you send SIGHUP to the compute process, the cache is cleared.
>
> This should dramatically reduce the number of calls to placement from
> the compute service. Like, to nearly zero, unless something is actually
> changing.
>
> Can I get some initial feedback as to whether this is worth polishing up
> into something real? (It will probably need a bp/spec if so.)
>
> [1]
> http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-11-01.log.html#t2018-11-01T17:32:03
> [2]https://review.openstack.org/#/c/614886/
>
> ==========
> Background
> ==========
> In the Queens release, our friends at CERN noticed a serious spike in
> the number of requests to placement from compute nodes, even in a
> stable-state cloud. Given that we were in the process of adding a ton of
> infrastructure to support sharing and nested providers, this was not
> unexpected. Roughly, what was previously:
>
>   @periodic_task:
>       GET/resource_providers/$compute_uuid
>       GET/resource_providers/$compute_uuid/inventories
>
> became more like:
>
>   @periodic_task:
>       # In Queens/Rocky, this would still just return the compute RP
>       GET /resource_providers?in_tree=$compute_uuid
>       # In Queens/Rocky, this would return nothing
>       GET /resource_providers?member_of=...&required=MISC_SHARES...
>       for each provider returned above:  # i.e. just one in Q/R
>           GET/resource_providers/$compute_uuid/inventories
>           GET/resource_providers/$compute_uuid/traits
>           GET/resource_providers/$compute_uuid/aggregates
>
> In a cloud the size of CERN's, the load wasn't acceptable. But at the
> time, CERN worked around the problem by disabling refreshing entirely.
> (The fact that this seems to have worked for them is an encouraging sign
> for the proposed code change.)
>
> We're not actually making use of most of that information, but it sets
> the stage for things that we're working on in Stein and beyond, like
> multiple VGPU types, bandwidth resource providers, accelerators, NUMA,
> etc., so removing/reducing the amount of information we look at isn't
> really an option strategically.

A few random points from the long discussion that should probably
re-posed here for wider thought:

* There was probably a lot of discussion about why we needed to do this
caching and stuff in the compute in the first place. What has changed
that we no longer need to aggressively refresh the cache on every
periodic? I thought initially it came up because people really wanted
the compute to be fully self-healing to any external changes, including
hot plugging resources like disk on the host to automatically reflect
those changes in inventory. Similarly, external user/service
interactions with the placement API which would then be automatically
picked up by the next periodic run - is that no longer a desire, and/or
how was the decision made previously that simply requiring a SIGHUP in
that case wasn't sufficient/desirable.

* I believe I made the point yesterday that we should probably not
refresh by default, and let operators opt-in to that behavior if they
really need it, i.e. they are frequently making changes to the
environment, potentially by some external service (I could think of
vCenter doing this to reflect changes from vCenter back into
nova/placement), but I don't think that should be the assumed behavior
by most and our defaults should reflect the "normal" use case.

* I think I've noted a few times now that we don't actually use the
provider aggregates information (yet) in the compute service. Nova host
aggregate membership is mirror to placement since Rocky [1] but that
happens in the API, not the the compute. The only thing I can think of
that relied on resource provider aggregate information in the compute is
the shared storage providers concept, but that's not supported (yet)
[2]. So do we need to keep retrieving aggregate information when nothing
in compute uses it yet?

* Similarly, why do we need to get traits on each periodic? The only
in-tree virt driver I'm aware of that *reports* traits is the libvirt
driver for CPU features [3]. Otherwise I think the idea behind getting
the latest traits is so the virt driver doesn't overwrite any traits set
externally on the compute node root resource provider. I think that
still stands and is probably OK, even though we have generations now
which should keep us from overwriting if we don't have the latest
traits, but I wanted to bring it up since it's related to the "why do we
need provider aggregates in the compute?" question.

* Regardless of what we do, I think we should probably *at least* make
that refresh associations config allow 0 to disable it so CERN (and
others) can avoid the need to continually forward-porting code to
disable it.

[1]
https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/placement-mirror-host-aggregates.html
[2] https://bugs.launchpad.net/nova/+bug/1784020
[3]
https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/report-cpu-features-as-traits.html

--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [nova][placement] Placement requests and caching in the resource tracker

Mohammed Naser
On Fri, Nov 2, 2018 at 9:32 PM Matt Riedemann <[hidden email]> wrote:

>
> On 11/2/2018 2:22 PM, Eric Fried wrote:
> > Based on a (long) discussion yesterday [1] I have put up a patch [2]
> > whereby you can set [compute]resource_provider_association_refresh to
> > zero and the resource tracker will never* refresh the report client's
> > provider cache. Philosophically, we're removing the "healing" aspect of
> > the resource tracker's periodic and trusting that placement won't
> > diverge from whatever's in our cache. (If it does, it's because the op
> > hit the CLI, in which case they should SIGHUP - see below.)
> >
> > *except:
> > - When we initially create the compute node record and bootstrap its
> > resource provider.
> > - When the virt driver's update_provider_tree makes a change,
> > update_from_provider_tree reflects them in the cache as well as pushing
> > them back to placement.
> > - If update_from_provider_tree fails, the cache is cleared and gets
> > rebuilt on the next periodic.
> > - If you send SIGHUP to the compute process, the cache is cleared.
> >
> > This should dramatically reduce the number of calls to placement from
> > the compute service. Like, to nearly zero, unless something is actually
> > changing.
> >
> > Can I get some initial feedback as to whether this is worth polishing up
> > into something real? (It will probably need a bp/spec if so.)
> >
> > [1]
> > http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-11-01.log.html#t2018-11-01T17:32:03
> > [2]https://review.openstack.org/#/c/614886/
> >
> > ==========
> > Background
> > ==========
> > In the Queens release, our friends at CERN noticed a serious spike in
> > the number of requests to placement from compute nodes, even in a
> > stable-state cloud. Given that we were in the process of adding a ton of
> > infrastructure to support sharing and nested providers, this was not
> > unexpected. Roughly, what was previously:
> >
> >   @periodic_task:
> >       GET/resource_providers/$compute_uuid
> >       GET/resource_providers/$compute_uuid/inventories
> >
> > became more like:
> >
> >   @periodic_task:
> >       # In Queens/Rocky, this would still just return the compute RP
> >       GET /resource_providers?in_tree=$compute_uuid
> >       # In Queens/Rocky, this would return nothing
> >       GET /resource_providers?member_of=...&required=MISC_SHARES...
> >       for each provider returned above:  # i.e. just one in Q/R
> >           GET/resource_providers/$compute_uuid/inventories
> >           GET/resource_providers/$compute_uuid/traits
> >           GET/resource_providers/$compute_uuid/aggregates
> >
> > In a cloud the size of CERN's, the load wasn't acceptable. But at the
> > time, CERN worked around the problem by disabling refreshing entirely.
> > (The fact that this seems to have worked for them is an encouraging sign
> > for the proposed code change.)
> >
> > We're not actually making use of most of that information, but it sets
> > the stage for things that we're working on in Stein and beyond, like
> > multiple VGPU types, bandwidth resource providers, accelerators, NUMA,
> > etc., so removing/reducing the amount of information we look at isn't
> > really an option strategically.
>
> A few random points from the long discussion that should probably
> re-posed here for wider thought:
>
> * There was probably a lot of discussion about why we needed to do this
> caching and stuff in the compute in the first place. What has changed
> that we no longer need to aggressively refresh the cache on every
> periodic? I thought initially it came up because people really wanted
> the compute to be fully self-healing to any external changes, including
> hot plugging resources like disk on the host to automatically reflect
> those changes in inventory. Similarly, external user/service
> interactions with the placement API which would then be automatically
> picked up by the next periodic run - is that no longer a desire, and/or
> how was the decision made previously that simply requiring a SIGHUP in
> that case wasn't sufficient/desirable.
>
> * I believe I made the point yesterday that we should probably not
> refresh by default, and let operators opt-in to that behavior if they
> really need it, i.e. they are frequently making changes to the
> environment, potentially by some external service (I could think of
> vCenter doing this to reflect changes from vCenter back into
> nova/placement), but I don't think that should be the assumed behavior
> by most and our defaults should reflect the "normal" use case.
>
> * I think I've noted a few times now that we don't actually use the
> provider aggregates information (yet) in the compute service. Nova host
> aggregate membership is mirror to placement since Rocky [1] but that
> happens in the API, not the the compute. The only thing I can think of
> that relied on resource provider aggregate information in the compute is
> the shared storage providers concept, but that's not supported (yet)
> [2]. So do we need to keep retrieving aggregate information when nothing
> in compute uses it yet?
>
> * Similarly, why do we need to get traits on each periodic? The only
> in-tree virt driver I'm aware of that *reports* traits is the libvirt
> driver for CPU features [3]. Otherwise I think the idea behind getting
> the latest traits is so the virt driver doesn't overwrite any traits set
> externally on the compute node root resource provider. I think that
> still stands and is probably OK, even though we have generations now
> which should keep us from overwriting if we don't have the latest
> traits, but I wanted to bring it up since it's related to the "why do we
> need provider aggregates in the compute?" question.
>
> * Regardless of what we do, I think we should probably *at least* make
> that refresh associations config allow 0 to disable it so CERN (and
> others) can avoid the need to continually forward-porting code to
> disable it.
>
> [1]
> https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/placement-mirror-host-aggregates.html
> [2] https://bugs.launchpad.net/nova/+bug/1784020
> [3]
> https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/report-cpu-features-as-traits.html
>
> --
>
> Thanks,
>
> Matt
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



--
Mohammed Naser — vexxhost
-----------------------------------------------------
D. 514-316-8872
D. 800-910-1726 ext. 200
E. [hidden email]
W. http://vexxhost.com

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [nova][placement] Placement requests and caching in the resource tracker

Mohammed Naser
In reply to this post by Matt Riedemann-3
Ugh, hit send accidentally.  Please take my comments lightly as I have not been
as involved with the developments but just chiming in as an operator with some
ideas.

On Fri, Nov 2, 2018 at 9:32 PM Matt Riedemann <[hidden email]> wrote:

>
> On 11/2/2018 2:22 PM, Eric Fried wrote:
> > Based on a (long) discussion yesterday [1] I have put up a patch [2]
> > whereby you can set [compute]resource_provider_association_refresh to
> > zero and the resource tracker will never* refresh the report client's
> > provider cache. Philosophically, we're removing the "healing" aspect of
> > the resource tracker's periodic and trusting that placement won't
> > diverge from whatever's in our cache. (If it does, it's because the op
> > hit the CLI, in which case they should SIGHUP - see below.)
> >
> > *except:
> > - When we initially create the compute node record and bootstrap its
> > resource provider.
> > - When the virt driver's update_provider_tree makes a change,
> > update_from_provider_tree reflects them in the cache as well as pushing
> > them back to placement.
> > - If update_from_provider_tree fails, the cache is cleared and gets
> > rebuilt on the next periodic.
> > - If you send SIGHUP to the compute process, the cache is cleared.
> >
> > This should dramatically reduce the number of calls to placement from
> > the compute service. Like, to nearly zero, unless something is actually
> > changing.
> >
> > Can I get some initial feedback as to whether this is worth polishing up
> > into something real? (It will probably need a bp/spec if so.)
> >
> > [1]
> > http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-11-01.log.html#t2018-11-01T17:32:03
> > [2]https://review.openstack.org/#/c/614886/
> >
> > ==========
> > Background
> > ==========
> > In the Queens release, our friends at CERN noticed a serious spike in
> > the number of requests to placement from compute nodes, even in a
> > stable-state cloud. Given that we were in the process of adding a ton of
> > infrastructure to support sharing and nested providers, this was not
> > unexpected. Roughly, what was previously:
> >
> >   @periodic_task:
> >       GET/resource_providers/$compute_uuid
> >       GET/resource_providers/$compute_uuid/inventories
> >
> > became more like:
> >
> >   @periodic_task:
> >       # In Queens/Rocky, this would still just return the compute RP
> >       GET /resource_providers?in_tree=$compute_uuid
> >       # In Queens/Rocky, this would return nothing
> >       GET /resource_providers?member_of=...&required=MISC_SHARES...
> >       for each provider returned above:  # i.e. just one in Q/R
> >           GET/resource_providers/$compute_uuid/inventories
> >           GET/resource_providers/$compute_uuid/traits
> >           GET/resource_providers/$compute_uuid/aggregates
> >
> > In a cloud the size of CERN's, the load wasn't acceptable. But at the
> > time, CERN worked around the problem by disabling refreshing entirely.
> > (The fact that this seems to have worked for them is an encouraging sign
> > for the proposed code change.)
> >
> > We're not actually making use of most of that information, but it sets
> > the stage for things that we're working on in Stein and beyond, like
> > multiple VGPU types, bandwidth resource providers, accelerators, NUMA,
> > etc., so removing/reducing the amount of information we look at isn't
> > really an option strategically.
>
> A few random points from the long discussion that should probably
> re-posed here for wider thought:
>
> * There was probably a lot of discussion about why we needed to do this
> caching and stuff in the compute in the first place. What has changed
> that we no longer need to aggressively refresh the cache on every
> periodic? I thought initially it came up because people really wanted
> the compute to be fully self-healing to any external changes, including
> hot plugging resources like disk on the host to automatically reflect
> those changes in inventory. Similarly, external user/service
> interactions with the placement API which would then be automatically
> picked up by the next periodic run - is that no longer a desire, and/or
> how was the decision made previously that simply requiring a SIGHUP in
> that case wasn't sufficient/desirable.

I think that would be nice to have however at the current moment, based
from operators perspective, it looks like the placement service can really
get out of sync pretty easily.. so I think it'd be good to commit to either
really making it self-heal (delete stale allocations, create ones that should
be there) or remove all self-healing stuff

Also, for the self healing work, if we take that route and implement it fully,
it might make placement split much easier, because we just switch over
and wait for the computes to automagically populate everything, but that's
the type of operation that happens once in the lifetime of a cloud.

Just for information sake, a clean state cloud which had no reported issues
over maybe a period of 2-3 months already has 4 allocations which are
incorrect and 12 allocations pointing to the wrong resource provider, so I
think this comes does to committing to either "self-healing" to fix those
issues or not.

> * I believe I made the point yesterday that we should probably not
> refresh by default, and let operators opt-in to that behavior if they
> really need it, i.e. they are frequently making changes to the
> environment, potentially by some external service (I could think of
> vCenter doing this to reflect changes from vCenter back into
> nova/placement), but I don't think that should be the assumed behavior
> by most and our defaults should reflect the "normal" use case.

I agree.  For 99% of the deployments out there, placement service will likely
not be touched by anyone except the services and at this stage, probably
just Nova talking to placement directly.

I really do agree on the statement that the "normal" use case is of a user
playing around with placement out-of-band is not common at all.

> * I think I've noted a few times now that we don't actually use the
> provider aggregates information (yet) in the compute service. Nova host
> aggregate membership is mirror to placement since Rocky [1] but that
> happens in the API, not the the compute. The only thing I can think of
> that relied on resource provider aggregate information in the compute is
> the shared storage providers concept, but that's not supported (yet)
> [2]. So do we need to keep retrieving aggregate information when nothing
> in compute uses it yet?

Is there anything stopping us here from polling that information during
the time when the VM is spawning?  It doesn't seem like something that the
compute node always needs to check..

> * Similarly, why do we need to get traits on each periodic? The only
> in-tree virt driver I'm aware of that *reports* traits is the libvirt
> driver for CPU features [3]. Otherwise I think the idea behind getting
> the latest traits is so the virt driver doesn't overwrite any traits set
> externally on the compute node root resource provider. I think that
> still stands and is probably OK, even though we have generations now
> which should keep us from overwriting if we don't have the latest
> traits, but I wanted to bring it up since it's related to the "why do we
> need provider aggregates in the compute?" question.

Forgive my ignorance on this subject, but would traits really be only set
when the service is first started (so that check can happens only once on
startup) and then the compute nodes never really ever consume that
information (but the scheduler does?).  Also, AFAIK I doubt virt drivers
actually report much change in traits (CPU flags changing in runtime?)

> * Regardless of what we do, I think we should probably *at least* make
> that refresh associations config allow 0 to disable it so CERN (and
> others) can avoid the need to continually forward-porting code to
> disable it.

+1

> [1]
> https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/placement-mirror-host-aggregates.html
> [2] https://bugs.launchpad.net/nova/+bug/1784020
> [3]
> https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/report-cpu-features-as-traits.html
>
> --
>
> Thanks,
>
> Matt
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



--
Mohammed Naser — vexxhost
-----------------------------------------------------
D. 514-316-8872
D. 800-910-1726 ext. 200
E. [hidden email]
W. http://vexxhost.com

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [nova][placement] Placement requests and caching in the resource tracker

Jay Pipes
In reply to this post by Eric Fried
On 11/02/2018 03:22 PM, Eric Fried wrote:

> All-
>
> Based on a (long) discussion yesterday [1] I have put up a patch [2]
> whereby you can set [compute]resource_provider_association_refresh to
> zero and the resource tracker will never* refresh the report client's
> provider cache. Philosophically, we're removing the "healing" aspect of
> the resource tracker's periodic and trusting that placement won't
> diverge from whatever's in our cache. (If it does, it's because the op
> hit the CLI, in which case they should SIGHUP - see below.)
>
> *except:
> - When we initially create the compute node record and bootstrap its
> resource provider.
> - When the virt driver's update_provider_tree makes a change,
> update_from_provider_tree reflects them in the cache as well as pushing
> them back to placement.
> - If update_from_provider_tree fails, the cache is cleared and gets
> rebuilt on the next periodic.
> - If you send SIGHUP to the compute process, the cache is cleared.
>
> This should dramatically reduce the number of calls to placement from
> the compute service. Like, to nearly zero, unless something is actually
> changing.
>
> Can I get some initial feedback as to whether this is worth polishing up
> into something real? (It will probably need a bp/spec if so.)
>
> [1]
> http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-11-01.log.html#t2018-11-01T17:32:03
> [2] https://review.openstack.org/#/c/614886/
>
> ==========
> Background
> ==========
> In the Queens release, our friends at CERN noticed a serious spike in
> the number of requests to placement from compute nodes, even in a
> stable-state cloud. Given that we were in the process of adding a ton of
> infrastructure to support sharing and nested providers, this was not
> unexpected. Roughly, what was previously:
>
>   @periodic_task:
>       GET /resource_providers/$compute_uuid
>       GET /resource_providers/$compute_uuid/inventories
>
> became more like:
>
>   @periodic_task:
>       # In Queens/Rocky, this would still just return the compute RP
>       GET /resource_providers?in_tree=$compute_uuid
>       # In Queens/Rocky, this would return nothing
>       GET /resource_providers?member_of=...&required=MISC_SHARES...
>       for each provider returned above:  # i.e. just one in Q/R
>           GET /resource_providers/$compute_uuid/inventories
>           GET /resource_providers/$compute_uuid/traits
>           GET /resource_providers/$compute_uuid/aggregates
>
> In a cloud the size of CERN's, the load wasn't acceptable. But at the
> time, CERN worked around the problem by disabling refreshing entirely.
> (The fact that this seems to have worked for them is an encouraging sign
> for the proposed code change.)
>
> We're not actually making use of most of that information, but it sets
> the stage for things that we're working on in Stein and beyond, like
> multiple VGPU types, bandwidth resource providers, accelerators, NUMA,
> etc., so removing/reducing the amount of information we look at isn't
> really an option strategically.

I support your idea of getting rid of the periodic refresh of the cache
in the scheduler report client. Much of that was added in order to
emulate the original way the resource tracker worked.

Most of the behaviour in the original resource tracker (and some of the
code still in there for dealing with (surprise!) PCI passthrough devices
and NUMA topology) was due to doing allocations on the compute node (the
whole claims stuff). We needed to always be syncing the state of the
compute_nodes and pci_devices table in the cell database with whatever
usage information was being created/modified on the compute nodes [0].

All of the "healing" code that's in the resource tracker was basically
to deal with "soft delete", migrations that didn't complete or work
properly, and, again, to handle allocations becoming out-of-sync because
the compute nodes were responsible for allocating (as opposed to the
current situation we have where the placement service -- via the
scheduler's call to claim_resources() -- is responsible for allocating
resources [1]).

Now that we have generation markers protecting both providers and
consumers, we can rely on those generations to signal to the scheduler
report client that it needs to pull fresh information about a provider
or consumer. So, there's really no need to automatically and blindly
refresh any more.

Best,
-jay

[0] We always need to be syncing those tables because those tables,
unlike the placement database's data modeling, couple both inventory AND
usage in the same table structure...

[1] again, except for PCI devices and NUMA topology, because of the
tight coupling in place with the different resource trackers those types
of resources use...


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [nova][placement] Placement requests and caching in the resource tracker

Belmiro Moreira
Thanks Eric for the patch.
This will help keeping placement calls under control.

Belmiro


On Sun, Nov 4, 2018 at 1:01 PM Jay Pipes <[hidden email]> wrote:
On 11/02/2018 03:22 PM, Eric Fried wrote:
> All-
>
> Based on a (long) discussion yesterday [1] I have put up a patch [2]
> whereby you can set [compute]resource_provider_association_refresh to
> zero and the resource tracker will never* refresh the report client's
> provider cache. Philosophically, we're removing the "healing" aspect of
> the resource tracker's periodic and trusting that placement won't
> diverge from whatever's in our cache. (If it does, it's because the op
> hit the CLI, in which case they should SIGHUP - see below.)
>
> *except:
> - When we initially create the compute node record and bootstrap its
> resource provider.
> - When the virt driver's update_provider_tree makes a change,
> update_from_provider_tree reflects them in the cache as well as pushing
> them back to placement.
> - If update_from_provider_tree fails, the cache is cleared and gets
> rebuilt on the next periodic.
> - If you send SIGHUP to the compute process, the cache is cleared.
>
> This should dramatically reduce the number of calls to placement from
> the compute service. Like, to nearly zero, unless something is actually
> changing.
>
> Can I get some initial feedback as to whether this is worth polishing up
> into something real? (It will probably need a bp/spec if so.)
>
> [1]
> http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-11-01.log.html#t2018-11-01T17:32:03
> [2] https://review.openstack.org/#/c/614886/
>
> ==========
> Background
> ==========
> In the Queens release, our friends at CERN noticed a serious spike in
> the number of requests to placement from compute nodes, even in a
> stable-state cloud. Given that we were in the process of adding a ton of
> infrastructure to support sharing and nested providers, this was not
> unexpected. Roughly, what was previously:
>
>   @periodic_task:
>       GET /resource_providers/$compute_uuid
>       GET /resource_providers/$compute_uuid/inventories
>
> became more like:
>
>   @periodic_task:
>       # In Queens/Rocky, this would still just return the compute RP
>       GET /resource_providers?in_tree=$compute_uuid
>       # In Queens/Rocky, this would return nothing
>       GET /resource_providers?member_of=...&required=MISC_SHARES...
>       for each provider returned above:  # i.e. just one in Q/R
>           GET /resource_providers/$compute_uuid/inventories
>           GET /resource_providers/$compute_uuid/traits
>           GET /resource_providers/$compute_uuid/aggregates
>
> In a cloud the size of CERN's, the load wasn't acceptable. But at the
> time, CERN worked around the problem by disabling refreshing entirely.
> (The fact that this seems to have worked for them is an encouraging sign
> for the proposed code change.)
>
> We're not actually making use of most of that information, but it sets
> the stage for things that we're working on in Stein and beyond, like
> multiple VGPU types, bandwidth resource providers, accelerators, NUMA,
> etc., so removing/reducing the amount of information we look at isn't
> really an option strategically.

I support your idea of getting rid of the periodic refresh of the cache
in the scheduler report client. Much of that was added in order to
emulate the original way the resource tracker worked.

Most of the behaviour in the original resource tracker (and some of the
code still in there for dealing with (surprise!) PCI passthrough devices
and NUMA topology) was due to doing allocations on the compute node (the
whole claims stuff). We needed to always be syncing the state of the
compute_nodes and pci_devices table in the cell database with whatever
usage information was being created/modified on the compute nodes [0].

All of the "healing" code that's in the resource tracker was basically
to deal with "soft delete", migrations that didn't complete or work
properly, and, again, to handle allocations becoming out-of-sync because
the compute nodes were responsible for allocating (as opposed to the
current situation we have where the placement service -- via the
scheduler's call to claim_resources() -- is responsible for allocating
resources [1]).

Now that we have generation markers protecting both providers and
consumers, we can rely on those generations to signal to the scheduler
report client that it needs to pull fresh information about a provider
or consumer. So, there's really no need to automatically and blindly
refresh any more.

Best,
-jay

[0] We always need to be syncing those tables because those tables,
unlike the placement database's data modeling, couple both inventory AND
usage in the same table structure...

[1] again, except for PCI devices and NUMA topology, because of the
tight coupling in place with the different resource trackers those types
of resources use...


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@...?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [nova][placement] Placement requests and caching in the resource tracker

Chris Dent-2
In reply to this post by Jay Pipes
On Sun, 4 Nov 2018, Jay Pipes wrote:

> Now that we have generation markers protecting both providers and consumers,
> we can rely on those generations to signal to the scheduler report client
> that it needs to pull fresh information about a provider or consumer. So,
> there's really no need to automatically and blindly refresh any more.

I agree with this ^.

I've been trying to tease out the issues in this thread and on the
associated review [1] and I've decided that much of my confusion
comes from the fact that we refer to a thing which is a "cache" in
the resource tracker and either trusting it more or not having it at
all, and I think that's misleading. To me a "cache" has multiple
clients and there's some need for reconciliation and invalidation
amongst them. The thing that's in the resource tracker is in one
process, changes to it are synchronized; it's merely a data structure.

Some words follow where I try to tease things out a bit more (mostly
for my own sake, but if it helps other people, great). At the very
end there's a bit of list of suggested todos for us to consider.

What we have is a data structure which represents the resource
tracker and virtdirver's current view on what providers and
associates it is aware of. We maintain a boundary between the RT and
the virtdriver that means there's "updating" going on that sometimes
is a bit fussy to resolve (cf. recent adjustments to allocation
ratio handling).

In the old way, every now and again we get a bunch of info from
placement to confirm that our view is right and try to reconcile
things.

What we're considering moving towards is only doing that "get a
bunch of info from placement" when we fail to write to placement
because of a generation conflict.

Thus we should only read from placement:

* at compute node startup
* when a write fails

And we should only write to placement:

* at compute node startup
* when the virt driver tells us something has changed

Is that right? If it is not right, can we do that? If not, why not?

Because generations change, often, they guard against us making
changes in ignorance and allow us to write blindly and only GET when
we fail. We've got this everywhere now, let's use it. So, for
example, even if something else besides the compute is adding
traits, it's cool. We'll fail when we (the compute) try to clobber.

Elsewhere in the thread several other topics were raised. A lot of
that boil back to "what are we actually trying to do in the
periodics?". As is often the case (and appropriately so) what we're
trying to do has evolved and accreted in an organic fashion and it
is probably time for us to re-evaluate and make sure we're doing the
right stuff. The first step is writing that down. That aspect has
always been pretty obscure or tribal to me, I presume so for others.
So doing a legit audit of that code and the goals is something we
should do.

Mohammed's comments about allocations getting out of sync are
important. I agree with him that it would be excellent if we could
go back to self-healing those, especially because of the "wait for
the computes to automagically populate everything" part he mentions.
However, that aspect, while related to this, is not quite the same
thing. The management of allocations and the management of
inventories (and "associates") is happening from different angles.

And finally, even if we turn off these refreshes to lighten the
load, placement still needs to be capable of dealing with frequent
requests, so we have something to fix there. We need to do the
analysis to find out where the cost is and implement some solutions.
At the moment we don't know where it is. It could be:

* In the database server
* In the python code that marshals the data around those calls to
   the database
* In the python code that handles the WSGI interactions
* In the web server that is talking to the python code

belmoreira's document [2] suggests some avenues of investigation
(most CPU time is in user space and not waiting) but we'd need a bit
more information to plan any concrete next steps:

* what's the web server and which wsgi configuration?
* where's the database, if it's different what's the load there?

I suspect there's a lot we can do to make our code more correct and
efficient. And beyond that there is a great deal of standard run-of-
the mill server-side caching and etag handling that we could
implement if necessary. That is: treat placement like a web app that
needs to be optimized in the usual ways.

As Eric suggested at the start of the thread, this kind of
investigation is expected and normal. We've not done something
wrong. Make it, make it correct, make it fast is the process.
We're oscillating somewhere between 2 and 3.

So in terms of actions:

* I'm pretty well situated to do some deeper profiling and
   benchmarking of placement to find the elbows in that.

* It seems like Eric and Jay are probably best situated to define
   and refine what should really be going on with the resource
   tracker and other actions on the compute-node.

* We need to have further discussion and investigation on
   allocations getting out of sync. Volunteers?

What else?

[1] https://review.openstack.org/#/c/614886/
[2] https://docs.google.com/document/d/1d5k1hA3DbGmMyJbXdVcekR12gyrFTaj_tJdFwdQy-8E/edit

--
Chris Dent                       ٩◔̯◔۶           https://anticdent.org/
freenode: cdent                                         tw: @anticdent
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [nova][placement] Placement requests and caching in the resource tracker

Tetsuro Nakamura-2
Thus we should only read from placement:
* at compute node startup
* when a write fails
And we should only write to placement:
* at compute node startup
* when the virt driver tells us something has changed  

I agree with this. 

We could also prepare an interface for operators/other-projects to force nova to pull fresh information from placement and put it into its cache in order to avoid predictable conflicts.

Is that right? If it is not right, can we do that? If not, why not?  

The same question from me.
Refreshing periodically strategy might be now an optional optimization for smaller clouds?

2018年11月5日(月) 20:53 Chris Dent <[hidden email]>:
On Sun, 4 Nov 2018, Jay Pipes wrote:

> Now that we have generation markers protecting both providers and consumers,
> we can rely on those generations to signal to the scheduler report client
> that it needs to pull fresh information about a provider or consumer. So,
> there's really no need to automatically and blindly refresh any more.

I agree with this ^.

I've been trying to tease out the issues in this thread and on the
associated review [1] and I've decided that much of my confusion
comes from the fact that we refer to a thing which is a "cache" in
the resource tracker and either trusting it more or not having it at
all, and I think that's misleading. To me a "cache" has multiple
clients and there's some need for reconciliation and invalidation
amongst them. The thing that's in the resource tracker is in one
process, changes to it are synchronized; it's merely a data structure.

Some words follow where I try to tease things out a bit more (mostly
for my own sake, but if it helps other people, great). At the very
end there's a bit of list of suggested todos for us to consider.

What we have is a data structure which represents the resource
tracker and virtdirver's current view on what providers and
associates it is aware of. We maintain a boundary between the RT and
the virtdriver that means there's "updating" going on that sometimes
is a bit fussy to resolve (cf. recent adjustments to allocation
ratio handling).

In the old way, every now and again we get a bunch of info from
placement to confirm that our view is right and try to reconcile
things.

What we're considering moving towards is only doing that "get a
bunch of info from placement" when we fail to write to placement
because of a generation conflict.

Thus we should only read from placement:

* at compute node startup
* when a write fails

And we should only write to placement:

* at compute node startup
* when the virt driver tells us something has changed

Is that right? If it is not right, can we do that? If not, why not?

Because generations change, often, they guard against us making
changes in ignorance and allow us to write blindly and only GET when
we fail. We've got this everywhere now, let's use it. So, for
example, even if something else besides the compute is adding
traits, it's cool. We'll fail when we (the compute) try to clobber.

Elsewhere in the thread several other topics were raised. A lot of
that boil back to "what are we actually trying to do in the
periodics?". As is often the case (and appropriately so) what we're
trying to do has evolved and accreted in an organic fashion and it
is probably time for us to re-evaluate and make sure we're doing the
right stuff. The first step is writing that down. That aspect has
always been pretty obscure or tribal to me, I presume so for others.
So doing a legit audit of that code and the goals is something we
should do.

Mohammed's comments about allocations getting out of sync are
important. I agree with him that it would be excellent if we could
go back to self-healing those, especially because of the "wait for
the computes to automagically populate everything" part he mentions.
However, that aspect, while related to this, is not quite the same
thing. The management of allocations and the management of
inventories (and "associates") is happening from different angles.

And finally, even if we turn off these refreshes to lighten the
load, placement still needs to be capable of dealing with frequent
requests, so we have something to fix there. We need to do the
analysis to find out where the cost is and implement some solutions.
At the moment we don't know where it is. It could be:

* In the database server
* In the python code that marshals the data around those calls to
   the database
* In the python code that handles the WSGI interactions
* In the web server that is talking to the python code

belmoreira's document [2] suggests some avenues of investigation
(most CPU time is in user space and not waiting) but we'd need a bit
more information to plan any concrete next steps:

* what's the web server and which wsgi configuration?
* where's the database, if it's different what's the load there?

I suspect there's a lot we can do to make our code more correct and
efficient. And beyond that there is a great deal of standard run-of-
the mill server-side caching and etag handling that we could
implement if necessary. That is: treat placement like a web app that
needs to be optimized in the usual ways.

As Eric suggested at the start of the thread, this kind of
investigation is expected and normal. We've not done something
wrong. Make it, make it correct, make it fast is the process.
We're oscillating somewhere between 2 and 3.

So in terms of actions:

* I'm pretty well situated to do some deeper profiling and
   benchmarking of placement to find the elbows in that.

* It seems like Eric and Jay are probably best situated to define
   and refine what should really be going on with the resource
   tracker and other actions on the compute-node.

* We need to have further discussion and investigation on
   allocations getting out of sync. Volunteers?

What else?

[1] https://review.openstack.org/#/c/614886/
[2] https://docs.google.com/document/d/1d5k1hA3DbGmMyJbXdVcekR12gyrFTaj_tJdFwdQy-8E/edit

--
Chris Dent                       ٩◔̯◔۶           https://anticdent.org/
freenode: cdent                                         tw: @anticdent__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@...?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [nova][placement] Placement requests and caching in the resource tracker

Matt Riedemann-3
In reply to this post by Mohammed Naser
On 11/4/2018 4:22 AM, Mohammed Naser wrote:
> Just for information sake, a clean state cloud which had no reported issues
> over maybe a period of 2-3 months already has 4 allocations which are
> incorrect and 12 allocations pointing to the wrong resource provider, so I
> think this comes does to committing to either "self-healing" to fix those
> issues or not.

Is this running Rocky or an older release?

Have you dug into any of the operations around these instances to
determine what might have gone wrong? For example, was a live migration
performed recently on these instances and if so, did it fail? How about
evacuations (rebuild from a down host).

By "4 allocations which are incorrect" I assume that means they are
pointing at the correct compute node resource provider but the values
for allocated VCPU, MEMORY_MB and DISK_GB is wrong? If so, how do the
allocations align with old/new flavors used to resize the instance? Did
the resize fail?

Are there mixed compute versions at all, i.e. are you moving instances
around during a rolling upgrade?

--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [nova][placement] Placement requests and caching in the resource tracker

Matt Riedemann-3
In reply to this post by Chris Dent-2
On 11/5/2018 5:52 AM, Chris Dent wrote:
> * We need to have further discussion and investigation on
>    allocations getting out of sync. Volunteers?

This is something I've already spent a lot of time on with the
heal_allocations CLI, and have already started asking mnaser questions
about this elsewhere in the thread.

--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [nova][placement] Placement requests and caching in the resource tracker

Mohammed Naser
In reply to this post by Matt Riedemann-3
On Mon, Nov 5, 2018 at 4:17 PM Matt Riedemann <[hidden email]> wrote:
>
> On 11/4/2018 4:22 AM, Mohammed Naser wrote:
> > Just for information sake, a clean state cloud which had no reported issues
> > over maybe a period of 2-3 months already has 4 allocations which are
> > incorrect and 12 allocations pointing to the wrong resource provider, so I
> > think this comes does to committing to either "self-healing" to fix those
> > issues or not.
>
> Is this running Rocky or an older release?

In this case, this is inside a Queens cloud, I can run the same script
on a Rocky
cloud too.

> Have you dug into any of the operations around these instances to
> determine what might have gone wrong? For example, was a live migration
> performed recently on these instances and if so, did it fail? How about
> evacuations (rebuild from a down host).

To be honest, I have not, however, I suspect a lot of those happen from the
fact that it is possible that the service which makes the claim is not the
same one that deletes it

I'm not sure if this is something that's possible but say the compute2 makes
a claim for migrating to compute1 but something fails there, the revert happens
in compute1 but compute1 is already borked so it doesn't work

This isn't necessarily the exact case that's happening but it's a summary
of what I believe happens.

> By "4 allocations which are incorrect" I assume that means they are
> pointing at the correct compute node resource provider but the values
> for allocated VCPU, MEMORY_MB and DISK_GB is wrong? If so, how do the
> allocations align with old/new flavors used to resize the instance? Did
> the resize fail?

The allocated flavours usually are not wrong, they are simply associated
to the wrong resource provider (so it feels like failed migration or resize).

> Are there mixed compute versions at all, i.e. are you moving instances
> around during a rolling upgrade?

Nope

> --
>
> Thanks,
>
> Matt
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



--
Mohammed Naser — vexxhost
-----------------------------------------------------
D. 514-316-8872
D. 800-910-1726 ext. 200
E. [hidden email]
W. http://vexxhost.com

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [nova][placement] Placement requests and caching in the resource tracker

Matt Riedemann-3
On 11/5/2018 12:28 PM, Mohammed Naser wrote:

>> Have you dug into any of the operations around these instances to
>> determine what might have gone wrong? For example, was a live migration
>> performed recently on these instances and if so, did it fail? How about
>> evacuations (rebuild from a down host).
> To be honest, I have not, however, I suspect a lot of those happen from the
> fact that it is possible that the service which makes the claim is not the
> same one that deletes it
>
> I'm not sure if this is something that's possible but say the compute2 makes
> a claim for migrating to compute1 but something fails there, the revert happens
> in compute1 but compute1 is already borked so it doesn't work
>
> This isn't necessarily the exact case that's happening but it's a summary
> of what I believe happens.
>

The computes don't create the resource allocations in placement though,
the scheduler does, unless this deployment still has at least one
compute that is <Pike. You should probably check that to make sure.

The compute service should only be removing allocations for things like
server delete, failed move operation (cleanup the allocations created by
the scheduler), or a successful move operation (cleanup the allocations
for the source node held by the migration record).

I wonder if you have migration records (from the cell DB migrations
table) holding allocations in placement for some reason, even though the
migration is complete. I know you have an audit script to look for
allocations that are not held by instances, assuming those instances
have been deleted and the allocations were leaked, but they could have
also been held by the migration record and maybe leaked that way?
Although if you delete the instance, the related migrations records are
also removed (but maybe not their allocations?). I'm thinking of a case
like, resize and instance but rather than confirm/revert it, the user
deletes the instance. That would cleanup the allocations from the target
node but potentially not from the source node.

--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [nova][placement] Placement requests and caching in the resource tracker

Matt Riedemann-3
On 11/5/2018 1:17 PM, Matt Riedemann wrote:
> I'm thinking of a case like, resize and instance but rather than
> confirm/revert it, the user deletes the instance. That would cleanup the
> allocations from the target node but potentially not from the source node.

Well this case is at least not an issue:

https://review.openstack.org/#/c/615644/

It took me a bit to sort out how that worked but it does and I've added
a test to confirm it.

--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [nova][placement] Placement requests and caching in the resource tracker

Eric Fried
In reply to this post by Belmiro Moreira
I do intend to respond to all the excellent discussion on this thread,
but right now I just want to offer an update on the code:

I've split the effort apart into multiple changes starting at [1]. A few
of these are ready for review.

One opinion was that a specless blueprint would be appropriate. If
there's consensus on this, I'll spin one up.

[1] https://review.openstack.org/#/c/615606/

On 11/5/18 03:16, Belmiro Moreira wrote:

> Thanks Eric for the patch.
> This will help keeping placement calls under control.
>
> Belmiro
>
>
> On Sun, Nov 4, 2018 at 1:01 PM Jay Pipes <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 11/02/2018 03:22 PM, Eric Fried wrote:
>     > All-
>     >
>     > Based on a (long) discussion yesterday [1] I have put up a patch [2]
>     > whereby you can set [compute]resource_provider_association_refresh to
>     > zero and the resource tracker will never* refresh the report client's
>     > provider cache. Philosophically, we're removing the "healing"
>     aspect of
>     > the resource tracker's periodic and trusting that placement won't
>     > diverge from whatever's in our cache. (If it does, it's because the op
>     > hit the CLI, in which case they should SIGHUP - see below.)
>     >
>     > *except:
>     > - When we initially create the compute node record and bootstrap its
>     > resource provider.
>     > - When the virt driver's update_provider_tree makes a change,
>     > update_from_provider_tree reflects them in the cache as well as
>     pushing
>     > them back to placement.
>     > - If update_from_provider_tree fails, the cache is cleared and gets
>     > rebuilt on the next periodic.
>     > - If you send SIGHUP to the compute process, the cache is cleared.
>     >
>     > This should dramatically reduce the number of calls to placement from
>     > the compute service. Like, to nearly zero, unless something is
>     actually
>     > changing.
>     >
>     > Can I get some initial feedback as to whether this is worth
>     polishing up
>     > into something real? (It will probably need a bp/spec if so.)
>     >
>     > [1]
>     >
>     http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-11-01.log.html#t2018-11-01T17:32:03
>     > [2] https://review.openstack.org/#/c/614886/
>     >
>     > ==========
>     > Background
>     > ==========
>     > In the Queens release, our friends at CERN noticed a serious spike in
>     > the number of requests to placement from compute nodes, even in a
>     > stable-state cloud. Given that we were in the process of adding a
>     ton of
>     > infrastructure to support sharing and nested providers, this was not
>     > unexpected. Roughly, what was previously:
>     >
>     >   @periodic_task:
>     >       GET /resource_providers/$compute_uuid
>     >       GET /resource_providers/$compute_uuid/inventories
>     >
>     > became more like:
>     >
>     >   @periodic_task:
>     >       # In Queens/Rocky, this would still just return the compute RP
>     >       GET /resource_providers?in_tree=$compute_uuid
>     >       # In Queens/Rocky, this would return nothing
>     >       GET /resource_providers?member_of=...&required=MISC_SHARES...
>     >       for each provider returned above:  # i.e. just one in Q/R
>     >           GET /resource_providers/$compute_uuid/inventories
>     >           GET /resource_providers/$compute_uuid/traits
>     >           GET /resource_providers/$compute_uuid/aggregates
>     >
>     > In a cloud the size of CERN's, the load wasn't acceptable. But at the
>     > time, CERN worked around the problem by disabling refreshing entirely.
>     > (The fact that this seems to have worked for them is an
>     encouraging sign
>     > for the proposed code change.)
>     >
>     > We're not actually making use of most of that information, but it sets
>     > the stage for things that we're working on in Stein and beyond, like
>     > multiple VGPU types, bandwidth resource providers, accelerators, NUMA,
>     > etc., so removing/reducing the amount of information we look at isn't
>     > really an option strategically.
>
>     I support your idea of getting rid of the periodic refresh of the cache
>     in the scheduler report client. Much of that was added in order to
>     emulate the original way the resource tracker worked.
>
>     Most of the behaviour in the original resource tracker (and some of the
>     code still in there for dealing with (surprise!) PCI passthrough
>     devices
>     and NUMA topology) was due to doing allocations on the compute node
>     (the
>     whole claims stuff). We needed to always be syncing the state of the
>     compute_nodes and pci_devices table in the cell database with whatever
>     usage information was being created/modified on the compute nodes [0].
>
>     All of the "healing" code that's in the resource tracker was basically
>     to deal with "soft delete", migrations that didn't complete or work
>     properly, and, again, to handle allocations becoming out-of-sync
>     because
>     the compute nodes were responsible for allocating (as opposed to the
>     current situation we have where the placement service -- via the
>     scheduler's call to claim_resources() -- is responsible for allocating
>     resources [1]).
>
>     Now that we have generation markers protecting both providers and
>     consumers, we can rely on those generations to signal to the scheduler
>     report client that it needs to pull fresh information about a provider
>     or consumer. So, there's really no need to automatically and blindly
>     refresh any more.
>
>     Best,
>     -jay
>
>     [0] We always need to be syncing those tables because those tables,
>     unlike the placement database's data modeling, couple both inventory
>     AND
>     usage in the same table structure...
>
>     [1] again, except for PCI devices and NUMA topology, because of the
>     tight coupling in place with the different resource trackers those
>     types
>     of resources use...
>
>
>     __________________________________________________________________________
>     OpenStack Development Mailing List (not for usage questions)
>     Unsubscribe:
>     [hidden email]?subject:unsubscribe
>     <http://OpenStack-dev-request@...?subject:unsubscribe>
>     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev