[nova][scheduler][placement] Allocating Complex Resources

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[nova][scheduler][placement] Allocating Complex Resources

Edward Leafe
We had a very lively discussion this morning during the Scheduler subteam meeting, which was continued in a Google hangout. The subject was how to handle claiming resources when the Resource Provider is not "simple". By "simple", I mean a compute node that provides all of the resources itself, as contrasted with a compute node that uses a shared storage for disk space, or which has complex nested relationships with things such as PCI devices or NUMA nodes. The current situation is as follows:

a) scheduler gets a request with certain resource requirements (RAM, disk, CPU, etc.)
b) scheduler passes these resource requirements to placement, which returns a list of hosts (compute nodes) that can satisfy the request.
c) scheduler runs these through some filters and weighers to get a list ordered by best "fit"
d) it then tries to claim the resources, by posting to placement allocations for these resources against the selected host
e) once the allocation succeeds, scheduler returns that host to conductor to then have the VM built

(some details for edge cases left out for clarity of the overall process)

The problem we discussed comes into play when the compute node isn't the actual provider of the resources. The easiest example to consider is when the computes are associated with a shared storage provider. The placement query is smart enough to know that even if the compute node doesn't have enough local disk, it will get it from the shared storage, so it will return that host in step b) above. If the scheduler then chooses that host, when it tries to claim it, it will pass the resources and the compute node UUID back to placement to make the allocations. This is the point where the current code would fall short: somehow, placement needs to know to allocate the disk requested against the shared storage provider, and not the compute node.

One proposal is to essentially use the same logic in placement that was used to include that host in those matching the requirements. In other words, when it tries to allocate the amount of disk, it would determine that that host is in a shared storage aggregate, and be smart enough to allocate against that provider. This was referred to in our discussion as "Plan A".

Another proposal involved a change to how placement responds to the scheduler. Instead of just returning the UUIDs of the compute nodes that satisfy the required resources, it would include a whole bunch of additional information in a structured response. A straw man example of such a response is here: https://etherpad.openstack.org/p/placement-allocations-straw-man. This was referred to as "Plan B". The main feature of this approach is that part of that response would be the JSON dict for the allocation call, containing the specific resource provider UUID for each resource. This way, when the scheduler selects a host, it would simply pass that dict back to the /allocations call, and placement would be able to do the allocations directly against that information.

There was another issue raised: simply providing the host UUIDs didn't give the scheduler enough information in order to run its filters and weighers. Since the scheduler uses those UUIDs to construct HostState objects, the specific missing information was never completely clarified, so I'm just including this aspect of the conversation for completeness. It is orthogonal to the question of how to allocate when the resource provider is not "simple".

My current feeling is that we got ourselves into our existing mess of ugly, convoluted code when we tried to add these complex relationships into the resource tracker and the scheduler. We set out to create the placement engine to bring some sanity back to how we think about things we need to virtualize. I would really hate to see us make the same mistake again, by adding a good deal of complexity to handle a few non-simple cases. What I would like to avoid, no matter what the eventual solution chosen, is representing this complexity in multiple places. Currently the only two candidates for this logic are the placement engine, which knows about these relationships already, or the compute service itself, which has to handle the management of these complex virtualized resources.

I don't know the answer. I'm hoping that we can have a discussion that might uncover a clear approach, or, at the very least, one that is less murky than the others.


-- Ed Leafe






__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

signature.asc (859 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Sylvain Bauza-5
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256



Le 05/06/2017 23:22, Ed Leafe a écrit :

> We had a very lively discussion this morning during the Scheduler
> subteam meeting, which was continued in a Google hangout. The
> subject was how to handle claiming resources when the Resource
> Provider is not "simple". By "simple", I mean a compute node that
> provides all of the resources itself, as contrasted with a compute
> node that uses a shared storage for disk space, or which has
> complex nested relationships with things such as PCI devices or
> NUMA nodes. The current situation is as follows:
>
> a) scheduler gets a request with certain resource requirements
> (RAM, disk, CPU, etc.) b) scheduler passes these resource
> requirements to placement, which returns a list of hosts (compute
> nodes) that can satisfy the request. c) scheduler runs these
> through some filters and weighers to get a list ordered by best
> "fit" d) it then tries to claim the resources, by posting to
> placement allocations for these resources against the selected
> host e) once the allocation succeeds, scheduler returns that host
> to conductor to then have the VM built
>
> (some details for edge cases left out for clarity of the overall
> process)
>
> The problem we discussed comes into play when the compute node
> isn't the actual provider of the resources. The easiest example to
> consider is when the computes are associated with a shared storage
> provider. The placement query is smart enough to know that even if
> the compute node doesn't have enough local disk, it will get it
> from the shared storage, so it will return that host in step b)
> above. If the scheduler then chooses that host, when it tries to
> claim it, it will pass the resources and the compute node UUID back
> to placement to make the allocations. This is the point where the
> current code would fall short: somehow, placement needs to know to
> allocate the disk requested against the shared storage provider,
> and not the compute node.
>
> One proposal is to essentially use the same logic in placement that
> was used to include that host in those matching the requirements.
> In other words, when it tries to allocate the amount of disk, it
> would determine that that host is in a shared storage aggregate,
> and be smart enough to allocate against that provider. This was
> referred to in our discussion as "Plan A".
>
> Another proposal involved a change to how placement responds to the
> scheduler. Instead of just returning the UUIDs of the compute nodes
> that satisfy the required resources, it would include a whole bunch
> of additional information in a structured response. A straw man
> example of such a response is here:
> https://etherpad.openstack.org/p/placement-allocations-straw-man.
> This was referred to as "Plan B". The main feature of this approach
> is that part of that response would be the JSON dict for the
> allocation call, containing the specific resource provider UUID for
> each resource. This way, when the scheduler selects a host, it
> would simply pass that dict back to the /allocations call, and
> placement would be able to do the allocations directly against that
> information.
>
> There was another issue raised: simply providing the host UUIDs
> didn't give the scheduler enough information in order to run its
> filters and weighers. Since the scheduler uses those UUIDs to
> construct HostState objects, the specific missing information was
> never completely clarified, so I'm just including this aspect of
> the conversation for completeness. It is orthogonal to the question
> of how to allocate when the resource provider is not "simple".
>
> My current feeling is that we got ourselves into our existing mess
> of ugly, convoluted code when we tried to add these complex
> relationships into the resource tracker and the scheduler. We set
> out to create the placement engine to bring some sanity back to how
> we think about things we need to virtualize. I would really hate to
> see us make the same mistake again, by adding a good deal of
> complexity to handle a few non-simple cases. What I would like to
> avoid, no matter what the eventual solution chosen, is representing
> this complexity in multiple places. Currently the only two
> candidates for this logic are the placement engine, which knows
> about these relationships already, or the compute service itself,
> which has to handle the management of these complex virtualized
> resources.
>
> I don't know the answer. I'm hoping that we can have a discussion
> that might uncover a clear approach, or, at the very least, one
> that is less murky than the others.
>

I wasn't part neither of the scheduler meeting nor the hangout (hitted
by French holiday) so I don't get all the details in mind and I could
probably make wrong assumptions, so I apology in advance if I'm
telling anything silly.

That said, I still have some opinions and I'll put them here. Thanks
for having brought up that problem here, Ed.

The intent of the scheduler is to pick a destination where an instance
can land (the old concept of 'can_host' thingy). Getting back a list
of "sharing RPs" (a shared volume) doesn't really help the
decision-making I feel. What a user could want is to have the
scheduler pick a destination that is *close* to that "sharing RP" or
which would not be "shared-with" that "sharing RP", but I don't feel
the need for us to return the list of "things-that-can-not-host" (ie.
"sharing RPs") when the scheduler is asking the list of potential
targets to Placement.

That said, in order to have scheduling decisions based on filters
(like a dummy OnlyLocalDiskFilter or a NetworkSegmentedOnlyFilter) or
weighers (PreferMeLocalDisksWeigher), we could potentially imagine the
construct returned by Placement for the resource classes it handles to
be having more than just RP UUIDs but a list of extended dictionaries
(one per Resource Provider) of "inventories-minus-allocations" (ie.
what's left in the cloud) keyed by resource classes.

Of course, the size of the result could be a problem. Couldn't we
imagine limited paging for that ? Of course, the sorting is
underterministic as the construct of that list is depending on what's
available on time.

The Plan A option you mention hides the complexity of the
shared/non-shared logic but to the price that it would make scheduling
decisions on those criteries impossible unless you put
filtering/weighting logic into Placement, which AFAIK we strongly
disagree with.

- -Sylvain

>
> -- Ed Leafe
>
>
>
>
>
>
>
> ______________________________________________________________________
____
>
>
OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
> [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCAAGBQJZNnKOAAoJEO+lyg5yIElq/H0IAMb7FA7KyqGkoWo1zriDYe+L
rMT005F+xf8siNWDn3BY1HEJPcGmYrTFDMEia8QUq3tDsdselGxs+V4K/jd9Ih+M
rcf9Kz73odG1HzERlGvduvh7OgGedmZqds50LeyGsHyxN+whoAcIQ5/I2h16jzRa
EZtW8wINMr1qbxkM/PiFsEl0jhJn6IJAjH+LMx5H3ynNGp7gEjLQcdrDP5DNFOLS
PXU84ra2BDeFF4Y5pT1iP3JsL8GZrOYFWGY4K83n+D+MFiHWfEbZr0xenwyhGjbo
PvgW8KSu+paB1k+cRfZauIsoN1GwCCzONIWSHMSDtRbY++s8hwjcMC8bOjvl6Xs=
=b8ex
-----END PGP SIGNATURE-----

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Edward Leafe
On Jun 6, 2017, at 4:14 AM, Sylvain Bauza <[hidden email]> wrote:

The Plan A option you mention hides the complexity of the
shared/non-shared logic but to the price that it would make scheduling
decisions on those criteries impossible unless you put
filtering/weighting logic into Placement, which AFAIK we strongly
disagree with.

Not necessarily. Well, not now, for sure, but that’s why we need Traits to be integrated into Flavors as soon as possible so that we can make requests with qualitative requirements, not just quantitative. When that work is done, we can add traits to differentiate local from shared storage, just like we have traits to distinguish HDD from SSD. So if a VM with only local disk is needed, that will be in the request, and placement will never return hosts with shared storage. 

-- Ed Leafe






__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Sylvain Bauza-5


Le 06/06/2017 15:03, Edward Leafe a écrit :

> On Jun 6, 2017, at 4:14 AM, Sylvain Bauza <[hidden email]
> <mailto:[hidden email]>> wrote:
>>
>> The Plan A option you mention hides the complexity of the
>> shared/non-shared logic but to the price that it would make scheduling
>> decisions on those criteries impossible unless you put
>> filtering/weighting logic into Placement, which AFAIK we strongly
>> disagree with.
>
> Not necessarily. Well, not now, for sure, but that’s why we need Traits
> to be integrated into Flavors as soon as possible so that we can make
> requests with qualitative requirements, not just quantitative. When that
> work is done, we can add traits to differentiate local from shared
> storage, just like we have traits to distinguish HDD from SSD. So if a
> VM with only local disk is needed, that will be in the request, and
> placement will never return hosts with shared storage.
>

Well, there is a whole difference between defining constraints into
flavors, and making a general constraint on a filter basis which is
opt-able by config.

Operators could claim that they would need to update all their N flavors
in order to achieve a strict separation for not-shared-with resource
providers, which would somehow leak into the fact that users would have
flavors that differ for that aspect.

I'm not saying it's not good to mark traits into flavor extraspecs,
sometimes they're all good, but I do care of the flavor count explosion
if we begin putting all the filtering logic into extraspecs (plus the
fact it can't be config-manageable like filters are at the moment).

-Sylvain

> -- Ed Leafe
>
>
>
>
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Chris Dent-2
In reply to this post by Edward Leafe
On Mon, 5 Jun 2017, Ed Leafe wrote:

> One proposal is to essentially use the same logic in placement
> that was used to include that host in those matching the
> requirements. In other words, when it tries to allocate the amount
> of disk, it would determine that that host is in a shared storage
> aggregate, and be smart enough to allocate against that provider.
> This was referred to in our discussion as "Plan A".

What would help for me is greater explanation of if and if so, how and
why, "Plan A" doesn't work for nested resource providers.

We can declare that allocating for shared disk is fairly deterministic
if we assume that any given compute node is only associated with one
shared disk provider.

My understanding is this determinism is not the case with nested
resource providers because there's some fairly late in the game
choosing of which pci device or which numa cell is getting used.
The existing resource tracking doesn't have this problem because the
claim of those resources is made very late in the game. < Is this
correct?

The problem comes into play when we want to claim from the scheduler
(or conductor). Additional information is required to choose which
child providers to use. <- Is this correct?

Plan B overcomes the information deficit by including more
information in the response from placement (as straw-manned in the
etherpad [1]) allowing code in the filter scheduler to make accurate
claims. <- Is this correct?

For clarity and completeness in the discussion some questions for
which we have explicit answers would be useful. Some of these may
appear ignorant or obtuse and are mostly things we've been over
before. The goal is to draw out some clear statements in the present
day to be sure we are all talking about the same thing (or get us
there if not) modified for what we know now, compared to what we
knew a week or month ago.

* We already have the information the filter scheduler needs now by
   some other means, right?  What are the reasons we don't want to
   use that anymore?

* Part of the reason for having nested resource providers is because
   it can allow affinity/anti-affinity below the compute node (e.g.,
   workloads on the same host but different numa cells). If I
   remember correctly, the modelling and tracking of this kind of
   information in this way comes out of the time when we imagined the
   placement service would be doing considerably more filtering than
   is planned now. Plan B appears to be an acknowledgement of "on
   some of this stuff, we can't actually do anything but provide you
   some info, you need to decide". If that's the case, is the
   topological modelling on the placement DB side of things solely a
   convenient place to store information? If there were some other
   way to model that topology could things currently being considered
   for modelling as nested providers be instead simply modelled as
   inventories of a particular class of resource?
   (I'm not suggesting we do this, rather that the answer that says
   why we don't want to do this is useful for understanding the
   picture.)

* Does a claim made in the scheduler need to be complete? Is there
   value in making a partial claim from the scheduler that consumes a
   vcpu and some ram, and then in the resource tracker is corrected
   to consume a specific pci device, numa cell, gpu and/or fpga?
   Would this be better or worse than what we have now? Why?

* What is lacking in placement's representation of resource providers
   that makes it difficult or impossible for an allocation against a
   parent provider to be able to determine the correct child
   providers to which to cascade some of the allocation? (And by
   extension make the earlier scheduling decision.)

That's a start. With answers to at last some of these questions I
think the straw man in the etherpad can be more effectively
evaluated. As things stand right now it is a proposed solution
without a clear problem statement. I feel like we could do with a
more clear problem statement.

Thanks.

[1] https://etherpad.openstack.org/p/placement-allocations-straw-man

--
Chris Dent                  ┬──┬◡ノ(° -°ノ)       https://anticdent.org/
freenode: cdent                                         tw: @anticdent
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Edward Leafe
On Jun 6, 2017, at 9:56 AM, Chris Dent <[hidden email]> wrote:

For clarity and completeness in the discussion some questions for
which we have explicit answers would be useful. Some of these may
appear ignorant or obtuse and are mostly things we've been over
before. The goal is to draw out some clear statements in the present
day to be sure we are all talking about the same thing (or get us
there if not) modified for what we know now, compared to what we
knew a week or month ago.

One other question that came up: do we have any examples of any service (such as Neutron or Cinder) that would require the modeling for nested providers? Or is this confined to Nova?


-- Ed Leafe






__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Jay Pipes
On 06/07/2017 01:00 PM, Edward Leafe wrote:

> On Jun 6, 2017, at 9:56 AM, Chris Dent <[hidden email]
> <mailto:[hidden email]>> wrote:
>>
>> For clarity and completeness in the discussion some questions for
>> which we have explicit answers would be useful. Some of these may
>> appear ignorant or obtuse and are mostly things we've been over
>> before. The goal is to draw out some clear statements in the present
>> day to be sure we are all talking about the same thing (or get us
>> there if not) modified for what we know now, compared to what we
>> knew a week or month ago.
>
> One other question that came up: do we have any examples of any service
> (such as Neutron or Cinder) that would require the modeling for nested
> providers? Or is this confined to Nova?

The Cyborg project (accelerators like FPGAs and some vGPUs) need nested
resource providers to model the relationship between a virtual resource
context against an accelerator and the compute node itself.

Best,
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Mooney, Sean K


> -----Original Message-----
> From: Jay Pipes [mailto:[hidden email]]
> Sent: Wednesday, June 7, 2017 6:47 PM
> To: [hidden email]
> Subject: Re: [openstack-dev] [nova][scheduler][placement] Allocating
> Complex Resources
>
> On 06/07/2017 01:00 PM, Edward Leafe wrote:
> > On Jun 6, 2017, at 9:56 AM, Chris Dent <[hidden email]
> > <mailto:[hidden email]>> wrote:
> >>
> >> For clarity and completeness in the discussion some questions for
> >> which we have explicit answers would be useful. Some of these may
> >> appear ignorant or obtuse and are mostly things we've been over
> >> before. The goal is to draw out some clear statements in the present
> >> day to be sure we are all talking about the same thing (or get us
> >> there if not) modified for what we know now, compared to what we
> knew
> >> a week or month ago.
> >
> > One other question that came up: do we have any examples of any
> > service (such as Neutron or Cinder) that would require the modeling
> > for nested providers? Or is this confined to Nova?
>
> The Cyborg project (accelerators like FPGAs and some vGPUs) need nested
> resource providers to model the relationship between a virtual resource
> context against an accelerator and the compute node itself.
[Mooney, Sean K] neutron will need to use nested resource providers to track
Network backend specific consumable resources in the future also. One example is
is hardware offloaded virtual(e.g. vitio/vhost-user) interfaces which due to
their hardware based implementation are both a finite consumable
resources and have numa affinity and there for need to track and nested.

Another example for neutron would be bandwidth based scheduling / sla enforcement
where we want to guarantee that a specific bandwidth is available on the selected host
for a vm to consume. From an ovs/vpp/linux bridge perspective this would likely be tracked at
the physnet level so when selecting a host we would want to ensure that the physent
is both available from the host and has enough bandwidth available to resever for the instance.

Today nova and neutron do not track either of the above but at least the lather has been started
In the sriov context without placemet and should be extended to other non-sriov backend.
Snabb switch actually supports this already with vendor extentions via the neutron bining:profile
https://github.com/snabbco/snabb/blob/b7d6d77ba5fd6a6b9306f92466c1779bba2caa31/src/program/snabbnfv/doc/neutron-api-extensions.md#bandwidth-reservation
but nova is not aware of the capacity or availability info when placing the instance so if
the host cannot fufill the request the degrade to the least over subscribed port.
https://github.com/snabbco/snabb-neutron/blob/master/snabb_neutron/mechanism_snabb.py#L194-L200

with nested resource providers they could harden this request from best effort to a guaranteed bandwidth reservation
by informing the placemnt api of the bandwith availability of the physical interface and also the numa affinity the interfaces
by created a nested resource provider.

>
> Best,
> -jay
>
> _______________________________________________________________________
> ___
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-
> [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Edward Leafe
On Jun 7, 2017, at 1:44 PM, Mooney, Sean K <[hidden email]> wrote:

[Mooney, Sean K] neutron will need to use nested resource providers to track
Network backend specific consumable resources in the future also. One example is
is hardware offloaded virtual(e.g. vitio/vhost-user) interfaces which due to
their hardware based implementation are both a finite consumable
resources and have numa affinity and there for need to track and nested.

Another example for neutron would be bandwidth based scheduling / sla enforcement
where we want to guarantee that a specific bandwidth is available on the selected host
for a vm to consume. From an ovs/vpp/linux bridge perspective this would likely be tracked at
the physnet level so when selecting a host we would want to ensure that the physent
is both available from the host and has enough bandwidth available to resever for the instance.

OK, thanks, this is excellent information.

New question: will the placement service always be able to pick an acceptable provider, given that that the request needs X amount of bandwidth? IOW, are there other considerations besides quantitative amount (and possibly traits for qualitative concerns) that placement simply doesn’t know about? The example I have in mind is the case of stack vs. spread, where there are a few available providers that can meet the request. The logic for which one to pick can’t be in placement, though, as it’s a detail of the calling service. In the case of Nova, the assignment of VFs on vNICs usually should be spread, but that is not what placement knows, it’s handled by filters/weighers in Nova’s scheduler.

OK, that was a really long way of asking: will Neutron ever need to be able to determine the “best” choice from a selection of resource providers? Or will the fact that a resource provider has enough of a given resource be all that is needed?


-- Ed Leafe








-- Ed Leafe






__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Edward Leafe
In reply to this post by Mooney, Sean K
On Jun 7, 2017, at 1:44 PM, Mooney, Sean K <[hidden email]> wrote:

[Mooney, Sean K] neutron will need to use nested resource providers to track
Network backend specific consumable resources in the future also. One example is
is hardware offloaded virtual(e.g. vitio/vhost-user) interfaces which due to
their hardware based implementation are both a finite consumable
resources and have numa affinity and there for need to track and nested.

Another example for neutron would be bandwidth based scheduling / sla enforcement
where we want to guarantee that a specific bandwidth is available on the selected host
for a vm to consume. From an ovs/vpp/linux bridge perspective this would likely be tracked at
the physnet level so when selecting a host we would want to ensure that the physent
is both available from the host and has enough bandwidth available to resever for the instance.

OK, thanks, this is excellent information.

New question: will the placement service always be able to pick an acceptable provider, given that that the request needs X amount of bandwidth? IOW, are there other considerations besides quantitative amount (and possibly traits for qualitative concerns) that placement simply doesn’t know about? The example I have in mind is the case of stack vs. spread, where there are a few available providers that can meet the request. The logic for which one to pick can’t be in placement, though, as it’s a detail of the calling service. In the case of Nova, the assignment of VFs on vNICs usually should be spread, but that is not what placement knows, it’s handled by filters/weighers in Nova’s scheduler.

OK, that was a really long way of asking: will Neutron ever need to be able to determine the “best” choice from a selection of resource providers? Or will the fact that a resource provider has enough of a given resource be all that is needed?


-- Ed Leafe








-- Ed Leafe








-- Ed Leafe






__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Edward Leafe
In reply to this post by Chris Dent-2
Sorry for the top-post, but it seems that nobody has responded to this, and there are a lot of important questions that need answers. So I’m simply re-posting this so that we don’t get too ahead of ourselves, by planning implementations before we fully understand the problem and the implications of any proposed solution.


-- Ed Leafe


> On Jun 6, 2017, at 9:56 AM, Chris Dent <[hidden email]> wrote:
>
> On Mon, 5 Jun 2017, Ed Leafe wrote:
>
>> One proposal is to essentially use the same logic in placement
>> that was used to include that host in those matching the
>> requirements. In other words, when it tries to allocate the amount
>> of disk, it would determine that that host is in a shared storage
>> aggregate, and be smart enough to allocate against that provider.
>> This was referred to in our discussion as "Plan A".
>
> What would help for me is greater explanation of if and if so, how and
> why, "Plan A" doesn't work for nested resource providers.
>
> We can declare that allocating for shared disk is fairly deterministic
> if we assume that any given compute node is only associated with one
> shared disk provider.
>
> My understanding is this determinism is not the case with nested
> resource providers because there's some fairly late in the game
> choosing of which pci device or which numa cell is getting used.
> The existing resource tracking doesn't have this problem because the
> claim of those resources is made very late in the game. < Is this
> correct?
>
> The problem comes into play when we want to claim from the scheduler
> (or conductor). Additional information is required to choose which
> child providers to use. <- Is this correct?
>
> Plan B overcomes the information deficit by including more
> information in the response from placement (as straw-manned in the
> etherpad [1]) allowing code in the filter scheduler to make accurate
> claims. <- Is this correct?
>
> For clarity and completeness in the discussion some questions for
> which we have explicit answers would be useful. Some of these may
> appear ignorant or obtuse and are mostly things we've been over
> before. The goal is to draw out some clear statements in the present
> day to be sure we are all talking about the same thing (or get us
> there if not) modified for what we know now, compared to what we
> knew a week or month ago.
>
> * We already have the information the filter scheduler needs now by
>  some other means, right?  What are the reasons we don't want to
>  use that anymore?
>
> * Part of the reason for having nested resource providers is because
>  it can allow affinity/anti-affinity below the compute node (e.g.,
>  workloads on the same host but different numa cells). If I
>  remember correctly, the modelling and tracking of this kind of
>  information in this way comes out of the time when we imagined the
>  placement service would be doing considerably more filtering than
>  is planned now. Plan B appears to be an acknowledgement of "on
>  some of this stuff, we can't actually do anything but provide you
>  some info, you need to decide". If that's the case, is the
>  topological modelling on the placement DB side of things solely a
>  convenient place to store information? If there were some other
>  way to model that topology could things currently being considered
>  for modelling as nested providers be instead simply modelled as
>  inventories of a particular class of resource?
>  (I'm not suggesting we do this, rather that the answer that says
>  why we don't want to do this is useful for understanding the
>  picture.)
>
> * Does a claim made in the scheduler need to be complete? Is there
>  value in making a partial claim from the scheduler that consumes a
>  vcpu and some ram, and then in the resource tracker is corrected
>  to consume a specific pci device, numa cell, gpu and/or fpga?
>  Would this be better or worse than what we have now? Why?
>
> * What is lacking in placement's representation of resource providers
>  that makes it difficult or impossible for an allocation against a
>  parent provider to be able to determine the correct child
>  providers to which to cascade some of the allocation? (And by
>  extension make the earlier scheduling decision.)
>
> That's a start. With answers to at last some of these questions I
> think the straw man in the etherpad can be more effectively
> evaluated. As things stand right now it is a proposed solution
> without a clear problem statement. I feel like we could do with a
> more clear problem statement.
>
> Thanks.
>
> [1] https://etherpad.openstack.org/p/placement-allocations-straw-man
>
> --
> Chris Dent                  ┬──┬◡ノ(° -°ノ)       https://anticdent.org/
> freenode: cdent                                         tw: @anticdent__________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Jay Pipes
In reply to this post by Edward Leafe
On 06/05/2017 05:22 PM, Ed Leafe wrote:
> Another proposal involved a change to how placement responds to the
> scheduler. Instead of just returning the UUIDs of the compute nodes
> that satisfy the required resources, it would include a whole bunch
> of additional information in a structured response. A straw man
> example of such a response is here:
> https://etherpad.openstack.org/p/placement-allocations-straw-man.
> This was referred to as "Plan B".

Actually, this was Plan "C". Plan "B" was to modify the return of the
GET /resource_providers Placement REST API endpoint.

 > The main feature of this approach
> is that part of that response would be the JSON dict for the
> allocation call, containing the specific resource provider UUID for
> each resource. This way, when the scheduler selects a host

Important clarification is needed here. The proposal is to have the
scheduler actually select *more than just the compute host*. The
scheduler would select the host, any sharing providers and any child
providers within a host that actually contained the resources/traits
that the request demanded.

 >, it would

> simply pass that dict back to the /allocations call, and placement
> would be able to do the allocations directly against that
> information.
>
> There was another issue raised: simply providing the host UUIDs
> didn't give the scheduler enough information in order to run its
> filters and weighers. Since the scheduler uses those UUIDs to
> construct HostState objects, the specific missing information was
> never completely clarified, so I'm just including this aspect of the
> conversation for completeness. It is orthogonal to the question of
> how to allocate when the resource provider is not "simple".

The specific missing information is the following, but not limited to:

* Whether or not a resource can be provided by a sharing provider or a
"local provider" or either. For example, assume a compute node that is
associated with a shared storage pool via an aggregate but that also has
local disk for instances. The Placement API currently returns just the
compute host UUID but no indication of whether the compute host has
local disk to consume from, has shared disk to consume from, or both.
The scheduler is the thing that must weigh these choices and make a
choice. The placement API gives the scheduler the choices and the
scheduler makes a decision based on sorting/weighing algorithms.

It is imperative to remember the reason *why* we decided (way back in
Portland at the Nova mid-cycle last year) to keep sorting/weighing in
the Nova scheduler. The reason is because operators (and some
developers) insisted on being able to weigh the possible choices in ways
that "could not be pre-determined". In other words, folks wanted to keep
the existing uber-flexibility and customizability that the scheduler
weighers (and home-grown weigher plugins) currently allow, including
being able to sort possible compute hosts by such things as the average
thermal temperature of the power supply the hardware was connected to
over the last five minutes (I kid you friggin not.)

* Which SR-IOV physical function should provider an SRIOV_NET_VF
resource to an instance. Imagine a situation where a compute host has 4
SR-IOV physical functions, each having some traits representing hardware
offload support and each having an inventory of 8 SRIOV_NET_VF.
Currently the scheduler absolutely has the information to pick one of
these SRIOV physical functions to assign to a workload. What the
scheduler does *not* have, however, is a way to tell the Placement API
to consume an SRIOV_NET_VF from that particular physical function. Why?
Because the scheduler doesn't know that a particular physical function
even *is* a resource provider in the placement API. *Something* needs to
inform the scheduler that the physical function is a resource provider
and has a particular UUID to identify it. This is precisely what the
proposed GET /allocation_requests HTTP response data provides to the
scheduler.

> My current feeling is that we got ourselves into our existing mess of
> ugly, convoluted code when we tried to add these complex
> relationships into the resource tracker and the scheduler. We set out
> to create the placement engine to bring some sanity back to how we
> think about things we need to virtualize.

Sorry, I completely disagree with your assessment of why the placement
engine exists. We didn't create it to bring some sanity back to how we
think about things we need to virtualize. We created it to add
consistency and structure to the representation of resources in the system.

I don't believe that exposing this structured representation of
resources is a bad thing or that it is leaking "implementation details"
out of the placement API. It's not an implementation detail that a
resource provider is a child of another or that a different resource
provider is supplying some resource to a group of other providers.
That's simply an accurate representation of the underlying data structures.

 > I would really hate to see
> us make the same mistake again, by adding a good deal of complexity
> to handle a few non-simple cases. What I would like to avoid, no
> matter what the eventual solution chosen, is representing this
> complexity in multiple places. Currently the only two candidates for
> this logic are the placement engine, which knows about these
> relationships already, or the compute service itself, which has to
> handle the management of these complex virtualized resources.

The compute service will need to know about the hierarchies of providers
on a particular compute node. That isn't complexity. It's simply
accurate representation of the underlying data structures. Instead of
random dicts of key/value pairs and different serialized JSON blobs for
each particular class of resources, we now have a single, consistent way
of describing the providers of those resources.

> I don't know the answer. I'm hoping that we can have a discussion
> that might uncover a clear approach, or, at the very least, one that
> is less murky than the others.

I really like Dan's idea of returning a list of HTTP request bodies for
POST /allocations/{consumer_uuid} calls along with a list of provider
information that the scheduler can use in its sorting/weighing algorithms.

We've put this straw-man proposal here:

https://review.openstack.org/#/c/471927/

I'm hoping to keep the conversation going there.

Best,
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Dan Smith-2
>> My current feeling is that we got ourselves into our existing mess
>> of ugly, convoluted code when we tried to add these complex
>> relationships into the resource tracker and the scheduler. We set
>> out to create the placement engine to bring some sanity back to how
>> we think about things we need to virtualize.
>
> Sorry, I completely disagree with your assessment of why the
> placement engine exists. We didn't create it to bring some sanity
> back to how we think about things we need to virtualize. We created
> it to add consistency and structure to the representation of
> resources in the system.
>
> I don't believe that exposing this structured representation of
> resources is a bad thing or that it is leaking "implementation
> details" out of the placement API. It's not an implementation detail
> that a resource provider is a child of another or that a different
> resource provider is supplying some resource to a group of other
> providers. That's simply an accurate representation of the underlying
> data structures.

This ^.

With the proposal Jay has up, placement is merely exposing some of its
own data structures to a client that has declared what it wants. The
client has made a request for resources, and placement is returning some
allocations that would be valid. None of them are nova-specific at all
-- they're all data structures that you would pass to and/or retrieve
from placement already.

>> I don't know the answer. I'm hoping that we can have a discussion
>> that might uncover a clear approach, or, at the very least, one
>> that is less murky than the others.
>
> I really like Dan's idea of returning a list of HTTP request bodies
> for POST /allocations/{consumer_uuid} calls along with a list of
> provider information that the scheduler can use in its
> sorting/weighing algorithms.
>
> We've put this straw-man proposal here:
>
> https://review.openstack.org/#/c/471927/
>
> I'm hoping to keep the conversation going there.

This is the most clear option that we have, in my opinion. It simplifies
what the scheduler has to do, it simplifies what conductor has to do
during a retry, and it minimizes the amount of work that something else
like cinder would have to do to use placement to schedule resources.
Without this, cinder/neutron/whatever has to know about things like
aggregates and hierarchical relationships between providers in order to
make *any* sane decision about selecting resources. If placement returns
valid options with that stuff figured out, then those services can look
at the bits they care about and make a decision.

I'd really like us to use the existing strawman spec as a place to
iterate on what that API would look like, assuming we're going to go
that route, and work on actual code in both placement and the scheduler
to use it. I'm hoping that doing so will help clarify whether this is
the right approach or not, and whether there are other gotchas that we
don't yet have on our radar. We're rapidly running out of runway for
pike here and I feel like we've got to get moving on this or we're going
to have to punt. Since several other things depend on this work, we need
to consider the impact to a lot of our pike commitments if we're not
able to get something merged.

--Dan

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Jay Pipes
In reply to this post by Chris Dent-2
Sorry, been in a three-hour meeting. Comments inline...

On 06/06/2017 10:56 AM, Chris Dent wrote:

> On Mon, 5 Jun 2017, Ed Leafe wrote:
>
>> One proposal is to essentially use the same logic in placement
>> that was used to include that host in those matching the
>> requirements. In other words, when it tries to allocate the amount
>> of disk, it would determine that that host is in a shared storage
>> aggregate, and be smart enough to allocate against that provider.
>> This was referred to in our discussion as "Plan A".
>
> What would help for me is greater explanation of if and if so, how and
> why, "Plan A" doesn't work for nested resource providers.

We'd have to add all the sorting/weighing logic from the existing
scheduler into the Placement API. Otherwise, the Placement API won't
understand which child provider to pick out of many providers that meet
resource/trait requirements.

> We can declare that allocating for shared disk is fairly deterministic
> if we assume that any given compute node is only associated with one
> shared disk provider.

a) we can't assume that
b) a compute node could very well have both local disk and shared disk.
how would the placement API know which one to pick? This is a
sorting/weighing decision and thus is something the scheduler is
responsible for.

> My understanding is this determinism is not the case with nested
> resource providers because there's some fairly late in the game
> choosing of which pci device or which numa cell is getting used.
> The existing resource tracking doesn't have this problem because the
> claim of those resources is made very late in the game. < Is this
> correct?

No, it's not about determinism or how late in the game a claim decision
is made. It's really just that the scheduler is the thing that does
sorting/weighing, not the placement API. We made this decision due to
the operator feedback that they were not willing to give up their
ability to add custom weighers and be able to have scheduling policies
that rely on transient data like thermal metrics collection.

> The problem comes into play when we want to claim from the scheduler
> (or conductor). Additional information is required to choose which
> child providers to use. <- Is this correct?

Correct.

> Plan B overcomes the information deficit by including more
> information in the response from placement (as straw-manned in the
> etherpad [1]) allowing code in the filter scheduler to make accurate
> claims. <- Is this correct?

Partly, yes. But, more than anything it's about the placement API
returning resource provider UUIDs for child providers and sharing
providers so that the scheduler, when it picks one of those SRIOV
physical functions, or NUMA cells, or shared storage pools, has the
identifier with which to tell the placement API "ok, claim *this*
resource against *this* provider".

> * We already have the information the filter scheduler needs now by
>   some other means, right?  What are the reasons we don't want to
>   use that anymore?

The filter scheduler has most of the information, yes. What it doesn't
have is the *identifier* (UUID) for things like SRIOV PFs or NUMA cells
that the Placement API will use to distinguish between things. In other
words, the filter scheduler currently does things like unpack a
NUMATopology object into memory and determine a NUMA cell to place an
instance to. However, it has no concept that that NUMA cell is (or will
soon be once nested-resource-providers is done) a resource provider in
the placement API. Same for SRIOV PFs. Same for VGPUs. Same for FPGAs,
etc. That's why we need to return information to the scheduler from the
placement API that will allow the scheduler to understand "hey, this
NUMA cell on compute node X is resource provider $UUID".

> * Part of the reason for having nested resource providers is because
>   it can allow affinity/anti-affinity below the compute node (e.g.,
>   workloads on the same host but different numa cells).

Mmm, kinda, yeah.

 >  If I
>   remember correctly, the modelling and tracking of this kind of
>   information in this way comes out of the time when we imagined the
>   placement service would be doing considerably more filtering than
>   is planned now. Plan B appears to be an acknowledgement of "on
>   some of this stuff, we can't actually do anything but provide you
>   some info, you need to decide".

Not really. Filtering is still going to be done in the placement API.
It's the thing that says "hey, these providers (or trees of providers)
meet these resource and trait requirements". The scheduler however is
what takes that set of filtered providers and does its sorting/weighing
magic and selects one.

 > If that's the case, is the
>   topological modelling on the placement DB side of things solely a
>   convenient place to store information? If there were some other
>   way to model that topology could things currently being considered
>   for modelling as nested providers be instead simply modelled as
>   inventories of a particular class of resource?
>   (I'm not suggesting we do this, rather that the answer that says
>   why we don't want to do this is useful for understanding the
>   picture.)

The modeling of the topologies of providers in the placement API/DB is
strictly to ensure consistency and correctness of representation. We're
modeling the actual relationship between resource providers in a generic
way and not embedding that topology information in a variety of JSON
blobs and other structs in the cell database.

> * Does a claim made in the scheduler need to be complete? Is there
>   value in making a partial claim from the scheduler that consumes a
>   vcpu and some ram, and then in the resource tracker is corrected
>   to consume a specific pci device, numa cell, gpu and/or fpga?
>   Would this be better or worse than what we have now? Why?

Good question. I think the answer to this is probably pretty theoretical
at this point. My gut instinct is that we should treat the consumption
of resources in an atomic fashion, and that transactional nature of
allocation will result in fewer race conditions and cleaner code. But,
admittedly, this is just my gut reaction.

> * What is lacking in placement's representation of resource providers
>   that makes it difficult or impossible for an allocation against a
>   parent provider to be able to determine the correct child
>   providers to which to cascade some of the allocation? (And by
>   extension make the earlier scheduling decision.)

See above. The sorting/weighing logic, which is very much
deployer-defined and wreaks of customization, is what would need to be
added to the placement API.

best,
-jay

> That's a start. With answers to at last some of these questions I
> think the straw man in the etherpad can be more effectively
> evaluated. As things stand right now it is a proposed solution
> without a clear problem statement. I feel like we could do with a
> more clear problem statement.
>
> Thanks.
>
> [1] https://etherpad.openstack.org/p/placement-allocations-straw-man
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Edward Leafe
On Jun 9, 2017, at 4:35 PM, Jay Pipes <[hidden email]> wrote:

>> We can declare that allocating for shared disk is fairly deterministic
>> if we assume that any given compute node is only associated with one
>> shared disk provider.
>
> a) we can't assume that
> b) a compute node could very well have both local disk and shared disk. how would the placement API know which one to pick? This is a sorting/weighing decision and thus is something the scheduler is responsible for.

I remember having this discussion, and we concluded that a compute node could either have local or shared resources, but not both. There would be a trait to indicate shared disk. Has this changed?

>> * We already have the information the filter scheduler needs now by
>>  some other means, right?  What are the reasons we don't want to
>>  use that anymore?
>
> The filter scheduler has most of the information, yes. What it doesn't have is the *identifier* (UUID) for things like SRIOV PFs or NUMA cells that the Placement API will use to distinguish between things. In other words, the filter scheduler currently does things like unpack a NUMATopology object into memory and determine a NUMA cell to place an instance to. However, it has no concept that that NUMA cell is (or will soon be once nested-resource-providers is done) a resource provider in the placement API. Same for SRIOV PFs. Same for VGPUs. Same for FPGAs, etc. That's why we need to return information to the scheduler from the placement API that will allow the scheduler to understand "hey, this NUMA cell on compute node X is resource provider $UUID".

I guess that this was the point that confused me. The RP uuid is part of the provider: the compute node's uuid, and (after https://review.openstack.org/#/c/469147/ merges) the PCI device's uuid. So in the code that passes the PCI device information to the scheduler, we could add that new uuid field, and then the scheduler would have the information to a) select the best fit and then b) claim it with the specific uuid. Same for all the other nested/shared devices.

I don't mean to belabor this, but to my mind this seems a lot less disruptive to the existing code.


-- Ed Leafe






__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

signature.asc (859 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Dan Smith-2
>> b) a compute node could very well have both local disk and shared
>> disk. how would the placement API know which one to pick? This is a
>> sorting/weighing decision and thus is something the scheduler is
>> responsible for.

> I remember having this discussion, and we concluded that a
> computenode could either have local or shared resources, but not
> both. There would be a trait to indicate shared disk. Has this
> changed?

I've always thought we discussed that one of the benefits of this
approach was that it _could_ have both. Maybe we said "initially we
won't implement stuff so it can have both" but I think the plan has been
that we'd be able to support it.

>>> * We already have the information the filter scheduler needs now
>>>  by some other means, right?  What are the reasons we don't want
>>>  to use that anymore?
>>
>> The filter scheduler has most of the information, yes. What it
>> doesn't have is the *identifier* (UUID) for things like SRIOV PFs
>> or NUMA cells that the Placement API will use to distinguish
>> between things. In other words, the filter scheduler currently does
>> things like unpack a NUMATopology object into memory and determine
>> a NUMA cell to place an instance to. However, it has no concept
>> that that NUMA cell is (or will soon be once
>> nested-resource-providers is done) a resource provider in the
>> placement API. Same for SRIOV PFs. Same for VGPUs. Same for FPGAs,
>>  etc. That's why we need to return information to the scheduler
>> from the placement API that will allow the scheduler to understand
>> "hey, this NUMA cell on compute node X is resource provider
>> $UUID".

Why shouldn't scheduler know those relationships? You were the one (well
one of them :P) that specifically wanted to teach the nova scheduler to
be in the business of arranging and making claims (allocations) against
placement before returning. Why should some parts of the scheduler know
about resource providers, but not others? And, how would scheduler be
able to make the proper decisions (which require knowledge of
hierarchical relationships) without that knowledge? I'm sure I'm missing
something obvious, so please correct me.

IMHO, the scheduler should eventually evolve into a thing that mostly
deals in the currency of placement, translating those into nova concepts
where needed to avoid placement having to know anything about them.
In other words, I would expect to be able to explain the purpose of the
scheduler as "applies nova-specific logic to the generic resources that
placement says are _valid_, with the goal of determining which one is
_best_".

--Dan

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Chris Dent-2
In reply to this post by Jay Pipes
On Fri, 9 Jun 2017, Jay Pipes wrote:

> Sorry, been in a three-hour meeting. Comments inline...

Thanks for getting to this, it's very helpful to me.

>> * Part of the reason for having nested resource providers is because
>>   it can allow affinity/anti-affinity below the compute node (e.g.,
>>   workloads on the same host but different numa cells).
>
> Mmm, kinda, yeah.

What I meant by this was that if it didn't matter which of more than
one nested rp was used, then it would be easier to simply consider
the group of them as members of an inventory (that came out a bit
more in one of the later questions).

>> * Does a claim made in the scheduler need to be complete? Is there
>>   value in making a partial claim from the scheduler that consumes a
>>   vcpu and some ram, and then in the resource tracker is corrected
>>   to consume a specific pci device, numa cell, gpu and/or fpga?
>>   Would this be better or worse than what we have now? Why?
>
> Good question. I think the answer to this is probably pretty theoretical at
> this point. My gut instinct is that we should treat the consumption of
> resources in an atomic fashion, and that transactional nature of allocation
> will result in fewer race conditions and cleaner code. But, admittedly, this
> is just my gut reaction.
I suppose if we were more spread oriented than pack oriented, an
allocation of vcpu and ram would almost operate as a proxy for a
lock, allowing the later correcting allocation proposed above to be
somewhat safe because other near concurrent emplacements would be
happening on some other machine. But we don't have that reality.
I've always been in favor of making the allocation as early as
possible. I remember those halcyon days when we even thought it
might be possible to make a request and claim of resources in one
HTTP request.

>>   that makes it difficult or impossible for an allocation against a
>>   parent provider to be able to determine the correct child
>>   providers to which to cascade some of the allocation? (And by
>>   extension make the earlier scheduling decision.)
>
> See above. The sorting/weighing logic, which is very much deployer-defined
> and wreaks of customization, is what would need to be added to the placement
> API.

And enough of that sorting/weighing logic is likely to do with child or
shared providers that it's not possible to constrain the weighing
and sorting to solely compute nodes? Not just whether the host is on
fire, but the share disk farm too?

Okay, thank you, that helps set the stage more clearly and leads
straight to my remaining big question, which is asked on the spec
you've proposed:

     https://review.openstack.org/#/c/471927/

What are big strokes mechanisms for connecting the non-allocation
data in the response to GET /allocation_requests to the sorting
weighing logic? Answering on the spec works fine for me, I'm just
repeating it here in case people following along want the transition
over to the spec.

Thanks again.

--
Chris Dent                  ┬──┬◡ノ(° -°ノ)       https://anticdent.org/
freenode: cdent                                         tw: @anticdent
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Chris Dent-2
In reply to this post by Dan Smith-2
On Fri, 9 Jun 2017, Dan Smith wrote:

> In other words, I would expect to be able to explain the purpose of the
> scheduler as "applies nova-specific logic to the generic resources that
> placement says are _valid_, with the goal of determining which one is
> _best_".

This sounds great as an explanation. If we can reach this we done good.

--
Chris Dent                  ┬──┬◡ノ(° -°ノ)       https://anticdent.org/
freenode: cdent                                         tw: @anticdent
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Jay Pipes
In reply to this post by Edward Leafe
On 06/09/2017 06:31 PM, Ed Leafe wrote:

> On Jun 9, 2017, at 4:35 PM, Jay Pipes <[hidden email]> wrote:
>
>>> We can declare that allocating for shared disk is fairly deterministic
>>> if we assume that any given compute node is only associated with one
>>> shared disk provider.
>>
>> a) we can't assume that
>> b) a compute node could very well have both local disk and shared disk. how would the placement API know which one to pick? This is a sorting/weighing decision and thus is something the scheduler is responsible for.
>
> I remember having this discussion, and we concluded that a compute node could either have local or shared resources, but not both. There would be a trait to indicate shared disk. Has this changed?

I'm not sure it's changed per-se :) It's just that there's nothing
preventing this from happening. A compute node can theoretically have
local disk and also be associated with a shared storage pool.

>>> * We already have the information the filter scheduler needs now by
>>>   some other means, right?  What are the reasons we don't want to
>>>   use that anymore?
>>
>> The filter scheduler has most of the information, yes. What it doesn't have is the *identifier* (UUID) for things like SRIOV PFs or NUMA cells that the Placement API will use to distinguish between things. In other words, the filter scheduler currently does things like unpack a NUMATopology object into memory and determine a NUMA cell to place an instance to. However, it has no concept that that NUMA cell is (or will soon be once nested-resource-providers is done) a resource provider in the placement API. Same for SRIOV PFs. Same for VGPUs. Same for FPGAs, etc. That's why we need to return information to the scheduler from the placement API that will allow the scheduler to understand "hey, this NUMA cell on compute node X is resource provider $UUID".
>
> I guess that this was the point that confused me. The RP uuid is part of the provider: the compute node's uuid, and (after https://review.openstack.org/#/c/469147/ merges) the PCI device's uuid. So in the code that passes the PCI device information to the scheduler, we could add that new uuid field, and then the scheduler would have the information to a) select the best fit and then b) claim it with the specific uuid. Same for all the other nested/shared devices.

How would the scheduler know that a particular SRIOV PF resource
provider UUID is on a particular compute node unless the placement API
returns information indicating that SRIOV PF is a child of a particular
compute node resource provider?

> I don't mean to belabor this, but to my mind this seems a lot less disruptive to the existing code.

Belabor away :) I don't mind talking through the details. It's important
to do.

Best,
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Allocating Complex Resources

Edward Leafe
On Jun 12, 2017, at 10:20 AM, Jay Pipes <[hidden email]> wrote:

The RP uuid is part of the provider: the compute node's uuid, and (after https://review.openstack.org/#/c/469147/ merges) the PCI device's uuid. So in the code that passes the PCI device information to the scheduler, we could add that new uuid field, and then the scheduler would have the information to a) select the best fit and then b) claim it with the specific uuid. Same for all the other nested/shared devices.

How would the scheduler know that a particular SRIOV PF resource provider UUID is on a particular compute node unless the placement API returns information indicating that SRIOV PF is a child of a particular compute node resource provider?

Because PCI devices are per compute node. The HostState object populates itself from the compute node here:


If we add the UUID information to the PCI device, as the above-mentioned patch proposes, when the scheduler selects a particular compute node that has the device, it uses the PCI device’s UUID. I thought that having that information in the scheduler was what that patch was all about.

-- Ed Leafe






__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
12
Loading...