[nova][scheduler][placement] Trying to understand the proposed direction

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[nova][scheduler][placement] Trying to understand the proposed direction

Ed Leafe-2
There is a lot going on lately in placement-land, and some of the changes being proposed are complex enough that it is difficult to understand what the final result is supposed to look like. I have documented my understanding of the current way that the placement/scheduler interaction works, and also what I understand if how it will work when the proposed changes are all implemented. I don’t know how close that understanding is to what the design is, so I’m hoping that this will serve as a starting point for clarifying things, so that everyone involved in these efforts has a clear view of the target we are aiming for. So please reply to this thread with any corrections or additions, so that all can see.

I do realize that some of this is to be done in Pike, and the rest in Queens, but that timetable is not relevant to the overall understanding of the design.

-- Ed Leafe

Current flow:
* Scheduler gets a req spec from conductor, containing resource requirements
* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfy those requirements
* Placement returns a list of the UUIDs for those root providers to scheduler
* Scheduler uses those UUIDs to create HostState objects for each
* Scheduler runs those HostState objects through filters to remove those that don't meet requirements not selected for by placement
* Scheduler runs the remaining HostState objects through weighers to order them in terms of best fit.
* Scheduler takes the host at the top of that ranked list, and tries to claim the resources in placement. If that fails, there is a race, so that HostState is discarded, and the next is selected. This is repeated until the claim succeeds.
* Scheduler then creates a list of N UUIDs, with the first being the selected host, and the the rest being alternates consisting of the next hosts in the ranked list that are in the same cell as the selected host.
* Scheduler returns that list to conductor.
* Conductor determines the cell of the selected host, and sends that list to the target cell.
* Target cell tries to build the instance on the selected host. If it fails, it unclaims the resources for the selected host, and tries to claim the resources for the next host in the list. It then tries to build the instance on the next host in the list of alternates. Only when all alternates fail does the build request fail.

Proposed flow:
* Scheduler gets a req spec from conductor, containing resource requirements
* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfy those requirements
* Placement then constructs a data structure for each root provider as documented in the spec. [0]
* Placement returns a number of these data structures as JSON blobs. Due to the size of the data, a page size will have to be determined, and placement will have to either maintain that list of structured datafor subsequent requests, or re-run the query and only calculate the data structures for the hosts that fit in the requested page.
* Scheduler continues to request the paged results until it has them all.
* Scheduler then runs this data through the filters and weighers. No HostState objects are required, as the data structures will contain all the information that scheduler will need.
* Scheduler then selects the data structure at the top of the ranked list. Inside that structure is a dict of the allocation data that scheduler will need to claim the resources on the selected host. If the claim fails, the next data structure in the list is chosen, and repeated until a claim succeeds.
* Scheduler then creates a list of N of these data structures, with the first being the data for the selected host, and the the rest being data structures representing alternates consisting of the next hosts in the ranked list that are in the same cell as the selected host.
* Scheduler returns that list to conductor.
* Conductor determines the cell of the selected host, and sends that list to the target cell.
* Target cell tries to build the instance on the selected host. If it fails, it uses the allocation data in the data structure to unclaim the resources for the selected host, and tries to claim the resources for the next host in the list using its allocation data. It then tries to build the instance on the next host in the list of alternates. Only when all alternates fail does the build request fail.


[0] https://review.openstack.org/#/c/471927/





__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Jay Pipes
On 06/19/2017 09:04 AM, Edward Leafe wrote:
> Current flow:
> * Scheduler gets a req spec from conductor, containing resource requirements
> * Scheduler sends those requirements to placement
> * Placement runs a query to determine the root RPs that can satisfy those requirements

Not root RPs. Non-sharing resource providers, which currently
effectively means compute node providers. Nested resource providers
isn't yet merged, so there is currently no concept of a hierarchy of
providers.

> * Placement returns a list of the UUIDs for those root providers to scheduler

It returns the provider names and UUIDs, yes.

> * Scheduler uses those UUIDs to create HostState objects for each

Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passing
in a list of the provider UUIDs it got back from the placement service.
The scheduler then builds a set of HostState objects from the results of
ComputeNodeList.get_all_by_uuid().

The scheduler also keeps a set of AggregateMetadata objects in memory,
including the association of aggregate to host (note: this is the
compute node's *service*, not the compute node object itself, thus the
reason aggregates don't work properly for Ironic nodes).

> * Scheduler runs those HostState objects through filters to remove those that don't meet requirements not selected for by placement

Yep.

> * Scheduler runs the remaining HostState objects through weighers to order them in terms of best fit.

Yep.

> * Scheduler takes the host at the top of that ranked list, and tries to claim the resources in placement. If that fails, there is a race, so that HostState is discarded, and the next is selected. This is repeated until the claim succeeds.

No, this is not how things work currently. The scheduler does not claim
resources. It selects the top (or random host depending on the selection
strategy) and sends the launch request to the target compute node. The
target compute node then attempts to claim the resources and in doing so
writes records to the compute_nodes table in the Nova cell database as
well as the Placement API for the compute node resource provider.

> * Scheduler then creates a list of N UUIDs, with the first being the selected host, and the the rest being alternates consisting of the next hosts in the ranked list that are in the same cell as the selected host.

This isn't currently how things work, no. This has been discussed, however.

> * Scheduler returns that list to conductor.
> * Conductor determines the cell of the selected host, and sends that list to the target cell.
> * Target cell tries to build the instance on the selected host. If it fails, it unclaims the resources for the selected host, and tries to claim the resources for the next host in the list. It then tries to build the instance on the next host in the list of alternates. Only when all alternates fail does the build request fail.

This isn't currently how things work, no. There has been discussion of
having the compute node retry alternatives locally, but nothing more
than discussion.

> Proposed flow:
> * Scheduler gets a req spec from conductor, containing resource requirements
> * Scheduler sends those requirements to placement
> * Placement runs a query to determine the root RPs that can satisfy those requirements

Yes.

> * Placement then constructs a data structure for each root provider as documented in the spec. [0]

Yes.

> * Placement returns a number of these data structures as JSON blobs. Due to the size of the data, a page size will have to be determined, and placement will have to either maintain that list of structured datafor subsequent requests, or re-run the query and only calculate the data structures for the hosts that fit in the requested page.

"of these data structures as JSON blobs" is kind of redundant... all our
REST APIs return data structures as JSON blobs.

While we discussed the fact that there may be a lot of entries, we did
not say we'd immediately support a paging mechanism.

> * Scheduler continues to request the paged results until it has them all.

See above. Was discussed briefly as a concern but not work to do for
first patches.

> * Scheduler then runs this data through the filters and weighers. No HostState objects are required, as the data structures will contain all the information that scheduler will need.

No, this isn't correct. The scheduler will have *some* of the
information it requires for weighing from the returned data from the GET
/allocation_candidates call, but not all of it.

Again, operators have insisted on keeping the flexibility currently in
the Nova scheduler to weigh/sort compute nodes by things like thermal
metrics and kinds of data that the Placement API will never be
responsible for.

The scheduler will need to merge information from the
"provider_summaries" part of the HTTP response with information it has
already in its HostState objects (gotten from
ComputeNodeList.get_all_by_uuid() and AggregateMetadataList).

> * Scheduler then selects the data structure at the top of the ranked list. Inside that structure is a dict of the allocation data that scheduler will need to claim the resources on the selected host. If the claim fails, the next data structure in the list is chosen, and repeated until a claim succeeds.

Kind of, yes. The scheduler will select a *host* that meets its needs.

There may be more than one allocation request that includes that host
resource provider, because of shared providers and (soon) nested
providers. The scheduler will choose one of these allocation requests
and attempt a claim of resources by simply PUT
/allocations/{instance_uuid} with the serialized body of that allocation
request. If 202 returned, cool. If not, repeat for the next allocation
request.

> * Scheduler then creates a list of N of these data structures, with the first being the data for the selected host, and the the rest being data structures representing alternates consisting of the next hosts in the ranked list that are in the same cell as the selected host.

Yes, this is the proposed solution for allowing retries within a cell.

> * Scheduler returns that list to conductor.
> * Conductor determines the cell of the selected host, and sends that list to the target cell.
> * Target cell tries to build the instance on the selected host. If it fails, it uses the allocation data in the data structure to unclaim the resources for the selected host, and tries to claim the resources for the next host in the list using its allocation data. It then tries to build the instance on the next host in the list of alternates. Only when all alternates fail does the build request fail.

I'll let Dan discuss this last part.

Best,
-jay

> [0] https://review.openstack.org/#/c/471927/
>
>
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Matt Riedemann-3
On 6/19/2017 9:17 AM, Jay Pipes wrote:
> On 06/19/2017 09:04 AM, Edward Leafe wrote:
>> Current flow:

As noted in the nova-scheduler meeting this morning, this should have
been called "original plan" rather than "current flow", as Jay pointed
out inline.

>> * Scheduler gets a req spec from conductor, containing resource
>> requirements
>> * Scheduler sends those requirements to placement
>> * Placement runs a query to determine the root RPs that can satisfy
>> those requirements
>
> Not root RPs. Non-sharing resource providers, which currently
> effectively means compute node providers. Nested resource providers
> isn't yet merged, so there is currently no concept of a hierarchy of
> providers.
>
>> * Placement returns a list of the UUIDs for those root providers to
>> scheduler
>
> It returns the provider names and UUIDs, yes.
>
>> * Scheduler uses those UUIDs to create HostState objects for each
>
> Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passing
> in a list of the provider UUIDs it got back from the placement service.
> The scheduler then builds a set of HostState objects from the results of
> ComputeNodeList.get_all_by_uuid().
>
> The scheduler also keeps a set of AggregateMetadata objects in memory,
> including the association of aggregate to host (note: this is the
> compute node's *service*, not the compute node object itself, thus the
> reason aggregates don't work properly for Ironic nodes).
>
>> * Scheduler runs those HostState objects through filters to remove
>> those that don't meet requirements not selected for by placement
>
> Yep.
>
>> * Scheduler runs the remaining HostState objects through weighers to
>> order them in terms of best fit.
>
> Yep.
>
>> * Scheduler takes the host at the top of that ranked list, and tries
>> to claim the resources in placement. If that fails, there is a race,
>> so that HostState is discarded, and the next is selected. This is
>> repeated until the claim succeeds.
>
> No, this is not how things work currently. The scheduler does not claim
> resources. It selects the top (or random host depending on the selection
> strategy) and sends the launch request to the target compute node. The
> target compute node then attempts to claim the resources and in doing so
> writes records to the compute_nodes table in the Nova cell database as
> well as the Placement API for the compute node resource provider.

Not to nit pick, but today the scheduler sends the selected destinations
to the conductor. Conductor looks up the cell that a selected host is
in, creates the instance record and friends (bdms) in that cell and then
sends the build request to the compute host in that cell.

>
>> * Scheduler then creates a list of N UUIDs, with the first being the
>> selected host, and the the rest being alternates consisting of the
>> next hosts in the ranked list that are in the same cell as the
>> selected host.
>
> This isn't currently how things work, no. This has been discussed, however.
>
>> * Scheduler returns that list to conductor.
>> * Conductor determines the cell of the selected host, and sends that
>> list to the target cell.
>> * Target cell tries to build the instance on the selected host. If it
>> fails, it unclaims the resources for the selected host, and tries to
>> claim the resources for the next host in the list. It then tries to
>> build the instance on the next host in the list of alternates. Only
>> when all alternates fail does the build request fail.
>
> This isn't currently how things work, no. There has been discussion of
> having the compute node retry alternatives locally, but nothing more
> than discussion.

Correct that this isn't how things currently work, but it was/is the
original plan. And the retry happens within the cell conductor, not on
the compute node itself. The top-level conductor is what's getting
selected hosts from the scheduler. The cell-level conductor is what's
getting a retry request from the compute. The cell-level conductor would
deallocate from placement for the currently claimed providers, and then
pick one of the alternatives passed down from the top and then make
allocations (a claim) against those, then send to an alternative compute
host for another build attempt.

So with this plan, there are two places to make allocations - the
scheduler first, and then the cell conductors for retries. This
duplication is why some people were originally pushing to move all
allocation-related work happen in the conductor service.

>
>> Proposed flow:
>> * Scheduler gets a req spec from conductor, containing resource
>> requirements
>> * Scheduler sends those requirements to placement
>> * Placement runs a query to determine the root RPs that can satisfy
>> those requirements
>
> Yes.
>
>> * Placement then constructs a data structure for each root provider as
>> documented in the spec. [0]
>
> Yes.
>
>> * Placement returns a number of these data structures as JSON blobs.
>> Due to the size of the data, a page size will have to be determined,
>> and placement will have to either maintain that list of structured
>> datafor subsequent requests, or re-run the query and only calculate
>> the data structures for the hosts that fit in the requested page.
>
> "of these data structures as JSON blobs" is kind of redundant... all our
> REST APIs return data structures as JSON blobs.
>
> While we discussed the fact that there may be a lot of entries, we did
> not say we'd immediately support a paging mechanism.

I believe we said in the initial version we'd have the configurable
limit in the DB API queries, like we have today - the default limit is
1000. There was agreement to eventually build paging support into the API.

This does make me wonder though what happens when you have 100K or more
compute nodes reporting into placement and we limit on the first 1000.
Aren't we going to be imposing a packing strategy then just because of
how we pull things out of the database for Placement? Although I don't
see how that would be any different from before we had Placement and the
nova-scheduler service just did a ComputeNode.get_all() to the nova DB
and then filtered/weighed those objects.

>
>> * Scheduler continues to request the paged results until it has them all.
>
> See above. Was discussed briefly as a concern but not work to do for
> first patches.
>
>> * Scheduler then runs this data through the filters and weighers. No
>> HostState objects are required, as the data structures will contain
>> all the information that scheduler will need.
>
> No, this isn't correct. The scheduler will have *some* of the
> information it requires for weighing from the returned data from the GET
> /allocation_candidates call, but not all of it.
>
> Again, operators have insisted on keeping the flexibility currently in
> the Nova scheduler to weigh/sort compute nodes by things like thermal
> metrics and kinds of data that the Placement API will never be
> responsible for.
>
> The scheduler will need to merge information from the
> "provider_summaries" part of the HTTP response with information it has
> already in its HostState objects (gotten from
> ComputeNodeList.get_all_by_uuid() and AggregateMetadataList).
>
>> * Scheduler then selects the data structure at the top of the ranked
>> list. Inside that structure is a dict of the allocation data that
>> scheduler will need to claim the resources on the selected host. If
>> the claim fails, the next data structure in the list is chosen, and
>> repeated until a claim succeeds.
>
> Kind of, yes. The scheduler will select a *host* that meets its needs.
>
> There may be more than one allocation request that includes that host
> resource provider, because of shared providers and (soon) nested
> providers. The scheduler will choose one of these allocation requests
> and attempt a claim of resources by simply PUT
> /allocations/{instance_uuid} with the serialized body of that allocation
> request. If 202 returned, cool. If not, repeat for the next allocation
> request.
>
>> * Scheduler then creates a list of N of these data structures, with
>> the first being the data for the selected host, and the the rest being
>> data structures representing alternates consisting of the next hosts
>> in the ranked list that are in the same cell as the selected host.
>
> Yes, this is the proposed solution for allowing retries within a cell.
>
>> * Scheduler returns that list to conductor.
>> * Conductor determines the cell of the selected host, and sends that
>> list to the target cell.
>> * Target cell tries to build the instance on the selected host. If it
>> fails, it uses the allocation data in the data structure to unclaim
>> the resources for the selected host, and tries to claim the resources
>> for the next host in the list using its allocation data. It then tries
>> to build the instance on the next host in the list of alternates. Only
>> when all alternates fail does the build request fail.
>
> I'll let Dan discuss this last part.
>
> Best,
> -jay
>
>> [0] https://review.openstack.org/#/c/471927/
>>
>>
>>
>>
>>
>> __________________________________________________________________________
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> [hidden email]?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Ed Leafe-2
In reply to this post by Jay Pipes
On Jun 19, 2017, at 9:17 AM, Jay Pipes <[hidden email]> wrote:

As Matt pointed out, I mis-wrote when I said “current flow”. I meant “current agreed-to design flow”. So no need to rehash that.

* Placement returns a number of these data structures as JSON blobs. Due to the size of the data, a page size will have to be determined, and placement will have to either maintain that list of structured datafor subsequent requests, or re-run the query and only calculate the data structures for the hosts that fit in the requested page.

"of these data structures as JSON blobs" is kind of redundant... all our REST APIs return data structures as JSON blobs.

Well, I was trying to be specific. I didn’t mean to imply that this was a radical departure or anything.

While we discussed the fact that there may be a lot of entries, we did not say we'd immediately support a paging mechanism.

OK, thanks for clarifying that. When we discussed returning 1.5K per compute host instead of a couple of hundred bytes, there was discussion that paging would be necessary.

* Scheduler continues to request the paged results until it has them all.

See above. Was discussed briefly as a concern but not work to do for first patches.

* Scheduler then runs this data through the filters and weighers. No HostState objects are required, as the data structures will contain all the information that scheduler will need.

No, this isn't correct. The scheduler will have *some* of the information it requires for weighing from the returned data from the GET /allocation_candidates call, but not all of it.

Again, operators have insisted on keeping the flexibility currently in the Nova scheduler to weigh/sort compute nodes by things like thermal metrics and kinds of data that the Placement API will never be responsible for.

The scheduler will need to merge information from the "provider_summaries" part of the HTTP response with information it has already in its HostState objects (gotten from ComputeNodeList.get_all_by_uuid() and AggregateMetadataList).

OK, that’s informative, too. Is there anything decided on how much host info will be in the response from placement, and how much will be in HostState? Or how the reporting of resources by the compute nodes will have to change to feed this information to placement? Or how the two sources of information will be combined so that the filters and weighers can process it? Or is that still to be worked out?

* Scheduler then selects the data structure at the top of the ranked list. Inside that structure is a dict of the allocation data that scheduler will need to claim the resources on the selected host. If the claim fails, the next data structure in the list is chosen, and repeated until a claim succeeds.

Kind of, yes. The scheduler will select a *host* that meets its needs.

There may be more than one allocation request that includes that host resource provider, because of shared providers and (soon) nested providers. The scheduler will choose one of these allocation requests and attempt a claim of resources by simply PUT /allocations/{instance_uuid} with the serialized body of that allocation request. If 202 returned, cool. If not, repeat for the next allocation request.

Ah, yes, good point. A host with multiple nested providers, or with shared and local storage, will have to have multiple copies of the data structure returned to reflect those permutations. 

* Scheduler then creates a list of N of these data structures, with the first being the data for the selected host, and the the rest being data structures representing alternates consisting of the next hosts in the ranked list that are in the same cell as the selected host.

Yes, this is the proposed solution for allowing retries within a cell.

OK.

* Scheduler returns that list to conductor.
* Conductor determines the cell of the selected host, and sends that list to the target cell.
* Target cell tries to build the instance on the selected host. If it fails, it uses the allocation data in the data structure to unclaim the resources for the selected host, and tries to claim the resources for the next host in the list using its allocation data. It then tries to build the instance on the next host in the list of alternates. Only when all alternates fail does the build request fail.

I'll let Dan discuss this last part.

Well, that’s not substantially different than the original plan, so no additional explanation is required.

One other thing: since this new functionality is exposed via a new API call, is the existing method of filtering RPs by passing in resources going to be deprecated? And the code for adding filtering by traits to that also no longer useful?


-- Ed Leafe






__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Jay Pipes
On 06/19/2017 01:59 PM, Edward Leafe wrote:
>> While we discussed the fact that there may be a lot of entries, we did
>> not say we'd immediately support a paging mechanism.
>
> OK, thanks for clarifying that. When we discussed returning 1.5K per
> compute host instead of a couple of hundred bytes, there was discussion
> that paging would be necessary.

Not sure where you're getting the whole 1.5K per compute host thing from.

Here's a paste with the before and after of what we're talking about:

http://paste.openstack.org/show/613129/

Note that I'm using a situation with shared storage and two compute
nodes providing VCPU and MEMORY. In the current situation, the shared
storage provider isn't returned, as you know.

The before is 231 bytes. The after (again, with three providers, not 1)
is 1651 bytes.

gzipping the after contents results in 358 bytes.

So, honestly I'm not concerned.

>> Again, operators have insisted on keeping the flexibility currently in
>> the Nova scheduler to weigh/sort compute nodes by things like thermal
>> metrics and kinds of data that the Placement API will never be
>> responsible for.
>>
>> The scheduler will need to merge information from the
>> "provider_summaries" part of the HTTP response with information it has
>> already in its HostState objects (gotten from
>> ComputeNodeList.get_all_by_uuid() and AggregateMetadataList).
>
> OK, that’s informative, too. Is there anything decided on how much host
> info will be in the response from placement, and how much will be in
> HostState? Or how the reporting of resources by the compute nodes will
> have to change to feed this information to placement? Or how the two
> sources of information will be combined so that the filters and weighers
> can process it? Or is that still to be worked out?

I'm currently working on a patch that integrates the REST API into the
scheduler.

The merging of data will essentially start with the resource amounts
that the host state objects contain (stuff like total_usable_ram etc)
with the accurate data from the provider_summaries section.

Best,
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Ed Leafe-2
On Jun 19, 2017, at 1:34 PM, Jay Pipes <[hidden email]> wrote:

OK, thanks for clarifying that. When we discussed returning 1.5K per compute host instead of a couple of hundred bytes, there was discussion that paging would be necessary.

Not sure where you're getting the whole 1.5K per compute host thing from.

It was from the straw man example. Replacing the $FOO_UUID with UUIDs, and then stripping out all whitespace resulted in about 1500 bytes. Your example, with whitespace included, is 1600 bytes. 

Here's a paste with the before and after of what we're talking about:

http://paste.openstack.org/show/613129/

Note that I'm using a situation with shared storage and two compute nodes providing VCPU and MEMORY. In the current situation, the shared storage provider isn't returned, as you know.

The before is 231 bytes. The after (again, with three providers, not 1) is 1651 bytes.

So in the basic non-shared, non-nested case, if there are, let’s say, 200 compute nodes that can satisfy the request, will there be 1 “allocation_requests” key returned, with 200 “allocations” sub-keys? And one “provider_summaries” key, with 200 sub-keys on the compute node UUID?

gzipping the after contents results in 358 bytes.

So, honestly I'm not concerned.

Ok, just wanted to be clear.

OK, that’s informative, too. Is there anything decided on how much host info will be in the response from placement, and how much will be in HostState? Or how the reporting of resources by the compute nodes will have to change to feed this information to placement? Or how the two sources of information will be combined so that the filters and weighers can process it? Or is that still to be worked out?

I'm currtently working on a patch that integrates the REST API into the scheduler.

The merging of data will essentially start with the resource amounts that the host state objects contain (stuff like total_usable_ram etc) with the accurate data from the provider_summaries section.

So in the near-term, we will be using provider_summaries to update the corresponding HostState objects with those values. Is the long-term plan to have most of the HostState information moved to placement?


-- Ed Leafe






__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Jay Pipes
On 06/19/2017 05:24 PM, Edward Leafe wrote:

> On Jun 19, 2017, at 1:34 PM, Jay Pipes <[hidden email]
> <mailto:[hidden email]>> wrote:
>>
>>> OK, thanks for clarifying that. When we discussed returning 1.5K per
>>> compute host instead of a couple of hundred bytes, there was
>>> discussion that paging would be necessary.
>>
>> Not sure where you're getting the whole 1.5K per compute host thing from.
>
> It was from the straw man example. Replacing the $FOO_UUID with UUIDs,
> and then stripping out all whitespace resulted in about 1500 bytes. Your
> example, with whitespace included, is 1600 bytes.

It was the "per compute host" that I objected to.

>>> OK, that’s informative, too. Is there anything decided on how much
>>> host info will be in the response from placement, and how much will
>>> be in HostState? Or how the reporting of resources by the compute
>>> nodes will have to change to feed this information to placement? Or
>>> how the two sources of information will be combined so that the
>>> filters and weighers can process it? Or is that still to be worked out?
>>
>> I'm currtently working on a patch that integrates the REST API into
>> the scheduler.
>>
>> The merging of data will essentially start with the resource amounts
>> that the host state objects contain (stuff like total_usable_ram etc)
>> with the accurate data from the provider_summaries section.
>
> So in the near-term, we will be using provider_summaries to update the
> corresponding HostState objects with those values. Is the long-term plan
> to have most of the HostState information moved to placement?

Some things will move to placement sooner rather than later:

* Quantitative things that can be consumed
* Simple traits

Later rather than sooner:

* Distances between aggregates (affinity/anti-affinity)

Never:

* Filtering hosts based on how many instances use a particular image
* Filtering hosts based on something that is hypervisor-dependent
* Sorting hosts based on the number of instances in a particular state
(e.g. how many instances are live-migrating or shelving at any given time)
* Weighing hosts based on the current temperature of a power supply in a
rack
* Sorting hosts based on the current weather conditions in Zimbabwe

Best,
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Ed Leafe-2
On Jun 19, 2017, at 5:27 PM, Jay Pipes <[hidden email]> wrote:

It was from the straw man example. Replacing the $FOO_UUID with UUIDs, and then stripping out all whitespace resulted in about 1500 bytes. Your example, with whitespace included, is 1600 bytes.

It was the "per compute host" that I objected to.

I guess it would have helped to see an example of the data returned for multiple compute nodes. The straw man example was for a single compute node with SR-IOV, NUMA and shared storage. There was no indication how multiple hosts meeting the requested resources would be returned.

-- Ed Leafe






__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Boris Pavlovic
Hi, 

Does this look too complicated and and a bit over designed. 

For example, why we can't store all data in memory of single python application with simple REST API and have
simple mechanism for plugins that are filtering. Basically there is no any kind of problems with storing it on single host.

If we even have 100k hosts and every host has about 10KB -> 1GB of RAM (I can just use phone) 

There are easy ways to copy the state across different instance (sharing updates) 

And I thought that Placement project is going to be such centralized small simple APP for collecting all 
resource information and doing this very very simple and easy placement selection... 


Best regards,
Boris Pavlovic 

On Mon, Jun 19, 2017 at 5:05 PM, Edward Leafe <[hidden email]> wrote:
On Jun 19, 2017, at 5:27 PM, Jay Pipes <[hidden email]> wrote:

It was from the straw man example. Replacing the $FOO_UUID with UUIDs, and then stripping out all whitespace resulted in about 1500 bytes. Your example, with whitespace included, is 1600 bytes.

It was the "per compute host" that I objected to.

I guess it would have helped to see an example of the data returned for multiple compute nodes. The straw man example was for a single compute node with SR-IOV, NUMA and shared storage. There was no indication how multiple hosts meeting the requested resources would be returned.

-- Ed Leafe






__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Jay Pipes
In reply to this post by Ed Leafe-2
On 06/19/2017 08:05 PM, Edward Leafe wrote:

> On Jun 19, 2017, at 5:27 PM, Jay Pipes <[hidden email]
> <mailto:[hidden email]>> wrote:
>>
>>> It was from the straw man example. Replacing the $FOO_UUID with
>>> UUIDs, and then stripping out all whitespace resulted in about 1500
>>> bytes. Your example, with whitespace included, is 1600 bytes.
>>
>> It was the "per compute host" that I objected to.
>
> I guess it would have helped to see an example of the data returned for
> multiple compute nodes. The straw man example was for a single compute
> node with SR-IOV, NUMA and shared storage. There was no indication how
> multiple hosts meeting the requested resources would be returned.

The example I posted used 3 resource providers. 2 compute nodes with no
local disk and a shared storage pool.

Best,
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Jay Pipes
In reply to this post by Boris Pavlovic
On 06/19/2017 09:26 PM, Boris Pavlovic wrote:
> Hi,
>
> Does this look too complicated and and a bit over designed.

Is that a question?

> For example, why we can't store all data in memory of single python
> application with simple REST API and have
> simple mechanism for plugins that are filtering. Basically there is no
> any kind of problems with storing it on single host.

You mean how things currently work minus the REST API?

> If we even have 100k hosts and every host has about 10KB -> 1GB of RAM
> (I can just use phone)
>
> There are easy ways to copy the state across different instance (sharing
> updates)

We already do this. It isn't as easy as you think. It's introduced a
number of race conditions that we're attempting to address by doing
claims in the scheduler.

> And I thought that Placement project is going to be such centralized
> small simple APP for collecting all
> resource information and doing this very very simple and easy placement
> selection...

1) Placement doesn't collect anything.
2) Placement is indeed a simple small app with a global view of resources
3) Placement doesn't do the sorting/weighing of destinations. The
scheduler does that. See this thread for reasons why this is the case
(operators didn't want to give up their complexity/flexibility in how
they tweak selection decisions)
4) Placement simply tells the scheduler which providers have enough
capacity for a requested set of resource amounts and required
qualitative traits. It actually is pretty simple.

Best,
-jay

> Best regards,
> Boris Pavlovic
>
> On Mon, Jun 19, 2017 at 5:05 PM, Edward Leafe <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On Jun 19, 2017, at 5:27 PM, Jay Pipes <[hidden email]
>     <mailto:[hidden email]>> wrote:
>>
>>>     It was from the straw man example. Replacing the $FOO_UUID with
>>>     UUIDs, and then stripping out all whitespace resulted in about
>>>     1500 bytes. Your example, with whitespace included, is 1600 bytes.
>>
>>     It was the "per compute host" that I objected to.
>
>     I guess it would have helped to see an example of the data returned
>     for multiple compute nodes. The straw man example was for a single
>     compute node with SR-IOV, NUMA and shared storage. There was no
>     indication how multiple hosts meeting the requested resources would
>     be returned.
>
>     -- Ed Leafe
>
>
>
>
>
>
>     __________________________________________________________________________
>     OpenStack Development Mailing List (not for usage questions)
>     Unsubscribe:
>     [hidden email]?subject:unsubscribe
>     <http://OpenStack-dev-request@...?subject:unsubscribe>
>     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>     <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
>
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Ed Leafe-2
In reply to this post by Jay Pipes
On Jun 20, 2017, at 6:54 AM, Jay Pipes <[hidden email]> wrote:

It was the "per compute host" that I objected to.
I guess it would have helped to see an example of the data returned for multiple compute nodes. The straw man example was for a single compute node with SR-IOV, NUMA and shared storage. There was no indication how multiple hosts meeting the requested resources would be returned.

The example I posted used 3 resource providers. 2 compute nodes with no local disk and a shared storage pool.

Now I’m even more confused. In the straw man example (https://review.openstack.org/#/c/471927/) I see only one variable ($COMPUTE_NODE_UUID) referencing a compute node in the response.

-- Ed Leafe






__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Jay Pipes
On 06/20/2017 08:43 AM, Edward Leafe wrote:

> On Jun 20, 2017, at 6:54 AM, Jay Pipes <[hidden email]
> <mailto:[hidden email]>> wrote:
>>
>>>> It was the "per compute host" that I objected to.
>>> I guess it would have helped to see an example of the data returned
>>> for multiple compute nodes. The straw man example was for a single
>>> compute node with SR-IOV, NUMA and shared storage. There was no
>>> indication how multiple hosts meeting the requested resources would
>>> be returned.
>>
>> The example I posted used 3 resource providers. 2 compute nodes with
>> no local disk and a shared storage pool.
>
> Now I’m even more confused. In the straw man example
> (https://review.openstack.org/#/c/471927/)
> <https://review.openstack.org/#/c/471927/3/specs/pike/approved/placement-allocation-requests.rst> I see
> only one variable ($COMPUTE_NODE_UUID) referencing a compute node in the
> response.

I'm referring to the example I put in this email threads in
paste.openstack.org with numbers showing 1600 bytes for 3 resource
providers:

http://lists.openstack.org/pipermail/openstack-dev/2017-June/118593.html

Best,
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Alex Xu-2
In reply to this post by Jay Pipes


2017-06-19 22:17 GMT+08:00 Jay Pipes <[hidden email]>:
On 06/19/2017 09:04 AM, Edward Leafe wrote:
Current flow:
* Scheduler gets a req spec from conductor, containing resource requirements
* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfy those requirements

Not root RPs. Non-sharing resource providers, which currently effectively means compute node providers. Nested resource providers isn't yet merged, so there is currently no concept of a hierarchy of providers.

* Placement returns a list of the UUIDs for those root providers to scheduler

It returns the provider names and UUIDs, yes.

* Scheduler uses those UUIDs to create HostState objects for each

Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passing in a list of the provider UUIDs it got back from the placement service. The scheduler then builds a set of HostState objects from the results of ComputeNodeList.get_all_by_uuid().

The scheduler also keeps a set of AggregateMetadata objects in memory, including the association of aggregate to host (note: this is the compute node's *service*, not the compute node object itself, thus the reason aggregates don't work properly for Ironic nodes).

* Scheduler runs those HostState objects through filters to remove those that don't meet requirements not selected for by placement

Yep.

* Scheduler runs the remaining HostState objects through weighers to order them in terms of best fit.

Yep.

* Scheduler takes the host at the top of that ranked list, and tries to claim the resources in placement. If that fails, there is a race, so that HostState is discarded, and the next is selected. This is repeated until the claim succeeds.

No, this is not how things work currently. The scheduler does not claim resources. It selects the top (or random host depending on the selection strategy) and sends the launch request to the target compute node. The target compute node then attempts to claim the resources and in doing so writes records to the compute_nodes table in the Nova cell database as well as the Placement API for the compute node resource provider.

* Scheduler then creates a list of N UUIDs, with the first being the selected host, and the the rest being alternates consisting of the next hosts in the ranked list that are in the same cell as the selected host.

This isn't currently how things work, no. This has been discussed, however.

* Scheduler returns that list to conductor.
* Conductor determines the cell of the selected host, and sends that list to the target cell.
* Target cell tries to build the instance on the selected host. If it fails, it unclaims the resources for the selected host, and tries to claim the resources for the next host in the list. It then tries to build the instance on the next host in the list of alternates. Only when all alternates fail does the build request fail.

This isn't currently how things work, no. There has been discussion of having the compute node retry alternatives locally, but nothing more than discussion.

Proposed flow:
* Scheduler gets a req spec from conductor, containing resource requirements
* Scheduler sends those requirements to placement
* Placement runs a query to determine the root RPs that can satisfy those requirements

Yes.

* Placement then constructs a data structure for each root provider as documented in the spec. [0]

Yes.

* Placement returns a number of these data structures as JSON blobs. Due to the size of the data, a page size will have to be determined, and placement will have to either maintain that list of structured datafor subsequent requests, or re-run the query and only calculate the data structures for the hosts that fit in the requested page.

"of these data structures as JSON blobs" is kind of redundant... all our REST APIs return data structures as JSON blobs.

While we discussed the fact that there may be a lot of entries, we did not say we'd immediately support a paging mechanism.

* Scheduler continues to request the paged results until it has them all.

See above. Was discussed briefly as a concern but not work to do for first patches.

* Scheduler then runs this data through the filters and weighers. No HostState objects are required, as the data structures will contain all the information that scheduler will need.

No, this isn't correct. The scheduler will have *some* of the information it requires for weighing from the returned data from the GET /allocation_candidates call, but not all of it.

Again, operators have insisted on keeping the flexibility currently in the Nova scheduler to weigh/sort compute nodes by things like thermal metrics and kinds of data that the Placement API will never be responsible for.

The scheduler will need to merge information from the "provider_summaries" part of the HTTP response with information it has already in its HostState objects (gotten from ComputeNodeList.get_all_by_uuid() and AggregateMetadataList).

* Scheduler then selects the data structure at the top of the ranked list. Inside that structure is a dict of the allocation data that scheduler will need to claim the resources on the selected host. If the claim fails, the next data structure in the list is chosen, and repeated until a claim succeeds.

Kind of, yes. The scheduler will select a *host* that meets its needs.

There may be more than one allocation request that includes that host resource provider, because of shared providers and (soon) nested providers. The scheduler will choose one of these allocation requests and attempt a claim of resources by simply PUT /allocations/{instance_uuid} with the serialized body of that allocation request. If 202 returned, cool. If not, repeat for the next allocation request.

* Scheduler then creates a list of N of these data structures, with the first being the data for the selected host, and the the rest being data structures representing alternates consisting of the next hosts in the ranked list that are in the same cell as the selected host.

Yes, this is the proposed solution for allowing retries within a cell.

Is that possible we use trait to distinguish different cells? Then the retry can be done in the cell by query the placement directly with trait which indicate the specific cell.

Those traits will be some custom traits, and generate by the cell name.
 


* Scheduler returns that list to conductor.
* Conductor determines the cell of the selected host, and sends that list to the target cell.
* Target cell tries to build the instance on the selected host. If it fails, it uses the allocation data in the data structure to unclaim the resources for the selected host, and tries to claim the resources for the next host in the list using its allocation data. It then tries to build the instance on the next host in the list of alternates. Only when all alternates fail does the build request fail.

In the compute node, will we get rid of the allocation update in the periodic task "update_available_resource"? Otherwise, we will have race between the claim in the nova-scheduler and that periodic task.


I'll let Dan discuss this last part.

Best,
-jay


[0] https://review.openstack.org/#/c/471927/





__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@...enstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@...enstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Ed Leafe-2
In reply to this post by Jay Pipes
On Jun 20, 2017, at 8:38 AM, Jay Pipes <[hidden email]> wrote:

The example I posted used 3 resource providers. 2 compute nodes with no local disk and a shared storage pool.
Now I’m even more confused. In the straw man example (https://review.openstack.org/#/c/471927/) <https://review.openstack.org/#/c/471927/3/specs/pike/approved/placement-allocation-requests.rst> I see only one variable ($COMPUTE_NODE_UUID) referencing a compute node in the response.

I'm referring to the example I put in this email threads in paste.openstack.org with numbers showing 1600 bytes for 3 resource providers:

http://lists.openstack.org/pipermail/openstack-dev/2017-June/118593.html

And I’m referring to the comment I made on the spec back on June 13 that was never corrected/clarified. I’m glad you gave an example yesterday after I expressed my confusion; that was the whole purpose of starting this thread. Things may be clear to you, but they have confused me and others. We can’t help if we don’t understand.


-- Ed Leafe






__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Stephen Finucane
In reply to this post by Matt Riedemann-3
On Mon, 2017-06-19 at 09:36 -0500, Matt Riedemann wrote:

> On 6/19/2017 9:17 AM, Jay Pipes wrote:
> > On 06/19/2017 09:04 AM, Edward Leafe wrote:
> > > Current flow:
>
> As noted in the nova-scheduler meeting this morning, this should have 
> been called "original plan" rather than "current flow", as Jay pointed 
> out inline.
>
> > > * Scheduler gets a req spec from conductor, containing resource 
> > > requirements
> > > * Scheduler sends those requirements to placement
> > > * Placement runs a query to determine the root RPs that can satisfy 
> > > those requirements
> >
> > Not root RPs. Non-sharing resource providers, which currently 
> > effectively means compute node providers. Nested resource providers 
> > isn't yet merged, so there is currently no concept of a hierarchy of 
> > providers.
> >
> > > * Placement returns a list of the UUIDs for those root providers to 
> > > scheduler
> >
> > It returns the provider names and UUIDs, yes.
> >
> > > * Scheduler uses those UUIDs to create HostState objects for each
> >
> > Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passing 
> > in a list of the provider UUIDs it got back from the placement service. 
> > The scheduler then builds a set of HostState objects from the results of 
> > ComputeNodeList.get_all_by_uuid().
> >
> > The scheduler also keeps a set of AggregateMetadata objects in memory, 
> > including the association of aggregate to host (note: this is the 
> > compute node's *service*, not the compute node object itself, thus the 
> > reason aggregates don't work properly for Ironic nodes).
> >
> > > * Scheduler runs those HostState objects through filters to remove 
> > > those that don't meet requirements not selected for by placement
> >
> > Yep.
> >
> > > * Scheduler runs the remaining HostState objects through weighers to 
> > > order them in terms of best fit.
> >
> > Yep.
> >
> > > * Scheduler takes the host at the top of that ranked list, and tries 
> > > to claim the resources in placement. If that fails, there is a race, 
> > > so that HostState is discarded, and the next is selected. This is 
> > > repeated until the claim succeeds.
> >
> > No, this is not how things work currently. The scheduler does not claim 
> > resources. It selects the top (or random host depending on the selection 
> > strategy) and sends the launch request to the target compute node. The 
> > target compute node then attempts to claim the resources and in doing so 
> > writes records to the compute_nodes table in the Nova cell database as 
> > well as the Placement API for the compute node resource provider.
>
> Not to nit pick, but today the scheduler sends the selected destinations 
> to the conductor. Conductor looks up the cell that a selected host is 
> in, creates the instance record and friends (bdms) in that cell and then 
> sends the build request to the compute host in that cell.
>
> >
> > > * Scheduler then creates a list of N UUIDs, with the first being the 
> > > selected host, and the the rest being alternates consisting of the 
> > > next hosts in the ranked list that are in the same cell as the 
> > > selected host.
> >
> > This isn't currently how things work, no. This has been discussed, however.
> >
> > > * Scheduler returns that list to conductor.
> > > * Conductor determines the cell of the selected host, and sends that 
> > > list to the target cell.
> > > * Target cell tries to build the instance on the selected host. If it 
> > > fails, it unclaims the resources for the selected host, and tries to 
> > > claim the resources for the next host in the list. It then tries to 
> > > build the instance on the next host in the list of alternates. Only 
> > > when all alternates fail does the build request fail.
> >
> > This isn't currently how things work, no. There has been discussion of 
> > having the compute node retry alternatives locally, but nothing more 
> > than discussion.
>
> Correct that this isn't how things currently work, but it was/is the 
> original plan. And the retry happens within the cell conductor, not on 
> the compute node itself. The top-level conductor is what's getting 
> selected hosts from the scheduler. The cell-level conductor is what's 
> getting a retry request from the compute. The cell-level conductor would 
> deallocate from placement for the currently claimed providers, and then 
> pick one of the alternatives passed down from the top and then make 
> allocations (a claim) against those, then send to an alternative compute 
> host for another build attempt.
>
> So with this plan, there are two places to make allocations - the 
> scheduler first, and then the cell conductors for retries. This 
> duplication is why some people were originally pushing to move all 
> allocation-related work happen in the conductor service.
>
> > > Proposed flow:
> > > * Scheduler gets a req spec from conductor, containing resource 
> > > requirements
> > > * Scheduler sends those requirements to placement
> > > * Placement runs a query to determine the root RPs that can satisfy 
> > > those requirements
> >
> > Yes.
> >
> > > * Placement then constructs a data structure for each root provider as 
> > > documented in the spec. [0]
> >
> > Yes.
> >
> > > * Placement returns a number of these data structures as JSON blobs. 
> > > Due to the size of the data, a page size will have to be determined, 
> > > and placement will have to either maintain that list of structured 
> > > datafor subsequent requests, or re-run the query and only calculate 
> > > the data structures for the hosts that fit in the requested page.
> >
> > "of these data structures as JSON blobs" is kind of redundant... all our 
> > REST APIs return data structures as JSON blobs.
> >
> > While we discussed the fact that there may be a lot of entries, we did 
> > not say we'd immediately support a paging mechanism.
>
> I believe we said in the initial version we'd have the configurable 
> limit in the DB API queries, like we have today - the default limit is 
> 1000. There was agreement to eventually build paging support into the API.
>
> This does make me wonder though what happens when you have 100K or more 
> compute nodes reporting into placement and we limit on the first 1000. 
> Aren't we going to be imposing a packing strategy then just because of 
> how we pull things out of the database for Placement? Although I don't 
> see how that would be any different from before we had Placement and the 
> nova-scheduler service just did a ComputeNode.get_all() to the nova DB 
> and then filtered/weighed those objects.
>
> > > * Scheduler continues to request the paged results until it has them all.
> >
> > See above. Was discussed briefly as a concern but not work to do for 
> > first patches.
> >
> > > * Scheduler then runs this data through the filters and weighers. No 
> > > HostState objects are required, as the data structures will contain 
> > > all the information that scheduler will need.
> >
> > No, this isn't correct. The scheduler will have *some* of the 
> > information it requires for weighing from the returned data from the GET 
> > /allocation_candidates call, but not all of it.
> >
> > Again, operators have insisted on keeping the flexibility currently in 
> > the Nova scheduler to weigh/sort compute nodes by things like thermal 
> > metrics and kinds of data that the Placement API will never be 
> > responsible for.
> >
> > The scheduler will need to merge information from the 
> > "provider_summaries" part of the HTTP response with information it has 
> > already in its HostState objects (gotten from 
> > ComputeNodeList.get_all_by_uuid() and AggregateMetadataList).
> >
> > > * Scheduler then selects the data structure at the top of the ranked 
> > > list. Inside that structure is a dict of the allocation data that 
> > > scheduler will need to claim the resources on the selected host. If 
> > > the claim fails, the next data structure in the list is chosen, and 
> > > repeated until a claim succeeds.
> >
> > Kind of, yes. The scheduler will select a *host* that meets its needs.
> >
> > There may be more than one allocation request that includes that host 
> > resource provider, because of shared providers and (soon) nested 
> > providers. The scheduler will choose one of these allocation requests 
> > and attempt a claim of resources by simply PUT 
> > /allocations/{instance_uuid} with the serialized body of that allocation 
> > request. If 202 returned, cool. If not, repeat for the next allocation 
> > request.
> >
> > > * Scheduler then creates a list of N of these data structures, with 
> > > the first being the data for the selected host, and the the rest being 
> > > data structures representing alternates consisting of the next hosts 
> > > in the ranked list that are in the same cell as the selected host.
> >
> > Yes, this is the proposed solution for allowing retries within a cell.
> >
> > > * Scheduler returns that list to conductor.
> > > * Conductor determines the cell of the selected host, and sends that 
> > > list to the target cell.
> > > * Target cell tries to build the instance on the selected host. If it 
> > > fails, it uses the allocation data in the data structure to unclaim 
> > > the resources for the selected host, and tries to claim the resources 
> > > for the next host in the list using its allocation data. It then tries 
> > > to build the instance on the next host in the list of alternates. Only 
> > > when all alternates fail does the build request fail.
> >
> > I'll let Dan discuss this last part.
> >
> > Best,
> > -jay
> >
> > > [0] https://review.openstack.org/#/c/471927/

I have a document (with a nifty activity diagram in tow) for all the above
available here:

  https://review.openstack.org/475810 

Should be more Google'able that mailing list posts for future us :)

Stephen

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Jay Pipes
In reply to this post by Alex Xu-2
On 06/20/2017 09:51 AM, Alex Xu wrote:

> 2017-06-19 22:17 GMT+08:00 Jay Pipes <[hidden email]
> <mailto:[hidden email]>>:
>         * Scheduler then creates a list of N of these data structures,
>         with the first being the data for the selected host, and the the
>         rest being data structures representing alternates consisting of
>         the next hosts in the ranked list that are in the same cell as
>         the selected host.
>
>     Yes, this is the proposed solution for allowing retries within a cell.
>
> Is that possible we use trait to distinguish different cells? Then the
> retry can be done in the cell by query the placement directly with trait
> which indicate the specific cell.
>
> Those traits will be some custom traits, and generate by the cell name.

No, we're not going to use traits in this way, for a couple reasons:

1) Placement doesn't and shouldn't know about Nova's internals. Cells
are internal structures of Nova. Users don't know about them, neither
should placement.

2) Traits describe a resource provider. A cell ID doesn't describe a
resource provider, just like an aggregate ID doesn't describe a resource
provider.

>         * Scheduler returns that list to conductor.
>         * Conductor determines the cell of the selected host, and sends
>         that list to the target cell.
>         * Target cell tries to build the instance on the selected host.
>         If it fails, it uses the allocation data in the data structure
>         to unclaim the resources for the selected host, and tries to
>         claim the resources for the next host in the list using its
>         allocation data. It then tries to build the instance on the next
>         host in the list of alternates. Only when all alternates fail
>         does the build request fail.
>
> In the compute node, will we get rid of the allocation update in the
> periodic task "update_available_resource"? Otherwise, we will have race
> between the claim in the nova-scheduler and that periodic task.

Yup, good point, and yes, we will be removing the call to PUT
/allocations in the compute node resource tracker. Only DELETE
/allocations/{instance_uuid} will be called if something goes terribly
wrong on instance launch.

Best,
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Eric Fried
In reply to this post by Stephen Finucane
Nice Stephen!

For those who aren't aware, the rendered version (pretty, so pretty) can
be accessed via the gate-nova-docs-ubuntu-xenial jenkins job:

http://docs-draft.openstack.org/10/475810/1/check/gate-nova-docs-ubuntu-xenial/25e5173//doc/build/html/scheduling.html?highlight=scheduling

On 06/20/2017 09:09 AM, [hidden email] wrote:
<efried: snip/>

>
> I have a document (with a nifty activity diagram in tow) for all the above
> available here:
>
>   https://review.openstack.org/475810 
>
> Should be more Google'able that mailing list posts for future us :)
>
> Stephen
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Chris Friesen
On 06/20/2017 09:51 AM, Eric Fried wrote:
> Nice Stephen!
>
> For those who aren't aware, the rendered version (pretty, so pretty) can
> be accessed via the gate-nova-docs-ubuntu-xenial jenkins job:
>
> http://docs-draft.openstack.org/10/475810/1/check/gate-nova-docs-ubuntu-xenial/25e5173//doc/build/html/scheduling.html?highlight=scheduling

Can we teach it to not put line breaks in the middle of words in the text boxes?

Chris

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [nova][scheduler][placement] Trying to understand the proposed direction

Stephen Finucane
On Tue, 2017-06-20 at 16:48 -0600, Chris Friesen wrote:

> On 06/20/2017 09:51 AM, Eric Fried wrote:
> > Nice Stephen!
> >
> > For those who aren't aware, the rendered version (pretty, so pretty) can
> > be accessed via the gate-nova-docs-ubuntu-xenial jenkins job:
> >
> > http://docs-draft.openstack.org/10/475810/1/check/gate-nova-docs-ubuntu-xen
> > ial/25e5173//doc/build/html/scheduling.html?highlight=scheduling
>
> Can we teach it to not put line breaks in the middle of words in the text
> boxes?

Doesn't seem configurable in its current form :( This, and the defaulting to
PNG output instead of SVG (which makes things ungreppable) are my biggest bug
bear.

I'll go have a look at the sauce and see what can be done about it. If not,
still better than nothing?

Stephen

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
12
Loading...