[all][qa][glance] some recent tempest problems

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[all][qa][glance] some recent tempest problems

Brian Rosmaita-2
This isn't a glance-specific problem though we've encountered it quite
a few times recently.

Briefly, we're gating on Tempest jobs that tempest itself does not
gate on.  This leads to a situation where new tests can be merged in
tempest, but wind up breaking our gate. We aren't claiming that the
added tests are bad or don't provide value; the problem is that we
have to drop everything and fix the gate.  This interrupts our current
work and forces us to prioritize bugs to fix based not on what makes
the most sense for the project given current priorities and resources,
but based on whatever we can do to get the gates un-blocked.

As we said earlier, this situation seems to be impacting multiple projects.

One solution for this is to change our gating so that we do not run
any Tempest jobs against Glance repositories that are not also gated
by Tempest.  That would in theory open a regression path, which is why
we haven't put up a patch yet.  Another way this could be addressed is
by the Tempest team changing the non-voting jobs causing this
situation into voting jobs, which would prevent such changes from
being merged in the first place.  The key issue here is that we need
to be able to prioritize bugs based on what's most important to each
project.

We want to be clear that we appreciate the work the Tempest team does.
We abhor bugs and want to squash them too.  The problem is just that
we're stretched pretty thin with resources right now, and being forced
to prioritize bug fixes that will get our gate un-blocked is
interfering with our ability to work on issues that may have a higher
impact on end users.

The point of this email is to find out whether anyone has a better
suggestion for how to handle this situation.

Thanks!

Erno Kuvaja
Glance Release Czar

Brian Rosmaita
Glance PTL

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

Doug Hellmann-2
Excerpts from Brian Rosmaita's message of 2017-06-15 13:04:39 -0400:

> This isn't a glance-specific problem though we've encountered it quite
> a few times recently.
>
> Briefly, we're gating on Tempest jobs that tempest itself does not
> gate on.  This leads to a situation where new tests can be merged in
> tempest, but wind up breaking our gate. We aren't claiming that the
> added tests are bad or don't provide value; the problem is that we
> have to drop everything and fix the gate.  This interrupts our current
> work and forces us to prioritize bugs to fix based not on what makes
> the most sense for the project given current priorities and resources,
> but based on whatever we can do to get the gates un-blocked.
>
> As we said earlier, this situation seems to be impacting multiple projects.
>
> One solution for this is to change our gating so that we do not run
> any Tempest jobs against Glance repositories that are not also gated
> by Tempest.  That would in theory open a regression path, which is why
> we haven't put up a patch yet.  Another way this could be addressed is
> by the Tempest team changing the non-voting jobs causing this
> situation into voting jobs, which would prevent such changes from
> being merged in the first place.  The key issue here is that we need
> to be able to prioritize bugs based on what's most important to each
> project.
>
> We want to be clear that we appreciate the work the Tempest team does.
> We abhor bugs and want to squash them too.  The problem is just that
> we're stretched pretty thin with resources right now, and being forced
> to prioritize bug fixes that will get our gate un-blocked is
> interfering with our ability to work on issues that may have a higher
> impact on end users.
>
> The point of this email is to find out whether anyone has a better
> suggestion for how to handle this situation.
>
> Thanks!
>
> Erno Kuvaja
> Glance Release Czar
>
> Brian Rosmaita
> Glance PTL
>

Asymmetric gating definitely has a way of introducing these problems.

Which jobs are involved?

Doug

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

Sean Dague-2
In reply to this post by Brian Rosmaita-2
On 06/15/2017 01:04 PM, Brian Rosmaita wrote:

> This isn't a glance-specific problem though we've encountered it quite
> a few times recently.
>
> Briefly, we're gating on Tempest jobs that tempest itself does not
> gate on.  This leads to a situation where new tests can be merged in
> tempest, but wind up breaking our gate. We aren't claiming that the
> added tests are bad or don't provide value; the problem is that we
> have to drop everything and fix the gate.  This interrupts our current
> work and forces us to prioritize bugs to fix based not on what makes
> the most sense for the project given current priorities and resources,
> but based on whatever we can do to get the gates un-blocked.
>
> As we said earlier, this situation seems to be impacting multiple projects.
>
> One solution for this is to change our gating so that we do not run
> any Tempest jobs against Glance repositories that are not also gated
> by Tempest.  That would in theory open a regression path, which is why
> we haven't put up a patch yet.  Another way this could be addressed is
> by the Tempest team changing the non-voting jobs causing this
> situation into voting jobs, which would prevent such changes from
> being merged in the first place.  The key issue here is that we need
> to be able to prioritize bugs based on what's most important to each
> project.
>
> We want to be clear that we appreciate the work the Tempest team does.
> We abhor bugs and want to squash them too.  The problem is just that
> we're stretched pretty thin with resources right now, and being forced
> to prioritize bug fixes that will get our gate un-blocked is
> interfering with our ability to work on issues that may have a higher
> impact on end users.
>
> The point of this email is to find out whether anyone has a better
> suggestion for how to handle this situation.

It would be useful to provide detailed examples. Everything is trade
offs, and having the conversation in the abstract is very difficult to
understand those trade offs.

        -Sean

--
Sean Dague
http://dague.net

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

zhu.fanglei
In reply to this post by Brian Rosmaita-2
https://review.openstack.org/#/c/471352/   may be an example






Original Mail



Sender:  <[hidden email]>
To:  <[hidden email]>
Date: 2017/06/16 05:25
Subject: Re: [openstack-dev] [all][qa][glance] some recent tempest problems





On 06/15/2017 01:04 PM, Brian Rosmaita wrote:

> This isn't a glance-specific problem though we've encountered it quite
> a few times recently.
>
> Briefly, we're gating on Tempest jobs that tempest itself does not
> gate on.  This leads to a situation where new tests can be merged in
> tempest, but wind up breaking our gate. We aren't claiming that the
> added tests are bad or don't provide value the problem is that we
> have to drop everything and fix the gate.  This interrupts our current
> work and forces us to prioritize bugs to fix based not on what makes
> the most sense for the project given current priorities and resources,
> but based on whatever we can do to get the gates un-blocked.
>
> As we said earlier, this situation seems to be impacting multiple projects.
>
> One solution for this is to change our gating so that we do not run
> any Tempest jobs against Glance repositories that are not also gated
> by Tempest.  That would in theory open a regression path, which is why
> we haven't put up a patch yet.  Another way this could be addressed is
> by the Tempest team changing the non-voting jobs causing this
> situation into voting jobs, which would prevent such changes from
> being merged in the first place.  The key issue here is that we need
> to be able to prioritize bugs based on what's most important to each
> project.
>
> We want to be clear that we appreciate the work the Tempest team does.
> We abhor bugs and want to squash them too.  The problem is just that
> we're stretched pretty thin with resources right now, and being forced
> to prioritize bug fixes that will get our gate un-blocked is
> interfering with our ability to work on issues that may have a higher
> impact on end users.
>
> The point of this email is to find out whether anyone has a better
> suggestion for how to handle this situation.
It would be useful to provide detailed examples. Everything is trade
offs, and having the conversation in the abstract is very difficult to
understand those trade offs.

    -Sean

--
Sean Dague
http://dague.net

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

GHANSHYAM MANN
On Fri, Jun 16, 2017 at 9:43 AM,  <[hidden email]> wrote:
> https://review.openstack.org/#/c/471352/   may be an example

If this is case which is ceph related, i think we already discussed
these kind of cases, where functionality depends on backend storage
and how to handle corresponding tests failure [1].

Solution on that was Ceph job should exclude such test case which
functionality is not implemented/supported in ceph byregex. Jon
Bernard is working on this tests blacklist [2].

If there is any other job or case, then we can discuss/think of having
job running for Tempest gate also which i think we do in most cases.

And about making ceph job as voting, i remember we did not do that due
to stability ok job. Ceph job fails frequently and once Jon patches
merge and job is consistently stable then we can make voting.

>
>
> Original Mail
> Sender:  <[hidden email]>;
> To:  <[hidden email]>;
> Date: 2017/06/16 05:25
> Subject: Re: [openstack-dev] [all][qa][glance] some recent tempest problems
>
>
> On 06/15/2017 01:04 PM, Brian Rosmaita wrote:
>> This isn't a glance-specific problem though we've encountered it quite
>> a few times recently.
>>
>> Briefly, we're gating on Tempest jobs that tempest itself does not
>> gate on.  This leads to a situation where new tests can be merged in
>> tempest, but wind up breaking our gate. We aren't claiming that the
>> added tests are bad or don't provide value; the problem is that we
>> have to drop everything and fix the gate.  This interrupts our current
>> work and forces us to prioritize bugs to fix based not on what makes
>> the most sense for the project given current priorities and resources,
>> but based on whatever we can do to get the gates un-blocked.
>>
>> As we said earlier, this situation seems to be impacting multiple
>> projects.
>>
>> One solution for this is to change our gating so that we do not run
>> any Tempest jobs against Glance repositories that are not also gated
>> by Tempest.  That would in theory open a regression path, which is why
>> we haven't put up a patch yet.  Another way this could be addressed is
>> by the Tempest team changing the non-voting jobs causing this
>> situation into voting jobs, which would prevent such changes from
>> being merged in the first place.  The key issue here is that we need
>> to be able to prioritize bugs based on what's most important to each
>> project.
>>
>> We want to be clear that we appreciate the work the Tempest team does.
>> We abhor bugs and want to squash them too.  The problem is just that
>> we're stretched pretty thin with resources right now, and being forced
>> to prioritize bug fixes that will get our gate un-blocked is
>> interfering with our ability to work on issues that may have a higher
>> impact on end users.
>>
>> The point of this email is to find out whether anyone has a better
>> suggestion for how to handle this situation.
>
> It would be useful to provide detailed examples. Everything is trade
> offs, and having the conversation in the abstract is very difficult to
> understand those trade offs.
>
>     -Sean
>
> --
> Sean Dague
> http://dague.net
>


..1 http://lists.openstack.org/pipermail/openstack-dev/2017-May/116172.html

..2 https://review.openstack.org/#/c/459774/ ,
https://review.openstack.org/#/c/459445/


-gmann

> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

Sean McGinnis
In reply to this post by Sean Dague-2
>
> It would be useful to provide detailed examples. Everything is trade
> offs, and having the conversation in the abstract is very difficult to
> understand those trade offs.
>
> -Sean
>

We've had this issue in Cinder and os-brick. Usually around Ceph, but if
you follow the user survey, that's the most popular backend.

The problem we see is the tempest test that covers this is non-voting.
And there have been several cases so far where this non-voting job does
not pass, due to a legitimate failure, but the tempest patch merges anyway.


To be fair, these failures usually do point out actual problems that need
to be fixed. Not always, but at least in a few cases. But instead of it
being addressed first to make sure there is no disruption, it's suddenly
a blocking issue that holds up everything until it's either reverted, skipped,
or the problem is resolved.

Here's one recent instance: https://review.openstack.org/#/c/471352/

Sean

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

Sean Dague-2
On 06/16/2017 09:51 AM, Sean McGinnis wrote:

>>
>> It would be useful to provide detailed examples. Everything is trade
>> offs, and having the conversation in the abstract is very difficult to
>> understand those trade offs.
>>
>> -Sean
>>
>
> We've had this issue in Cinder and os-brick. Usually around Ceph, but if
> you follow the user survey, that's the most popular backend.
>
> The problem we see is the tempest test that covers this is non-voting.
> And there have been several cases so far where this non-voting job does
> not pass, due to a legitimate failure, but the tempest patch merges anyway.
>
>
> To be fair, these failures usually do point out actual problems that need
> to be fixed. Not always, but at least in a few cases. But instead of it
> being addressed first to make sure there is no disruption, it's suddenly
> a blocking issue that holds up everything until it's either reverted, skipped,
> or the problem is resolved.
>
> Here's one recent instance: https://review.openstack.org/#/c/471352/

Sure, if ceph is the primary concern, that feels like it should be a
reasonable specific thing to fix. It's not a grand issue, it's a
specific mismatch on what configs should be common.

        -Sean

--
Sean Dague
http://dague.net

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

Eric Harney
In reply to this post by GHANSHYAM MANN
On 06/15/2017 10:51 PM, Ghanshyam Mann wrote:

> On Fri, Jun 16, 2017 at 9:43 AM,  <[hidden email]> wrote:
>> https://review.openstack.org/#/c/471352/   may be an example
>
> If this is case which is ceph related, i think we already discussed
> these kind of cases, where functionality depends on backend storage
> and how to handle corresponding tests failure [1].
>
> Solution on that was Ceph job should exclude such test case which
> functionality is not implemented/supported in ceph byregex. Jon
> Bernard is working on this tests blacklist [2].
>
> If there is any other job or case, then we can discuss/think of having
> job running for Tempest gate also which i think we do in most cases.
>
> And about making ceph job as voting, i remember we did not do that due
> to stability ok job. Ceph job fails frequently and once Jon patches
> merge and job is consistently stable then we can make voting.
>

I'm not convinced yet that this failure is purely Ceph-specific, at a
quick look.

I think what happens here is, unshelve performs an asynchronous delete
of a glance image, and returns as successful before the delete has
necessarily completed.  The check in tempest then sees that the image
still exists, and fails -- but this isn't valid, because the unshelve
API doesn't guarantee that this image is no longer there at the time it
returns.  This would fail on any image delete that isn't instantaneous.

Is there a guarantee anywhere that the unshelve API behaves how this
tempest test expects it to?

>>
>>
>> Original Mail
>> Sender:  <[hidden email]>;
>> To:  <[hidden email]>;
>> Date: 2017/06/16 05:25
>> Subject: Re: [openstack-dev] [all][qa][glance] some recent tempest problems
>>
>>
>> On 06/15/2017 01:04 PM, Brian Rosmaita wrote:
>>> This isn't a glance-specific problem though we've encountered it quite
>>> a few times recently.
>>>
>>> Briefly, we're gating on Tempest jobs that tempest itself does not
>>> gate on.  This leads to a situation where new tests can be merged in
>>> tempest, but wind up breaking our gate. We aren't claiming that the
>>> added tests are bad or don't provide value; the problem is that we
>>> have to drop everything and fix the gate.  This interrupts our current
>>> work and forces us to prioritize bugs to fix based not on what makes
>>> the most sense for the project given current priorities and resources,
>>> but based on whatever we can do to get the gates un-blocked.
>>>
>>> As we said earlier, this situation seems to be impacting multiple
>>> projects.
>>>
>>> One solution for this is to change our gating so that we do not run
>>> any Tempest jobs against Glance repositories that are not also gated
>>> by Tempest.  That would in theory open a regression path, which is why
>>> we haven't put up a patch yet.  Another way this could be addressed is
>>> by the Tempest team changing the non-voting jobs causing this
>>> situation into voting jobs, which would prevent such changes from
>>> being merged in the first place.  The key issue here is that we need
>>> to be able to prioritize bugs based on what's most important to each
>>> project.
>>>
>>> We want to be clear that we appreciate the work the Tempest team does.
>>> We abhor bugs and want to squash them too.  The problem is just that
>>> we're stretched pretty thin with resources right now, and being forced
>>> to prioritize bug fixes that will get our gate un-blocked is
>>> interfering with our ability to work on issues that may have a higher
>>> impact on end users.
>>>
>>> The point of this email is to find out whether anyone has a better
>>> suggestion for how to handle this situation.
>>
>> It would be useful to provide detailed examples. Everything is trade
>> offs, and having the conversation in the abstract is very difficult to
>> understand those trade offs.
>>
>>     -Sean
>>
>> --
>> Sean Dague
>> http://dague.net
>>
>
>
> ..1 http://lists.openstack.org/pipermail/openstack-dev/2017-May/116172.html
>
> ..2 https://review.openstack.org/#/c/459774/ ,
> https://review.openstack.org/#/c/459445/
>
>
> -gmann
>
>> __________________________________________________________________________

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

GHANSHYAM MANN
In reply to this post by Sean Dague-2
On Fri, Jun 16, 2017 at 10:57 PM, Sean Dague <[hidden email]> wrote:

> On 06/16/2017 09:51 AM, Sean McGinnis wrote:
>>>
>>> It would be useful to provide detailed examples. Everything is trade
>>> offs, and having the conversation in the abstract is very difficult to
>>> understand those trade offs.
>>>
>>>      -Sean
>>>
>>
>> We've had this issue in Cinder and os-brick. Usually around Ceph, but if
>> you follow the user survey, that's the most popular backend.
>>
>> The problem we see is the tempest test that covers this is non-voting.
>> And there have been several cases so far where this non-voting job does
>> not pass, due to a legitimate failure, but the tempest patch merges anyway.
>>
>>
>> To be fair, these failures usually do point out actual problems that need
>> to be fixed. Not always, but at least in a few cases. But instead of it
>> being addressed first to make sure there is no disruption, it's suddenly
>> a blocking issue that holds up everything until it's either reverted, skipped,
>> or the problem is resolved.
>>
>> Here's one recent instance: https://review.openstack.org/#/c/471352/
>
> Sure, if ceph is the primary concern, that feels like it should be a
> reasonable specific thing to fix. It's not a grand issue, it's a
> specific mismatch on what configs should be common.

yea, we had such cases and decided to have blacklist of tests not
suitable for ceph. ceph job will exclude the tests failing on ceph.
Jon is working on this - https://review.openstack.org/#/c/459774/

This approach solve the problem without limiting tests scope. [1]

..1 http://lists.openstack.org/pipermail/openstack-dev/2017-May/116172.html

-gmann

>
>         -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

Doug Hellmann-2
Excerpts from Ghanshyam Mann's message of 2017-06-16 23:05:08 +0900:

> On Fri, Jun 16, 2017 at 10:57 PM, Sean Dague <[hidden email]> wrote:
> > On 06/16/2017 09:51 AM, Sean McGinnis wrote:
> >>>
> >>> It would be useful to provide detailed examples. Everything is trade
> >>> offs, and having the conversation in the abstract is very difficult to
> >>> understand those trade offs.
> >>>
> >>>      -Sean
> >>>
> >>
> >> We've had this issue in Cinder and os-brick. Usually around Ceph, but if
> >> you follow the user survey, that's the most popular backend.
> >>
> >> The problem we see is the tempest test that covers this is non-voting.
> >> And there have been several cases so far where this non-voting job does
> >> not pass, due to a legitimate failure, but the tempest patch merges anyway.
> >>
> >>
> >> To be fair, these failures usually do point out actual problems that need
> >> to be fixed. Not always, but at least in a few cases. But instead of it
> >> being addressed first to make sure there is no disruption, it's suddenly
> >> a blocking issue that holds up everything until it's either reverted, skipped,
> >> or the problem is resolved.
> >>
> >> Here's one recent instance: https://review.openstack.org/#/c/471352/
> >
> > Sure, if ceph is the primary concern, that feels like it should be a
> > reasonable specific thing to fix. It's not a grand issue, it's a
> > specific mismatch on what configs should be common.
>
> yea, we had such cases and decided to have blacklist of tests not
> suitable for ceph. ceph job will exclude the tests failing on ceph.
> Jon is working on this - https://review.openstack.org/#/c/459774/
>
> This approach solve the problem without limiting tests scope. [1]
>
> ..1 http://lists.openstack.org/pipermail/openstack-dev/2017-May/116172.html
>
> -gmann

Is ceph behaving in an unexpected way or are the tests are making
implicit assumptions that might also cause trouble for other backends
if these tests ever make it into the suite used by the interop team?

Doug

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

Sean McGinnis
In reply to this post by GHANSHYAM MANN
>
> yea, we had such cases and decided to have blacklist of tests not
> suitable for ceph. ceph job will exclude the tests failing on ceph.
> Jon is working on this - https://review.openstack.org/#/c/459774/
>

I don't think merging tests that are showing failures, then blacklisting
them, is the right approach. And as Eric points out, this isn't
necessarily just a failure with Ceph. There is a legitimate logical
issue with what this particular test is doing.

But in general, to get back to some of the earlier points, I don't think
we should be merging tests with known breakages until those breakages
can be first addressed.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

Eric Harney
On 06/16/2017 10:21 AM, Sean McGinnis wrote:

>
> I don't think merging tests that are showing failures, then blacklisting
> them, is the right approach. And as Eric points out, this isn't
> necessarily just a failure with Ceph. There is a legitimate logical
> issue with what this particular test is doing.
>
> But in general, to get back to some of the earlier points, I don't think
> we should be merging tests with known breakages until those breakages
> can be first addressed.
>

As another example, this was the last round of this, in May:

https://review.openstack.org/#/c/332670/

which is a new tempest test for a Cinder API that is not supported by
all drivers.  The Ceph job failed on the tempest patch, correctly, the
test was merged, then the Ceph jobs broke:

https://bugs.launchpad.net/glance/+bug/1687538
https://review.openstack.org/#/c/461625/

This is really not a sustainable model.

And this is the _easy_ case, since Ceph jobs run in OpenStack infra and
are easily visible and trackable.  I'm not sure what the impact is on
Cinder third-party CI for other drivers.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

Sean Dague-2
In reply to this post by Sean McGinnis
On 06/16/2017 09:51 AM, Sean McGinnis wrote:

>>
>> It would be useful to provide detailed examples. Everything is trade
>> offs, and having the conversation in the abstract is very difficult to
>> understand those trade offs.
>>
>> -Sean
>>
>
> We've had this issue in Cinder and os-brick. Usually around Ceph, but if
> you follow the user survey, that's the most popular backend.
>
> The problem we see is the tempest test that covers this is non-voting.
> And there have been several cases so far where this non-voting job does
> not pass, due to a legitimate failure, but the tempest patch merges anyway.
>
>
> To be fair, these failures usually do point out actual problems that need
> to be fixed. Not always, but at least in a few cases. But instead of it
> being addressed first to make sure there is no disruption, it's suddenly
> a blocking issue that holds up everything until it's either reverted, skipped,
> or the problem is resolved.
>
> Here's one recent instance: https://review.openstack.org/#/c/471352/

So, before we go further, ceph seems to be -nv on all projects right
now, right? So I get there is some debate on that patch, but is it
blocking anything?

Again, we seem to be missing specifics and a set of events here, which
lacking that everyone is trying to guess what the problems are, which I
don't think is effective.

        -Sean

--
Sean Dague
http://dague.net

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

Sean Dague-2
In reply to this post by Eric Harney
On 06/16/2017 10:46 AM, Eric Harney wrote:

> On 06/16/2017 10:21 AM, Sean McGinnis wrote:
>>
>> I don't think merging tests that are showing failures, then blacklisting
>> them, is the right approach. And as Eric points out, this isn't
>> necessarily just a failure with Ceph. There is a legitimate logical
>> issue with what this particular test is doing.
>>
>> But in general, to get back to some of the earlier points, I don't think
>> we should be merging tests with known breakages until those breakages
>> can be first addressed.
>>
>
> As another example, this was the last round of this, in May:
>
> https://review.openstack.org/#/c/332670/
>
> which is a new tempest test for a Cinder API that is not supported by
> all drivers.  The Ceph job failed on the tempest patch, correctly, the
> test was merged, then the Ceph jobs broke:
>
> https://bugs.launchpad.net/glance/+bug/1687538
> https://review.openstack.org/#/c/461625/
>
> This is really not a sustainable model.
>
> And this is the _easy_ case, since Ceph jobs run in OpenStack infra and
> are easily visible and trackable.  I'm not sure what the impact is on
> Cinder third-party CI for other drivers.

Ah, so the issue is that
gate-tempest-dsvm-full-ceph-plugin-src-glance_store-ubuntu-xenial is
Voting, because when the regex was made to stop ceph jobs from voting
(which they aren't on Nova, Tempest, Glance, or Cinder), it wasn't
applied there.

It's also a question about why a library is doing different back end
testing through full stack testing, instead of more targeted and
controlled behavior. Which I think is probably also less than ideal.

Both would be good things to fix.

        -Sean

--
Sean Dague
http://dague.net

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

Sean McGinnis
In reply to this post by Sean Dague-2
>
> So, before we go further, ceph seems to be -nv on all projects right
> now, right? So I get there is some debate on that patch, but is it
> blocking anything?
>

Ceph is voting on os-brick patches. So it does block some things when
we run into this situation.

But again, we should avoid getting into this situation in the first
place, voting or no.


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

Matt Riedemann-3
On 6/16/2017 3:32 PM, Sean McGinnis wrote:

>>
>> So, before we go further, ceph seems to be -nv on all projects right
>> now, right? So I get there is some debate on that patch, but is it
>> blocking anything?
>>
>
> Ceph is voting on os-brick patches. So it does block some things when
> we run into this situation.
>
> But again, we should avoid getting into this situation in the first
> place, voting or no.
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

Yeah there is a distinction between the ceph nv job that runs on
nova/cinder/glance changes and the ceph job that runs on os-brick and
glance_store changes. When we made the tempest dsvm ceph job non-voting
we failed to mirror that in the os-brick/glance-store jobs. We should do
that.

--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

Matt Riedemann-3
On 6/16/2017 8:13 PM, Matt Riedemann wrote:
> Yeah there is a distinction between the ceph nv job that runs on
> nova/cinder/glance changes and the ceph job that runs on os-brick and
> glance_store changes. When we made the tempest dsvm ceph job non-voting
> we failed to mirror that in the os-brick/glance-store jobs. We should do
> that.

Here you go:

https://review.openstack.org/#/c/475095/

--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

Matt Riedemann-3
In reply to this post by Eric Harney
On 6/16/2017 9:46 AM, Eric Harney wrote:

> On 06/16/2017 10:21 AM, Sean McGinnis wrote:
>>
>> I don't think merging tests that are showing failures, then blacklisting
>> them, is the right approach. And as Eric points out, this isn't
>> necessarily just a failure with Ceph. There is a legitimate logical
>> issue with what this particular test is doing.
>>
>> But in general, to get back to some of the earlier points, I don't think
>> we should be merging tests with known breakages until those breakages
>> can be first addressed.
>>
>
> As another example, this was the last round of this, in May:
>
> https://review.openstack.org/#/c/332670/
>
> which is a new tempest test for a Cinder API that is not supported by
> all drivers.  The Ceph job failed on the tempest patch, correctly, the
> test was merged, then the Ceph jobs broke:
>
> https://bugs.launchpad.net/glance/+bug/1687538
> https://review.openstack.org/#/c/461625/
>
> This is really not a sustainable model.
>
> And this is the _easy_ case, since Ceph jobs run in OpenStack infra and
> are easily visible and trackable.  I'm not sure what the impact is on
> Cinder third-party CI for other drivers.
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [hidden email]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

This is generally why we have config options in Tempest to not run tests
that certain backends don't implement, like all of the backup/snapshot
volume tests that the NFS job was failing on forever.

I think it's perfectly valid to have tests in Tempest for things that
not all backends implement as long as they are configurable. It's up to
the various CI jobs to configure Tempest properly for what they support
and then work on reducing the number of things they don't support. We've
been doing that for ages now.

--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [all][qa][glance] some recent tempest problems

Matt Riedemann-3
In reply to this post by Eric Harney
On 6/16/2017 8:58 AM, Eric Harney wrote:

> I'm not convinced yet that this failure is purely Ceph-specific, at a
> quick look.
>
> I think what happens here is, unshelve performs an asynchronous delete
> of a glance image, and returns as successful before the delete has
> necessarily completed.  The check in tempest then sees that the image
> still exists, and fails -- but this isn't valid, because the unshelve
> API doesn't guarantee that this image is no longer there at the time it
> returns.  This would fail on any image delete that isn't instantaneous.
>
> Is there a guarantee anywhere that the unshelve API behaves how this
> tempest test expects it to?

There are no guarantees, no. The unshelve API reference is here [1]. The
asynchronous postconditions section just says:

"After you successfully shelve a server, its status changes to ACTIVE.
The server appears on the compute node.

The shelved image is deleted from the list of images returned by an API
call."

It doesn't say the image is deleted immediately, or that it waits for
the image to be gone before changing the instance status to ACTIVE.

I see there is also a typo in there, that should say after you
successfully *unshelve* a server.

 From an API user point of view, this is all asynchronous because it's
an RPC cast from the nova-api service to the nova-conductor and finally
nova-compute service when unshelving the instance.

So I think the test is making some wrong assumptions on how fast the
image is going to be deleted when the instance is active.

As Ken'ichi pointed out in the Tempest change, Glance returns a 204 when
deleting an image in the v2 API [2]. If the image delete is asynchronous
then that should probably be a 202.

Either way the Tempest test should probably be in a wait loop for the
image to be gone if it's really going to assert this.

[1]
https://developer.openstack.org/api-ref/compute/?expanded=unshelve-restore-shelved-server-unshelve-action-detail#unshelve-restore-shelved-server-unshelve-action
[2]
https://developer.openstack.org/api-ref/image/v2/index.html?expanded=delete-an-image-detail#delete-an-image

--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Loading...