[OCTAVIA][QUEENS][KOLLA] - Amphora to Health-manager invalid UDP heartbeat.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[OCTAVIA][QUEENS][KOLLA] - Amphora to Health-manager invalid UDP heartbeat.

Gaël THEROND
Hi guys,

I'm finishing to work on my POC for Octavia and after solving few issues with my configuration I'm close to get a properly working setup.
However, I'm facing a small but yet annoying bug with the health-manager receiving amphora heartbeat UDP packet which it consider as not correct and so drop it.

Here are the messages that can be found in logs:

2018-10-23 13:53:21.844 25 WARNING octavia.amphorae.backends.health_daemon.status_message [-] calculated hmac: faf73e41a0f843b826ee581c3995b7f7e56b5e5a294fca0b84eda426766f8415 not equal to msg hmac: 6137613337316432636365393832376431343337306537353066626130653261 dropping packet


The annoying thing is that I don't get why the UDP packet is considered as stale and how can I try to reproduce the payload which is send to the HealthManager.
I'm willing to write a simple PY program to simulate the heartbeat payload but I don't now what's exactly the message and I think I miss some informations.

Both HealthManager and the Amphora do use the same heartbeat_key and both can contact on the network as the initial Health-manager to Amphora 9443 connection is validated.

As an effect to this situation, my loadbalancer is stuck in PENDING_UPDATE mode.

Do you have any idea on how can I handle such thing or if it's something already seen out there for anyone else?

Kind regards,
G.

_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Reply | Threaded
Open this post in threaded view
|

Re: [OCTAVIA][QUEENS][KOLLA] - Amphora to Health-manager invalid UDP heartbeat.

Michael Johnson
Are the controller and the amphora using the same version of Octavia?

We had a python3 issue where we had to change the HMAC digest used. If
you controller is running an older version of Octavia than your
amphora images, it may not have the compatibility code to support the
new format.  The compatibility code is here:
https://github.com/openstack/octavia/blob/master/octavia/amphorae/backends/health_daemon/status_message.py#L56

There is also a release note about the issue here:
https://docs.openstack.org/releasenotes/octavia/rocky.html#upgrade-notes

If that is not the issue, I would double check the heartbeat_key in
the health manager configuration files and inside one of the amphora.

Note, that this key is only used for health heartbeats and stats, it
is not used for the controller to amphora communication on port 9443.

Also, load balancers cannot get "stuck" in PENDING_* states unless
someone has killed the controller process that was actively working on
that load balancer. By killed I mean a non-graceful shutdown of the
process that was in the middle of working on the load balancer.
Otherwise all code paths lead back to ACTIVE or ERROR status after it
finishes the work or gives up retrying the requested action. Check
your controller logs to make sure this load balancer is not still
being worked on by one of the controllers. The default retry timeouts
(some are up to 25 minutes) are very long (it will keep trying to
accomplish the request) to accommodate very slow (virtual box) hosts
and the test gates. You will want to tune those down for a production
deployment.

Michael

On Tue, Oct 23, 2018 at 7:09 AM Gaël THEROND <[hidden email]> wrote:

>
> Hi guys,
>
> I'm finishing to work on my POC for Octavia and after solving few issues with my configuration I'm close to get a properly working setup.
> However, I'm facing a small but yet annoying bug with the health-manager receiving amphora heartbeat UDP packet which it consider as not correct and so drop it.
>
> Here are the messages that can be found in logs:
>
> 2018-10-23 13:53:21.844 25 WARNING octavia.amphorae.backends.health_daemon.status_message [-] calculated hmac: faf73e41a0f843b826ee581c3995b7f7e56b5e5a294fca0b84eda426766f8415 not equal to msg hmac: 6137613337316432636365393832376431343337306537353066626130653261 dropping packet
>
> Which come from this part of the HM Code:
>
> https://docs.openstack.org/octavia/pike/_modules/octavia/amphorae/backends/health_daemon/status_message.html#get_payload
>
> The annoying thing is that I don't get why the UDP packet is considered as stale and how can I try to reproduce the payload which is send to the HealthManager.
> I'm willing to write a simple PY program to simulate the heartbeat payload but I don't now what's exactly the message and I think I miss some informations.
>
> Both HealthManager and the Amphora do use the same heartbeat_key and both can contact on the network as the initial Health-manager to Amphora 9443 connection is validated.
>
> As an effect to this situation, my loadbalancer is stuck in PENDING_UPDATE mode.
>
> Do you have any idea on how can I handle such thing or if it's something already seen out there for anyone else?
>
> Kind regards,
> G.
> _______________________________________________
> OpenStack-operators mailing list
> [hidden email]
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Reply | Threaded
Open this post in threaded view
|

Re: [OCTAVIA][QUEENS][KOLLA] - Amphora to Health-manager invalid UDP heartbeat.

Gaël THEROND
Hi Michael,

Thanks a lot for those many details regarding the transition between different states, indeed as you said, my LB passed from pending_update to active but I still had an offline status this morning as I still received UDP Packets that HM dropped.

When I was talking about the HealthManager reaching to the amphora on port 9443 of course I didn't mean it use the heartbeat key.


I just had a look at my Amphora and Octavia CP (Control Plan) versions, seems a little bit off sync as my amphora agent is: %prog 3.0.0.0b4.dev6
while my octavia CP services are: %prog 2.0.1

I've just updated to stable/rocky this morning and so jumped to: %prog 3.0.1
I'll check if I still encounter this issue, but for now my issue seems to have vanished as I've the following messages:

2018-10-24 11:58:54.620 24 DEBUG futurist.periodics [-] Submitting periodic callback 'octavia.cmd.health_manager.periodic_health_check' _process_scheduled /usr/lib/python2.7/site-packages/futurist/periodics.py:639
2018-10-24 11:58:57.620 24 DEBUG futurist.periodics [-] Submitting periodic callback 'octavia.cmd.health_manager.periodic_health_check' _process_scheduled /usr/lib/python2.7/site-packages/futurist/periodics.py:639
2018-10-24 11:59:00.620 24 DEBUG futurist.periodics [-] Submitting periodic callback 'octavia.cmd.health_manager.periodic_health_check' _process_scheduled /usr/lib/python2.7/site-packages/futurist/periodics.py:639
2018-10-24 11:59:03.620 24 DEBUG futurist.periodics [-] Submitting periodic callback 'octavia.cmd.health_manager.periodic_health_check' _process_scheduled /usr/lib/python2.7/site-packages/futurist/periodics.py:639
2018-10-24 11:59:04.557 23 DEBUG octavia.amphorae.drivers.health.heartbeat_udp [-] Received packet from ('172.27.201.105', 48342) dorecv /usr/lib/python2.7/site-packages/octavia/amphorae/drivers/health/heartbeat_udp.py:187
2018-10-24 11:59:04.619 45 DEBUG octavia.controller.healthmanager.health_drivers.update_db [-] Health Update finished in: 0.0600640773773 seconds update_health /usr/lib/python2.7/site-packages/octavia/controller/healthmanager/health_drivers/update_db.py:93

I'll update you with my following investigation, but so far, the issue seems to be resolve, I'll tweak a bit the timeouts as my LB take a looooot of time to create Listeners/Pools and come to an online status.

Thanks a lot!

Le mar. 23 oct. 2018 à 19:09, Michael Johnson <[hidden email]> a écrit :
Are the controller and the amphora using the same version of Octavia?

We had a python3 issue where we had to change the HMAC digest used. If
you controller is running an older version of Octavia than your
amphora images, it may not have the compatibility code to support the
new format.  The compatibility code is here:
https://github.com/openstack/octavia/blob/master/octavia/amphorae/backends/health_daemon/status_message.py#L56

There is also a release note about the issue here:
https://docs.openstack.org/releasenotes/octavia/rocky.html#upgrade-notes

If that is not the issue, I would double check the heartbeat_key in
the health manager configuration files and inside one of the amphora.

Note, that this key is only used for health heartbeats and stats, it
is not used for the controller to amphora communication on port 9443.

Also, load balancers cannot get "stuck" in PENDING_* states unless
someone has killed the controller process that was actively working on
that load balancer. By killed I mean a non-graceful shutdown of the
process that was in the middle of working on the load balancer.
Otherwise all code paths lead back to ACTIVE or ERROR status after it
finishes the work or gives up retrying the requested action. Check
your controller logs to make sure this load balancer is not still
being worked on by one of the controllers. The default retry timeouts
(some are up to 25 minutes) are very long (it will keep trying to
accomplish the request) to accommodate very slow (virtual box) hosts
and the test gates. You will want to tune those down for a production
deployment.

Michael

On Tue, Oct 23, 2018 at 7:09 AM Gaël THEROND <[hidden email]> wrote:
>
> Hi guys,
>
> I'm finishing to work on my POC for Octavia and after solving few issues with my configuration I'm close to get a properly working setup.
> However, I'm facing a small but yet annoying bug with the health-manager receiving amphora heartbeat UDP packet which it consider as not correct and so drop it.
>
> Here are the messages that can be found in logs:
>
> 2018-10-23 13:53:21.844 25 WARNING octavia.amphorae.backends.health_daemon.status_message [-] calculated hmac: faf73e41a0f843b826ee581c3995b7f7e56b5e5a294fca0b84eda426766f8415 not equal to msg hmac: 6137613337316432636365393832376431343337306537353066626130653261 dropping packet
>
> Which come from this part of the HM Code:
>
> https://docs.openstack.org/octavia/pike/_modules/octavia/amphorae/backends/health_daemon/status_message.html#get_payload
>
> The annoying thing is that I don't get why the UDP packet is considered as stale and how can I try to reproduce the payload which is send to the HealthManager.
> I'm willing to write a simple PY program to simulate the heartbeat payload but I don't now what's exactly the message and I think I miss some informations.
>
> Both HealthManager and the Amphora do use the same heartbeat_key and both can contact on the network as the initial Health-manager to Amphora 9443 connection is validated.
>
> As an effect to this situation, my loadbalancer is stuck in PENDING_UPDATE mode.
>
> Do you have any idea on how can I handle such thing or if it's something already seen out there for anyone else?
>
> Kind regards,
> G.
> _______________________________________________
> OpenStack-operators mailing list
> [hidden email]
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators