mitaka/xenial libvirt issues

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

mitaka/xenial libvirt issues

Joe Topjian-2
Hi all,

We're seeing some strange libvirt issues in an Ubuntu 16.04 environment. It's running Mitaka, but I don't think this is a problem with OpenStack itself.

We're in the process of upgrading this environment from Ubuntu 14.04 with the Mitaka cloud archive to 16.04. Instances are being live migrated (NFS share) to a new 16.04 compute node (fresh install), so there's a change between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing is only happening on the 16.04/1.3.1 nodes.

We're getting occasional reports of instances not able to be snapshotted. Upon investigation, the snapshot process quits early with a libvirt/qemu lock timeout error. We then see that the instance's xml file has disappeared from /etc/libvirt/qemu and must restart libvirt and hard-reboot the instance to get things back to a normal state. Trying to live-migrate the instance to another node causes the same thing to happen.

However, at some random time, either the snapshot or the migration will work without error. I haven't been able to reproduce this issue on my own and haven't been able to figure out the root cause by inspecting instances reported to me.

One thing that has stood out is the length of time it takes for libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at least 5 minutes before a simple "virsh list" will work. The command will hang otherwise. If I increase libvirt's logging level, I can see that during this period of time, libvirt is working on iptables and ebtables (looks like it's shelling out commands).

But if I run "libvirtd -l" straight on the command line, all of this completes within 5 seconds (including all of the shelling out).

My initial thought is that systemd is doing some type of throttling between the system and user slice, but I've tried comparing slice attributes and, probably due to my lack of understanding of systemd, can't find anything to prove this.

Is anyone else running into this problem? Does anyone know what might be the cause?

Thanks,
Joe

_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Reply | Threaded
Open this post in threaded view
|

Re: mitaka/xenial libvirt issues

Chris Sarginson
We hit the same issue a while back (I suspect), which we seemed to resolve by pinning QEMU and related packages at the following version (you might need to hunt down the debs manually):

1:2.5+dfsg-5ubuntu10.5

I'm certain there's a launchpad bug for Ubuntu qemu regarding this, but don't have it to hand.

Hope this helps,
Chris

On Thu, 23 Nov 2017 at 15:33 Joe Topjian <[hidden email]> wrote:
Hi all,

We're seeing some strange libvirt issues in an Ubuntu 16.04 environment. It's running Mitaka, but I don't think this is a problem with OpenStack itself.

We're in the process of upgrading this environment from Ubuntu 14.04 with the Mitaka cloud archive to 16.04. Instances are being live migrated (NFS share) to a new 16.04 compute node (fresh install), so there's a change between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing is only happening on the 16.04/1.3.1 nodes.

We're getting occasional reports of instances not able to be snapshotted. Upon investigation, the snapshot process quits early with a libvirt/qemu lock timeout error. We then see that the instance's xml file has disappeared from /etc/libvirt/qemu and must restart libvirt and hard-reboot the instance to get things back to a normal state. Trying to live-migrate the instance to another node causes the same thing to happen.

However, at some random time, either the snapshot or the migration will work without error. I haven't been able to reproduce this issue on my own and haven't been able to figure out the root cause by inspecting instances reported to me.

One thing that has stood out is the length of time it takes for libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at least 5 minutes before a simple "virsh list" will work. The command will hang otherwise. If I increase libvirt's logging level, I can see that during this period of time, libvirt is working on iptables and ebtables (looks like it's shelling out commands).

But if I run "libvirtd -l" straight on the command line, all of this completes within 5 seconds (including all of the shelling out).

My initial thought is that systemd is doing some type of throttling between the system and user slice, but I've tried comparing slice attributes and, probably due to my lack of understanding of systemd, can't find anything to prove this.

Is anyone else running into this problem? Does anyone know what might be the cause?

Thanks,
Joe
_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Reply | Threaded
Open this post in threaded view
|

Re: mitaka/xenial libvirt issues

Joe Topjian-2
Hi Chris,

Thanks - we will definitely look into this. To confirm: did you also downgrade libvirt as well or was it all qemu? 

Thanks,
Joe

On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <[hidden email]> wrote:
We hit the same issue a while back (I suspect), which we seemed to resolve by pinning QEMU and related packages at the following version (you might need to hunt down the debs manually):

1:2.5+dfsg-5ubuntu10.5

I'm certain there's a launchpad bug for Ubuntu qemu regarding this, but don't have it to hand.

Hope this helps,
Chris

On Thu, 23 Nov 2017 at 15:33 Joe Topjian <[hidden email]> wrote:
Hi all,

We're seeing some strange libvirt issues in an Ubuntu 16.04 environment. It's running Mitaka, but I don't think this is a problem with OpenStack itself.

We're in the process of upgrading this environment from Ubuntu 14.04 with the Mitaka cloud archive to 16.04. Instances are being live migrated (NFS share) to a new 16.04 compute node (fresh install), so there's a change between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing is only happening on the 16.04/1.3.1 nodes.

We're getting occasional reports of instances not able to be snapshotted. Upon investigation, the snapshot process quits early with a libvirt/qemu lock timeout error. We then see that the instance's xml file has disappeared from /etc/libvirt/qemu and must restart libvirt and hard-reboot the instance to get things back to a normal state. Trying to live-migrate the instance to another node causes the same thing to happen.

However, at some random time, either the snapshot or the migration will work without error. I haven't been able to reproduce this issue on my own and haven't been able to figure out the root cause by inspecting instances reported to me.

One thing that has stood out is the length of time it takes for libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at least 5 minutes before a simple "virsh list" will work. The command will hang otherwise. If I increase libvirt's logging level, I can see that during this period of time, libvirt is working on iptables and ebtables (looks like it's shelling out commands).

But if I run "libvirtd -l" straight on the command line, all of this completes within 5 seconds (including all of the shelling out).

My initial thought is that systemd is doing some type of throttling between the system and user slice, but I've tried comparing slice attributes and, probably due to my lack of understanding of systemd, can't find anything to prove this.

Is anyone else running into this problem? Does anyone know what might be the cause?

Thanks,
Joe
_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Reply | Threaded
Open this post in threaded view
|

Re: mitaka/xenial libvirt issues

Chris Sarginson
I think we may have pinned libvirt-bin as well, (1.3.1), but I can't guarantee that, sorry - I would suggest its worth trying pinning both initially.

Chris

On Thu, 23 Nov 2017 at 17:42 Joe Topjian <[hidden email]> wrote:
Hi Chris,

Thanks - we will definitely look into this. To confirm: did you also downgrade libvirt as well or was it all qemu? 

Thanks,
Joe

On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <[hidden email]> wrote:
We hit the same issue a while back (I suspect), which we seemed to resolve by pinning QEMU and related packages at the following version (you might need to hunt down the debs manually):

1:2.5+dfsg-5ubuntu10.5

I'm certain there's a launchpad bug for Ubuntu qemu regarding this, but don't have it to hand.

Hope this helps,
Chris

On Thu, 23 Nov 2017 at 15:33 Joe Topjian <[hidden email]> wrote:
Hi all,

We're seeing some strange libvirt issues in an Ubuntu 16.04 environment. It's running Mitaka, but I don't think this is a problem with OpenStack itself.

We're in the process of upgrading this environment from Ubuntu 14.04 with the Mitaka cloud archive to 16.04. Instances are being live migrated (NFS share) to a new 16.04 compute node (fresh install), so there's a change between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing is only happening on the 16.04/1.3.1 nodes.

We're getting occasional reports of instances not able to be snapshotted. Upon investigation, the snapshot process quits early with a libvirt/qemu lock timeout error. We then see that the instance's xml file has disappeared from /etc/libvirt/qemu and must restart libvirt and hard-reboot the instance to get things back to a normal state. Trying to live-migrate the instance to another node causes the same thing to happen.

However, at some random time, either the snapshot or the migration will work without error. I haven't been able to reproduce this issue on my own and haven't been able to figure out the root cause by inspecting instances reported to me.

One thing that has stood out is the length of time it takes for libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at least 5 minutes before a simple "virsh list" will work. The command will hang otherwise. If I increase libvirt's logging level, I can see that during this period of time, libvirt is working on iptables and ebtables (looks like it's shelling out commands).

But if I run "libvirtd -l" straight on the command line, all of this completes within 5 seconds (including all of the shelling out).

My initial thought is that systemd is doing some type of throttling between the system and user slice, but I've tried comparing slice attributes and, probably due to my lack of understanding of systemd, can't find anything to prove this.

Is anyone else running into this problem? Does anyone know what might be the cause?

Thanks,
Joe
_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Reply | Threaded
Open this post in threaded view
|

Re: mitaka/xenial libvirt issues

Joe Topjian-2
OK, thanks. We'll definitely look at downgrading in a test environment.

To add some further info to this problem, here are some log entries. When an instance fails to snapshot or fails to migrate, we see:

libvirtd[27939]: Cannot start job (modify, none) for domain instance-00004fe4; current job is (modify, none) owned by (27942 remoteDispatchDomainBlockJobAbort, 0 <null>) for (69116s, 0s)

libvirtd[27939]: Cannot start job (none, migration out) for domain instance-00004fe4; current job is (modify, none) owned by (27942 remoteDispatchDomainBlockJobAbort, 0 <null>) for (69361s, 0s)


The one piece of this that I'm currently fixated on is the length of time it takes libvirt to start. I'm not sure if it's causing the above, though. When starting libvirt through systemd, it takes much longer to process the iptables and ebtables rules than if we start libvirtd on the command-line directly.

virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -L libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -L libvirt-P-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -F libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -X libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -F libvirt-P-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -X libvirt-P-vnet5'

We're talking about a difference between 5 minutes and 5 seconds depending on where libvirt was started. This doesn't seem normal to me. 

In general, is anyone aware of systemd performing restrictions of some kind on processes which create subprocesses? Or something like that? I've tried comparing cgroups and the various limits within systemd between my shell session and the libvirt-bin.service session and can't find anything immediately noticeable. Maybe it's apparmor?

Thanks,
Joe

On Thu, Nov 23, 2017 at 11:03 AM, Chris Sarginson <[hidden email]> wrote:
I think we may have pinned libvirt-bin as well, (1.3.1), but I can't guarantee that, sorry - I would suggest its worth trying pinning both initially.

Chris

On Thu, 23 Nov 2017 at 17:42 Joe Topjian <[hidden email]> wrote:
Hi Chris,

Thanks - we will definitely look into this. To confirm: did you also downgrade libvirt as well or was it all qemu? 

Thanks,
Joe

On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <[hidden email]> wrote:
We hit the same issue a while back (I suspect), which we seemed to resolve by pinning QEMU and related packages at the following version (you might need to hunt down the debs manually):

1:2.5+dfsg-5ubuntu10.5

I'm certain there's a launchpad bug for Ubuntu qemu regarding this, but don't have it to hand.

Hope this helps,
Chris

On Thu, 23 Nov 2017 at 15:33 Joe Topjian <[hidden email]> wrote:
Hi all,

We're seeing some strange libvirt issues in an Ubuntu 16.04 environment. It's running Mitaka, but I don't think this is a problem with OpenStack itself.

We're in the process of upgrading this environment from Ubuntu 14.04 with the Mitaka cloud archive to 16.04. Instances are being live migrated (NFS share) to a new 16.04 compute node (fresh install), so there's a change between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing is only happening on the 16.04/1.3.1 nodes.

We're getting occasional reports of instances not able to be snapshotted. Upon investigation, the snapshot process quits early with a libvirt/qemu lock timeout error. We then see that the instance's xml file has disappeared from /etc/libvirt/qemu and must restart libvirt and hard-reboot the instance to get things back to a normal state. Trying to live-migrate the instance to another node causes the same thing to happen.

However, at some random time, either the snapshot or the migration will work without error. I haven't been able to reproduce this issue on my own and haven't been able to figure out the root cause by inspecting instances reported to me.

One thing that has stood out is the length of time it takes for libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at least 5 minutes before a simple "virsh list" will work. The command will hang otherwise. If I increase libvirt's logging level, I can see that during this period of time, libvirt is working on iptables and ebtables (looks like it's shelling out commands).

But if I run "libvirtd -l" straight on the command line, all of this completes within 5 seconds (including all of the shelling out).

My initial thought is that systemd is doing some type of throttling between the system and user slice, but I've tried comparing slice attributes and, probably due to my lack of understanding of systemd, can't find anything to prove this.

Is anyone else running into this problem? Does anyone know what might be the cause?

Thanks,
Joe
_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Reply | Threaded
Open this post in threaded view
|

Re: mitaka/xenial libvirt issues

Sean Redmond
Hi,

I think it maybe related to this:


Thanks

On Thu, Nov 23, 2017 at 6:20 PM, Joe Topjian <[hidden email]> wrote:
OK, thanks. We'll definitely look at downgrading in a test environment.

To add some further info to this problem, here are some log entries. When an instance fails to snapshot or fails to migrate, we see:

libvirtd[27939]: Cannot start job (modify, none) for domain instance-00004fe4; current job is (modify, none) owned by (27942 remoteDispatchDomainBlockJobAbort, 0 <null>) for (69116s, 0s)

libvirtd[27939]: Cannot start job (none, migration out) for domain instance-00004fe4; current job is (modify, none) owned by (27942 remoteDispatchDomainBlockJobAbort, 0 <null>) for (69361s, 0s)


The one piece of this that I'm currently fixated on is the length of time it takes libvirt to start. I'm not sure if it's causing the above, though. When starting libvirt through systemd, it takes much longer to process the iptables and ebtables rules than if we start libvirtd on the command-line directly.

virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -L libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -L libvirt-P-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -F libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -X libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -F libvirt-P-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -X libvirt-P-vnet5'

We're talking about a difference between 5 minutes and 5 seconds depending on where libvirt was started. This doesn't seem normal to me. 

In general, is anyone aware of systemd performing restrictions of some kind on processes which create subprocesses? Or something like that? I've tried comparing cgroups and the various limits within systemd between my shell session and the libvirt-bin.service session and can't find anything immediately noticeable. Maybe it's apparmor?

Thanks,
Joe

On Thu, Nov 23, 2017 at 11:03 AM, Chris Sarginson <[hidden email]> wrote:
I think we may have pinned libvirt-bin as well, (1.3.1), but I can't guarantee that, sorry - I would suggest its worth trying pinning both initially.

Chris

On Thu, 23 Nov 2017 at 17:42 Joe Topjian <[hidden email]> wrote:
Hi Chris,

Thanks - we will definitely look into this. To confirm: did you also downgrade libvirt as well or was it all qemu? 

Thanks,
Joe

On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <[hidden email]> wrote:
We hit the same issue a while back (I suspect), which we seemed to resolve by pinning QEMU and related packages at the following version (you might need to hunt down the debs manually):

1:2.5+dfsg-5ubuntu10.5

I'm certain there's a launchpad bug for Ubuntu qemu regarding this, but don't have it to hand.

Hope this helps,
Chris

On Thu, 23 Nov 2017 at 15:33 Joe Topjian <[hidden email]> wrote:
Hi all,

We're seeing some strange libvirt issues in an Ubuntu 16.04 environment. It's running Mitaka, but I don't think this is a problem with OpenStack itself.

We're in the process of upgrading this environment from Ubuntu 14.04 with the Mitaka cloud archive to 16.04. Instances are being live migrated (NFS share) to a new 16.04 compute node (fresh install), so there's a change between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing is only happening on the 16.04/1.3.1 nodes.

We're getting occasional reports of instances not able to be snapshotted. Upon investigation, the snapshot process quits early with a libvirt/qemu lock timeout error. We then see that the instance's xml file has disappeared from /etc/libvirt/qemu and must restart libvirt and hard-reboot the instance to get things back to a normal state. Trying to live-migrate the instance to another node causes the same thing to happen.

However, at some random time, either the snapshot or the migration will work without error. I haven't been able to reproduce this issue on my own and haven't been able to figure out the root cause by inspecting instances reported to me.

One thing that has stood out is the length of time it takes for libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at least 5 minutes before a simple "virsh list" will work. The command will hang otherwise. If I increase libvirt's logging level, I can see that during this period of time, libvirt is working on iptables and ebtables (looks like it's shelling out commands).

But if I run "libvirtd -l" straight on the command line, all of this completes within 5 seconds (including all of the shelling out).

My initial thought is that systemd is doing some type of throttling between the system and user slice, but I've tried comparing slice attributes and, probably due to my lack of understanding of systemd, can't find anything to prove this.

Is anyone else running into this problem? Does anyone know what might be the cause?

Thanks,
Joe
_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Reply | Threaded
Open this post in threaded view
|

Re: mitaka/xenial libvirt issues

Tobias Urdin

Hello,

The seems to assume tunnelled migrations, the live_migration_flag is removed in later version but is there in Mitaka.

Do you have the VIR_MIGRATE_TUNNELLED flag set for [libvirt]live_migration_flag in nova.conf?


Might be a long shot, but I've removed VIR_MIGRATE_TUNNELLED in our clouds

Best regards


On 11/26/2017 01:01 PM, Sean Redmond wrote:
Hi,

I think it maybe related to this:

<a href="https://bugs.launchpad.net/ubuntu/&#43;source/qemu/&#43;bug/1647389" moz-do-not-send="true">https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389

Thanks

On Thu, Nov 23, 2017 at 6:20 PM, Joe Topjian <[hidden email]> wrote:
OK, thanks. We'll definitely look at downgrading in a test environment.

To add some further info to this problem, here are some log entries. When an instance fails to snapshot or fails to migrate, we see:

libvirtd[27939]: Cannot start job (modify, none) for domain instance-00004fe4; current job is (modify, none) owned by (27942 remoteDispatchDomainBlockJobAbort, 0 <null>) for (69116s, 0s)

libvirtd[27939]: Cannot start job (none, migration out) for domain instance-00004fe4; current job is (modify, none) owned by (27942 remoteDispatchDomainBlockJobAbort, 0 <null>) for (69361s, 0s)


The one piece of this that I'm currently fixated on is the length of time it takes libvirt to start. I'm not sure if it's causing the above, though. When starting libvirt through systemd, it takes much longer to process the iptables and ebtables rules than if we start libvirtd on the command-line directly.

virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -L libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -L libvirt-P-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -F libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -X libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -F libvirt-P-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -X libvirt-P-vnet5'

We're talking about a difference between 5 minutes and 5 seconds depending on where libvirt was started. This doesn't seem normal to me. 

In general, is anyone aware of systemd performing restrictions of some kind on processes which create subprocesses? Or something like that? I've tried comparing cgroups and the various limits within systemd between my shell session and the libvirt-bin.service session and can't find anything immediately noticeable. Maybe it's apparmor?

Thanks,
Joe

On Thu, Nov 23, 2017 at 11:03 AM, Chris Sarginson <[hidden email]> wrote:
I think we may have pinned libvirt-bin as well, (1.3.1), but I can't guarantee that, sorry - I would suggest its worth trying pinning both initially.

Chris

On Thu, 23 Nov 2017 at 17:42 Joe Topjian <[hidden email]> wrote:
Hi Chris,

Thanks - we will definitely look into this. To confirm: did you also downgrade libvirt as well or was it all qemu? 

Thanks,
Joe

On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <[hidden email]> wrote:
We hit the same issue a while back (I suspect), which we seemed to resolve by pinning QEMU and related packages at the following version (you might need to hunt down the debs manually):

1:2.5+dfsg-5ubuntu10.5

I'm certain there's a launchpad bug for Ubuntu qemu regarding this, but don't have it to hand.

Hope this helps,
Chris

On Thu, 23 Nov 2017 at 15:33 Joe Topjian <[hidden email]> wrote:
Hi all,

We're seeing some strange libvirt issues in an Ubuntu 16.04 environment. It's running Mitaka, but I don't think this is a problem with OpenStack itself.

We're in the process of upgrading this environment from Ubuntu 14.04 with the Mitaka cloud archive to 16.04. Instances are being live migrated (NFS share) to a new 16.04 compute node (fresh install), so there's a change between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing is only happening on the 16.04/1.3.1 nodes.

We're getting occasional reports of instances not able to be snapshotted. Upon investigation, the snapshot process quits early with a libvirt/qemu lock timeout error. We then see that the instance's xml file has disappeared from /etc/libvirt/qemu and must restart libvirt and hard-reboot the instance to get things back to a normal state. Trying to live-migrate the instance to another node causes the same thing to happen.

However, at some random time, either the snapshot or the migration will work without error. I haven't been able to reproduce this issue on my own and haven't been able to figure out the root cause by inspecting instances reported to me.

One thing that has stood out is the length of time it takes for libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at least 5 minutes before a simple "virsh list" will work. The command will hang otherwise. If I increase libvirt's logging level, I can see that during this period of time, libvirt is working on iptables and ebtables (looks like it's shelling out commands).

But if I run "libvirtd -l" straight on the command line, all of this completes within 5 seconds (including all of the shelling out).

My initial thought is that systemd is doing some type of throttling between the system and user slice, but I've tried comparing slice attributes and, probably due to my lack of understanding of systemd, can't find anything to prove this.

Is anyone else running into this problem? Does anyone know what might be the cause?

Thanks,
Joe
_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators




_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Reply | Threaded
Open this post in threaded view
|

Re: mitaka/xenial libvirt issues

Joe Topjian-2
Hi all,

To my knowledge, we don't use tunneled migrations. This issue is also happening with snapshots, so it's not restricted to just migrations.

I haven't yet tried the apparmor patches that George mentioned. I plan on applying them once I get another report of a problematic instance.

Thank you for the suggetions, though :)
Joe

On Mon, Nov 27, 2017 at 2:10 AM, Tobias Urdin <[hidden email]> wrote:

Hello,

The seems to assume tunnelled migrations, the live_migration_flag is removed in later version but is there in Mitaka.

Do you have the VIR_MIGRATE_TUNNELLED flag set for [libvirt]live_migration_flag in nova.conf?


Might be a long shot, but I've removed VIR_MIGRATE_TUNNELLED in our clouds

Best regards


On 11/26/2017 01:01 PM, Sean Redmond wrote:
Hi,

I think it maybe related to this:


Thanks

On Thu, Nov 23, 2017 at 6:20 PM, Joe Topjian <[hidden email]> wrote:
OK, thanks. We'll definitely look at downgrading in a test environment.

To add some further info to this problem, here are some log entries. When an instance fails to snapshot or fails to migrate, we see:

libvirtd[27939]: Cannot start job (modify, none) for domain instance-00004fe4; current job is (modify, none) owned by (27942 remoteDispatchDomainBlockJobAbort, 0 <null>) for (69116s, 0s)

libvirtd[27939]: Cannot start job (none, migration out) for domain instance-00004fe4; current job is (modify, none) owned by (27942 remoteDispatchDomainBlockJobAbort, 0 <null>) for (69361s, 0s)


The one piece of this that I'm currently fixated on is the length of time it takes libvirt to start. I'm not sure if it's causing the above, though. When starting libvirt through systemd, it takes much longer to process the iptables and ebtables rules than if we start libvirtd on the command-line directly.

virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -L libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -L libvirt-P-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -F libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -X libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -F libvirt-P-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -X libvirt-P-vnet5'

We're talking about a difference between 5 minutes and 5 seconds depending on where libvirt was started. This doesn't seem normal to me. 

In general, is anyone aware of systemd performing restrictions of some kind on processes which create subprocesses? Or something like that? I've tried comparing cgroups and the various limits within systemd between my shell session and the libvirt-bin.service session and can't find anything immediately noticeable. Maybe it's apparmor?

Thanks,
Joe

On Thu, Nov 23, 2017 at 11:03 AM, Chris Sarginson <[hidden email]> wrote:
I think we may have pinned libvirt-bin as well, (1.3.1), but I can't guarantee that, sorry - I would suggest its worth trying pinning both initially.

Chris

On Thu, 23 Nov 2017 at 17:42 Joe Topjian <[hidden email]> wrote:
Hi Chris,

Thanks - we will definitely look into this. To confirm: did you also downgrade libvirt as well or was it all qemu? 

Thanks,
Joe

On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <[hidden email]> wrote:
We hit the same issue a while back (I suspect), which we seemed to resolve by pinning QEMU and related packages at the following version (you might need to hunt down the debs manually):

1:2.5+dfsg-5ubuntu10.5

I'm certain there's a launchpad bug for Ubuntu qemu regarding this, but don't have it to hand.

Hope this helps,
Chris

On Thu, 23 Nov 2017 at 15:33 Joe Topjian <[hidden email]> wrote:
Hi all,

We're seeing some strange libvirt issues in an Ubuntu 16.04 environment. It's running Mitaka, but I don't think this is a problem with OpenStack itself.

We're in the process of upgrading this environment from Ubuntu 14.04 with the Mitaka cloud archive to 16.04. Instances are being live migrated (NFS share) to a new 16.04 compute node (fresh install), so there's a change between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing is only happening on the 16.04/1.3.1 nodes.

We're getting occasional reports of instances not able to be snapshotted. Upon investigation, the snapshot process quits early with a libvirt/qemu lock timeout error. We then see that the instance's xml file has disappeared from /etc/libvirt/qemu and must restart libvirt and hard-reboot the instance to get things back to a normal state. Trying to live-migrate the instance to another node causes the same thing to happen.

However, at some random time, either the snapshot or the migration will work without error. I haven't been able to reproduce this issue on my own and haven't been able to figure out the root cause by inspecting instances reported to me.

One thing that has stood out is the length of time it takes for libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at least 5 minutes before a simple "virsh list" will work. The command will hang otherwise. If I increase libvirt's logging level, I can see that during this period of time, libvirt is working on iptables and ebtables (looks like it's shelling out commands).

But if I run "libvirtd -l" straight on the command line, all of this completes within 5 seconds (including all of the shelling out).

My initial thought is that systemd is doing some type of throttling between the system and user slice, but I've tried comparing slice attributes and, probably due to my lack of understanding of systemd, can't find anything to prove this.

Is anyone else running into this problem? Does anyone know what might be the cause?

Thanks,
Joe
_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators




_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Reply | Threaded
Open this post in threaded view
|

Re: mitaka/xenial libvirt issues

Joe Topjian-2
We think we've pinned the qemu errors down to a mismatched group ID on a handful of compute nodes.

The slow systemd/libvirt is still unsolved, but at the moment that does not actually be the cause of the qemu errors.

On Mon, Nov 27, 2017 at 8:04 AM, Joe Topjian <[hidden email]> wrote:
Hi all,

To my knowledge, we don't use tunneled migrations. This issue is also happening with snapshots, so it's not restricted to just migrations.

I haven't yet tried the apparmor patches that George mentioned. I plan on applying them once I get another report of a problematic instance.

Thank you for the suggetions, though :)
Joe

On Mon, Nov 27, 2017 at 2:10 AM, Tobias Urdin <[hidden email]> wrote:

Hello,

The seems to assume tunnelled migrations, the live_migration_flag is removed in later version but is there in Mitaka.

Do you have the VIR_MIGRATE_TUNNELLED flag set for [libvirt]live_migration_flag in nova.conf?


Might be a long shot, but I've removed VIR_MIGRATE_TUNNELLED in our clouds

Best regards


On 11/26/2017 01:01 PM, Sean Redmond wrote:
Hi,

I think it maybe related to this:


Thanks

On Thu, Nov 23, 2017 at 6:20 PM, Joe Topjian <[hidden email]> wrote:
OK, thanks. We'll definitely look at downgrading in a test environment.

To add some further info to this problem, here are some log entries. When an instance fails to snapshot or fails to migrate, we see:

libvirtd[27939]: Cannot start job (modify, none) for domain instance-00004fe4; current job is (modify, none) owned by (27942 remoteDispatchDomainBlockJobAbort, 0 <null>) for (69116s, 0s)

libvirtd[27939]: Cannot start job (none, migration out) for domain instance-00004fe4; current job is (modify, none) owned by (27942 remoteDispatchDomainBlockJobAbort, 0 <null>) for (69361s, 0s)


The one piece of this that I'm currently fixated on is the length of time it takes libvirt to start. I'm not sure if it's causing the above, though. When starting libvirt through systemd, it takes much longer to process the iptables and ebtables rules than if we start libvirtd on the command-line directly.

virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -L libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -L libvirt-P-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -F libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -X libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -F libvirt-P-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -X libvirt-P-vnet5'

We're talking about a difference between 5 minutes and 5 seconds depending on where libvirt was started. This doesn't seem normal to me. 

In general, is anyone aware of systemd performing restrictions of some kind on processes which create subprocesses? Or something like that? I've tried comparing cgroups and the various limits within systemd between my shell session and the libvirt-bin.service session and can't find anything immediately noticeable. Maybe it's apparmor?

Thanks,
Joe

On Thu, Nov 23, 2017 at 11:03 AM, Chris Sarginson <[hidden email]> wrote:
I think we may have pinned libvirt-bin as well, (1.3.1), but I can't guarantee that, sorry - I would suggest its worth trying pinning both initially.

Chris

On Thu, 23 Nov 2017 at 17:42 Joe Topjian <[hidden email]> wrote:
Hi Chris,

Thanks - we will definitely look into this. To confirm: did you also downgrade libvirt as well or was it all qemu? 

Thanks,
Joe

On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <[hidden email]> wrote:
We hit the same issue a while back (I suspect), which we seemed to resolve by pinning QEMU and related packages at the following version (you might need to hunt down the debs manually):

1:2.5+dfsg-5ubuntu10.5

I'm certain there's a launchpad bug for Ubuntu qemu regarding this, but don't have it to hand.

Hope this helps,
Chris

On Thu, 23 Nov 2017 at 15:33 Joe Topjian <[hidden email]> wrote:
Hi all,

We're seeing some strange libvirt issues in an Ubuntu 16.04 environment. It's running Mitaka, but I don't think this is a problem with OpenStack itself.

We're in the process of upgrading this environment from Ubuntu 14.04 with the Mitaka cloud archive to 16.04. Instances are being live migrated (NFS share) to a new 16.04 compute node (fresh install), so there's a change between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing is only happening on the 16.04/1.3.1 nodes.

We're getting occasional reports of instances not able to be snapshotted. Upon investigation, the snapshot process quits early with a libvirt/qemu lock timeout error. We then see that the instance's xml file has disappeared from /etc/libvirt/qemu and must restart libvirt and hard-reboot the instance to get things back to a normal state. Trying to live-migrate the instance to another node causes the same thing to happen.

However, at some random time, either the snapshot or the migration will work without error. I haven't been able to reproduce this issue on my own and haven't been able to figure out the root cause by inspecting instances reported to me.

One thing that has stood out is the length of time it takes for libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at least 5 minutes before a simple "virsh list" will work. The command will hang otherwise. If I increase libvirt's logging level, I can see that during this period of time, libvirt is working on iptables and ebtables (looks like it's shelling out commands).

But if I run "libvirtd -l" straight on the command line, all of this completes within 5 seconds (including all of the shelling out).

My initial thought is that systemd is doing some type of throttling between the system and user slice, but I've tried comparing slice attributes and, probably due to my lack of understanding of systemd, can't find anything to prove this.

Is anyone else running into this problem? Does anyone know what might be the cause?

Thanks,
Joe
_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators




_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators




_______________________________________________
OpenStack-operators mailing list
[hidden email]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators