Finally: systemd-networkd to the rescue

People like me, working with different Linux distributions and automation, are always looking for ways to simplify and bridge the different configuration styles of system configuration into a unified way. Up to a point where it does not matter if you prefer Ubuntu, Debian, RedHat, CentOS or what ever your choice of Linux OS is. Finally systemd comes to the rescue to solve the network configuration issue using the systemd-networkd manager.

So how can you manage network configuration using systemd-networkd ?

First check if you actually have it installed and running with

systemctl status systemd-network

If the service is not enabled, just enable it after you have added your interfaces.

To configure interfaces, or more precisely networks in systemd, you only need to add a config file with a .network suffix. In my case /etc/systemd/network/ens33.network:

[Match]
Name=ens33

[Network]
DHCP=ipv4

[Address]
Address=10.0.2.15/24

[Address]
Address=10.0.3.15/24

The example above enables DHCP (v4) on the network interface ens33, a VMWare interface and yes I run VMWare on my MacBook, while additionally adding secondary IP addresses for haproxy testing purposes.

Once the configuration is completed, enable and restart the systemd-networkd service:

systemctl enable systemd-network
systemctl restart systemd-network

The networkctl command can now be used to monitor the lifecycle of an network:

Pretty cool, right! Finally one network manager to rule them all.

More information can be found at the systemd-networkd man page, documenting many more available options via http://man7.org/linux/man-pages/man5/systemd.network.5.html

Tales from the crypt: Neutron metadata issues

I’m operating OpenStack since 2014 and have come across a significant number of issues, mainly around Neutron; Which make sense, knowing the importance of Neutron inside OpenStack and without proper function all your workload has no access to the network.

This particular situation we are looking at was reported as a performance issue for the Neutron metadata service, in a Neutron Linux bridge ML2 managed environment.

The Neutron metadata service implements a proxy in between the OpenStack instance and the Nova and Neutron services to provide Amazon AWS EC2 style metadata.
This Neutron service is important for user instances for various reasons including:
• Cloud Placement Decisions (What is my public IP etc)
• User Scripts and SSH Key injection into the boot process (typically via cloud-init)

Performance issues, resulting in client timeouts or service unavailability of this service directly impacted cloud user workload, which led to application unavailability. The issue was compounded by operating over 1000 instances inside one layer 2 network.

The issue was further more compounded by operating over 1000 instances inside one Neutron layer 2 network.
The way Neutron provides this service is by wrapping into a Linux network namespaces and running a HTTP proxy server, the neutron-ns-metadata-proxy. Network namespace are common practice to separate routing domains in Linux, allowing custom firewall (iptables) and routing processing compared to the host OS. Additionally, the service scales per Neutron L2 network, a crucial information moving forward.

What happened to this service?

A Rackspace Private Cloud OpenStack customer was reporting response times larger than 30 seconds for any request to the Neutron metadata service. While initial debugging on the user instances revealed that metadata requests got intercepted by a security appliance, excluding the standard metadata IP, 169.254.169.254 from the proxy configuration via

export no_proxy="localhost,127.0.0.1,localaddress,.localdomain.com,169.254.169.254"

did not solve the issue. At this point I knew the issue was related to the Neutron service or the background service it uses, mainly Nova API (compute) and RabbitMQ (the OpenStack message bus).
Looking at the request the Neutron service handles, I identified an unusual pattern in the frequency and realized that the configuration management Chef was requesting the metadata, beyond the standard expected behavior if OpenStack instances boot/reboot.
From previous issues I knew that the Chef plugin ohai played a major role and inefficiencies were known in regards to HTTP connection handling, mainly the lack of supporting HTTP persistence.
Continuing the research on the Neutron service and looking for ways to improve response times, I identified that the neutron-ns-metadata-proxy service was only capable of opening 100 Unix sockets to the neutron-metadata-agent. These sockets are used to talk to the Neutron metadata-agent across the Linux Network namespace, without opening additional TCP connections internally, mainly as performance optimization.

Unable to explain the 100 connections limit at first, especially in absence of Neutron backend problems (Neutron server) or Nova API issues, I began looking at the neutron source code and found a related change in the upstream code.
The Neutron commit was adding an option to parameterize the WSGI threads, WSGI is used as web server gateway for Python, but also lowering the default limit from 1000 to 100. This crucial information was absent in any Neutron release notes.

More importantly, we just found our 100 Unix sockets limit

This also explained the second observation that the connections to the Neutron metadata service got queued and caused the large delay in response times. This queueing was a result of using a network event library eventlet and greenlet combination, a typical way of addressing non-blocking I/O in the Python environment.

So what comes next?

Currently I am looking to solve the problem in multiple ways.
The imminent problem should be solved with a Chef-ohai plugin fix as proposed per Chef pull request #995 which finally introduces persistent HTTP connections and drastically reducing the need for parallel connection. First results are encouraging.

More importantly the Neutron community has re-implemented the neutron-ns-metadata-proxy with HAProxy (LP #1524916) to address performance issues. The OpenStack community needs verify if the issue is still occurring.

Alternatively, there are Neutron network design decisions that can assist with these problems. For example, one approach is to reduce the size of a Neutron L2 network to smaller than 23 bits, which allows Neutron to scale out the metadata service.

This approach allows the option to create multiple Neutron routers, scaling out the Neutron metadata service onto other Neutron agents, where one router is only responsible for serving the Neutron Metadata requests. This is especially the situation when the configuration option enable_isolated_metadata is set to True and project/tenant networks are attached to Neutron routers.

So as usual, Neutron keeps it interesting for us. Can’t wait to dissect Neutron Metadata service in a DVR environment. More to come …..

What’s up with OpenStack Swift metadata

The other day I got interested in what attributes OpenStack Swift is actually storing along with a data.

First I had to determine the actual partition where Swift is storing the data. In my case I had a Swift ring prepared with a replication count of 2 so the data can only exist in two partitions.

The easy way to lookup this information is by using the swift-get-nodes:

swift-get-nodes <ring file> <URL containing account+container+path>
# swift-get-nodes /etc/swift/object-1.ring.gz /AUTH_e1496568b6864cb1b52cdfe7436c213f/test/root/hummingbird |grep lah

ssh 172.29.244.100 "ls -lah ${DEVICE:-/srv/node*}/swift4.img/objects-1/39/83a/27ea485e7f147e5e47f9c38dd0feb83a"

ssh 172.29.244.100 "ls -lah ${DEVICE:-/srv/node*}/swift5.img/objects-1/39/83a/27ea485e7f147e5e47f9c38dd0feb83a"

Lets change into the partition and retrieve all xattr keys and values:

# cd /srv/swift4.img/objects-1/39/83a/27ea485e7f147e5e47f9c38dd0feb83a/
# getfattr -dm '.*' 1469228674.06028.data

# file: 1469228674.06028.data
 user.swift.metadata="�}q(UC⎺┼├e┼├↑Le┼±├▒─12321▮41U┼▒└e─U>>"

Using the key name user.swift.metadata, I found out that the value for this key is a Python pickle object: https://github.com/openstack/swift/blob/master/swift/obj/diskfile.py#L133

Now let’s uncover the data of the pickle object:

Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.


>>> import xattr
>>> import pickle

>>> fh = open('1469228674.06028.data')

>>> xattr.listxattr(fh)
 (u'user.swift.metadata',)

>>> xattr.getxattr(fh,'user.swift.metadata')
 '\x80\x02}q\x01(U\x0eContent-Lengthq\x02U\x0812321041U\x04nameq\x03U>> 

>>> p = pickle.loads(xattr.getxattr(fh,'user.swift.metadata'))

>>> print p
 {'Content-Length': '12321041', 'name': '/AUTH_e1496568b6864cb1b52cdfe7436c213f/test/root/hummingbird', 'Content-Type': 'application/octet-stream', 'ETag': 'd645ab07a4a452abeeb7f3ad0ec0f7db', 'X-Timestamp': '1469228674.06028', 'X-Object-Meta-Mtime': '1466738039.627804'}

Here it is, the usual suspects are stored. This metadata is actually returned with each stat request. Quite clever, that way Swift does not need rehash or read additional attributes per file it serves.

Multi Homing Debian/Ubuntu instances

As some CentOS/RedHat folks might know you can use the GATEWAYDEV option inside the networks configuration file to accept the network default gateway only from this interface. (GATEWAYDEV=eth0 for example). This is particular useful when connecting instances to multiple networks like a public and internal network to eliminate unnecessary routing while using DHCP to assign the network addresses to the interface. The need is primarily given when both networks, public and internal push DHCP default router (gateway) information to allow multi homed and single homed instance in the same networks.

One way of implementing a similar feature like RedHat or the derivatives is to utilize DHCP client hooks to alter the DHCP options pushed from the server to the client. The DHCP client program does support enter an exit hooks, which allows for alteration pre and post interface configuration.

For this use case I implemented the following enter hook to ignore the default router information on all interfaces other than the elected one. Personally I would recommend that the public facing interface is always the first one and the Internet default gateway is using this interface. Currently all dhclient enter hooks are stored inside the /etc/dhcp/dhclient-enter-hooks.d directory and are executed in alphanumeric order. This is also the reason why I prefixed the script with the number 1.

RUN='yes'
 
if [ $RUN = 'yes' ]; then
  if [ "$reason" = "BOUND" -o "$reason" = "REBOOT" ]; then
    if [ $interface != 'eth0' ]; then
      test -f /tmp/dhclient-script.debug && echo "Stripping default GW off $interface" |tee -a /tmp/dhclient-script.debug
      new_routers=""
      old_routers=""
    fi
  fi
fi

Windows drive letters and cinder volumes ??

How does Windows persist drive letters when attaching/detaching cinder volumes ?

Whenever filesystems are mounted inside Windows instances the administrator will usually assign a drive letter to the device. This information is made persistent inside the registry HKLM\SYSTEM\Mounted Devices registry subkey. Therefore the drive letter will be always persistent as long as you have the volume attached to the same instance. The device order does not matter in this case. My information was based on the quote I found at windowsitpro.com stating :

According to a Microsoft Customer Service and Support (CSS) representative, Windows uses the disk ID as an index to store and retrieve information about the disk. For example, in the HKLM\SYSTEM\Mounted Devices registry subkey, the disk ID appears as REG_BINARY data in the \DosDevices\ and \\??\Volume{} entries because Windows uses the disk ID to store and retrieve information about persistent drive letter mappings and mount points.

And what happens if you detach and attach cinder volumes in the wrong order ?

In short nothing, as long as the volumes are attached to the same system retaining the same copy of the registry. If ever the registry or the system changes, the device letter will be reordered. That’s also a good reason to choose your device description wisely in case you have to recover from a instance/OS issue.

Spice console issues with RHEL/CentOS 7 instances ?

After I deployed Openstack Icehouse we did notice spice html5 proxy console issues in particular with CentOS 7 and RHEL 7 guests. Those guest consoles did show issues with the character echoing, you where not able to see what you have typed inside the terminal. I did track this issue down to  a spice html5 proxy issue whenever the guest is using a frame buffer enable console. After I did disable the frame buffer mode and switched the console to a text mode, the guest console was finally usable. Here the instructions :

Please add inside the /etc/default/grub config file the option “nofb nomodeset” to the GRUB_CMDLINE_LINUX variable and regenerate the grub2 config.

  • Tested configuration /etc/default/grub :
GRUB_TIMEOUT=1
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL="serial console"
GRUB_SERIAL_COMMAND="serial"
GRUB_CMDLINE_LINUX="console=ttyS0 console=tty0 crashkernel=auto vconsole.keymap=us nofb nomodeset"
GRUB_DISABLE_RECOVERY="true"
  • Rebuilt grub2 config

grub2-mkconfig -o /boot/grub2/grub.cfg

After the mandatory instance reboot the console will boot in text mode only and not using any frame buffer graphic device. The console should work as desired at this point.

Why are the Nova Hypervisor statistic not updating after renaming a host while instance are running?

The nova-compute service is periodically updating hardware (VCPU, RAM, DISK) statistics for a host and is using the host name (check with hostname -f in Linux) to update the database with the available resources.

In cases where the host name has been changed while instances are running, all existing instances still reference the old host name inside the node column of the nova.instances table. All those entries need to be updated in order to get the correct amount of available resources for nova  inside the nova MySQL database:

UPDATE nova.instances SET node = '<new host name>' WHERE node = '<old host name>';

Other columns as host, launched_on should be included in a subsequent SQL.

Ever wondering why Windows Guests come up with the wrong time running inside Openstack ?

Openstack is starting the instances in UTC time when using kvm, the simulated guest hardware clock is always to UTC. This is independent from the host clock setting.

Windows OSes only assumes the hardware clock is set to local time so it boots up in UTC time until the time synchronization against the time.microsoft.com finishes and corrects it to the desired time zone.

To change the hardware clock in windows, you can add this registry entry :

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\TimeZoneInformation]
"RealTimeIsUniversal"=dword:00000001

That will boot the instance with the correct time, since the hardware inside the guest and host are lining using the same time zone information.

Additionally please note, that Microsoft Windows Server 2008/7 had a High CPU issue when changing to DST.

That has been fixed with the hot fix :

2800213 High CPU usage during DST changeover in Windows Server 2008, Windows 7, or Windows Server 2008 R2

https://support.microsoft.com/en-us/kb/2687252

Update per 6/25/2015: This hot-fix is only applicable for older Openstack releases (Havana and lower). Newer releases of Openstack do start the guest inside the localtime zone, local to the host. Additionally Openstack needs to be aware that the image you’re using is a window guest and you have to set the os_type image property to windows. But beware for errors around those glance properties. There are known issues that the image properties are not retained if you create new images from existing nova instances.

Want to schedule AWS snapshots?

For all of you who has a need to schedule AWS snapshots and are not so familiar with Linux shell scripting, here my code how I schedule EC2 snapshots.

I recommend to setup a dedicated script host in your AWS region where you can execute all your scripts. Usually a t1.micro AWS Linux instance will suffice.

Environment variable you need to access AWS EC2 API :

export AWS_CREDENTIAL_FILE=$HOME/.awssecret

The file .awssecret has a simple format :

AWSAccessKeyId=xxxxxxx 
AWSSecretKey=yyyyyy

which you’ll get once you create your user and generate a AWS key at the IAM user management console.

How do you call the script :

source $HOME/.bash_profile ; $HOME/bin/ec2-create-backup.sh

Following a example how I am using the script in a cron :

MAILTO=""
#Backup of XXX
00 00 * * * ( source $home/.bash_profile ; $home/bin/ec2-create-backup.sh us-west-1 vol-f233464 10 )

Before you run the script, I would test your environment if you have everything correctly setup :

 ec2-describe-snapshots --hide-tags --region us-west-1

Following now the code for my auxiliary script:

#!/bin/bash
# ec2-describe-snapshots --hide-tags --region us-west-1 -F volume-id=vol-xxxxxx
# output:
# SNAPSHOT snap-xxxxxx vol-xxxxxx completed 2013-09-19T23:24:26+0000 100% 519544898336 25 mysql 5.6
export PATH=$PATH:/opt/aws/bin
RET=0
 
usage() {
 echo -e "$0\t  ";
 echo -e "$0\tus-west-1 vol-12345 31";
 exit 1;
}
 
makeSnapshot() {
 local r=$1, vol=$2
 echo "Creating new snapshot for volume $vol"
 ec2-create-snapshot --region $region $vol -d "Backup $(date +'%Y%m%d%H%M%S') of $vol"
 export RET=$?
}
 
deleteSnapshot() {
 local r=$1, snap=$2
 echo "Deleting oldest snapshot $snap"
 ec2-delete-snapshot --region $region $snap
 export RET=$?
}
 
region=$1
volume=$2
test -z $1 && usage
test -z $2 && usage
test -z $3 && backlog=5 || backlog=$3
test -z $AWS_CREDENTIAL_FILE && echo "AWS_CREDENTIAL_FILE not set"
snaps=( $( ec2-describe-snapshots --hide-tags --region $region -F volume-id=$volume | egrep -o 'snap-[0-9A-Za-z]+' ) )
nosnaps=${#snaps[@]}
 
if [ $nosnaps -lt $backlog ]; then
 makeSnapshot $region $volume
 test $RET -gt 0 && exit 1 || exit 0
else
 lastsnap=$( let $nosnaps-1 )
 oldestTS=$( ec2-describe-snapshots --hide-tags --region $region -F
 "volume-id=$volume" | egrep -o "Backup [0-9]+ of" | egrep -o '[0-9]+' | sort | head -n1 )
 snap=$( ec2-describe-snapshots --hide-tags --region $region -F "volume-id=$volume" -F "description=*${oldestTS}*" | egrep -o 'snap-[0-9A-Za-z]+' );
 deleteSnapshot $region ${snap}
 test $RET -gt 0 && exit 1
 makeSnapshot $region $volume
 test $RET -gt 0 && exit 1 || exit 0
fi;

KVM Live Migration (RedHat)

Live Migration using shared storage

I really love the feature to migrate running VMs from one Linux hypervisor to another without having the burden of pooling or the necessity to have some sort of shared storage attached. Although migrations using shared storage, e.g.NFS, are faster and easier to accomplish. The migration can be initiated using the virt-manager GUI tool or even simpler at the virsh CLI. As a requirement I installed libvirtd and opened the network communication for libvirtd (listen_tcp to 1 @/etc/libvirt/libvirtd.conf). Also check your firewall settings on the hypervisor, if the libvirtd port is open. (as root: netstat -ntlp |grep libvirtd)

Following example shows how to migrate a VM over network using libvirtd. You should always enable TLS for libvirtd, but encryption is not always supported by 3rd party products like Cloudstack :

sudo virsh migrate --live --persistent --p2p --tunnelled <VM> qemu+tcp://<hypervisor>/system

Important is the –persistent option, which ensures the new VM on the target hypervisor stays persistent. If you don’t use the option, the VM configuration will automatically be removed from the target hypervisor and you have to start the VM on the old machine.

I usually use a temporary live migration during hardware maintenance or overload situations, with the intention to run the VM on the old metal afterwards.

 

Live Migration using local storage

KVM allows you to live migrate a VM from local to local storage. The only requirement is that you have enough RAM and a destination disk image available with the same disk size. This image needs to reside at the same path and file name.

  • Create a new disk on the destination KVM host
sudo qemu-img create -f qcow2 /var/lib/libvirt/images/<VM>.img 2G
  • Start migration on the source KVM host
sudo virsh migrate --live --p2p --tunnelled --persistent --copy-storage-all <VM> qemu+tcp://<hypervisor>/system