Tales from the crypt: influxdb and retention policies

Folks,

don’t be this guy:

CREATE RETENTION POLICY ON <name> DURATION 30d  REPLICATION 1

because what will happen is that the custom policy is not used within your database as listed below.

All your shards will still use the standard autogen policy which will expire data according the standard policy.

To enable a custom retention policy for new shards, simply execute:

USE <database TO modify>
ALTER RETENTION POLICY <name> DEFAULT

and new shards automatically pick up the configure retention policy:

And to prevent this issue from happening again, just add the keyword DEFAULT as part of the CREATE RETENTION policy:

  1. CREATE RETENTION POLICY ON <name> DURATION 30d  REPLICATION 1 DEFAULT

 

indicating it’s preferred use over autogen.

Tales from the crypt: Neutron metadata issues

I’m operating OpenStack since 2014 and have come across a significant number of issues, mainly around Neutron; Which make sense, knowing the importance of Neutron inside OpenStack and without proper function all your workload has no access to the network.

This particular situation we are looking at was reported as a performance issue for the Neutron metadata service, in a Neutron Linux bridge ML2 managed environment.

The Neutron metadata service implements a proxy in between the OpenStack instance and the Nova and Neutron services to provide Amazon AWS EC2 style metadata.
This Neutron service is important for user instances for various reasons including:
• Cloud Placement Decisions (What is my public IP etc)
• User Scripts and SSH Key injection into the boot process (typically via cloud-init)

Performance issues, resulting in client timeouts or service unavailability of this service directly impacted cloud user workload, which led to application unavailability. The issue was compounded by operating over 1000 instances inside one layer 2 network.

The issue was further more compounded by operating over 1000 instances inside one Neutron layer 2 network.
The way Neutron provides this service is by wrapping into a Linux network namespaces and running a HTTP proxy server, the neutron-ns-metadata-proxy. Network namespace are common practice to separate routing domains in Linux, allowing custom firewall (iptables) and routing processing compared to the host OS. Additionally, the service scales per Neutron L2 network, a crucial information moving forward.

What happened to this service?

A Rackspace Private Cloud OpenStack customer was reporting response times larger than 30 seconds for any request to the Neutron metadata service. While initial debugging on the user instances revealed that metadata requests got intercepted by a security appliance, excluding the standard metadata IP, 169.254.169.254 from the proxy configuration via

export no_proxy="localhost,127.0.0.1,localaddress,.localdomain.com,169.254.169.254"

did not solve the issue. At this point I knew the issue was related to the Neutron service or the background service it uses, mainly Nova API (compute) and RabbitMQ (the OpenStack message bus).
Looking at the request the Neutron service handles, I identified an unusual pattern in the frequency and realized that the configuration management Chef was requesting the metadata, beyond the standard expected behavior if OpenStack instances boot/reboot.
From previous issues I knew that the Chef plugin ohai played a major role and inefficiencies were known in regards to HTTP connection handling, mainly the lack of supporting HTTP persistence.
Continuing the research on the Neutron service and looking for ways to improve response times, I identified that the neutron-ns-metadata-proxy service was only capable of opening 100 Unix sockets to the neutron-metadata-agent. These sockets are used to talk to the Neutron metadata-agent across the Linux Network namespace, without opening additional TCP connections internally, mainly as performance optimization.

Unable to explain the 100 connections limit at first, especially in absence of Neutron backend problems (Neutron server) or Nova API issues, I began looking at the neutron source code and found a related change in the upstream code.
The Neutron commit was adding an option to parameterize the WSGI threads, WSGI is used as web server gateway for Python, but also lowering the default limit from 1000 to 100. This crucial information was absent in any Neutron release notes.

More importantly, we just found our 100 Unix sockets limit

This also explained the second observation that the connections to the Neutron metadata service got queued and caused the large delay in response times. This queueing was a result of using a network event library eventlet and greenlet combination, a typical way of addressing non-blocking I/O in the Python environment.

So what comes next?

Currently I am looking to solve the problem in multiple ways.
The imminent problem should be solved with a Chef-ohai plugin fix as proposed per Chef pull request #995 which finally introduces persistent HTTP connections and drastically reducing the need for parallel connection. First results are encouraging.

More importantly the Neutron community has re-implemented the neutron-ns-metadata-proxy with HAProxy (LP #1524916) to address performance issues. The OpenStack community needs verify if the issue is still occurring.

Alternatively, there are Neutron network design decisions that can assist with these problems. For example, one approach is to reduce the size of a Neutron L2 network to smaller than 23 bits, which allows Neutron to scale out the metadata service.

This approach allows the option to create multiple Neutron routers, scaling out the Neutron metadata service onto other Neutron agents, where one router is only responsible for serving the Neutron Metadata requests. This is especially the situation when the configuration option enable_isolated_metadata is set to True and project/tenant networks are attached to Neutron routers.

So as usual, Neutron keeps it interesting for us. Can’t wait to dissect Neutron Metadata service in a DVR environment. More to come …..

What’s up with OpenStack Swift metadata

The other day I got interested in what attributes OpenStack Swift is actually storing along with a data.

First I had to determine the actual partition where Swift is storing the data. In my case I had a Swift ring prepared with a replication count of 2 so the data can only exist in two partitions.

The easy way to lookup this information is by using the swift-get-nodes:

swift-get-nodes <ring file> <URL containing account+container+path>
# swift-get-nodes /etc/swift/object-1.ring.gz /AUTH_e1496568b6864cb1b52cdfe7436c213f/test/root/hummingbird |grep lah

ssh 172.29.244.100 "ls -lah ${DEVICE:-/srv/node*}/swift4.img/objects-1/39/83a/27ea485e7f147e5e47f9c38dd0feb83a"

ssh 172.29.244.100 "ls -lah ${DEVICE:-/srv/node*}/swift5.img/objects-1/39/83a/27ea485e7f147e5e47f9c38dd0feb83a"

Lets change into the partition and retrieve all xattr keys and values:

# cd /srv/swift4.img/objects-1/39/83a/27ea485e7f147e5e47f9c38dd0feb83a/
# getfattr -dm '.*' 1469228674.06028.data

# file: 1469228674.06028.data
 user.swift.metadata="�}q(UC⎺┼├e┼├↑Le┼±├▒─12321▮41U┼▒└e─U>>"

Using the key name user.swift.metadata, I found out that the value for this key is a Python pickle object: https://github.com/openstack/swift/blob/master/swift/obj/diskfile.py#L133

Now let’s uncover the data of the pickle object:

Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.


>>> import xattr
>>> import pickle

>>> fh = open('1469228674.06028.data')

>>> xattr.listxattr(fh)
 (u'user.swift.metadata',)

>>> xattr.getxattr(fh,'user.swift.metadata')
 '\x80\x02}q\x01(U\x0eContent-Lengthq\x02U\x0812321041U\x04nameq\x03U>> 

>>> p = pickle.loads(xattr.getxattr(fh,'user.swift.metadata'))

>>> print p
 {'Content-Length': '12321041', 'name': '/AUTH_e1496568b6864cb1b52cdfe7436c213f/test/root/hummingbird', 'Content-Type': 'application/octet-stream', 'ETag': 'd645ab07a4a452abeeb7f3ad0ec0f7db', 'X-Timestamp': '1469228674.06028', 'X-Object-Meta-Mtime': '1466738039.627804'}

Here it is, the usual suspects are stored. This metadata is actually returned with each stat request. Quite clever, that way Swift does not need rehash or read additional attributes per file it serves.

Small excursion to the undocumented OpenStack LBaaSv2 world and HAProxy

Some people, including me, like to play with new stuff. And recently I set my mind to explore LBaaSv2 with the HAProxy namespace driver under RDO the RedHat Open Source distribution for OpenStack.

Here is what I did to get the Neutron LBaasV2 agent including the HAProxy driver working.

The configuration

  • Install necessary packages
yum upgrade
yum -y install openstack-neutron-lbaas haproxy

 

  • Enabling the LoadBalancerPluginv2 inside the /etc/neutron/neutron.conf
crudini --set /etc/neutron/neutron.conf DEFAULT service_plugins router,neutron_lbaas.services.loadbalancer.plugin.LoadBalancerPluginv2
  • Enabling the HAProxy namespace driver inside the/etc/neutron/neutron_lbaas.conf file
crudini --set /etc/neutron/neutron_lbaas.conf service_providers service_provider LOADBALANCERV2:Haproxy:neutron_lbaas.drivers.haproxy.plugin_driver.HaproxyOnHostPluginDriver:default
  • Configure OVS as interface drive inside the /etc/neutron/lbaas_agent.ini file

Interestingly RedHat did not pre configure the interface driver to OVS, knowing that it comes by default with OVS enabled as Neutron plugin.

crudini --set /etc/neutron/lbaas_agent.ini DEFAULT interface_driver neutron.agent.linux.interface.OVSInterfaceDriver
  • Add necessary database tables to the neutron database
neutron-db-manage --service lbaas --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini upgrade head
  • Restart services
service neutron-server restart
service neutron-lbaasv2-agent restart

Testing & Creating a neutron load balancer

If all goes well, you will end up with loaded a Loadbalancerv2 agent

# source ~/keystonerc_admin ; neutron agent-list --fields agent_type --fields alive
+----------------------+-------+
| agent_type           | alive |
+----------------------+-------+
| Open vSwitch agent   | :-)   |
| Metadata agent       | :-)   |
| DHCP agent           | :-)   |
| Loadbalancerv2 agent | :-)   |
| L3 agent             | :-)   |
+----------------------+-------+

Now let’s create a load balancer since the existing privet (sub)network

 

neutron lbaas-loadbalancer-create private_subnet
 
Created a new loadbalancer:
+---------------------+----------------------+
| Field               | Value                |
+---------------------+----------------------+
| admin_state_up      | True                 |
| description         |                      |
| id                  | **id omitted**       |
| listeners           |                      |
| name                |                      |
| operating_status    | ONLINE               |
| provider            | haproxy              |
| provisioning_status | ACTIVE               |
| tenant_id           | **id omitted**       |
| vip_address         | 10.0.0.3             |
| vip_port_id         | **id omitted**       |
| vip_subnet_id       | **id omitted**       |
+---------------------+----------------------+

I did not assign a name to the load balancer, so all subsequent commands will reference the ID c92fb015-c766-4a26-a9f2-39f03aad20e8.

neutron lbaas-listener-create --loadbalancer <lb id> --protocol HTTP --protocol-port 80
Created a new listener:
+---------------------------+----------------+
| Field                     | Value          |
+---------------------------+----------------+
| admin_state_up            | True           |
| connection_limit          | -1             |
| default_pool_id           |                |
| default_tls_container_ref |                |
| description               |                |
| id                        | **id omitted** |
| loadbalancers             |                |
| name                      |                |
| protocol                  |                |
| protocol_port             |                |
| sni_container_refs        |                |
| tenant_id                 | **id omitted** |
+---------------------------+----------------+

It’s alive

neutron lbaas-loadbalancer-show <lb id>
+---------------------+---------------------+
| Field               | Value               |
+---------------------+---------------------+
| admin_state_up      | True                |
| description         |                     |
| id                  | **id omitted**      |
| listeners           |                     |
| name                |                     |
| operating_status    | ONLINE              |
| provider            | haproxy             |
| provisioning_status | ACTIVE              |
| tenant_id           | abc                 |
| vip_address         | 10.0.0.3            |
| vip_port_id         | ID                  |
| vip_subnet_id       | **id omitted**      |
+---------------------+---------------------+

Let’s just have a look inside the qlbaas namespace and see if the haproxy process is actually running

# ip netns |grep lbaas
qlbaas-c92fb015-c766-4a26-a9f2-39f03aad20e8
 
# ip netns exec qlbaas-c92fb015-c766-4a26-a9f2-39f03aad20e8 netstat -ntlp
 
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 10.0.0.3:80             0.0.0.0:*               LISTEN      14017/haproxy

 

For those who are curious how the haproxy has been configured, just look at the The haproxy configuration is stored at the /var/lib/neutron/lbaas/v2/c92fb015-c766-4a26-a9f2-39f03aad20e8/haproxy.conf file, where c92fb015-c766-4a26-a9f2-39f03aad20e8 resembles the load balancer ID