Tales from the crypt: Neutron metadata issues

I’m operating OpenStack since 2014 and have come across a significant number of issues, mainly around Neutron; Which make sense, knowing the importance of Neutron inside OpenStack and without proper function all your workload has no access to the network.

This particular situation we are looking at was reported as a performance issue for the Neutron metadata service, in a Neutron Linux bridge ML2 managed environment.

The Neutron metadata service implements a proxy in between the OpenStack instance and the Nova and Neutron services to provide Amazon AWS EC2 style metadata.
This Neutron service is important for user instances for various reasons including:
• Cloud Placement Decisions (What is my public IP etc)
• User Scripts and SSH Key injection into the boot process (typically via cloud-init)

Performance issues, resulting in client timeouts or service unavailability of this service directly impacted cloud user workload, which led to application unavailability. The issue was compounded by operating over 1000 instances inside one layer 2 network.

The issue was further more compounded by operating over 1000 instances inside one Neutron layer 2 network.
The way Neutron provides this service is by wrapping into a Linux network namespaces and running a HTTP proxy server, the neutron-ns-metadata-proxy. Network namespace are common practice to separate routing domains in Linux, allowing custom firewall (iptables) and routing processing compared to the host OS. Additionally, the service scales per Neutron L2 network, a crucial information moving forward.

What happened to this service?

A Rackspace Private Cloud OpenStack customer was reporting response times larger than 30 seconds for any request to the Neutron metadata service. While initial debugging on the user instances revealed that metadata requests got intercepted by a security appliance, excluding the standard metadata IP, 169.254.169.254 from the proxy configuration via

export no_proxy="localhost,127.0.0.1,localaddress,.localdomain.com,169.254.169.254"

did not solve the issue. At this point I knew the issue was related to the Neutron service or the background service it uses, mainly Nova API (compute) and RabbitMQ (the OpenStack message bus).
Looking at the request the Neutron service handles, I identified an unusual pattern in the frequency and realized that the configuration management Chef was requesting the metadata, beyond the standard expected behavior if OpenStack instances boot/reboot.
From previous issues I knew that the Chef plugin ohai played a major role and inefficiencies were known in regards to HTTP connection handling, mainly the lack of supporting HTTP persistence.
Continuing the research on the Neutron service and looking for ways to improve response times, I identified that the neutron-ns-metadata-proxy service was only capable of opening 100 Unix sockets to the neutron-metadata-agent. These sockets are used to talk to the Neutron metadata-agent across the Linux Network namespace, without opening additional TCP connections internally, mainly as performance optimization.

Unable to explain the 100 connections limit at first, especially in absence of Neutron backend problems (Neutron server) or Nova API issues, I began looking at the neutron source code and found a related change in the upstream code.
The Neutron commit was adding an option to parameterize the WSGI threads, WSGI is used as web server gateway for Python, but also lowering the default limit from 1000 to 100. This crucial information was absent in any Neutron release notes.

More importantly, we just found our 100 Unix sockets limit

This also explained the second observation that the connections to the Neutron metadata service got queued and caused the large delay in response times. This queueing was a result of using a network event library eventlet and greenlet combination, a typical way of addressing non-blocking I/O in the Python environment.

So what comes next?

Currently I am looking to solve the problem in multiple ways.
The imminent problem should be solved with a Chef-ohai plugin fix as proposed per Chef pull request #995 which finally introduces persistent HTTP connections and drastically reducing the need for parallel connection. First results are encouraging.

More importantly the Neutron community has re-implemented the neutron-ns-metadata-proxy with HAProxy (LP #1524916) to address performance issues. The OpenStack community needs verify if the issue is still occurring.

Alternatively, there are Neutron network design decisions that can assist with these problems. For example, one approach is to reduce the size of a Neutron L2 network to smaller than 23 bits, which allows Neutron to scale out the metadata service.

This approach allows the option to create multiple Neutron routers, scaling out the Neutron metadata service onto other Neutron agents, where one router is only responsible for serving the Neutron Metadata requests. This is especially the situation when the configuration option enable_isolated_metadata is set to True and project/tenant networks are attached to Neutron routers.

So as usual, Neutron keeps it interesting for us. Can’t wait to dissect Neutron Metadata service in a DVR environment. More to come …..