John Likes OpenStack

OpenInfra Live Episode 24: OpenStack and Ceph

2021-09-27T18:19:00.000-04:00

This Thursday at 14:00 UTC Francesco and I will be in a panel on OpenInfra Live Episode 24: OpenStack and Ceph.

My tox cheat sheet

2020-09-04T14:31:00.000-04:00

Install tox on centos8 undercloud deployed by tripleo-lab

 curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
 python3 get-pip.py
 pip install tox

Render changes to tripleo docs:

  cd /home/stack/tripleo-docs
  tox -e deploy-guide

Check syntax errors before wasting CI time

 tox -e linters
 tox -e pep8

Run a specific unit test

 cd /home/stack/tripleo-common
 tox -e py36 -- tripleo_common.tests.test_inventory.TestInventory.test_get_roles_by_service

 cd /home/stack/tripleo-ansible
 tox -e py36 -- tripleo_ansible.tests.modules.test_derive_hci_parameters.TestTripleoDeriveHciParameters

Running tripleo-ansible molecule locally for dummies

2020-06-26T14:39:00.001-04:00

I've had to re-teach myself how to do this so I'm writing my own notes.

Prerequisites:

Get a working undercloud (perhaps from tripleo-lab)
git clone https://git.openstack.org/openstack/tripleo-ansible.git ; cd tripleo-ansible
Determine the test name: ls roles

Once you have your environment ready run a test with the name from step 3.

./scripts/run-local-test tripleo_derived_parameters

Some tests in CI are configured to use `--skip-tags`. You can do this for your local tests too by setting the appropriate environment variables. For example:

 export TRIPLEO_JOB_ANSIBLE_ARGS="--skip-tags run_ceph_ansible,run_uuid_ansible,ceph_client_rsync,clean_fetch_dir"
 ./scripts/run-local-test tripleo_ceph_run_ansible

This last tip should get added to the docs.

Building a Ceph-powered Cloud: Deploying a containerized Red Hat Ceph Storage 4 cluster for Red Hat Open Stack Platform 16

2020-06-03T08:55:00.001-04:00

https://www.redhat.com/en/blog/building-ceph-powered-cloud-deploying-containerized-red-hat-ceph-storage-4-cluster-red-hat-open-stack-platform-16

Notes on testing a tripleo-common mistral patch

2019-07-03T16:04:00.001-04:00

I recently ran into bug 1834094 and wanted to test the proposed fix. These are my notes if I have to do this again.

Get a patched container

Because the mistral-executor is running as a container on the undercloud I needed to build a new container and TripleO's Container Image Preparation helped me do this without too much trouble.

As described the Container Image Preparation docs, I already download a local copy of the containers to my undercloud by running the following:

time sudo openstack tripleo container image prepare \
  -e ~/train/containers.yaml \
  --output-env-file ~/containers-env-file.yaml

where ~/train/containers.yaml has the following:

---
parameter_defaults:
  NeutronMechanismDrivers: ovn
  ContainerImagePrepare:
  - push_destination: 192.168.24.1:8787
    set:
      ceph_image: daemon
      ceph_namespace: docker.io/ceph
      ceph_tag: v4.0.0-stable-4.0-nautilus-centos-7-x86_64
      name_prefix: centos-binary
      namespace: docker.io/tripleomaster
      tag: current-tripleo

I now want to download the same set of containers to my undercloud but I want the mistral-executor container to have the proposed fix. If I vist the review and click download I can see the patch is at refs/changes/60/668560/3 and I can pass this information to TripleO's Container Image Preparation so that it builds me a container with that patch applied.

To do this I update my containers.yaml to exclude the mistral-executor container from the usual tags with the excludes list directive and then create a separate section with the includes directive specific to the mistral-executor container.

Within this new section I ask that the tripleo-modify-image ansible role pull that patch and apply it to that source image.

---
parameter_defaults:
  NeutronMechanismDrivers: ovn
  ContainerImagePrepare:
  - push_destination: 192.168.24.1:8787
    set:
      ceph_image: daemon
      ceph_namespace: docker.io/ceph
      ceph_tag: v4.0.0-stable-4.0-nautilus-centos-7-x86_64
      name_prefix: centos-binary
      namespace: docker.io/tripleomaster
      tag: current-tripleo
    excludes: [mistral-executor]
  - push_destination: 192.168.24.1:8787
    set:
      name_prefix: centos-binary
      namespace: docker.io/tripleomaster
      tag: current-tripleo
    modify_role: tripleo-modify-image
    modify_append_tag: "-devel-ps3"
    modify_vars:
      tasks_from: dev_install.yml
      source_image: docker.io/tripleomaster/centos-binary-mistral-executor:current-tripleo
      refspecs:
        -
          project: tripleo-common
          refspec: refs/changes/60/668560/3
    includes: [mistral-executor]

When I then run the `sudo openstack tripleo container image prepare` command I see that it took a few extra steps to create my new container image.

Writing manifest to image destination
Storing signatures
INFO[0005] created - from /var/lib/containers/storage/overlay/10c5e9ec709991e7eb6cbbf99c08d87f9f728c1644d64e3b070bc3c81adcbc03/diff
and /var/lib/containers/storage/overlay-layers/10c5e9ec709991e7eb6cbbf99c08d87f9f728c1644d64e3b070bc3c81adcbc03.tar-split.gz (wrote 150320640 bytes) 
Completed modify and upload for image docker.io/tripleomaster/centos-binary-mistral-executor:current-tripleo
Removing local copy of 192.168.24.1:8787/tripleomaster/centos-binary-mistral-executor:current-tripleo
Removing local copy of 192.168.24.1:8787/tripleomaster/centos-binary-mistral-executor:current-tripleo-devel-ps3
Output env file exists, moving it to backup.

If I were deploying the mistral container in the overcloud I could just use 'openstack overcloud deploy ... -e ~/containers-env-file.yaml' and be done, but because I need to replace my mistral-executor container on my undercloud I have to do a few manual steps.

Run the patched container on the undercloud

My undercloud is ready to serve the patched mistral-executor container but it doesn't yet have its own copy of it to run; i.e. I only see the original container:

(undercloud) [stack@undercloud train]$ sudo podman images | grep exec
docker.io/tripleomaster/centos-binary-mistral-executor            current-tripleo   1f0ed5edc023   9 days ago   1.78 GB
(undercloud) [stack@undercloud train]$

However, the same undercloud will serve it from the following URL:

(undercloud) [stack@undercloud train]$ grep executor ~/containers-env-file.yaml
  ContainerMistralExecutorImage: 192.168.24.1:8787/tripleomaster/centos-binary-mistral-executor:current-tripleo-devel-ps3
(undercloud) [stack@undercloud train]$

So we can pull it down so we can run it on the undercloud:

 sudo podman pull 192.168.24.1:8787/tripleomaster/centos-binary-mistral-executor:current-tripleo-devel-ps3

I now want to stop the running mistral-executor container and start my new one in it's place. As per Debugging with Paunch I can use the print-cmd action to extract the command which is used to start the mistral-executor container and save it to a shell script:

sudo paunch debug --file  /var/lib/tripleo-config/container-startup-config-step_4.json --container mistral_executor --action print-cmd > start_executor.sh

I'll also add the exact container image name to the shell script

sudo podman images | grep ps3 >> start_executor.sh

Next I'll edit the script to update the container name and make sure the container is named mistral_executor:

vim start_executor.sh

Before I restart the container I'll prove that the current container isn't running the patch (the same command later will prove that it is).

(undercloud) [stack@undercloud train]$ sudo podman exec mistral_executor grep render /usr/lib/python2.7/site-packages/tripleo_common/utils/config.py
                # string so it's rendered in a readable format.
                    template_data = deployment_template.render(
                template_data = host_var_server_template.render(
(undercloud) [stack@undercloud train]$

Stop the mistral-executor container with systemd (otherwise it will automatically restart).

 sudo systemctl stop tripleo_mistral_executor.service

Remove the container with podman to ensure the name is not in use:

 sudo podman rm mistral_executor

Start the new container:

 sudo bash start_executor.sh

and now I'll verify that my new container does have the patch:

(undercloud) [stack@undercloud train]$ sudo podman exec mistral_executor grep render /usr/lib/python2.7/site-packages/tripleo_common/utils/config.py
    def render_network_config(self, stack, config_dir, server_roles):
                # string so it's rendered in a readable format.
                    template_data = deployment_template.render(
                template_data = host_var_server_template.render(
        self.render_network_config(stack, config_dir, server_roles)
(undercloud) [stack@undercloud train]$

For a bonus, I also see it fixed the bug.

(undercloud) [stack@undercloud tripleo-heat-templates]$ openstack overcloud config download --config-dir config-download 
Starting config-download export...
config-download export successful
Finished config-download export.
Extracting config-download...
The TripleO configuration has been successfully generated into: config-download
(undercloud) [stack@undercloud tripleo-heat-templates]$

How do I re-run only ceph-ansible when using tripleo config-download?

2019-01-29T17:26:00.000-05:00

After config-download runs the first time, you may do the following:

cd /var/lib/mistral/config-download/
bash ansible-playbook-command.sh --tags external_deploy_steps

The above runs only the external deploy steps, which for the ceph-ansible integration, means run the ansible which generates the inventory and then execute ceph-ansible.

More on this in TripleO config-download User’s Guide: Deploying with Ansible.

If you're using the standalone deployer, then config-download does not provide the ansible-playbook-command.sh. You can workaround this by doing the following:

cd /root/undercloud-ansible-su_6px97
ansible -i inventory.yaml -m ping all
ansible-playbook -i inventory.yaml -b deploy_steps_playbook.yaml --tags external_deploy_steps

The above makes the following assumptions:

You ran standalone with `--output-dir=$HOME` as root and that undercloud-ansible-su_6px97 was created by config download and contains the downloaded playbooks. Use `ls -ltr` to find the latest version.
If you're using the newer python3-only versions you ran something like `ln -s $(which ansible-3) /usr/local/bin/ansible`
That config-download already generated the overcloud inventory.yaml (the second command above is just to test that the inventory is working)

ceph-ansible podman with vagrant

2019-01-07T15:04:00.001-05:00

These are just my notes on how I got vagrant with libvirt working on CentOS7 and then used ceph-ansible's fedora29 podman tests to deploy a containerized ceph nautilus preview cluster without docker. I'm doing this in hopes of hooking Ceph into the new podman TripleO deploys.

Configure Vagrant with libvirt on CentOS7

I already have a CentOS7 machine I used for tripleo quickstart. I did the following to get vagrant working on it with libvirt.

1. Create a vagrant user

sudo useradd vagrant
sudo usermod -aG wheel vagrant
sudo usermod --append --groups libvirt vagrant
sudo su - vagrant
mkdir .ssh
chmod 700 .ssh/
cd .ssh/
curl https://github.com/fultonj.keys > authorized_keys
chmod 600 authorized_keys

Continue as the vagrant user.

2. Install the Vagrant and other RPMs

Download the CentOS Vagrant RPM from https://www.vagrantup.com/downloads.html and install other RPMs needed for it to work with libvirt.

sudo yum install vagrant_2.2.2_x86_64.rpm
sudo yum install qemu libvirt libvirt-devel ruby-devel gcc qemu-kvm
vagrant plugin install vagrant-libvirt

Note that I already had many of the libvirt deps above from quickstart.

3. Get a CentOS7 box for verification

Download the centos/7 box.

[vagrant@hamfast ~]$ vagrant box add centos/7
==> box: Loading metadata for box 'centos/7'
    box: URL: https://vagrantcloud.com/centos/7
This box can work with multiple providers! The providers that it
can work with are listed below. Please review the list and choose
the provider you will be working with.

1) hyperv
2) libvirt
3) virtualbox
4) vmware_desktop

Enter your choice: 2
==> box: Adding box 'centos/7' (v1811.02) for provider: libvirt
    box: Downloading: https://vagrantcloud.com/centos/boxes/7/versions/1811.02/providers/libvirt.box
    box: Download redirected to host: cloud.centos.org
==> box: Successfully added box 'centos/7' (v1811.02) for 'libvirt'!
[vagrant@hamfast ~]$

Create a Vagrant file for it with `vagrant init centos/7`.

4. Configure Vagrant to use a custom storage pool (Optional)

Because I was already using libvirt directly with an images pool, vagrant was unable to download the centos/7 system. I like this as I want to keep my images pool separate for when I use libvirt directly. To make Vagrant happy I created my own pool for it and added the following to my Vagrantfile:

Vagrant.configure("2") do |config|
  config.vm.provider :libvirt do |libvirt|
    libvirt.storage_pool_name = "vagrant_images"
  end
end

After doing the above `vagrant up` worked for me.

Run ceph-ansible's Fedora 29 podman tests

1. Clone ceph-ansible master

git clone git@github.com:ceph/ceph-ansible.git; cd ceph-ansible

2. Satisfy dependencies

sudo pip install -r requirements.txt
sudo pip install tox
cp vagrant_variables.yml.sample vagrant_variables.yml
cp site.yml.sample site.yml

Optionally: modify Vagrantfile for vagrant_images storage pool

3. Deploy with the container_podman

tox -e dev-container_podman -- --provider=libvirt

The above will result in tox triggering vagrant to create 10 virtual machines and then ceph-ansible will install ceph on them.

4. Inspect Deployment

Verify the virtual machines are running:

[vagrant@hamfast ~]$ cd ~/ceph-ansible/tests/functional/fedora/29/container-podman
[vagrant@hamfast container-podman]$ cp vagrant_variables.yml.sample vagrant_variables.yml
[vagrant@hamfast container-podman]$ vagrant status
Current machine states:

mgr0                      running (libvirt)
client0                   running (libvirt)
client1                   running (libvirt)
rgw0                      running (libvirt)
mds0                      running (libvirt)
rbd-mirror0               running (libvirt)
iscsi-gw0                 running (libvirt)
mon0                      running (libvirt)
mon1                      running (libvirt)
mon2                      running (libvirt)
osd0                      running (libvirt)
osd1                      running (libvirt)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.
[vagrant@hamfast container-podman]$

Connect to a monitor and see that it's running Ceph containers

[vagrant@hamfast container-podman]$ vagrant ssh mon0
Last login: Mon Jan  7 17:11:28 2019 from 192.168.121.1
[vagrant@mon0 ~]$

[vagrant@mon0 ~]$ sudo podman ps
CONTAINER ID   IMAGE                                 COMMAND                  CREATED       STATUS           PORTS   NAMES
c494695eb0c2   docker.io/ceph/daemon:latest-master   /opt/ceph-container...   4 hours ago   Up 4 hours ago           ceph-mgr-mon0
dbabf02df984   docker.io/ceph/daemon:latest-master   /opt/ceph-container...   4 hours ago   Up 4 hours ago           ceph-mon-mon0
[vagrant@mon0 ~]$ 

[vagrant@mon0 ~]$ sudo podman images
REPOSITORY              TAG             IMAGE ID       CREATED       SIZE
docker.io/ceph/daemon   latest-master   24fdc8c3cb3f   4 weeks ago   726MB
[vagrant@mon0 ~]$

[vagrant@mon0 ~]$ which docker
/usr/bin/which: no docker in (/home/vagrant/.local/bin:/home/vagrant/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)
[vagrant@mon0 ~]$

Observe the status of the Ceph cluster:

[vagrant@mon0 ~]$ sudo podman exec dbabf02df984 ceph -s
  cluster:
    id:     9d2599f2-aec7-4c7c-a88e-7a8d39ebb557
    health: HEALTH_WARN
            application not enabled on 1 pool(s)
 
  services:
    mon:        3 daemons, quorum mon0,mon1,mon2 (age 71m)
    mgr:        mon1(active, since 70m), standbys: mon2, mon0
    mds:        cephfs-1/1/1 up  {0=mds0=up:active}
    osd:        4 osds: 4 up (since 68m), 4 in (since 68m)
    rbd-mirror: 1 daemon active
    rgw:        1 daemon active
 
  data:
    pools:   13 pools, 124 pgs
    objects: 194 objects, 3.5 KiB
    usage:   54 GiB used, 71 GiB / 125 GiB avail
    pgs:     124 active+clean
 
[vagrant@mon0 ~]$

Observe the installed versions:

[vagrant@mon0 ~]$ sudo podman exec -ti dbabf02df984 /bin/bash
[root@mon0 /]# cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core) 
[root@mon0 /]# 

[root@mon0 /]# ceph --version
ceph version 14.0.1-1496-gaf96e16 (af96e16271b620ab87570b1190585fffc06daeac) nautilus (dev)
[root@mon0 /]#

Observe the OSDs

[vagrant@hamfast container-podman]$ vagrant ssh osd0
[vagrant@osd0 ~]$ sudo su -
[root@osd0 ~]# podman ps
CONTAINER ID   IMAGE                                 COMMAND                  CREATED             STATUS                 PORTS   NAMES
4fe23502592c   docker.io/ceph/daemon:latest-master   /opt/ceph-container...   About an hour ago   Up About an hour ago           ceph-osd-2
f582b4311076   docker.io/ceph/daemon:latest-master   /opt/ceph-container...   About an hour ago   Up About an hour ago           ceph-osd-0
[root@osd0 ~]# lsblk
NAME                   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                      8:0    0   50G  0 disk 
sdb                      8:16   0   50G  0 disk 
├─test_group-data--lv1 253:1    0   25G  0 lvm  
└─test_group-data--lv2 253:2    0 12.5G  0 lvm  
sdc                      8:32   0   50G  0 disk 
├─sdc1                   8:33   0   25G  0 part 
└─sdc2                   8:34   0   25G  0 part 
  └─journals-journal1  253:3    0   25G  0 lvm  
vda                    252:0    0   41G  0 disk 
├─vda1                 252:1    0    1G  0 part /boot
└─vda2                 252:2    0   40G  0 part 
  └─atomicos-root      253:0    0   40G  0 lvm  /sysroot
[root@osd0 ~]# 

[root@osd0 ~]# podman exec 4fe23502592c cat var/lib/ceph/osd/ceph-2/type
bluestore
[root@osd0 ~]#

Simulate edge deployments using TripleO Standalone

2019-01-04T14:38:00.000-05:00

My colleagues presented at OpenStack Summit Berlin on Distributed Hyperconvergence. This includes using TripleO to deploy a central controller node, extracting information from that central node, and then passing that information as input to a second TripleO deployment at a remote location ("on the edge of the network"). This edge deployment could host its own Ceph cluster which is collocated with compute nodes in its own availability zone. A third TripleO deployment could be added for a second remote edge deployment and users could then use the central deployment to schedule workloads per availability zone closer to where the workloads are needed.

You can simulate this type of deployment today with a single hypervisor and TripleO's standalone installer as per the newly merged upstream docs.

PC for tripleo quickstart

2018-08-28T10:43:00.002-04:00

I built a machine for running TripleO Quickstart at home.

My complete part list is on pcpart picker with the exception of the extra Noctua NM-AM4 Mounting Kit and video card (which I only go to install the OS)

I also have photos from when I built it.

My nodes.yaml gives me:

Three 9GB 2CPU controller nodes
Three 6GB 2CPU ceph storage nodes
One 3GB 2CPU compute node (that's enough to spawn one nested VM for a quick test)
One 13GB 8CPU undercloud node

That leaves less than 2GB of RAM for the hypervisor and all 16 vCPUs (8 cores * 2 threads) are marked for a VM so I'm pushing it a little.

When using this system with the same ndoes.yaml my run times are as follows for Rocky RC1:

unercloud install of rocky: 43m44.118s
overcloud install of rocky: 49m51.369s

Updating ceph-ansible in a containerized undercloud

2018-08-22T23:30:00.000-04:00

Update

What's below won't be the case for much longer because ceph-ansible will be come a dependency of TripleO and the mistral-executor container will bind mount the ceph-ansible source directory on the container host. What's in this post could still be used as an example of updating a package in a TripleO container but don't be mislead about it being the way to update ceph-ansible any longer.

Original Content

In Rocky the TripleO undercloud will run containers. If you're using TripleO to deploy Ceph in Rocky, this means that ceph-ansible shouldn't be installed on your undercloud server directly because your undercloud server is a container host. Instead ceph-ansible should be installed on the mistral-executor container because, as per config-download, That is the container which runs ansible to configure the overcloud.

If you install ceph-ansible on your undercloud host it will lead to confusion about what version of ceph-ansible is being used when you try to debug it. Instead install it on the mistral-executor container.

So this is the new normal in Rocky on an undercloud that can deploy Ceph:

[root@undercloud-0 ~]# rpm -q ceph-ansible
package ceph-ansible is not installed
[root@undercloud-0 ~]# 

[root@undercloud-0 ~]# docker ps | grep mistral 
0a77642d8d10        192.168.24.1:8787/tripleomaster/openstack-mistral-api:2018-08-20.1                 "kolla_start"            4 hours ago         Up 4 hours (healthy)                         mistral_api
c32898628b4b        192.168.24.1:8787/tripleomaster/openstack-mistral-engine:2018-08-20.1              "kolla_start"            4 hours ago         Up 4 hours (healthy)                         mistral_engine
c972b3e74cab        192.168.24.1:8787/tripleomaster/openstack-mistral-event-engine:2018-08-20.1        "kolla_start"            4 hours ago         Up 4 hours (healthy)                         mistral_event_engine
d52708e0bab0        192.168.24.1:8787/tripleomaster/openstack-mistral-executor:2018-08-20.1            "kolla_start"            4 hours ago         Up 4 hours (healthy)                         mistral_executor
[root@undercloud-0 ~]# 

[root@undercloud-0 ~]# docker exec -ti d52708e0bab0 rpm -q ceph-ansible
ceph-ansible-3.1.0-0.1.rc18.el7cp.noarch
[root@undercloud-0 ~]#

So what happens if you're in a situation where you want to try a different ceph-ansible version on your unercloud?

In the next example I'll update my mistral-executor container from ceph-ansible rc18 to rc21. These commands are just variations of the upstream documentation but with a focus on updating the undercloud, not overcloud, container. Here's the image I want to update:

[root@undercloud-0 ~]# docker images | grep mistral-executor
192.168.24.1:8787/tripleomaster/openstack-mistral-executor             2018-08-20.1        740bb6f24755        2 days ago          1.05 GB
[root@undercloud-0 ~]#

I have a copy of ceph-ansible-3.1.0-0.1.rc21.el7cp.noarch.rpm in my current working directory

[root@undercloud-0 ~]# mkdir -p rc21
[root@undercloud-0 ~]# cat > rc21/Dockerfile < FROM 192.168.24.1:8787/tripleomaster/openstack-mistral-executor:2018-08-20.1
> USER root
> COPY ceph-ansible-3.1.0-0.1.rc21.el7cp.noarch.rpm .
> RUN yum install -y ceph-ansible-3.1.0-0.1.rc21.el7cp.noarch.rpm
> USER mistral
> EOF
[root@undercloud-0 ~]#

So again that file is (for copy/paste later):

[root@undercloud-0 ~]# cat rc21/Dockerfile 
FROM 192.168.24.1:8787/tripleomaster/openstack-mistral-executor:2018-08-20.1
USER root
COPY ceph-ansible-3.1.0-0.1.rc21.el7cp.noarch.rpm .
RUN yum install -y ceph-ansible-3.1.0-0.1.rc21.el7cp.noarch.rpm
USER mistral
[root@undercloud-0 ~]#

Build the new container

[root@undercloud-0 ~]# docker build --rm -t 192.168.24.1:8787/tripleomaster/openstack-mistral-executor:2018-08-20.1 ~/rc21 
Sending build context to Docker daemon 221.2 kB
Step 1/5 : FROM 192.168.24.1:8787/tripleomaster/openstack-mistral-executor:2018-08-20.1
 ---> 740bb6f24755
Step 2/5 : USER root
 ---> Using cache
 ---> 8d7f2e7f9993
Step 3/5 : COPY ceph-ansible-3.1.0-0.1.rc21.el7cp.noarch.rpm .
 ---> 54fbf7185eec
Removing intermediate container 9afe4b16ba95
Step 4/5 : RUN yum install -y ceph-ansible-3.1.0-0.1.rc21.el7cp.noarch.rpm
 ---> Running in e80fce669471

Examining ceph-ansible-3.1.0-0.1.rc21.el7cp.noarch.rpm: ceph-ansible-3.1.0-0.1.rc21.el7cp.noarch
Marking ceph-ansible-3.1.0-0.1.rc21.el7cp.noarch.rpm as an update to ceph-ansible-3.1.0-0.1.rc18.el7cp.noarch
Resolving Dependencies
--> Running transaction check
---> Package ceph-ansible.noarch 0:3.1.0-0.1.rc18.el7cp will be updated
---> Package ceph-ansible.noarch 0:3.1.0-0.1.rc21.el7cp will be an update
--> Finished Dependency Resolution

Dependencies Resolved

================================================================================
 Package
    Arch   Version              Repository                                 Size
================================================================================
Updating:
 ceph-ansible
    noarch 3.1.0-0.1.rc21.el7cp /ceph-ansible-3.1.0-0.1.rc21.el7cp.noarch 1.0 M

Transaction Summary
================================================================================
Upgrade  1 Package

Total size: 1.0 M
Downloading packages:
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Updating   : ceph-ansible-3.1.0-0.1.rc21.el7cp.noarch                     1/2 
  Cleanup    : ceph-ansible-3.1.0-0.1.rc18.el7cp.noarch                     2/2 
  Verifying  : ceph-ansible-3.1.0-0.1.rc21.el7cp.noarch                     1/2 
  Verifying  : ceph-ansible-3.1.0-0.1.rc18.el7cp.noarch                     2/2 

Updated:
  ceph-ansible.noarch 0:3.1.0-0.1.rc21.el7cp                                    

Complete!
 ---> 41a804e032f5
Removing intermediate container e80fce669471
Step 5/5 : USER mistral
 ---> Running in bc0db608c299
 ---> f5ad6b3ed630
Removing intermediate container bc0db608c299
Successfully built f5ad6b3ed630
[root@undercloud-0 ~]#

Upload the new container to the registry:

[root@undercloud-0 ~]# docker push 192.168.24.1:8787/tripleomaster/openstack-mistral-executor:2018-08-20.1
The push refers to a repository [192.168.24.1:8787/tripleomaster/openstack-mistral-executor]
606ffb827a1b: Pushed 
fc3710ffba43: Pushed 
4e770d9096db: Layer already exists 
4d7e8476e5cd: Layer already exists 
9eef3d74eb8b: Layer already exists 
977c2f6f6121: Layer already exists 
00860a9b126f: Layer already exists 
366de6e5861a: Layer already exists 
2018-08-20.1: digest: sha256:50aae064d930e8d498702673c6703b70e331d09e966c6f436b683bb152e80337 size: 2007
[root@undercloud-0 ~]#

Now we see new the f5ad6b3ed630 container in addition to the old one:

[root@undercloud-0 ~]# docker images | grep mistral-executor
192.168.24.1:8787/tripleomaster/openstack-mistral-executor             2018-08-20.1        f5ad6b3ed630        4 minutes ago       1.09 GB
192.168.24.1:8787/tripleomaster/openstack-mistral-executor                           740bb6f24755        2 days ago          1.05 GB
[root@undercloud-0 ~]#

The old container is still running though:

[root@undercloud-0 ~]# docker ps | grep mistral
373f8c17ce74        192.168.24.1:8787/tripleomaster/openstack-mistral-api:2018-08-20.1                 "kolla_start"            6 hours ago         Up 6 hours (healthy)                         mistral_api
4f171deef184        192.168.24.1:8787/tripleomaster/openstack-mistral-engine:2018-08-20.1              "kolla_start"            6 hours ago         Up 6 hours (healthy)                         mistral_engine
8f25657237cd        192.168.24.1:8787/tripleomaster/openstack-mistral-event-engine:2018-08-20.1        "kolla_start"            6 hours ago         Up 6 hours (healthy)                         mistral_event_engine
a7fb6df4e7cf        740bb6f24755                                                                 "kolla_start"            6 hours ago         Up 6 hours (healthy)                         mistral_executor
[root@undercloud-0 ~]#

Merely updating the image doesn't restart the container and neither does `docker restart a7fb6df4e7cf`. Instead I need to stop it and start it but there's a lot that goes into starting these containers with the correct parameters.

The upstream docs section on Debugging with Paunch shows me a command to get the exact command that was used to start my container. I just needed to use `paunch list | grep mistral` first to know I need to look at the tripleo_step4.

[root@undercloud-0 ~]# paunch debug --file  /var/lib/tripleo-config/docker-container-startup-config-step_4.json   --container mistral_executor --action print-cmd 
docker run --name mistral_executor-glzxsrmw --detach=true --env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS --net=host --health-cmd=/openstack/healthcheck --privileged=false --restart=always --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/lib/kolla/config_files/mistral_executor.json:/var/lib/kolla/config_files/config.json:ro --volume=/var/lib/config-data/puppet-generated/mistral/:/var/lib/kolla/config_files/src:ro --volume=/run:/run --volume=/var/run/docker.sock:/var/run/docker.sock:rw --volume=/var/log/containers/mistral:/var/log/mistral --volume=/var/lib/mistral:/var/lib/mistral --volume=/usr/share/ansible/:/usr/share/ansible/:ro --volume=/var/lib/config-data/nova/etc/nova:/etc/nova:ro 192.168.24.1:8787/tripleomaster/openstack-mistral-executor:2018-08-20.1
[root@undercloud-0 ~]#

Now that I know the command I can see my six-hour old conatiner:

[root@undercloud-0 ~]# docker ps | grep mistral_executor
a7fb6df4e7cf        740bb6f24755                                                                 "kolla_start"            6 hours ago         Up 12 minutes (healthy)                       mistral_executor
[root@undercloud-0 ~]#

stop it

[root@undercloud-0 ~]# docker stop a7fb6df4e7cf
a7fb6df4e7cf
[root@undercloud-0 ~]#

ensure it's gone

[root@undercloud-0 ~]# docker rm a7fb6df4e7cf
Error response from daemon: No such container: a7fb6df4e7cf
[root@undercloud-0 ~]#

and then run the command I got from above to start the container and finally see my new container

[root@undercloud-0 ~]# docker ps | grep mistral-executor
d8e4073441c0        192.168.24.1:8787/tripleomaster/openstack-mistral-executor:2018-08-20.1            "kolla_start"            14 seconds ago      Up 13 seconds (health: starting)                       mistral_executor-glzxsrmw
[root@undercloud-0 ~]#

Finally I confirm that my container has the new ceph-ansible package:

(undercloud) [stack@undercloud-0 ~]$ docker exec -ti d8e4073441c0 rpm -q ceph-ansible
ceph-ansible-3.1.0-0.1.rc21.el7cp.noarch
(undercloud) [stack@undercloud-0 ~]$

I was then able to deploy my overcloud and see that the rc21 version fixed a bug.

Tips on searching ceph-install-workflow.log on TripleO

2018-06-21T11:56:00.000-04:00

1. Only look at the logs relevant to the last run

/var/log/mistral/ceph-install-workflow.log will contain a concatenation of the ceph-ansible runs. The last N lines of the file will have what you're looking for, so what is N?

Determine how long the file is:

[root@undercloud mistral]# wc -l ceph-install-workflow.log 
20287 ceph-install-workflow.log
[root@undercloud mistral]#

Find the lines where previous ansible runs finshed.

[root@undercloud mistral]# grep -n failed=0 ceph-install-workflow.log 
5425:2018-06-18 23:06:58,901 p=22256 u=mistral |  172.16.0.21                : ok=118  changed=19   unreachable=0    failed=0   
5426:2018-06-18 23:06:58,901 p=22256 u=mistral |  172.16.0.23                : ok=81   changed=13   unreachable=0    failed=0   
5427:2018-06-18 23:06:58,901 p=22256 u=mistral |  172.16.0.25                : ok=113  changed=18   unreachable=0    failed=0   
5428:2018-06-18 23:06:58,901 p=22256 u=mistral |  172.16.0.27                : ok=38   changed=3    unreachable=0    failed=0   
5429:2018-06-18 23:06:58,901 p=22256 u=mistral |  172.16.0.28                : ok=77   changed=13   unreachable=0    failed=0   
5430:2018-06-18 23:06:58,901 p=22256 u=mistral |  172.16.0.29                : ok=58   changed=7    unreachable=0    failed=0   
5431:2018-06-18 23:06:58,901 p=22256 u=mistral |  172.16.0.30                : ok=83   changed=18   unreachable=0    failed=0   
5432:2018-06-18 23:06:58,902 p=22256 u=mistral |  172.16.0.31                : ok=110  changed=17   unreachable=0    failed=0   
9948:2018-06-20 12:06:38,325 p=11460 u=mistral |  172.16.0.21                : ok=107  changed=12   unreachable=0    failed=0   
9949:2018-06-20 12:06:38,326 p=11460 u=mistral |  172.16.0.23                : ok=69   changed=4    unreachable=0    failed=0   
9950:2018-06-20 12:06:38,326 p=11460 u=mistral |  172.16.0.25                : ok=102  changed=11   unreachable=0    failed=0   
9951:2018-06-20 12:06:38,326 p=11460 u=mistral |  172.16.0.27                : ok=26   changed=0    unreachable=0    failed=0   
9952:2018-06-20 12:06:38,326 p=11460 u=mistral |  172.16.0.29                : ok=46   changed=5    unreachable=0    failed=0   
9953:2018-06-20 12:06:38,326 p=11460 u=mistral |  172.16.0.30                : ok=70   changed=8    unreachable=0    failed=0   
9954:2018-06-20 12:06:38,326 p=11460 u=mistral |  172.16.0.31                : ok=99   changed=10   unreachable=0    failed=0   
14927:2018-06-20 23:14:57,881 p=7702 u=mistral |  172.16.0.23                : ok=118  changed=19   unreachable=0    failed=0   
14928:2018-06-20 23:14:57,881 p=7702 u=mistral |  172.16.0.27                : ok=110  changed=17   unreachable=0    failed=0   
14932:2018-06-20 23:14:57,881 p=7702 u=mistral |  172.16.0.34                : ok=113  changed=18   unreachable=0    failed=0   
20255:2018-06-21 09:46:40,571 p=17564 u=mistral |  172.16.0.22                : ok=118  changed=19   unreachable=0    failed=0   
20256:2018-06-21 09:46:40,571 p=17564 u=mistral |  172.16.0.26                : ok=134  changed=18   unreachable=0    failed=0   
20257:2018-06-21 09:46:40,571 p=17564 u=mistral |  172.16.0.27                : ok=102  changed=14   unreachable=0    failed=0   
20258:2018-06-21 09:46:40,571 p=17564 u=mistral |  172.16.0.28                : ok=113  changed=18   unreachable=0    failed=0   
20260:2018-06-21 09:46:40,571 p=17564 u=mistral |  172.16.0.34                : ok=110  changed=17   unreachable=0    failed=0   
[root@undercloud mistral]#

Subtract the last run's line number from the total file lines:

[root@undercloud mistral]# echo $(( 20260 - 14932))
5328
[root@undercloud mistral]#

Tail from that line line going forward.

2. Identify the node(s) where the playbook run failed:

I know the last 100 lines of the relevant run will have failed set to true if there was a failure. Doing a grep for that will also show me the host:

[root@undercloud mistral]# tail -5328 ceph-install-workflow.log | tail -100 | grep failed=1
2018-06-21 09:46:40,571 p=17564 u=mistral |  172.16.0.32                : ok=66   changed=14   unreachable=0    failed=1   
[root@undercloud mistral]#

Now that I know the host I want to see on which task that host failed so I grep for 'failed:'. Just grepping for failed won't help as the log will be full of '"failed": false'.

In this case I extract out the failure:

[root@undercloud mistral]# tail -5328 ceph-install-workflow.log | grep 172.16.0.32 | grep failed:
2018-06-21 09:46:06,093 p=17564 u=mistral |  failed: [172.16.0.32 -> 172.16.0.22] (item=[{u'rule_name': u'', u'pg_num': 128, u'name': u'metrics'}, 
{'_ansible_parsed': True, 'stderr_lines': [u"Error ENOENT: unrecognized pool 'metrics'"], u'cmd': [u'docker', u'exec', u'ceph-mon-controller02', 
u'ceph', u'--cluster', u'ceph', u'osd', u'pool', u'get', u'metrics', u'size'], u'end': u'2018-06-21 13:46:01.070270', '_ansible_no_log': False, 
'_ansible_delegated_vars': {'ansible_delegated_host': u'172.16.0.22', 'ansible_host': u'172.16.0.22'}, '_ansible_item_result': True, u'changed': 
True, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': False, u'_raw_params': u'docker exec ceph-mon-controller02 
ceph --cluster ceph osd pool get metrics size', u'removes': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'stdout': u'', u'start': 
u'2018-06-21 13:46:00.729965', u'delta': u'0:00:00.340305', 'item': {u'rule_name': u'', u'pg_num': 128, u'name': u'metrics'}, u'rc': 2, u'msg': 
u'non-zero return code', 'stdout_lines': [], 'failed_when_result': False, u'stderr': u"Error ENOENT: unrecognized pool 'metrics'", 
'_ansible_ignore_errors': None, u'failed': False}]) => {"changed": false, "cmd": ["docker", "exec", "ceph-mon-controller02", "ceph", 
"--cluster", "ceph", "osd", "pool", "create", "metrics", "128", "128", "replicated_rule", "1"], "delta": "0:00:01.421755", "end": 
"2018-06-21 13:46:06.390381", "item": [{"name": "metrics", "pg_num": 128, "rule_name": ""}, {"_ansible_delegated_vars": 
{"ansible_delegated_host": "172.16.0.22", "ansible_host": "172.16.0.22"}, "_ansible_ignore_errors": null, "_ansible_item_result": 
true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": true, "cmd": ["docker", "exec", "ceph-mon-controller02", 
"ceph", "--cluster", "ceph", "osd", "pool", "get", "metrics", "size"], "delta": "0:00:00.340305", "end": "2018-06-21 13:46:01.070270", 
"failed": false, "failed_when_result": false, "invocation": {"module_args": {"_raw_params": "docker exec ceph-mon-controller02 
ceph --cluster ceph osd pool get metrics size", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, 
"removes": null, "stdin": null, "warn": true}}, "item": {"name": "metrics", "pg_num": 128, "rule_name": ""}, "msg": 
"non-zero return code", "rc": 2, "start": "2018-06-21 13:46:00.729965", "stderr": "Error ENOENT: unrecognized pool 
'metrics'", "stderr_lines": ["Error ENOENT: unrecognized pool 'metrics'"], "stdout": "", "stdout_lines": []}], 
"msg": "non-zero return code", "rc": 34, "start": "2018-06-21 13:46:04.968626", "stderr": "Error ERANGE:  
pg_num 128 size 3 would mean 768 total pgs, which exceeds max 600 (mon_max_pg_per_osd 200 * num_in_osds 3)", 
"stderr_lines": ["Error ERANGE:  pg_num 128 size 3 would mean 768 total pgs, which exceeds max 600 
(mon_max_pg_per_osd 200 * num_in_osds 3)"], "stdout": "", "stdout_lines": []}
...
[root@undercloud mistral]#

So that's how I quickly find what went wrong in a ceph-ansible run when debugging a TripleO deployment.

3. Extra

You may be wondering what that error is.

There was a ceph-ansible issue with creating pools before the OSDs were running made the deployment fail because of the overdose protection check. This is something you can still fail if your PG numbers and OSDs are not aligned correctly (use pgcalc) but better to fail a deployment then put production data on a misconfigured cluster. You could also fail it because of this issue that ceph-ansible rc9 fixed (technically it was fixed in an earlier version but it had other bugs so I recommend rc9).

TripleO Ceph Integration on the Road in June

2018-06-21T11:39:00.000-04:00

The first week of June I went to an upstream TripleO workshop in Brno. The labs we used are at https://github.com/redhat-openstack/tripleo-workshop

The third week of June I went to a downstream Red Hat OpenStack Platform event in Montreal for those deploying the upcoming version 13 in the field. I covered similar topics with respect to Ceph deployment via TripleO.

Red Hat Summit 2018: HCI Lab

2018-04-25T08:55:00.000-04:00

I will be at Red Hat Summit in SFO on May 8th jointly hosting the lab Deploy a containerized HCI IaaS with OpenStack and Ceph.

Debugging TripleO Ceph-Ansible Deployments

2017-09-08T07:38:00.001-04:00

Starting in Pike it is possible to use TripleO to deploy Ceph in containers using ceph-ansible. This is a guide to help you if there is a problem. It asks questions, somewhat rhetorically, to help you track down the problem.

What does this error from openstack overcloud deploy... mean?

If TripleO's new Ceph deployment fails, then you'll see an error like the following:

 Stack overcloud CREATE_FAILED

overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution:
  resource_type: OS::Mistral::ExternalResource
  physical_resource_id: bb9e685c-fbe9-4573-8d74-2c053bc5de0d
  status: CREATE_FAILED
  status_reason: |
    resources.WorkflowTasks_Step2_Execution: ERROR
Heat Stack create failed.

TripleO installs the OS and configures networking and other base services for OpenStack for the nodes during step 1 of its five-step deployment. During step 2, a new type of Heat OS::Mistral::ExternalResource is created which calls a new Mistral workflow which uses a new Mistral action to call an Ansible playook. The playbook that is called is site-docker.yaml.sample from ceph-ansible. Giulio covers this in more detail in Understanding ceph-ansible in TripleO. The above error message indicates that Heat was able to call Mistral, but that the Mistral workflow failed. So, the next place to look is the Mistral logs on the undercloud to see if the ceph-ansible site-docker.yml playbook ran.

Did the ceph-ansible playbook run?

The most helpful file for debugging TripleO ceph-ansible deployments is:

/var/log/mistral/ceph-install-workflow.log

If it doesn't exist or is empty, then the ceph-ansible playbook run did not happen.

If it does exist, then it's the key to solving the problem! Read it as it will contain the output of the ceph-ansible run which you can use to debug ceph-ansible as you normally would. The ceph-ansible docs should help. Once you think the environment has been changed so that you won't have the problem (details on that below), then re-run the `openstack overcloud deploy ...` command, and after TripleO does its normal checks, it will re-run the playbook. Because ceph-ansible and TripleO are idempotent, this process may be repeated as necessary.

Why didn't the ceph-ansible playbook run?

The following will show the playbook call to ceph-ansible:

 cd /var/log/mistrtal/
 grep site-docker.yml.sample executor.log | grep ansible-playbook

If there's an error during the playbook run, then it should look something like this...

  2017-09-06 12:13:22.181 20608 ERROR mistral.executors.default_executor Command: 
    ansible-playbook -v /usr/share/ceph-ansible/site-docker.yml.sample --user tripleo-admin --become ...

If you don't see a playbook call like the above, then the Mistral tasks that set up the environment for a ceph-ansible run failed.

What does Mistral do to prepare the environment to run ceph-ansible?

A copy of the Mistral workbook which prepares the overcloud and undercloud to run ceph-ansible, and then runs it, is in:

  /usr/share/tripleo-common/workbooks/ceph-ansible.yaml

The Mistral tasks do the following:

Configure the SSH key-pairs so the undercloud can run ansible tasks on the overcloud ndoes and the tripleo-admin user
Create a temporary fetch directory for ceph-ansible to use to copy configs between overcloud ndoes
Build a temporary Ansible inventory in a file like /tmp/ansible-mistral-actionSYRh6Q/inventory.yaml
Set the Ansible fork count to the number of nodes (but not >100).
Run the ceph-ansible site-docker.yaml.sample playbook
Clean up temproary files

To check the details of the Mistral tasks used by ceph-ansible, extract the workflow's UUID with the following:

WORKFLOW='tripleo.storage.v1.ceph-install'
UUID=$(mistral execution-list | grep $WORKFLOW | awk {'print $2'} | tail -1)

Then use the ID to examine each task:

for TASK_ID in $(mistral task-list $UUID | awk {'print $2'} | egrep -v 'ID|^$'); do
    mistral task-get $TASK_ID
    mistral task-get-result $TASK_ID | jq . | sed -e 's/\\n/\n/g' -e 's/\\"/"/g'
done

If you really need to update the workbook itself, you can modify a copy and upload it with the following, but please see if your problem can instead be solved by simply overriding the default values in a Heat environment file as per the documentation.

source ~/stackrc
cp /usr/share/tripleo-common/workbooks/ceph-ansible.yaml .
vi ceph-ansible.yaml
mistral workbook-update ceph-ansible.yaml

I already know ceph-ansible; how do I edit the files in group_vars?

Please don't. It will break the TripleO integration. Instead please use TripleO as usual, and override the default values in a Heat environment file like ceph.yaml which you then use -e to add to your openstack overcloud deploy command as described in the documentation.

What changes does the TripleO ceph-ansible integration make to the files in ceph-ansible's group_vars?

None. Instead YAQL within tripleo-head-templates builds a Mistral environment which the ceph-ansible.yaml Mistral workbook may access to when it calls ceph-ansible. The workbook then passes those parameters as JSON with the ansible-playbook command's --extra-vars option. To see what parameters were passed using this method, grep the executor.log as above to see the ceph-ansible playbook call. The sample file, site-docker.yml.sample is called because that file is shipped by ceph-ansible. This allows TripleO to not need to maintain its own ceph-ansible fork.

What does a usual ceph-ansible playbook call look like when run by TripleO?

ansible-playbook -v /usr/share/ceph-ansible/site-docker.yml.sample
  --user tripleo-admin
  --become
  --become-user root
  --extra-vars
    {"monitor_secret": "***",
    "ceph_conf_overrides":
      {"global": {"osd_pool_default_pg_num": 32,
          "osd_pool_default_size": 1}},
    "osd_scenario": "non-collocated",
    "fetch_directory": "/tmp/file-mistral-action3_a1Cb",
    "user_config": true,
    "ceph_docker_image_tag": "tag-build-master-jewel-centos-7",
    "ceph_release": "jewel",
    "containerized_deployment": true,
    "public_network": "192.168.24.0/24",
    "copy_admin_key": false,
    "journal_collocation": false,
    "monitor_interface": "eth0",
    "admin_secret": "***",
    "raw_journal_devices": ["/dev/vdd", "/dev/vdd"],
    "keys": [{"mon_cap": "allow r", 
             "osd_cap": "allow class-read object_prefix rbd_children, allow rwx pool=volumes, ... ], 
    "openstack_keys": [{"mon_cap": "allow r", ... ], 
    "generate_fsid": false, 
    "osd_objectstore": "filestore", 
    "monitor_address_block": "192.168.24.0/24", 
    "ntp_service_enabled": false, 
    "ceph_docker_image": "ceph/daemon", 
    "docker": true, 
    "fsid": "2d87a5e8-8e72-11e7-a223-003da9b9b610", 
    "journal_size": 256, 
    "cephfs_metadata": "manila_metadata", 
    "openstack_config": true, 
    "ceph_docker_registry": "docker.io", 
    "pools": [], 
    "cephfs_data": "manila_data", 
    "ceph_stable": true, 
    "devices": ["/dev/vdb", "/dev/vdc"], 
    "ceph_origin": "distro", 
    "openstack_pools": [
             {"rule_name": "", "pg_num": 32, "name": "volumes"}, 
         {"rule_name": "", "pg_num": 32, "name": "backups"}, 
         {"rule_name": "", "pg_num": 32, "name": "vms"}, 
         {"rule_name": "", "pg_num": 32, "name": "images"}, 
         {"rule_name": "", "pg_num": 32, "name": "metrics"}], 
    "ip_version": "ipv4", 
    "ireallymeanit": "yes", 
    "cluster_network": "192.168.24.0/24", 
    "cephfs": "cephfs", 
    "raw_multi_journal": true
    } 
  --forks 6 
  --ssh-common-args "-o StrictHostKeyChecking=no" 
  --ssh-extra-args "-o UserKnownHostsFile=/dev/null" 
  --inventory-file /tmp/ansible-mistral-actiontrguE1/inventory.yaml 
  --private-key /tmp/ansible-mistral-actiontrguE1/ssh_private_key 
  --skip-tags package-install,with_pkg

You can get the above in an unformated version of the following from a grep to /var/log/mistral/executor.log as described above.

How can I re-run only the ceph-ansible playbook?

Careful. This should not be done on a production deployment because if you re-run the Mistral deployment directly after getting the error posted under the first question, then the Heat Stack will not be updated. Thus, Heat will believe the OS::Mistral::ExternalResource resource has status CREATE_FAILED. If you are doing a practice deployment or development, then you can use Mistral's task-rerun. But this only works if the task has failed.

First get the Task ID

WORKFLOW='tripleo.storage.v1.ceph-install'
UUID=$(mistral execution-list | grep $WORKFLOW | awk {'print $2'} | tail -1)
mistral task-list $UUID | grep ERROR

For example:

(undercloud) [stack@undercloud workbooks]$ mistral task-list $UUID | grep ERROR 
| 31257437-c877-40f8-872f-2576da89a8ea | ceph_install          | tripleo.storage.v1.ceph-install | a5287f5c-f781-40cf-8fce-c56c21c52918 | ERROR   | Failed to run action [act... | 2017-09-07 15:31:43 | 2017-09-07 15:31:46 |
(undercloud) [stack@undercloud workbooks]$

Then re-run the task

(undercloud) [stack@undercloud workbooks]$ mistral task-rerun 31257437-c877-40f8-872f-2576da89a8ea
+---------------+--------------------------------------+
| Field         | Value                                |
+---------------+--------------------------------------+
| ID            | 31257437-c877-40f8-872f-2576da89a8ea |
| Name          | ceph_install                         |
| Workflow name | tripleo.storage.v1.ceph-install      |
| Execution ID  | a5287f5c-f781-40cf-8fce-c56c21c52918 |
| State         | RUNNING                              |
| State info    | None                                 |
| Created at    | 2017-09-07 15:31:43                  |
| Updated at    | 2017-09-08 16:24:04                  |
+---------------+--------------------------------------+
(undercloud) [stack@undercloud workbooks]$

If you run the above and keep the following in another window:

 tail -f /var/log/mistral/ceph-install-workflow.log

Then it's just like running `ansible-playbook site-docker.yaml ...` but you don't need to pass all of the --extra-vars because the same Mistral environment built by Heat is available.

Make a NUMA-aware VM with virsh

2017-09-07T11:28:00.000-04:00

Grégory showed me how he uses `virsh edit` on a VM to add something like the following:

<cpu mode='custom' match='exact' check='partial'>
  <model fallback='allow'>SandyBridge</model>
  <feature policy='force' name='vmx'/>
  <numa>
    <cell id='0' cpus='0-1' memory='4096000' unit='KiB'/>
    <cell id='1' cpus='2-3' memory='4096000' unit='KiB'/>
  </numa>
</cpu>

After that `lstopo` will show NUMA nodes you can use. E.g. if you want to start a process on your VM with `numactl`.

# lstopo-no-graphics
Machine (7999MB total)
  NUMANode L#0 (P#0 3999MB)
    Package L#0 + L3 L#0 (16MB) + L2 L#0 (4096KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
    Package L#1 + L3 L#1 (16MB) + L2 L#1 (4096KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
  NUMANode L#1 (P#1 4000MB)
    Package L#2 + L3 L#2 (16MB) + L2 L#2 (4096KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
    Package L#3 + L3 L#3 (16MB) + L2 L#3 (4096KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
  Misc(MemoryModule)
  HostBridge L#0
    PCI 8086:7010
    PCI 1013:00b8
      GPU L#0 "card0"
      GPU L#1 "controlD64"
    3 x { PCI 1af4:1000 }
    2 x { PCI 1af4:1001 }

Trick to test external ceph clusters using only tripleo-quickstart

2017-09-05T09:58:00.000-04:00

TripleO can stand up a Ceph cluster as part of an overcloud. However, if all you have is a tripleo-quickstart env and want to test an overcloud feature which uses an external Ceph cluster, then can have quickstart stand up two heat stacks, one to make a separate ceph cluster and the other to stand up an overcloud which uses that ceph cluster.

Deploy stand alone ceph cluster

I use deploy-ceph-only.sh with ceph-only.yaml, based on Giulio's example. I add `-- stack ceph` to `openstack overcloud deploy ...` so that the Heat stack is not called "overcloud". You cannot rename a Heat stack.

After deploying the ceph cluster, get the monitor node's IP (CephExternalMonHost), use `ceph auth list` to get the secret key secret for the client.openstack keyring (CephClientKey), and look at the ceph.conf to get the FSID (CephClusterFSID), so that overcloud-ceph-ansible-external.yaml may be updated accordingly.

Deploy an overcloud to use external ceph

I use deploy-ext-ceph.sh with overcloud-ceph-ansible-external.yaml. This uses changes in tripleo and ceph-ansible which are unmerged (at this time of writing).

Results

(undercloud) [stack@undercloud ceph-ansible]$ openstack server list
+--------------------------------------+-------------------------+--------+------------------------+----------------+--------------+
| ID                                   | Name                    | Status | Networks               | Image          | Flavor       |
+--------------------------------------+-------------------------+--------+------------------------+----------------+--------------+
| 28d57de8-8354-43e0-8d4e-46de33ea4672 | overcloud-controller-0  | BUILD  | ctlplane=192.168.24.8  | overcloud-full | control      |
| 298943dd-b3d2-4302-93fd-c45d8375ff16 | overcloud-novacompute-0 | BUILD  | ctlplane=192.168.24.21 | overcloud-full | compute      |
| f4d15186-775c-4cab-ae5d-c3fd48ecfccf | ceph-cephstorage-2      | ACTIVE | ctlplane=192.168.24.18 | overcloud-full | ceph-storage |
| 24da4c0f-f945-4489-bdeb-eb9b2cf70bc0 | ceph-cephstorage-0      | ACTIVE | ctlplane=192.168.24.9  | overcloud-full | ceph-storage |
| 248eacd5-e0ae-47b2-a3a9-2b4f3d0dfa6c | ceph-cephstorage-1      | ACTIVE | ctlplane=192.168.24.15 | overcloud-full | ceph-storage |
| 5af9a2ae-3492-4874-b8ab-2de2f8530b60 | ceph-controller-0       | ACTIVE | ctlplane=192.168.24.6  | overcloud-full | control      |
+--------------------------------------+-------------------------+--------+------------------------+----------------+--------------+
(undercloud) [stack@undercloud ceph-ansible]$ openstack stack list
+--------------------------------------+------------+----------------------------------+--------------------+----------------------+--------------+
| ID                                   | Stack Name | Project                          | Stack Status       | Creation Time        | Updated Time |
+--------------------------------------+------------+----------------------------------+--------------------+----------------------+--------------+
| c016b71d-0c73-468d-bed5-baf26d88ea23 | overcloud  | d8e1f76b116f467cbe9e60b6c91c80b3 | CREATE_IN_PROGRESS | 2017-09-05T14:30:02Z | None         |
| 91370b74-41bd-4923-bacb-c24d98ca148f | ceph       | d8e1f76b116f467cbe9e60b6c91c80b3 | CREATE_COMPLETE    | 2017-09-05T14:11:04Z | None         |
+--------------------------------------+------------+----------------------------------+--------------------+----------------------+--------------+
(undercloud) [stack@undercloud ceph-ansible]$

I had set up my virtual hardware by running `quickstart.sh -e @myconfigfile.yml` with myconfigfile.yml.

In this scenario I used puppet-ceph to deploy the ceph cluster and ceph-ansible to deploy the ceph-client, which is the reverse of a more popular scenario. All four combinations are possible, though the puppet-ceph method will be deprecated.

Accessing a Mistral Environment in a CLI workflow

2017-06-06T14:34:00.001-04:00

Recently, with some help of the Mistral devs in freenode #openstack-mistral, I was able to create a simple environment and then write a workflow to access it. I will share my example below.

You can define a mistral environment file in YAML:

(undercloud) [stack@undercloud 101]$ cat env.yaml 
---
name: "my_env"
variables: 
  foo: bar
  service_ips:
    ceph_mon_ctlplane_node_ips:
      - "192.168.24.13"
      - "192.168.24.15"
(undercloud) [stack@undercloud 101]$

You can then ask Mistral to store that enviornment:

(undercloud) [stack@undercloud 101]$  mistral environment-create -f yaml env.yaml
Name: my_env
Description: null
Variables: "{\n    \"foo\": \"bar\", \n    \"service_ips\": {\n        \"ceph_mon_ctlplane_node_ips\"\
  : [\n            \"192.168.24.13\", \n            \"192.168.24.15\"\n        ]\n\
  \    }\n}"
Scope: private
Created at: '2017-06-06 16:31:01'
Updated at: null
(undercloud) [stack@undercloud 101]$

Observe it in the environment list:

(undercloud) [stack@undercloud 101]$ mistral environment-list
+-------------------+-------------------+---------+-------------------+---------------------+
| Name              | Description       | Scope   | Created at        | Updated at          |
+-------------------+-------------------+---------+-------------------+---------------------+
| tripleo           | None              | private | 2017-06-02        |               |
| .undercloud-      |                   |         | 21:24:12          |                     |
| config            |                   |         |                   |                     |
| overcloud         | None              | private | 2017-06-02        | 2017-06-02 23:32:53 |
|                   |                   |         | 21:24:21          |                     |
| ssh_keys          | SSH keys for      | private | 2017-06-02        |               |
|                   | TripleO           |         | 21:24:40          |                     |
|                   | validations       |         |                   |                     |
| my_env            | None              | private | 2017-06-06        |               |
|                   |                   |         | 16:32:41          |                     |
+-------------------+-------------------+---------+-------------------+---------------------+
(undercloud) [stack@undercloud 101]$

Look at it directly:

(undercloud) [stack@undercloud 101]$ mistral environment-get my_env
+-------------+-----------------------------------------+
| Field       | Value                                   |
+-------------+-----------------------------------------+
| Name        | my_env                                  |
| Description |                                   |
| Variables   | {                                       |
|             |     "foo": "bar",                       |
|             |     "service_ips": {                    |
|             |         "ceph_mon_ctlplane_node_ips": [ |
|             |             "192.168.24.13",            |
|             |             "192.168.24.15"             |
|             |         ]                               |
|             |     }                                   |
|             | }                                       |
| Scope       | private                                 |
| Created at  | 2017-06-06 16:32:41                     |
| Updated at  |                                   |
+-------------+-----------------------------------------+
(undercloud) [stack@undercloud 101]$

You can define a workflow which can access the variables in the Mistral environment:

---
version: "2.0"
wf:
  tasks:
    show_env_synax1:
      action: std.echo output=<% $.get('__env') %>
      on-complete: show_env_synax2
    show_env_synax2:
      action: std.echo output=<% env() %>
      on-complete: show_ips
    show_ips:
      action: std.echo output=<% env().get('service_ips', {}).get('ceph_mon_ctlplane_node_ips', []) %>

You can then have a Mistral worfklow use it by specifying it as a param as per the documentation.

 mistral execution-create workflow_identifier [workflow_input] [params]

In [params] we specify the environment name. If your workflow has no [workflow_input], then pass '' to make it clear your are specifying the environment name with params as the second argument.

First we create (or update) our workflow:

(undercloud) [stack@undercloud 101]$ mistral workflow-update mistral-env.yaml
+----------------+------+----------------+--------+-------+----------------+----------------+
| ID             | Name | Project ID     | Tags   | Input | Created at     | Updated at     |
+----------------+------+----------------+--------+-------+----------------+----------------+
| 18e9daee-06db- | wf   | f282a331978146 |  |       | 2017-06-05     | 2017-06-06     |
| 42bc-b0bf-     |      | ce988911bc5643 |        |       | 17:04:31       | 19:04:06       |
| 228c19bf2c99   |      | 5db4           |        |       |                |                |
+----------------+------+----------------+--------+-------+----------------+----------------+
(undercloud) [stack@undercloud 101]$

Next we execute our workflow and indicate that the [workflow_input] is empty by passing '' and after that we pass some JSON specifying that the "env" key should be "my_env" as defined above:

(undercloud) [stack@undercloud 101]$ mistral execution-create wf '' '{"env": "my_env"}'
+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| ID                | f2c62c11-d5b6-4698-88af-3ef91240b837 |
| Workflow ID       | 18e9daee-06db-42bc-b0bf-228c19bf2c99 |
| Workflow name     | wf                                   |
| Description       |                                      |
| Task Execution ID |                                |
| State             | RUNNING                              |
| State info        | None                                 |
| Created at        | 2017-06-06 19:05:17                  |
| Updated at        | 2017-06-06 19:05:17                  |
+-------------------+--------------------------------------+
(undercloud) [stack@undercloud 101]$

As a shortcut we save the UUID of the execution, and use it to get the IDs of the list of tasks:

(undercloud) [stack@undercloud 101]$ UUID=f2c62c11-d5b6-4698-88af-3ef91240b837
(undercloud) [stack@undercloud 101]$ mistral task-list $UUID | awk {'print $2'} | egrep -v 'ID|^$'
edf9576b-e4b7-41c9-9d0d-2486e886ce96
5e6559d0-d875-4f30-8567-dfd1dbf7ac32
6a7f2793-41a4-4ef9-8366-4d59f936044d
(undercloud) [stack@undercloud 101]$

Next we make sure our ID maps to the task we want to see the output for:

(undercloud) [stack@undercloud 101]$ mistral task-get edf9576b-e4b7-41c9-9d0d-2486e886ce96
+---------------+--------------------------------------+
| Field         | Value                                |
+---------------+--------------------------------------+
| ID            | edf9576b-e4b7-41c9-9d0d-2486e886ce96 |
| Name          | show_env_synax1                      |
| Workflow name | wf                                   |
| Execution ID  | f2c62c11-d5b6-4698-88af-3ef91240b837 |
| State         | SUCCESS                              |
| State info    | None                                 |
| Created at    | 2017-06-06 19:05:17                  |
| Updated at    | 2017-06-06 19:05:18                  |
+---------------+--------------------------------------+
(undercloud) [stack@undercloud 101]$

So what was the result of using syntax1?

(undercloud) [stack@undercloud 101]$ mistral task-get-result edf9576b-e4b7-41c9-9d0d-2486e886ce96
{
    "foo": "bar", 
    "service_ips": {
        "ceph_mon_ctlplane_node_ips": [
            "192.168.24.13", 
            "192.168.24.15"
        ]
    }
}
(undercloud) [stack@undercloud 101]$

The environment we passed. Note that the more compact syntax2 does the same thing:

(undercloud) [stack@undercloud 101]$ mistral task-get-result 6a7f2793-41a4-4ef9-8366-4d59f936044d
{
    "foo": "bar", 
    "service_ips": {
        "ceph_mon_ctlplane_node_ips": [
            "192.168.24.13", 
            "192.168.24.15"
        ]
    }
}
(undercloud) [stack@undercloud 101]$

What's nice is that we can specifically pick items out with the env() dictionary as shown in the show_ips task.

(undercloud) [stack@undercloud 101]$ mistral task-get-result 5e6559d0-d875-4f30-8567-dfd1dbf7ac32
[
    "192.168.24.13", 
    "192.168.24.15"
]
(undercloud) [stack@undercloud 101]$

As a refresh the output of the task above, came from the following task:

   show_ips:
      action: std.echo output=<% env().get('service_ips', {}).get('ceph_mon_ctlplane_node_ips', []) %>

Red Hat Summit 2017: DPDK and HCI

2017-05-08T15:37:00.000-04:00

I am back from Red Hat Summit 2017
Andrew Theurer and I did a presentation on Hyper-converged OpenStack/Ceph and DPDK workloads
We achieved our goal of proving that you can run VMs, OSDs, and a DPDK workload on the same server.
Andrew was able to run a workload to maintain 5.5 million packets per second per interface, for a total of 11 Mpps for 11 hours; even with some Ceph stroage activity in the middle of that time period.
Slides from this session

openstack baremetal introspection data save

2017-04-18T09:26:00.000-04:00

I am happy with python-ironic-inspector-client 1.4.0 (Pike and newer) as I can more easily access my introspection data with:

 openstack baremetal introspection data save $UUID

In the past I used to use a script to do the above.

For example, I quickly use it after using quickstart to make sure that my ceph flavor got its extra disks. When I run quickstart with `-e @myconfigfile.yml` where myconfigfile.yml contains a control flavor and a ceph flavor like so:

overcloud_nodes:
  - name: control_0
    flavor: control
    virtualbmc_port: 6230

  - name: ceph_0
    flavor: ceph
    virtualbmc_port: 6231

Then the ceph flavor gets the extradisks boolean set to true. So when I first SSH into my deployed undercloud and simply run the following commands, then I can verify that my introspection data does contain the extra disks.

[stack@undercloud ~]$ openstack baremetal node list
+-----------------------+-----------+---------------+-------------+--------------------+
| UUID                  | Name      | Instance UUID | Power State | Provisioning State |
+-----------------------+-----------+---------------+-------------+--------------------+
| 4bbe35d4-9c79-4b80    | control-0 | None          | power off   | available          |
| -816c-fca8a9f8a895    |           |               |             |                    |
| bd9123b8-01a2-48ea-a2 | ceph-0    | None          | power off   | available          |
| 47-2d38cfaa1102       |           |               |             |                    |
+-----------------------+-----------+---------------+-------------+--------------------+
[stack@undercloud ~]$ 

 openstack baremetal introspection data save control-0 > control-0
 openstack baremetal introspection data save ceph-0 > ceph-0

[stack@undercloud ironic]$ cat ceph-0 | jq "." | grep dev 
        "name": "/dev/vda",
        "name": "/dev/vdb",
        "name": "/dev/vdc",
        "name": "/dev/vdd",
    "name": "/dev/vda",
[stack@undercloud ironic]$

More details on `openstack baremetal introspection data save` in the document Accessing Introspection Data.

Ceph OSDs and Systemd Basics

2017-04-13T15:35:00.001-04:00

As of Infernalis and then into Jewel/RHCS2, Ceph uses systemd to start services when installed on a Red Hat based system. Prior to that, e.g. Hammer/RHCS1.3, it used SysV init.

When puppet-ceph or ceph-ansible configure Ceph OSD services, they do not need to run commands like:

    systemctl enable ceph-osd@0

because those tools call `ceph-disk` (implemented in Python) directly to prepare and activate the OSDs and then `ceph-disk` enables the service in systemd. Thus, after puppet-ceph runs on your system you can see evidence of the service being systemd enabled even though you won't see anything like `systemctl enable ceph-osd@$i` in the module itself:

$ journalctl | grep "Created symlink from /run/systemd/system/ceph-osd.target.wants/ceph-osd"
Apr 12 19:16:01 compute-1.localdomain os-collect-config[1921]: Notice:
/Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-activate-/srv/data]/returns:
Created symlink from /run/systemd/system/ceph-osd.target.wants/ceph-osd@0.service
to /usr/lib/systemd/system/ceph-osd@.service.

Note that the symlink is in /run/systemd/ and not /etc/systemd/ as per a somewhat recent commit which adds the `--runtime` option.

To see if your OSDs are running with the --runtime option use something like the following. In the example below the --runtime was not used:

[stack@hci-director ~]$ ansible osds -b -m shell -a "systemctl list-unit-files | grep ceph | grep osd"
192.168.1.26 | SUCCESS | rc=0 >>
ceph-osd@.service                             enabled 
ceph-osd.target                               enabled 

192.168.1.28 | SUCCESS | rc=0 >>
ceph-osd@.service                             enabled 
ceph-osd.target                               enabled 

192.168.1.31 | SUCCESS | rc=0 >>
ceph-osd@.service                             enabled 
ceph-osd.target                               enabled 

[stack@hci-director ~]$

In this example the --runtime option was used:

[stack@hci-director ~]$ ansible osds -b -m shell -a "systemctl list-unit-files | grep ceph | grep osd"
192.168.1.25 | SUCCESS | rc=0 >>
ceph-osd@.service                             enabled-runtime
ceph-osd.target                               enabled        

192.168.1.23 | SUCCESS | rc=0 >>
ceph-osd@.service                             enabled-runtime
ceph-osd.target                               enabled        

192.168.1.27 | SUCCESS | rc=0 >>
ceph-osd@.service                             enabled-runtime
ceph-osd.target                               enabled        

[stack@hci-director ~]$

If you have directory-based OSDs, then I recommend they be enabled without --runtime.

If you want to restart your OSDs by sequentially running the ceph-osd@.service for each OSD ID, then you may do so like this:

    OSD_IDS=$(ls /var/lib/ceph/osd | awk 'BEGIN { FS = "-" } ; { print $2 }')
    for OSD_ID in $OSD_IDS; do
 systemctl status ceph-osd@$OSD_ID 
 systemctl restart ceph-osd@$OSD_ID
 systemctl status ceph-osd@$OSD_ID
    done

It is not necessary to start them sequentially however as ceph-osd.target will start them all. You can verify this is working by stopping your OSD directly and then restarting only the target. You can then see that the individual OSD was started:

ls /var/lib/ceph/osd/
i=1
systemctl stop ceph-osd@$i
systemctl status ceph-osd@$i
systemctl status ceph-osd.target
systemctl restart ceph-osd.target
systemctl status ceph-osd@$i

TripleO Ceph Ansible Spec Merged

2017-04-06T10:56:00.000-04:00

The TripleO Ceph Ansible Spec merged today. It's going to be a busy cycle :)

openstack server image create.. the hard way

2017-04-05T17:11:00.001-04:00

If you need to rescue a nova instance when openstack server image create isn't working and its backend is ceph, then here's how I did it for a pet called demo3 (all commands run on compute node except those starting with "openstack", which was run on the undercloud).

openstack server show demo3
# instance is running on overcloud-osd-compute-3 as instance-0000002b

virsh dumpxml instance-0000002b | grep rbd
# I see it is in rbd:vms/e6674b4d-40f4-4af3-b16d-c1ee37a3e1a6_disk

openstack server suspend demo3
# quiesce your instance

qemu-img info rbd:vms/e6674b4d-40f4-4af3-b16d-c1ee37a3e1a6_disk
# verify you have an image qemu-img can read

qemu-img snapshot -c demo3-snap1 rbd:vms/e6674b4d-40f4-4af3-b16d-c1ee37a3e1a6_disk
# snapshot your image

qemu-img info rbd:vms/e6674b4d-40f4-4af3-b16d-c1ee37a3e1a6_disk
# verify snapshot exists on rbd

rbd -p vms ls -l
# observe the e6674b4d-40f4-4af3-b16d-c1ee37a3e1a6_disk@demo3-snap1

qemu-img convert rbd:vms/e6674b4d-40f4-4af3-b16d-c1ee37a3e1a6_disk@demo3-snap1 demo3-snap1.raw
# pickle your snapshot to a local image file (took 35 seconds)

qemu-img info demo3-snap1.raw
# verify the export is a readable by qemu-img
# now we have demo3-snap1.raw to import or even save offline

openstack server resume demo3
# resume your instance (confirmed it answered `ping 10.1.1.9`)

We were then able to use `openstack image create demo3-image1 --disk-format=raw --container-format=bare < demo3-snap1.raw` to import that image of the instance into glance so that it may live again.

TripleO backports and dealing with unclean cherry picks

2017-04-05T06:56:00.002-04:00

Sometimes using a git cherry-pick to do a backport is easy because you simply use the "Cherry Pick" button in Gerrit's web UI. Other times you get a merge conflict that's resolvable on the CLI. The Contributor Guide's Cherry pick a change is the thing to read in that case, but it assumes a clean cherry pick in the end. Sometimes it's practical to use a cherry pick to set up a change but then abort the cherry pick and manually submit an unclean cherry pick of a clean change for review. I recently had to do this and learned a few things from TripleO cores in IRC so I want to share what I learned here in case it helps others. I'm no expert at this but I have a process I can follow to do this again without any issues.

The following changes were made in TripleO/Ocata to the following repositories:

These changes are important for running OpenStack on Ceph with more than 700 block-device backed OSDs and should be backported for those running TripleO/Newton.

Stable backport policy requires a bug in Launchpad. I opened 1673995. I then viewed the original changes in gerrit and was able to click cherry-pick and enter "stable/newton". The ones for puppet-triple and puppet-nova went cleanly and I got two new gerrit IDs. I changed their topic to bug/1673995.

The attempted GUI backport for THT had a conflict so I had to resolve it by the command line with the following process.

Get a clean copy of the repo you need to backport to (I normally do this with a script)

git config --global gitreview.username fultonj
git clone https://git.openstack.org/openstack/tripleo-heat-templates.git
cd tripleo-heat-templates
git remote add gerrit ssh://fultonj@review.openstack.org:29418/openstack/tripleo-heat-templates.git
git review -s
git fetch origin

Create a topic branch for the bug from the stable branch

git checkout -b bug/1673995 remotes/origin/stable/newton

On the review page for the THT change, click the Download pull-down menu and copy the cherry pick command. Running the cherry pick command for me looked like the following:

[jfulton@skagra tripleo-heat-templates{bug/1673995}]$ git fetch \
https://review.openstack.org/openstack/openstack-manuals refs/changes/34/235734/1 \
&& git cherry-pick -x FETCH_HEAD 
warning: no common commits
remote: Counting objects: 107419, done
remote: Finding sources: 100% (107419/107419)
remote: Total 107419 (delta 66297), reused 91954 (delta 66297)
Receiving objects: 100% (107419/107419), 372.15 MiB | 2.92 MiB/s, done.
Resolving deltas: 100% (66297/66297), done.
From https://review.openstack.org/openstack/openstack-manuals
 * branch            refs/changes/34/235734/1 -> FETCH_HEAD
error: could not apply c95d624... Spelling miss in Networking Guide
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add ' or 'git rm '
hint: and commit the result with 'git commit'
[jfulton@skagra tripleo-heat-templates{bug/1673995}]$

I knew there would be a conflict but was advised to start with the `cherry-pick -x` to get the reference to what commit the changes came from; even when you know there will be manual changes required. From there its normal to leave the conflicts that won't be committed and commit only the changes that need to be made. Also, note in the commit message, or otherwise comment about this that the proposed commit wasn't a clean cherry pick so that reviewers know to pay extra attention.

At this point I abort the cherry pick to get ready for manual clean up.

[jfulton@skagra tripleo-heat-templates{bug/1673995}]$ git status 
On branch bug/1673995
Your branch is up-to-date with 'origin/stable/newton'.
You are currently cherry-picking commit c95d624.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add/rm ..." as appropriate to mark resolution)

 deleted by us:   doc/networking-guide/source/adv_config_sriov.rst

no changes added to commit (use "git add" and/or "git commit -a")
[jfulton@skagra tripleo-heat-templates{bug/1673995}]$ git cherry-pick --abort
[jfulton@skagra tripleo-heat-templates{bug/1673995}]$

I don't need to `git rm doc/networking-guide/source/adv_config_sriov.rst` as it's not staged for commit.

I then manually edit `puppet/services/nova-libvirt.yaml` to add the three required lines from the upstream change.

            nova::compute::libvirt::qemu::configure_qemu: true
            nova::compute::libvirt::qemu::max_files: 32768
            nova::compute::libvirt::qemu::max_processes: 131072

From there I `git add puppet/services/nova-libvirt.yaml` and `git commit` which brings me to writing the commit message for this type of change.
In my case I copied/pasted the commit message from the original patch but added "Unclean cherry-pick from I1e79675f6aac1b0fe6cc7269550fa6bc8586e1fb". Be sure to update the commit message to keep the Change-ID the same as the cherry-pick, even if there were coflicts. When this happens add the conflicts: to the commit message but leave the Change-ID the same. (I had originally overlooked this and let a new change ID be generated but the commmit file can just be edited to set a new change ID).
Like the original patch, I set the dependencies so that all three of them would be tested together in CI.
After the changes for the bug merged, the status for the launch pad bug could be set to "Fix Committed".

Finding the right NUMA node for HCI with DPDK

2017-04-03T11:42:00.000-04:00

Q: Which NUMA node do I use so that a process I want to run doesn't have to jump NUMA boundaries?

A: Find the numa node (e.g. 0 or 1) using `lstopo-no-graphics`. In the example below, I see that em1 is on the same IRQ as my ceph disks and I deployed my overcloud using em1 to host my ceph storage networks. Thus, I'm going to use numactl -t prefered to start my OSD processes on NUMA node 0 (e.g. see this post deploy template).

Similarly, I see that my p4p1 and p4p2 NICs are on the other NUMA node. I'm doing HCI and want to run DPDK so I am going to tell the DPDK processes to use the second NUMA node.

[stack@c10-h01-r730xd ~]$ lstopo-no-graphics 
Machine (128GB total)
  NUMANode L#0 (P#0 64GB)
    Package L#0 + L3 L#0 (25MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#20)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#22)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#4)
        PU L#5 (P#24)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#6)
        PU L#7 (P#26)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#8)
        PU L#9 (P#28)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#10)
        PU L#11 (P#30)
      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#12)
        PU L#13 (P#32)
      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#14)
        PU L#15 (P#34)
      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#16)
        PU L#17 (P#36)
      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#18)
        PU L#19 (P#38)
    HostBridge L#0
      PCIBridge
        PCI 1000:005d
          Block(Disk) L#0 "sda"
          Block(Disk) L#1 "sdb"
          Block(Disk) L#2 "sdc"
          Block(Disk) L#3 "sdd"
          Block(Disk) L#4 "sde"
          Block(Disk) L#5 "sdf"
          Block(Disk) L#6 "sdg"
          Block(Disk) L#7 "sdh"
          Block(Disk) L#8 "sdi"
          Block(Disk) L#9 "sdj"
          Block(Disk) L#10 "sdk"
          Block(Disk) L#11 "sdl"
          Block(Disk) L#12 "sdm"
          Block(Disk) L#13 "sdn"
          Block(Disk) L#14 "sdo"
          Block(Disk) L#15 "sdp"
          Block(Disk) L#16 "sdq"
      PCIBridge
        PCI 8086:1572
          Net L#17 "em1"
        PCI 8086:1572
          Net L#18 "em2"
      PCIBridge
        PCIBridge
          PCIBridge
            PCI 144d:a820
      PCIBridge
        PCI 8086:1521
          Net L#19 "em3"
        PCI 8086:1521
          Net L#20 "em4"
      PCIBridge
        PCIBridge
          PCIBridge
            PCIBridge
              PCI 102b:0534
                GPU L#21 "card0"
                GPU L#22 "controlD64"
  NUMANode L#1 (P#1 64GB)
    Package L#1 + L3 L#1 (25MB)
      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#1)
        PU L#21 (P#21)
      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#3)
        PU L#23 (P#23)
      L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#5)
        PU L#25 (P#25)
      L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#7)
        PU L#27 (P#27)
      L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#9)
        PU L#29 (P#29)
      L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#11)
        PU L#31 (P#31)
      L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
        PU L#32 (P#13)
        PU L#33 (P#33)
      L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
        PU L#34 (P#15)
        PU L#35 (P#35)
      L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
        PU L#36 (P#17)
        PU L#37 (P#37)
      L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
        PU L#38 (P#19)
        PU L#39 (P#39)
    HostBridge L#11
      PCIBridge
        PCI 8086:1572
          Net L#23 "p4p1"
        PCI 8086:1572
          Net L#24 "p4p2"
[stack@c10-h01-r730xd ~]$

Q: Now that I know I want numa node 1, how do I know which CPUs to pass to the HostCpusList in THT?

A: Use `lscpu`

[stack@c10-h01-r730xd ~]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Stepping:              2
CPU MHz:               2574.113
BogoMIPS:              4603.71
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39
[stack@c10-h01-r730xd ~]$

For example, I can see that NUMA node1's CPUs are the odd numbered ones. Thus, the settings look like this:

  # Add a list or range of physical CPU cores to be reserved for virtual machine processes:
  NovaVcpuPinSet: ['9,11,13,15']
  # Set a list or range of physical CPU cores to be tuned: 
  HostCpusList: '1,3,5,7,9,11,13,15'

Ironic Metadata Disk Cleaning instead of a first-boot zap disk

2017-03-29T11:38:00.000-04:00

I verified that Ironic Metadata Disk Cleaning works well with 10 Dell RX730s and OSPd/Tripleo on OSP10/Newton.

This was not via TripleO's "clean_nodes=True" param but purely a change in Ironic. I used the following steps to turn it on after deploying the undercloud.

Identify the neutron UUID of the TripleO ctlplane:

[stack@c10-h01-r730xd ~]$ neutron net-list
+--------------------------------------+----------+----------------------------------------+
| id                                   | name     | subnets                                |
+--------------------------------------+----------+----------------------------------------+
| 40a26da2-bcc6-47c9-b308-49c8d6911f8d | ctlplane | 5541d13e-3d44-442b-b2c3-1c99bc959861   |
|                                      |          | 192.0.2.0/24                           |
+--------------------------------------+----------+----------------------------------------+
[stack@c10-h01-r730xd ~]$

Modify ironic.conf:

    [conductor]
    automated_clean = True
    [deploy]
    erase_devices_priority = 0
    erase_devices_metadata_priority = 10
    [neutron]
    cleaning_network_uuid = $UUID

For example:

[root@c10-h01-r730xd ironic]# egrep "clean|erase" /etc/ironic/ironic.conf | egrep -v \#
automated_clean = True
erase_devices_priority = 0
erase_devices_metadata_priority = 10
cleaning_network_uuid = 40a26da2-bcc6-47c9-b308-49c8d6911f8d
[root@c10-h01-r730xd ironic]#

Bounce the ironic conductor service.

systemctl restart openstack-ironic-conductor.service

Once that's done, merely trying to put the nodes into the ironic state "available" will put them in the "cleaning" state first and then clean the disks before they finally get set to the "available" state. I set the state on my nodes out of available and back with the following:

for ironic_id in $(ironic node-list | awk {'print $2'} | grep -v UUID | egrep -v '^$'); do
     ironic node-set-provision-state $ironic_id manage; 
done

for ironic_id in $(ironic node-list | awk {'print $2'} | grep -v UUID | egrep -v '^$'); do 
     ironic node-set-provision-state $ironic_id provide; 
done

Simply by doing the above to bash loops the nodes were booted on a ram disk and every disk, including the root disk (e.g. /dev/sda), had its metadata removed. After that `ceph-disk prepare` was able to make the server's disks into OSDs without any problems even though I did not use my first-boot Heat template which I used to use to wipe the disks.

After running `openstack stack delete overcloud --yes --wait` I see that the nodes are automatically turned back on with the following status as it cleans the nodes. After that the nodes go back to "power off" of "available". The cleaning process takes about 3 minutes so I'm pretty happy with it

[stack@c10-h01-r730xd ~]$ ironic node-list
+--------------------------------------+-----------------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name            | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+-----------------+---------------+-------------+--------------------+-------------+
| 014479db-e90a-4837-8834-7edea44a91fc | h03-control-mon | None          | power on    | clean wait         | False       |
| 64ba2e30-ba46-4ac7-93ce-126c6da0da65 | h07-control-mon | None          | power on    | clean wait         | False       |
| 6a1f895a-dd1f-42db-b0f8-b11303168561 | h09-control-mon | None          | power on    | clean wait         | False       |
| c1b93d79-a92a-49af-9b66-79aab861b395 | h11-compute-osd | None          | power on    | clean wait         | False       |
| 48c47e63-6bc2-4f0c-937f-c0f2397cd194 | h13-compute-osd | None          | power on    | clean wait         | False       |
| 8254d5fd-600a-4607-a0b0-38b4c95b22df | h15-compute-osd | None          | power on    | clean wait         | False       |
| b142b1b6-63c8-44af-9c0b-23788219e318 | h17-compute-osd | None          | power on    | clean wait         | False       |
| deef869f-519b-4b6f-9fad-999f376f5b98 | h19-compute-osd | None          | power on    | clean wait         | False       |
| 43fc22fa-49c4-4a99-b799-7aece8e359f6 | h21-compute-osd | None          | power on    | clean wait         | False       |
| 29a41539-5b12-4545-8680-595d6b4dceb2 | h23-compute-osd | None          | power on    | clean wait         | False       |
+--------------------------------------+-----------------+---------------+-------------+--------------------+-------------+
[stack@c10-h01-r730xd ~]$