VMware Cloud Foundation (VCF) Troubleshooting

Overview

Most of the time, when I am troubleshooting a VMware Cloud Foundation (VCF) bring-up, networking comes into play. Whether it be NTP, DNS, or other various ports, networking is probably the most important thing to get right for a successful bring-up. Subnets, network borders, Maximum Transmission Unit (MTU), VLANs, trunks, tags, i.e.

General Troubleshooting

Release Notes

There are a few easy things to check that might be enough to get on the right track. First, read the release notes for the version of VMware Cloud Foundation (VCF) that is being installed. One of the most important pieces of information in the release notes is the Bill of Materials (BOM).

Is the right version of each Software Component installed, down to the Build Number? Using OEM Customized Installer CDs should not matter as long as the build number is the same.


Cyclical Redundancy Check (CRC)

VMware by Broadcom publishes various hash values that allow you to check the integrity of the downloaded software. These include MD5, SHA-1, and SHA-256.

Checking the hash value of the downloaded file compared to the what is published can sometimes be all that’s needed to find a faulty download occurred. Even if the software appears to install correctly, there might be just enough corruption to cause bigger issues. It is a good idea to check the hash value of the VMware Cloud Builder appliance as well as the applicable VMware vSphere Hypervisor (ESXi ISO) image (including OEM Customized Installer CDs).

On a Mac, use Terminal to check a file, shasum --algorithm <[1 (default), 256]> <file name> or md5 <file name>.

On Windows, use PowerShell to check a file, Get-FileHash -Algorithm <[MD5 | SHA1 | SHA256]> <file name>.

On Linux, use Terminal to check a file,


Deployment Parameters

When completing the Management Domain ESXi Hosts portion of the Deployment Parameters Workbook, ensure only the short name for the host is input. I worked on a deployment where the fully qualified domain names were entered. There were no apparent issues and it even passed validation. It was not until we got to the “Download SSH Keys using Guest Program for vCenter” step that it failed. You may see a message in the log that it cannot connect. When you ping IP and host name of the ESXi hosts, they should resolve, because you are using the correct IP or host name. I think when the connectivity is being tested from the Cloud Builder appliance, it builds a string consisting of the hostname provided in Management Domain ESXi Hosts section along with the DNS Zone Name, essentially doubling the domain name.

So if a hostname.domain is entered on the Deployment Parameters Workbook, then it essentially is trying to connect to hostname.domain.domain. This was a tricky one to detect since the hostname was correct, it just didn’t need to be the fully qualified domain name. We only found it when transcribing to a new Deployment Parameters Workbook during troubleshooting.


Network Mapper (NMAP)

This next section is a lot more technical and should be considered an advanced topic. But if you are already down the path to install VMware Cloud Foundation (VCF), this should be a breeze.

Identify common services and note the IP addresses and port numbers (Domain Name System (DNS), Network Time Protocol (NTP), DNS zone name).

Pay attention to network borders and be consistent with Gateways. More information to follow.

Subnetted networks, not using /24, can get very confusing very fast. 

Decide if the first address or last address is the gateway and stick with it. For example, using a /27 (255.255.255.224) network, where the networks are 32 addresses wide…remember that only 30 addresses are usable. The lowest address is the network address and the highest address is the broadcast. The usable addresses, including ones that can be used as a gateway must be in the range. See below for an example. 

NetworkFirstLast Broadcast
192.168.16.0/27192.168.16.1192.168.16.30192.168.16.31
192.168.16.32/27192.168.16.33192.168.16.62192.168.16.63
192.168.16.64/27192.168.16.65192.168.16.94192.168.16.95
192.168.16.96/27192.168.16.97192.168.16.126192.168.16.127



192.168.16.223/27192.168.16.224192.168.16.254192.168.16.255

I tend to favor the lowest usable address in my networks as the gateway addresses, but any address in the range of usable addresses can be a candidate. 

But networks that are classless tend to have confusing borders. Using the example above, an address in 192.168.16.32/27 cannot communicate with an address in 192.168.16.64/27 without being routed through a Layer 3 device (router). But at a quick glance, it may seem like these are maybe a part of the same network. It can be even more confusing when the host addresses are close together, such as 192.168.16.62/27 and 192.168.16.65/27. 

In summary, pay attention to the network borders. 

For VMware Cloud Foundation, it is expected that certain devices remain in a single network, such as all ESXi hosts or Tunnel Endpoints (TEP). Again, pay attention to the size of the networks. A Management Domain requires four (4) physical hosts. So at least four (4) addresses are required for host management (vmk0). But, there are usually two (2) physical network interfaces (NIC) per physical ESXi host and each NIC will get a TEP address. As you can see, below, using a /29 (255.255.255.248) would not contain enough addresses. One of the hosts will end up with only one address as the next consecutive address would be the broadcast address for the network. 

NetworkFirstLastBroadcast
192.168.16.0/29192.168.16.1192.168.16.7192.168.16.8
192.168.16.9/29192.168.16.10192.168.16.15192.168.16.16

Tunnel Endpoint (TEP) Table

HostTEP – 1TEP – 2
ESXi-101192.168.16.1192.168.16.2
ESXi-102192.168.16.3192.168.16.4
ESXi-103192.168.16.5192.168.16.6
ESXi-104192.168.16.7???

vSAN Disk Availability Validation

Note: The following is in draft and is not complete at this time. For reference use only!

If you start a bring-up operation and then something fails, and you try again, you may experience an issue where data was written to the the vSAN datastore, or the disks were consumed for vSAN and there are no identifiable disks for vSAN. Regardless, there are a few things that can be done to help resolve this before resorting to re-imaging the ESXi host.

On the ESXi console or through SSH, type the following to leave the vSAN Cluster.

esxcli vsan cluster leave

From the vSphere Client, go to the Storage tab and select Devices. Clear the partition table of the vSAN disks (cache and data) to ensure no files can be accessed.

Also remove the service account by navigating to Users and look for the account with vsan in the name.


Leave a Reply

Your email address will not be published. Required fields are marked *