google.com, pub-4920175566720914, DIRECT, f08c47fec0942fa0 Skip to main content

Reason's for instability of NSX-T Cluster

 

Some time back I had an issue where my NSX-T lab environment was showing unstable status. My environment consists of 3 NSX-T manager nodes aligned with the VIP IP address. 



The issue where I was unable to access my NSX-T console through VIP IP address nor with my other NSX-T nodes. It's quite intermittent I was able to access console UI from one of the manager node using admin account. However, unable to login to the manager's node using SSH with admin or root account.

As I said its quite intermitted where I managed to access the manager UI console. 

In the below Figure:1, it states that 1-2 manager nodes were showing unavailable.

Figure:1

On validating the "VIEW DETAILS" it clearly shows that /var/log partition was 100% full.

Figure:2

Now the main objective is to either compress or delete the old logs from /var/log partition to bring back the manager's node's. 

To accomplish this I booted the NSX-T node VM sequentially, mounting the Ubuntu image using rescue mode to clean up the required space under /var/log.

Verified the /var/log partition in manager nodes and found Syslog.1 was occupying a large space on this partition.
Figure:3

As per the above figure, it states that the Syslog was occupying huge space under /var/log. 
Also, is states that Syslog.1 was not rotated for a long time.

We can compress or delete the old logs file to maintain free space into /var/log location. From my view, I have deleted the old Syslog and other logs to maintain relevant space into the partition.

Later, I was able to login from SSH using Root password. However, system asked to change the password as the Root password got expired.

I used the below command to reset the root password and validating the expiry status.

Reset root password of NSX-T manager

set user <username> password  <new password> old-password <old-password>
ie:
set user root password VMware1!VMware1!!  old-password VMware1!VMware1!


After all the above mechanisms validated the NSX-T environment and found all NSX-T managers in good shape and showing in stable status.  partition of /var/log is having quite relevant space.


Figure:5

Now, curiosity increases to identify the root cause of the issue where SYSLOG was unable to rotate from log time.

To get more information I referred the logrotate file under /etc/logrotate.conf

logrotate.conf
#user the syslog group by default, since this is the owning group
# of /var/log/syslog.
Su root syslog

As per the above Logrotate.conf snippets, It states that the Syslog rotation was owned by the Root user.


The log rotate runs as a CRON daily task, executed by the "root" user. Since the Root password had expired, the daily log rotation CRON job was failed to authenticate for the rotation.


The log rotation runs as a CRON daily task, executed by the ‘root’ user. Since the root password expired, the daily log rotation CRON job was failing to authenticate.

logrotate.conf


All the above logs state that the issue with unstable of NSX-T is due to /var/log full occupancy by 100% and that's because unable to rotate the syslog. As logrotate.conf for syslog was governed by root partition and in this case root partition was expired and confirm the root cause of this issue.


Auth.log

<87>1 0000-00-00T10:34:01.345432_00+00 nsxt000010.virtualvmx.com CRON 5324—pam_unix{cron:account_:exipred password for user root (password aged)

<87>1 0000-00-00T10:34:01.494949_00+00 nsxt000010.virtualvmx.com CRON 3423—pam_unix{cron:account_:exipred password for user root (password aged)

<87>1 0000-00-00T10:34:01.928345_00+00 nsxt000010.virtualvmx.com CRON 8765—pam_unix{cron:account_:exipred password for user root (password aged)

<87>1 0000-00-00T10:34:01.492823_00+00 nsxt000010.virtualvmx.com CRON 4323—pam_unix{cron:account_:exipred password for user root (password aged)

<87>1 0000-00-00T10:34:01.492384_00+00 nsxt000010.virtualvmx.com CRON 7665—pam_unix{cron:account_:exipred password for user root (password aged)

<87>1 0000-00-00T10:34:01.492838_00+00 nsxt000010.virtualvmx.com CRON 4827—pam_unix{cron:account_:exipred password for user root (password aged)



All the above logs state that the issue with unstable of NSX-T is due to /var/log full occupancy by 100% and that's because unable to rotate the syslog. As logrotate.conf for syslog was governed by root partition and in this case root partition was expired and confirm the root cause of this issue.



One can validate the root password in NSX-T using the below command.


get user <username> password-expiation
ie:
get user root password-expiration


So, Its quite important to validate the root password expiration to avoid this kind of scenario into your environment.





















Comments

Popular posts from this blog

Changing the FQDN of the vCenter appliance (VCSA)

This article states how to change the system name or the FQDN of the vCenter appliance 6.x
You may not find any way to change the FQDN from the vCenter GUI either from VAMI page of from webclient as the option to change the hostname always be greyed out.
Now the option left is from the command line of VCSA appliance.
Below steps will make it possible to change the FQDN of the VCSA from the command line.
Access the VCSA from console or from Putty session.Login with root permissionUse above command in the command prompt of VCSA : /opt/vmware/share/vami/vami_config_netOpt for option 3 (Hostname)Change the hostname to new nameReboot the VCSA appliance.After reboot you will be successfully manage to change the FQDN of the VCSA .

Note: Above step is unsupported by VMware and may impact your SSL certificate and face problem while logging to vSphere Web Client.

If you are using self-signed certificate, you can regenerate the certificate with the help of below KB 2112283 article.



Happy Sharin…

VM Creation Date & Time from Powercli

Most of the times we have several requirement when we talk about IT environment like designing , deployment , compliance check or for Security auditing the environment.
Somewhere during security auditing we require to provide several information to security team to get successful audit.
One of them is the compliance of Virtual machine auditing of creation date and time.
Here into this post we will explore how to get the creation date and time of virtual machine hosted into the vCenter or ESXi.
To get the details we will use VMware Powercli to extract the details.
By default there is no function added into Powercli to get such details, so here we will add a function of vm creation date.
Below is the function which needed to be copy and paste into the Powercli.
=======================================================================
function Get-VMCreationTime { $vms = get-vm $vmevts = @() $vmevt = new-object PSObject foreach ($vm in $vms) { #Progress bar: $foundString = "       Found: "+$v…

Could not connect to one or more vCenter Server systems: https://FQDN:443/sdk

Recently I got a case where vCenter 6.0 where the webclient was not showing inventory while loading. Issue occur when the customer was performing migration activity of virtual machine.
We verified that the vpxd services of vCenter, which is VCSA (Appliance), went into stopped stated just after starting means its crashing.
On VCSA Shell: service-control --status vmware-vpxd shows "stopped" service-control --start vmware-vpxd starts the service starts for a couple of seconds and stops again
VCSA 6.0 is linked with extrnal PSC 6.0. Verified the services of PSC and found all looks into good state.
Tried to power off both the VCSA and PSC and Power on in sequence where we started first PSC and later VCSA. After restarting the VCSA, status of the VPXD services was same as it was getting stopped after couple of seconds.
Checked the VPXD logs and found that the heartbeat between ESXi and VCSA was getting timed out for more than 1032 ms or more.
VCSA has generated the core dump at /var/core. …