Copyright (c) 2015: The Nutanix Bible and StevenPoitras.com, 2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Steven Poitras and StevenPoitras.com with appropriate and specific direction to the original content.
I am honored to write a foreword for this book that we've come to call "The Nutanix Bible." First and foremost, let me address the name of the book, which to some would seem not fully inclusive vis-à-vis their own faiths, or to others who are agnostic or atheist. There is a Merriam Webster meaning of the word "bible" that is not literally about scriptures: "a publication that is preeminent especially in authoritativeness or wide readership". And that is how you should interpret its roots. It started being written by one of the most humble yet knowledgeable employees at Nutanix, Steven Poitras, our first Solution Architect who continues to be authoritative on the subject without wielding his "early employee" primogeniture. Knowledge to him was not power -- the act of sharing that knowledge is what makes him eminently powerful in this company. Steve epitomizes culture in this company -- by helping everyone else out with his authority on the subject, by helping them automate their chores in Power Shell or Python, by building insightful reference architectures (that are beautifully balanced in both content and form), by being a real-time buddy to anyone needing help on Yammer or Twitter, by being transparent with engineers on the need to self-reflect and self-improve, and by being ambitious.
When he came forward to write a blog, his big dream was to lead with transparency, and to build advocates in the field who would be empowered to make design trade-offs based on this transparency. It is rare for companies to open up on design and architecture as much as Steve has with his blog. Most open source companies -- who at the surface might seem transparent because their code is open source -- never talk in-depth about design, and "how it works" under the hood. When our competitors know about our product or design weaknesses, it makes us stronger -- because there is very little to hide, and everything to gain when something gets critiqued under a crosshair. A public admonition of a feature trade-off or a design decision drives the entire company on Yammer in quick time, and before long, we've a conclusion on whether it is a genuine weakness or a true strength that someone is fear-mongering on. Nutanix Bible, in essence, protects us from drinking our own kool aid. That is the power of an honest discourse with our customers and partners.
This ever-improving artifact, beyond being authoritative, is also enjoying wide readership across the world. Architects, managers, and CIOs alike, have stopped me in conference hallways to talk about how refreshingly lucid the writing style is, with some painfully detailed illustrations, visio diagrams, and pictorials. Steve has taken time to tell the web-scale story, without taking shortcuts. Democratizing our distributed architecture was not going to be easy in a world where most IT practitioners have been buried in dealing with the "urgent". The Bible bridges the gap between IT and DevOps, because it attempts to explain computer science and software engineering trade-offs in very simple terms. We hope that in the coming 3-5 years, IT will speak a language that helps them get closer to the DevOps' web-scale jargon.
With this first edition, we are converting Steve's blog into a book. The day we stop adding to this book is the beginning of the end of this company. I expect each and everyone of you to keep reminding us of what brought us this far: truth, the whole truth, and nothing but the truth, will set you free (from complacency and hubris).
Keep us honest.
--Dheeraj Pandey, CEO, Nutanix
Stuart Miniman, Principal Research Contributor, Wikibon
Users today are constantly barraged by new technologies. There is no limit of new opportunities for IT to change to a "new and better way", but the adoption of new technology and more importantly, the change of operations and processes is difficult. Even the huge growth of open source technologies has been hampered by lack of adequate documentation. Wikibon was founded on the principal that the community can help with this problem and in that spirit, The Nutanix Bible, which started as a blog post by Steve Poitras, has become a valuable reference point for IT practitioners that want to learn about hypercovergence and web-scale principles or to dig deep into Nutanix and hypervisor architectures. The concepts that Steve has written about are advanced software engineering problems that some of the smartest engineers in the industry have designed a solution for. The book explains these technologies in a way that is understandable to IT generalists without compromising the technical veracity.
The concepts of distributed systems and software-led infrastructure are critical for IT practitioners to understand. I encourage both Nutanix customers and everyone who wants to understand these trends to read the book. The technologies discussed here power some of the largest datacenters in the world.
--Stuart Miniman, Principal Research Contributor, Wikibon
Welcome to The Nutanix Bible! I work with the Nutanix platform on a daily basis – trying to find issues, push its limits as well as administer it for my production benchmarking lab. This item is being produced to serve as a living document outlining tips and tricks used every day by myself and a variety of engineers here at Nutanix.
NOTE: What you see here is an under the covers look at how things work. With that said, all topics discussed are abstracted by Nutanix and knowledge isn't required to successfully operate a Nutanix environment!
Enjoy!
--Steven Poitras, Principal Solutions Architect, Nutanix
A brief look at the history of infrastructure and what has led us to where we are today.
The datacenter has evolved significantly over the last several decades. The following sections will examine each era in detail.
The mainframe ruled for many years and laid the core foundation of where we are today. It allowed companies to leverage the following key characteristics:
But the mainframe also introduced the following issues:
With mainframes, it was very difficult for organizations within a business to leverage these capabilities which partly led to the entrance of pizza boxes or stand-alone servers. Key characteristics of stand-alone servers included:
These stand-alone servers introduced more issues:
Businesses always need to make money and data is a key piece of that puzzle. With direct-attached storage (DAS), organizations either needed more space than was locally available, or data high availability (HA) where a server failure wouldn’t cause data unavailability.
Centralized storage replaced both the mainframe and the stand-alone server with sharable, larger pools of storage that also provided data protection. Key characteristics of centralized storage included:
Issues with centralized storage included:
At this point in time, compute utilization was low and resource efficiency was impacting the bottom line. Virtualization was then introduced and enabled multiple workloads and operating systems (OSs) to run as virtual machines (VMs) on a single piece of hardware. Virtualization enabled businesses to increase utilization of their pizza boxes, but also increased the number of silos and the impacts of an outage. Key characteristics of virtualization included:
Issues with virtualization included:
The hypervisor became a very efficient and feature-filled solution. With the advent of tools, including VMware vMotion, HA, and DRS, users obtained the ability to provide VM high availability and migrate compute workloads dynamically. The only caveat was the reliance on centralized storage, causing the two paths to merge. The only down turn was the increased load on the storage array before and VM sprawl led to contention for storage I/O. Key characteristics included:
Issues included:
SSDs helped alleviate this I/O bottleneck by providing much higher I/O performance without the need for tons of disk enclosures. However, given the extreme advances in performance, the controllers and network had not yet evolved to handle the vast I/O available. Key characteristics of SSDs included:
SSD issues included:
The figure below characterizes the various latencies for specific types of I/O:
Item | Latency | Comments |
---|---|---|
L1 cache reference | 0.5 ns | |
Branch Mispredict | 5 ns | |
L2 cache reference | 7 ns | 14x L1 cache |
Mutex lock/unlock | 25 ns | |
Main memory reference | 100 ns | 20x L2 cache, 200x L1 cache |
Compress 1KB with Zippy | 3,000 ns | |
Sent 1KB over 1Gbps network | 10,000 ns | 0.01 ms |
Read 4K randomly from SSD | 150,000 ns | 0.15 ms |
Read 1MB sequentially from memory | 250,000 ns | 0.25 ms |
Round trip within datacenter | 500,000 ns | 0.5 ms |
Read 1MB sequentially from SSD | 1,000,000 ns | 1 ms, 4x memory |
Disk seek | 10,000,000 ns | 10 ms, 20x datacenter round trip |
Read 1MB sequentially from disk | 20,000,000 ns | 20 ms, 80x memory, 20x SSD |
Send packet CA -> Netherlands -> CA | 150,000,000 ns | 150 ms |
(credit: Jeff Dean, https://gist.github.com/jboner/2841832)
The table above shows that the CPU can access its caches at anywhere from ~0.5-7ns (L1 vs. L2). For main memory, these accesses occur at ~100ns, whereas a local 4K SSD read is ~150,000ns or 0.15ms.
If we take a typical enterprise-class SSD (in this case the Intel S3700 - SPEC), this device is capable of the following:
For traditional storage, there are a few main types of media for I/O:
For the calculation below, we are using the 500MB/s Read and 460MB/s Write BW available from the Intel S3700.
The calculation is done as follows:
numSSD = ROUNDUP((numConnections * connBW (in GB/s))/ ssdBW (R or W))
NOTE: Numbers were rounded up as a partial SSD isn’t possible. This also does not account for the necessary CPU required to handle all of the I/O and assumes unlimited controller CPU power.
Network BW | SSDs required to saturate network BW | ||
---|---|---|---|
Controller Connectivity | Available Network BW | Read I/O | Write I/O |
Dual 4Gb FC | 8Gb == 1GB | 2 | 3 |
Dual 8Gb FC | 16Gb == 2GB | 4 | 5 |
Dual 16Gb FC | 32Gb == 4GB | 8 | 9 |
Dual 1Gb ETH | 2Gb == 0.25GB | 1 | 1 |
Dual 10Gb ETH | 20Gb == 2.5GB | 5 | 6 |
As the table shows, if you wanted to leverage the theoretical maximum performance an SSD could offer, the network can become a bottleneck with anywhere from 1 to 9 SSDs depending on the type of networking leveraged
Typical main memory latency is ~100ns (will vary), we can perform the following calculations:
If we assume a typical network RTT is ~0.5ms (will vary by switch vendor) which is ~500,000ns that would come down to:
If we theoretically assume a very fast network with a 10,000ns RTT:
What that means is even with a theoretically fast network, there is a 10,000% overhead when compared to a non-network memory access. With a slow network this can be upwards of a 500,000% latency overhead.
In order to alleviate this overhead, server side caching technologies are introduced.
web·scale - /web ' skãl/ - noun - computing architecture
a new architectural approach to infrastructure and computing.
This section will present some of the core concepts behind “Web-scale” infrastructure and why we leverage them. Before I get started, I just wanted to clearly state the Web-scale doesn’t mean you need to be “web-scale” (e.g. Google, Facebook, or Microsoft). These constructs are applicable and beneficial at any scale (3-nodes or thousands of nodes).
Historical challenges included:
There are a few key constructs used when talking about “Web-scale” infrastructure:
Other related items:
The following sections will provide a technical perspective on what they actually mean.
There are differing opinions on what hyper-convergence actually is. It also varies based on the scope of components (e.g. virtualization, networking, etc.). However, the core concept comes down to the following: natively combining two or more components into a single unit. ‘Natively’ is the key word here. In order to be the most effective, the components must be natively integrated and not just bundled together. In the case of Nutanix, we natively converge compute + storage to form a single node used in our appliance. For others, this might be converging storage with the network, etc. What it really means:
Benefits include:
Software-defined intelligence is taking the core logic from normally proprietary or specialized hardware (e.g. ASIC / FPGA) and doing it in software on commodity hardware. For Nutanix, we take the traditional storage logic (e.g. RAID, deduplication, compression, etc.) and put that into software that runs in each of the Nutanix CVMs on standard x86 hardware. What it really means:
Benefits include:
Distributed autonomous systems involve moving away from the traditional concept of having a single unit responsible for doing something and distributing that role among all nodes within the cluster. You can think of this as creating a purely distributed system. Traditionally, vendors have assumed that hardware will be reliable, which, in most cases can be true. However, core to distributed systems is the idea that hardware will eventually fail and handling that fault in an elegant and non-disruptive way is key.
These distributed systems are designed to accommodate and remediate failure, to form something that is self-healing and autonomous. In the event of a component failure, the system will transparently handle and remediate the failure, continuing to operate as expected. Alerting will make the user aware, but rather than being a critical time-sensitive item, any remediation (e.g. replace a failed node) can be done on the admin’s schedule. Another way to put it is fail in-place (rebuild without replace) For items where a “master” is needed an election process is utilized, in the event this master fails a new master is elected. To distribute the processing of tasks MapReduce concepts are leveraged. What it really means:
Benefits include:
Incremental and linear scale out relates to the ability to start with a certain set of resources and as needed scale them out while linearly increasing the performance of the system. All of the constructs mentioned above are critical enablers in making this a reality. For example, traditionally you’d have 3-layers of components for running virtual workloads: servers, storage, and network – all of which are scaled independently. As an example, when you scale out the number of servers you’re not scaling out your storage performance. With a hyper-converged platform like Nutanix, when you scale out with new node(s) you’re scaling out:
What it really means:
Benefits include:
In summary:
prism - /'prizɘm/ - noun - control plane
one-click management and interface for datacenter operations.
Building a beautiful, empathetic and intuitive product are core to the Nutanix platform and something we take very seriously. This section will cover our design methodology and how we iterate on them. More coming here soon!
In the meantime feel free to check out this great post on our design methodology and iterations by our Product Design Lead, Jeremy Sallee (who also desgned this) -http://salleedesign.com/stuff/sdwip/blog/nutanix-case-study/
You can download the Nutanix Visio stencils here:http://www.visiocafe.com/nutanix.htm
Prism is a distributed resource management platform which allows users to manage and monitor objects and services across their Nutanix environment.
These capabilities are broken down into two key categories:
The figure highlights an image illustrating the conceptual nature of Prism as part of the Nutanix platform:
Prism is broken down into two main components:
The figure shows an image illustrating the conceptual relationship between Prism Central and Prism Element:
For larger or distributed deployments (e.g. more than one cluster or multiple sites) it is recommended to use Prism Central to simplify operations and provide a single management UI for all clusters / sites.
A Prism service runs on every CVM with an elected Prism Leader which is responsible for handling HTTP requests. Similar to other components which have a Master, if the Prism Leader fails, a new one will be elected. When a CVM which is not the Prism Leader gets a HTTP request it will permanently redirect the request to the current Prism Leader using HTTP response status code 301.
Here we show a conceptual view of the Prism services and how HTTP request(s) are handled:
Prism listens on ports 80 and 9440, if HTTP traffic comes in on port 80 it is redirected to HTTPS on port 9440.
When using the cluster external IP (recommended), it will always be hosted by the current Prism Leader. In the event of a Prism Leader failure the cluster IP will be assumed by the newly elected Prism Leader and a gratuitous ARP (gARP) will be used to clean any stale ARP cache entries. In this scenario any time the cluster IP is used to access Prism, no redirection is necessary as that will already be the Prism Leader.
You can determine the current Prism leader by running 'curl localhost:2019/prism/leader' on any CVM.
In the following sections we're cover some of the typical Prism uses as well as some common troubleshooting scenarios.
Performing a Nutanix software upgrade is a very simple and non-disruptive process.
To begin, start by logging into Prism and clicking on the gear icon on the top right (settings) or by pressing 'S' and selecting 'Upgrade Software':
This will launch the 'Upgrade Software' dialog box and will show your current software version and if there are any upgrade versions available. It is also possible to manually upload a NOS binary file.
You can then download the upgrade version from the cloud or upload the version manually:
It will then upload the upgrade software onto the Nutanix CVMs:
After the software is loaded click on 'Upgrade' to start the upgrade process:
You'll then be prompted with a confirmation box:
The upgrade will start with pre-upgrade checks then start upgrading the software in a rolling manner:
Once the upgrade is complete you'll see an updated status and have access to all of the new features:
Your Prism session will briefly disconnect during the upgrade when the current Prism Leader is upgraded. All VMs and services running remain unaffected.
Similar to Nutanix software upgrades, hypervisor upgrades can be fully automated in a rolling manner via Prism.
To begin follow the similar steps above to launch the 'Upgrade Software' dialogue box and select 'Hypervisor'.
You can then download the hypervisor upgrade version from the cloud or upload the version manually:
It will then load the upgrade software onto the Hypervisors. After the software is loaded click on 'Upgrade' to start the upgrade process:
You'll then be prompted with a confirmation box:
The system will then go through host pre-upgrade checks and upload the hypervisor upgrade to the cluster:
Once the pre-upgrade checks are complete the rolling hypervisor upgrade will then proceed:
Similar to the rolling nature of the Nutanix software upgrades, each host will be upgraded in a rolling manner with zero impact to running VMs. VMs will be live-migrated off the current host, the host will be upgraded, and then rebooted. This process will iterate through each host until all hosts in the cluster are upgraded.
You can also get cluster wide upgrade status from any Nutanix CVM by running 'host_upgrade --status'. The detailed per host status is logged to ~/data/logs/host_upgrade.out on each CVM.
Once the upgrade is complete you'll see an updated status and have access to all of the new features:
The ability to dynamically scale the Acropolis cluster is core to its functionality. To scale an Acropolis cluster, rack / stack / cable the nodes and power them on. Once the nodes are powered up they will be discoverable by the current cluster using mDNS.
The figure shows an example 7 node cluster with 1 node which has been discovered:
Multiple nodes can be discovered and added to the cluster concurrently.
Once the nodes have been discovered you can begin the expansion by clicking 'Expand Cluster' on the upper right hand corner of the 'Hardware' page:
You can also being the cluster expansion process from any page by clicking on the gear icon:
This launches the expand cluster menu where you can select the node(s) to add and specify IP addresses for the components:
After the hosts have been selected you'll be prompted to upgrade a hypervisor image which will be used to image the nodes being added:
After the upload is completed you can click on 'Expand Cluster' to being the imaging and expansion process:
The job will then be submitted and the corresponding task item will appear:
Detailed tasks status can be viewed by expanding the task(s):
After the imaging and add node process has been completed you'll see the updated cluster size and resources:
To get detailed capacity planning details you can click on a specific cluster under the 'cluster runway' section in Prism Central to get more details:
This view provides detailed information on cluster runway and identifies the most constrained resource (limiting resource). You can also get detailed information on what the top consumers are as well as some potential options to clean up additional capacity or ideal node types for cluster expansion.
The HTML5 UI is a key part to Prism to provide a simple, easy to use management interface. However, another core ability are the APIs which are available for automation. All functionality exposed through the Prism UI is also exposed through a full set of REST APIs to allow for the ability to programmatically interface with the Nutanix platform. This allow customers and partners to enable automation, 3rd-party tools, or even create their own UI.
The following section covers these interfaces and provides some example usage.
Core to any dynamic or “software-defined” environment, Nutanix provides a vast array of interfaces allowing for simple programability and interfacing. Here are the main interfaces:
Core to this is the REST API which exposes every capability and data point of the Prism UI and allows for orchestration or automation tools to easily drive Nutanix action. This enables tools like Saltstack, Puppet, vRealize Operations, System Center Orchestrator, Ansible, etc. to easily create custom workflows for Nutanix. Also, this means that any third-party developer could create their own custom UI and pull in Nutanix data via REST.
The following figure shows a small snippet of the Nutanix REST API explorer which allows developers to interact with the API and see expected data formats:
Operations can be expanded to display details and examples of the REST call:
As of 4.5.x basic authentication over HTTPS is leveraged for client and HTTP call authentication.
The Acropolis CLI (ACLI) is the CLI for managing the Acropolis portion of the Nutanix product. These capabilities were enabled in releases after 4.1.2.
NOTE: All of these actions can be performed via the HTML5 GUI and REST API. I just use these commands as part of my scripting to automate tasks.
Description: Enter ACLI shell (run from any CVM)
Acli
OR
Description: Execute ACLI command via Linux shell
ACLI
Description: Lists Acropolis nodes in the cluster.
Acli –o json
Description: Lists Acropolis nodes in the cluster.
host.list
Description: Create network based on VLAN
net.create
Example: net.create vlan.133 ip_config=10.1.1.1/24
Description: List networks
net.list
Description: Create dhcp scope
net.add_dhcp_pool
Note: .254 is reserved and used by the Acropolis DHCP server if an address for the Acropolis DHCP server wasn’t set during network creation
Example: net.add_dhcp_pool vlan.100 start=10.1.1.100 end=10.1.1.200
Description: Get a network's properties
net.get
Example: net.get vlan.133
Description: Get a network's VMs and details including VM name / UUID, MAC address and IP
net.list_vms
Example: net.list_vms vlan.133
Description: Set DHCP DNS
net.update_dhcp_dns
Example: net.set_dhcp_dns vlan.100 servers=10.1.1.1,10.1.1.2 domains=splab.com
Description: Create VM
vm.create
Example: vm.create testVM memory=2G num_vcpus=2
Description: Create bulk VM
vm.create
Example: vm.create testVM[000..999] memory=2G num_vcpus=2
Description: Create clone of existing VM
vm.clone
Example: vm.clone testClone clone_from_vm=MYBASEVM
Description: Create bulk clones of existing VM
vm.clone
Example: vm.clone testClone[001..999] clone_from_vm=MYBASEVM
# Description: Create disk for OS
vm.disk_create
class="codetext"Example: vm.disk_create testVM create_size=500G container=default
Description: Create and add NIC
vm.nic_create
Example: vm.nic_create testVM network=vlan.100
Description: Set a VM boot device
Set to boot form specific disk id
vm.update_boot_device
Example: vm.update_boot_device testVM disk_addr=scsi.0
Set to boot from CDrom
vm.update_boot_device
Example: vm.update_boot_device testVM disk_addr=ide.0
Description: Mount ISO to VM cdrom
Steps:
1. Upload ISOs to container
2. Enable whitelist for client IPs
3. Upload ISOs to share
Create CDrom with ISO
vm.disk_create
Example: vm.disk_create testVM clone_nfs_file=/default/ISOs/myfile.iso cdrom=true
If a CDrom is already created just mount it
vm.disk_update
Example: vm.disk_update atestVM1 ide.0 clone_nfs_file=/default/ISOs/myfile.iso
Description: Remove ISO from CDrom
vm.disk_update
Description: Power on VM(s)
vm.on
Example: vm.on testVM
Power on all VMs
Example: vm.on *
Power on range of VMs
Example: vm.on testVM[01..99]
NOTE: All of these actions can be performed via the HTML5 GUI and REST API. I just use these commands as part of my scripting to automate tasks.
Description: Adds a particular subnet to the NFS whitelist
ncli cluster add-to-nfs-whitelist ip-subnet-masks=10.2.0.0/255.255.0.0
Description: Displays the current version of the Nutanix software
ncli cluster version
Description: Displays the hidden ncli commands/options
ncli helpsys listall hidden=true [detailed=false|true]
Description: Displays the existing storage pools
ncli sp ls
Description: Displays the existing containers
ncli ctr ls
Description: Creates a new container
ncli ctr create name=
Description: Displays the existing VMs
ncli vm ls
Description: Displays the existing public keys
ncli cluster list-public-keys
Description: Adds a public key for cluster access
SCP public key to CVM
Add public key to cluster
ncli cluster add-public-key name=myPK file-path=~/mykey.pub
Description: Removes a public key for cluster access
ncli cluster remove-public-keys name=myPK
Description: Creates a protection domain
ncli pd create name=
Description: Create a remote site for replication
ncli remote-site create name=
Description: Protect all VMs in the specified container
ncli pd protect name=
Description: Protect the VMs specified
ncli pd protect name=
Description: Protect the DSF Files specified
ncli pd protect name=
Description: Create a one-time snapshot of the protection domain
ncli pd add-one-time-snapshot name=
Description: Create a recurring snapshot schedule and replication to n remote sites
ncli pd set-schedule name=
Description: Monitor replication status
ncli pd list-replication-status
Description: Fail-over a protection domain to a remote site
ncli pd migrate name=
Description: Activate a protection domain at a remote site
ncli pd activate name=
Description: Enables the DSF Shadow Clone feature
ncli cluster edit-params enable-shadow-clones=true
Description: Enables fingerprinting and/or on disk dedup for a specific vDisk
ncli vdisk edit name=
The below will cover the Nutanix PowerShell CMDlets, how to use them and some general background on Windows PowerShell.
Windows PowerShell is a powerful shell (hence the name ;P) and scripting language built on the .NET framework. It is a very simple to use language and is built to be intuitive and interactive. Within PowerShell there are a few key constructs/Items:
CMDlets are commands or .NET classes which perform a particular operation. They are usually conformed to the Getter/Setter methodology and typically use a
Piping is an important construct in PowerShell (similar to its use in Linux) and can greatly simplify things when used correctly. With piping you’re essentially taking the output of one section of the pipeline and using that as input to the next section of the pipeline. The pipeline can be as long as required (assuming there remains output which is being fed to the next section of the pipe). A very simple example could be getting the current processes, finding those that match a particular trait or filter and then sorting them:
Get-Service | where {$_.Status -eq "Running"} | Sort-Object Name
Piping can also be used in place of for-each, for example:
# For each item in my array
$myArray | %{
# Do something
}
Below are a few of the key object types in PowerShell. You can easily get the object type by using the .getType() method, for example: $someVariable.getType() will return the objects type.
$myVariable = "foo"
Note: You can also set a variable to the output of a series or pipeline of commands:
$myVar2 = (Get-Process | where {$_.Status -eq "Running})
In this example the commands inside the parentheses will be evaluated first then variable will be the outcome of that.
$myArray = @("Value","Value")
Note: You can also have an array of arrays, hash tables or custom objects
$myHash = @{"Key" = "Value";"Key" = "Value"}
Get the help content for a particular CMDlet (similar to a man page in Linux)
Get-Help
Example: Get-Help Get-Process
List properties and methods of a command or object
Example: $someObject | Get-Member
Download Nutanix CMDlets Installer The Nutanix CMDlets can be downloaded directly from the Prism UI (post 4.0.1) and can be found on the drop down in the upper right hand corner:
Check if snappin is loaded and if not, load
if ( (Get-PSSnapin -Name NutanixCmdletsPSSnapin -ErrorAction SilentlyContinue) -eq $null )
{
Add-PsSnapin NutanixCmdletsPSSnapin
}
Get-Command | Where-Object{$_.PSSnapin.Name -eq "NutanixCmdletsPSSnapin"}
Connect-NutanixCluster -Server $server -UserName "myuser" -Password "myuser" -AcceptInvalidSSLCerts
Or secure way prompting user for password
Connect-NutanixCluster -Server $server -UserName "myuser" -Password (Read-Host "Password: ") -AcceptInvalidSSLCerts
Set to variable
$searchString = "myVM"
$vms = Get-NTNXVM | where {$_.vmName -match $searchString}
Interactive
Get-NTNXVM | where {$_.vmName -match "myString"}
Interactive and formatted
Get-NTNXVM | where {$_.vmName -match "myString"} | ft
Set to variable
$vdisks = Get-NTNXVDisk
Interactive
Get-NTNXVDisk
Interactive and formatted
Get-NTNXVDisk | ft
Set to variable
$containers = Get-NTNXContainer
Interactive
Get-NTNXContainer
Interactive and formatted
Get-NTNXContainer | ft
Set to variable
$pds = Get-NTNXProtectionDomain
Interactive
Get-NTNXProtectionDomain
Interactive and formatted
Get-NTNXProtectionDomain | ft
Set to variable
$cgs = Get-NTNXProtectionDomainConsistencyGroup
Interactive
Get-NTNXProtectionDomainConsistencyGroup
Interactive and formatted
Get-NTNXProtectionDomainConsistencyGroup | ft
You can find more scripts on the Nutanix Github located athttps://github.com/nutanix
OpenStack is an open source platform for managing and building clouds. It is primarily broken into the front-end (dashboard and API) and infrastructure services (compute, storage, etc.).
The OpenStack and Nutanix solution is composed of two main components
The OpenStack Controller can be an existing VM / host, or deployed as part of the OpenStack on Nutanix solution. The Acropolis OVM is a helper VM which is deployed as part of the Nutanix OpenStack solution.
The client communicates with the OpenStack Controller using their expected methods (Web UI / HTTP, SDK, CLI or API) and the OpenStack controller communicates with the Acropolis OVM which translates the requests into native Acropolis REST API calls using the OpenStack Driver.
The figure shows a high-level overview of the communication:
The current solution (as of 4.5.1) requires an OpenStack Controller on version Kilo or later.
The table shows a high-level conceptual role mapping:
Item | Role | OpenStack Controller | Acropolis OVM | Acropolis Cluster |
---|---|---|---|---|
Tenant Dashboard | User interface and API | X | ||
Orchestration | Object CRUD and lifecycle management | X | ||
Quotas | Resource controls and limits | X | ||
Users, Groups and Roles | Role based access control (RBAC) | X | ||
SSO | Single-sign on | X | ||
Platform Integration | OpenStack to Nutanix integration | X | ||
Infrastructure Services | Target infrastructure (compute, storage, network) | X |
OpenStack is composed of a set of components which are responsible for serving various infrastructure functions. Some of these functions will be hosted by the OpenStack Controller and some will be hosted by the Acropolis OVM.
The table shows the core OpenStack components and role mapping:
Component | Role | OpenStack Controller | Acropolis OVM |
---|---|---|---|
Keystone | Identity service | X | |
Horizon | Dashboard and UI | X | |
Nova | Compute | X | |
Swift | Object storage | X | X |
Cinder | Block storage | X | |
Glance | Image service | X | X |
Neutron | Networking | X | |
Heat | Orchestration | X | |
Others | All other components | X |
The figure shows a more detailed view of the OpenStack components and communication:
In the following sections we will go through some of the main OpenStack components and how they are integrated into the Nutanix platform.
Nova is the compute engine and scheduler for the OpenStack platform. In the Nutanix OpenStack solution each Acropolis OVM acts as a compute host and every Acropolis Cluster will act as a single hypervisor host eligible for scheduling OpenStack instances. The Acropolis OVM runs the Nova-compute service.
You can view the Nova services using the OpenStack portal under 'Admin'->'System'->'System Information'->'Compute Services'.
The figure shows the Nova services, host and state:
The Nova scheduler decides which compute host (i.e. Acropolis OVM) to place the instances based upon the selected availability zone. These requests will be sent to the selected Acropolis OVM which will forward the request to the target host's (i.e. Acropolis cluster) Acropolis scheduler. The Acropolis scheduler will then determine optimal node placement within the cluster. Individual nodes within a cluster are not exposed to OpenStack.
You can view the compute and hypevisor hosts using the OpenStack portal under 'Admin'->'System'->'Hypervisors'.
The figure shows the Acropolis OVM as the compute host:
The figure shows the Acropolis cluster as the hypervisor host:
As you can see from the previous image the full cluster resources are seen in a single hypervisor host.
Swift in an object store used to store and retrieve files. This is currently only leveraged for backup / restore of snapshots and images.
Cinder is OpenStack's volume component for exposing iSCSI targets. Cinder leverages the Acropolis Volumes API in the Nutanix solution. These volumes are attached to the instance(s) directly as block devies (as compared to in-guest).
You can view the Cinder services using the OpenStack portal under 'Admin'->'System'->'System Information'->'Block Storage Services'.
The figure shows the Cinder services, host and state:
Glance is the image store for OpenStack and shows the available images for provisioning. Images can include ISOs, disks, and snapshots.
The Image Repo is the repository storing available images published by Glance. These can be located within the Nutanix environment or by an external source. When the images are hosted on the Nutanix platform, they will be published to the OpenStack controller via Glance on the OVM. In cases where the Image Repo exists only on an external source, Glance will be hosted by the OpenStack Controller and the Image Cache will be leveraged on the Acropolis Cluster(s).
Glance is enabled on a per-cluster basis and will always exist with the Image Repo. When Glance is enabled on multiple clusters the Image Repo will span those clusters and images created via the OpenStack Portal will be propogated to all clusters running Glance. Those clusters not hosting Glance will cache the images locally using the Image Cache.
For larger deployments Glance should run on at least two Acropolis Clusters per site. This will provide Image Repo HA in the case of a cluster outage and ensure the images will always be available when not in the Image Cache.
When external sources host the Image Repo / Glance, Nova will be responsible for handling data movement from the external source to the target Acropolis Cluster(s). In this case the Image Cache will be leveraged on the target Acropolis Cluster(s) to cache the image locally for any subsequent provisioning requsts for the image.
Neutron is the networking component of OpenStack and responsible for network configuration. The Acropolis OVM allows network CRUD operations to be performed by the OpenStack portal and will then make the required changes in Acropolis.
You can view the Neutron services using the OpenStack portal under 'Admin'->'System'->'System Information'->'Network Agents'.
The figure shows the Neutron services, host and state:
Neutron will assign IP addresses to instances when they are booted. In this case Acropolis will receive a desired IP address for the VM which will be allocated. When the VM performs a DHCP request the Acropolis Master will respond to the DHCP request on a private VXLAN as usual with Acropolis Hypervisor.
Currently only Local and VLAN network types are supported.
The Keystone and Horizon components run in an OpenStack Controller which interfaces with the Acropolis OVM. The OVM(s) have an OpenStack Driver which is responsible for translating the OpenStack API calls into native Acropolis API calls.
For large scale cloud deployments it is important to leverage a delivery topology that will be distributed and meet the requirements of the end-users while providing flexibility and locality.
OpenStack leverages the following high-level constructs which are defined below:
The figure shows the high-level relationship of the constructs:
The figure shows an example application of the constructs:
You can view and manage hosts, host aggregates and availability zones using the OpenStack portal under 'Admin'->'System'->'Host Aggregates'.
The figure shows the host aggregates, availability zones and hosts:
For larger deployments it is recommended to have multiple Acropolis OVMs connected to the OpenStack Controller abstracted by a load balancer. This allows for HA and of the OVMs as well as distribution of transactions. The OVM(s) don't contain any state information allowing them to be scaled.
The figure shows an example of scaling OVMs for a single site:
One method to achieve this for the OVM(s) is using Keepalived and HAproxy.
For environments spanning multiple sites the OpenStack Controller will talk to multiple Acropolis OVMs across sites.
The figure shows an example of the deployment across multiple sites:
The OVM can be deployed as a standalone RPM on a CentOS / Redhat distro or as a full VM. The Acropolis OVM can be deployed on any platform (Nutanix or non-Nutanix) as long as it has network connectivity to the OpenStack Controller and Nutanix Cluster(s).
The VM(s) ofr the Acropolis OVM can be deployed on a Nutanix AHV cluster using the following steps. If the OVM is already deployed you can skip past the VM creation steps. You can use the full OVM image or use an existing CentOS / Redhat VM image.
First we will import the provided Acropolis OVM disk image to Acropolis cluster. This can be done by copying the disk image over using SCP or by specifying a URL to copy the file from. Note: It is possible to deploy this VM anywhere, not necessarily on a Acropolis cluster.
To copy the file SCP the image to any CVM IP on port 2222, the run the following command to create the image from the disk:
image.create
To import the disk image using Images API, run the following command:
image.create
Next create the Acropolis VM for the OVM by running the following ACLI commands on any CVM:
vm.create
vm.disk_create
vm.nic_create
vm.on
Once the VM(s) have been created and powered on, SSH to the OVM(s) using the provided credentials.
Next we'll download the RPM to the OVM(s) using SCP or wget or any other file transfer protocol to begin the installation.
After the RPM has been downloaded we'll install it on the OVM(s):
# Install OVM RPM
# If CentOS
yum install
# If Redhat
rpm -i
Next we'll configure the OVM(s) by running the following commands (must be run on every OVM):
# Enter OVM Shell (if not already)
ovmctl
# Register OpenStack Driver service
ovmctl --add=service --name=
# Register OpenStack Controller
# Items in '[]' are optional
ovmctl --add=controller --name=
The following values are used as defaults:
Authentication: auth_strategy = keystone, auth_region = RegionOne
auth_tenant = services, auth_password = admin
Database: db_{nova,cinder,glance,neutron} = mysql, db_{nova,cinder,glance,neutron}_password = admin
RPC: rpc_backend = rabbit, rpc_username = guest, rpc_password = guest
# Register Acropolis Cluster(s) (run for each cluster to add)
# Items in '[]' are optional
ovmctl --add=cluster --name=
The following values are used as defaults:
Number of VCPUs per core = 4
Container name = default
Image cache = disabled, Image cache URL = None
Now that the OVM has been configured, we'll configure the OpenStack Controller to know about the Glance and Neutron endpoints.
First we will get the Keystone service ids for Glance and Neutron by running the following commands on the OpenStack Controller:
source keystonerc_admin
keystone service-list
The output should look similar to the following:
+----------------------------------+------------+-----------------+
| id | name | type |
+----------------------------------+------------+-----------------+
| e95f5c6a56dc4d93b016dbad0b72351e | ceilometer | metering |
| 09f82d2cacc64e6082755fe15f35dbcc | cinder | volume |
| f169a9e6a5744b4f8f88897a4bd2b16a | cinderv2 | volumev2 |
| 9e539e8dee264dd9a086677427434982 | glance | image |
| e0d6cc81400642e092c3290e82a3b607 | heat | orchestration |
| 9082586f93eb4ac3be59b880a163c2b8 | keystone | identity |
| f4c4266142c742a78b330f8bafe5e49e | neutron | network |
| df0cc41a9de2490a8ae403f4b026adab | nova | compute |
| 829d9667fd194102898e489503f6bbad | nova_ec2 | ec2 |
| b00d961b5b0c4ebbb9bdc742ff6570bf | novav3 | computev3 |
| 717c7602de9d44eba95e821a9a4aaf26 | sahara | data-processing |
| 23b80e6d3fd84c62943d3602e3f6cdc7 | swift | object-store |
| 6284a3da6d9243e98d3e923e46122109 | swift_s3 | s3 |
| f3f5eae89c1e4e57be222505637dfb36 | trove | database |
+----------------------------------+------------+-----------------+
We will then create the two endpoints for Glance and Neutron using the services ids gathered previously:
# Add Keystone endpoint for Glance
keystone endpoint-create \
--service-id=
--publicurl=http://
--internalurl=http://
--region=
--adminurl=http://
# Add Keystone endpoint for Neutron
keystone endpoint-create \
--service-id=
--publicurl=http://
--internalurl=http://
--region=
--adminurl=http://
After the endpoints have been created we will update the Nova and Cinder configuration files with new Acropolis OVM IP of Glance host.
First we will edit Nova.conf which is located at /etc/nova/nova.conf and edit the following lines:
[glance]
...
# Default glance hostname or IP address (string value)
host=
# Default glance port (integer value)
port=9292
...
# A list of the glance api servers available to nova. Prefix
# with https:// for ssl-based glance api servers.
# ([hostname|ip]:port) (list value)
api_servers=
Next we will edit Cinder.conf which is located at /etc/cinder/cinder.conf and edit the following items:
# Default glance host name or IP (string value)
glance_host=
# Default glance port (integer value)
glance_port=9292
# A list of the glance API servers available to cinder
# ([hostname|ip]:port) (list value)
glance_api_servers=$glance_host:$glance_port
After the files have been edited we will restart the Nova and Cinder services to take the new configuration settings. The services can be restarted with the following commands below or by running the scripts which are available for download.
# Restart Nova services
service openstack-nova-api restart
service openstack-nova-consoleauth restart
service openstack-nova-scheduler restart
service openstack-nova-conductor restart
service openstack-nova-cert restart
service openstack-nova-novncproxy restart
# OR you can also use the script which can be downloaded as part of the helper tools:
~/openstack/commands/nova-restart
# Restart Cinder
service openstack-cinder-api restart
service openstack-cinder-scheduler restart
service openstack-cinder-backup restart
# OR you can also use the script which can be downloaded as part of the helper tools:
~/openstack/commands/cinder-restart
Component | Key Log Location(s) |
---|---|
Keystone | /var/log/keystone/keystone.log |
Horizon | /var/log/horizon/horizon.log |
Nova | /var/log/nova/nova-api.log /var/log/nova/nova-scheduler.log /var/log/nova/nove-compute.log* |
Swift | /var/log/swift/swift.log |
Cinder | /var/log/cinder/api.log /var/log/cinder/scheduler.log /var/log/cinder/volume.log |
Glance | /var/log/glance/api.log /var/log/glance/registry.log |
Neutron | /var/log/neutron/server.log /var/log/neutron/dhcp-agent.log* /var/log/neutron/l3-agent.log* /var/log/neutron/metadata-agent.log* /var/log/neutron/openvswitch-agent.log* |
Logs marked with * are on the Acropolis OVM only.
Check NTP if a service is seen as state 'down' in OpenStack Manager (Admin UI or CLI) eventhough the service is running in the OVM. Many services have a requirement for time to be in sync between the OpenStack Controller and Acropolis OVM.
Load Keystone source (perform before running other commands)
source keystonerc_admin
List Keystone services
keystone service-list
List Keystone endpoints
keystone endpoint-list
Create Keystone endpoint
keystone endpoint-create \
--service-id=
--publicurl=http://
--internalurl=http://
--region=
--adminurl=http://
List Nova instances
nova list
Show instance details
nova show
List Nova hypersivor hosts
nova hypervisor-list
Show hyprevisor host details
nova hypervisor-show
List Glance images
glance image-list
Show Glance image details
glance image-show
a·crop·o·lis - /ɘ ' kräpɘlis/ - noun - data plane
storage, compute and virtualization platform.
Acropolis is a distributed multi-resource manager, orchestration platform and data plane.
It is broken down into three main components:
Building upon the distributed nature of everything Nutanix does, we’re expanding this into the virtualization and resource management space. Acropolis is a back-end service that allows for workload and resource management, provisioning, and operations. Its goal is to abstract the facilitating resource (e.g., hypervisor, on-premise, cloud, etc.) from the workloads running, while providing a single “platform” to operate.
This gives workloads the ability to seamlessly move between hypervisors, cloud providers, and platforms.
The figure highlights an image illustrating the conceptual nature of Acropolis at various layers:
Currently, the only fully supported hypervisor for VM management is Acropolis Hypervisor, however this may expand in the future. The Volumes API and read-only operations are still supported on all.
An Acropolis Slave runs on every CVM with an elected Acropolis Master which is responsible for task scheduling, execution, IPAM, etc. Similar to other components which have a Master, if the Acropolis Master fails, a new one will be elected.
The role breakdown for each can be seen below:
Here we show a conceptual view of the Acropolis Master / Slave relationship:
For a video explanation you can watch the following video: LINK
The Nutanix solution is a converged storage + compute solution which leverages local components and creates a distributed platform for virtualization, also known as a virtual computing platform. The solution is a bundled hardware + software appliance which houses 2 (6000/7000 series) or 4 nodes (1000/2000/3000/3050 series) in a 2U footprint.
Each node runs an industry-standard hypervisor (ESXi, KVM, Hyper-V currently) and the Nutanix Controller VM (CVM). The Nutanix CVM is what runs the Nutanix software and serves all of the I/O operations for the hypervisor and all VMs running on that host. For the Nutanix units running VMware vSphere, the SCSI controller, which manages the SSD and HDD devices, is directly passed to the CVM leveraging VM-Direct Path (Intel VT-d). In the case of Hyper-V, the storage devices are passed through to the CVM.
The following figure provides an example of what a typical node logically looks like:
As mentioned above (likely numerous times), the Nutanix platform is a software-based solution which ships as a bundled software + hardware appliance. The controller VM is where the vast majority of the Nutanix software and logic sits and was designed from the beginning to be an extensible and pluggable architecture. A key benefit to being software-defined and not relying upon any hardware offloads or constructs is around extensibility. As with any product life cycle, advancements and new features will always be introduced.
By not relying on any custom ASIC/FPGA or hardware capabilities, Nutanix can develop and deploy these new features through a simple software update. This means that the deployment of a new feature (e.g., deduplication) can be deployed by upgrading the current version of the Nutanix software. This also allows newer generation features to be deployed on legacy hardware models. For example, say you’re running a workload running an older version of Nutanix software on a prior generation hardware platform (e.g., 2400). The running software version doesn’t provide deduplication capabilities which your workload could benefit greatly from. To get these features, you perform a rolling upgrade of the Nutanix software version while the workload is running, and you now have deduplication. It’s really that easy.
Similar to features, the ability to create new “adapters” or interfaces into DSF is another key capability. When the product first shipped, it solely supported iSCSI for I/O from the hypervisor, this has now grown to include NFS and SMB. In the future, there is the ability to create new adapters for various workloads and hypervisors (HDFS, etc.). And again, all of this can be deployed via a software update. This is contrary to most legacy infrastructures, where a hardware upgrade or software purchase is normally required to get the “latest and greatest” features. With Nutanix, it’s different. Since all features are deployed in software, they can run on any hardware platform, any hypervisor, and be deployed through simple software upgrades.
The following figure shows a logical representation of what this software-defined controller framework looks like:
For a visual explanation you can watch the following video: LINK
The Nutanix platform is composed of the following high-level components:
In this section, I’ll cover how the various storage devices (SSD / HDD) are broken down, partitioned, and utilized by the Nutanix platform. NOTE: All of the capacities used are in Base2 Gibibyte (GiB) instead of the Base10 Gigabyte (GB). Formatting of the drives with a filesystem and associated overheads has also been taken into account.
SSD devices store a few key items which are explained in greater detail above:
The following figure shows an example of the storage breakdown for a Nutanix node’s SSD(s):
NOTE: The sizing for OpLog is done dynamically as of release 4.0.1 which will allow the extent store portion to grow dynamically. The values used are assuming a completely utilized OpLog. Graphics and proportions aren’t drawn to scale. When evaluating the Remaining GiB capacities, do so from the top down. For example, the Remaining GiB to be used for the OpLog calculation would be after Nutanix Home and Cassandra have been subtracted from the formatted SSD capacity.
Most models ship with 1 or 2 SSDs, however the same construct applies for models shipping with more SSD devices. For example, if we apply this to an example 3060 or 6060 node which has 2 x 400GB SSDs, this would give us 100GiB of OpLog, 40GiB of Content Cache, and ~440GiB of Extent Store SSD capacity per node.
Since HDD devices are primarily used for bulk storage, their breakdown is much simpler:
For example, if we apply this to an example 3060 node which has 4 x 1TB HDDs, this would give us 80GiB reserved for Curator and ~3.4TiB of Extent Store HDD capacity per node.
NOTE: the above values are accurate as of 4.0.1 and may vary by release.
Together, a group of Nutanix nodes forms a distributed platform called the Acropolis Distributed Storage Fabric (DSF). DSF appears to the hypervisor like any centralized storage array, however all of the I/Os are handled locally to provide the highest performance. More detail on how these nodes form a distributed system can be found in the next section.
The following figure shows an example of how these Nutanix nodes form DSF:
The Acropolis Distributed Storage Fabric is composed of the following high-level struct:
The following figure shows how these map between DSF and the hypervisor:
The following figure shows how these structs relate between the various file systems:
Here is another graphical representation of how these units are related:
For a visual explanation, you can watch the following video: LINK
The Nutanix I/O path is composed of the following high-level components:
The following figure shows a high-level overview of the Content Cache:
Data is brought into the cache at a 4K granularity and all caching is done real-time (e.g. no delay or batch process data to pull data into the cache).
For a visual explanation, you can watch the following video: LINK
The Nutanix platform currently uses a resiliency factor, also known as a replication factor (RF), and checksum to ensure data redundancy and availability in the case of a node or disk failure or corruption. As explained above, the OpLog acts as a staging area to absorb incoming writes onto a low-latency SSD tier. Upon being written to the local OpLog, the data is synchronously replicated to another one or two Nutanix CVM’s OpLog (dependent on RF) before being acknowledged (Ack) as a successful write to the host. This ensures that the data exists in at least two or three independent locations and is fault tolerant. NOTE: For RF3, a minimum of 5 nodes is required since metadata will be RF5.
Data RF is configured via Prism and is done at the container level. All nodes participate in OpLog replication to eliminate any “hot nodes”, ensuring linear performance at scale. While the data is being written, a checksum is computed and stored as part of its metadata. Data is then asynchronously drained to the extent store where the RF is implicitly maintained. In the case of a node or disk failure, the data is then re-replicated among all nodes in the cluster to maintain the RF. Any time the data is read, the checksum is computed to ensure the data is valid. In the event where the checksum and data don’t match, the replica of the data will be read and will replace the non-valid copy.
The following figure shows an example of what this logically looks like:
For a visual explanation, you can watch the following video: LINK
Metadata is at the core of any intelligent system and is even more critical for any filesystem or storage array. In terms of DSF, there are a few key structs that are critical for its success: it has to be right 100% of the time (known as “strictly consistent”), it has to be scalable, and it has to perform at massive scale. As mentioned in the architecture section above, DSF utilizes a “ring-like” structure as a key-value store which stores essential metadata as well as other platform data (e.g., stats, etc.). In order to ensure metadata availability and redundancy a RF is utilized among an odd amount of nodes (e.g., 3, 5, etc.). Upon a metadata write or update, the row is written to a node in the ring and then replicated to n number of peers (where n is dependent on cluster size). A majority of nodes must agree before anything is committed, which is enforced using the Paxos algorithm. This ensures strict consistency for all data and metadata stored as part of the platform.
The following figure shows an example of a metadata insert/update for a 4 node cluster:
Performance at scale is also another important struct for DSF metadata. Contrary to traditional dual-controller or “master” models, each Nutanix node is responsible for a subset of the overall platform’s metadata. This eliminates the traditional bottlenecks by allowing metadata to be served and manipulated by all nodes in the cluster. A consistent hashing scheme is utilized to minimize the redistribution of keys during cluster size modifications (also known as “add/remove node”) When the cluster scales (e.g., from 4 to 8 nodes), the nodes are inserted throughout the ring between nodes for “block awareness” and reliability.
The following figure shows an example of the metadata “ring” and how it scales:
For a visual explanation, you can watch the following video: LINK
Reliability and resiliency are key, if not the most important concepts within DSF or any primary storage platform.
Contrary to traditional architectures which are built around the idea that hardware will be reliable, Nutanix takes a different approach: it expects hardware will eventually fail. By doing so, the system is designed to handle these failures in an elegant and non-disruptive manner.
NOTE: That doesn’t mean the hardware quality isn’t there, just a concept shift. The Nutanix hardware and QA teams undergo an exhaustive qualification and vetting process.
Potential levels of failure
Being a distributed system, DSF is built to handle component, service, and CVM failures, which can be characterized on a few levels:
A disk failure can be characterized as just that, a disk which has either been removed, had a dye failure, or is experiencing I/O errors and has been proactively removed.
VM impact:
In the event of a disk failure, a Curator scan (MapReduce Framework) will occur immediately. It will scan the metadata (Cassandra) to find the data previously hosted on the failed disk and the nodes / disks hosting the replicas.
Once it has found that data that needs to be “re-replicated”, it will distribute the replication tasks to the nodes throughout the cluster.
An important thing to highlight here is given how Nutanix distributes data and replicas across all nodes / CVMs / disks; all nodes / CVMs / disks will participate in the re-replication.
This substantially reduces the time required for re-protection, as the power of the full cluster can be utilized; the larger the cluster, the faster the re-protection.
A CVM "failure” can be characterized as a CVM power action causing the CVM to be temporarily unavailable. The system is designed to transparently handle these gracefully. In the event of a failure, I/Os will be re-directed to other CVMs within the cluster. The mechanism for this will vary by hypervisor.
The rolling upgrade process actually leverages this capability as it will upgrade one CVM at a time, iterating through the cluster.
VM impact:
In the event of a CVM "failure” the I/O which was previously being served from the down CVM, will be forwarded to other CVMs throughout the cluster. ESXi and Hyper-V handle this via a process called CVM Autopathing, which leverages HA.py (like “happy”), where it will modify the routes to forward traffic going to the internal address (192.168.5.2) to the external IP of other CVMs throughout the cluster. This enables the datastore to remain intact, just the CVM responsible for serving the I/Os is remote.
Once the local CVM comes back up and is stable, the route would be removed and the local CVM would take over all new I/Os.
In the case of KVM, iSCSI multi-pathing is leveraged where the primary path is the local CVM and the two other paths would be remote. In the event where the primary path fails, one of the other paths will become active.
Similar to Autopathing with ESXi and Hyper-V, when the local CVM comes back online, it’ll take over as the primary path.
VM Impact:
In the event of a node failure, a VM HA event will occur restarting the VMs on other nodes throughout the virtualization cluster. Once restarted, the VMs will continue to perform I/Os as usual which will be handled by their local CVMs.
Similar to the case of a disk failure above, a Curator scan will find the data previously hosted on the node and its respective replicas.
Similar to the disk failure scenario above, the same process will take place to re-protect the data, just for the full node (all associated disks).
In the event where the node remains down for a prolonged period of time, the down CVM will be removed from the metadata ring. It will be joined back into the ring after it has been up and stable for a duration of time.
For a visual explanation, you can watch the following video: LINK
The Nutanix Capacity Optimization Engine (COE) is responsible for performing data transformations to increase data efficiency on disk. Currently compression is one of the key features of the COE to perform data optimization. DSF provides both in-line and post-process flavors of compression to best suit the customer’s needs and type of data.
In-line compression will compress sequential streams of data or large I/O sizes in memory before it is written to disk, while post-process compression will initially write the data as normal (in an un-compressed state) and then leverage the Curator framework to compress the data cluster wide. When in-line compression is enabled but the I/Os are random in nature, the data will be written un-compressed in the OpLog, coalesced, and then compressed in memory before being written to the Extent Store. The Google Snappy compression library is leveraged which provides good compression ratios with minimal computational overhead and extremely fast compression / decompression rates.
The following figure shows an example of how in-line compression interacts with the DSF write I/O path:
Almost always use inline compression (compression delay = 0) as it will only compress larger / sequential writes and not impact random write performance.
Inline compression also pairs perfectly with erasure coding.
For post-process compression, all new write I/O is written in an un-compressed state and follows the normal DSF I/O path. After the compression delay (configurable) is met and the data has become cold (down-migrated to the HDD tier via ILM), the data is eligible to become compressed. Post-process compression uses the Curator MapReduce framework and all nodes will perform compression tasks. Compression tasks will be throttled by Chronos.
The following figure shows an example of how post-process compression interacts with the DSF write I/O path:
For read I/O, the data is first decompressed in memory and then the I/O is served. For data that is heavily accessed, the data will become decompressed in the HDD tier and can then leverage ILM to move up to the SSD tier as well as be stored in the cache.
The following figure shows an example of how decompression interacts with the DSF I/O path during read:
You can view the current compression rates via Prism on the Storage > Dashboard page.
The Nutanix platform relies leverages factor (RF) for data protection and availability. This method provides the highest degree of availability because it does not require reading from more than one storage location or data re-computation on failure. However, this does come at the cost of storage resources as full copies are required.
To provide a balance between availability while reducing the amount of storage required, DSF provides the ability to encode data using erasure codes (EC).
Similar to the concept of RAID (levels 4, 5, 6, etc.) where parity is calculated, EC encodes a strip of data blocks on different nodes and calculates parity. In the event of a host and/or disk failure, the parity can be leveraged to calculate any missing data blocks (decoding). In the case of DSF, the data block is an extent group and each data block must be on a different node and belong to a different vDisk.
The number of data and parity blocks in a strip is configurable based upon the desired failures to tolerate. The configuration is commonly referred to as the number of /
For example, “RF2 like” availability (e.g., N+1) could consist of 3 or 4 data blocks and 1 parity block in a strip (e.g., 3/1 or 4/1). “RF3 like” availability (e.g. N+2) could consist of 3 or 4 data blocks and 2 parity blocks in a strip (e.g. 3/2 or 4/2).
You can override the default strip size (4/1 for “RF2 like” or 4/2 for “RF3 like”) via NCLI ‘ctr [create / edit] … erasure-code=
The expected overhead can be calculated as <# parity blocks> / <# data blocks>. For example, a 4/1 strip has a 25% overhead or 1.25X compared to the 2X of RF2. A 4/2 strip has a 50% overhead or 1.5X compared to the 3X of RF3.
The following table characterizes the encoded strip sizes and example overheads:
Cluster Size (nodes) | EC Strip Size (data/parity blocks) | EC Overhead (vs. 2X of RF2) | EC Strip Size (data/parity) | EC Overhead (vs. 3X of RF3) |
4 | 2/1 | 1.5X | N/A | N/A |
5 | 3/1 | 1.33X | N/A | N/A |
6 | 4/1 |