SSUG::Digital: 006 – Persistent Storage for Kubernetes and OpenShift environments

This episode will discuss the Spectrum Scale Container Storage Interface (CSI). CSI is a standard for exposing arbitrary block and file storage systems to containerized workloads on container orchestration systems like Kubernetes and OpenShift. Spectrum Scale CSI provides your containers fast access to files stored in Spectrum Scale with capabilities such as dynamic provisioning of volumes and read-write-many access.

 

Download slides here

Episode 2:  Best Practices for building a stretched cluster

Q&A

Q: ­This slide (titled Spectrum Scale CSI Driver – Architecture) shows CPU architecture is x86­
A: ­Yes, with Spectrum Scale CSI Driver 2.0.0 only x86 is supported. The support for other architectures (IBM Power and IBM Z) will be provided in upcoming releases­ (IBM usual roadmap disclaimers apply).

Q: ­Is the management of storage class available via Ansible?­
A: ­Setting up a storage class is a one-time operation. While it might be done using Ansible (and Kubernetes integration modules), clients usually do the management using the Kubernetes or Openshift CLI or GUI.­

Q: ­Will the slides be provided post this presentation?
A: ­Yes. You will find the chart decks, recordings, Q&A and related information for all past talks including this one at https://www.spectrumscaleug.org/experttalks/.

Q:  Once you have CSI driver support for non x86_64 platforms, will the Spectrum Scale cluster be able to be heterogeneous (AIX, Linux, x86_64 and ppc64le)? Will this cluster support AIX NSD only nodes?­
A: ­ In the first release for non x86_64 platforms, all worker nodes that have Spectrum Scale client installed need to  be of same CPU architecture and the same operating system. If there are AIX NSD nodes, those must be outside of Kubernetes cluster.­ AIX NSD only nodes might be integrated by remote mounting the storage cluster to a client Spectrum Scale cluster that runs the Kubernetes workload.

Q: ­Is Network Load Balancer a per-requisite must have for the CSI deployment?­
A: ­No it isn’t.­

Q: ­ Is there a possibility to have Spectrum Scale clients installed within containers?
­A: ­We are working on a capability called Container Native Spectrum Scale (CNSS) where Spectrum Scale will run inside a container.  The initial release is planned for December 2020.  (Disclaimer: All dates are subject to change; IBM usual roadmap disclaimers apply)­

Q: Do we need to have an x86 “only” Spectrum Scale/OpenShift cluster and a ppc64le “only” Spectrum Scale/OpenShift cluster?
A: ­The requirement of same CPU architecture and same operating system is only for Spectrum Scale Client node which are part of Kubernetes/ Openshift cluster. NSD server can be of other platform (as per Spectrum Scale support matrix at https://www.ibm.com/support/knowledgecenter/STXKQY/gpfsclustersfaq.html)

Q: ­ Any plans for ability to self-provision Spectrum Scale clusters with containers?­
A: ­Container Native Spectrum Scale (CNSS) will have an Operator that will deploy and configure a Spectrum Scale cluster automatically.  It will also remote mount the filesystem from a Spectrum Scale Storage Cluster.

Q: ­One of the issues that we are trying to solve is to isolate Spectrum Scale IO with respect to each tenant/application/user on a single server just like how we can isolate cpu/network with cgroups. Would spectrum scale on containers help us in isolating or qos with storage IO?­
A: ­Running Spectrum Scale in a container or CSI by itself will not address QoS.  A new fileset based QoS capability, with CSI, will be able to handle this in a future release­. (Disclaimer: All dates are subject to change; IBM usual roadmap disclaimers apply)

Q: ­Does OpenStack have to be managed via the web gui. Can it be controlled via a CLI?­
A: ­You are free to use either CLI or GUI.­­­

User group host: Bill Anderson

Speakers:

Speaker NamePhotoBio
Smita RautSmita RautSmita Raut is a Senior Software Engineer with IBM Storage Labs in Pune, India. She works with the IBM Spectrum Scale development team as the architect for persistent storage for containers. In her nine years with IBM, she has lead the development on various projects, including Object protocol for IBM Spectrum Scale and enablement of IBM Spectrum Scale on public cloud. She is an active technical blogger and has published several blogs on object protocol and container storage interface driver.
Harald SeippHarald SeippHarald Seipp is a Senior Technical Staff Member with IBM Systems in Germany. He is the founder and Technical Leader of the Center of Excellence for Cloud Storage as part of the EMEA Storage Competence Center. He is providing guidance to worldwide IBM teams across organizations, and works with customers and IBM Business Partners across EMEA to create and implement complex storage cloud architectures. His more than 25 years of technology experience includes previous job roles as Software Developer, Software Development Leader, Lead Developer, and Architect for successful software products, and co-inventor of an IBM storage product. He holds various patents on storage and networking technology.
Renar GrunenbergRenar Grunenberg is since 27 years at HuK-Coburg. He leads the backup and storage team and is responsible for all the storage and backup stuff in his department and company. Renar has 15 years experience with Spectrum Scale including CES, CSI, ESS and normal core function. In this episode Renar will discuss a use case for Kafka self-service with K8s and Spectrum Scale CSI.
Simon ThompsonSimon Thompson is the Research Computing Infrastructure Architect within Advanced Research Computing at the University of Birmingham. He oversees the infrastructure and systems team, running the University's HPC and research data systems. This involves experimenting (and breaking) new technology. Simon is also chair of the Spectrum Scale user group in the UK.

SSUG::Digital: 005 – Update on functional enhancements in Spectrum Scale (inode management, vCPU scaling, considerations for NUMA)

Spectrum Scale is a highly scalable, high-performance storage solution for file and object storage. It started more than 20 years ago as research project and is now used by thousands of customers. IBM continues to enhance Spectrum Scale, in response to recent hardware advancements and evolving workloads.
This presentation will discuss selected improvements in Spectrum V5, focusing on improvements for inode management, vCPU scaling and considerations for NUMA.

Part 1 (inode management):

Part 2 (vCPU scaling and NUMA considerations):

Download slides here

Q&A – inode management

Q: If I make the block size 8k, can an inode stuff a file of that size­?
A: ­No , the max inode size is 4K. As discussed on the call, changing the blocksize (including the metadata blocksize) doesn’t impact the size of an inode, which is currently limited to 4K.

Q: ­When files are created an inode number is assigned from 1-Max. 32bit applications can only address inodes up to 4b. With millions of temp files created during application runs, inodes get used up very quickly.  But once the job is finished the files are­ ­deleted, but the inodes are not recycled. ­This results in filling the inodes while the file system isn’t full. Can inodes/files that have been deleted, have their inode recycled for future use?
A: Inodes do get reused after deletion. How many independent filesets do you have? If you only had the root fileset and you set the maxInodes to less than 4b, then you can never have an inode number greater than 4b.

Q: ­Do we have any idea how long it takes to re-layout the inode allocation map?  And can that be done while the FS is mounted?
A: The run-time for this operation will depend on the size of the existing inode allocation map file since we will be migrating data from the exiting map to the new map. In one customer engagement, the migration completed in an hour and in another case it took 18 hours.
While this operation can theoretically be done while fs is mounted, we have currently restricted it to be done with file system offline for safety reasons. We are evaluating to make this operation online in a future release. The re-layout parameters can however be tested with file system mounted.

Q: ­Are there counters that report the lock collisions/waiters for lock contention that would indicate whether a re-layout is desirable?
A: ‘mmfsadm dump ialloc’ provides counters on segment search. Grep for ‘inodes allocated from’. Ideally, we expect allocations to happen from ‘inodes allocated from list of prefetched/deleted inodes’ or ‘inodes allocated from current ialloc segment’. Also, long waiters during file creation is indication of inode space pressure.

Q: Why does a large number of NumNodes influence mmdf run time?  (seen some minutes) ­
A: mmdf fetches cached data. This should not be impacted by cluster size.

Q: How does NumNodes relate to the number of segments?
A: The number of inode allocation map segments are chosen such that every node can find a segment with free inodes even if 75% of all segments are full. This has to do with the inode expansion getting triggered only when the inode space is 75% full. We want inode allocation to continue while the inode expansion is taking place. This means that the number of segments would be roughly 4 times NumNodes.

Q: Are there any general recommendations for initial inode allocation?  I know this depends on the filesystem’s expected use. We typically just base it roughly off existing systems.
A: Use the default value of allocated inodes (by omitting NumInodesToPreallocate argument of –inode-limit option of mmcrfs/mmcrfileset) when creating file system/independent fileset and let the inodes expand on demand.­

Q:  How is the inode allocation map, and its segmentation, affected if metadata NSDs are added or deleted?
A: The inode allocation map is not affected by newly added NSDs as it only tracks inode state. The block allocation map is the one that tracks free/used disk blocks and will get updated on new disk add/delete.

­Q: Can we shrink the inode space, if we by mistake allocate a large inode space using –inode-limit?
 ­A: No.

Q: ­When files are deleted, does the recovery of free inode happen in lazy way? One customer has just reported that after deleting data from 5TB filesystem, free space is not reflecting on the filesystem­.
A: Yes. The files are deleted asynchronously in the background. You can run ‘mmfsadm dump deferreddeletions’ to see the number of inodes that are queued for deletion in mounted file systems.

Q: At what version is automatic inode expansion available?
A: Since the earliest spectrum scale versions.

Q: ­How do you indentify the metanode?­
A: ­Here is an example:
ls -i testfile
68608 testfile

Then find this inode number in ‘mmfsadm dump files’. (note that the mmfsadm dump command should be avoided in production)
===== dump files =====
[… search on inode]
  inode 68608 snap 0 USERFILE nlink 1 genNum 0x49DE6F0F mode­‑

The above is an example of how you might lookup the metanode for a file.

You can map the cluster name by looking at the ‘tscomm’ section of a dump, e.g.:
===== dump tscomm =====
[…]
Domain , myAddr <c1n2> (was: <c1p0>)[…]
UID domain 0x1800DF65038 (0xFFFFB6768DF65038)  Name “c202f06fs04a-ib1.gpfs.net”­

Q: Is metanode transient?
A: ­ A metanode is a per file assignment. It lasts for as long as there are open instances of the file. The assignment is dynamic and the metanode role may automatically migrate to other nodes for better performance.

Q: ­If some of the node went down and if metanode unable to get update from those failed nodes: In such situation how updates are maintained by the metanode­?
A: A non-metanode will sends its updates to the metanode before it writes any dependent blocks to disk. If the non-metanode went down before it could send its updates, then log recovery will ensure that there are no inconsistent modifications to disk data by the non-metanode. Spectrum Scale only guarantees persistence of data/metadata from the last sync window.

Q: Can we limit metanode to migrate to remote node? Also, will it help in improving performance if limit metanode in storage cluster?
A: Metanode performance depends on how many nodes are sending metanode updates and how expensive the network send is. The file system uses such heuristics to determine the optimal metanode placement. In most cases it is best to let the file system make this decision. The only known use case for preventing metanode migration to remote node is if the remote node is in a compute cluster which cannot afford the overhead of a metanode operation. For this rare case we have an undocumented configuration parameter to force the metanode to be in the storage cluster.

Q: Sometime when we delete large data, it takes significant time to show free space in df -h command output. Do we need to run mmrestripfs to reclaim the deleted space faster?
A: ­ ‘df -h’ would return cached information on free space. It is likely that the large file that was deleted has not freed up its space as file deletes happen in the background. You can use ‘mmfsadm dump deferreddeletions’ to get a count of number of inodes that are queued for background deletion. If the node is not overloaded on I/O and you find that the number of to-be-deleted inodes are not reducing at a reasonable rate (depending on the file size and I/O througput of the node), then we would need to investigate further by collecting dumps and traces. Please open a ticket with IBM support in such a case. The mmrestripefs command is for restoring/rebalancing data and metadata replicas. It would not have any impact on speeding up background file deletion.

Q&A – vCPU scaling and NUMA considerations

Q: We see in the mmfs log now following messages, what does it mean? What is missing?
­[W] NUMA BIOS/platform support for NUMA is disabled or not available. NUMA results are approximated and GPFS NUMA awareness may suffer.
A: That means libnuma was found but numa_available() returned false. This is a platform firmware functionality shortcoming. Spectrum Scale can still get a lot of information as some is derivable from /proc . File a ticket with your server vendor that libnuma :: numa_available() returns false .

Q: ­So, any recommendations on POWER9 for SMT settings?  AIX versus Linux on Power?  We used to suggest smaller SMT modes in the past.
A: We are running SMT-4 on some large POWER9 systems. Evaluate based on I/O vs workload needs as discussed verbally.

Q: Are there any special NUMA considerations for AMD systems which are different to NUMA considerations for Intel systems?
A: This is highly dependent upon the processor and chip set independent of brand and based on what that processor and chipset offer for tuning. We do not have any prescriptive guidance.

User group host: Simon Thompson

Speakers:

Speaker NamePhotoBio
Michael HarrisMichael HarrisMike is a Senior Software Engineer on the Spectrum Scale Core Team. Mike has a deep background on OS kernel, device drivers, virtualization, and system software with focus on NUMA, atomics and concurrency, high cpu count concurrency. On GPFS focusing on NUMA and scaling as well as DMAPI and host file system integration and system calls.
Karthik IyerKarthik IyerKarthik Iyer is a Senior Software Engineer in Spectrum Scale Core. Karthik has 18 years of design and development experience in distributed system software, specifically in the areas of file system core and database management. Karthik also specialises in trouble shooting Spectrum Scale corruption related issues.

SSUG::Digital: 004 – Update on Performance Enhancements in Spectrum Scale

Update on File Create and MMAP performance, optimised code for small DIO.

Spectrum Scale is a highly scalable, high-performance storage solution for file and object storage. IBM continues to enhance Spectrum Scale performance, in response to recent hardware advancements and evolving workloads.
This presentation will discuss performance related improvements in Spectrum V5, focusing on enhancements made in support of AI and HPC use cases, including improvements to MMAP reads, file create performance, and small direct IO. In addition we will review some performance numbers measured on the IBM ESS 5000.

Download slides here

Q&A

Q: ­I assume copy of these charts will be posted to Spectrum Scale User Group “Presentations” web page?
A: ­Yes, for all episodes the slides and video should be posted afterwards­.

Q: ­Please expand on other areas of performance improvements within GPFS that IBM is working on now­?
A: ­Which areas would you like to see improved?

Q: ­Will prefetch still happen after the slow second IO?
A: ­Regarding ‘will prefetch still happen after the slow second IO’ – I know that Ulf said we should handle any more prefetch questions in another talk, but let me just comment on one case:­ ­we make decisions to prefetch after the associated I/Os are complete, so prior to prefetch kicking in, a slow I/O might delay the decision to start prefetching­.

User group host: Simon Thompson

Speakers:

Speaker NamePhotoBio
John Lewars (IBM)John LewarsJohn Lewars is a Senior Technical Staff Member leading performance engineering work in the IBM Spectrum Scale development team. He has been with IBM for over 20 years, working first on several aspects of IBM's largest high performance computing systems, and later on the IBM Spectrum Scale (formerly GPFS) development team. John's work on the Spectrum Scale team includes working with large customer deployments and improving network resiliency, along with co-leading development of the team's first public cloud and container support deliverables.
Jürgen Hannappel (DESY)Jürgen Hannappel works in the scientific computing group of the DESY IT department on data management for EuXFEL and Petra III. With a background in particle physics his interests shifted towards computing over time as his place of work moved from CERN and Bonn University to DESY
Olaf Weiser (IBM)Olaf works with GPFS for over 15 years now. He started his GPFS career in one of the worlds biggest telecommunication companies as a technical administrator. Since more than 10 years, Olaf is with IBM as storage consultant and performance specialist. Recently, he joined IBM Research and Development and works on enhancements in Spectrum Scale to adopt client and customer's needs in the product.

SSUG::Digital: 003 – Strategy Update

Spectrum Scale Strategy UpdateToday is the AI era and we are going through huge explosion of data. Besides the AI revolution, we have clouds, hybrid clouds and data is moving from “on-prem” to various clouds, multi-clouds and back. Coupled with this data growth, Hardware is evolving with an increasing factor of 10. The IBM Spectrum Scale team continues to Invest heavily in adding exciting new features and technology to maintain its leadership as a premier file system. In this session, Wayne Sawdon (CTO) and Ted Hoover (Program Director) of the Spectrum Scale development team will give an overview of recent, upcoming features and strategy for Spectrum Scale.

Download slides here

Q&A

None

User group host: Bob Oesterlin

Speakers:

Speaker NamePhotoBio
Wayne SawdonWayne SawdonWayne joined IBM in 1982 and worked on a variety of research projects including the QuickSilver Transactional Operating System. He spend most of the 90's on educational leave at Carnegie Mellon University working on Distributed Shared Memory and Software Defined Computer Architecture. Upon returning he joined the TigerShark research project which became IBM's General Parallel File System. Although Wayne has worked on most of the file system, he only admits to working on its data management. These days, Wayne serves as the CTO for Spectrum Scale and ESS.
Ted HooverTed HooverTed Hoover is a Program Director within IBM’s Spectrum Scale product development organisation. Ted is responsible for the worldwide development of Spectrum Scale cloud, container, and performance engineering teams.

SSUG::Digital: 002 – Best Practices for building a stretched cluster

Talk 2 in the SSUG::Digital series looks at how to build a stretched cluster. What are the best practices? What pitfalls are there? Why would you consider a stretched cluster built with Spectrum Scale, as opposed to one of the alternative approaches to high availability? How do stretched clusters work, and what considerations go into planning a successful stretched cluster? We will examine the theory behind Spectrum Scale stretched clusters, review some best practices for designing stretched clusters, and talk about a few cases where stretched clusters have been successfully deployed.

Download slides here

 

Q&A

Q: For DR use case where ClusterA and ClusterB are the 2 separate data centres (DC A and DCB), do I need my Tiebreaker Quorum node installed in Data Centre C?
A: (This is covered in the presentation). It is recommended to have the tiebreaker quorum node at a third site but it could be in one of the sites with the caveat that if that site goes down the second site will not be able to stay up.

Q: The documentation shows a high speed shared storage is needed…does it mean that san fabric should be merged over ISL for volume allocation across site?
A: When using Spectrum Scale replication for stretch clusters there is no need to for the SAN to be extended across the sites. The stretched cluster architecture described in the presentation works even when underlying storage does not replicate the data across sites.

Q: Will there be any performance difference between extended SAN and accessing NSD over network using their owner?
A: Well aside from the protocol difference block vs file, it depends on the type of connectivity you have to SAN vs network. Spectrum Scale has been placing more resiliency in recent releases for what to do for network behaviour (eg recently proactiveReconnet feature was added to Spectrum Scale).

Q: Does 10ms latency required between SiteA, SiteB and also Tiebreaker quorum node? Can my tiebreaker quorum node have higher latency? 
A: Yes, the third site can have a higher latency but it should still be “within reason”. So maybe double that number ie 20ms. It is recommended to keep it under a second.

Q: Is tiebreaker node hosted on AWS or any other Cloud Providers a supported configuration?
A: Yes, we have customers who using a public cloud for their third site.

Q: What is the RPO and RTO?
A: Remember that this is synchronous replication. So as long as you don’t run out of space on your storage there is 0 RPO. RTO answer depends on your workload and infrastructure. It depends on the rate of change of data change, your storage and the WAN.

Q: How to check/measure the rate of data change?
A: This really depends on the application and the rate of data change by the application. If you already have implemented Spectrum Scale, you can use the historical data from performance monitoring within Spectrum Scale to estimate the rate of data change.

Q: Do you have any general tips/recommendations regarding CES in a stretched cluster?
A: The CES nodes in your cluster need to be split between the two sites as they are still part of single cluster. SMB performs its own locking with ctdb component. Thus the latency between the CES nodes needs to be fairly small value. Also be aware if you have different address spaces on two sites, there may not be an automated failover of services and you may need to manually perform the same.

Continue reading “SSUG::Digital: 002 – Best Practices for building a stretched cluster”