Spectrum Scale Users Group at Supercomputing – November 2016

First off, your feedback, please! If you were at the event, please give us feedback here: https://www.surveymonkey.com/r/SSUGSC16 We’ve only heard from 9 attendees so far, and there were a lot more of you there :-).

The slides should be available soon here: http://www.spectrumscale.org/presentations/

Kicking Things Off

This was the second year for a Spectrum Scale Users Group meeting at SC. To start things off Doris Conti welcomed the group and reminded us that the IBMers are here to present, but to also listen to users’ experiences —the good, the bad, (the ugly? 🙂 ) how Spectrum Scale is being used in the field and what should be coming in the product next from the users’ perspective. Doris complimented what she called the heroes of the data centers, the people managing PBs of data short-, if not, single-handedly. The developers then introduced themselves so people could talk to them at the break or during the week.

Bob Oesterlin, Users Group Co-principal, gave a quick overview of the Users Group structure and purpose, and mentioned that we need a site for the next in-person meeting —we have asked for hosting sites on the mailing list, so please, chime in if you are willing to host! As usual, we welcome anyone interested in being a Sector Representative, that is, someone who can represent Spectrum Scale (SS) usage for a particular group of users —genetic analysis, financials, etc.

Ulf Troppens, of IBM Germany, spoke about the unstructured data explosion and how (tongue-in-cheek) the cell phone is made by the devil 🙂 because it is producing mass amounts of this unstructured data. Ulf spoke to how SS is changing to adapt and respond to this data management need.

Doug O’flaherty provided a very brief licensing update – most importantly noting the capacity pricing announcement from a few weeks ago and emphasizing it is an addition, not a replacement, to socket-based licensing. They are withdrawing Express Edition, and trying to be flexible with migration options to capacity, questions are welcome. Doug strongly encouraged that we talk to sales, and not engineers, about licensing!

The latest in IBM Spectrum Scale (Scott Fadden, IBM)

Scott Fadden provided a forward-looking view of Spectrum Scale; that is, how is the way the product is being used impacting decision making. One thing SS users are running into is the “cloud directive” —management is telling sysadmins to “run things in the cloud”. People responsible for implementing are in turn trying to figure out what that means, because there are lots of issues with that —cost, i/o performance to “cloud”, data movement, etc. So what many have done is start with cloud for archive because it’s the best “cloud” fit today. There is also a difference in cloud when groups have to decide to run “on the cloud” (off prem) vs. be the cloud (on prem). One approach is to use an on-premise, private cloud to address what external, off-premise cloud offerings don’t meet today. Then there can be a transition path later when/if the external cloud catches up. How would this work? Leverage SS using protocol nodes to allow NFS/SMB access, GNR/SS RAID to allow for use of larger disks without long rebuild time. Scott also mentioned some new features like ILM support for getting snapshots off flash; and some new features based on feedback like mmnetverify — a network test tool and installer improvements. See Scott’s slides for all the details.

ESS Updates (Steve Duersch, IBM)

As part of our hardware updates, Steve Duersch talked to us about some new ESS models. These are meant to allow those with the need for throughput improvements to augment with flash – ESS GF1, GF2 (
R: 13.6 GB/s W: 9.3 GB/s; 26.6 GB/s 16.6 GB/s, respectively). [see also the slides] As you would expect, MTBF goes down with less moving parts (flash). Other ESS improvements include: improved flow for L2 support, avoid bouncing to Power team, and RHEL team, etc. ESS L2 to own from cradle to grave. Some customers chimed in they experienced this kind of bouncing with other vendors providing SS, e.g. bouncing from hardware v. software v. OS, and would love to see improvement there in general.

NASA NCCS Goddard: Big Data in HPC (Carrie Spear, Nasa)

Kicking off the user talks was Carrie Spear who talked about NASA’s Data Analytics and Storage System (DASS) Her group is working to allow for both traditional MPI jobs and add Hadoop workflow —without transforming the data to do both. They used a storage building block approach and had 3 test models: Cloudera + HDFS; Cloudera + GPFS; Cloudera + Lustre and her results showed the connector outperforming HDFS. Carrie appreciated being able to work closely with SS developers on this project. Per Carrie, the connector installation is straightforward. It is worth noting it’s available for free download (https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/IBM%20Spectrum%20Scale%20support%20for%20Hadoop) This was an interesting talk and generated several questions.

Nuance: Tiering to Object Storage (Bob Oesterlin, Nuance)

Next, Co-principal Bob Oesterlin from Nuance talked about his experience with 5 PB of Cloud Tiering ranging from flash to SATA. In Bob’s experience 80% of data is “cold” within a few weeks, and there are billions of files, many small files (we all love those, right?). Nuance has compute nodes with local storage (SSD) and then data flows between local storage and warm tier using policy-based migration.

Bob noted that using Transparent Cloud Tiering (TCT) requires advanced license and an interaction with IBM to enable it. Doris responded that this was intentional, not to annoy the customer, but to allow IBM developers a chance to interact with the user and understand how TCT will be used and perhaps make recommendations about that. The tiering is done with policies (mmapplypolicy), Implementing polices for file movement implies overhead, Bob made sure his management knew that object is not meant to be performant. There are tools to determine the mapping from an object name to a GPFS filename. Bob requested a better way to determine cloud usage. Bob ran into a bunch of gotchas — see Bob’s slide if you’re implementing/testing this. There was a question about whether pre-staging made sense in Bob’s environment. The short answer was —not now, it might allow people to game the system and say they need everything pre-staged and defeat the purpose of the tiering.

Sponsor Talk (NetApp): IBM DCS Storage: The Ideal Building Block for Flexible Spectrum Scale (Matt Forney (Ennovar), Kums Rajaram (IBM))

Our sponsor talk was related to the IBM DCS storage and work going on at Ennovar (very interesting concept at Wichita State University providing a sort of incubator for IT Pros http://webs.wichita.edu/?u=ennovar&p=/index/) The presentation was about a DCS recipe with building blocks. Their first pass focused on throughput rather than worrying about iops/small files. It uses a shared metadata/data pool and has 16MB block size. Future work will look at smaller block sizes. There is a portal to test for yourself ennovar-hpc.herokuapp.com

Virginia Tech ARC: SANDisk IF150 (Brian Marshall, Virginia Tech ARC)

This talk covered us of the Sandisk IF150 (JBOF) with Omni-path (100Gbps, low latency, and competes with Mellanox IB). There are 4 options for connection. ARC chose OPA card.

OPA + EDR NSD Serve, some notes:

Use the Intel Storage Router Design
GPFS does not support RDMA over 2 fabrics
Kept daemon and admin network on 10Gb since OPA is new
Use subnets to specify the IP over Fabric network to use OPA for data

NASA NCCS Goddard: Monitoring with Grafana and InfluxDB, (Aaron Knister, NASA Goddard)

Nasa Goddard presented on their monitoring work. Their goal for metric collection was 10s resolution which turned out to be 173K measurements every 10s! They looked at a few different TSDBs (time series databases) to implement this. Currently they are using a perl script to poll for data (mmpmon) and then pushing it to InfluxDB. To get data out of the system they use a CLI and for graphing/GUI Grafana. There have already been several times where being able to see the data in this way helped fix problems. One was seeing the latency on the IB fabric which lead to a GPFS tuning oversight being discovered. Another was determining where a restripe was needed —discovered by high loads on NSD servers where newly added disks were taking brunt of I/O with data being allocated to new pools. They will try to open source some of this code if their organization will permit it.

Spectrum Scale Enhancements for CORAL – (Sarp Oral (ORNL), Gautam Shah (IBM))

Gautam covered many of the considerations and development efforts needed to meet the CORAL requirements. See the slides for full details, but for example, CORAL needs a burst buffer/performance tier, local NVMe SSD managed by Burst Buffer Software (explicit software) to reduce spinning disk/power requirements; and large capacity requirements met through more traditional ESS Storage.

The burst buffer software for allows for storing at checkpoints with stage in/out to SSD managed by via the scheduler (LSF). Spider 3 is a center-wide, single-namesspace POSIX file system for all OLCF resources to enable seamless data sharing between resources. It has a usable capacity of 250PB.

There are some ways that researchers interact with their data that aren’t ideal, but nevertheless need to be served. For example “ls” in directories with lots of small files to check the progress of a job. File sizes: < 32K files 85-90% of files.

See the slides more details about the rates and other objectives being achieved for this system. The take home for most of us is by CORAL pushing the envelope, there will be changes that benefit the whole SS community —some examples are allowing for more concurrency, improved restripes, token management improvements, RPC communication enhancements, performance counters to help determine where bottlenecks lie, etc.

News from IBM Research (Sven Oehme, Dean Hildebrand, IBM)

Lastly Sven and Dean spoke –which was a lively end to the session. Sven talked about how varied disk drive speed reports can be depending upon who you ask: the disk vendor, the block storage vendor, the HPC admin… There are many factors that impact this including caching, filesystem choices, RAID modes, access patterns, etc.

Of note for Spectrum Scale 4.2.1+ is a fundamental network communication overhaul, due to the need for almost lock-free communication in all performance critical code. Also added is always-on instrumentation for performance critical data, rather than adding it later or for occasional data collection when needed. To balance a desire to use larger block sizes, there will be more/smaller sub-blocks –today you have to chose between bandwidth or wasting space for small files.

First results on ES GF-1 and GF-2 were reported and there was a lively discussion around the value of IOPS measurements. You really need to come see Sven speak if you haven’t, if nothing else check out his slides.

Lastly, Dean quickly presented on containers and asked how many people are getting requests for this —result was 20-30%.

So, that was a packed session, and we’ll look forward to the next event. Look for a poll on that includes a questions as to whether the next US event should be 2 days. Keep those cards and letters coming, we need to know what the User Group wants!