CHESS is a software architecture designed specifically by ClusterTech to integrate HPCC components, providing a complete environment for integration, update, configuration, maintenance, upgrade and use of software components. CHESS has the following advantages:
Support one-key switch between Chinese and English by B/S architecture graphical interface; Achieve centralised management of users and groups, resource management, configuration of task scheduler parameters, and real-time monitoring of task status such as CPU load, memory utilisation and network traffic through Web GUI; Export reports in PDF, EXCEL or other formats by powerful file management function; Support DCV application integration and accelerate 3D remote visualisation; Support Docker and Singularity container; Support GPU card monitoring, and shared or exclusive use
Easy To Use
Achieve rapid and automatic deployment of the entire cluster system; Control the administrators and users’ function permission strictly with modular installation; Support non-disk machine group
Support high availability to avoid time and economic loss caused by single point of failure; Enable system backup and node restoration to default settings; Provide comprehensive error alerts and logs
Support using SSH and VNC directly through CHESS; Achieve intelligent task scheduling, resource reservation, task backfilling, dynamic priority and cluster partition; Support customised development
CHESS System Deployment
In CHESS, the fast integrated deployment system can help the administrators to quickly and easily deploy operating system (OS) and software for cluster nodes. In a standard setting, after MAC address collection and IP planning are done, CHESS will complete system installation and configuration for 64 nodes in 60 minutes.
Based on the combination of software and hardware, CHESS deployment system can complete the operating system, software installation for the entire cluster or one node and unified network configuration and service configuration for the entire cluster, through a console node with pre-installed OS.
CHESS deployment system can be flexibly configured according to users’ requirement, for example:
- OS version selection, installation content, and installation sequence
- Hard disk partition, size, and type of file system
- Network IP address setting, and NIC bonding
- Host name analysis methods are flexible. A host name can be a combination of any letters, specific symbols and any numeric characters that meet the specification
CUI Web Portal
CUI (ClusterTech User Interface) is a user interaction interface of CHESS. It is connected through a browser (IE 10 or above, Google Chrome, Firefox, etc.) to realise unified interface of each functional module such as cluster management, cluster monitoring, job scheduling, job scheduling management and cluster reporting, so as to realise the unified login of ClusterTech's independently developed software, for which the administrators can set user access permission by module.
CHESS Cluster Management
CHESS cluster management functions include: user management, node management, project management, message management and log management.
- User management:
Administrators can add, delete, modify and check users through the interface. It supports user presentation in the form of organisational structure. Moreover, you can customise the role setting and assign different permissions for different roles.
- Node management:
The CHESS node list interface displays the node status, host name, service status, resource usage and job distribution, you can perform shutdown, restart, VNC, hard shutdown, hard boot, dashboard, and console operations for the node on this page.
- Project management:
Administrators or users can customise the project name, assign time and user rights to the project, and provide project dimensional reports through the interface.
- Message management:
Message management function is mainly used for administrators to send home page notifications and messages to users, so as to synchronise messages to users with higher efficiency and accuracy. Click “System Setting”and enter the “Message Management” navigation menu to add, delete, refresh and search for messages.
- Log management:
CHESS log management function is convenient for administrators to track and view various operation logs of users and systems.
CHESS Job Scheduling
- Scheduling management:
CHESS resource management and job scheduling system can reasonably and efficiently manage all software and hardware resources in the system and jobs submitted by users, maximising the throughput and utilisation of the cluster system. The job scheduling management function is only available to system administrators. System administrators can actively schedule resources to optimise the use of resources and reduce the job response time. System administrators can view the CPU usage of each node, and optimise the management of the cluster system by configuring resource managers and scheduling strategies. System administrators can also set queues, nodes, user/group priorities, and conduct resource management through the CHESS cluster management system, making the complex task of cluster resource scheduling management simple, unified, and efficient.
- CPU job submission:
CHESS’ job submission function page provides file management functions for normal users, who can directly manage files in the system, for example, create, edit, upload, download, copy, cut, paste, compress, and decompress files.
- GPU job submission:
When submitting a GPU job, you can set more GPU-related submission parameters in the application template, including the type and number of GPUs per node. There are two types of GPUs: shared and exclusive.
- Scheduling strategy:
CHESS job scheduling system supports inter-task correlation, automatic file transfer (File Staging), multiple task queuing, multiple system grouping, multiple task priority strategies and configuration, multiple resource management and priority of task, QOS (Quality of Service, including service objects and resources, and function access control), configurable node allocation strategies, multiple configurable backfill policies, detailed system diagnostics, and tracking and statistics of various resource usage.
- Application template:
Administrators can use add and edit template operations, and use the basic components provided by CHESS to display the interaction parameters in the application template which are commonly used by users by dragging and dropping, for users to submit jobs. In the template list, you can add, delete, publish, disable, search and edit the application template, and edit users and groups.
CHESS Cluster Monitoring
CHESS cluster monitoring function provides monitoring of the running status and resource usage of the entire cluster, a single machine and GPU. Administrators can check the running status of the cluster at any time and troubleshoot in time.
- Cluster overview:
The cluster overview interface displays the related content of the cluster summary by default, including the 30-min CPU/Memory status, storage status, swap partition status, load status, and network status of the cluster.
- Cabinet diagram display:
CHESS supports an intuitive physical view of the cabinet map, which can be customised according to the physical placement of the users’ on-site servers. You can view the basic information of each node, and start, shut down, restart, VNC, or Shell login of any node.
- Single-machine monitoring:
Through the nodes in the cabinet diagram, you can enter the dashboard of a single node. The interface displays detailed information of the node, such as CPU/Memory, storage, swap partition, load, and network (CPU/Memory, swap partition, storage, load, and network information are similar to those in [Cluster Summary]).
- GPU monitoring:
If there is a GPU card in the server, you can view its information on the GPU list. The interface displays all the GPU card information on each node, including the host name of the GPU, GPU name, use rate, temperature, used video memory, video memory frequency, processor frequency and PCIe read/write bandwidth.
Artificial Intelligence Module
It integrates the two major applications of HPC and AI through CHESS web interface so as to realise unified scheduling of various tasks on a general-purpose CPU or GPU cluster for calculation, which breaks the limitation of the necessity to build a GPU heterogeneous platform for AI application. Through efficient distributed training and reasoning, the bottleneck of AI computing performance on CPU is solved, and a unified and efficient HPC/AI fusion is achieved, without basically changing the using habits of HPC users.
CHESS supports two types of AI framework, one is Docker-based single-machine training and reasoning, usually used to run single-machine multi-card GPU servers; the other is Singularity-based multi-machine training and reasoning.
- Single-machine training and reasoning:
CHESS supports the training and reasoning of Docker container-based single-machine multi-GPU cards, provides container warehouses for users to operate on Docker images and set up parameter submission through CHESS application template.
- Multi-machine training and reasoning:
Multi-machine training and reasoning uses the CHESS scheduler to schedule the Singularity container-based AI framework to multiple servers, and use the MPI parallel method to achieve multi-machine parallel training and reasoning.
- Viewing of training results:
You can find the TensorFlow running directory through the data management interface, and right-click to open the TensorBoard visualisation tool to check the ongoing training and reasoning results of TensorFlow.
CHESS Cluster Reporting
After selection based on a variety of filtering conditions and display dimensions, the reporting system provides users with five types of reports: Done Job, Project Job, User Job, CPU/GPU Billing Summary and CPU/GPU Billing Detail. Administrators will be able to display the report or download it in PDF, HTML and Excel format after the process of various dimension filtering.
CHESS Cluster Billing
CHESS cluster billing function provides queue-based rate setting, charging overview and charging details viewing and downloading. CHESS billing function supports pre-payment. After the pre-set amount is exceeded, the user’s job cannot be submitted for operation.
CHESS provides CAE with a complete set of high performance computing environment software packages, combines CAE application software with scheduling system, provides application templates, and facilitates job submission and result viewing.
CHESS expands the artificial intelligence framework on the basis of supporting the original HPC application. It can support the scheduling of artificial intelligence framework based on Docker container and distribute artificial intelligence framework based on Singularity container. The integrated TensorBoard tool can be used to view the training process at any time while TensorFlow is running.
CHESS can be widely used in the oil industry to monitor all computing resources and manage clusters remotely, while providing operational and usage reports for software such as CGG and Omega.
CHESS can provide life science HPC users with a complete parallel software development and running environment, configure the CHESS Monitor and CHESS Schedule modules for cluster monitoring and job scheduling, and it includes templates and settings specific to different applications to improve user efficiency and reduce the use of high-performance computers.
CHESS provides a complete set of software solutions for the use of universities and research institutes. It supports a variety of complex parallel environments and applications and a variety of job scheduling strategies so as to provide software support platform for users. CHESS allows users to focus more on their own scientific achievements and helps them build a set of efficient and stable cluster system.
ClusterTech has rich experience in the meteorological industry. CHESS provides users with a complete parallel computing development and running environment. Different application templates can be used for submitting jobs based on different modes. Aiming at the processing of massive small files involved in meteorological simulation, it provides the solution of high bandwidth and high IOPS of memory file system for the meteorological industry, and solves the I/O bottleneck for users. We also provide common numerical mode installation, debugging and training services, and cooperate with users to complete the installation and debugging of numerical forecast system.