Benefits
Support for Hybrid Use of In-House HPC and Cloud Resources
The local cluster can be linked with the public cloud to quickly realise a hybrid cloud; public cloud computing resources can be flexibly scaled according to demand; provides virtually unlimited cloud resources.
Private Cluster Management Tool
A private cluster management tool that provides rapid deployment, centralised management, and unified scheduling.
One-Stop Service
Provides optimised and customised application software and cluster management software to users; one-click link to customise the user’s cluster; online use of data files, application software, computing resources, storage resources; online monitoring and other functions; 7*24 hours online service.
Multi-Architecture Support
X86 and ARM64 architecture server
Support Various AI Open Source Frameworks
TensorFlow, Caffe and others
GPU Monitoring and Scheduling
Support single card & multi-card GPU card sharing
Functions
With its modular design, CHESS can freely select combinations of modules according to user demand. Modules include deployment, cluster management, cluster arrangement, monitoring, job scheduling, hybrid cloud, statistics and billing, and WEB portal.
Deployment Module
The deployment module helps system administrators deploy the operating system and software applications, efficiently and conveniently.
- Batch installation, rapid deployment;
- Elastic extension and dynamic scaling of nodes;
- System backup and restore functions;
- System imaging and customised software packages for different nodes;
- Unified deployment of operating system, management software and application environments.
Cluster Management Module
This module provides node management, parallel command, remote switch machine and other functions; NFS shared directory management, operation logs and machine on-off records are implemented via the Web based interface.
- Node role management
The role of a node can be switched by checking the character column letter (M/I/E/T). - Node status and node operation
Viewing node information, including online status, whether to allow the submission of jobs, single or batch node operations (delete, switch machine, restart, new mirror, restore node, SSH, VNC, etc.) - Shared directory management
Shared directories can be created via the web interface. One can edit mount points without complex NFS shared file system configurations. - Operating system imaging
Node system image management provides one-click recovery of the operating system. - Cluster operation log queries
Various log of cluster operations can be queried including the contents, time, results, users, etc.
Monitoring Module
The system administrator can monitor the physical cabinet view, the system cluster, node operation and resource usage. It also supports webpage, email alarm and alarm threshold settings.
- Intuitive cluster monitoring
The physical cabinet view shows the node position, the node status information, including server loading, online status, CPU temperature, etc. - Cluster/node performance status monitoring
Real time monitoring of the cluster/node CPU, memory, swap partition, network, disk, loading and other performance indicators. - File system usage
Listing of the cluster shared directories and the mount points under each shared directory, and run status details. - Fault notification
When there are node failures or the load of CPU, memory and other indicators is too high, notification will be sent via short message service (SMS) or email. Notification history is maintained for further reference. - Alarm threshold setting
Alarm thresholds can be customised for different scenarios. - Performance Analysis
Performance parameters of the node can be set and displayed in real-time, based on specific time ranges. - GPU card monitoring
The performance of each GPU card can be monitored.
Job Scheduling Module
This module optimises the cluster system hardware and software resources, reducing job response time and supporting multiple job submission templates. It simplifies cluster resource management, providing a clear view of the node CPU usage and configuration of the resource manager. One can also edit/delete/compress scripts directly via the web interface.
- Unified job management interface
View the status, queue, and owner information of job submissions from the job management list, and delete/ stop jobs. - Compute nodes configuration
View the number of cores and CPU utilization of each node in the cluster, monitor node job submission, modify node properties, and control node resources. - Scheduling policy
Provide resource reservations, Backfill algorithm, dynamic priority, fair sharing, quota management, system diagnosis, system monitoring and statistics and other functions. It supports QoS/ preemption strategy and policy-based scheduling; jobs can access to cluster resources based on their priority. - User group policies
User group policies include maximum number of jobs, maximum number of processors, maximum memory, maximum hard disk, maximum wall time, priority, etc. - Resource reservation
Computing resources can be reserved for users, ensuring that the job has available computing resources at a specific time. - Multiple application templates, flexible job submission
Job template can be used to simplify similar job submission; new template can be created for any application.
Job submission options: command line, web interface, application integration interface, job script and executable file submission. Common applications can also be set as templates. - Comprehensive file management
The intuitive web interface allows users to create, edit, upload, download, copy, cut, paste, compress, and decompress files.
Hybrid Cloud Module
The hybrid cloud module integrates the local servers and public cloud resources into an integrated HPC cluster system and application environment. It can be expanded as needed and deployed flexibly, greatly improving computing power and speeding up application processing.
- Hybrid cloud node management
Through the web based interface, one can manage the basic information of the node, including hostname, MAC address, IP address, role, specification, status, creation time, etc. - Cloud node provisioning
Provisioning of public cloud resources can be done as required; a flexible pricing scheme includes monthly or yearly subscription, or even pay-per-use. After the application is completed, relevant information can be viewed on the hybrid cloud node. - Cloud node operation
Node management operations support startup, shutdown, forced shutdown, restart, forced restart and release. - Cloud storage management
Job data can be read and written into shared storage NAS storage or directly created and mounted to public cloud nodes. Such storage can also be removed as needed.
Statistics and Billing
Various reporting options includes rich data resource statistics, report overview, individual report details, etc.; PDF/HTML/Excel formats are supported.
- Cluster computing resource usage statistics
One can generate cluster system CPU/memory/swap partition/storage usage and completed job/running job/waiting to process job data reports. - Statistic resource consumption and flexible rate setting
Combined with user (group) CPU usage time and run time, one can set charge rates flexibly to generate bills.
Web Portal
The Web portal provides comprehensive facilities and various management functions, including monitoring, job scheduling, reporting, and hybrid cloud module.
- Access control
In cluster management, job scheduling, cluster monitoring and report statistics module, the administrator can set the user access right and assign the user function module via the Web interface. - Service management
The interface also provides service monitoring, view service status/time/CPU utilisation/memory usage, start/terminate/monitor service projects. - Users and groups
The Web interface provides the create/edit/delete user (group) functions; one can view the groups of each user, changing the group password, etc.
Industrial Applications
CHESS Hybrid Cloud Platform can be widely used in aerospace, automotive, electronics, education, scientific research, petroleum, meteorology, life sciences, manufacturing, artificial intelligence and other domain which have high computational demands.
- Manufacturing: Ansys, Fluent, Abaqus, CFX, Numeca, etc.
- Computational chemistry: VASP, GROMACS, LAMMPS, NAMD, Gaussian, Materials Studio, etc.
- Meteorological applications: MM5, Grapes, CESM, WRF, wrf-chem, etc.
- Biomedical applications: Anaconda, Bioconda, bwa, FastQC, etc.
- Scientific computing: Matlab, R, Mathematica, etc.
- Artificial intelligence: TensorFlow, Caffe, etc.