########################################### # LANL Trinity trace FAQ version 1.0 Beta # ########################################### 1. Trace description ==================== In 2018, Trinity is the largest supercomputer at the Los Alamos National Lab (LANL) and it is used for capability computing. Capability clusters are a large-scale, high-demand resource introducing novel hardware technologies that aid in achieving crucial computing milestones, such as higher-resolution climate and astrophysics models. Trinity's hardware was stood up in two pre-production phases before being put into full production use and out trace was collected before the second phase completed. At the time of data collection, Trinity consisted of 9408 identical compute nodes, a total of 301056 Intel Xeon E5-2698v3 2.3GHz cores and 1.2PB RAM, making this the largest cluster with a publicly available trace by number of CPU cores. This Trinity dataset covers 3 months of the machine's operation from February to April 2016. During that time, Trinity was operating in OpenScience mode, i.e., the machine was undergoing beta testing and was available to a wider number of users than it is expected to have after it receives its final security classification. We note that OpenScience workloads are representative of a capability supercomputer's workload, as they occur roughly every 18 months when a new machine is introduced, or before an older one is decommissioned. The dataset consists of 25237 multi-node jobs issued by 88 users and collected by MOAB. Collected data include: timestampts for job stages from submission to termination, job properties such as size and owner, the job's return status, and a time budget per job that if exceeded causes the job to be killed. 2. Scheduling policy ==================== The general scheduling policy is strongly fair-share dominated, however backfill is used even if the fair-share allocation is negative (i.e., user has exceeded their quota). Because MOAB knows the earliest time that the highest priority job can start, and which resources it will need, it can also determine which jobs can be started without delaying it. The backfill feature allows the scheduler to start other, lower-priority jobs so long as they do not delay the highest priority job. The scheduler policy for Mustang was the same in periods of ~6 months at a time, corresponding to different "campaigns". It remained fair-share dominated, but the individual project weightings change for each "campaign" period. Each campaign is expected to consist of the following phases: - Input and dataset validation: jobs that consist of a small number of nodes compared to the dataset size - Problem partitioning: grid sizing - Parametric studies: sweet spot determination of job sizes, nodes, processes per node, memory per process etc. - Computation - Analysis and visualization: jobs that are primarily I/O-bound, implying different node use patterns Jobs that belong to an individual user can get cancelled in batches, where the timestamp and duration match across all the cancelled jobs. This is usually because job scripts allow users to specify multiple jobs linked together by explicitly stating their dependencies in the script. If one job fails or gets cancelled, the rest are automatically cancelled. 3. Fields ========= The dataset comprises numberous fileds that follow the Workload Event Format as seen in http://docs.adaptivecomputing.com/9-0-3/MWM/Content/topics/ moabWorkloadManager/topics/analyzing/workloadtrace.html. Some fields that are not populated in the original trace have been removed. Classes and Event Types have been renamed to uppercase format to represent categorical types. MOAB uses hyphens ("-") to indicate fields with not values; these have been replaced with empty space (""). Temporal data was originally in epoch form, but has since been converted to a datetime format. 3.1 event_time The time the event was registered by the scheduler. 3.2 object_id A unique identifier assigned to the job. 3.3 object_event The type of scheduling event. The following types are possible: - JOBSTART: the first event of a job recorded by the scheduler - JOBEND: the last event of a job that has terminated successfully, as recorded by the scheduler - JOBFAIL: the last event of a job that has terminated unsuccessfully, as recorded by the scheduler - JOBCANCEL: the last event of a job that a user has terminated forcefully, as recorded by the scheduler 3.4 nodes_requested The number of nodes requested by the job from the scheduler. If a job is started then it can be assumed that this request has been fulfilled. Since this is a homogeneous cluster, and each node consists of 32 identical CPU cores, this field can be used to derive the total number of cores assigned to the job by the scheduler. A zero value indicated that the requested number of nodes was not specified by the user. 3.5 tasks_requested The number of tasks requested, where each task corresponds to a physical CPU core. This may not be equal to the number of actual cores assigned to a job, as jobs at LANL are assigned entire physical nodes. To calculate the total number of physical CPU cores assigned to the job see section 3.4. 3.6 user_name An identifier of the user submitting the job. There are 88 users in the trace, and user IDs have been anonymized using random numbers. 3.7 group_name An identifier of the primary group of the user submitting the job. There are 91 groups in the trace, with each user belonging to a distinct group, and some switching their group partway through the trace. 3.8 wallclock_limit The maximum allowed job duration in seconds. Typically ranges from 1 to 2160 minutes (i.e., 36 hours which is the max wall clock time for Trinity). There can be outliers of jobs with longer wallclock limits, however. 3.9 job_event_state The state of the job at the time of the event. Possible values include: - Completed: the job completed successfully - Running: the job is still running - Vacated: the job has been removed from the scheduler queue - Hold: the job has been suspended 3.10 required_class The type of queue required by the job, specified as a square bracket list of [:] requirements. For example: [batch:1]. Possible values in this trace include: ccm_queue, dat, standard. 3.11 submission_time The time when the job was submitted to the scheduler's queue. The Trinity trace covers the period from February 2, 2016 until April 21, 2016. 3.12 dispatch_time The time when the scheduler requested that the job begin executing. The Trinity trace covers the period from February 2, 2016 until April 21, 2016. 3.13 start_time The time when the job began executing. The Trinity trace covers the period from February 2, 2016 until April 21, 2016. 3.14 end_time The time when the job completed executing. The Trinity trace covers the period from February 2, 2016 until April 21, 2016. 3.15 queue_time The time when the job met all fairness policies. 3.16 tasks_allocated The number of tasks allocated to the job. In most cases, this field is identical to the tasks_requested field. 3.17 required_tasks_per_node The number of tasks per node, as required by the job, or '-1' if no requirement was specified. In this trace the values found fall in [0, 32] but are mostly either 0 or 32. 3.18 QOS The quality of service requested/assigned using the format [:]. For example: "hipriority:bottomfeeder". Possible values in this trace include: -:-, -:hpcdev, -:support, -:tos1, -:tos2, -:tos3, -:tos4, -:tos5, -:tos6, -:tos7, -:tos8, -:tos-exec. 3.19 job_flags Square bracket delimited list of job attributes. For example: [BACKFILL][PREEMPTEE]. The following flags show up in our trace: - PROCSPECIFIED: Job requested processors on command line - WIDERSVSEARCHALGO: Aims for earlier start times by relaxing the default guarantee of finding the longest possible ranges of machines to satisfy job request - RESTARTABLE: Job can be requeued/restarted if preempted - INTERACTIVE: Job needs interactive input from user to run - TEMPLATESAPPLIED: Job had all applicable templates applied ("this is this" type of definition, found in Moab docs) - ADVRES: Job may only utilize accessible, reserved resources - GRESONLY: Use no compute resources (processors, memory, etc.), only generic ones - NORMSTART: Job is internal system job and will not be started via the RM - BACKFILL: Job uses backfill to run - GLOBALQUEUE: Job directly submitted without any authentication - FSVIOLATION: Job started with a fairshare violation - PREEMPTOR: Job may preempt other PREEMPTEE jobs 3.20 account_name The name of the account associated with the job, if specified. Possible values in this trace include: ccm_queue, hpcdev, llnl-exec, NOACCT, support, tos1, tos2, tos3, tos4, tos5, tos6, tos7, tos8, tos-exec. 3.21 executable The full path of the job executable, if specified. 73% of JOBSTART events specify a path. Typically (for our trace), these paths fall under $HOME or /var/spool/moab/spool/. 3.22 resource_manager_extension_string Resource manager specific list of job attributes if specified. See the relevant MOAB documentation at http://docs.adaptivecomputing.com/9-0-3/MWM/Content/ topics/moabWorkloadManager/topics/resourceManagers/rmextensions.html for more information. 3.23 bypass_count The number of times the job was bypassed by lower priority jobs via backfill, or '-1' if not specified. 3.24 proc_seconds_utilized The number of processor seconds used by the job. Mostly zero in this trace. 3.25 partition_name The name of the partition on which the job ran. Possible values in this trace include: ALL, SHARED, trinity. 3.26 dedicated_processors_per_task The number of processors required per task. Mostly 1, with few exceptions using 0 or 32 in this trace. 3.27 allocated_host_list A list of hosts allocated to the job. The list is colon (':') delimited, and each entry may represent a range. For example, "12:14-31:36-63" implies that the hosts allocated to the job were 12, 14 through 31, and 36 through 63. All ranges are inclusive. 3.28 resource_manager_name The name of the resource manager, if specified. In this trace, 23% of entries reference "internal", and 77% reference "trinity". 3.29 reservation The name of the reservation required by the job, if specified. Most entries leave this empty, but for the rest, values fall in: DAT-marti, eotos, eotos.61, langer-tos8, ldms.33, ldms-DAT, ldms-DAT.41, LDMS-DAT.41, ldms-dat-prep.51, ldms_prep.59, nalu.64, nalu.65, nalu.66, rt113630, rt113630.17, rt113630.28, slownodes. 3.30 job_message Incorporates messages from many systems, including the resource manager, scheduler, and administrative systems for that job. In this trace, 8% of events carry a job message. 3.31 completion_code The job exit status/completion code. Most popular ones are 0, 1, 2, -3, and 12. Assuming these are UNIX error codes (warning: they may not be!), they could be interpreted as follows: - 1, or EPERM: Operation not permitted - 2, or ENOENT: No such file or directory - 3, or ESRCH: No such process - 12, or ENOMEM: Out of memory 3.32 effective_queue_duration The amount of time, in seconds, that the job was eligible for scheduling. 4. Contact info =============== Please direct inquiries, feedback, or suggestions regarding this trace to the ATLAS mailing list at info@project-atlas.org 5. Usage ======== Please cite the following paper when you publish work that uses this trace: George Amvrosiadis, Jun Woo Park, Gregory R. Ganger, Garth A. Gibson, Elisabeth Baseman, Nathan DeBardeleben. "On the diversity of cluster workloads and its impact on research results." In Proceedings of the 2018 USENIX Annual Technical Conference, Boston, MA, July 11-13, 2018.