###########################################
# LANL Trinity trace FAQ version 1.0 Beta #
###########################################


1. Trace description
====================

In 2018, Trinity is the largest supercomputer at the Los Alamos National Lab
(LANL) and it is used for capability computing. Capability clusters are a
large-scale, high-demand resource introducing novel hardware technologies that
aid in achieving crucial computing milestones, such as higher-resolution climate
and astrophysics models. Trinity's hardware was stood up in two pre-production
phases before being put into full production use and out trace was collected
before the second phase completed. At the time of data collection, Trinity
consisted of 9408 identical compute nodes, a total of 301056 Intel Xeon
E5-2698v3 2.3GHz cores and 1.2PB RAM, making this the largest cluster with a
publicly available trace by number of CPU cores.

This Trinity dataset covers 3 months of the machine's operation from February to
April 2016. During that time, Trinity was operating in OpenScience mode, i.e.,
the machine was undergoing beta testing and was available to a wider number of
users than it is expected to have after it receives its final security
classification. We note that OpenScience workloads are representative of a
capability supercomputer's workload, as they occur roughly every 18 months when
a new machine is introduced, or before an older one is decommissioned. The
dataset consists of 25237 multi-node jobs issued by 88 users and collected by
MOAB. Collected data include: timestampts for job stages from submission to
termination, job properties such as size and owner, the job's return status, and
a time budget per job that if exceeded causes the job to be killed.


2. Scheduling policy
====================

The general scheduling policy is strongly fair-share dominated, however backfill
is used even if the fair-share allocation is negative (i.e., user has exceeded
their quota). Because MOAB knows the earliest time that the highest priority
job can start, and which resources it will need, it can also determine which
jobs can be started without delaying it. The backfill feature allows the
scheduler to start other, lower-priority jobs so long as they do not delay the
highest priority job. The scheduler policy for Mustang was the same in periods
of ~6 months at a time, corresponding to different "campaigns". It remained
fair-share dominated, but the individual project weightings change for each
"campaign" period. Each campaign is expected to consist of the following phases:

- Input and dataset validation: jobs that consist of a small number of nodes
  compared to the dataset size
- Problem partitioning: grid sizing
- Parametric studies: sweet spot determination of job sizes, nodes, processes
  per node, memory per process etc.
- Computation
- Analysis and visualization: jobs that are primarily I/O-bound, implying
  different node use patterns

Jobs that belong to an individual user can get cancelled in batches, where the
timestamp and duration match across all the cancelled jobs. This is usually
because job scripts allow users to specify multiple jobs linked together by
explicitly stating their dependencies in the script. If one job fails or gets
cancelled, the rest are automatically cancelled.


3. Fields
=========

The dataset comprises numberous fileds that follow the Workload Event Format as
seen in http://docs.adaptivecomputing.com/9-0-3/MWM/Content/topics/
moabWorkloadManager/topics/analyzing/workloadtrace.html. Some fields that are
not populated in the original trace have been removed. Classes and Event Types
have been renamed to uppercase format to represent categorical types. MOAB uses
hyphens ("-") to indicate fields with not values; these have been replaced with
empty space (""). Temporal data was originally in epoch form, but has since been
converted to a datetime format.

3.1 event_time

The time the event was registered by the scheduler.

3.2 object_id

A unique identifier assigned to the job.

3.3 object_event

The type of scheduling event. The following types are possible:
- JOBSTART: the first event of a job recorded by the scheduler
- JOBEND: the last event of a job that has terminated successfully, as recorded
  by the scheduler
- JOBFAIL: the last event of a job that has terminated unsuccessfully, as
  recorded by the scheduler
- JOBCANCEL: the last event of a job that a user has terminated forcefully,
  as recorded by the scheduler

3.4 nodes_requested

The number of nodes requested by the job from the scheduler. If a job is started
then it can be assumed that this request has been fulfilled. Since this is a
homogeneous cluster, and each node consists of 32 identical CPU cores, this
field can be used to derive the total number of cores assigned to the job by
the scheduler. A zero value indicated that the requested number of nodes was not
specified by the user.

3.5 tasks_requested

The number of tasks requested, where each task corresponds to a physical CPU
core. This may not be equal to the number of actual cores assigned to a job,
as jobs at LANL are assigned entire physical nodes. To calculate the total
number of physical CPU cores assigned to the job see section 3.4.

3.6 user_name

An identifier of the user submitting the job. There are 88 users in the trace,
and user IDs have been anonymized using random numbers.

3.7 group_name

An identifier of the primary group of the user submitting the job. There are 91
groups in the trace, with each user belonging to a distinct group, and some
switching their group partway through the trace.

3.8 wallclock_limit

The maximum allowed job duration in seconds. Typically ranges from 1 to 2160
minutes (i.e., 36 hours which is the max wall clock time for Trinity). There
can be outliers of jobs with longer wallclock limits, however.

3.9 job_event_state

The state of the job at the time of the event. Possible values include:
- Completed: the job completed successfully
- Running: the job is still running
- Vacated: the job has been removed from the scheduler queue
- Hold: the job has been suspended

3.10 required_class

The type of queue required by the job, specified as a square bracket list of
<QUEUE> [:<QUEUEINSTANCE>] requirements. For example: [batch:1]. Possible values
in this trace include: ccm_queue, dat, standard.

3.11 submission_time

The time when the job was submitted to the scheduler's queue. The Trinity trace
covers the period from February 2, 2016 until April 21, 2016.

3.12 dispatch_time

The time when the scheduler requested that the job begin executing. The Trinity
trace covers the period from February 2, 2016 until April 21, 2016.

3.13 start_time

The time when the job began executing. The Trinity trace covers the period from
February 2, 2016 until April 21, 2016.

3.14 end_time

The time when the job completed executing. The Trinity trace covers the period
from February 2, 2016 until April 21, 2016.

3.15 queue_time

The time when the job met all fairness policies.

3.16 tasks_allocated

The number of tasks allocated to the job. In most cases, this field is identical
to the tasks_requested field.

3.17 required_tasks_per_node

The number of tasks per node, as required by the job, or '-1' if no requirement
was specified. In this trace the values found fall in [0, 32] but are mostly
either 0 or 32.

3.18 QOS

The quality of service requested/assigned using the format <QOS_REQUESTED>
[:<QOS_DELIVERED>]. For example: "hipriority:bottomfeeder". Possible values in
this trace include: -:-, -:hpcdev, -:support, -:tos1, -:tos2, -:tos3, -:tos4,
-:tos5, -:tos6, -:tos7, -:tos8, -:tos-exec.

3.19 job_flags

Square bracket delimited list of job attributes. For example:
[BACKFILL][PREEMPTEE]. The following flags show up in our trace:
- PROCSPECIFIED: Job requested processors on command line
- WIDERSVSEARCHALGO: Aims for earlier start times by relaxing the default
  guarantee of finding the longest possible ranges of machines to satisfy job
  request
- RESTARTABLE: Job can be requeued/restarted if preempted
- INTERACTIVE: Job needs interactive input from user to run
- TEMPLATESAPPLIED: Job had all applicable templates applied ("this is this"
  type of definition, found in Moab docs)
- ADVRES: Job may only utilize accessible, reserved resources
- GRESONLY: Use no compute resources (processors, memory, etc.), only generic
  ones
- NORMSTART: Job is internal system job and will not be started via the RM
- BACKFILL: Job uses backfill to run
- GLOBALQUEUE: Job directly submitted without any authentication
- FSVIOLATION: Job started with a fairshare violation
- PREEMPTOR: Job may preempt other PREEMPTEE jobs

3.20 account_name

The name of the account associated with the job, if specified. Possible values
in this trace include: ccm_queue, hpcdev, llnl-exec, NOACCT, support, tos1,
tos2, tos3, tos4, tos5, tos6, tos7, tos8, tos-exec.

3.21 executable

The full path of the job executable, if specified. 73% of JOBSTART events
specify a path. Typically (for our trace), these paths fall under $HOME or
/var/spool/moab/spool/.

3.22 resource_manager_extension_string

Resource manager specific list of job attributes if specified. See the relevant
MOAB documentation at http://docs.adaptivecomputing.com/9-0-3/MWM/Content/
topics/moabWorkloadManager/topics/resourceManagers/rmextensions.html for more
information.

3.23 bypass_count

The number of times the job was bypassed by lower priority jobs via backfill, or
'-1' if not specified.

3.24 proc_seconds_utilized

The number of processor seconds used by the job. Mostly zero in this trace.

3.25 partition_name

The name of the partition on which the job ran. Possible values in this trace
include: ALL, SHARED, trinity.

3.26 dedicated_processors_per_task

The number of processors required per task. Mostly 1, with few exceptions using
0 or 32 in this trace.

3.27 allocated_host_list

A list of hosts allocated to the job. The list is colon (':') delimited, and
each entry may represent a range. For example, "12:14-31:36-63" implies that the
hosts allocated to the job were 12, 14 through 31, and 36 through 63. All ranges
are inclusive.

3.28 resource_manager_name

The name of the resource manager, if specified. In this trace, 23% of entries
reference "internal", and 77% reference "trinity".

3.29 reservation

The name of the reservation required by the job, if specified. Most entries
leave this empty, but for the rest, values fall in: DAT-marti, eotos, eotos.61,
langer-tos8, ldms.33, ldms-DAT, ldms-DAT.41, LDMS-DAT.41, ldms-dat-prep.51,
ldms_prep.59, nalu.64, nalu.65, nalu.66, rt113630, rt113630.17, rt113630.28,
slownodes.

3.30 job_message

Incorporates messages from many systems, including the resource manager,
scheduler, and administrative systems for that job. In this trace, 8% of events
carry a job message.

3.31 completion_code

The job exit status/completion code. Most popular ones are 0, 1, 2, -3, and 12.
Assuming these are UNIX error codes (warning: they may not be!), they could be
interpreted as follows:
- 1, or EPERM: Operation not permitted
- 2, or ENOENT: No such file or directory
- 3, or ESRCH: No such process
- 12, or ENOMEM: Out of memory

3.32 effective_queue_duration

The amount of time, in seconds, that the job was eligible for scheduling.


4. Contact info
===============

Please direct inquiries, feedback, or suggestions regarding this trace to the
ATLAS mailing list at info@project-atlas.org


5. Usage
========

Please cite the following paper when you publish work that uses this trace:

    George Amvrosiadis, Jun Woo Park, Gregory R. Ganger, Garth A. Gibson,
    Elisabeth Baseman, Nathan DeBardeleben. "On the diversity of cluster
    workloads and its impact on research results." In Proceedings of the 2018
    USENIX Annual Technical Conference, Boston, MA, July 11-13, 2018.