###########################################
# LANL Trinity (formatted) trace FAQ version 1.0 Beta #
###########################################


1. Trace description
====================

In 2018, Trinity is the largest supercomputer at the Los Alamos National Lab
(LANL) and it is used for capability computing. Capability clusters are a
large-scale, high-demand resource introducing novel hardware technologies that
aid in achieving crucial computing milestones, such as higher-resolution climate
and astrophysics models. Trinity's hardware was stood up in two pre-production
phases before being put into full production use and out trace was collected
before the second phase completed. At the time of data collection, Trinity
consisted of 9408 identical compute nodes, a total of 301056 Intel Xeon
E5-2698v3 2.3GHz cores and 1.2PB RAM, making this the largest cluster with a
publicly available trace by number of CPU cores.

This Trinity dataset covers 3 months of the machine's operation from February to
April 2016. During that time, Trinity was operating in OpenScience mode, i.e.,
the machine was undergoing beta testing and was available to a wider number of
users than it is expected to have after it receives its final security
classification. We note that OpenScience workloads are representative of a
capability supercomputer's workload, as they occur roughly every 18 months when
a new machine is introduced, or before an older one is decommissioned. The
dataset consists of 25237 multi-node jobs issued by 88 users and collected by
MOAB. Collected data include: timestampts for job stages
from submission to termination, job properties such as size and owner, the job's
return status, and a time budget per job that if exceeded causes the job to be
killed. This dataset is a processed variant of the raw Trinity dataset. Job
events have been processed to return a job-level view of the trace rather than
an event-level view.


2. Scheduling policy
====================

The general scheduling policy is strongly fair-share dominated, however backfill
is used even if the fair-share allocation is negative (i.e., user has exceeded
their quota). Because MOAB knows the earliest time that the highest priority
job can start, and which resources it will need, it can also determine which
jobs can be started without delaying it. The backfill feature allows the
scheduler to start other, lower-priority jobs so long as they do not delay the
highest priority job. The scheduler policy for Mustang was the same in periods
of ~6 months at a time, corresponding to different "campaigns". It remained
fair-share dominated, but the individual project weightings change for each
"campaign" period. Each campaign is expected to consist of the following phases:

- Input and dataset validation: jobs that consist of a small number of nodes
  compared to the dataset size
- Problem partitioning: grid sizing
- Parametric studies: sweet spot determination of job sizes, nodes, processes
  per node, memory per process etc.
- Computation
- Analysis and visualization: jobs that are primarily I/O-bound, implying
  different node use patterns

Jobs that belong to an individual user can get cancelled in batches, where the
timestamp and duration match across all the cancelled jobs. This is usually
because job scripts allow users to specify multiple jobs linked together by
explicitly stating their dependencies in the script. If one job fails or gets
cancelled, the rest are automatically cancelled.


3. Fields
=========

The dataset comprises numberous fileds that follow the Workload Event Format as
seen in http://docs.adaptivecomputing.com/9-0-3/MWM/Content/topics/
moabWorkloadManager/topics/analyzing/workloadtrace.html. Some fields that are
not populated in the original trace have been removed. Classes and Event Types
have been renamed to uppercase format to represent categorical types. MOAB uses
hyphens ("-") to indicate fields with not values; these have been replaced with
empty space (""). Temporal data was originally in epoch form, but has since been
converted to a datetime format.

3.1 user_ID

An identifier of the user submitting the job. There are 88 users in the trace,
and user IDs have been anonymized using random numbers.

3.2 group_ID

An identifier of the primary group of the user submitting the job. There are 89
groups in the trace, with each user belonging to a distinct group, and some
switching their group partway through the trace.

3.3 submit_time

The time when the job was submitted to the scheduler's queue. The Trinity trace
covers the period from February 2, 2016 until April 21, 2016.

3.4 start_time

The time when the job began executing. The Trinity trace covers the period from
February 2, 2016 until April 21, 2016.

3.5 dispatch_time

The time when the scheduler requested that the job begin executing. The Trinity
trace covers the period from February 2, 2016 until April 21, 2016.

3.6 queue_time

The time when the job met all fairness policies.

3.7 end_time

The time when the job completed executing. The Trinity trace covers the period
from February 2, 2016 until April 21, 2016.

3.8 wallclock_limit

The maximum allowed job duration in seconds. Typically ranges from 1 to 2160
minutes (i.e., 36 hours which is the max wall clock time for Trinity). There
can be outliers of jobs with longer wallclock limits, however.

3.9 job_status

The type of scheduling event. The following types are possible:
- JOBSTART: the first event of a job recorded by the scheduler
- JOBEND: the last event of a job that has terminated successfully, as recorded
  by the scheduler
- JOBFAIL: the last event of a job that has terminated unsuccessfully, as
  recorded by the scheduler
- JOBCANCEL: the last event of a job that a user has terminated forcefully,
  as recorded by the scheduler

3.10 node_count

The number of nodes requested by the job from the scheduler. If a job is started
then it can be assumed that this request has been fulfilled. Since this is a
homogeneous cluster, and each node consists of 32 identical CPU cores, this
field can be used to derive the total number of cores assigned to the job by
the scheduler. A zero value indicated that the requested number of nodes was not
specified by the user.

3.11 tasks_requested

The number of tasks requested, where each task corresponds to a physical CPU
core. This may not be equal to the number of actual cores assigned to a job,
as jobs at LANL are assigned entire physical nodes. To calculate the total
number of physical CPU cores assigned to the job see section 3.10.

4. Contact info
===============

Please direct inquiries, feedback, or suggestions regarding this trace to the
ATLAS mailing list at info@project-atlas.org


5. Usage
========

Please cite the following paper when you publish work that uses this trace:

    George Amvrosiadis, Jun Woo Park, Gregory R. Ganger, Garth A. Gibson,
    Elisabeth Baseman, Nathan DeBardeleben. "On the diversity of cluster
    workloads and its impact on research results." In Proceedings of the 2018
    USENIX Annual Technical Conference, Boston, MA, July 11-13, 2018.