OpenCloud Hadoop cluster trace : format and schema

Overview

This document describes the semantics, data format, and data sources of logs from a Hadoop cluster called OpenCloud. OpenCloud is a research cluster at Carnegie Mellon University (CMU) managed by the CMU Parallel Data Lab. It is open to CMU researchers (including faculty, post-docs, and graduate students) from all departments on campus. In the logs that we collected, the cluster was used by groups in areas that include computational astrophysics, computational biology, computational neurolinguistics, information retrieval and information classification, machine learning from the contents of the web, natural language processing, image and video analysis, malware analysis, social networking analysis, cloud computing systems development, cluster failure diagnosis, and several class projects related to information retrieval, data mining and distributed systems.

The 64 nodes in this cluster each have a 2.8 GHz dual quad core CPU (8 cores), 16 GB RAM, 10 Gbps Ethernet NIC, and four Seagate 7200 RPM SATA disk drives. The cluster ran Hadoop 0.20.1 during the whole data collection period.

We have collected 20-month logs from May 2010 to Dec 2011. The logs contain 51975 successful jobs, 4614 failed jobs and 1762 killed jobs. There were 78 users in total that submitted jobs in the logs.

Data Source

We do not provide raw log files in our released data. Instead, we anonymized some data fields, and stored the transformed logs as several CSV files. In this section, we explain the raw log files we used, and how we anonymized some data fields. The parser that we used to transform logs is based on the Hadoop code base 0.20.2. It can be downloaded from the following link: http://archive.apache.org/dist/hadoop/core/hadoop-0.20.2/

Log Files

The raw logs we collected are mainly two types of logs gathered by the unmodified open source Hadoop framework:

Anonymization

For job configuration files, we removed all user-defined configuration parameters (key value pairs not defined by Hadoop, probably used by user code) to ensure that no user-specific information is released. We also anonymized the following Hadoop defined data fields presented in the final logs:

There are 78 users in total. All usernames are mapped into a unique number in [0..77].

If a configuration value needs to be anonymized, the value is processed as follows.

When anonymizing a file system path, we do following; maintaining the tree depth for the directory

Schema

The logs are partitioned by month. So there are 20 folders, and each containing information of all jobs submitted in one month. Each folder contains 6 CSV files whose schema are described in the following text.

Note: jtid is the jobtracker ID generated at each boot of JobTracker. And jobid is an ID assigned by jobtracker for each job. Combining jtid and jobid can uniquely identify a MapReduce job.

Job Configuration

Job configuration (conf.csv)
Field name Data type Description
jtid int8
jobid int4
keyname varchar
value varchar

Job History

Table Job history (job.csv)
Field name Data type Description
jtid int8
jobid int4
submitTime int8 UTC timestamp in milliseconds
launchTime int8 UTC timestamp in milliseconds
finishTime int8 UTC timestamp in milliseconds
status tinyint SUCCESS: 0, FAILED: 1, KILLED: 2
numMaps int4 total number of map tasks in the job
numReduces int4 total number of reduce tasks in the job
finMaps int4 number of completed map tasks in the job
finReduces int4 number of completed reduce tasks in the job
failMaps int4 number of failed map tasks
failReduces int4 number of failed reduce tasks

Table Task history (task.csv)
Field name Data type Description
jtid int8
jobid int4
tasktype char either m or r
taskid int4
startTime int8 UTC timestamp in milliseconds
finishTime int8 UTC timestamp in milliseconds
status tinyint SUCCESS: 0, FAILED: 1, KILLED: 2

Table Task Attempt history (attempt.csv)
Field name Data type Description
jtid int8
jobid int4
tasktype char either m or r
taskid int4
attempt smallint
startTime int8 UTC timestamp in milliseconds
shuffleTime int8 UTC timestamp in milliseconds. NULL for map task
sortTime int8 UTC timestamp in milliseconds. NULL for map task
finishTime int8 UTC timestamp in milliseconds
status tinyint SUCCESS: 0, FAILED: 1, KILLED: 2
rack varchar rack identifier that the host running the task attempt is belong to
hostname varchar host name that the attempt is running on

Table Counter History (counter.csv)
Field name Data type Description
jtid int8
jobid int4
tasktype char m or r. ? for job-wide counters
taskid int4 -1 for job-wide counters
attempt smallint -1 for task-wide counters
countergroup varchar Example: Task, JobInProgress, FileSystemCounters
counter varchar Example: REDUCE_INPUT_GROUPS, FILE_BYTES_READ
value int8

Table Split History (split.csv)
Field name Data type Description
jtid int8
jobid int4
tasktype char always 'm'?
taskid int4
splitid smallint
splitHost varchar host name that has the input data for the (map) task.