OpenCloud Hadoop cluster trace : format and schema

Overview

This document describes the semantics, data format, and data sources of logs from a Hadoop cluster called OpenCloud. OpenCloud is a research cluster at Carnegie Mellon University (CMU) managed by the CMU Parallel Data Lab. It is open to CMU researchers (including faculty, post-docs, and graduate students) from all departments on campus. In the logs that we collected, the cluster was used by groups in areas that include computational astrophysics, computational biology, computational neurolinguistics, information retrieval and information classification, machine learning from the contents of the web, natural language processing, image and video analysis, malware analysis, social networking analysis, cloud computing systems development, cluster failure diagnosis, and several class projects related to information retrieval, data mining and distributed systems.

The 64 nodes in this cluster each have a 2.8 GHz dual quad core CPU (8 cores), 16 GB RAM, 10 Gbps Ethernet NIC, and four Seagate 7200 RPM SATA disk drives. The cluster ran Hadoop 0.20.1 during the whole data collection period.

We have collected 20-month logs from May 2010 to Dec 2011. The logs contain 51975 successful jobs, 4614 failed jobs and 1762 killed jobs. There were 78 users in total that submitted jobs in the logs.

Data Source

We do not provide raw log files in our released data. Instead, we anonymized some data fields, and stored the transformed logs as several CSV files. In this section, we explain the raw log files we used, and how we anonymized some data fields. The parser that we used to transform logs is based on the Hadoop code base 0.20.2. It can be downloaded from the following link: http://archive.apache.org/dist/hadoop/core/hadoop-0.20.2/

Log Files

The raw logs we collected are mainly two types of logs gathered by the unmodified open source Hadoop framework:

Job configuration file: it is a XML file that is generated by Hadoop system at the submission time of each job. The XML file is a key-value table that records all the configuration of a MapReduce job. For example, username, input/output data path, memory parameters, system configurations, and other user-defined parameters.
Job history file: this file contains the execution history of each map/reduce task of every job. It can gives the total runtime, the amount of I/O, input/output records.

Anonymization

For job configuration files, we removed all user-defined configuration parameters (key value pairs not defined by Hadoop, probably used by user code) to ensure that no user-specific information is released. We also anonymized the following Hadoop defined data fields presented in the final logs:

job history: username, jobname and state strings
- USER="username"
- JOBNAME="jobname"
- Task fields ...STATE_STRING="XXX", often containing references to the input or output filename(s)

jobconf: the following value were anonymized.
- user.name
- username
- hadoop.job.ugi
- mapred.job.name
- mapred.working.dir
- mapred.output.dir
- mapred.input.dir

There are 78 users in total. All usernames are mapped into a unique number in [0..77].

If a configuration value needs to be anonymized, the value is processed as follows.

The value is hashed with 20 bytes secret seed (salt) in SHA-1 algorithm.
The salt is SHA-1 hash of the time of generation in milli seconds, nano second CPU timer value, OS name, OS version, user name, and time took to generate the value in nano seconds.
The hashed value is converted into 40 character hexa-decimal string.

When anonymizing a file system path, we do following; maintaining the tree depth for the directory

The directory is anonymized independently.
The file name is anonymized independently.

Schema

The logs are partitioned by month. So there are 20 folders, and each containing information of all jobs submitted in one month. Each folder contains 6 CSV files whose schema are described in the following text.

Note: jtid is the jobtracker ID generated at each boot of JobTracker. And jobid is an ID assigned by jobtracker for each job. Combining jtid and jobid can uniquely identify a MapReduce job.

Job Configuration

Job configuration (conf.csv)
Field name	Data type	Description
jtid	int8
jobid	int4
keyname	varchar
value	varchar

Job History

Table Job history (job.csv)
Field name	Data type	Description
jtid	int8
jobid	int4
submitTime	int8	UTC timestamp in milliseconds
launchTime	int8	UTC timestamp in milliseconds
finishTime	int8	UTC timestamp in milliseconds
status	tinyint	SUCCESS: 0, FAILED: 1, KILLED: 2
numMaps	int4	total number of map tasks in the job
numReduces	int4	total number of reduce tasks in the job
finMaps	int4	number of completed map tasks in the job
finReduces	int4	number of completed reduce tasks in the job
failMaps	int4	number of failed map tasks
failReduces	int4	number of failed reduce tasks

Table Task history (task.csv)
Field name	Data type	Description
jtid	int8
jobid	int4
tasktype	char	either m or r
taskid	int4
startTime	int8	UTC timestamp in milliseconds
finishTime	int8	UTC timestamp in milliseconds
status	tinyint	SUCCESS: 0, FAILED: 1, KILLED: 2

Table Task Attempt history (attempt.csv)
Field name	Data type	Description
jtid	int8
jobid	int4
tasktype	char	either m or r
taskid	int4
attempt	smallint
startTime	int8	UTC timestamp in milliseconds
shuffleTime	int8	UTC timestamp in milliseconds. NULL for map task
sortTime	int8	UTC timestamp in milliseconds. NULL for map task
finishTime	int8	UTC timestamp in milliseconds
status	tinyint	SUCCESS: 0, FAILED: 1, KILLED: 2
rack	varchar	rack identifier that the host running the task attempt is belong to
hostname	varchar	host name that the attempt is running on

Table Counter History (counter.csv)
Field name	Data type	Description
jtid	int8
jobid	int4
tasktype	char	m or r. ? for job-wide counters
taskid	int4	-1 for job-wide counters
attempt	smallint	-1 for task-wide counters
countergroup	varchar	Example: Task, JobInProgress, FileSystemCounters
counter	varchar	Example: REDUCE_INPUT_GROUPS, FILE_BYTES_READ
value	int8

Table Split History (split.csv)
Field name	Data type	Description
jtid	int8
jobid	int4
tasktype	char	always 'm'?
taskid	int4
splitid	smallint
splitHost	varchar	host name that has the input data for the (map) task.