This document describes the semantics, data format, and data sources of logs from a Hadoop cluster called OpenCloud. OpenCloud is a research cluster at Carnegie Mellon University (CMU) managed by the CMU Parallel Data Lab. It is open to CMU researchers (including faculty, post-docs, and graduate students) from all departments on campus. In the logs that we collected, the cluster was used by groups in areas that include computational astrophysics, computational biology, computational neurolinguistics, information retrieval and information classification, machine learning from the contents of the web, natural language processing, image and video analysis, malware analysis, social networking analysis, cloud computing systems development, cluster failure diagnosis, and several class projects related to information retrieval, data mining and distributed systems.
The 64 nodes in this cluster each have a 2.8 GHz dual quad core CPU (8 cores), 16 GB RAM, 10 Gbps Ethernet NIC, and four Seagate 7200 RPM SATA disk drives. The cluster ran Hadoop 0.20.1 during the whole data collection period.
We have collected 20-month logs from May 2010 to Dec 2011. The logs contain 51975 successful jobs, 4614 failed jobs and 1762 killed jobs. There were 78 users in total that submitted jobs in the logs.
We do not provide raw log files in our released data. Instead, we anonymized some data fields, and stored the transformed logs as several CSV files. In this section, we explain the raw log files we used, and how we anonymized some data fields. The parser that we used to transform logs is based on the Hadoop code base 0.20.2. It can be downloaded from the following link: http://archive.apache.org/dist/hadoop/core/hadoop-0.20.2/
The raw logs we collected are mainly two types of logs gathered by the unmodified open source Hadoop framework:
For job configuration files, we removed all user-defined configuration parameters (key value pairs not defined by Hadoop, probably used by user code) to ensure that no user-specific information is released. We also anonymized the following Hadoop defined data fields presented in the final logs:
There are 78 users in total. All usernames are mapped into a unique number in [0..77].
If a configuration value needs to be anonymized, the value is processed as follows.
When anonymizing a file system path, we do following; maintaining the tree depth for the directory
The logs are partitioned by month. So there are 20 folders, and each containing information of all jobs submitted in one month. Each folder contains 6 CSV files whose schema are described in the following text.
Note: jtid is the jobtracker ID generated at each boot of JobTracker. And jobid is an ID assigned by jobtracker for each job. Combining jtid and jobid can uniquely identify a MapReduce job.
Job configuration (conf.csv) | |||
---|---|---|---|
Field name | Data type | Description | |
jtid | int8 | ||
jobid | int4 | ||
keyname | varchar | ||
value | varchar |
Table Job history (job.csv) | ||
---|---|---|
Field name | Data type | Description |
jtid | int8 | |
jobid | int4 | |
submitTime | int8 | UTC timestamp in milliseconds |
launchTime | int8 | UTC timestamp in milliseconds |
finishTime | int8 | UTC timestamp in milliseconds |
status | tinyint | SUCCESS: 0, FAILED: 1, KILLED: 2 |
numMaps | int4 | total number of map tasks in the job |
numReduces | int4 | total number of reduce tasks in the job |
finMaps | int4 | number of completed map tasks in the job |
finReduces | int4 | number of completed reduce tasks in the job |
failMaps | int4 | number of failed map tasks |
failReduces | int4 | number of failed reduce tasks |
Table Task history (task.csv) | ||
---|---|---|
Field name | Data type | Description |
jtid | int8 | |
jobid | int4 | |
tasktype | char | either m or r |
taskid | int4 | |
startTime | int8 | UTC timestamp in milliseconds |
finishTime | int8 | UTC timestamp in milliseconds |
status | tinyint | SUCCESS: 0, FAILED: 1, KILLED: 2 |
Table Task Attempt history (attempt.csv) | ||
---|---|---|
Field name | Data type | Description |
jtid | int8 | |
jobid | int4 | |
tasktype | char | either m or r |
taskid | int4 | |
attempt | smallint | |
startTime | int8 | UTC timestamp in milliseconds |
shuffleTime | int8 | UTC timestamp in milliseconds. NULL for map task |
sortTime | int8 | UTC timestamp in milliseconds. NULL for map task |
finishTime | int8 | UTC timestamp in milliseconds |
status | tinyint | SUCCESS: 0, FAILED: 1, KILLED: 2 |
rack | varchar | rack identifier that the host running the task attempt is belong to |
hostname | varchar | host name that the attempt is running on |
Table Counter History (counter.csv) | ||
---|---|---|
Field name | Data type | Description |
jtid | int8 | |
jobid | int4 | |
tasktype | char | m or r. ? for job-wide counters |
taskid | int4 | -1 for job-wide counters |
attempt | smallint | -1 for task-wide counters |
countergroup | varchar | Example: Task, JobInProgress, FileSystemCounters |
counter | varchar | Example: REDUCE_INPUT_GROUPS, FILE_BYTES_READ |
value | int8 |
Table Split History (split.csv) | ||
---|---|---|
Field name | Data type | Description |
jtid | int8 | |
jobid | int4 | |
tasktype | char | always 'm'? |
taskid | int4 | |
splitid | smallint | |
splitHost | varchar | host name that has the input data for the (map) task. |