Meta Tectonic Bulk Storage Traces (Baleen)

These traces are used in the paper Baleen: ML Admission & Prefetching for Flash Caches. We ask that academic works using these traces to cite Baleen 1 and, if appropriate, CacheLib 2 and Tectonic 3.

These traces are anonymized Meta production workloads from a flash cache (CacheLib) in front of a bulk storage system (Tectonic). These traces are licensed under the same terms as CacheLib (Apache 2.0).

Quick start

Want to look at a trace right away? Look at Region1 or download storage_0.1.tar.gz or storage.tar.gz.

Contents

Filename Description Size (compressed) Size (uncompressed)
storage/ 0.1% and 1% traces (10 samples each) - 8.1 GB
storage_0.1.tar.gz 0.1% traces (1 sample each) 15 MB 76 MB
storage.tar.gz 0.1% and 1% traces (1 sample each) 0.15 GB 0.8 GB
storage_0.1_10.tar.gz 0.1% traces (10 samples each) 0.15 GB 0.8 GB
storage_10.tar.gz 0.1% and 1% traces (10 samples each) 1.6 GB 8.1 GB
storage_all/ Everything (full traces, 0.1%, 1%, keys) 30 GB 150 GB
storage_key.tar.gz List of keys with #GETs, #ops 1.4 GB 8 GB

Format

Each request consists of the following details:

Newer traces include additional fields:

Each trace directory has a file full.header containing column names for that trace.

For example, the trace line below represents a request made for Block 310 starting at byte 262144 of size 4325376, at time 1572074461.57806 of type GET_TEMP.

310 262144 4325376 1572074461.57806 1 24 3 5

Feature values are anonymized IDs, and do not carry the same meaning for different traces.

For the op field, please use this mapping (also in the code) to differentiate GETs & PUTs.

GET_TEMP = 1
GET_PERM = 2
PUT_TEMP = 3
PUT_PERM = 4
GET_NOT_INIT = 5
PUT_NOT_INIT = 6
UNKNOWN = 100
 
PUT_OPS = [4, 3, 6]
GET_OPS = [2, 1, 5]

The keys file is primarily used to make sure keys are sampled weighted by the number of GETs. It can also be used to calculate the distribution of GETs and/or PUTs across keys. The format of the keys file is:

<key> <number of GETs+PUTs> <number of GETs>

Directory Organization

7 traces are provided in this dump. Each trace is sampled (by block ID) from a different Tectonic cluster. See the Baleen paper for more details on the traces.

Example Directory:

For traces collected in 2021 & 2023:

As some full traces contain more than one machine's worth of data, further sampling is always required so that you can use it with the right cache sizes.

In the paper, multiple samples were used, e.g., full_0_1.trace, full_1_1.trace, full_2_1.trace, ...

See scripts/common/sample.py in BCacheSim for more details.

Contact

For further questions, please contact Daniel Lin-Kit Wong.

References


  1. Baleen: ML Admission & Prefetching for Flash Caches
    Daniel Lin-Kit Wong, Hao Wu, Carson Molder, Sathya Gunasekar, Jimmy Lu, Snehal Khandkar, Abhinav Sharma, Daniel S. Berger, Nathan Beckmann, Gregory R. Ganger
    USENIX FAST 2024↩︎

  2. The CacheLib Caching Engine: Design and Experiences at Scale
    Benjamin Berg, Daniel S. Berger, Sara McAllister, Isaac Grosof, Sathya Gunasekar, Jimmy Lu, Michael Uhlar, Jim Carrig, Nathan Beckmann, Mor Harchol-Balter, and Gregory R. Ganger
    USENIX OSDI 2020↩︎

  3. Facebook's Tectonic Filesystem: Efficiency from Exascale
    Satadru Pan, Theano Stavrinos, Yunqiao Zhang, Atul Sikaria, Pavel Zakharov, Abhinav Sharma, Mike Shuey, Richard Wareing, Monika Gangapuram, Guanglei Cao, Christian Preseau, Pratap Singh, Kestutis Patiejunas, and JR Tipton, Ethan Katz-Bassett, and Wyatt Lloyd
    USENIX FAST 2021↩︎