These traces are used in the paper Baleen: ML Admission & Prefetching for Flash Caches. We ask that academic works using these traces to cite Baleen 1 and, if appropriate, CacheLib 2 and Tectonic 3.
These traces are anonymized Meta production workloads from a flash cache (CacheLib) in front of a bulk storage system (Tectonic). These traces are licensed under the same terms as CacheLib (Apache 2.0).
Want to look at a trace right away? Look at Region1 or download storage_0.1.tar.gz or storage.tar.gz.
Filename | Description | Size (compressed) | Size (uncompressed) |
---|---|---|---|
storage/ | 0.1% and 1% traces (10 samples each) | - | 8.1 GB |
storage_0.1.tar.gz | 0.1% traces (1 sample each) | 15 MB | 76 MB |
storage.tar.gz | 0.1% and 1% traces (1 sample each) | 0.15 GB | 0.8 GB |
storage_0.1_10.tar.gz | 0.1% traces (10 samples each) | 0.15 GB | 0.8 GB |
storage_10.tar.gz | 0.1% and 1% traces (10 samples each) | 1.6 GB | 8.1 GB |
storage_all/ | Everything (full traces, 0.1%, 1%, keys) | 30 GB | 150 GB |
storage_key.tar.gz | List of keys with #GETs, #ops | 1.4 GB | 8 GB |
Each request consists of the following details:
user_namespace
) & User
(user_name
) (which application originated the request)Newer traces include additional fields:
Each trace directory has a file full.header
containing
column names for that trace.
For example, the trace line below represents a request made for Block 310 starting at byte 262144 of size 4325376, at time 1572074461.57806 of type GET_TEMP.
310 262144 4325376 1572074461.57806 1 24 3 5
Feature values are anonymized IDs, and do not carry the same meaning for different traces.
For the op field, please use this mapping (also in the code) to differentiate GETs & PUTs.
GET_TEMP = 1
GET_PERM = 2
PUT_TEMP = 3
PUT_PERM = 4
GET_NOT_INIT = 5
PUT_NOT_INIT = 6
UNKNOWN = 100
PUT_OPS = [4, 3, 6]
GET_OPS = [2, 1, 5]
The keys file is primarily used to make sure keys are sampled weighted by the number of GETs. It can also be used to calculate the distribution of GETs and/or PUTs across keys. The format of the keys file is:
<key> <number of GETs+PUTs> <number of GETs>
7 traces are provided in this dump. Each trace is sampled (by block ID) from a different Tectonic cluster. See the Baleen paper for more details on the traces.
Example Directory:
For traces collected in 2021 & 2023:
As some full traces contain more than one machine's worth of data, further sampling is always required so that you can use it with the right cache sizes.
In the paper, multiple samples were used, e.g., full_0_1.trace, full_1_1.trace, full_2_1.trace, ...
See scripts/common/sample.py in BCacheSim for more details.
For further questions, please contact Daniel Lin-Kit Wong.
Baleen: ML Admission & Prefetching for Flash
Caches
Daniel Lin-Kit Wong, Hao Wu, Carson Molder, Sathya
Gunasekar, Jimmy Lu, Snehal Khandkar, Abhinav Sharma, Daniel S. Berger,
Nathan Beckmann, Gregory R. Ganger
USENIX FAST 2024↩︎
The CacheLib Caching Engine: Design and
Experiences at Scale
Benjamin Berg, Daniel S. Berger, Sara
McAllister, Isaac Grosof, Sathya Gunasekar, Jimmy Lu, Michael Uhlar, Jim
Carrig, Nathan Beckmann, Mor Harchol-Balter, and Gregory R. Ganger
USENIX OSDI 2020↩︎
Facebook's Tectonic Filesystem: Efficiency from
Exascale
Satadru Pan, Theano Stavrinos, Yunqiao Zhang,
Atul Sikaria, Pavel Zakharov, Abhinav Sharma, Mike Shuey, Richard
Wareing, Monika Gangapuram, Guanglei Cao, Christian Preseau, Pratap
Singh, Kestutis Patiejunas, and JR Tipton, Ethan Katz-Bassett, and Wyatt
Lloyd
USENIX FAST 2021↩︎