Meta Tectonic Bulk Storage Traces (Baleen)

These traces are used in the paper Baleen: ML Admission & Prefetching for Flash Caches. We ask that academic works using these traces to cite Baleen ¹ and, if appropriate, CacheLib ² and Tectonic ³.

These traces are anonymized Meta production workloads from a flash cache (CacheLib) in front of a bulk storage system (Tectonic). These traces are licensed under the same terms as CacheLib (Apache 2.0).

Quick start

Want to look at a trace right away? Look at Region1 or download storage_0.1.tar.gz or storage.tar.gz.

Filename	Description	Size (compressed)	Size (uncompressed)
storage/	0.1% and 1% traces (10 samples each)	-	8.1 GB
storage_0.1.tar.gz	0.1% traces (1 sample each)	15 MB	76 MB
storage.tar.gz	0.1% and 1% traces (1 sample each)	0.15 GB	0.8 GB
storage_0.1_10.tar.gz	0.1% traces (10 samples each)	0.15 GB	0.8 GB
storage_10.tar.gz	0.1% and 1% traces (10 samples each)	1.6 GB	8.1 GB
storage_all/	Everything (full traces, 0.1%, 1%, keys)	30 GB	150 GB
storage_key.tar.gz	List of keys with #GETs, #ops	1.4 GB	8 GB

Format

Each request consists of the following details:

Block ID
IO byte offset
IO size
Time
Op name
Features: Namespace (user_namespace) & User (user_name) (which application originated the request)

Newer traces include additional fields:

rs_shard_id: the Reed-Solomon shard (see Tectonic paper). In traces containing this field, the key should be (block_id, rs_shard_id).
op_count: number of repeated operations within the same minute

Each trace directory has a file full.header containing column names for that trace.

For example, the trace line below represents a request made for Block 310 starting at byte 262144 of size 4325376, at time 1572074461.57806 of type GET_TEMP.

310 262144 4325376 1572074461.57806 1 24 3 5

Feature values are anonymized IDs, and do not carry the same meaning for different traces.

For the op field, please use this mapping (also in the code) to differentiate GETs & PUTs.

GET_TEMP = 1
GET_PERM = 2
PUT_TEMP = 3
PUT_PERM = 4
GET_NOT_INIT = 5
PUT_NOT_INIT = 6
UNKNOWN = 100
 
PUT_OPS = [4, 3, 6]
GET_OPS = [2, 1, 5]

The keys file is primarily used to make sure keys are sampled weighted by the number of GETs. It can also be used to calculate the distribution of GETs and/or PUTs across keys. The format of the keys file is:

<key> <number of GETs+PUTs> <number of GETs>

Directory Organization

7 traces are provided in this dump. Each trace is sampled (by block ID) from a different Tectonic cluster. See the Baleen paper for more details on the traces.

Example Directory:

full_0_1.trace (1% trace - Use with cache size at 1%)
full_0_0.1.trace (0.1% trace - Use with cache size at 0.1%)
full.trace (full trace - do not use without sampling)
full.keys (List of keys with GETs & GET+PUTs, used for sampling)

For traces collected in 2021 & 2023:

Each trace is from a different datacenter.
Each machine in the cluster (1,000s to 10,000s of nodes) samples traffic at a fixed rate, which is then aggregated across all machines to get the full trace.

As some full traces contain more than one machine's worth of data, further sampling is always required so that you can use it with the right cache sizes.

In the paper, multiple samples were used, e.g., full_0_1.trace, full_1_1.trace, full_2_1.trace, ...

See scripts/common/sample.py in BCacheSim for more details.

Contact

For further questions, please contact Daniel Lin-Kit Wong.

References

Baleen: ML Admission & Prefetching for Flash Caches
Daniel Lin-Kit Wong, Hao Wu, Carson Molder, Sathya Gunasekar, Jimmy Lu, Snehal Khandkar, Abhinav Sharma, Daniel S. Berger, Nathan Beckmann, Gregory R. Ganger
USENIX FAST 2024↩︎
The CacheLib Caching Engine: Design and Experiences at Scale
Benjamin Berg, Daniel S. Berger, Sara McAllister, Isaac Grosof, Sathya Gunasekar, Jimmy Lu, Michael Uhlar, Jim Carrig, Nathan Beckmann, Mor Harchol-Balter, and Gregory R. Ganger
USENIX OSDI 2020↩︎
Facebook's Tectonic Filesystem: Efficiency from Exascale
Satadru Pan, Theano Stavrinos, Yunqiao Zhang, Atul Sikaria, Pavel Zakharov, Abhinav Sharma, Mike Shuey, Richard Wareing, Monika Gangapuram, Guanglei Cao, Christian Preseau, Pratap Singh, Kestutis Patiejunas, and JR Tipton, Ethan Katz-Bassett, and Wyatt Lloyd
USENIX FAST 2021↩︎

Quick start

Contents

Format

Directory Organization

Contact

References