Haystack paper summary— [Notes]

6 min readMay 9, 2022

Table of Content

∘ Overview
∘ Requirements
∘ Flow with CDN
∘ NFS Based Approach
∘ Architecture — has 3 core components:
∘ Search and Upload Process
∘ HayStack Directory
∘ HayStack Cache
∘ HayStack Store
∘ File Structure
∘ Photo Write and Delete
∘ The Index File
∘ FileSystem
∘ Recovery from Failures
∘ Optimizations
∘ Photo’s Age
∘ Traffic
∘ Read and Write Operations
· Paper

In 2010, Facebook had 260 billion images
Users upload one billion photos (60 TB) in one week
Haystack ( new and improved approach )
→ Better than the traditional approach that used NFS
→ Reduces disk accesses
→ Minimizes per-photo metadata [size, status, starting position]
Serves one million images per second ( peak )
Haystack is a photo-object store

Overview

Facebook saves each photo in four formats
→ Large, Medium, Small, Thumbnail
Pattern: Written once, never modified, rarely deleted
Disadvantages of POSIX file systems
→ Directories, per-file metadata
→ Do not require permissions
→ Problems with traditional NFS
→ → Several accesses are required to read the file
→ → Filename => inode number => read the data

Requirements

High throughput and low latency
Requests that exceed processing capacity
→ Either ignored
→ Handed to a CDN (very slow)
Haystack: High throughput with low latency
→ Requires only one disk operation per read
→ Caches all meta-data in main memory
Fault tolerance → replication across data centres
Throughput: 4X more throughput than NFS (cost per terabyte 28% less)

Flow with CDN

NFS Based Approach

Using CDNS is not an effective solution
→ Requests have a long tail
→ CDN’s cache only the most popular photos
→ Most requests are sent to the backing photo store
NFS saves each photo as a file on commercial NAS appliances. ·
URL => mapped to volume, and path of file => Data
Saved hundreds of files per directory
→ Requires 3 disk accesses: Read the directory metadata, load the inode, read the file contents
Optimization: Cache filehandles

Not suitable for heavy-tailed requests

Architecture — has 3 core components:

Haystack Store
Haystack Directory
Haystack Cache
´

Haystack Store

Grouped into logical volumes [ large distributed directory]
Each logical volume has multiple physical volumes (replicas)

Haystack Directory

Logical to physical mapping [Given logical Volume Id gives us Physical volume ID → Machine Id]
Photo to logical volume mapping [ Photo identifier → LV → PV → Machine]

Cache{DHT}: Internal CDN

Search and Upload Process

The web servers use the directory to create a URL for each photo
Form:


http://<CDN>/<Cache>/<Machine_ld>/< Logicalvolume, photo>

Upload Process
→ User contacts the webserver
→ The web server contacts the directory
→ The directory assigns a writeable logical volume
→ The web server sends a request to the store
→ The store writes to all the physical volumes

HayStack Directory

Provides a mapping from logical volumes to physical volumes
Load balances write across logical volumes and reads across physical volumes
Determines whether a request should be handled by the cache or CDN
Marks volumes as read-only once they have reached their capacity. We need to start more machines when we run out of writeable volumes.

HayStack Cache

It is organized as a DHT{ distributed hash table}. The key is the photo’s id and the value is the photo’s data.
If an item is not there, it is fetched from the store.
Caches a photo only when
→ Request comes from a user (not a CDN because CDN will anyways do caching at its end)
→ Photo is fetched from a write-enabled store machine
* photos uploaded recently will have more chance of being retrieved again so write-enabled store machines will have such recent photos whereas read-enabled stores will have old photos {may be 100–200 days old} and the chance of retrieval for these is very low so no point of caching read-enable store images

HayStack Store

Each store machine manages multiple physical volumes.
A physical volume is a very large file containing millions of
photos.
For accessing a photo in a machine, we need ( metadata ):
→ Logical volume id [will be mapped to PV id]
→ File offset
→ Size of the photo
The store machine keeps an in-memory mapping of the photo
ids to metadata

File Structure

One physical volume is a large file, with a superblock, and a sequence of needles
Each needle contains the following fields
→ Header, Cookie, Key (64 bits), Alternate key (32 bits), Flags Size, Data, Checksum
The mapping between photo id and the needle’s fields (offset, size) is kept in memory
We additionally use a cookie with each photo id, such that it is hard to guess the URL of a photo

Photo Write and Delete

Photo write:
→ We provide the logical volume id, key, alternate key, cookie and data
→ Each (physical)machine updates its in-memory meta data, creates a needle, and writes the data.
→ A photo is never modified. If we remove red eyes or rotate the image, a new image is created and is saved with the same key and alternate key. We now point to the new offset.
Photo Delete:
→ We set a bit in the volume file and in-memory data structure

The Index File

Index files can be used to create the in-memory data structure while rebooting
It is a checkpoint of the in-memory data structure
Contains a superblock, and a sequence of needles
This file is updated asynchronously. May not be in sync with the volume file [the data file is primary and the index file is secondary]
After rebooting the store machine runs a job to bring the index file in sync (with the data file)

FileSystem

Store machines should use a file system that allows them to perform quick random seeks in a large file.
Each store machine uses XFS. [extent file system]
→ The block maps are very small (can be cached in the main memory)
→ Efficient file pre-allocation, low fragmentation

Recovery from Failures

Background task: PitchFork

Periodically checks the health of each store machine
Attempts to read data from the store machine
If it finds a problem, it maps the machine as read-only [ no new write]
If we cannot fix the problem, and the machine is otherwise fine, we start a bulk sync operation

Optimizations

Compaction: reclaim space of deleted and duplicate needles
Dynamically move unique(valid) entries to a new volume file
Over a year, 25% of photos get deleted
Space-saving: set the offset to 0 for deleted photos
Haystack uses an average of 10 bytes of main memory per
photo [l, m, s, t → 40 bytes metadata main memory
Sequentialized writes by grouping photos into albums

Photo’s Age

Plot the cumulative percentage of accesses (y-axis) with the
age of the photo (x-axis)
The shape of the curve (A(1 — e-Bx)
90% of cumulative accesses are less than 600 days old.

Traffic

Some statistics (date of publication of the paper, 2010)
120 million photos uploaded per day, 1.44 billion [0.12* 3(replicas of physical volume) * 4(formats)] Haystack photos written
80–100 billion photos viewed
View stats: — 85% are small, and 10% are thumbnails.
Large photos account for only 5% of the views

Read and Write Operations

The majority of the operations are read: 5000 ops per minute
Writes are limited to 500–1000 ops per minute
Almost no deletes
Reads are much slower than writes.
→ Average read latency: 10 ms
→ Average write latency: 1.5 ms

Haystack paper summary— [Notes]

Table of Content

Overview

Requirements

Flow with CDN

NFS Based Approach

Architecture — has 3 core components:

Search and Upload Process

HayStack Directory

HayStack Cache

HayStack Store

File Structure

Photo Write and Delete

The Index File

FileSystem

Recovery from Failures

Optimizations

Photo’s Age

Traffic

Read and Write Operations

Paper

Written by Tarun Jain