• In 2010, Facebook had 260 billion images
  • Users upload one billion photos (60 TB) in one week
  • Haystack ( new and improved approach )
    → Better than the traditional approach that used NFS
    → Reduces disk accesses
    → Minimizes per-photo metadata [size, status, starting position]
  • Serves one million images per second ( peak )
  • Haystack is a photo-object store

Overview

  • Facebook saves each photo in four formats
    → Large, Medium, Small, Thumbnail
  • Pattern: Written once, never modified, rarely deleted
  • Disadvantages of POSIX file systems
    → Directories, per-file metadata
    → Do not require permissions
    → Problems with traditional NFS
    → → Several accesses are required to read the file
    → → Filename => inode number => read the data

Requirements

  • High throughput and low latency
  • Requests that exceed processing capacity
    → Either ignored
    → Handed to a CDN (very slow)
  • Haystack: High throughput with low latency
    → Requires only one disk operation per read
    → Caches all meta-data in main memory
  • Fault tolerance → replication across data centres
  • Throughput: 4X more throughput than NFS (cost per terabyte 28% less)

Flow with CDN

NFS Based Approach

  • Using CDNS is not an effective solution
    → Requests have a long tail
    → CDN’s cache only the most popular photos
    → Most requests are sent to the backing photo store
  • NFS saves each photo as a file on commercial NAS appliances. ·
  • URL => mapped to volume, and path of file => Data
  • Saved hundreds of files per directory
    → Requires 3 disk accesses: Read the directory metadata, load the inode, read the file contents
  • Optimization: Cache filehandles

Not suitable for heavy-tailed requests

Architecture — has 3 core components:

  1. Haystack Store
  2. Haystack Directory
  3. Haystack Cache
    ´

Haystack Store

  • Grouped into logical volumes [ large distributed directory]
  • Each logical volume has multiple physical volumes (replicas)

Haystack Directory

  • Logical to physical mapping [Given logical Volume Id gives us Physical volume ID → Machine Id]
  • Photo to logical volume mapping [ Photo identifier → LV → PV → Machine]

Cache{DHT}: Internal CDN

Search and Upload Process

  • The web servers use the directory to create a URL for each photo
  • Form:

http://<CDN>/<Cache>/<Machine_ld>/< Logicalvolume, photo>
  • Upload Process
    → User contacts the webserver
    → The web server contacts the directory
    → The directory assigns a writeable logical volume
    → The web server sends a request to the store
    → The store writes to all the physical volumes

HayStack Directory

  • Provides a mapping from logical volumes to physical volumes
  • Load balances write across logical volumes and reads across physical volumes
  • Determines whether a request should be handled by the cache or CDN
  • Marks volumes as read-only once they have reached their capacity. We need to start more machines when we run out of writeable volumes.

HayStack Cache

  • It is organized as a DHT{ distributed hash table}. The key is the photo’s id and the value is the photo’s data.
  • If an item is not there, it is fetched from the store.
  • Caches a photo only when
    → Request comes from a user (not a CDN because CDN will anyways do caching at its end)
    → Photo is fetched from a write-enabled store machine
    * photos uploaded recently will have more chance of being retrieved again so write-enabled store machines will have such recent photos whereas read-enabled stores will have old photos {may be 100–200 days old} and the chance of retrieval for these is very low so no point of caching read-enable store images

HayStack Store

  • Each store machine manages multiple physical volumes.
  • A physical volume is a very large file containing millions of
    photos.
  • For accessing a photo in a machine, we need ( metadata ):
    → Logical volume id [will be mapped to PV id]
    → File offset
    → Size of the photo
  • The store machine keeps an in-memory mapping of the photo
    ids to metadata

File Structure

  • One physical volume is a large file, with a superblock, and a sequence of needles
  • Each needle contains the following fields
    → Header, Cookie, Key (64 bits), Alternate key (32 bits), Flags Size, Data, Checksum
  • The mapping between photo id and the needle’s fields (offset, size) is kept in memory
  • We additionally use a cookie with each photo id, such that it is hard to guess the URL of a photo

Photo Write and Delete

  • Photo write:
    → We provide the logical volume id, key, alternate key, cookie and data
    → Each (physical)machine updates its in-memory meta data, creates a needle, and writes the data.
    → A photo is never modified. If we remove red eyes or rotate the image, a new image is created and is saved with the same key and alternate key. We now point to the new offset.
  • Photo Delete:
    → We set a bit in the volume file and in-memory data structure

The Index File

  • Index files can be used to create the in-memory data structure while rebooting
  • It is a checkpoint of the in-memory data structure
  • Contains a superblock, and a sequence of needles
  • This file is updated asynchronously. May not be in sync with the volume file [the data file is primary and the index file is secondary]
  • After rebooting the store machine runs a job to bring the index file in sync (with the data file)

FileSystem

  • Store machines should use a file system that allows them to perform quick random seeks in a large file.
  • Each store machine uses XFS. [extent file system]
    → The block maps are very small (can be cached in the main memory)
    → Efficient file pre-allocation, low fragmentation

Recovery from Failures

Background task: PitchFork

  • Periodically checks the health of each store machine
  • Attempts to read data from the store machine
  • If it finds a problem, it maps the machine as read-only [ no new write]
  • If we cannot fix the problem, and the machine is otherwise fine, we start a bulk sync operation

Optimizations

  • Compaction: reclaim space of deleted and duplicate needles
  • Dynamically move unique(valid) entries to a new volume file
  • Over a year, 25% of photos get deleted
  • Space-saving: set the offset to 0 for deleted photos
  • Haystack uses an average of 10 bytes of main memory per
    photo [l, m, s, t → 40 bytes metadata main memory
  • Sequentialized writes by grouping photos into albums

Photo’s Age

  • Plot the cumulative percentage of accesses (y-axis) with the
    age of the photo (x-axis)
  • The shape of the curve (A(1 — e-Bx)
  • 90% of cumulative accesses are less than 600 days old.

Traffic

  • Some statistics (date of publication of the paper, 2010)
  • 120 million photos uploaded per day, 1.44 billion [0.12* 3(replicas of physical volume) * 4(formats)] Haystack photos written
  • 80–100 billion photos viewed
  • View stats: — 85% are small, and 10% are thumbnails.
  • Large photos account for only 5% of the views

Read and Write Operations

  • The majority of the operations are read: 5000 ops per minute
  • Writes are limited to 500–1000 ops per minute
  • Almost no deletes
  • Reads are much slower than writes.
    → Average read latency: 10 ms
    → Average write latency: 1.5 ms

Paper

--

--