Over the past decades I have accumulated a few terabytes of personal data.
Mostly things like:

  • Photos
  • Music
  • Documents
  • Videos
  • Random project backups

For a long time everything lived in a single central storage location on my NAS.
That worked fine for years… until two problems became increasingly annoying:

  • Backup times
  • Storage requirements

Especially when I needed to reinstall or replace the underlying operating system of my NAS,
restoring the full dataset could take many hours.

At some point I realized: my data strategy needed a rethink 🤓


Separating Active Data from Archive Data

My current setup separates data into two categories.

Active Data

These are files I use daily or weekly.
They remain on my NAS:

  • Password vault (KeePass)
  • Current documents
  • Recent photos
  • Invoices for upcoming tax returns
  • Active projects

Because the total dataset became much smaller, I was able to change the storage hardware:


Due to the reduced amount of data, I was able to replace the previously used high-capacity HDD with a much faster SSD with lower capacity.

This gives me much better responsiveness for everyday files.


Archive Data

Everything I rarely need has been moved to an external archive drive.

Examples:

  • Old photos
  • Old videos
  • Completed projects
  • Historical documents

This approach has several advantages:

  • Backups are much faster
  • The NAS contains far fewer files
  • Hardware requirements are lower
  • Offline storage reduces ransomware risk

Even though I personally consider the risk fairly small, having an offline dataset
never hurts 🙂.


Why Many Small Files Are Slow on ext4

While optimizing my backups I noticed something interesting:

Copying many small files is dramatically slower than copying a few large ones.

This behavior is easy to explain once you look at how ext4 works.

Each File Requires an Inode

ext4 is an inode-based filesystem.

For every single file the filesystem must:

  • find a free inode
  • write the inode
  • create a directory entry
  • allocate data blocks
  • update metadata

Example:


1 file of 1 GB   → 1 inode
100,000 files of 10 KB → 100,000 inodes

That produces a massive amount of metadata I/O.

The result:

  • more disk seeks
  • more journal writes
  • more metadata operations

In other words: lots of tiny files slow everything down.


Solution: Bundle Small Files into Large Archives

To address this problem I decided to bundle many small image files into large archive files.

I asked an AI to help me write a small Bash script that:

  • recursively scans my photo directories
  • collects images inside each folder
  • creates a large archive file

Important detail:

No compression.

JPEG images are already compressed, so additional compression is usually pointless and only wastes CPU cycles.


Why I Chose RAR

Some people will ask: why RAR?

Simple answer: long-term experience.

RAR offers a few features that are extremely useful for archives:

Split Archives

If an archive exceeds a configured size limit, it automatically creates volumes:


jpg_archive.part1.rar
jpg_archive.part2.rar
jpg_archive.part3.rar

These volumes belong together and can be extracted seamlessly.

Recovery Records

RAR can store recovery information inside the archive.

This can help recover data if:

  • a disk sector goes bad
  • bitrot occurs
  • an archive file becomes partially corrupted

For long-term archives, that feature is extremely valuable.


The Trade-Off

Of course this approach has one downside.

If I want to access a single photo, I must first:

  1. extract the archive
  2. open the image

But realistically:

How often do I randomly browse a photo from 2007?

Exactly 🙂.


Backup Strategy

Naturally the external archive disk is not the only copy.

I maintain multiple copies:

  • a local backup disk
  • another additional copy
  • one off-site backup

That protects against:

  • hardware failure
  • accidental deletion
  • fire or theft
  • bitrot

The Bash Script

Below is the script I used to recursively bundle image files into RAR archives.

#!/bin/bash

# Abort immediately on any error
set -e

# Base directory
BASE_DIR="/media/raphael/MASTER/daten/bilder/"
LOG_FILE="$BASE_DIR/jpg_rar_log.txt"

if [[ -z "$BASE_DIR" ]]; then
    echo "Please provide the base directory as a parameter."
    exit 1
fi

# Check if RAR is installed
if ! command -v rar &> /dev/null; then
    echo "RAR is not installed. Please install it and try again."
    exit 1
fi

# Initialize logfile
echo "=== JPG RAR Archiving Log $(date) ===" > "$LOG_FILE"

# Recursive directory traversal
find "$BASE_DIR" -type d -print0 | while IFS= read -r -d '' DIR; do

    # Skip directory if RAR files already exist
    if find "$DIR" -maxdepth 1 -type f -iname "*.rar" -print -quit | grep -q .; then
        echo "[SKIP] '$DIR' already contains RAR files."
        echo "$(date) | $DIR | SKIPPED | Reason: RAR files present" >> "$LOG_FILE"
        continue
    fi

    # Count JPG files
    JPG_COUNT=$(find "$DIR" -maxdepth 1 -type f \( -iname "*.jpg" -o -iname "*.jpeg" \) | wc -l)

    if [[ "$JPG_COUNT" -le 5 ]]; then
        echo "[SKIP] '$DIR' has only $JPG_COUNT JPGs."
        echo "$(date) | $DIR | SKIPPED | Reason: too few JPGs ($JPG_COUNT)" >> "$LOG_FILE"
        continue
    fi

    ARCHIVE_NAME="$DIR/jpg_archive.rar"
    echo "[PROCESS] '$DIR' → '$ARCHIVE_NAME' ($JPG_COUNT JPGs)"

    # Create archive (null-terminated to support special characters)
    if find "$DIR" -maxdepth 1 -type f \( -iname "*.jpg" -o -iname "*.jpeg" \) -print0 | \
       xargs -0 rar a -m0 -rr10p -v5000m -ep1 "$ARCHIVE_NAME"; then

        # Determine archive size in MB
        ARCHIVE_SIZE=$(du -m "$ARCHIVE_NAME" | cut -f1)

        # Delete original files
        find "$DIR" -maxdepth 1 -type f \( -iname "*.jpg" -o -iname "*.jpeg" \) -exec rm -f {} +

        echo "[DONE] Archive successfully created and originals removed."
        echo "$(date) | $DIR | SUCCESS | Archive: $ARCHIVE_NAME | Size: ${ARCHIVE_SIZE}MB | JPGs: $JPG_COUNT" >> "$LOG_FILE"
    else
        echo "[ERROR] Archiving failed, original files remain."
        echo "$(date) | $DIR | ERROR | Archive: $ARCHIVE_NAME | JPGs: $JPG_COUNT" >> "$LOG_FILE"
        exit 1
    fi

done

echo "Finished! Logfile: $LOG_FILE"

Important Warning

⚠️ I take absolutely no responsibility for anyone using this script.

Automated archiving can potentially lead to data loss if used incorrectly.

Always:

  • test with copies first
  • keep multiple backups
  • verify your archives

Only then should you consider running something like this on real data.


Next Article

After archiving most of my data, I also created another script that
automatically scans all RAR files and verifies their integrity.

I’ll describe that in the next article 🙂.

By raphael

Leave a Reply