Archiving Terabytes of Personal Data – My Nerdy RAR + Bash Workflow

Over the past decades I have accumulated a few terabytes of personal data.
Mostly things like:

Photos
Music
Documents
Videos
Random project backups

For a long time everything lived in a single central storage location on my NAS.
That worked fine for years… until two problems became increasingly annoying:

Backup times
Storage requirements

Especially when I needed to reinstall or replace the underlying operating system of my NAS,
restoring the full dataset could take many hours.

At some point I realized: my data strategy needed a rethink 🤓

Separating Active Data from Archive Data

My current setup separates data into two categories.

Active Data

These are files I use daily or weekly.
They remain on my NAS:

Password vault (KeePass)
Current documents
Recent photos
Invoices for upcoming tax returns
Active projects

Because the total dataset became much smaller, I was able to change the storage hardware:

Due to the reduced amount of data, I was able to replace the previously used high-capacity HDD with a much faster SSD with lower capacity.

This gives me much better responsiveness for everyday files.

Archive Data

Everything I rarely need has been moved to an external archive drive.

Examples:

Old photos
Old videos
Completed projects
Historical documents

This approach has several advantages:

Backups are much faster
The NAS contains far fewer files
Hardware requirements are lower
Offline storage reduces ransomware risk

Even though I personally consider the risk fairly small, having an offline dataset
never hurts 🙂.

Why Many Small Files Are Slow on ext4

While optimizing my backups I noticed something interesting:

Copying many small files is dramatically slower than copying a few large ones.

This behavior is easy to explain once you look at how ext4 works.

Each File Requires an Inode

ext4 is an inode-based filesystem.

For every single file the filesystem must:

find a free inode
write the inode
create a directory entry
allocate data blocks
update metadata

Example:


1 file of 1 GB   → 1 inode
100,000 files of 10 KB → 100,000 inodes

That produces a massive amount of metadata I/O.

The result:

more disk seeks
more journal writes
more metadata operations

In other words: lots of tiny files slow everything down.

Solution: Bundle Small Files into Large Archives

To address this problem I decided to bundle many small image files into large archive files.

I asked an AI to help me write a small Bash script that:

recursively scans my photo directories
collects images inside each folder
creates a large archive file

Important detail:

No compression.

JPEG images are already compressed, so additional compression is usually pointless and only wastes CPU cycles.

Why I Chose RAR

Some people will ask: why RAR?

Simple answer: long-term experience.

RAR offers a few features that are extremely useful for archives:

Split Archives

If an archive exceeds a configured size limit, it automatically creates volumes:


jpg_archive.part1.rar
jpg_archive.part2.rar
jpg_archive.part3.rar

These volumes belong together and can be extracted seamlessly.

Recovery Records

RAR can store recovery information inside the archive.

This can help recover data if:

a disk sector goes bad
bitrot occurs
an archive file becomes partially corrupted

For long-term archives, that feature is extremely valuable.

The Trade-Off

Of course this approach has one downside.

If I want to access a single photo, I must first:

extract the archive
open the image

But realistically:

How often do I randomly browse a photo from 2007?

Exactly 🙂.

Backup Strategy

Naturally the external archive disk is not the only copy.

I maintain multiple copies:

a local backup disk
another additional copy
one off-site backup

That protects against:

hardware failure
accidental deletion
fire or theft
bitrot

The Bash Script

Below is the script I used to recursively bundle image files into RAR archives.

#!/bin/bash

# Abort immediately on any error
set -e

# Base directory
BASE_DIR="/media/raphael/MASTER/daten/bilder/"
LOG_FILE="$BASE_DIR/jpg_rar_log.txt"

if [[ -z "$BASE_DIR" ]]; then
    echo "Please provide the base directory as a parameter."
    exit 1
fi

# Check if RAR is installed
if ! command -v rar &> /dev/null; then
    echo "RAR is not installed. Please install it and try again."
    exit 1
fi

# Initialize logfile
echo "=== JPG RAR Archiving Log $(date) ===" > "$LOG_FILE"

# Recursive directory traversal
find "$BASE_DIR" -type d -print0 | while IFS= read -r -d '' DIR; do

    # Skip directory if RAR files already exist
    if find "$DIR" -maxdepth 1 -type f -iname "*.rar" -print -quit | grep -q .; then
        echo "[SKIP] '$DIR' already contains RAR files."
        echo "$(date) | $DIR | SKIPPED | Reason: RAR files present" >> "$LOG_FILE"
        continue
    fi

    # Count JPG files
    JPG_COUNT=$(find "$DIR" -maxdepth 1 -type f \( -iname "*.jpg" -o -iname "*.jpeg" \) | wc -l)

    if [[ "$JPG_COUNT" -le 5 ]]; then
        echo "[SKIP] '$DIR' has only $JPG_COUNT JPGs."
        echo "$(date) | $DIR | SKIPPED | Reason: too few JPGs ($JPG_COUNT)" >> "$LOG_FILE"
        continue
    fi

    ARCHIVE_NAME="$DIR/jpg_archive.rar"
    echo "[PROCESS] '$DIR' → '$ARCHIVE_NAME' ($JPG_COUNT JPGs)"

    # Create archive (null-terminated to support special characters)
    if find "$DIR" -maxdepth 1 -type f \( -iname "*.jpg" -o -iname "*.jpeg" \) -print0 | \
       xargs -0 rar a -m0 -rr10p -v5000m -ep1 "$ARCHIVE_NAME"; then

        # Determine archive size in MB
        ARCHIVE_SIZE=$(du -m "$ARCHIVE_NAME" | cut -f1)

        # Delete original files
        find "$DIR" -maxdepth 1 -type f \( -iname "*.jpg" -o -iname "*.jpeg" \) -exec rm -f {} +

        echo "[DONE] Archive successfully created and originals removed."
        echo "$(date) | $DIR | SUCCESS | Archive: $ARCHIVE_NAME | Size: ${ARCHIVE_SIZE}MB | JPGs: $JPG_COUNT" >> "$LOG_FILE"
    else
        echo "[ERROR] Archiving failed, original files remain."
        echo "$(date) | $DIR | ERROR | Archive: $ARCHIVE_NAME | JPGs: $JPG_COUNT" >> "$LOG_FILE"
        exit 1
    fi

done

echo "Finished! Logfile: $LOG_FILE"

Important Warning

⚠️ I take absolutely no responsibility for anyone using this script.

Automated archiving can potentially lead to data loss if used incorrectly.

Always:

test with copies first
keep multiple backups
verify your archives

Only then should you consider running something like this on real data.

After archiving most of my data, I also created another script that
automatically scans all RAR files and verifies their integrity.

I’ll describe that in the next article 🙂.

Archiving Terabytes of Personal Data – My Nerdy RAR + Bash Workflow

Byraphael

Separating Active Data from Archive Data

Active Data

Archive Data

Why Many Small Files Are Slow on ext4

Each File Requires an Inode

Solution: Bundle Small Files into Large Archives

Why I Chose RAR

Split Archives

Recovery Records

The Trade-Off

Backup Strategy

The Bash Script

Important Warning

Next Article

By raphael

Related Post

🎢 Part 2: The Honeymoon Phase Ends — FLUX.1-dev, Ollama, and Still Not Quite There

From “I’ll do it later” to “wait, it’s already done?!” — Bringing FLUX + ComfyUI to life on my old tower PC

Running My Own AI Instead of Renting It? A Small Homelab Experiment

Leave a Reply Cancel reply

You missed

🎢 Part 2: The Honeymoon Phase Ends — FLUX.1-dev, Ollama, and Still Not Quite There

From “I’ll do it later” to “wait, it’s already done?!” — Bringing FLUX + ComfyUI to life on my old tower PC

Running My Own AI Instead of Renting It? A Small Homelab Experiment

Replacing Cron Jobs with Jenkins for Nextcloud Backups 🚀