Over the past decades I have accumulated a few terabytes of personal data.
Mostly things like:
- Photos
- Music
- Documents
- Videos
- Random project backups
For a long time everything lived in a single central storage location on my NAS.
That worked fine for years… until two problems became increasingly annoying:
- Backup times
- Storage requirements
Especially when I needed to reinstall or replace the underlying operating system of my NAS,
restoring the full dataset could take many hours.
At some point I realized: my data strategy needed a rethink 🤓
Separating Active Data from Archive Data
My current setup separates data into two categories.
Active Data
These are files I use daily or weekly.
They remain on my NAS:
- Password vault (KeePass)
- Current documents
- Recent photos
- Invoices for upcoming tax returns
- Active projects
Because the total dataset became much smaller, I was able to change the storage hardware:
Due to the reduced amount of data, I was able to replace the previously used high-capacity HDD with a much faster SSD with lower capacity.
This gives me much better responsiveness for everyday files.
Archive Data
Everything I rarely need has been moved to an external archive drive.
Examples:
- Old photos
- Old videos
- Completed projects
- Historical documents
This approach has several advantages:
- Backups are much faster
- The NAS contains far fewer files
- Hardware requirements are lower
- Offline storage reduces ransomware risk
Even though I personally consider the risk fairly small, having an offline dataset
never hurts 🙂.
Why Many Small Files Are Slow on ext4
While optimizing my backups I noticed something interesting:
Copying many small files is dramatically slower than copying a few large ones.
This behavior is easy to explain once you look at how ext4 works.
Each File Requires an Inode
ext4 is an inode-based filesystem.
For every single file the filesystem must:
- find a free inode
- write the inode
- create a directory entry
- allocate data blocks
- update metadata
Example:
1 file of 1 GB → 1 inode
100,000 files of 10 KB → 100,000 inodes
That produces a massive amount of metadata I/O.
The result:
- more disk seeks
- more journal writes
- more metadata operations
In other words: lots of tiny files slow everything down.
Solution: Bundle Small Files into Large Archives
To address this problem I decided to bundle many small image files into large archive files.
I asked an AI to help me write a small Bash script that:
- recursively scans my photo directories
- collects images inside each folder
- creates a large archive file
Important detail:
No compression.
JPEG images are already compressed, so additional compression is usually pointless and only wastes CPU cycles.
Why I Chose RAR
Some people will ask: why RAR?
Simple answer: long-term experience.
RAR offers a few features that are extremely useful for archives:
Split Archives
If an archive exceeds a configured size limit, it automatically creates volumes:
jpg_archive.part1.rar
jpg_archive.part2.rar
jpg_archive.part3.rar
These volumes belong together and can be extracted seamlessly.
Recovery Records
RAR can store recovery information inside the archive.
This can help recover data if:
- a disk sector goes bad
- bitrot occurs
- an archive file becomes partially corrupted
For long-term archives, that feature is extremely valuable.
The Trade-Off
Of course this approach has one downside.
If I want to access a single photo, I must first:
- extract the archive
- open the image
But realistically:
How often do I randomly browse a photo from 2007?
Exactly 🙂.
Backup Strategy
Naturally the external archive disk is not the only copy.
I maintain multiple copies:
- a local backup disk
- another additional copy
- one off-site backup
That protects against:
- hardware failure
- accidental deletion
- fire or theft
- bitrot
The Bash Script
Below is the script I used to recursively bundle image files into RAR archives.
#!/bin/bash
# Abort immediately on any error
set -e
# Base directory
BASE_DIR="/media/raphael/MASTER/daten/bilder/"
LOG_FILE="$BASE_DIR/jpg_rar_log.txt"
if [[ -z "$BASE_DIR" ]]; then
echo "Please provide the base directory as a parameter."
exit 1
fi
# Check if RAR is installed
if ! command -v rar &> /dev/null; then
echo "RAR is not installed. Please install it and try again."
exit 1
fi
# Initialize logfile
echo "=== JPG RAR Archiving Log $(date) ===" > "$LOG_FILE"
# Recursive directory traversal
find "$BASE_DIR" -type d -print0 | while IFS= read -r -d '' DIR; do
# Skip directory if RAR files already exist
if find "$DIR" -maxdepth 1 -type f -iname "*.rar" -print -quit | grep -q .; then
echo "[SKIP] '$DIR' already contains RAR files."
echo "$(date) | $DIR | SKIPPED | Reason: RAR files present" >> "$LOG_FILE"
continue
fi
# Count JPG files
JPG_COUNT=$(find "$DIR" -maxdepth 1 -type f \( -iname "*.jpg" -o -iname "*.jpeg" \) | wc -l)
if [[ "$JPG_COUNT" -le 5 ]]; then
echo "[SKIP] '$DIR' has only $JPG_COUNT JPGs."
echo "$(date) | $DIR | SKIPPED | Reason: too few JPGs ($JPG_COUNT)" >> "$LOG_FILE"
continue
fi
ARCHIVE_NAME="$DIR/jpg_archive.rar"
echo "[PROCESS] '$DIR' → '$ARCHIVE_NAME' ($JPG_COUNT JPGs)"
# Create archive (null-terminated to support special characters)
if find "$DIR" -maxdepth 1 -type f \( -iname "*.jpg" -o -iname "*.jpeg" \) -print0 | \
xargs -0 rar a -m0 -rr10p -v5000m -ep1 "$ARCHIVE_NAME"; then
# Determine archive size in MB
ARCHIVE_SIZE=$(du -m "$ARCHIVE_NAME" | cut -f1)
# Delete original files
find "$DIR" -maxdepth 1 -type f \( -iname "*.jpg" -o -iname "*.jpeg" \) -exec rm -f {} +
echo "[DONE] Archive successfully created and originals removed."
echo "$(date) | $DIR | SUCCESS | Archive: $ARCHIVE_NAME | Size: ${ARCHIVE_SIZE}MB | JPGs: $JPG_COUNT" >> "$LOG_FILE"
else
echo "[ERROR] Archiving failed, original files remain."
echo "$(date) | $DIR | ERROR | Archive: $ARCHIVE_NAME | JPGs: $JPG_COUNT" >> "$LOG_FILE"
exit 1
fi
done
echo "Finished! Logfile: $LOG_FILE"
Important Warning
⚠️ I take absolutely no responsibility for anyone using this script.
Automated archiving can potentially lead to data loss if used incorrectly.
Always:
- test with copies first
- keep multiple backups
- verify your archives
Only then should you consider running something like this on real data.
Next Article
After archiving most of my data, I also created another script that
automatically scans all RAR files and verifies their integrity.
I’ll describe that in the next article 🙂.
