# ChunkingManager Bug Analysis Report

## Key Discoveries

### The Bug is Actually MULTIPLE Bugs

1. **Critical Race Condition** (`ChunkingManager:122`): Position deleted BEFORE verifying seek succeeded
2. **Silent Failures**: `gSeek()` can fail but return value is never checked
3. **State Loss**: `UploadUuid` stored in `copyExtraData` gets cleared when transitioning files, breaking multipart uploads
4. **Memory Exhaustion**: At ~1.6GB, package serialization fails due to PHP memory limits
5. **No Transaction Safety**: Position updates can fail silently

### Why It's Intermittent

The perfect storm requires:
- Large file (1.6GB+) pushing memory limits
- File transition callback clearing critical state
- Database serialization failure at the exact wrong moment

This combination is rare in testing but happens in production.

### Good News

There's already a partial fix in branch `20251127-ChunkingManager-bug-fix` that removes early deletion, but it's incomplete.

---

## The Plan

I've created a **4-phase layered fix strategy** organized by priority:

### Phase 1 (Critical - Ship First)
- Add `gSeek()` return value validation
- Add retry logic to database updates
- Enhanced error reporting
- Position validation on load

### Phase 2 (Root Cause Fix)
- Move `UploadUuid` to `generalExtraData` (survives file transitions)
- Add state transition logging

### Phase 3 (Memory Management)
- Memory headroom checks before serialization
- Diagnostic logging improvements

### Phase 4 (Optional Defensive)
- Position rollback on save failure
- Additional validation layers

---

## Implementation Details

The plan includes:
- Specific code changes
- File paths
- Testing strategy
- Implementation order

**Total**: ~200-250 lines across 6 files

# Fix Plan: Chunk Position Reset Bug

## Executive Summary

**Bug**: During backup storage uploads, the chunk position intermittently resets from ~1.6GB back to position `[]` or `[0,0]`, causing re-upload attempts that fail with "File already exists" errors.

**Root Causes Identified**:
1. **Critical Race Condition** (ChunkingManager:122): Position deleted BEFORE validating seek succeeded
2. **Silent Failures**: gSeek() returns false but return value is never checked
3. **State Corruption**: copyExtraData (containing UploadUuid) cleared prematurely when transitioning files
4. **Memory Exhaustion**: Package serialization fails at ~1.6GB due to PHP memory limits, causing database update failure
5. **No Transaction Safety**: Position can be partially updated without atomicity guarantees

**Impact**: Production-only intermittent failures during large backup uploads to DupCloud storage.

**Existing Work**: Branch `20251127-ChunkingManager-bug-fix` (commit 813cce895) contains a partial fix that removes early position deletion but lacks critical validation checks.

## Recommended Approach: Layered Fix Strategy

### Phase 1: Critical Fixes (HIGH PRIORITY - Ship First)

These are the minimum changes needed to prevent position resets. Low risk, high impact.

#### Fix 1.1: Add gSeek() Return Value Validation
**File**: `src/Libs/Chunking/ChunkingManager.php` (lines 121-125)

**Current Issue**:
```php
} elseif (($last_position = $this->persistance->getPersistanceData()) !== null) {
    $this->persistance->deletePersistanceData();  // ❌ Deletes before validation
    $this->it->gSeek($last_position);             // ❌ Return value ignored
    $this->it->next();
}
```

**Fix**:
```php
} elseif (($last_position = $this->persistance->getPersistanceData()) !== null) {
    // Validate seek succeeds BEFORE clearing position
    if ($this->it->gSeek($last_position) === false) {
        DupLog::error('[CHUNK] Failed to restore position: ' . json_encode($last_position));
        $this->lastErrorMessage = 'Failed to restore chunk position';
        return self::CHUNK_ERROR;
    }

    // Only delete after successful seek
    $this->persistance->deletePersistanceData();
    $this->it->next();

    DupLog::trace('[CHUNK] Resumed from position: ' . json_encode($last_position));
}
```

**Rationale**: Never delete position until we verify it can be restored. This prevents the most obvious failure path.

#### Fix 1.2: Add Retry Logic to package->update()
**File**: `src/Package/Storage/UploadPackageFilePersistanceAdapter.php` (lines 48-54)

**Current Issue**: Database update failures are not retried, causing silent position loss.

**Fix**:
```php
public function savePersistanceData($data, GenericSeekableIteratorInterface $it): bool
{
    $this->uploadInfo->progress      = $it->getProgressPerc();
    $this->uploadInfo->chunkPosition = $data;

    // Retry transient failures
    $maxRetries = 3;
    for ($attempt = 1; $attempt <= $maxRetries; $attempt++) {
        if ($this->package->update(false)) {
            return true;
        }

        DupLog::error('[CHUNK] Package update failed (attempt ' . $attempt . '/' . $maxRetries . ')');

        if ($attempt < $maxRetries) {
            usleep(100000); // 100ms delay before retry
        }
    }

    DupLog::error('[CHUNK] CRITICAL: Failed to persist position after ' . $maxRetries . ' attempts');
    return false; // Triggers CHUNK_ERROR in ChunkingManager
}
```

**Rationale**: Database operations can fail transiently. Retrying prevents position loss from temporary issues.

#### Fix 1.3: Enhanced gSeek() with Error Details
**File**: `src/Libs/Chunking/Iterators/TimeoutFileCopyIterator.php` (lines 115-133)

**Current Issue**: gSeek() returns false silently without explaining why.

**Add error tracking**:
```php
/** @var string Last error message */
protected $lastError = '';

/**
 * Get last error message
 */
public function getLastError(): string
{
    return $this->lastError;
}

public function gSeek($position): bool
{
    $this->lastError = '';

    // Validate structure
    if (!is_array($position)) {
        $this->lastError = 'Position is not an array: ' . gettype($position);
        return false;
    }

    if (count($position) !== 2) {
        $this->lastError = 'Position array has wrong size: ' . count($position);
        return false;
    }

    // Validate data types
    if (!is_int($position[0]) || !is_int($position[1])) {
        $this->lastError = 'Position contains non-integer values';
        return false;
    }

    $fileIndex = $position[0];
    $offset = $position[1];

    // Validate ranges
    if ($fileIndex < 0 || $fileIndex >= count($this->from)) {
        $this->lastError = 'File index out of range: ' . $fileIndex . ' (max: ' . (count($this->from) - 1) . ')';
        return false;
    }

    if ($offset < 0) {
        $this->lastError = 'Offset is negative: ' . $offset;
        return false;
    }

    // Validate file exists
    $targetFile = $this->from[$fileIndex];
    if (!$this->adapter->isFile($targetFile)) {
        $this->lastError = 'Target file does not exist: ' . $targetFile;
        return false;
    }

    // Validate offset within file size
    $fileSize = $this->adapter->fileSize($targetFile);
    if ($offset > $fileSize) {
        $this->lastError = 'Offset (' . $offset . ') exceeds file size (' . $fileSize . ')';
        return false;
    }

    // Existing seek logic continues...
    $this->setCurrentItem($fileIndex, $offset);

    // Recalculate bytes parsed
    $this->bytesParsed = 0;
    for ($i = 0; $i < $this->position[0]; $i++) {
        $file = $this->from[$i];
        if (!$this->adapter->isFile($file)) {
            continue;
        }
        $this->bytesParsed += $this->adapter->fileSize($file);
    }
    $this->bytesParsed += $this->position[1];

    return true;
}
```

**Then in ChunkingManager, log the error**:
```php
if ($this->it->gSeek($last_position) === false) {
    $errorDetails = method_exists($this->it, 'getLastError') ? $this->it->getLastError() : 'Unknown error';
    DupLog::error('[CHUNK] Seek failed: ' . $errorDetails);
    // ... rest of error handling
}
```

**Rationale**: Diagnostic information is critical for debugging production issues.

#### Fix 1.4: Validate Position on Load
**File**: `src/Package/Storage/UploadPackageFilePersistanceAdapter.php` (lines 35-38)

**Add validation**:
```php
public function getPersistanceData()
{
    $position = empty($this->uploadInfo->chunkPosition) ? null : $this->uploadInfo->chunkPosition;

    // Validate position structure before returning
    if ($position !== null) {
        if (!is_array($position) || count($position) !== 2) {
            DupLog::error('[CHUNK] Corrupted position detected: ' . json_encode($position));
            return null; // Force restart from beginning
        }

        if (!is_int($position[0]) || !is_int($position[1])) {
            DupLog::error('[CHUNK] Invalid position data types: [' .
                gettype($position[0]) . ', ' . gettype($position[1]) . ']');
            return null;
        }

        if ($position[0] < 0 || $position[1] < 0) {
            DupLog::error('[CHUNK] Negative position values: ' . json_encode($position));
            return null;
        }

        DupLog::trace('[CHUNK] Loaded valid position: ' . json_encode($position));
    }

    return $position;
}
```

**Rationale**: Detect corrupted position data early. Better to restart cleanly than crash or corrupt further.

### Phase 2: Root Cause Fix - UploadUuid Protection (HIGH PRIORITY)

This addresses the DupCloud multipart upload state loss issue.

#### Fix 2.1: Move Multipart State to generalExtraData
**File**: `addons/dupcloudaddon/src/Utils/DupCloudStorageAdapter.php` (lines 534-596)

**Problem**: `copyExtraData` is cleared when transitioning between files (installer → archive), losing the UploadUuid needed for multipart uploads.

**Solution**: Store multipart upload state in `generalExtraData` which persists across file transitions.

**Change at line 534**:
```php
if ($offset === 0) {
    // ... existing backup_details validation ...

    // Initialize multipart state in generalExtraData (survives file changes)
    if (!isset($generalExtraData['multipartState'])) {
        $partCount = $this->getRequestPartCount($offset, $length, $fileSize);
        $result    = $this->client->startMultipart($partCount, $backupDetails, $this->backupType);

        $generalExtraData['multipartState'] = [
            'UploadUuid' => $result['uuid'],
            'UploadUrls' => $result['urls'],
            'Parts'      => [],
            'fileName'   => $storageFile, // Track which file this is for
        ];

        DupLog::trace('[DUPCLOUD] Started multipart upload: UUID=' .
            substr($result['uuid'], 0, 8) . '...');
    }

    $multipartState = &$generalExtraData['multipartState'];
} elseif (!isset($generalExtraData['multipartState']['UploadUuid'])) {
    // Resuming without UUID is a critical error
    DupLog::error('[DUPCLOUD] CRITICAL: Resuming at offset ' . $offset .
        ' but UploadUuid missing from generalExtraData');
    throw new Exception('Upload UUID has to be set to continue multipart upload');
}

$multipartState = &$generalExtraData['multipartState'];
```

**Continue using `$multipartState` throughout the rest of the function** instead of `$extraData`.

**Critical lines to update**:
- Line 551: `$partIndex = count($multipartState['Parts']) + 1;`
- Line 554: `$partUploadUrl = $multipartState['UploadUrls'][$partIndex - 1];`
- Line 570: `$multipartState['Parts'][] = [ ... ];`
- Line 583: `'parts' => $multipartState['Parts'],`

**Rationale**: `generalExtraData` survives file transitions, preventing UploadUuid loss.

#### Fix 2.2: Log State Transitions
**File**: `src/Package/Storage/StorageTransferChunkFiles.php` (lines 168-171)

**Add logging to understand when clearing happens**:
```php
$it = new TimeoutFileCopyIterator($extraData['replacements'], $adapter, function (): void {
    DupLog::trace('[CHUNK] File transition - clearing per-file data');
    DupLog::trace('[CHUNK] generalExtraData preserved: ' .
        json_encode(array_keys($this->uploadInfo->generalExtraData)));

    // Reset per-file extra data when file changes
    $this->uploadInfo->copyExtraData = [];
});
```

**Rationale**: Visibility into when state clearing happens helps debug similar issues.

### Phase 3: Memory Management (MEDIUM PRIORITY)

Prevents serialization failures at large file sizes.

#### Fix 3.1: Memory Headroom Check
**File**: `src/Package/AbstractPackage.php` (line 812)

**Add before serialization**:
```php
// Check memory headroom before expensive serialization
$memoryLimit = ini_get('memory_limit');
if ($memoryLimit !== '-1') {
    $memoryLimitBytes = SnapUtil::convertToBytes($memoryLimit);
    $currentMemory = memory_get_usage(true);
    $headroom = $memoryLimitBytes - $currentMemory;

    if ($headroom < 50 * MB_IN_BYTES) {
        DupLog::trace('[PACKAGE] Low memory before serialization. Limit: ' . $memoryLimit .
            ', Used: ' . size_format($currentMemory) . ', Headroom: ' . size_format($headroom));

        // Aggressively clean before serialization
        gc_collect_cycles();
    }
}

$packageObj = JsonSerialize::serialize($this, JSON_PRETTY_PRINT | JsonSerialize::JSON_SKIP_CLASS_NAME);

if (!$packageObj) {
    DupLog::error('[PACKAGE] Serialization failed! Memory: ' .
        size_format(memory_get_usage(true)) . '/' . ini_get('memory_limit'));

    // ... existing error handling ...
}
```

**Rationale**: Early warning and cleanup can prevent serialization failures.

#### Fix 3.2: Add Diagnostic Logging to package->update()
**File**: `src/Package/AbstractPackage.php` (line 818)

**Add after serialization**:
```php
$packageSize = strlen($packageObj);
if ($packageSize > 1 * MB_IN_BYTES) {
    DupLog::trace('[PACKAGE] Large object: ' . size_format($packageSize));
}

// ... continue with database update ...

if ($updateResult === false) {
    DupLog::error('[PACKAGE] Database update failed: ' . $wpdb->last_error);
    DupLog::error('[PACKAGE] Package size: ' . size_format($packageSize));
    // ... existing error handling ...
}
```

**Rationale**: Understanding when and why updates fail helps diagnose production issues.

### Phase 4: Defensive Validation (OPTIONAL - Nice to Have)

Additional safety checks for long-term robustness.

#### Fix 4.1: Position Rollback on Save Failure
**File**: `src/Package/Storage/UploadPackageFilePersistanceAdapter.php`

**Enhance savePersistanceData**:
```php
public function savePersistanceData($data, GenericSeekableIteratorInterface $it): bool
{
    // Capture old state for rollback
    $oldPosition = $this->uploadInfo->chunkPosition;
    $oldProgress = $this->uploadInfo->progress;

    // Update state
    $this->uploadInfo->progress      = $it->getProgressPerc();
    $this->uploadInfo->chunkPosition = $data;

    // Retry with exponential backoff
    $maxRetries = 3;
    $delay = 100000; // 100ms

    for ($attempt = 1; $attempt <= $maxRetries; $attempt++) {
        if ($this->package->update(false)) {
            return true;
        }

        DupLog::error('[CHUNK] Package update failed (attempt ' . $attempt . '/' . $maxRetries . ')');

        if ($attempt < $maxRetries) {
            usleep($delay);
            $delay *= 2; // Exponential backoff
        }
    }

    // All retries failed - rollback in-memory state
    DupLog::error('[CHUNK] CRITICAL: Rolling back position due to persistent save failure');
    $this->uploadInfo->chunkPosition = $oldPosition;
    $this->uploadInfo->progress = $oldProgress;

    return false;
}
```

**Rationale**: If we can't persist, memory should match last known good database state.

### Testing Strategy

#### Unit Tests
1. **ChunkingManager**: Test gSeek() validation with invalid positions
2. **TimeoutFileCopyIterator**: Test gSeek() error reporting for various failure modes
3. **UploadPackageFilePersistanceAdapter**: Test retry logic, validate position loading

#### Integration Tests
1. **Large File Upload**: Upload 2GB+ file, interrupt at various points, verify resume
2. **Memory Pressure**: Simulate memory exhaustion during package serialization
3. **Database Failure**: Simulate wpdb update failures, verify retry and rollback
4. **State Corruption**: Corrupt position data, verify detection and safe restart

#### Manual Testing
1. Real DupCloud upload with 2GB+ backup
2. Monitor logs for position tracking
3. Kill PHP process mid-upload, verify resume
4. Verify no "File already exists" errors on resume

## Implementation Order

### Sprint 1: Critical Fixes (Ship ASAP)
- ✅ Fix 1.1: gSeek() validation (ChunkingManager.php)
- ✅ Fix 1.2: Retry logic (UploadPackageFilePersistanceAdapter.php)
- ✅ Fix 1.3: Enhanced gSeek() errors (TimeoutFileCopyIterator.php)
- ✅ Fix 1.4: Position validation on load (UploadPackageFilePersistanceAdapter.php)

**Estimated effort**: 4-6 hours
**Risk**: LOW
**Impact**: HIGH
**Testing**: Unit + basic integration

### Sprint 2: Root Cause Fix (Ship Next)
- ✅ Fix 2.1: Move UploadUuid to generalExtraData (DupCloudStorageAdapter.php)
- ✅ Fix 2.2: State transition logging (StorageTransferChunkFiles.php)

**Estimated effort**: 3-4 hours
**Risk**: MEDIUM (changes DupCloud adapter)
**Impact**: HIGH (prevents multipart upload failures)
**Testing**: Full DupCloud integration test

### Sprint 3: Memory & Diagnostics (Ship Later)
- ✅ Fix 3.1: Memory headroom check (AbstractPackage.php)
- ✅ Fix 3.2: Diagnostic logging (AbstractPackage.php)

**Estimated effort**: 2-3 hours
**Risk**: LOW
**Impact**: MEDIUM (improves diagnostics)
**Testing**: Manual testing under memory pressure

### Sprint 4: Defensive Measures (Optional)
- ✅ Fix 4.1: Position rollback (UploadPackageFilePersistanceAdapter.php)

**Estimated effort**: 2 hours
**Risk**: LOW
**Impact**: LOW (incremental safety)
**Testing**: Chaos testing with simulated failures

## Files to Modify

### Critical Path (Sprint 1 & 2):
1. `src/Libs/Chunking/ChunkingManager.php` - gSeek validation, error handling
2. `src/Package/Storage/UploadPackageFilePersistanceAdapter.php` - Retry logic, validation
3. `src/Libs/Chunking/Iterators/TimeoutFileCopyIterator.php` - Enhanced error reporting
4. `addons/dupcloudaddon/src/Utils/DupCloudStorageAdapter.php` - UploadUuid in generalExtraData
5. `src/Package/Storage/StorageTransferChunkFiles.php` - State transition logging

### Secondary (Sprint 3):
6. `src/Package/AbstractPackage.php` - Memory management, diagnostics

### Total LOC: ~200-250 lines across 6 files

## Success Criteria

✅ **No position resets**: Uploads resume from correct position after interruption
✅ **Clear diagnostics**: Errors include details about what failed and why
✅ **Transient failure handling**: Database failures are retried, not silently ignored
✅ **Multipart state preserved**: UploadUuid survives file transitions
✅ **Production ready**: Works under memory pressure on production sites

## Rollback Plan

If issues arise:
1. Revert to branch `PR_4.5.24.1` (current stable)
2. All changes are additive (new validations, logging)
3. No API changes, backward compatible
4. Can deploy incrementally (Sprint 1 alone is beneficial)

## Monitoring

Add to production monitoring:
- Count of chunk position validation failures
- Count of package update retry attempts
- Count of serialization failures
- Average package object size over time

## Risk Assessment

**Overall Risk**: LOW-MEDIUM

- ✅ Changes are defensive (add validation, don't change logic)
- ✅ Backward compatible (old positions still work)
- ✅ Fail-safe (corruption detected → restart, not undefined behavior)
- ⚠️ DupCloud adapter changes require thorough testing
- ✅ Can ship incrementally (Sprint 1 is low-risk, high-value)