Source code for GUIBRUSHR.Retrieval.ExofastMCMC.exofast_demc

"""
EXOFAST DEMC (Differential Evolution Markov Chain) Implementation.

This module implements the DEMC posterior sampling algorithm for parameter estimation
in exoplanet fitting. The code maintains the exact logic and sequence of operations
from the original implementation while improving readability and documentation.

PARALLEL VERSION: The _initialize_first_chain_step function has been parallelized
using multiprocessing.Pool for improved performance on multi-core systems.
"""

_GLOSSARY = """
===========================================================================================
VARIABLE GLOSSARY - exofast_demc.py
===========================================================================================

This module implements the ExoFast Differential Evolution MCMC (DEMC) algorithm, ported
from IDL (Eastman et al. 2013, PASP 125, 83). Many variable names differ from the IDL
source or carry historical misnomers; this glossary documents the mapping and semantics.

The algorithm follows ter Braak (2006) "A Markov Chain Monte Carlo version of the genetic
algorithm Differential Evolution" and uses Gelman-Rubin / Ford (2006) convergence
diagnostics.

-------------------------------------------------------------------------------------------
--- ALGORITHM PARAMETERS ---
-------------------------------------------------------------------------------------------

nthin
    Thinning factor. Each chain executes `nthin` DE proposals per saved step, discarding
    all intermediate states except the last. Used to reduce autocorrelation between saved
    samples without ballooning memory. IDL equivalent: `for k=0,nthin-1 do begin`.
    Example: nthin=10 -> 10 proposals per stored step; only the final state is written to
    the chain array.

nchains
    Total number of independent MCMC chains. Must be divisible by `ncores` (so that every
    core gets the same number of chains), and must be > 2 for the Gelman-Rubin diagnostic
    to be defined. Typical values: 8, 16, 32.

ncores
    Number of parallel CPU cores used to advance chains. The chains are partitioned into
    `chain_per_core = nchains // ncores` chains per worker process. Each worker advances
    its slice independently inside `parallel_chain` / `run_multiple_processes`.

nfit
    Number of free parameters being fitted (dimensionality of the posterior). Determines
    the shape of the first axis of `pars` and the DE scale factor `gamma_coeff`.

maxsteps
    Total number of MCMC steps the chain array is sized for. Includes any previously
    completed steps when `moresteps=True` (resume mode). Controls the second axis of
    `pars`, `lhood`, and `log_prior_chain`.

max_steps_run
    Number of steps to actually run in THIS invocation. Differs from `maxsteps` only when
    `moresteps=True`: in that case `max_steps_run = maxsteps - already_done`, while
    `maxsteps` still refers to the full concatenated chain length.

nextrecalc
    Step index at which the next Gelman-Rubin convergence check will be performed. IDL
    heuristic: initialized to 100. If convergence is NOT yet reached at step i, it is
    pushed forward to `int(i / 0.9)` (i.e., check again after another ~10% growth in
    chain length). If convergence IS reached, it may be jumped ahead further to stabilize.

npass
    Count of CONSECUTIVE convergence checks that have passed. 0-based: initialized
    to 0 and reset to 0 on any failed check, so it is the true number of passes
    accumulated. Convergence is declared only once `npass` reaches `maxnpass`
    (default 6, configurable from the GUI), i.e. after `maxnpass` consecutive
    passing checks, preventing premature termination on a lucky draw. (IDL used a
    1-based counter that stopped at 6 = only 5 real passes; we require the
    documented count.)

nstop
    Step index where convergence was first declared. Used at the end to trim the chain
    arrays to their effective length (and to distinguish burn-in from converged samples).

burnndx
    Burn-in index. Computed as the first step `i` at which the median log-likelihood
    across chains `median(lhood[i, :])` exceeds the global median of log-likelihoods.
    All steps with index < burnndx are considered burn-in and discarded from posterior
    summaries. This is the ExoFast convention; higher `lhood` means better fit.

-------------------------------------------------------------------------------------------
--- CHAIN ARRAYS (shape conventions) ---
-------------------------------------------------------------------------------------------

pars : ndarray, shape (nfit, maxsteps, nchains)
    Parameter values at each step for each chain. Axis 0 indexes free parameters,
    axis 1 indexes MCMC step, axis 2 indexes chain. This is the primary posterior sample
    container.

lhood : ndarray, shape (maxsteps, nchains)
    Log-likelihood (actually log-posterior contribution from the data term) at each step
    and chain. IDL name: `chi2`. WARNING: despite the IDL name this is NOT a chi-squared
    value. It is a log-likelihood returned by `lh_function_gib`: HIGHER means BETTER fit.
    The Metropolis-Hastings acceptance test uses `exp(lhood_new - lhood_old)`, consistent
    with log-likelihoods (not chi-squared, which would use `exp(-0.5*(chi2_new-chi2_old))`).

log_prior_chain : ndarray, shape (maxsteps, nchains)
    Logarithm of the Gaussian prior product at each step and chain:
        log_prior = sum_k  -0.5 * ((theta_k - mu_k) / sigma_k)**2
    for every parameter with a Gaussian prior (uniform-prior parameters contribute 0).
    IDL names: `det_chain`, `determinant`. WARNING: the IDL name is severely misleading;
    this is NOT a matrix determinant. The quantity is computed in log space because, for
    the typical N~150-dimensional problem, the linear product underflows double precision.
    See `calculate_log_prior` for the full formula.

-------------------------------------------------------------------------------------------
--- DE-MCMC ALGORITHM VARIABLES (inside parallel_chain / run_multiple_processes) ---
-------------------------------------------------------------------------------------------

gamma_coeff
    DE scale factor (ter Braak 2006). Defined as:
        gamma_coeff = 2.38 / sqrt(2 * nfit)
    This value is the theoretically optimal scaling for proposals of the form
    `theta_new = theta_j + gamma * (theta_r1 - theta_r2)` under a Gaussian target.

r1, r2
    Indices of two randomly chosen reference chains used to build the differential vector
    `pars[:, r1] - pars[:, r2]`. Constraints: r1 != j (the chain being updated),
    r2 != j, and r2 != r1. Drawn uniformly without replacement from the remaining chains.

epsilon
    Small uniform perturbation added to each DE proposal to guarantee ergodicity:
        epsilon = (U(0,1) - 0.5) * 2 * scale / epsilon_scale_divisor
    IDL: `(randomu-0.5d0)*2d0*scale/100d0` (divisor hardcoded to 100 in the original
    EXOFAST IDL code). Here the divisor is configurable at runtime via
    `retrieval_data.epsilon_scale_divisor` (default 100.0 = IDL behaviour). NOTE: this
    perturbation is UNIFORM, NOT Gaussian, despite occasional literature recommendations
    to use N(0, sigma).

log_C
    Logarithm of the Metropolis-Hastings acceptance ratio:
        log_C = (log_prior_new - log_prior_old) + (lhood_new - lhood_old)
    IDL multiplicative form: `C = (newdet/olddet) * exp(newchi2 - oldchi2)`. The proposal
    is accepted iff `log(U(0,1)) < log_C` (equivalent to `U(0,1) < C` in linear space).
    Working in log space avoids overflow/underflow when priors span many decades.

lhood (local)
    Log-likelihood returned by `lh_function_gib` for the proposed parameter vector.
    Includes the spectral data-fit term. IDL name: `chi2` (misnomer, see above).

log_prior (local)
    Log of the Gaussian prior product for the proposed parameter vector:
        log_prior = sum_k -0.5 * ((theta_k - mu_k) / sigma_k)**2
    IDL names: `det`, `determinant` (severe misnomer - this is not a matrix determinant
    of any kind). Computed in log space to avoid underflow on high-dimensional problems
    (~150 parameters).

-------------------------------------------------------------------------------------------
--- CONVERGENCE TRACKING ---
-------------------------------------------------------------------------------------------

gelmanrubin : ndarray, shape (nfit,)
    Gelman-Rubin R-hat statistic per parameter, computed by `exofast_gelmanrubin`
    following Ford (2006) eq. 25. Chains are deemed converged when all entries
    satisfy R-hat < 1.01.

tz : ndarray, shape (nfit,)
    Effective sample size factor per parameter (Ford 2006 eq. 26). Chains are deemed
    converged when all entries satisfy tz > 1000 (i.e., at least ~1000 independent
    samples per parameter).

mintz : list of float
    Running record of `min(tz)` at each convergence check step. Used to extrapolate
    how many additional steps would be needed to reach the tz > 1000 threshold, and
    to emit a warning if extrapolation suggests convergence cannot be reached within
    `maxsteps`.

tzsteps : list of int
    Step indices at which `min(tz)` was recorded (parallel to `mintz`). Together they
    form the data used for the linear-extrapolation convergence forecast.

alreadywarned : bool
    Flag tracking whether the "convergence may not be reached within maxsteps" warning
    has already been printed, so that it is emitted only once per run.

naccept
    Running total of accepted DE proposals across ALL chains and ALL `nthin` sub-steps.
    Divided by `(maxsteps * nchains * nthin)` at the end to report the overall
    acceptance rate (target: ~23% for high-dimensional Gaussian targets).

-------------------------------------------------------------------------------------------
--- PICKLE FILE KEYS (moresteps / crash recovery) ---
-------------------------------------------------------------------------------------------

"parameters"
    The `pars` array (nfit, nsteps_done, nchains) saved for resume.

"lhood"
    The `lhood` array (nsteps_done, nchains). For backward compatibility with older
    pickles saved under the IDL-inherited name, the loader also accepts the key "chi2"
    and treats it as `lhood` (same numerical content, just the old name).

"log_prior_chain"
    The `log_prior_chain` array (nsteps_done, nchains). For backward compatibility the
    loader also accepts the old IDL-inherited key "det_chain" and treats it as
    `log_prior_chain`.

===========================================================================================
"""

import datetime
import traceback
from re import sub as _resub_digits
# import multiprocessing as mp

import numpy as np
import pickle
from GUIBRUSHR.Retrieval.ExofastMCMC.exofast_gelmanrubin import exofast_gelmanrubin
from GUIBRUSHR.Retrieval.debug_log import (
    emit_event,
    init_log,
    get_log_path,
    BufferedLogger,
)


# Parent-side buffer used to batch MCMC-driver events so the JSONL debug log
# is flushed at distribution-save checkpoints (~every 10 outer_steps),
# instead of once per iteration. ``_pemit`` routes records to the buffer when
# one is active (inside ``run_mcmc_sampling``) and falls back to direct append
# otherwise, so helpers invoked before the buffer is set up (init, finalize)
# still log immediately.
_PARENT_BUFFER: "BufferedLogger | None" = None


def _pemit(path, record):
    """Parent-side emit: buffered when the MCMC loop owns a BufferedLogger,
    direct atomic append otherwise. Never raises.
    """
    if _PARENT_BUFFER is not None:
        _PARENT_BUFFER.emit(record)
    else:
        emit_event(path, record)



[docs]
def time_left_units(timeleft):
    """
    Convert time in seconds to the most convenient time units.

    This function takes a time duration in seconds and converts it to the most
    appropriate unit (seconds, minutes, hours, or days) for display purposes.

    Parameters
    ----------
    timeleft : float
        Remaining time in seconds.

    Returns
    -------
    tuple
        A tuple containing (converted_time, units_string).
    """
    units = "seconds"

    # Convert to minutes if more than 60 seconds
    if timeleft > 60:
        timeleft /= 60
        units = "minutes"

        # Convert to hours if more than 60 minutes
        if timeleft > 60:
            timeleft /= 60
            units = "hours"

            # Convert to days if more than 24 hours
            if timeleft > 24:
                timeleft /= 24
                units = "days"

    return timeleft, units



def _initialize_chain_files(model_obj):
    """
    Initialize the file path for the resumable chain checkpoint.

    Parameters
    ----------
    model_obj :
        Model object containing retrieval data and configuration.

    Returns
    -------
    str
        Path to the partial_prob_and_chain_burnin.pkl checkpoint file. This is
        the only chain artifact that is read back (by resume and by the GUI).
    """
    partial_prob_and_chain = "partial_prob_and_chain_burnin.pkl"

    # Construct full file path using results directory
    partial_prob_and_chain = (
        model_obj.retrieval_data.path_results + partial_prob_and_chain
    )

    return partial_prob_and_chain


def _build_multi_expansion(model_obj):
    """Derive ``{base_name: [expanded_chain_names]}`` for the current run.

    Mirrors the chain-name routing used by ``ParameterHandler.create_param_full``:
    chain entries whose names contain a digit suffix are mapped back to the
    base parameter name (which must appear in ``list_multiple_param``).
    Returned dict is keyed by base name with chain-order expansion lists.

    Returns an empty dict if no multi-parameters are active. Returns ``None``
    if ``model_obj`` does not yet expose the required attributes (e.g. test
    fixtures); callers treat ``None`` as "skip optional metadata".
    """
    try:
        chain_names = list(model_obj.bestpars_data.list_bestpars)
        list_multi = list(model_obj.param_handler.list_multiple_param)
        params_list = list(model_obj.param_handler.params_list)
    except AttributeError:
        return None

    multi_exp: dict[str, list[str]] = {}
    for cn in chain_names:
        if cn in params_list:
            continue
        base = _resub_digits(r"\d+", "", cn)
        if base in list_multi:
            multi_exp.setdefault(base, []).append(cn)
    return multi_exp


def _validate_resume_schema(loaded, model_obj, debug_log_path=None):
    """Best-effort sanity check that a resumed pickle's chain ordering
    still matches the current ``bestpars_data.list_bestpars``.

    The on-disk pickle stores ``chain_names`` (added in this refactor) so
    a mismatch between the parameters table the user just configured and
    the parameters the prior run was sampling is caught loudly instead of
    silently misassigning every saved value.

    Older pickles without ``chain_names`` skip the check. The caller
    proceeds anyway — this is a guard, not a hard precondition.
    """
    if not isinstance(loaded, dict):
        return
    saved_names = loaded.get("chain_names")
    if saved_names is None:
        return
    current_names = list(model_obj.bestpars_data.list_bestpars)
    if list(saved_names) != current_names:
        msg = (
            "Resume aborted: chain_names in checkpoint do not match the "
            "current df_parameters list_bestpars.\n"
            f"  saved   ({len(saved_names)}): {list(saved_names)!r}\n"
            f"  current ({len(current_names)}): {current_names!r}\n"
            "Either restore the original df_parameters.yaml or start a "
            "fresh retrieval. (This guard exists because positional chain "
            "row index N silently changes meaning if rows are reordered.)"
        )
        _pemit(debug_log_path, {
            "event": "resume_error",
            "kind": "chain_names_mismatch",
            "saved": list(saved_names),
            "current": current_names,
        })
        raise RuntimeError(msg)


def _load_previous_chains(partial_prob_and_chain, pars, lhood, log_prior_chain, debug_log_path=None, model_obj=None):
    """
    Load and concatenate previous chain data when resuming sampling.

    Parameters
    ----------
    partial_prob_and_chain : str
        Path to the pickle file containing previous chain data.
    pars : ndarray
        Current parameter array to concatenate with.
    lhood : ndarray
        Current log-likelihood array to concatenate with.  # IDL: chi2
    log_prior_chain : ndarray
        Current log-prior chain array to concatenate with.
        IDL: 'det_chain' — renamed; log of Gaussian prior product, see calculate_log_prior.

    Returns
    -------
    tuple
        Updated (pars, lhood, log_prior_chain, last_index_data_resumed, naccept).
    """
    with open(partial_prob_and_chain, "rb") as f:
        pars_and_chain = pickle.load(f)

    if model_obj is not None:
        _validate_resume_schema(pars_and_chain, model_obj, debug_log_path)

    # Handle both dictionary and legacy list formats
    if isinstance(pars_and_chain, dict):
        pars_old = pars_and_chain["parameters"]
        # Backward compat: read both new key 'lhood' and old key 'chi2'
        lhood_old = pars_and_chain.get("lhood", pars_and_chain.get("chi2"))  # IDL: chi2
        # IDL: 'det_chain' key — now stored as 'log_prior_chain'; fall back to legacy key
        log_prior_chain_old = pars_and_chain.get(
            "log_prior_chain", pars_and_chain.get("det_chain", None)
        )
    else:
        # Legacy format
        pars_old = pars_and_chain[0]
        lhood_old = pars_and_chain[1]  # IDL: chi2
        log_prior_chain_old = pars_and_chain[2] if len(pars_and_chain) > 2 else None

    # Concatenate old and new arrays
    pars = np.concatenate((pars_old, pars), axis=1)
    lhood = np.concatenate((lhood_old, lhood), axis=0)  # IDL: chi2

    last_index_data_resumed = len(lhood_old[:, 0]) - 1
    naccept = int(np.sum([len(np.unique(pars[0, :, i]))
                         for i in range(len(pars[0, 0, :]))]))

    # Handle log-prior chain data with error handling
    try:
        if log_prior_chain_old is not None:
            log_prior_chain = np.concatenate((log_prior_chain_old, log_prior_chain), axis=0)
        else:
            raise ValueError("No log_prior_chain in loaded data")
    except Exception as e:
        _pemit(debug_log_path, {
            "event": "resume_warning",
            "kind": "no_prior_log_prior_chain",
            "error": str(e),
            "action": "set_log_prior_chain_to_zero",
        })
        # IDL fallback was np.ones (det=1 → no prior). Log-space equivalent: 0 (log(1)=0)
        log_prior_chain = np.concatenate((np.zeros(np.shape(lhood_old)), log_prior_chain), axis=0)

    return pars, lhood, log_prior_chain, last_index_data_resumed, naccept


def _extract_model_data_for_parallel(model_obj):
    """
    Extract necessary data from model_obj for parallel workers.

    This function extracts all data needed by the parallel workers to recreate
    param_full objects and calculate likelihoods without access to the full model_obj.

    Parameters
    ----------
    model_obj : 
        The model object containing all necessary methods and data

    Returns
    -------
    dict
        Dictionary containing serializable model data needed by workers
    """
    # Extract the essential data for parameter creation and likelihood calculation
    model_data = {
        # Core parameter data
        'bestpars_initial': model_obj.bestpars_data.list_bestpars_initial_value.copy(),
        'scale_vector': model_obj.retrieval_data.scale_vector_params.copy(),

        # Add any other data your model_obj needs for create_param_full and lh_function_gib
        # You may need to customize this based on your specific model structure
        # Examples (uncomment and adapt as needed):
        # 'observations': getattr(model_obj, 'observations', None),
        # 'covariance_matrix': getattr(model_obj, 'covariance_matrix', None),
        # 'parameter_bounds': getattr(model_obj, 'parameter_bounds', None),
        # 'instrument_data': getattr(model_obj, 'instrument_data', None),

        # Add a reference to the model object itself if it's serializable
        # Otherwise, extract specific methods or data structures needed
        'model_obj_ref': model_obj  # This works if model_obj is pickleable
    }

    return model_data


def _create_param_full_parallel(candidate_pars, model_data):
    """
    Parallel-safe version of model_obj.create_param_full().

    This function recreates the param_full object using the extracted model data,
    enabling parallel workers to check parameter boundaries without the full model_obj.

    Parameters
    ----------
    candidate_pars : ndarray
        Candidate parameter values
    model_data : dict
        Dictionary containing extracted model data

    Returns
    -------
    list
        List of parameter objects with boundary checking capability
    """
    # If the full model_obj is available and serializable, use it directly
    if 'model_obj_ref' in model_data and model_data['model_obj_ref'] is not None:
        return model_data['model_obj_ref'].create_param_full(candidate_pars)

    # Otherwise, implement a standalone version based on your parameter structure
    # This is where you would implement the logic from your create_param_full method
    # without requiring the full model_obj

    # Example implementation (customize based on your actual parameter structure):
    # param_full = []
    # for i, par_value in enumerate(candidate_pars):
    #     # Create parameter objects based on your specific parameter classes
    #     param_obj = YourParameterClass(
    #         value=par_value,
    #         bounds=model_data.get('parameter_bounds', {}).get(i, None),
    #         # Add other necessary parameter attributes
    #     )
    #     param_full.append(param_obj)
    # return param_full

    raise NotImplementedError(
        "You need to implement _create_param_full_parallel() based on your "
        "model_obj.create_param_full() method. Either ensure model_obj is "
        "serializable or implement a standalone version of parameter creation."
    )


def _calculate_likelihood_parallel(param_full, model_data):
    """
    Parallel-safe version of model_obj.lh_function_gib().

    This function calculates the likelihood using the extracted model data,
    enabling parallel workers to compute chi2 and determinant values.

    Parameters
    ----------
    param_full : list
        List of parameter objects created by _create_param_full_parallel
    model_data : dict
        Dictionary containing extracted model data

    Returns
    -------
    tuple
        (chi2_value, determinant_value, additional_info) - same format as lh_function_gib
    """
    # If the full model_obj is available and serializable, use it directly
    if 'model_obj_ref' in model_data and model_data['model_obj_ref'] is not None:
        return model_data['model_obj_ref'].lh_function_gib(param_full)

    # Otherwise, implement a standalone version based on your likelihood calculation
    # This is where you would implement the logic from your lh_function_gib method
    # without requiring the full model_obj

    # Example implementation (customize based on your actual likelihood calculation):
    # param_values = np.array([p.value for p in param_full if p is not None])
    #
    # # Implement your specific likelihood calculation here
    # chi2_val = your_chi2_calculation(param_values, model_data)
    # det_val = your_determinant_calculation(param_values, model_data)
    # additional_info = None  # or whatever your lh_function_gib returns as third element
    #
    # return chi2_val, det_val, additional_info

    raise NotImplementedError(
        "You need to implement _calculate_likelihood_parallel() based on your "
        "model_obj.lh_function_gib() method. Either ensure model_obj is "
        "serializable or implement a standalone version of likelihood calculation."
    )


# def _initialize_single_chain_worker(kr, bestpars_initial, scale_vector, nfit, seed_base, model_data):
#     """
#     Worker function to initialize a single chain with valid parameter values.
#
#     This function replicates the logic of the original for loop iteration,
#     generating initial parameter values for one chain and ensuring they
#     satisfy boundary conditions through iterative sampling.
#
#     Parameters
#     ----------
#     kr : int
#         Chain index
#     bestpars_initial : ndarray
#         Initial best parameter values
#     scale_vector : ndarray
#         Parameter scaling vector
#     nfit : int
#         Number of fitted parameters
#     seed_base : int
#         Base seed for random number generation
#     model_data : dict
#         Dictionary containing model data needed for param_full creation and likelihood
#
#     Returns
#     -------
#     tuple
#         (chain_index, parameters, chi2_value, determinant_value)
#     """
#     # Create independent RNG for this worker to avoid correlation between chains
#     rng = np.random.default_rng(seed=seed_base + kr)
#
#     param_full = None
#     cond_params = False
#
#     # Keep generating parameters until boundary conditions are satisfied
#     while not cond_params:
#         cond_params = True
#
#         # Generate random normal deviates
#         a = rng.standard_normal(nfit)
#
#         # Create parameter values using initial values and scaling
#         candidate_pars = bestpars_initial + scale_vector * a
#
#         # Create full parameter object and check boundaries
#         param_full = _create_param_full_parallel(candidate_pars, model_data)
#         for elem in param_full:
#             if elem is not None:
#                 cond_params = cond_params and elem.boundaries_check()
#
#     # Calculate likelihood for initial parameters
#     chi2_val, det_val, _ = _calculate_likelihood_parallel(param_full, model_data)
#
#     return kr, candidate_pars, chi2_val, det_val


# def _initialize_first_chain_step_parallel(model_obj, nchains, nfit, pars, chi2, det_chain, n_cores=None):
#     """
#     Initialize the first step of each chain with valid parameter values.
#
#     This function generates initial parameter values for each chain, ensuring
#     they satisfy boundary conditions through iterative sampling.
#
#     Parameters
#     ----------
#     model_obj :
#         Model object containing parameter generation methods.
#     nchains : int
#         Number of chains to initialize.
#     nfit : int
#         Number of fitted parameters.
#     pars : ndarray
#         Parameter array to populate.
#     chi2 : ndarray
#         Chi-squared array to populate.
#     det_chain : ndarray
#         Determinant chain array to populate.
#     n_cores : int,
#         Number of worker processes.
#     """
#
#     # Extract necessary data from model_obj for parallel workers
#     bestpars_initial = model_obj.bestpars_data.list_bestpars_initial_value
#     scale_vector = model_obj.retrieval_data.scale_vector_params
#     seed_base = model_obj.random_obj.seed
#
#     # Prepare model data for workers
#     model_data = _extract_model_data_for_parallel(model_obj)
#
#     # Prepare arguments for each worker
#     worker_args = [
#         (kr, bestpars_initial, scale_vector, nfit, seed_base, model_data)
#         for kr in range(nchains)
#     ]
#
#     try:
#         # Execute in parallel using Pool
#         with mp.Pool(processes=n_cores) as pool:
#             results = pool.starmap(_initialize_single_chain_worker, worker_args)
#
#         # Populate output arrays with results
#         for kr, candidate_pars, chi2_val, det_val in results:
#             pars[:, 0, kr] = candidate_pars
#             chi2[0, kr] = chi2_val
#             det_chain[0, kr] = det_val
#
#     except Exception as e:
#         print(f"Parallel initialization failed: {e}")
#         print("Falling back to serial implementation...")
#         # Fallback to serial implementation if parallel fails
#         _initialize_first_chain_step_serial(model_obj, nchains, nfit, pars, chi2, det_chain)


def _initialize_first_chain_step_serial(
        model_obj, nchains, nfit, pars, lhood, log_prior_chain, debug_log_path=None,
):
    """
    Original serial implementation of chain initialization.

    This function maintains the original logic as a fallback when parallel
    processing fails or is not beneficial.

    Parameters
    ----------
    model_obj :
        Model object containing parameter generation methods.
    nchains : int
        Number of chains to initialize.
    nfit : int
        Number of fitted parameters.
    pars : ndarray
        Parameter array to populate.
    lhood : ndarray
        Log-likelihood array to populate.  # IDL: chi2
    log_prior_chain : ndarray
        Log-prior chain array to populate.
        IDL: 'det_chain' — renamed; log of Gaussian prior product, see calculate_log_prior.
    debug_log_path :
        JSONL log path for per-chain initialization events (may be None).
    """
    param_full = None

    # Upper bound on the number of init redraws per chain. Each tentative
    # redraws the starting point until BOTH the prior boundaries are satisfied
    # AND the resulting likelihood is finite. The cap is a safety net so a chain
    # that can never reach a finite-likelihood region cannot loop forever; if it
    # is hit, the chain falls back to its last (non-finite) draw and a
    # "chain_init_cap_reached" event is logged.
    max_init_tentative = int(
        getattr(model_obj.retrieval_data, "max_init_tentative", 1000)
    )

    # Initialize each chain
    for kr in range(nchains):
        _pemit(debug_log_path, {
            "event": "chain_init_begin",
            "chain": int(kr),
        })
        cond_params = False

        # Keep redrawing until the boundary conditions are satisfied AND the
        # likelihood is finite. A -inf init point would poison the chain: the
        # Metropolis ratio against -inf is +inf, so the chain accepts its first
        # finite proposal unconditionally, lands in an arbitrary low-quality
        # region and becomes a convergence straggler. Requiring a finite init
        # likelihood here avoids that.
        n_tentative = 0
        init_lhood = -np.inf
        init_log_prior = 0.0
        while not cond_params:
            n_tentative += 1
            cond_params = True

            # Generate random normal deviates
            a = model_obj.random_obj.rng.standard_normal(nfit)

            # IDL: pars[0:nfit-1,0,j] = bestpars[tofit] + 2d0*scale*(randomu_normal(nfit))
            # Three-branch dispatch keyed on retrieval_data.init_mode (set by
            # the user via the GUI dropdown, defaults to "isotropic").
            #   - "correlated": multivariate Gaussian using L = chol(C_MAP).
            #   - "diagonal":   independent Gaussian with σ_p = sqrt(diag(C_MAP)).
            #   - "isotropic":  legacy path, σ from YAML scale (byte-identical).
            # Runtime fallback to isotropic only triggers as a safety net when
            # Hessian extraction silently failed (Bestpars attrs are None);
            # we emit a JSONL ``init_mode_fallback`` event so post-mortems
            # surface what happened.
            mu_init = model_obj.bestpars_data.list_bestpars_initial_value
            f_init = model_obj.retrieval_data.init_scatter_factor
            mode_init = getattr(model_obj.retrieval_data, "init_mode", "isotropic")
            if mode_init == "correlated":
                L_init = getattr(model_obj.bestpars_data, "init_cholesky", None)
                if L_init is not None:
                    pars[:, 0, kr] = np.asarray(mu_init) + f_init * (np.asarray(L_init) @ a)
                else:
                    _pemit(debug_log_path, {
                        "event": "init_mode_fallback",
                        "chain": int(kr),
                        "tentative": int(n_tentative),
                        "requested": "correlated",
                        "used": "isotropic",
                        "reason": "cholesky unavailable",
                    })
                    pars[:, 0, kr] = (
                        np.asarray(mu_init)
                        + f_init * model_obj.retrieval_data.scale_vector_params * a
                    )
            elif mode_init == "diagonal":
                sig_init = getattr(model_obj.bestpars_data, "init_scale_per_param", None)
                if sig_init is not None:
                    pars[:, 0, kr] = np.asarray(mu_init) + f_init * np.asarray(sig_init) * a
                else:
                    _pemit(debug_log_path, {
                        "event": "init_mode_fallback",
                        "chain": int(kr),
                        "tentative": int(n_tentative),
                        "requested": "diagonal",
                        "used": "isotropic",
                        "reason": "hessian diagonal unavailable",
                    })
                    pars[:, 0, kr] = (
                        np.asarray(mu_init)
                        + f_init * model_obj.retrieval_data.scale_vector_params * a
                    )
            else:  # "isotropic" — legacy path, byte-identical
                pars[:, 0, kr] = (
                    np.asarray(mu_init)
                    + f_init * model_obj.retrieval_data.scale_vector_params * a
                )

            # Create full parameter object and check boundaries
            param_full = model_obj.create_param_full(pars[:, 0, kr])
            for elem in param_full:
                if elem is not None:
                    cond_params = cond_params and elem.boundaries_check()
                    if n_tentative % 100 == 0 and not elem.boundaries_check():
                        _pemit(debug_log_path, {
                            "event": "chain_init_boundary_miss",
                            "chain": int(kr),
                            "tentative": int(n_tentative),
                            "param_name": getattr(elem, "name", None),
                            "value": float(getattr(elem, "value_in_retrieval", float("nan"))),
                            "range_min": float(getattr(elem, "range_min", float("nan"))),
                            "range_max": float(getattr(elem, "range_max", float("nan"))),
                        })

            # Boundaries violated: redraw (unless the tentative cap is hit).
            if not cond_params:
                if n_tentative >= max_init_tentative:
                    break
                continue

            # Boundaries OK: evaluate the likelihood and require it to be finite.
            # IDL: chi2[0,kr]=lhood, det_chain[0,kr]=det  — det is now log_prior
            # chain=kr, step=0 lets downstream diagnostic prints in LikelihoodHR
            # identify the chain that produced an init-time warning.
            init_lhood, init_log_prior, _, _ = model_obj.lh_function_gib(
                param_full, chain=int(kr), step=0,
            )  # IDL: chi2[0,kr]
            if not np.isfinite(init_lhood):
                # Non-finite init likelihood (e.g. model NaN/Inf): reject this
                # draw and try again so the chain does not start at -inf.
                cond_params = False
                _pemit(debug_log_path, {
                    "event": "chain_init_nonfinite_lhood",
                    "chain": int(kr),
                    "tentative": int(n_tentative),
                    "init_lhood": float(init_lhood),
                })
                if n_tentative >= max_init_tentative:
                    _pemit(debug_log_path, {
                        "event": "chain_init_cap_reached",
                        "chain": int(kr),
                        "tentative": int(n_tentative),
                        "init_lhood": float(init_lhood),
                    })
                    break

        # Store the (possibly re-drawn) initial likelihood and log-prior.
        lhood[0, kr] = init_lhood
        log_prior_chain[0, kr] = init_log_prior

        _pemit(debug_log_path, {
            "event": "chain_init_done",
            "chain": int(kr),
            "n_tentative": int(n_tentative),
            "init_lhood": float(lhood[0, kr]),
            "init_log_prior": float(log_prior_chain[0, kr]),
            "pars_min": float(np.min(pars[:, 0, kr])),
            "pars_max": float(np.max(pars[:, 0, kr])),
        })


def _save_intermediate_results(partial_prob_and_chain,
                              pars, lhood, log_prior_chain, index_position_python,
                              chain_names=None, multi_expansion=None):
    """
    Save the intermediate (burn-in) chain checkpoint to disk.

    Parameters
    ----------
    partial_prob_and_chain : str
        Path to pickle file for partial results.
    pars : ndarray
        Parameter array.
    lhood : ndarray
        Log-likelihood array.  # IDL: chi2
    log_prior_chain : ndarray
        Log-prior chain array. IDL: 'det_chain' — renamed; see calculate_log_prior.
    index_position_python : int
        Current position index (Python indexing).
    """
    # Save partial results as pickle: saving partial burnin chains
    # Key 'lhood' replaces old key 'chi2'; _load_previous_chains reads both for backward compat.
    # IDL key was 'det_chain'; now 'log_prior_chain'. _load_previous_chains reads both.
    # Optional 'chain_names' / 'multi_expansion' record the canonical chain
    # ordering at save time, so a later resume can detect that the user has
    # reordered or modified df_parameters.yaml between runs (which would
    # otherwise silently misassign every saved value).
    payload = {
        "parameters": pars[:, :index_position_python, :],
        "lhood": lhood[:index_position_python, :],  # IDL: "chi2"
        "log_prior_chain": log_prior_chain[:index_position_python, :],
    }
    if chain_names is not None:
        payload["chain_names"] = list(chain_names)
    if multi_expansion is not None:
        payload["multi_expansion"] = {
            k: list(v) for k, v in multi_expansion.items()
        }
    with open(partial_prob_and_chain, "wb") as f:
        pickle.dump(payload, f)


def _check_convergence_and_burnin(
        pars, lhood, nfit, nchains, index_position_python,
        rhat_threshold=1.01,
):
    """
    Check convergence and determine burn-in index.

    Parameters
    ----------
    pars : ndarray
        Parameter array.
    lhood : ndarray
        Log-likelihood array.  # IDL: chi2
    nfit : int
        Number of fitted parameters.
    nchains : int
        Number of chains.
    index_position_python : int
        Current position index.
    rhat_threshold : float, optional
        Upper bound on R-hat for declaring convergence (default 1.01).

    Returns
    -------
    tuple
        Tuple containing (converged, gelmanrubin, tz, burnndx).
    """
    # IDL: medchi2 = median(chi2[0:i,*])
    # Calculate median log-likelihood for burn-in determination
    med_lhood = np.median(lhood[:index_position_python, :])  # IDL: medchi2
    burnndx = 0
    for jj in range(nchains):
        # IDL: tmpndx = (where(chi2[0:i,j] gt medchi2))(0)
        tmpndx = np.where(lhood[:index_position_python, jj] > med_lhood)[0]
        if len(tmpndx) > 0:
            if tmpndx[0] > burnndx:
                burnndx = tmpndx[0]
    # IDL: burnndx = burnndx < (i-3)  — allows G-R calculation if one chain is problematic
    burnndx = min(burnndx, index_position_python - 3)

    # IDL: converged = exofast_gelmanrubin(pars[0:nfit-1,burnndx:i,*], gelmanrubin, tz)
    # NOTE: IDL uses pars[0:nfit-1,...] which in IDL inclusive indexing = nfit params.
    # Python 0:nfit-1 is exclusive → excludes last parameter. Fixed to 0:nfit.
    converged, gelmanrubin, tz = exofast_gelmanrubin(
        pars[0:nfit, burnndx:index_position_python, :],
        rhat_threshold=rhat_threshold,
    )

    return converged, gelmanrubin, tz, burnndx


def _update_convergence_tracking(converged, nstop, i, npass, dontstop,
                               maxsteps, output_file, gelmanrubin, tz,
                               maxnpass=6):
    """
    Update convergence tracking variables and determine next recalculation step.

    Parameters
    ----------
    converged : int or None
        Convergence flag: 1 if converged, 0 if not, None if G-R could not run
        (too few chains or steps — in that case state is left unchanged).
    nstop : int
        Step at which convergence was first achieved.
    i : int
        Current step index.
    npass : int
        Number of consecutive convergence passes.
    dontstop : bool
        Flag to continue even after convergence.
    maxsteps : int
        Maximum number of steps.
    output_file : str
        Path to output file for status messages.
    gelmanrubin : ndarray or None
        Gelman-Rubin statistics.
    tz : ndarray or None
        Independent draws statistics.
    maxnpass : int, optional
        Number of consecutive passes required to declare convergence (default 6).

    Returns
    -------
    tuple
        Updated (nextrecalc, npass, nstop, should_break).
    """
    should_break = False

    # IDL: exofast_gelmanrubin always returns a value; Python returns None when
    # nchains <= 2 or nsteps <= 1. In that case, defer the next check without
    # resetting npass/nstop — we simply have insufficient data yet.
    if converged is None:
        nextrecalc = int(i / 0.9)
        return nextrecalc, npass, nstop, should_break

    # IDL: if converged then begin ... endif else begin ... endelse
    if converged == 1:
        if nstop == 0:
            nstop = i
        # IDL: nextrecalc = long(nstop/(1.d0-npass/100.d0))
        # npass is now 0-based (see init in likelihood()); (npass + 1) keeps the
        # check-spacing cadence byte-identical to the original 1-based scheme, so
        # only the required pass count changes, not WHEN checks happen.
        nextrecalc = int(nstop / (1 - (npass + 1) / 100))
        npass += 1

        # Stop once we have maxnpass consecutive passes (0-based npass reaches
        # maxnpass after exactly maxnpass passing checks). IDL stopped at 6 from
        # a 1-based counter (= 5 passes); we require the documented maxnpass.
        if npass == maxnpass:
            if dontstop == 0:
                temp_str = f"Has converged: Rhat={gelmanrubin} Tz={tz}"
                with open(output_file, "a") as f:
                    f.write(temp_str.rstrip("\n") + "\n")
                should_break = True
            nextrecalc = maxsteps
    else:
        # IDL: nextrecalc = long(i/0.9d0) / nstop=0L / npass=1L
        # A failed check resets the consecutive-pass counter to 0 (was 1 in IDL).
        nextrecalc = int(i / 0.9)
        nstop = 0
        npass = 0

    return nextrecalc, npass, nstop, should_break


def _finalize_chains(pars, lhood, removeburn, burnndx, nstop):
    """
    Finalize chains by removing the burn-in period.

    The final chains are not written to disk: the resumable checkpoint
    partial_prob_and_chain_burnin.pkl (saved during sampling) is the only chain
    artifact read back by the resume logic and the GUI. The trimmed (pars, lhood)
    are returned for in-process downstream use.

    Parameters
    ----------
    pars : ndarray
        Parameter array.
    lhood : ndarray
        Log-likelihood array.  # IDL: chi2
    removeburn : bool
        Whether to remove burn-in period from final results.
    burnndx : int
        Burn-in index.
    nstop : int
        Final step index.

    Returns
    -------
    tuple
        Final (pars, lhood) arrays.
    """
    # IDL: if keyword_set(removeburn) then pars=pars[*,burnndx:nstop,*] else pars=pars[*,0:nstop,*]
    if removeburn:
        pars = pars[:, burnndx:nstop, :]
        lhood = lhood[burnndx:nstop, :]  # IDL: chi2
    else:
        pars = pars[:, 0:nstop, :]
        lhood = lhood[0:nstop, :]  # IDL: chi2

    return pars, lhood



[docs]
def likelihood(model_obj, dontstop=False, removeburn=True, moresteps=False):
    """
    Run a DEMC (Differential Evolution Markov Chain) posterior sampling.

    This function implements the main DEMC algorithm for Bayesian parameter
    estimation. It runs multiple chains in parallel, monitors convergence
    using Gelman-Rubin statistics, and saves intermediate results.

    Parameters
    ----------
    model_obj :
        Model object containing all necessary data and methods for sampling.
    dontstop : bool, optional
        If True, continue sampling even after convergence. Default is False.
    removeburn : bool, optional
        If True, remove burn-in period from final results. Default is True.
    moresteps : bool, optional
        If True, resume from previous sampling run. Default is False.

    Returns
    -------
    tuple
        Final parameter chains and log-likelihood values as (pars, lhood).  # IDL: (pars, chi2)
    """
    nthin = model_obj.retrieval_data.nthin
    rhat_threshold = float(
        getattr(model_obj.retrieval_data, "gelman_rubin_threshold", 1.01)
    )
    maxnpass = int(getattr(model_obj.retrieval_data, "maxnpass", 6))
    # Initialize the resumable chain checkpoint path
    partial_prob_and_chain = _initialize_chain_files(model_obj)

    # Install a parent-side buffered logger so driver events (demc_setup,
    # step_done, convergence_check, …) are flushed to disk only at
    # distribution-save checkpoints (every 10 outer_steps), not on every
    # iteration. Workers in ModelData.parallel_chain continue to flush their
    # own short-lived buffer once per outer_step.
    global _PARENT_BUFFER
    _PARENT_BUFFER = BufferedLogger()

    # Extract configuration parameters from model object
    nfit = model_obj.bestpars_data.nfit
    max_steps_run = model_obj.retrieval_data.maxsteps
    nchains = int(model_obj.bestpars_data.nchains)
    ncores = int(model_obj.bestpars_data.ncores)
    # MCMC sampler choice. "DE-MC" (default) leaves the worker path byte-identical;
    # "Snooker" enables DE-MCzs, for which each worker also receives the full
    # pooled history archive built fresh from ``pars`` each step (see below).
    sampler = getattr(model_obj.retrieval_data, "mcmc_sampler", "DE-MC")

    # IDL does not check this; silently losing chains is a hard-to-diagnose bug
    if nchains % ncores != 0:
        raise ValueError(
            f"nchains ({nchains}) must be exactly divisible by ncores ({ncores}). "
            f"Adjust multiplier_chains or multiplier_cores in configuration."
        )

    # Initialize arrays for parameters, log-likelihood, and log-prior chain
    # IDL: chi2 = dblarr(maxsteps, nchains)  — renamed lhood; log-likelihood, higher = better
    # IDL: det_chain = dblarr(maxsteps, nchains)  — renamed log_prior_chain; see calculate_log_prior
    pars = np.zeros([nfit, max_steps_run, nchains])
    lhood = np.zeros([max_steps_run, nchains])  # IDL: chi2
    log_prior_chain = np.zeros([max_steps_run, nchains])

    # Record start time for progress tracking
    t0 = datetime.datetime.now()

    # Initialize the shared JSONL debug log (truncate + header). Writers
    # (this function, worker processes inside parallel_chain, and init helpers)
    # all share the same file so events stay chronologically ordered.
    # When retrieval_data.save_chain_debug is False the path is set to None;
    # debug_log.init_log/emit_event/BufferedLogger.flush all no-op on None,
    # so the entire debug/ folder is skipped and no JSONL is produced.
    debug_log_path = (
        get_log_path(model_obj)
        if getattr(model_obj.retrieval_data, "save_chain_debug", True)
        else None
    )
    init_log(debug_log_path, {
        "event": "session_start",
        "component": "exofast_demc",
        "id_process": getattr(model_obj.retrieval_data, "id_process", None),
        "moresteps": bool(moresteps),
        "dontstop": bool(dontstop),
        "removeburn": bool(removeburn),
    })
    model_obj._debug_log_path = debug_log_path

    # Handle resuming from previous run or starting fresh
    if moresteps:
        # Load and concatenate previous chain data
        pars, lhood, log_prior_chain, last_index_data_resumed, naccept = _load_previous_chains(
            partial_prob_and_chain, pars, lhood, log_prior_chain, debug_log_path,
            model_obj=model_obj,
        )
    else:
        # IDL: naccept = 1d0
        naccept = 1
        last_index_data_resumed = 0

    # Set up output file and display initial status
    output_file = model_obj.retrieval_data.table_output_file
    maxsteps = max_steps_run + last_index_data_resumed

    print(f"\nStart at step {last_index_data_resumed} of {maxsteps}")
    print(f"{naccept} accepted")
    print(f"{ncores} cores")
    print(f"{nchains} chains")
    print(f"Init scatter factor: {model_obj.retrieval_data.init_scatter_factor}")
    print(f"Mode jump threshold: {model_obj.bestpars_data.mode_jump_threshold}\n")

    _pemit(debug_log_path, {
        "event": "demc_setup",
        "nchains": nchains,
        "ncores": ncores,
        "nfit": int(nfit),
        "chain_per_core": nchains // ncores,
        "max_steps_run": int(max_steps_run),
        "maxsteps_total": int(maxsteps),
        "nthin": int(nthin),
        "last_index_data_resumed": int(last_index_data_resumed),
        "naccept_resumed": int(naccept),
        "init_scatter_factor": float(model_obj.retrieval_data.init_scatter_factor),
        "mode_jump_threshold": float(model_obj.bestpars_data.mode_jump_threshold),
        "gamma_coeff": float(model_obj.bestpars_data.gamma_coeff),
        "epsilon_scale_divisor": float(model_obj.retrieval_data.epsilon_scale_divisor),
    })

    # One-shot registry of fitted-parameter bounds. Emitted once so every
    # boundary-reject event can carry just the proposed value (the analyser
    # joins on name to recover range_min / range_max).
    param_registry = []
    for slot in model_obj.param_handler.initial_param_array:
        if slot is None or not getattr(slot, "status", False):
            continue
        param_registry.append({
            "name": str(getattr(slot, "name", "?")),
            "range_min": float(slot.range_min),
            "range_max": float(slot.range_max),
        })
    _pemit(debug_log_path, {
        "event": "param_registry",
        "params": param_registry,
    })

    with open(output_file, "a") as f:
        f.write("First newpars creation\n")

    # Initialize first step of chains if not resuming
    if not moresteps:
        _initialize_first_chain_step_serial(
            model_obj, nchains, nfit, pars, lhood, log_prior_chain, debug_log_path,
        )

    # IDL: nextrecalc = 100L  — first convergence check at step 100
    # (was incorrectly 1000 in Python, delaying first burn-in detection 10x)
    nextrecalc = 100
    # npass = number of CONSECUTIVE passing checks accumulated so far. 0-based:
    # starts at 0 (zero passes), and convergence stops the run at npass==maxnpass.
    # NOTE: this intentionally diverges from IDL (npass=1L init, stop at 6 → only
    # 5 real passes). Here maxnpass is the true pass count, so the GUI shows 0/6
    # before any pass and maxnpass==6 means 6 consecutive passes as documented.
    npass = 0
    nstop = 0

    # Convergence warning tracking: IDL fits a line to min(tz) vs step number
    # and warns if convergence is not expected within maxsteps.
    # IDL: mintz=[0d0], tzsteps=[0L], alreadywarned=0L
    mintz = [0.0]
    tzsteps = [0]
    alreadywarned = False

    with open(output_file, "a") as f:
        f.write("Starting chains\n")

    # DE-MCzs archive Z: an append-only (M, nfit) buffer holding the full pooled
    # history (all past+present states, all chains). Built incrementally - one
    # nchains-row append per step - instead of rebuilding the whole history each
    # step (which would be O(steps^2) memory traffic). Only allocated for the
    # snooker path; DE-MC leaves it None and pays nothing. Capacity bounds the
    # maximum reachable length (every stored step across the whole run, plus a
    # one-step margin).
    if sampler == "Snooker":
        archive_capacity = nchains * (int(max_steps_run) + last_index_data_resumed + 1)
        archive_buf = np.empty((archive_capacity, nfit))
        # Seed with every state already present: steps 0 .. last_index_data_resumed.
        archive_seed = pars[:, : last_index_data_resumed + 1, :].reshape(nfit, -1).T
        archive_buf[: archive_seed.shape[0]] = archive_seed
        archive_len = archive_seed.shape[0]
    else:
        archive_buf = None
        archive_len = 0

    # Main sampling loop
    # IDL: for i=initfor, maxsteps-1L do begin
    counter = 0
    index_total = last_index_data_resumed
    for i in range(1, int(max_steps_run)):
        index_total = i + last_index_data_resumed
        counter += 1

        # Run parallel processes for current step with retry logic.
        # nthin is passed so each worker executes nthin proposals internally,
        # saving only the final state — matching IDL's inner for k=0,nthin-1 loop.
        max_step_retries = 5
        step_retry_count = 0
        step_success = False

        while step_retry_count < max_step_retries and not step_success:
            try:
                # DE-MCzs draws from the full pooled history Z = all states from
                # step 0 to index_total-1, across all chains. archive_buf is the
                # append-only buffer; archive_buf[:archive_len] is a contiguous
                # view of exactly those states (no per-step copy). Under the fork
                # start method all workers share it copy-on-write. archive_len is
                # stable across retries of the same step (the append happens only
                # once, after a successful step below).
                archive = archive_buf[:archive_len] if sampler == "Snooker" else None
                return_dict = model_obj.run_multiple_processes(
                    pars[:, index_total-1, :],
                    lhood[index_total-1, :],  # IDL: chi2[index_total-1, :]
                    log_prior_chain[index_total-1, :],
                    nthin=nthin,
                    outer_step=index_total,
                    debug_log_path=debug_log_path,
                    sampler=sampler,
                    archive=archive,
                )

                # Collect results from parallel processes
                counter_chain = 0
                chain_per_core = nchains // ncores
                step_naccept = 0
                for j in range(ncores):
                    for k in range(chain_per_core):
                        temp_dict = return_dict[j][k]
                        naccept += temp_dict.naccept
                        step_naccept += int(temp_dict.naccept)
                        pars[:, index_total, counter_chain] = np.squeeze(temp_dict.pars)
                        lhood[index_total, counter_chain] = temp_dict.lhood  # IDL: temp_dict.chi2
                        # IDL: det_chain[i,k] = temp_dict.det  — now log_prior_chain
                        log_prior_chain[index_total, counter_chain] = temp_dict.log_prior
                        counter_chain += 1

                # Append this step's whole population to the DE-MCzs archive (K=1
                # dense history). Runs exactly once per step, on the success path
                # only, so a retried step never double-appends.
                if sampler == "Snooker":
                    archive_buf[archive_len:archive_len + nchains] = (
                        pars[:, index_total, :].T
                    )
                    archive_len += nchains

                step_success = True

            except KeyError as e:
                step_retry_count += 1
                _pemit(debug_log_path, {
                    "event": "step_retry",
                    "outer_step": index_total,
                    "attempt": step_retry_count,
                    "max_attempts": max_step_retries,
                    "error_type": "KeyError",
                    "error": str(e),
                })
                if step_retry_count < max_step_retries:
                    import time
                    time.sleep(2)
                else:
                    _pemit(debug_log_path, {
                        "event": "fatal_error",
                        "outer_step": index_total,
                        "id_process": getattr(model_obj.retrieval_data, "id_process", None),
                        "error_type": "KeyError",
                        "error": str(e),
                        "traceback": traceback.format_exc(),
                        "reason": "max_retries_reached",
                    })
                    exit()
            except Exception as e:
                _pemit(debug_log_path, {
                    "event": "fatal_error",
                    "outer_step": index_total,
                    "id_process": getattr(model_obj.retrieval_data, "id_process", None),
                    "error_type": type(e).__name__,
                    "error": str(e),
                    "traceback": traceback.format_exc(),
                })
                exit()

        index_position_python = index_total + 1

        # IDL: acceptancerate = naccept/double(i*nchains*nthin)*100
        # Calculate acceptance rate as a float (was stored as str — bug)
        acceptancerate = float(naccept / (index_position_python * nchains * nthin)) * 100
        timeleft = (datetime.datetime.now() - t0) * (max_steps_run / (i + 1) - 1)
        timeleft, units = time_left_units(timeleft.total_seconds())

        step_lhood = lhood[index_total, :]
        _pemit(debug_log_path, {
            "event": "step_done",
            "outer_step": index_total,
            "step_naccept": int(step_naccept),
            "step_accept_rate": float(step_naccept) / (nchains * nthin),
            "naccept_total": int(naccept),
            "acceptance_rate_cum": float(acceptancerate),
            "lhood_min": float(np.min(step_lhood)),
            "lhood_max": float(np.max(step_lhood)),
            "lhood_mean": float(np.mean(step_lhood)),
            "lhood_std": float(np.std(step_lhood)),
            "time_left_sec": (datetime.datetime.now() - t0).total_seconds()
                             * (max_steps_run / (i + 1) - 1),
            "progress_pct": float(100 * (i + 1) / max_steps_run),
        })

        # Save intermediate results every 10 steps
        # IDL saves every 100; Python saves more frequently for finer crash recovery
        if counter == 10:
            # Build optional schema metadata once and pass it through; the
            # save helper records it inside the checkpoint pickle so that
            # a future _load_previous_chains can detect chain reordering.
            try:
                _chain_names = list(model_obj.bestpars_data.list_bestpars)
            except AttributeError:
                _chain_names = None
            _multi_exp = _build_multi_expansion(model_obj)
            _save_intermediate_results(
                partial_prob_and_chain,
                pars, lhood, log_prior_chain, index_position_python,
                chain_names=_chain_names, multi_expansion=_multi_exp,
            )
            # Flush the parent event buffer in lockstep with the chain-sample
            # save so the JSONL log is always consistent with the on-disk
            # distribution snapshot (never ahead, never behind).
            if _PARENT_BUFFER is not None:
                _PARENT_BUFFER.flush(debug_log_path)
            counter = 0

        # IDL: if i eq nextrecalc then begin
        # Check convergence and burn-in at specified intervals
        if i >= nextrecalc:
            converged, gelmanrubin, tz, burnndx = _check_convergence_and_burnin(
                pars, lhood, nfit, nchains, index_position_python,
                rhat_threshold=rhat_threshold,
            )

            conv_record = {
                "event": "convergence_check",
                "outer_step": index_total,
                "converged": None if converged is None else int(converged),
                "burnndx": int(burnndx) if burnndx is not None else None,
                "npass_before": int(npass),
                "nstop_before": int(nstop),
            }
            if gelmanrubin is not None:
                gr_arr = np.asarray(gelmanrubin, dtype=float)
                conv_record["gelmanrubin_min"] = float(np.min(gr_arr))
                conv_record["gelmanrubin_max"] = float(np.max(gr_arr))
                conv_record["gelmanrubin_median"] = float(np.median(gr_arr))
            if tz is not None:
                tz_arr = np.asarray(tz, dtype=float)
                conv_record["tz_min"] = float(np.min(tz_arr))
                conv_record["tz_max"] = float(np.max(tz_arr))
                conv_record["tz_median"] = float(np.median(tz_arr))
            _pemit(debug_log_path, conv_record)

            if converged is not None:
                # IDL: mintz=[mintz,min(tz)] / tzsteps=[tzsteps,i]
                # Track minimum independent draws over time for convergence warning
                mintz.append(float(np.min(tz)))
                tzsteps.append(i)

                # IDL: if not alreadywarned and i gt maxsteps/20 and ntz gt 3 then begin
                #        coeffs = poly_fit(tzsteps,mintz,1)
                #        stepstoconvergence = (2000-coeffs[0])/coeffs[1]
                if not alreadywarned and i > maxsteps / 20 and len(mintz) > 3:
                    # Fit line to min(tz) growth — extrapolate to tz=2000 (2× threshold)
                    slope, intercept = np.polyfit(tzsteps[1:], mintz[1:], 1)
                    if slope > 0:
                        stepstoconvergence = (2000 - intercept) / slope
                        if stepstoconvergence > maxsteps:
                            bestnthin = round(stepstoconvergence * nthin / maxsteps)
                            _pemit(debug_log_path, {
                                "event": "convergence_warning",
                                "outer_step": index_total,
                                "kind": "not_expected_to_mix",
                                "recommended_nthin": int(bestnthin),
                                "slope": float(slope),
                                "intercept": float(intercept),
                            })
                        else:
                            pct = int(stepstoconvergence * 100 / maxsteps)
                            _pemit(debug_log_path, {
                                "event": "convergence_warning",
                                "outer_step": index_total,
                                "kind": "expected_to_mix",
                                "expected_pct_complete": int(pct),
                                "slope": float(slope),
                                "intercept": float(intercept),
                            })
                    alreadywarned = True

            # Update convergence tracking
            nextrecalc, npass, nstop, should_break = _update_convergence_tracking(
                converged, nstop, i, npass, dontstop, maxsteps,
                output_file, gelmanrubin, tz, maxnpass=maxnpass
            )

            _pemit(debug_log_path, {
                "event": "convergence_update",
                "outer_step": index_total,
                "nextrecalc": int(nextrecalc),
                "npass": int(npass),
                "nstop": int(nstop),
                "should_break": bool(should_break),
            })

            if should_break:
                break

        # Update progress display periodically
        if i % round(max_steps_run / 1000) == 0:
            progress = float(100 * (i + 1) / max_steps_run)
            temp_str = (
                f"EXOFAST: {progress:.3f}%, "
                f"acceptance rate = {acceptancerate:.2f}%, "
                f"npass = {npass}/{maxnpass}, "
                f"Time left = {timeleft:.2f} {units}"
            )
            with open(output_file, "a") as f:
                f.write(temp_str.rstrip("\n") + "\n")

    # IDL: if npass ne 6 or keyword_set(dontstop) then nstop = maxsteps-1
    # Use maxsteps (total including resumed steps), not max_steps_run (current run only)
    if npass != maxnpass or dontstop == 1:
        nstop = maxsteps - 1

    # Final burn-in determination
    # IDL: medchi2 = median(chi2[0:nstop,*]) / burnndx = burnndx < (maxsteps-3)
    med_lhood = np.median(lhood[:nstop, :])  # IDL: medchi2
    burnndx = 0
    for j in range(nchains):
        tmpndx = np.where(lhood[:nstop, j] > med_lhood)[0]  # IDL: chi2
        if len(tmpndx) > 0:
            if tmpndx[0] > burnndx:
                burnndx = tmpndx[0]
    burnndx = min(burnndx, maxsteps - 3)

    # IDL: if npass ne 6 then begin / bad=where(tz lt 1000 or gelmanrubin gt 1.01)
    # Final convergence check and warnings
    # NOTE: in IDL, the "not well-mixed" check is in the npass!=6 branch;
    # the marginal warning is in the else (npass==6). Previously these were INVERTED.
    if npass != maxnpass:
        converged, gelmanrubin, tz = exofast_gelmanrubin(
            pars[0:nfit, burnndx:nstop, :],
            rhat_threshold=rhat_threshold,
        )
        if gelmanrubin is not None:
            bad = np.where(np.logical_or(tz < 1000, gelmanrubin > rhat_threshold))
            if len(bad[0]) > 0:
                temp_str = (
                    f"WARNING: following parameters are not well-mixed: {bad} "
                    f"GELMANRUBIN: {gelmanrubin[bad]} "
                    f"TZ: {tz[bad]}"
                )
            else:
                # IDL: message, 'WARNING: chain did not pass 6 tests, may be marginally well-mixed'
                temp_str = (f"WARNING: The chain did not pass {maxnpass} consecutive "
                            f"tests and may be marginally well-mixed.")
        else:
            temp_str = "WARNING: Gelman-Rubin returned None — too few steps or chains."
        with open(output_file, "a") as f:
            f.write(temp_str.rstrip("\n") + "\n")
    else:
        # IDL has no message here — chain converged normally
        temp_str = f"Chain converged after {npass} consecutive Gelman-Rubin passes."
        with open(output_file, "a") as f:
            f.write(temp_str.rstrip("\n") + "\n")

    # Calculate and display final runtime statistics
    runtime = datetime.datetime.now() - t0
    runtime_sec = runtime.total_seconds()
    runtime, units = time_left_units(runtime_sec)

    temp_str = (
        f"EXOFAST_DEMC: done in {runtime:.2f} {units}. "
        f"Took {(index_total / maxsteps) * 100:.1f}% of the steps"
    )
    with open(output_file, "a") as f:
        f.write(temp_str.rstrip("\n") + "\n")

    _pemit(debug_log_path, {
        "event": "demc_end",
        "runtime_sec": float(runtime_sec),
        "runtime_display": f"{runtime:.2f} {units}",
        "index_total": int(index_total),
        "maxsteps_total": int(maxsteps),
        "pct_steps_used": float((index_total / maxsteps) * 100),
        "naccept_total": int(naccept),
        "acceptance_rate_final": float(naccept / ((index_total + 1) * nchains * nthin) * 100),
        "npass": int(npass),
        "nstop": int(nstop),
        "burnndx": int(burnndx),
        "converged_by_npass": bool(npass == maxnpass),
    })

    # Flush any remaining parent events (incl. demc_end) and release the
    # buffer so subsequent retrievals / re-entries start clean.
    if _PARENT_BUFFER is not None:
        _PARENT_BUFFER.flush(debug_log_path)
    _PARENT_BUFFER = None

    # Finalize chains (remove burn-in)
    pars, lhood = _finalize_chains(
        pars, lhood, removeburn, burnndx, nstop
    )

    return pars, lhood