streamlit_healthcheck.healthcheck

View Source

   1import streamlit as st
   2import psutil
   3import pandas as pd
   4import requests
   5import time
   6import threading
   7import json
   8import os
   9from datetime import datetime
  10from typing import Dict, List, Any, Optional, Callable
  11import functools
  12import traceback
  13import logging
  14import sqlite3
  15
  16# Set up logging
  17logging.basicConfig(
  18    level=logging.INFO,
  19    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
  20    handlers=[
  21        logging.StreamHandler()
  22    ]
  23)
  24logger = logging.getLogger(__name__)
  25
  26class StreamlitPageMonitor:
  27    """
  28    Singleton class that monitors and records errors occurring within Streamlit pages.
  29    It captures both explicit Streamlit error messages (monkey-patching st.error) and
  30    uncaught exceptions raised during the execution of monitored page functions, and
  31    persists error details to a local SQLite database.
  32    
  33    Key responsibilities
  34    
  35    - Intercept Streamlit error calls by monkey-patching st.error and record them with
  36        a stack trace, timestamp, status, and type.
  37    - Provide a decorator `monitor_page(page_name)` to set a page context, capture
  38        exceptions raised while rendering/executing a page, and record those exceptions.
  39    - Store errors in an in-memory structure grouped by page and persist them to
  40        an SQLite database for later inspection.
  41    - Provide utilities to load, deduplicate, clear, and query stored errors.
  42    
  43    Behavior and side effects
  44    
  45    - Implements the Singleton pattern: only one instance exists per Python process.
  46    - On first instantiation, optionally accepts a custom db_path and initializes
  47        the SQLite database and its parent directory (creating it if necessary).
  48    - Monkey-patches `streamlit.error` (st.error) to capture calls and still forward
  49        them to the original st.error implementation.
  50    - Records the following fields for each error: page, error, traceback, timestamp,
  51        status, type. The SQLite table `errors` mirrors these fields and includes an
  52        auto-incrementing `id`.
  53    - Persists errors immediately to SQLite when captured; database IO errors are
  54        logged but do not suppress the original exception (for monitored exceptions,
  55        the exception is re-raised after recording).
  56        
  57    Public API (methods)
  58    
  59    - __new__(cls, db_path=None)
  60            Create or return the singleton StreamlitPageMonitor instance.
  61        
  62            Parameters
  63            ----------
  64            db_path : Optional[str]
  65                If provided on the first instantiation, overrides the class-level
  66                database path used to persist captured Streamlit error information.
  67                
  68            Returns
  69            -------
  70            StreamlitPageMonitor
  71                The singleton instance of the class.
  72                
  73            Behavior
  74            --------
  75            - On first instantiation (when cls._instance is None):
  76            - Allocates the singleton via super().__new__.
  77            - Optionally sets cls._db_path from the provided db_path.
  78            - Logs the configured DB path.
  79            - Monkey-patches streamlit.error (st.error) with a wrapper that:
  80                - Builds an error record containing the error text, a formatted stack trace,
  81                ISO timestamp, severity/status, an error type marker, and the current page.
  82                - Normalizes a missing current page to "unknown_page".
  83                - Stores the record in the in-memory cls._errors dictionary keyed by page.
  84                - Attempts to persist the record to the SQLite DB using cls().save_errors_to_db,
  85                logging any persistence errors without interrupting Streamlit's normal error display.
  86                - Calls the original st.error to preserve expected UI behavior.
  87            - Initializes the SQLite DB via cls._init_db().
  88            - On subsequent calls:
  89            - Returns the existing singleton instance.
  90            - If db_path is provided, updates cls._db_path for future use.
  91            
  92            Side effects
  93            ------------
  94            - Replaces st.error globally for the running process.
  95            - Writes error records to both an in-memory structure (cls._errors) and to the
  96            configured SQLite database (if persistence succeeds).
  97            - Logs informational and error messages.
  98            
  99            Notes
 100            -----
 101            - The method assumes the class defines/has: _instance, _db_path, _current_page,
 102            _errors, _st_error (original st.error), save_errors_to_db, and _init_db.
 103            - Exceptions raised during saving of individual errors are caught and logged;
 104            exceptions from instance creation or DB initialization may propagate.
 105            - The implementation is not explicitly thread-safe; concurrent instantiation
 106            attempts may require external synchronization if used in multi-threaded contexts.
 107    - set_page_context(cls, page_name: str)
 108            Set the current page name used when recording subsequent errors.
 109    - monitor_page(cls, page_name: str) -> Callable
 110            Decorator for page rendering/execution functions. Sets the page context,
 111            clears previously recorded non-Streamlit errors for that page, runs the
 112            function, records and persists any raised exception, and re-raises it.
 113    - _handle_st_error(cls, error_message: str)
 114    
 115            Handles Streamlit-specific errors by recording error details for the current page.
 116        
 117            Args:
 118                error_message (str): The error message to be logged.
 119                
 120            Side Effects:
 121                Updates the class-level _errors dictionary with error information for the current Streamlit page.
 122                
 123            Error Information Stored:
 124                - error: Formatted error message.
 125                - traceback: Stack trace at the point of error.
 126                - timestamp: Time when the error occurred (ISO format).
 127                - status: Error severity ('critical').
 128                - type: Error type ('streamlit_error').
 129    - get_page_errors(cls) -> dict
 130            Load errors from the database and return a dictionary mapping page names to
 131            lists of error dicts. Performs basic deduplication by error message.
 132    - save_errors_to_db(cls, errors: Iterable[dict])
 133            Persist a list of error dictionaries to the configured SQLite database.
 134            Ensures traceback is stored as a string (JSON if originally a list).
 135    - clear_errors(cls, page_name: Optional[str] = None)
 136            Clear in-memory errors for a specific page or all pages and delete matching
 137            rows from the database.
 138    - _init_db(cls)
 139            Ensure the database directory exists and create the `errors` table if it
 140            does not exist.
 141    - load_errors_from_db(cls, page=None, status=None, limit=None) -> List[dict]
 142            Query the database for errors, optionally filtering by page and/or status,
 143            returning a list of error dictionaries ordered by timestamp (descending)
 144            and limited if requested.
 145            
 146    Storage and format
 147    
 148    - Default DB path: ~/local/share/streamlit-healthcheck/streamlit_page_errors.db (overridable).
 149    - SQLite table `errors` columns: id, page, error, traceback, timestamp, status, type.
 150    - Tracebacks may be stored as JSON strings (if originally lists) or plain strings.
 151    Concurrency and robustness
 152    - Designed for single-process usage typical of Streamlit apps. The singleton and
 153        monkey-patching are process-global.
 154    - Database interactions use short-lived connections; callers should handle any
 155        exceptions arising from DB access (errors are logged internally).
 156    - Decorator preserves original function metadata via functools.wraps.
 157    
 158    Examples
 159    
 160    - Use as a decorator on page render function:
 161    >>> @StreamlitPageMonitor.monitor_page("home")
 162    >>> def render_home():
 163
 164    - Set page context manually:
 165    >>> StreamlitPageMonitor.set_page_context("settings")
 166    
 167    - Set custom DB path on first instantiation:
 168    >>> # Place this at the top of your Streamlit app once, before any error monitoring or decorator usage to ensure the sqlite
 169    >>> # database is created properly at the specified path; otherwise it will default to a temp directory. The temp directory
 170    >>> # will be `~/local/share/streamlit-healthcheck/streamlit_page_errors.db`.
 171    >>> StreamlitPageMonitor(db_path="/home/saradindu/dev/streamlit_page_errors.db")
 172    ...
 173
 174    SQLite Database Schema
 175    ---------------------
 176    The following schema is used for persisting errors:
 177
 178    ```sql
 179    CREATE TABLE IF NOT EXISTS errors (
 180        id INTEGER PRIMARY KEY AUTOINCREMENT,
 181        page TEXT,
 182        error TEXT,
 183        traceback TEXT,
 184        timestamp TEXT,
 185        status TEXT,
 186        type TEXT
 187    );
 188    ```
 189
 190    Field Descriptions:
 191
 192    | Column     | Type    | Description                                 |
 193    |------------|---------|---------------------------------------------|
 194    | id         | INTEGER | Auto-incrementing primary key               |
 195    | page       | TEXT    | Name of the Streamlit page                  |
 196    | error      | TEXT    | Error message                               |
 197    | traceback  | TEXT    | Stack trace or traceback (as string/JSON)   |
 198    | timestamp  | TEXT    | ISO8601 timestamp of error occurrence       |
 199    | status     | TEXT    | Severity/status (e.g., 'critical')          |
 200    | type       | TEXT    | Error type ('streamlit_error', 'exception') |
 201
 202    Example:
 203    
 204    >>> @StreamlitPageMonitor.monitor_page("home")
 205    >>> def render_home():
 206    
 207    Notes
 208    
 209    - The class monkey-patches st.error globally when first instantiated; ensure
 210        this side effect is acceptable in your environment.
 211    - Errors captured by st.error that occur outside any known page are recorded
 212        under the page name "unknown_page".
 213    - The schema is created/ensured in `_init_db()`.
 214    - Tracebacks may be stored as JSON strings or plain text.
 215    - Errors are persisted immediately upon capture.
 216    
 217    """
 218    _instance = None
 219    _errors: Dict[str, List[Dict[str, Any]]] = {}
 220    _st_error = st.error
 221    _current_page = None
 222
 223    # --- SQLite schema for error persistence ---
 224    # Table: errors
 225    # Fields:
 226    #   id INTEGER PRIMARY KEY AUTOINCREMENT
 227    #   page TEXT
 228    #   error TEXT
 229    #   traceback TEXT
 230    #   timestamp TEXT
 231    #   status TEXT
 232    #   type TEXT
 233    
 234    # Local development DB path
 235    _db_path = os.path.join(os.path.expanduser("~"), "dev", "streamlit-healthcheck", "streamlit_page_errors.db")
 236    # Final build DB path
 237    #_db_path = os.path.join(os.path.expanduser("~"), ".local", "share", "streamlit-healthcheck", "streamlit_page_errors.db")
 238
 239    def __new__(cls, db_path=None):
 240        """
 241        Create or return the singleton StreamlitPageMonitor instance.
 242        """
 243        
 244        if cls._instance is None:
 245            cls._instance = super(StreamlitPageMonitor, cls).__new__(cls)
 246            # Allow db_path override at first instantiation
 247            if db_path is not None:
 248                cls._db_path = db_path
 249            logger.info(f"StreamlitPageMonitor DB path set to: {cls._db_path}")
 250            # Monkey patch st.error to capture error messages
 251            def patched_error(*args, **kwargs):
 252                error_message = " ".join(str(arg) for arg in args)
 253                current_page = cls._current_page
 254                error_info = {
 255                    'error': error_message,
 256                    'traceback': traceback.format_stack(),
 257                    'timestamp': datetime.now().isoformat(),
 258                    'status': 'critical',
 259                    'type': 'streamlit_error',
 260                    'page': current_page
 261                }
 262                # Ensure current_page is a string, not None
 263                if current_page is None:
 264                    current_page = "unknown_page"
 265                if current_page not in cls._errors:
 266                    cls._errors[current_page] = []
 267                cls._errors[current_page].append(error_info)
 268                # Persist to DB
 269                try:
 270                    cls().save_errors_to_db([error_info])
 271                except Exception as e:
 272                    logger.error(f"Failed to save Streamlit error to DB: {e}")
 273                # Call original st.error
 274                return cls._st_error(*args, **kwargs)
 275
 276            st.error = patched_error
 277
 278            # Initialize SQLite database
 279            cls._init_db()
 280        else:
 281            # If already instantiated, allow updating db_path if provided
 282            if db_path is not None:
 283                cls._db_path = db_path
 284        return cls._instance
 285
 286    @classmethod
 287    def _handle_st_error(cls, error_message: str):
 288        """
 289        Handles Streamlit-specific errors by recording error details for the current page.
 290        """
 291        
 292        # Get current page name from Streamlit context
 293        current_page = getattr(st, '_current_page', 'unknown_page')
 294        error_info = {
 295            'error': f"Streamlit Error: {error_message}",
 296            'traceback': traceback.format_stack(),
 297            'timestamp': datetime.now().isoformat(),
 298            'status': 'critical',
 299            'type': 'streamlit_error',
 300            'page': current_page
 301        }
 302        # Initialize list for page if not exists
 303        if current_page not in cls._errors:
 304            cls._errors[current_page] = []
 305        # Add new error
 306        cls._errors[current_page].append(error_info)
 307        # Persist to DB
 308        try:
 309            cls().save_errors_to_db([error_info])
 310        except Exception as e:
 311            logger.error(f"Failed to save Streamlit error to DB: {e}")
 312
 313    @classmethod
 314    def set_page_context(cls, page_name: str):
 315        """Set the current page context"""
 316        cls._current_page = page_name
 317
 318    @classmethod
 319    def monitor_page(cls, page_name: str):
 320        """
 321        Decorator to monitor and log exceptions for a specific Streamlit page.
 322        
 323        Args:
 324            page_name (str): The name of the page to monitor.
 325            
 326        Returns:
 327            Callable: A decorator that wraps the target function, sets the page context,
 328            clears previous non-Streamlit errors, and logs any exceptions that occur during execution.
 329            
 330        The decorator performs the following actions:
 331        
 332            - Sets the current page context using `cls.set_page_context`.
 333            - Clears previous exception errors for the page, retaining only those marked as 'streamlit_error'.
 334            - Executes the wrapped function.
 335            - If an exception occurs, logs detailed error information (error message, traceback, timestamp, status, type, and page)
 336              to `cls._errors` under the given page name, then re-raises the exception.
 337        """
 338        
 339        def decorator(func):
 340            """
 341            Decorator to manage page-specific error handling and context setting.
 342            This decorator sets the current page context before executing the decorated function.
 343            It clears previous exception errors for the page, retaining only Streamlit error calls.
 344            If an exception occurs during function execution, it captures error details including
 345            the error message, traceback, timestamp, status, type, and page name, and appends them
 346            to the page's error log. The exception is then re-raised.
 347            
 348            Args:
 349                func (Callable): The function to be decorated.
 350                
 351            Returns:
 352                Callable: The wrapped function with error handling and context management.
 353            """
 354            
 355            @functools.wraps(func)
 356            def wrapper(*args, **kwargs):
 357                # Set the current page context
 358                cls.set_page_context(page_name)
 359                try:
 360                    # Clear previous exception errors but keep st.error calls
 361                    if page_name in cls._errors:
 362                        cls._errors[page_name] = [
 363                            e for e in cls._errors[page_name]
 364                            if e.get('type') == 'streamlit_error'
 365                        ]
 366                    result = func(*args, **kwargs)
 367                    return result
 368                except Exception as e:
 369                    error_info = {
 370                        'error': str(e),
 371                        'traceback': traceback.format_exc(),
 372                        'timestamp': datetime.now().isoformat(),
 373                        'status': 'critical',
 374                        'type': 'exception',
 375                        'page': page_name
 376                    }
 377                    if page_name not in cls._errors:
 378                        cls._errors[page_name] = []
 379                    cls._errors[page_name].append(error_info)
 380                    # Persist to DB
 381                    try:
 382                        cls().save_errors_to_db([error_info])
 383                    except Exception as db_exc:
 384                        logger.error(f"Failed to save exception error to DB: {db_exc}")
 385                    raise
 386            return wrapper
 387        return decorator
 388
 389    @classmethod
 390    def get_page_errors(cls):
 391        """
 392        Load error records from storage and return them grouped by page.
 393        This class method calls cls().load_errors_from_db() to retrieve a sequence of error records
 394        (each expected to be a mapping). It normalizes each record to a dictionary with the keys:
 395        
 396            - 'error' (str): error message, default "Unknown error"
 397            - 'traceback' (list): traceback frames or lines, default []
 398            - 'timestamp' (str): timestamp string, default ""
 399            - 'type' (str): error type/category, default "unknown"
 400            
 401        Grouping and uniqueness:
 402        
 403            - Records are grouped by the 'page' key; if a record has no 'page' key, the page name
 404                "unknown" is used.
 405            - For each page, only unique errors are kept using the 'error' string as the deduplication
 406                key. When multiple records for the same page have the same 'error' value, the last
 407                occurrence in the loaded sequence will be retained.
 408                
 409        Return value:
 410        
 411            - dict[str, list[dict]]: mapping from page name to a list of normalized error dicts.
 412            
 413        Error handling:
 414        
 415            - Any exception raised while loading or processing records will be logged via logger.error.
 416                The method will return the result accumulated so far (or an empty dict if nothing was
 417                accumulated).
 418                
 419        Notes:
 420        
 421            - The class is expected to be instantiable (cls()) and to provide a load_errors_from_db()
 422                method that yields or returns an iterable of mappings.
 423        """
 424        
 425        result = {}
 426        try:
 427            db_errors = cls().load_errors_from_db()
 428            for err in db_errors:
 429                page = err.get('page', 'unknown')
 430                if page not in result:
 431                    result[page] = []
 432                result[page].append({
 433                    'error': err.get('error', 'Unknown error'),
 434                    'traceback': err.get('traceback', []),
 435                    'timestamp': err.get('timestamp', ''),
 436                    'type': err.get('type', 'unknown')
 437                })
 438            # Return only unique page errors using the 'page' column for filtering
 439            return {page: list({e['error']: e for e in errors}.values()) for page, errors in result.items()}
 440        except Exception as e:
 441            logger.error(f"Failed to load errors from DB: {e}")
 442            return result
 443
 444    @classmethod
 445    def save_errors_to_db(cls, errors):
 446        """
 447        Save a sequence of error records into the SQLite database configured at cls._db_path.
 448        
 449        Parameters
 450        ----------
 451        
 452        errors : Iterable[Mapping] | list[dict]
 453        
 454            Sequence of error records to persist. Each record is expected to be a mapping with the
 455            following keys (values are stored as provided, except for traceback which is normalized):
 456            
 457              - "page": identifier or name of the page where the error occurred (str)
 458              - "error": human-readable error message (str)
 459              - "traceback": traceback information; may be a str, list, or None. If a list, it will be
 460                JSON-encoded before storage. If None, an empty string is stored.
 461              - "timestamp": timestamp for the error (stored as provided)
 462              - "status": status associated with the error (str)
 463              - "type": classification/type of the error (str)
 464              
 465        Behavior
 466        --------
 467        
 468        - If `errors` is falsy (None or empty), the method returns immediately without touching the DB.
 469        - Opens a SQLite connection to the path stored in `cls._db_path`.
 470        - Iterates over the provided records and inserts each into the `errors` table with columns
 471          (page, error, traceback, timestamp, status, type).
 472        - Ensures that the `traceback` value is always written as a string (list -> JSON string,
 473          other values -> str(), None -> "").
 474        - Commits the transaction if all inserts succeed and always closes the connection in a finally block.
 475        
 476        Exceptions
 477        ----------
 478        
 479        - Underlying sqlite3 exceptions (e.g., sqlite3.Error) are not swallowed and will propagate to the caller
 480          if connection/execution fails.
 481          
 482        Returns
 483        -------
 484        
 485        None
 486        """
 487        if not errors:
 488            return
 489        conn = sqlite3.connect(cls._db_path)
 490        try:
 491            cursor = conn.cursor()
 492            for err in errors:
 493                # Ensure traceback is always a string for SQLite
 494                tb = err.get("traceback")
 495                if isinstance(tb, list):
 496                    import json
 497                    tb_str = json.dumps(tb)
 498                else:
 499                    tb_str = str(tb) if tb is not None else ""
 500                cursor.execute(
 501                    """
 502                    INSERT INTO errors (page, error, traceback, timestamp, status, type)
 503                    VALUES (?, ?, ?, ?, ?, ?)
 504                    """,
 505                    (
 506                        err.get("page"),
 507                        err.get("error"),
 508                        tb_str,
 509                        err.get("timestamp"),
 510                        err.get("status"),
 511                        err.get("type"),
 512                    ),
 513                )
 514            conn.commit()
 515        finally:
 516            conn.close()
 517
 518    @classmethod
 519    def clear_errors(cls, page_name: Optional[str] = None):
 520        """Clear stored health-check errors for a specific page or for all pages.
 521        This classmethod updates both the in-memory error cache and the persistent
 522        SQLite-backed store.
 523        
 524        If `page_name` is provided:
 525        
 526        - Remove the entry for that page from the class-level in-memory dictionary
 527            of errors (if present).
 528        - Delete all rows in the SQLite `errors` table where `page` equals `page_name`.
 529        
 530        If `page_name` is None:
 531        
 532        - Clear the entire in-memory errors dictionary.
 533        - Delete all rows from the SQLite `errors` table.
 534        
 535        Args:
 536                page_name (Optional[str]): Name of the page whose errors should be cleared.
 537                        If None, all errors are cleared.
 538                        
 539        Returns:
 540                None
 541                
 542        Side effects:
 543        
 544                - Mutates class-level state (clears entries in `cls._errors`).
 545                - Opens a SQLite connection to `cls._db_path` and executes DELETE statements
 546                    against the `errors` table. Commits the transaction and closes the connection.
 547                    
 548        Error handling:
 549        
 550                - Database-related exceptions are caught and logged via the module logger;
 551                    they are not re-raised by this method. As a result, callers should not
 552                    rely on exceptions to detect DB failures.
 553                    
 554        Notes:
 555        
 556                - The method assumes `cls._db_path` points to a valid SQLite database file
 557                    and that an `errors` table exists with a `page` column.
 558                - This method does not provide synchronization; callers should take care of
 559                    concurrent access to class state and the database if used from multiple
 560                    threads or processes.
 561        """
 562        
 563        if page_name:
 564            if page_name in cls._errors:
 565                del cls._errors[page_name]
 566            # Remove from DB
 567            try:
 568                conn = sqlite3.connect(cls._db_path)
 569                cursor = conn.cursor()
 570                cursor.execute("DELETE FROM errors WHERE page = ?", (page_name,))
 571                conn.commit()
 572                conn.close()
 573            except Exception as e:
 574                logger.error(f"Failed to clear errors from DB for page {page_name}: {e}")
 575        else:
 576            cls._errors = {}
 577            # Remove all from DB
 578            try:
 579                conn = sqlite3.connect(cls._db_path)
 580                cursor = conn.cursor()
 581                cursor.execute("DELETE FROM errors")
 582                conn.commit()
 583                conn.close()
 584            except Exception as e:
 585                logger.error(f"Failed to clear all errors from DB: {e}")
 586
 587    @classmethod
 588    def _init_db(cls):
 589        """
 590        Initialize the SQLite database file and ensure the required schema exists.
 591        This class-level initializer performs the following steps:
 592        
 593        - Ensures the parent directory of cls._db_path exists; creates it if necessary.
 594            - If cls._db_path has no parent directory (e.g., a bare filename), no directory is created.
 595        - Connects to the SQLite database at cls._db_path (creating the file if it does not exist).
 596        - Creates an "errors" table if it does not already exist with the following columns:
 597            - id (INTEGER PRIMARY KEY AUTOINCREMENT)
 598            - page (TEXT)
 599            - error (TEXT)
 600            - traceback (TEXT)
 601            - timestamp (TEXT)
 602            - status (TEXT)
 603            - type (TEXT)
 604        - Commits the schema change and closes the database connection.
 605        - Logs informational and error messages using the module logger.
 606        
 607        Parameters
 608        ----------
 609        
 610        cls : type
 611        
 612                The class on which this method is invoked. Must provide a valid string attribute
 613                `_db_path` indicating the target SQLite database file path.
 614                
 615        Raises
 616        ------
 617        
 618        Exception
 619        
 620                Re-raises exceptions encountered when creating the parent directory (os.makedirs).
 621                
 622        sqlite3.Error
 623        
 624                May be raised by sqlite3.connect or subsequent SQLite operations when the database
 625                cannot be opened or initialized.
 626                
 627        Side effects
 628        ------------
 629        
 630        - May create directories on the filesystem.
 631        - May create or modify the SQLite database file at cls._db_path.
 632        - Writes log messages via the module logger.
 633        
 634        Returns
 635        -------
 636        
 637        None
 638        """
 639        
 640        # Ensure the parent directory for the DB exists
 641        db_dir = os.path.dirname(cls._db_path)
 642        if db_dir and not os.path.exists(db_dir):
 643            try:
 644                os.makedirs(db_dir, exist_ok=False)
 645                logger.info(f"Created directory for DB: {db_dir}")
 646            except Exception as e:
 647                logger.error(f"Failed to create DB directory {db_dir}: {e}")
 648                raise
 649        # Now create/connect to the DB and table
 650        logger.info(f"Initializing SQLite DB at: {cls._db_path}")
 651        conn = sqlite3.connect(cls._db_path)
 652        c = conn.cursor()
 653        c.execute('''CREATE TABLE IF NOT EXISTS errors (
 654            id INTEGER PRIMARY KEY AUTOINCREMENT,
 655            page TEXT,
 656            error TEXT,
 657            traceback TEXT,
 658            timestamp TEXT,
 659            status TEXT,
 660            type TEXT
 661        )''')
 662        conn.commit()
 663        conn.close()
 664    @classmethod
 665    def load_errors_from_db(cls, page=None, status=None, limit=None):
 666        """
 667        Load errors from the class SQLite database.
 668        This classmethod connects to the SQLite database at cls._db_path, queries the
 669        'errors' table, and returns matching error records as a list of dictionaries.
 670        
 671        Parameters:
 672        
 673            page (Optional[str]): If provided, filter results to rows where the 'page'
 674                column equals this value.
 675            status (Optional[str]): If provided, filter results to rows where the 'status'
 676                column equals this value.
 677            limit (Optional[int|str]): If provided, limits the number of returned rows.
 678                The value is cast to int internally; a non-convertible value will raise
 679                ValueError.
 680                
 681        Returns:
 682        
 683            List[dict]: A list of dictionaries representing rows from the 'errors' table.
 684            Each dict contains the following keys:
 685                - id: primary key (int)
 686                - page: page identifier (str)
 687                - error: short error message (str)
 688                - traceback: full traceback or diagnostic text (str)
 689                - timestamp: stored timestamp value as retrieved from the DB (type depends on schema)
 690                - status: error status (str)
 691                - type: error type/category (str)
 692                
 693        Raises:
 694        
 695            ValueError: If `limit` cannot be converted to int.
 696            sqlite3.Error: If an SQLite error occurs while executing the query.
 697            
 698        Notes:
 699        
 700            - Uses parameterized queries for the 'page' and 'status' filters to avoid SQL
 701              injection. The `limit` is applied after casting to int.
 702            - Results are ordered by `timestamp` in descending order.
 703            - The database connection is always closed in a finally block to ensure cleanup.
 704        """
 705        
 706        conn = sqlite3.connect(cls._db_path)
 707        try:
 708            cursor = conn.cursor()
 709            query = "SELECT id, page, error, traceback, timestamp, status, type FROM errors"
 710            params = []
 711            filters = []
 712            if page:
 713                filters.append("page = ?")
 714                params.append(page)
 715            if status:
 716                filters.append("status = ?")
 717                params.append(status)
 718            if filters:
 719                query += " WHERE " + " AND ".join(filters)
 720            query += " ORDER BY timestamp DESC"
 721            if limit:
 722                query += f" LIMIT {int(limit)}"
 723            cursor.execute(query, params)
 724            rows = cursor.fetchall()
 725            errors = []
 726            for row in rows:
 727                errors.append({
 728                    "id": row[0],
 729                    "page": row[1],
 730                    "error": row[2],
 731                    "traceback": row[3],
 732                    "timestamp": row[4],
 733                    "status": row[5],
 734                    "type": row[6],
 735                })
 736            return errors
 737        finally:
 738            conn.close()
 739
 740class HealthCheckService:
 741    """
 742    A background-capable health monitoring service for a Streamlit-based application.
 743    This class periodically executes a configurable set of checks (system metrics,
 744    external dependencies, Streamlit server and pages, and user-registered custom checks),
 745    aggregates their results, and exposes a sanitized health snapshot suitable for UI
 746    display or remote monitoring.
 747    
 748    Primary responsibilities
 749    
 750    - Load and persist a JSON configuration that defines check intervals, thresholds,
 751        dependencies to probe, and Streamlit connection settings.
 752    - Run periodic checks in a dedicated background thread (start/stop semantics).
 753    - Collect system metrics (CPU, memory, disk) using psutil and apply configurable
 754        warning/critical thresholds.
 755    - Probe configured HTTP API endpoints and (placeholder) database checks.
 756    - Verify Streamlit server liveness by calling a /healthz endpoint and inspect
 757        Streamlit page errors via StreamlitPageMonitor.
 758    - Allow callers to register synchronous custom checks (functions returning dicts).
 759    - Compute an aggregated overall status (critical > warning > unknown > healthy).
 760    - Provide a sanitized snapshot of health data with function references removed for safe
 761        serialization/display.
 762        
 763    Usage (high level)
 764    
 765    - Instantiate: svc = HealthCheckService(config_path="path/to/config.json")
 766    - Optionally register custom checks: svc.register_custom_check("my_check", my_check_func)
 767        where my_check_func() -> Dict[str, Any]
 768    - Start background monitoring: svc.start()
 769    - Stop monitoring: svc.stop()
 770    - Retrieve current health snapshot for display or API responses: svc.get_health_data()
 771    - Persist any changes to configuration: svc.save_config()
 772    
 773    Configuration (JSON)
 774    
 775    - check_interval: int (seconds) — how often to run the checks (default 60)
 776    - streamlit_url: str — base host (default "http://localhost")
 777    - streamlit_port: int — port for Streamlit server (default 8501)
 778    - system_checks: { "cpu": bool, "memory": bool, "disk": bool }
 779    - dependencies:
 780            - api_endpoints: list of { "name": str, "url": str, "timeout": int }
 781            - databases: list of { "name": str, "type": str, "connection_string": str }
 782    - thresholds:
 783            - cpu_warning, cpu_critical, memory_warning, memory_critical, disk_warning, disk_critical
 784            
 785    Health data structure (conceptual)
 786    
 787    - last_updated: ISO timestamp
 788    - system: { "cpu": {...}, "memory": {...}, "disk": {...} }
 789    - dependencies: { "<name>": {...}, ... }
 790    - custom_checks: { "<name>": {...} }  (get_health_data() strips callable references)
 791    - streamlit_server: {status, response_code/latency/error, message, url}
 792    - streamlit_pages: {status, error_count, errors, details}
 793    - overall_status: "healthy" | "warning" | "critical" | "unknown"
 794    
 795    Threading and safety
 796    
 797    - The service runs checks in a daemon thread started by start(). stop() signals the
 798        thread to terminate and joins with a short timeout. Clients should avoid modifying
 799        internal structures concurrently; get_health_data() returns a sanitized snapshot
 800        appropriate for concurrent reads.
 801        
 802    Custom checks
 803    
 804    - register_custom_check(name, func): registers a synchronous function that returns a
 805        dict describing the check result (must include a "status" key with one of the
 806        recognized values). The service stores the function reference internally but returns
 807        sanitized results via get_health_data().
 808        
 809    Error handling and logging
 810    
 811    - Individual checks catch exceptions and surface errors in the corresponding
 812        health_data entry with status "critical" where appropriate.
 813    - The Streamlit UI integration (st.* calls) is used for user-visible error messages
 814        when loading/saving configuration; the service also logs events to its configured
 815        logger.
 816        
 817    Extensibility notes
 818    
 819    - Database checks are left as placeholders; implement _check_database for specific DB
 820        drivers/connections.
 821    - Custom checks are synchronous; if long-running checks are required, adapt the
 822        registration/run pattern to use async or worker pools.
 823    """
 824    def __init__(self, config_path: str = "health_check_config.json"):
 825        """
 826        Initializes the HealthCheckService instance.
 827        
 828        Args:
 829            config_path (str): Path to the health check configuration file. Defaults to "health_check_config.json".
 830            
 831        Attributes:
 832        
 833        - logger (logging.Logger): Logger for the HealthCheckService.
 834        - config_path (str): Path to the configuration file.
 835        - health_data (Dict[str, Any]): Dictionary storing health check data.
 836        - config (dict): Loaded configuration from the config file.
 837        - check_interval (int): Interval in seconds between health checks. Defaults to 60.
 838        - _running (bool): Indicates if the health check service is running.
 839        - _thread (threading.Thread or None): Thread running the health check loop.
 840        - streamlit_url (str): URL of the Streamlit service. Defaults to "http://localhost".
 841        - streamlit_port (int): Port of the Streamlit service. Defaults to 8501.
 842        """
 843        self.logger = logging.getLogger(f"{__name__}.HealthCheckService")
 844        self.logger.info("Initializing HealthCheckService")
 845        self.config_path = config_path
 846        self.health_data: Dict[str, Any] = {
 847            "last_updated": None,
 848            "system": {},
 849            "dependencies": {},
 850            "custom_checks": {},
 851            "overall_status": "unknown"
 852        }
 853        self.config = self._load_config()
 854        self.check_interval = self.config.get("check_interval", 60)  # Default: 60 seconds
 855        self._running = False
 856        self._thread = None
 857        self.streamlit_url = self.config.get("streamlit_url", "http://localhost")
 858        self.streamlit_port = self.config.get("streamlit_port", 8501)  # Default: 8501
 859    def _load_config(self) -> Dict:
 860        """Load health check configuration from file."""
 861        if os.path.exists(self.config_path):
 862            try:
 863                with open(self.config_path, "r") as f:
 864                    return json.load(f)
 865            except Exception as e:
 866                st.error(f"Error loading health check config: {str(e)}")
 867                return self._get_default_config()
 868        else:
 869            return self._get_default_config()
 870            
 871    def _get_default_config(self) -> Dict:
 872        """Return default health check configuration."""
 873        return {
 874            "check_interval": 60,
 875            "streamlit_url": "http://localhost",
 876            "streamlit_port": 8501,
 877            "system_checks": {
 878                "cpu": True,
 879                "memory": True,
 880                "disk": True
 881            },
 882            "dependencies": {
 883                "api_endpoints": [
 884                    # Example API endpoint to check
 885                    {"name": "example_api", "url": "https://httpbin.org/get", "timeout": 5}
 886                ],
 887                "databases": [
 888                    # Example database connection to check
 889                    {"name": "main_db", "type": "postgres", "connection_string": "..."}
 890                ]
 891            },
 892            "thresholds": {
 893                "cpu_warning": 70,
 894                "cpu_critical": 90,
 895                "memory_warning": 70,
 896                "memory_critical": 90,
 897                "disk_warning": 70,
 898                "disk_critical": 90
 899            }
 900        }
 901    
 902    def start(self):
 903        """
 904        Start the periodic health-check background thread.
 905        If the `healthcheck` runner is already active, this method is a no-op and returns
 906        immediately. Otherwise, it marks the runner as running, creates a daemon thread
 907        targeting self._run_checks_periodically, stores the thread on self._thread, and
 908        starts it.
 909        
 910        Behavior and side effects:
 911        
 912        - Idempotent while running: repeated calls will not create additional threads.
 913        - Sets self._running to True.
 914        - Assigns a daemon threading.Thread to self._thread and starts it.
 915        - Non-blocking: returns after starting the background thread.
 916        - The daemon thread will not prevent the process from exiting.
 917        
 918        Thread-safety:
 919        
 920        - If start() may be called concurrently from multiple threads, callers should
 921            ensure proper synchronization (e.g., external locking) to avoid race conditions.
 922            
 923        Returns:
 924        
 925                None
 926        """
 927        
 928        if self._running:
 929            return
 930            
 931        self._running = True
 932        self._thread = threading.Thread(target=self._run_checks_periodically, daemon=True)
 933        self._thread.start()
 934        
 935    def stop(self):
 936        """Stop the health check service."""
 937        self._running = False
 938        if self._thread:
 939            self._thread.join(timeout=1)
 940            
 941    def _run_checks_periodically(self):
 942        """Run health checks periodically based on check interval."""
 943        while self._running:
 944            self.run_all_checks()
 945            time.sleep(self.check_interval)
 946            
 947    def run_all_checks(self):
 948        """Run all configured health checks and update health data."""
 949        # Update timestamp
 950        self.health_data["last_updated"] = datetime.now().isoformat()
 951        
 952        # Check Streamlit server
 953        self.health_data["streamlit_server"] = self.check_streamlit_server()
 954        
 955        # System checks
 956        if self.config["system_checks"].get("cpu", True):
 957            self.check_cpu()
 958        if self.config["system_checks"].get("memory", True):
 959            self.check_memory()
 960        if self.config["system_checks"].get("disk", True):
 961            self.check_disk()
 962            
 963        # Rest of the existing checks...
 964        self.check_dependencies()
 965        self.run_custom_checks()
 966        self.check_streamlit_pages()
 967        self._update_overall_status()
 968        
 969    def check_cpu(self):
 970        """
 971        Checks the current CPU usage and updates the health status based on configured thresholds.
 972        Measures the CPU usage percentage over a 1-second interval using psutil. Compares the result
 973        against warning and critical thresholds defined in the configuration. Sets the status to
 974        'healthy', 'warning', or 'critical' accordingly, and updates the health data dictionary.
 975        
 976        Returns:
 977        
 978            None
 979        """
 980        
 981        cpu_percent = psutil.cpu_percent(interval=1)
 982        warning_threshold = self.config["thresholds"].get("cpu_warning", 70)
 983        critical_threshold = self.config["thresholds"].get("cpu_critical", 90)
 984        
 985        status = "healthy"
 986        if cpu_percent >= critical_threshold:
 987            status = "critical"
 988        elif cpu_percent >= warning_threshold:
 989            status = "warning"
 990            
 991        self.health_data["system"]["cpu"] = {
 992            "usage_percent": cpu_percent,
 993            "status": status
 994        }
 995        
 996    def check_memory(self):
 997        """
 998        Checks the system's memory usage and updates the health status accordingly.
 999        Retrieves the current memory usage statistics using psutil, compares the usage percentage
1000        against configured warning and critical thresholds, and sets the memory status to 'healthy',
1001        'warning', or 'critical'. Updates the health_data dictionary with total memory, available memory,
1002        usage percentage, and status.
1003        
1004        Returns:
1005        
1006            None
1007        """
1008        
1009        memory = psutil.virtual_memory()
1010        memory_percent = memory.percent
1011        warning_threshold = self.config["thresholds"].get("memory_warning", 70)
1012        critical_threshold = self.config["thresholds"].get("memory_critical", 90)
1013        
1014        status = "healthy"
1015        if memory_percent >= critical_threshold:
1016            status = "critical"
1017        elif memory_percent >= warning_threshold:
1018            status = "warning"
1019            
1020        self.health_data["system"]["memory"] = {
1021            "total_gb": round(memory.total / (1024**3), 2),
1022            "available_gb": round(memory.available / (1024**3), 2),
1023            "usage_percent": memory_percent,
1024            "status": status
1025        }
1026        
1027    def check_disk(self):
1028        """
1029        Checks the disk usage of the root filesystem and updates the health status.
1030        Retrieves disk usage statistics using psutil, compares the usage percentage
1031        against configured warning and critical thresholds, and sets the disk status
1032        accordingly (`healthy`, `warning`, or `critical`). Updates the health_data
1033        dictionary with total disk size, free space, usage percentage, and status.
1034        
1035        Returns:
1036        
1037            None
1038        """
1039        
1040        disk = psutil.disk_usage('/')
1041        disk_percent = disk.percent
1042        warning_threshold = self.config["thresholds"].get("disk_warning", 70)
1043        critical_threshold = self.config["thresholds"].get("disk_critical", 90)
1044        
1045        status = "healthy"
1046        if disk_percent >= critical_threshold:
1047            status = "critical"
1048        elif disk_percent >= warning_threshold:
1049            status = "warning"
1050            
1051        self.health_data["system"]["disk"] = {
1052            "total_gb": round(disk.total / (1024**3), 2),
1053            "free_gb": round(disk.free / (1024**3), 2),
1054            "usage_percent": disk_percent,
1055            "status": status
1056        }
1057        
1058    def check_dependencies(self):
1059        """
1060        Checks the health of configured dependencies, including API endpoints and databases.
1061        Iterates through the list of API endpoints and databases specified in the configuration,
1062        and performs health checks on each by invoking the corresponding internal methods.
1063        
1064        Raises:
1065        
1066            Exception: If any dependency check fails.
1067        """
1068        
1069        # Check API endpoints
1070        for endpoint in self.config["dependencies"].get("api_endpoints", []):
1071            self._check_api_endpoint(endpoint)
1072            
1073        # Check database connections
1074        for db in self.config["dependencies"].get("databases", []):
1075            self._check_database(db)
1076            
1077    def _check_api_endpoint(self, endpoint: Dict):
1078        """
1079        Check if an API endpoint is accessible.
1080        
1081        Args:
1082        
1083            endpoint: Dictionary with endpoint configuration
1084        """
1085        name = endpoint.get("name", "unknown_api")
1086        url = endpoint.get("url", "")
1087        timeout = endpoint.get("timeout", 5)
1088        
1089        if not url:
1090            return
1091            
1092        try:
1093            start_time = time.time()
1094            response = requests.get(url, timeout=timeout)
1095            response_time = time.time() - start_time
1096            
1097            status = "healthy" if response.status_code < 400 else "critical"
1098            
1099            self.health_data["dependencies"][name] = {
1100                "type": "api",
1101                "url": url,
1102                "status": status,
1103                "response_time_ms": round(response_time * 1000, 2),
1104                "status_code": response.status_code
1105            }
1106        except Exception as e:
1107            self.health_data["dependencies"][name] = {
1108                "type": "api",
1109                "url": url,
1110                "status": "critical",
1111                "error": str(e)
1112            }
1113            
1114    def _check_database(self, db_config: Dict):
1115        """
1116        Check database connection.
1117        Note: This is a placeholder. You'll need to implement specific database checks
1118        based on your application's needs.
1119        
1120        Args:
1121        
1122            db_config: Dictionary with database configuration
1123        """
1124        name = db_config.get("name", "unknown_db")
1125        db_type = db_config.get("type", "")
1126        
1127        # Placeholder for database connection check
1128        # In a real implementation, you would check the specific database connection
1129        self.health_data["dependencies"][name] = {
1130            "type": "database",
1131            "db_type": db_type,
1132            "status": "unknown",
1133            "message": "Database check not implemented"
1134        }
1135        
1136    def register_custom_check(self, name: str, check_func: Callable[[], Dict[str, Any]]):
1137        """
1138        Register a custom health check function.
1139        
1140        Args:
1141        
1142            name: Name of the custom check
1143            check_func: Function that performs the check and returns a dictionary with results
1144        """
1145        if "custom_checks" not in self.health_data:
1146            self.health_data["custom_checks"] = {}
1147            
1148        self.health_data["custom_checks"][name] = {
1149            "status": "unknown",
1150            "check_func": check_func
1151        }
1152        
1153    def run_custom_checks(self):
1154        """Run all registered custom health checks."""
1155        if "custom_checks" not in self.health_data:
1156            return
1157            
1158        for name, check_info in list(self.health_data["custom_checks"].items()):
1159            if "check_func" in check_info and callable(check_info["check_func"]):
1160                try:
1161                    result = check_info["check_func"]()
1162                    # Remove the function reference from the result
1163                    func = check_info["check_func"]
1164                    self.health_data["custom_checks"][name] = result
1165                    # Add the function back
1166                    self.health_data["custom_checks"][name]["check_func"] = func
1167                except Exception as e:
1168                    self.health_data["custom_checks"][name] = {
1169                        "status": "critical",
1170                        "error": str(e),
1171                        "check_func": check_info["check_func"]
1172                    }
1173                    
1174    def _update_overall_status(self):
1175        """
1176        Updates the overall health status of the application based on the statuses of various components.
1177        
1178        The method checks the health status of the following components:
1179            - Streamlit server
1180            - System checks
1181            - Dependencies
1182            - Custom checks (excluding those with a 'check_func' key)
1183            - Streamlit pages
1184            
1185        The overall status is determined using the following priority order:
1186            1. "critical" if any component is critical
1187            2. "warning" if any component is warning and none are critical
1188            3. "unknown" if any component is unknown and none are critical or warning, and no healthy components exist
1189            4. "healthy" if any component is healthy and none are critical, warning, or unknown
1190            5. "unknown" if no statuses are found
1191            
1192        The result is stored in `self.health_data["overall_status"]`.
1193        """
1194        
1195        has_critical = False
1196        has_warning = False
1197        has_healthy = False
1198        has_unknown = False
1199        
1200        # Helper function to check status
1201        def check_component_status(status):
1202            nonlocal has_critical, has_warning, has_healthy, has_unknown
1203            if status == "critical":
1204                has_critical = True
1205            elif status == "warning":
1206                has_warning = True
1207            elif status == "healthy":
1208                has_healthy = True
1209            elif status == "unknown":
1210                has_unknown = True
1211
1212        # Check Streamlit server status
1213        server_status = self.health_data.get("streamlit_server", {}).get("status")
1214        check_component_status(server_status)
1215        
1216        # Check system status
1217        for system_check in self.health_data.get("system", {}).values():
1218            check_component_status(system_check.get("status"))
1219                    
1220        # Check dependencies status
1221        for dep_check in self.health_data.get("dependencies", {}).values():
1222            check_component_status(dep_check.get("status"))
1223                    
1224        # Check custom checks status
1225        for custom_check in self.health_data.get("custom_checks", {}).values():
1226            if isinstance(custom_check, dict) and "check_func" not in custom_check:
1227                check_component_status(custom_check.get("status"))
1228        
1229        # Check Streamlit pages status
1230        pages_status = self.health_data.get("streamlit_pages", {}).get("status")
1231        check_component_status(pages_status)
1232                        
1233        # Determine overall status with priority:
1234        # critical > warning > unknown > healthy
1235        if has_critical:
1236            self.health_data["overall_status"] = "critical"
1237        elif has_warning:
1238            self.health_data["overall_status"] = "warning"
1239        elif has_unknown and not has_healthy:
1240            self.health_data["overall_status"] = "unknown"
1241        elif has_healthy:
1242            self.health_data["overall_status"] = "healthy"
1243        else:
1244            self.health_data["overall_status"] = "unknown"
1245                
1246    def get_health_data(self) -> Dict:
1247        """Get the latest health check data."""
1248        # Create a copy without the function references
1249        result: Dict[str, Any] = {}
1250        for key, value in self.health_data.items():
1251            if key == "custom_checks":
1252                result[key] = {}
1253                for check_name, check_data in value.items():
1254                    if isinstance(check_data, dict):
1255                        check_copy = check_data.copy()
1256                        if "check_func" in check_copy:
1257                            del check_copy["check_func"]
1258                        result[key][check_name] = check_copy
1259            else:
1260                result[key] = value
1261        return result
1262        
1263    def save_config(self):
1264        """
1265        Saves the current health check configuration to a JSON file.
1266        Attempts to write the configuration stored in `self.config` to the file specified by `self.config_path`.
1267        Displays a success message in the Streamlit app upon successful save.
1268        Handles and displays appropriate error messages for file not found, permission issues, JSON decoding errors, and other exceptions.
1269        
1270        Raises:
1271        
1272            FileNotFoundError: If the configuration file path does not exist.
1273            PermissionError: If there are insufficient permissions to write to the file.
1274            json.JSONDecodeError: If there is an error decoding the JSON data.
1275            Exception: For any other exceptions encountered during the save process.
1276        """
1277        
1278        try:
1279            with open(self.config_path, "w") as f:
1280                json.dump(self.config, f, indent=2)
1281                st.success(f"Health check config saved successfully to {self.config_path}")
1282        except FileNotFoundError:
1283            st.error(f"Configuration file not found: {self.config_path}")
1284        except PermissionError:
1285            st.error(f"Permission denied: Unable to write to {self.config_path}")
1286        except json.JSONDecodeError:
1287            st.error(f"Error decoding JSON in config file: {self.config_path}")
1288        except Exception as e:
1289            st.error(f"Error saving health check config: {str(e)}")
1290    def check_streamlit_pages(self):
1291        """
1292        Checks for errors in Streamlit pages and updates the health data accordingly.
1293        This method retrieves page errors using StreamlitPageMonitor.get_page_errors().
1294        If errors are found, it sets the 'streamlit_pages' status to 'critical' and updates
1295        the overall health status to 'critical'. If no errors are found, it marks the
1296        'streamlit_pages' status as 'healthy'.
1297        
1298        Updates:
1299        
1300            self.health_data["streamlit_pages"]: Dict containing status, error count, errors, and details.
1301            self.health_data["overall_status"]: Set to 'critical' if errors are detected.
1302            self.health_data["streamlit_pages"]["details"]: A summary of the errors found.
1303            
1304        Returns:
1305        
1306            None
1307        """
1308        
1309        page_errors = StreamlitPageMonitor.get_page_errors()
1310        
1311        if "streamlit_pages" not in self.health_data:
1312            self.health_data["streamlit_pages"] = {}
1313        
1314        if page_errors:
1315            total_errors = sum(len(errors) for errors in page_errors.values())
1316            self.health_data["streamlit_pages"] = {
1317                "status": "critical",
1318                "error_count": total_errors,
1319                "errors": page_errors,
1320                "details": "Errors detected in Streamlit pages"
1321            }
1322            # This affects overall status
1323            self.health_data["overall_status"] = "critical"
1324        else:
1325            self.health_data["streamlit_pages"] = {
1326                "status": "healthy",
1327                "error_count": 0,
1328                "errors": {},
1329                "details": "All pages functioning normally"
1330            }
1331    
1332    def check_streamlit_server(self) -> Dict[str, Any]:
1333        """
1334        Checks the health status of the Streamlit server by sending a GET request to the /healthz endpoint.
1335        
1336        Returns:
1337        
1338            Dict[str, Any]: A dictionary containing the health status, response code, latency in milliseconds,
1339                            message, and the URL checked. If the server is healthy (HTTP 200), status is "healthy".
1340                            Otherwise, status is "critical" with error details.
1341                            
1342        Handles:
1343        
1344            - Connection errors: Returns critical status with connection error details.
1345            - Timeout errors: Returns critical status with timeout error details.
1346            - Other exceptions: Returns critical status with unknown error details.
1347            
1348        Logs:
1349        
1350            - The URL being checked.
1351            - The response status code and text.
1352            - Health status and response time if healthy.
1353            - Warnings and errors for unhealthy or failed checks.
1354        """
1355        
1356        try:
1357            host = self.streamlit_url.rstrip('/')
1358            if not host.startswith(('http://', 'https://')):
1359                host = f"http://{host}"
1360            
1361            url = f"{host}:{self.streamlit_port}/healthz"
1362            self.logger.info(f"Checking Streamlit server health at: {url}")
1363            
1364            start_time = time.time()
1365            response = requests.get(url, timeout=3)
1366            total_time = (time.time() - start_time) * 1000
1367            self.logger.info(f"{response.status_code} - {response.text}")
1368            # Check if the response is healthy
1369            if response.status_code == 200:
1370                self.logger.info(f"Streamlit server healthy - Response time: {round(total_time, 2)}ms")
1371                return {
1372                    "status": "healthy",
1373                    "response_code": response.status_code,
1374                    "latency_ms": round(total_time, 2),
1375                    "message": "Streamlit server is running",
1376                    "url": url
1377                }
1378            else:
1379                self.logger.warning(f"Unhealthy response from server: {response.status_code}")
1380                return {
1381                    "status": "critical",
1382                    "response_code": response.status_code,
1383                    "error": f"Unhealthy response from server: {response.status_code}",
1384                    "message": "Streamlit server is not healthy",
1385                    "url": url
1386                }
1387
1388        except requests.exceptions.ConnectionError as e:
1389            self.logger.error(f"Connection error while checking Streamlit server: {str(e)}")
1390            return {
1391                "status": "critical",
1392                "error": f"Connection error: {str(e)}",
1393                "message": "Cannot connect to Streamlit server",
1394                "url": url
1395            }
1396        except requests.exceptions.Timeout as e:
1397            self.logger.error(f"Timeout while checking Streamlit server: {str(e)}")
1398            return {
1399                "status": "critical",
1400                "error": f"Timeout error: {str(e)}",
1401                "message": "Streamlit server is not responding",
1402                "url": url
1403            }
1404        except Exception as e:
1405            self.logger.error(f"Unexpected error while checking Streamlit server: {str(e)}")
1406            return {
1407                "status": "critical",
1408                "error": f"Unknown error: {str(e)}",
1409                "message": "Failed to check Streamlit server",
1410                "url": url
1411            }
1412    
1413def health_check(config_path:str = "health_check_config.json"):
1414    """
1415    Displays an interactive Streamlit dashboard for monitoring application health.
1416    This function initializes and manages a health check service, presenting real-time system metrics,
1417    dependency statuses, custom checks, and Streamlit page health in a user-friendly dashboard.
1418    Users can manually refresh health checks, view detailed error information, and adjust configuration
1419    thresholds and intervals directly from the UI.
1420    
1421    Args:
1422    
1423        config_path (str, optional): Path to the health check configuration JSON file.
1424            Defaults to "health_check_config.json".
1425            
1426    Features:
1427    
1428        - Displays overall health status with color-coded indicators.
1429        - Shows last updated timestamp for health data.
1430        - Monitors Streamlit server status, latency, and errors.
1431        - Provides tabs for:
1432            * System Resources (CPU, Memory, Disk usage and status)
1433            * Dependencies (external services and their health)
1434            * Custom Checks (user-defined health checks)
1435            * Streamlit Pages (page-specific errors and status)
1436        - Allows configuration of system thresholds, check intervals, and Streamlit server settings.
1437        - Supports manual refresh and saving configuration changes.
1438        
1439    Raises:
1440    
1441        Displays error messages in the UI for any exceptions encountered during health data retrieval or processing.
1442        
1443    Returns:
1444    
1445        None. The dashboard is rendered in the Streamlit app.
1446    """
1447    
1448    logger = logging.getLogger(f"{__name__}.health_check")
1449    logger.info("Starting health check dashboard")
1450    st.title("Application Health Dashboard")
1451    
1452    # Initialize or get the health check service
1453    if "health_service" not in st.session_state:
1454        logger.info("Initializing new health check service")
1455        st.session_state.health_service = HealthCheckService(config_path = config_path)
1456        st.session_state.health_service.start()
1457    
1458    health_service = st.session_state.health_service
1459    health_service.run_all_checks()
1460    
1461    # Add controls for manual refresh and configuration
1462    col1, col2 = st.columns([3, 1])
1463    with col1:
1464        st.subheader("System Health Status")
1465    with col2:
1466        if st.button("Refresh Now"):
1467            health_service.run_all_checks()
1468    
1469    # Get the latest health data
1470    health_data = health_service.get_health_data()
1471    
1472    # Display overall status with appropriate color
1473    overall_status = health_data.get("overall_status", "unknown")
1474    status_color = {
1475        "healthy": "green",
1476        "warning": "orange",
1477        "critical": "red",
1478        "unknown": "gray"
1479    }.get(overall_status, "gray")
1480    
1481    st.markdown(
1482        f"<h3 style='color: {status_color};'>Overall Status: {overall_status.upper()}</h3>",
1483        unsafe_allow_html=True
1484    )
1485    
1486    # Display last updated time
1487    if health_data.get("last_updated"):
1488        try:
1489            last_updated = datetime.fromisoformat(health_data["last_updated"])
1490            st.text(f"Last updated: {last_updated.strftime('%Y-%m-%d %H:%M:%S')}")
1491        except Exception as e:
1492            st.error(f"Last updated: {health_data['last_updated']}")
1493            st.exception(e)
1494    
1495    server_health = health_data.get("streamlit_server", {})
1496    server_status = server_health.get("status", "unknown")
1497    server_color = {
1498        "healthy": "green",
1499        "critical": "red",
1500        "unknown": "gray"
1501    }.get(server_status, "gray")
1502
1503    st.markdown(
1504        f"### Streamlit Server Status: <span style='color: {server_color}'>{server_status.upper()}</span>",
1505        unsafe_allow_html=True
1506    )
1507
1508    if server_status != "healthy":
1509        st.error(server_health.get("message", "Server status unknown"))
1510        if "error" in server_health:
1511            st.code(server_health["error"])
1512    else:
1513        st.success(server_health.get("message", "Server is running"))
1514        if "latency_ms" in server_health:
1515            latency = server_health["latency_ms"]
1516            # Define color based on latency thresholds
1517            if latency <= 50:
1518                latency_color = "green"
1519                performance = "Excellent"
1520            elif latency <= 100:
1521                latency_color = "blue"
1522                performance = "Good"
1523            elif latency <= 200:
1524                latency_color = "orange"
1525                performance = "Fair"
1526            else:
1527                latency_color = "red"
1528                performance = "Poor"
1529                
1530            st.markdown(
1531                f"""
1532                <div style='display: flex; align-items: center; gap: 10px;'>
1533                    <div>Server Response Time:</div>
1534                    <div style='color: {latency_color}; font-weight: bold;'>
1535                        {latency} ms
1536                    </div>
1537                    <div style='color: {latency_color};'>
1538                        ({performance})
1539                    </div>
1540                </div>
1541                """,
1542                unsafe_allow_html=True
1543            )
1544    
1545    # Create tabs for different categories of health checks
1546    tab1, tab2, tab3, tab4 = st.tabs(["System Resources", "Dependencies", "Custom Checks", "Streamlit Pages"])
1547    
1548    with tab1:
1549        # Display system health checks
1550        system_data = health_data.get("system", {})
1551        
1552        # CPU
1553        if "cpu" in system_data:
1554            cpu_data = system_data["cpu"]
1555            cpu_status = cpu_data.get("status", "unknown")
1556            cpu_color = {"healthy": "green", "warning": "orange", "critical": "red"}.get(cpu_status, "gray")
1557            
1558            st.markdown(f"### CPU Status: <span style='color:{cpu_color}'>{cpu_status.upper()}</span>", unsafe_allow_html=True)
1559            st.progress(cpu_data.get("usage_percent", 0) / 100)
1560            st.text(f"CPU Usage: {cpu_data.get('usage_percent', 0)}%")
1561        
1562        # Memory
1563        if "memory" in system_data:
1564            memory_data = system_data["memory"]
1565            memory_status = memory_data.get("status", "unknown")
1566            memory_color = {"healthy": "green", "warning": "orange", "critical": "red"}.get(memory_status, "gray")
1567            
1568            st.markdown(f"### Memory Status: <span style='color:{memory_color}'>{memory_status.upper()}</span>", unsafe_allow_html=True)
1569            st.progress(memory_data.get("usage_percent", 0) / 100)
1570            st.text(f"Memory Usage: {memory_data.get('usage_percent', 0)}%")
1571            st.text(f"Total Memory: {memory_data.get('total_gb', 0)} GB")
1572            st.text(f"Available Memory: {memory_data.get('available_gb', 0)} GB")
1573        
1574        # Disk
1575        if "disk" in system_data:
1576            disk_data = system_data["disk"]
1577            disk_status = disk_data.get("status", "unknown")
1578            disk_color = {"healthy": "green", "warning": "orange", "critical": "red"}.get(disk_status, "gray")
1579            
1580            st.markdown(f"### Disk Status: <span style='color:{disk_color}'>{disk_status.upper()}</span>", unsafe_allow_html=True)
1581            st.progress(disk_data.get("usage_percent", 0) / 100)
1582            st.text(f"Disk Usage: {disk_data.get('usage_percent', 0)}%")
1583            st.text(f"Total Disk Space: {disk_data.get('total_gb', 0)} GB")
1584            st.text(f"Free Disk Space: {disk_data.get('free_gb', 0)} GB")
1585    
1586    with tab2:
1587        # Display dependency health checks
1588        dependencies = health_data.get("dependencies", {})
1589        if dependencies:
1590            # Create a dataframe for all dependencies
1591            dep_data = []
1592            for name, dep_info in dependencies.items():
1593                dep_data.append({
1594                    "Name": name,
1595                    "Type": dep_info.get("type", "unknown"),
1596                    "Status": dep_info.get("status", "unknown"),
1597                    "Details": ", ".join([f"{k}: {v}" for k, v in dep_info.items() 
1598                               if k not in ["name", "type", "status", "error"] and not isinstance(v, dict)])
1599                })
1600            
1601            # Show dependencies table
1602            if dep_data:
1603                df_deps = pd.DataFrame(dep_data)
1604                st.dataframe(df_deps)
1605            else:
1606                st.info("No dependencies configured")
1607
1608            # Create a dataframe for all custom checks from health_data
1609            custom_checks = health_data.get("custom_checks", {})
1610            check_data = []
1611            for name, check_info in custom_checks.items():
1612                if isinstance(check_info, dict) and "check_func" not in check_info:
1613                    check_data.append({
1614                        "Name": name,
1615                        "Status": check_info.get("status", "unknown"),
1616                        "Details": ", ".join([f"{k}: {v}" for k, v in check_info.items()
1617                                             if k not in ["name", "status", "check_func", "error"] and not isinstance(v, dict)]),
1618                        "Error": check_info.get("error", "")
1619                    })
1620
1621            if check_data:
1622                df_checks = pd.DataFrame(check_data)
1623
1624                # Apply color formatting to status column
1625                def color_status(val):
1626                    colors = {
1627                        "healthy": "background-color: #c6efce; color: #006100",
1628                        "warning": "background-color: #ffeb9c; color: #9c5700",
1629                        "critical": "background-color: #ffc7ce; color: #9c0006",
1630                        "unknown": "background-color: #eeeeee; color: #7f7f7f"
1631                    }
1632                    return colors.get(str(val).lower(), "")
1633
1634                # Use styled dataframe to color the Status column
1635                try:
1636                    # apply expects a function that returns a sequence of styles for the column;
1637                    # map color_status across the 'Status' column to produce the CSS strings.
1638                    st.dataframe(
1639                        df_checks.style.apply(
1640                            lambda col: col.map(color_status),
1641                            subset=["Status"]
1642                        )
1643                    )
1644                except Exception:
1645                    # Fallback if styling isn't supported in the environment
1646                    st.dataframe(df_checks)
1647            else:
1648                st.info("No custom checks configured")
1649        else:
1650            st.info("No custom checks configured")
1651    with tab4:
1652        # Always read page errors from SQLite DB for latest state
1653        page_errors = StreamlitPageMonitor.get_page_errors()
1654        error_count = sum(len(errors) for errors in page_errors.values())
1655        status = "critical" if error_count > 0 else "healthy"
1656        status_color = {
1657            "healthy": "green",
1658            "critical": "red",
1659            "unknown": "gray"
1660        }.get(status, "gray")
1661        st.markdown(f"### Page Status: <span style='color:{status_color}'>{status.upper()}</span>", unsafe_allow_html=True)
1662        st.metric("Error Count", error_count)
1663        if error_count > 0:
1664            st.markdown("<div style='background-color:#ffe6e6; color:#b30000; padding:10px; border-radius:5px; border:1px solid #b30000; font-weight:bold;'>Pages with errors:</div>",
1665            unsafe_allow_html=True)
1666            for page_name, page_errors_list in page_errors.items():
1667                display_name = page_name.split("/")[-1] if "/" in page_name else page_name
1668                for error_info in page_errors_list:
1669                    if isinstance(error_info, dict):
1670                        with st.expander(f"Error in {display_name}"):
1671                            st.info(error_info.get('error', 'Unknown error'))
1672                            if error_info.get('type') == 'streamlit_error':
1673                                st.text("Type: Streamlit Error")
1674                            else:
1675                                st.text("Type: Exception")
1676                            st.text("Traceback:")
1677                            st.code("".join(error_info.get('traceback', ['No traceback available'])))
1678                            st.text(f"Timestamp: {error_info.get('timestamp', 'No timestamp')}")
1679    
1680    # Configuration section
1681    with st.expander("Health Check Configuration"):
1682        st.subheader("System Check Thresholds")
1683        
1684        col1, col2 = st.columns(2)
1685        with col1:
1686            cpu_warning = st.slider("CPU Warning Threshold (%)", 
1687                                min_value=10, max_value=90, 
1688                                value=health_service.config["thresholds"].get("cpu_warning", 70),
1689                                step=5)
1690            memory_warning = st.slider("Memory Warning Threshold (%)", 
1691                                   min_value=10, max_value=90, 
1692                                   value=health_service.config["thresholds"].get("memory_warning", 70),
1693                                   step=5)
1694            disk_warning = st.slider("Disk Warning Threshold (%)", 
1695                                 min_value=10, max_value=90, 
1696                                 value=health_service.config["thresholds"].get("disk_warning", 70),
1697                                 step=5)
1698            streamlit_url_update = st.text_input(
1699                "Streamlit Server URL",
1700                value=health_service.config.get("streamlit_url", "http://localhost")
1701            )
1702        
1703        with col2:
1704            cpu_critical = st.slider("CPU Critical Threshold (%)", 
1705                                 min_value=20, max_value=95, 
1706                                 value=health_service.config["thresholds"].get("cpu_critical", 90),
1707                                 step=5)
1708            memory_critical = st.slider("Memory Critical Threshold (%)", 
1709                                    min_value=20, max_value=95, 
1710                                    value=health_service.config["thresholds"].get("memory_critical", 90),
1711                                    step=5)
1712            disk_critical = st.slider("Disk Critical Threshold (%)", 
1713                                  min_value=20, max_value=95, 
1714                                  value=health_service.config["thresholds"].get("disk_critical", 90),
1715                                  step=5)
1716        
1717            check_interval = st.slider("Check Interval (seconds)", 
1718                                min_value=10, max_value=300, 
1719                                value=health_service.config.get("check_interval", 60),
1720                                step=10)
1721            streamlit_port_update = st.number_input(
1722                "Streamlit Server Port",
1723                value=health_service.config.get("streamlit_port", 8501),
1724                step=1
1725            )
1726        
1727        if st.button("Save Configuration"):
1728            # Update configuration
1729            health_service.config["thresholds"]["cpu_warning"] = cpu_warning
1730            health_service.config["thresholds"]["cpu_critical"] = cpu_critical
1731            health_service.config["thresholds"]["memory_warning"] = memory_warning
1732            health_service.config["thresholds"]["memory_critical"] = memory_critical
1733            health_service.config["thresholds"]["disk_warning"] = disk_warning
1734            health_service.config["thresholds"]["disk_critical"] = disk_critical
1735            health_service.config["check_interval"] = check_interval
1736            health_service.config["streamlit_url"] = streamlit_url_update
1737            health_service.config["streamlit_port"] = streamlit_port_update
1738            
1739            # Save to file
1740            health_service.save_config()
1741            st.success("Configuration saved successfully")
1742            
1743            # Restart the service if interval changed
1744            health_service.stop()
1745            health_service.start()

logger = <Logger streamlit_healthcheck.healthcheck (INFO)>

class StreamlitPageMonitor: View Source

 27class StreamlitPageMonitor:
 28    """
 29    Singleton class that monitors and records errors occurring within Streamlit pages.
 30    It captures both explicit Streamlit error messages (monkey-patching st.error) and
 31    uncaught exceptions raised during the execution of monitored page functions, and
 32    persists error details to a local SQLite database.
 33    
 34    Key responsibilities
 35    
 36    - Intercept Streamlit error calls by monkey-patching st.error and record them with
 37        a stack trace, timestamp, status, and type.
 38    - Provide a decorator `monitor_page(page_name)` to set a page context, capture
 39        exceptions raised while rendering/executing a page, and record those exceptions.
 40    - Store errors in an in-memory structure grouped by page and persist them to
 41        an SQLite database for later inspection.
 42    - Provide utilities to load, deduplicate, clear, and query stored errors.
 43    
 44    Behavior and side effects
 45    
 46    - Implements the Singleton pattern: only one instance exists per Python process.
 47    - On first instantiation, optionally accepts a custom db_path and initializes
 48        the SQLite database and its parent directory (creating it if necessary).
 49    - Monkey-patches `streamlit.error` (st.error) to capture calls and still forward
 50        them to the original st.error implementation.
 51    - Records the following fields for each error: page, error, traceback, timestamp,
 52        status, type. The SQLite table `errors` mirrors these fields and includes an
 53        auto-incrementing `id`.
 54    - Persists errors immediately to SQLite when captured; database IO errors are
 55        logged but do not suppress the original exception (for monitored exceptions,
 56        the exception is re-raised after recording).
 57        
 58    Public API (methods)
 59    
 60    - __new__(cls, db_path=None)
 61            Create or return the singleton StreamlitPageMonitor instance.
 62        
 63            Parameters
 64            ----------
 65            db_path : Optional[str]
 66                If provided on the first instantiation, overrides the class-level
 67                database path used to persist captured Streamlit error information.
 68                
 69            Returns
 70            -------
 71            StreamlitPageMonitor
 72                The singleton instance of the class.
 73                
 74            Behavior
 75            --------
 76            - On first instantiation (when cls._instance is None):
 77            - Allocates the singleton via super().__new__.
 78            - Optionally sets cls._db_path from the provided db_path.
 79            - Logs the configured DB path.
 80            - Monkey-patches streamlit.error (st.error) with a wrapper that:
 81                - Builds an error record containing the error text, a formatted stack trace,
 82                ISO timestamp, severity/status, an error type marker, and the current page.
 83                - Normalizes a missing current page to "unknown_page".
 84                - Stores the record in the in-memory cls._errors dictionary keyed by page.
 85                - Attempts to persist the record to the SQLite DB using cls().save_errors_to_db,
 86                logging any persistence errors without interrupting Streamlit's normal error display.
 87                - Calls the original st.error to preserve expected UI behavior.
 88            - Initializes the SQLite DB via cls._init_db().
 89            - On subsequent calls:
 90            - Returns the existing singleton instance.
 91            - If db_path is provided, updates cls._db_path for future use.
 92            
 93            Side effects
 94            ------------
 95            - Replaces st.error globally for the running process.
 96            - Writes error records to both an in-memory structure (cls._errors) and to the
 97            configured SQLite database (if persistence succeeds).
 98            - Logs informational and error messages.
 99            
100            Notes
101            -----
102            - The method assumes the class defines/has: _instance, _db_path, _current_page,
103            _errors, _st_error (original st.error), save_errors_to_db, and _init_db.
104            - Exceptions raised during saving of individual errors are caught and logged;
105            exceptions from instance creation or DB initialization may propagate.
106            - The implementation is not explicitly thread-safe; concurrent instantiation
107            attempts may require external synchronization if used in multi-threaded contexts.
108    - set_page_context(cls, page_name: str)
109            Set the current page name used when recording subsequent errors.
110    - monitor_page(cls, page_name: str) -> Callable
111            Decorator for page rendering/execution functions. Sets the page context,
112            clears previously recorded non-Streamlit errors for that page, runs the
113            function, records and persists any raised exception, and re-raises it.
114    - _handle_st_error(cls, error_message: str)
115    
116            Handles Streamlit-specific errors by recording error details for the current page.
117        
118            Args:
119                error_message (str): The error message to be logged.
120                
121            Side Effects:
122                Updates the class-level _errors dictionary with error information for the current Streamlit page.
123                
124            Error Information Stored:
125                - error: Formatted error message.
126                - traceback: Stack trace at the point of error.
127                - timestamp: Time when the error occurred (ISO format).
128                - status: Error severity ('critical').
129                - type: Error type ('streamlit_error').
130    - get_page_errors(cls) -> dict
131            Load errors from the database and return a dictionary mapping page names to
132            lists of error dicts. Performs basic deduplication by error message.
133    - save_errors_to_db(cls, errors: Iterable[dict])
134            Persist a list of error dictionaries to the configured SQLite database.
135            Ensures traceback is stored as a string (JSON if originally a list).
136    - clear_errors(cls, page_name: Optional[str] = None)
137            Clear in-memory errors for a specific page or all pages and delete matching
138            rows from the database.
139    - _init_db(cls)
140            Ensure the database directory exists and create the `errors` table if it
141            does not exist.
142    - load_errors_from_db(cls, page=None, status=None, limit=None) -> List[dict]
143            Query the database for errors, optionally filtering by page and/or status,
144            returning a list of error dictionaries ordered by timestamp (descending)
145            and limited if requested.
146            
147    Storage and format
148    
149    - Default DB path: ~/local/share/streamlit-healthcheck/streamlit_page_errors.db (overridable).
150    - SQLite table `errors` columns: id, page, error, traceback, timestamp, status, type.
151    - Tracebacks may be stored as JSON strings (if originally lists) or plain strings.
152    Concurrency and robustness
153    - Designed for single-process usage typical of Streamlit apps. The singleton and
154        monkey-patching are process-global.
155    - Database interactions use short-lived connections; callers should handle any
156        exceptions arising from DB access (errors are logged internally).
157    - Decorator preserves original function metadata via functools.wraps.
158    
159    Examples
160    
161    - Use as a decorator on page render function:
162    >>> @StreamlitPageMonitor.monitor_page("home")
163    >>> def render_home():
164
165    - Set page context manually:
166    >>> StreamlitPageMonitor.set_page_context("settings")
167    
168    - Set custom DB path on first instantiation:
169    >>> # Place this at the top of your Streamlit app once, before any error monitoring or decorator usage to ensure the sqlite
170    >>> # database is created properly at the specified path; otherwise it will default to a temp directory. The temp directory
171    >>> # will be `~/local/share/streamlit-healthcheck/streamlit_page_errors.db`.
172    >>> StreamlitPageMonitor(db_path="/home/saradindu/dev/streamlit_page_errors.db")
173    ...
174
175    SQLite Database Schema
176    ---------------------
177    The following schema is used for persisting errors:
178
179    ```sql
180    CREATE TABLE IF NOT EXISTS errors (
181        id INTEGER PRIMARY KEY AUTOINCREMENT,
182        page TEXT,
183        error TEXT,
184        traceback TEXT,
185        timestamp TEXT,
186        status TEXT,
187        type TEXT
188    );
189    ```
190
191    Field Descriptions:
192
193    | Column     | Type    | Description                                 |
194    |------------|---------|---------------------------------------------|
195    | id         | INTEGER | Auto-incrementing primary key               |
196    | page       | TEXT    | Name of the Streamlit page                  |
197    | error      | TEXT    | Error message                               |
198    | traceback  | TEXT    | Stack trace or traceback (as string/JSON)   |
199    | timestamp  | TEXT    | ISO8601 timestamp of error occurrence       |
200    | status     | TEXT    | Severity/status (e.g., 'critical')          |
201    | type       | TEXT    | Error type ('streamlit_error', 'exception') |
202
203    Example:
204    
205    >>> @StreamlitPageMonitor.monitor_page("home")
206    >>> def render_home():
207    
208    Notes
209    
210    - The class monkey-patches st.error globally when first instantiated; ensure
211        this side effect is acceptable in your environment.
212    - Errors captured by st.error that occur outside any known page are recorded
213        under the page name "unknown_page".
214    - The schema is created/ensured in `_init_db()`.
215    - Tracebacks may be stored as JSON strings or plain text.
216    - Errors are persisted immediately upon capture.
217    
218    """
219    _instance = None
220    _errors: Dict[str, List[Dict[str, Any]]] = {}
221    _st_error = st.error
222    _current_page = None
223
224    # --- SQLite schema for error persistence ---
225    # Table: errors
226    # Fields:
227    #   id INTEGER PRIMARY KEY AUTOINCREMENT
228    #   page TEXT
229    #   error TEXT
230    #   traceback TEXT
231    #   timestamp TEXT
232    #   status TEXT
233    #   type TEXT
234    
235    # Local development DB path
236    _db_path = os.path.join(os.path.expanduser("~"), "dev", "streamlit-healthcheck", "streamlit_page_errors.db")
237    # Final build DB path
238    #_db_path = os.path.join(os.path.expanduser("~"), ".local", "share", "streamlit-healthcheck", "streamlit_page_errors.db")
239
240    def __new__(cls, db_path=None):
241        """
242        Create or return the singleton StreamlitPageMonitor instance.
243        """
244        
245        if cls._instance is None:
246            cls._instance = super(StreamlitPageMonitor, cls).__new__(cls)
247            # Allow db_path override at first instantiation
248            if db_path is not None:
249                cls._db_path = db_path
250            logger.info(f"StreamlitPageMonitor DB path set to: {cls._db_path}")
251            # Monkey patch st.error to capture error messages
252            def patched_error(*args, **kwargs):
253                error_message = " ".join(str(arg) for arg in args)
254                current_page = cls._current_page
255                error_info = {
256                    'error': error_message,
257                    'traceback': traceback.format_stack(),
258                    'timestamp': datetime.now().isoformat(),
259                    'status': 'critical',
260                    'type': 'streamlit_error',
261                    'page': current_page
262                }
263                # Ensure current_page is a string, not None
264                if current_page is None:
265                    current_page = "unknown_page"
266                if current_page not in cls._errors:
267                    cls._errors[current_page] = []
268                cls._errors[current_page].append(error_info)
269                # Persist to DB
270                try:
271                    cls().save_errors_to_db([error_info])
272                except Exception as e:
273                    logger.error(f"Failed to save Streamlit error to DB: {e}")
274                # Call original st.error
275                return cls._st_error(*args, **kwargs)
276
277            st.error = patched_error
278
279            # Initialize SQLite database
280            cls._init_db()
281        else:
282            # If already instantiated, allow updating db_path if provided
283            if db_path is not None:
284                cls._db_path = db_path
285        return cls._instance
286
287    @classmethod
288    def _handle_st_error(cls, error_message: str):
289        """
290        Handles Streamlit-specific errors by recording error details for the current page.
291        """
292        
293        # Get current page name from Streamlit context
294        current_page = getattr(st, '_current_page', 'unknown_page')
295        error_info = {
296            'error': f"Streamlit Error: {error_message}",
297            'traceback': traceback.format_stack(),
298            'timestamp': datetime.now().isoformat(),
299            'status': 'critical',
300            'type': 'streamlit_error',
301            'page': current_page
302        }
303        # Initialize list for page if not exists
304        if current_page not in cls._errors:
305            cls._errors[current_page] = []
306        # Add new error
307        cls._errors[current_page].append(error_info)
308        # Persist to DB
309        try:
310            cls().save_errors_to_db([error_info])
311        except Exception as e:
312            logger.error(f"Failed to save Streamlit error to DB: {e}")
313
314    @classmethod
315    def set_page_context(cls, page_name: str):
316        """Set the current page context"""
317        cls._current_page = page_name
318
319    @classmethod
320    def monitor_page(cls, page_name: str):
321        """
322        Decorator to monitor and log exceptions for a specific Streamlit page.
323        
324        Args:
325            page_name (str): The name of the page to monitor.
326            
327        Returns:
328            Callable: A decorator that wraps the target function, sets the page context,
329            clears previous non-Streamlit errors, and logs any exceptions that occur during execution.
330            
331        The decorator performs the following actions:
332        
333            - Sets the current page context using `cls.set_page_context`.
334            - Clears previous exception errors for the page, retaining only those marked as 'streamlit_error'.
335            - Executes the wrapped function.
336            - If an exception occurs, logs detailed error information (error message, traceback, timestamp, status, type, and page)
337              to `cls._errors` under the given page name, then re-raises the exception.
338        """
339        
340        def decorator(func):
341            """
342            Decorator to manage page-specific error handling and context setting.
343            This decorator sets the current page context before executing the decorated function.
344            It clears previous exception errors for the page, retaining only Streamlit error calls.
345            If an exception occurs during function execution, it captures error details including
346            the error message, traceback, timestamp, status, type, and page name, and appends them
347            to the page's error log. The exception is then re-raised.
348            
349            Args:
350                func (Callable): The function to be decorated.
351                
352            Returns:
353                Callable: The wrapped function with error handling and context management.
354            """
355            
356            @functools.wraps(func)
357            def wrapper(*args, **kwargs):
358                # Set the current page context
359                cls.set_page_context(page_name)
360                try:
361                    # Clear previous exception errors but keep st.error calls
362                    if page_name in cls._errors:
363                        cls._errors[page_name] = [
364                            e for e in cls._errors[page_name]
365                            if e.get('type') == 'streamlit_error'
366                        ]
367                    result = func(*args, **kwargs)
368                    return result
369                except Exception as e:
370                    error_info = {
371                        'error': str(e),
372                        'traceback': traceback.format_exc(),
373                        'timestamp': datetime.now().isoformat(),
374                        'status': 'critical',
375                        'type': 'exception',
376                        'page': page_name
377                    }
378                    if page_name not in cls._errors:
379                        cls._errors[page_name] = []
380                    cls._errors[page_name].append(error_info)
381                    # Persist to DB
382                    try:
383                        cls().save_errors_to_db([error_info])
384                    except Exception as db_exc:
385                        logger.error(f"Failed to save exception error to DB: {db_exc}")
386                    raise
387            return wrapper
388        return decorator
389
390    @classmethod
391    def get_page_errors(cls):
392        """
393        Load error records from storage and return them grouped by page.
394        This class method calls cls().load_errors_from_db() to retrieve a sequence of error records
395        (each expected to be a mapping). It normalizes each record to a dictionary with the keys:
396        
397            - 'error' (str): error message, default "Unknown error"
398            - 'traceback' (list): traceback frames or lines, default []
399            - 'timestamp' (str): timestamp string, default ""
400            - 'type' (str): error type/category, default "unknown"
401            
402        Grouping and uniqueness:
403        
404            - Records are grouped by the 'page' key; if a record has no 'page' key, the page name
405                "unknown" is used.
406            - For each page, only unique errors are kept using the 'error' string as the deduplication
407                key. When multiple records for the same page have the same 'error' value, the last
408                occurrence in the loaded sequence will be retained.
409                
410        Return value:
411        
412            - dict[str, list[dict]]: mapping from page name to a list of normalized error dicts.
413            
414        Error handling:
415        
416            - Any exception raised while loading or processing records will be logged via logger.error.
417                The method will return the result accumulated so far (or an empty dict if nothing was
418                accumulated).
419                
420        Notes:
421        
422            - The class is expected to be instantiable (cls()) and to provide a load_errors_from_db()
423                method that yields or returns an iterable of mappings.
424        """
425        
426        result = {}
427        try:
428            db_errors = cls().load_errors_from_db()
429            for err in db_errors:
430                page = err.get('page', 'unknown')
431                if page not in result:
432                    result[page] = []
433                result[page].append({
434                    'error': err.get('error', 'Unknown error'),
435                    'traceback': err.get('traceback', []),
436                    'timestamp': err.get('timestamp', ''),
437                    'type': err.get('type', 'unknown')
438                })
439            # Return only unique page errors using the 'page' column for filtering
440            return {page: list({e['error']: e for e in errors}.values()) for page, errors in result.items()}
441        except Exception as e:
442            logger.error(f"Failed to load errors from DB: {e}")
443            return result
444
445    @classmethod
446    def save_errors_to_db(cls, errors):
447        """
448        Save a sequence of error records into the SQLite database configured at cls._db_path.
449        
450        Parameters
451        ----------
452        
453        errors : Iterable[Mapping] | list[dict]
454        
455            Sequence of error records to persist. Each record is expected to be a mapping with the
456            following keys (values are stored as provided, except for traceback which is normalized):
457            
458              - "page": identifier or name of the page where the error occurred (str)
459              - "error": human-readable error message (str)
460              - "traceback": traceback information; may be a str, list, or None. If a list, it will be
461                JSON-encoded before storage. If None, an empty string is stored.
462              - "timestamp": timestamp for the error (stored as provided)
463              - "status": status associated with the error (str)
464              - "type": classification/type of the error (str)
465              
466        Behavior
467        --------
468        
469        - If `errors` is falsy (None or empty), the method returns immediately without touching the DB.
470        - Opens a SQLite connection to the path stored in `cls._db_path`.
471        - Iterates over the provided records and inserts each into the `errors` table with columns
472          (page, error, traceback, timestamp, status, type).
473        - Ensures that the `traceback` value is always written as a string (list -> JSON string,
474          other values -> str(), None -> "").
475        - Commits the transaction if all inserts succeed and always closes the connection in a finally block.
476        
477        Exceptions
478        ----------
479        
480        - Underlying sqlite3 exceptions (e.g., sqlite3.Error) are not swallowed and will propagate to the caller
481          if connection/execution fails.
482          
483        Returns
484        -------
485        
486        None
487        """
488        if not errors:
489            return
490        conn = sqlite3.connect(cls._db_path)
491        try:
492            cursor = conn.cursor()
493            for err in errors:
494                # Ensure traceback is always a string for SQLite
495                tb = err.get("traceback")
496                if isinstance(tb, list):
497                    import json
498                    tb_str = json.dumps(tb)
499                else:
500                    tb_str = str(tb) if tb is not None else ""
501                cursor.execute(
502                    """
503                    INSERT INTO errors (page, error, traceback, timestamp, status, type)
504                    VALUES (?, ?, ?, ?, ?, ?)
505                    """,
506                    (
507                        err.get("page"),
508                        err.get("error"),
509                        tb_str,
510                        err.get("timestamp"),
511                        err.get("status"),
512                        err.get("type"),
513                    ),
514                )
515            conn.commit()
516        finally:
517            conn.close()
518
519    @classmethod
520    def clear_errors(cls, page_name: Optional[str] = None):
521        """Clear stored health-check errors for a specific page or for all pages.
522        This classmethod updates both the in-memory error cache and the persistent
523        SQLite-backed store.
524        
525        If `page_name` is provided:
526        
527        - Remove the entry for that page from the class-level in-memory dictionary
528            of errors (if present).
529        - Delete all rows in the SQLite `errors` table where `page` equals `page_name`.
530        
531        If `page_name` is None:
532        
533        - Clear the entire in-memory errors dictionary.
534        - Delete all rows from the SQLite `errors` table.
535        
536        Args:
537                page_name (Optional[str]): Name of the page whose errors should be cleared.
538                        If None, all errors are cleared.
539                        
540        Returns:
541                None
542                
543        Side effects:
544        
545                - Mutates class-level state (clears entries in `cls._errors`).
546                - Opens a SQLite connection to `cls._db_path` and executes DELETE statements
547                    against the `errors` table. Commits the transaction and closes the connection.
548                    
549        Error handling:
550        
551                - Database-related exceptions are caught and logged via the module logger;
552                    they are not re-raised by this method. As a result, callers should not
553                    rely on exceptions to detect DB failures.
554                    
555        Notes:
556        
557                - The method assumes `cls._db_path` points to a valid SQLite database file
558                    and that an `errors` table exists with a `page` column.
559                - This method does not provide synchronization; callers should take care of
560                    concurrent access to class state and the database if used from multiple
561                    threads or processes.
562        """
563        
564        if page_name:
565            if page_name in cls._errors:
566                del cls._errors[page_name]
567            # Remove from DB
568            try:
569                conn = sqlite3.connect(cls._db_path)
570                cursor = conn.cursor()
571                cursor.execute("DELETE FROM errors WHERE page = ?", (page_name,))
572                conn.commit()
573                conn.close()
574            except Exception as e:
575                logger.error(f"Failed to clear errors from DB for page {page_name}: {e}")
576        else:
577            cls._errors = {}
578            # Remove all from DB
579            try:
580                conn = sqlite3.connect(cls._db_path)
581                cursor = conn.cursor()
582                cursor.execute("DELETE FROM errors")
583                conn.commit()
584                conn.close()
585            except Exception as e:
586                logger.error(f"Failed to clear all errors from DB: {e}")
587
588    @classmethod
589    def _init_db(cls):
590        """
591        Initialize the SQLite database file and ensure the required schema exists.
592        This class-level initializer performs the following steps:
593        
594        - Ensures the parent directory of cls._db_path exists; creates it if necessary.
595            - If cls._db_path has no parent directory (e.g., a bare filename), no directory is created.
596        - Connects to the SQLite database at cls._db_path (creating the file if it does not exist).
597        - Creates an "errors" table if it does not already exist with the following columns:
598            - id (INTEGER PRIMARY KEY AUTOINCREMENT)
599            - page (TEXT)
600            - error (TEXT)
601            - traceback (TEXT)
602            - timestamp (TEXT)
603            - status (TEXT)
604            - type (TEXT)
605        - Commits the schema change and closes the database connection.
606        - Logs informational and error messages using the module logger.
607        
608        Parameters
609        ----------
610        
611        cls : type
612        
613                The class on which this method is invoked. Must provide a valid string attribute
614                `_db_path` indicating the target SQLite database file path.
615                
616        Raises
617        ------
618        
619        Exception
620        
621                Re-raises exceptions encountered when creating the parent directory (os.makedirs).
622                
623        sqlite3.Error
624        
625                May be raised by sqlite3.connect or subsequent SQLite operations when the database
626                cannot be opened or initialized.
627                
628        Side effects
629        ------------
630        
631        - May create directories on the filesystem.
632        - May create or modify the SQLite database file at cls._db_path.
633        - Writes log messages via the module logger.
634        
635        Returns
636        -------
637        
638        None
639        """
640        
641        # Ensure the parent directory for the DB exists
642        db_dir = os.path.dirname(cls._db_path)
643        if db_dir and not os.path.exists(db_dir):
644            try:
645                os.makedirs(db_dir, exist_ok=False)
646                logger.info(f"Created directory for DB: {db_dir}")
647            except Exception as e:
648                logger.error(f"Failed to create DB directory {db_dir}: {e}")
649                raise
650        # Now create/connect to the DB and table
651        logger.info(f"Initializing SQLite DB at: {cls._db_path}")
652        conn = sqlite3.connect(cls._db_path)
653        c = conn.cursor()
654        c.execute('''CREATE TABLE IF NOT EXISTS errors (
655            id INTEGER PRIMARY KEY AUTOINCREMENT,
656            page TEXT,
657            error TEXT,
658            traceback TEXT,
659            timestamp TEXT,
660            status TEXT,
661            type TEXT
662        )''')
663        conn.commit()
664        conn.close()
665    @classmethod
666    def load_errors_from_db(cls, page=None, status=None, limit=None):
667        """
668        Load errors from the class SQLite database.
669        This classmethod connects to the SQLite database at cls._db_path, queries the
670        'errors' table, and returns matching error records as a list of dictionaries.
671        
672        Parameters:
673        
674            page (Optional[str]): If provided, filter results to rows where the 'page'
675                column equals this value.
676            status (Optional[str]): If provided, filter results to rows where the 'status'
677                column equals this value.
678            limit (Optional[int|str]): If provided, limits the number of returned rows.
679                The value is cast to int internally; a non-convertible value will raise
680                ValueError.
681                
682        Returns:
683        
684            List[dict]: A list of dictionaries representing rows from the 'errors' table.
685            Each dict contains the following keys:
686                - id: primary key (int)
687                - page: page identifier (str)
688                - error: short error message (str)
689                - traceback: full traceback or diagnostic text (str)
690                - timestamp: stored timestamp value as retrieved from the DB (type depends on schema)
691                - status: error status (str)
692                - type: error type/category (str)
693                
694        Raises:
695        
696            ValueError: If `limit` cannot be converted to int.
697            sqlite3.Error: If an SQLite error occurs while executing the query.
698            
699        Notes:
700        
701            - Uses parameterized queries for the 'page' and 'status' filters to avoid SQL
702              injection. The `limit` is applied after casting to int.
703            - Results are ordered by `timestamp` in descending order.
704            - The database connection is always closed in a finally block to ensure cleanup.
705        """
706        
707        conn = sqlite3.connect(cls._db_path)
708        try:
709            cursor = conn.cursor()
710            query = "SELECT id, page, error, traceback, timestamp, status, type FROM errors"
711            params = []
712            filters = []
713            if page:
714                filters.append("page = ?")
715                params.append(page)
716            if status:
717                filters.append("status = ?")
718                params.append(status)
719            if filters:
720                query += " WHERE " + " AND ".join(filters)
721            query += " ORDER BY timestamp DESC"
722            if limit:
723                query += f" LIMIT {int(limit)}"
724            cursor.execute(query, params)
725            rows = cursor.fetchall()
726            errors = []
727            for row in rows:
728                errors.append({
729                    "id": row[0],
730                    "page": row[1],
731                    "error": row[2],
732                    "traceback": row[3],
733                    "timestamp": row[4],
734                    "status": row[5],
735                    "type": row[6],
736                })
737            return errors
738        finally:
739            conn.close()

Singleton class that monitors and records errors occurring within Streamlit pages. It captures both explicit Streamlit error messages (monkey-patching st.error) and uncaught exceptions raised during the execution of monitored page functions, and persists error details to a local SQLite database.

Key responsibilities

Intercept Streamlit error calls by monkey-patching st.error and record them with a stack trace, timestamp, status, and type.
Provide a decorator monitor_page(page_name) to set a page context, capture exceptions raised while rendering/executing a page, and record those exceptions.
Store errors in an in-memory structure grouped by page and persist them to an SQLite database for later inspection.
Provide utilities to load, deduplicate, clear, and query stored errors.

Behavior and side effects

Implements the Singleton pattern: only one instance exists per Python process.
On first instantiation, optionally accepts a custom db_path and initializes the SQLite database and its parent directory (creating it if necessary).
Monkey-patches streamlit.error (st.error) to capture calls and still forward them to the original st.error implementation.
Records the following fields for each error: page, error, traceback, timestamp, status, type. The SQLite table errors mirrors these fields and includes an auto-incrementing id.
Persists errors immediately to SQLite when captured; database IO errors are logged but do not suppress the original exception (for monitored exceptions, the exception is re-raised after recording).

Public API (methods)

__new__(cls, db_path=None) Create or return the singleton StreamlitPageMonitor instance.

Parameters
----------
db_path : Optional[str]
    If provided on the first instantiation, overrides the class-level
    database path used to persist captured Streamlit error information.

Returns
-------
StreamlitPageMonitor
    The singleton instance of the class.

Behavior
--------
- On first instantiation (when cls._instance is None):
- Allocates the singleton via super().__new__.
- Optionally sets cls._db_path from the provided db_path.
- Logs the configured DB path.
- Monkey-patches streamlit.error (st.error) with a wrapper that:
    - Builds an error record containing the error text, a formatted stack trace,
    ISO timestamp, severity/status, an error type marker, and the current page.
    - Normalizes a missing current page to "unknown_page".
    - Stores the record in the in-memory cls._errors dictionary keyed by page.
    - Attempts to persist the record to the SQLite DB using cls().save_errors_to_db,
    logging any persistence errors without interrupting Streamlit's normal error display.
    - Calls the original st.error to preserve expected UI behavior.
- Initializes the SQLite DB via cls._init_db().
- On subsequent calls:
- Returns the existing singleton instance.
- If db_path is provided, updates cls._db_path for future use.

Side effects
------------
- Replaces st.error globally for the running process.
- Writes error records to both an in-memory structure (cls._errors) and to the
configured SQLite database (if persistence succeeds).
- Logs informational and error messages.

Notes
-----
- The method assumes the class defines/has: _instance, _db_path, _current_page,
_errors, _st_error (original st.error), save_errors_to_db, and _init_db.
- Exceptions raised during saving of individual errors are caught and logged;
exceptions from instance creation or DB initialization may propagate.
- The implementation is not explicitly thread-safe; concurrent instantiation
attempts may require external synchronization if used in multi-threaded contexts.

set_page_context(cls, page_name: str) Set the current page name used when recording subsequent errors.
monitor_page(cls, page_name: str) -> Callable Decorator for page rendering/execution functions. Sets the page context, clears previously recorded non-Streamlit errors for that page, runs the function, records and persists any raised exception, and re-raises it.

_handle_st_error(cls, error_message: str)

Handles Streamlit-specific errors by recording error details for the current page.

Args:
    error_message (str): The error message to be logged.

Side Effects:
    Updates the class-level _errors dictionary with error information for the current Streamlit page.

Error Information Stored:
    - error: Formatted error message.
    - traceback: Stack trace at the point of error.
    - timestamp: Time when the error occurred (ISO format).
    - status: Error severity ('critical').
    - type: Error type ('streamlit_error').

get_page_errors(cls) -> dict Load errors from the database and return a dictionary mapping page names to lists of error dicts. Performs basic deduplication by error message.
save_errors_to_db(cls, errors: Iterable[dict]) Persist a list of error dictionaries to the configured SQLite database. Ensures traceback is stored as a string (JSON if originally a list).
clear_errors(cls, page_name: Optional[str] = None) Clear in-memory errors for a specific page or all pages and delete matching rows from the database.
_init_db(cls) Ensure the database directory exists and create the errors table if it does not exist.
load_errors_from_db(cls, page=None, status=None, limit=None) -> List[dict] Query the database for errors, optionally filtering by page and/or status, returning a list of error dictionaries ordered by timestamp (descending) and limited if requested.

Storage and format

Default DB path: ~/local/share/streamlit-healthcheck/streamlit_page_errors.db (overridable).
SQLite table errors columns: id, page, error, traceback, timestamp, status, type.
Tracebacks may be stored as JSON strings (if originally lists) or plain strings. Concurrency and robustness
Designed for single-process usage typical of Streamlit apps. The singleton and monkey-patching are process-global.
Database interactions use short-lived connections; callers should handle any exceptions arising from DB access (errors are logged internally).
Decorator preserves original function metadata via functools.wraps.

Examples

Use as a decorator on page render function:

>>> @StreamlitPageMonitor.monitor_page("home")
>>> def render_home():

Set page context manually:

>>> StreamlitPageMonitor.set_page_context("settings")

Set custom DB path on first instantiation:

>>> # Place this at the top of your Streamlit app once, before any error monitoring or decorator usage to ensure the sqlite
>>> # database is created properly at the specified path; otherwise it will default to a temp directory. The temp directory
>>> # will be `~/local/share/streamlit-healthcheck/streamlit_page_errors.db`.
>>> StreamlitPageMonitor(db_path="/home/saradindu/dev/streamlit_page_errors.db")
...

SQLite Database Schema

The following schema is used for persisting errors:

CREATE TABLE IF NOT EXISTS errors (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    page TEXT,
    error TEXT,
    traceback TEXT,
    timestamp TEXT,
    status TEXT,
    type TEXT
);

Field Descriptions:

Column	Type	Description
id	INTEGER	Auto-incrementing primary key
page	TEXT	Name of the Streamlit page
error	TEXT	Error message
traceback	TEXT	Stack trace or traceback (as string/JSON)
timestamp	TEXT	ISO8601 timestamp of error occurrence
status	TEXT	Severity/status (e.g., 'critical')
type	TEXT	Error type ('streamlit_error', 'exception')

Example:

>>> @StreamlitPageMonitor.monitor_page("home")
>>> def render_home():

Notes

The class monkey-patches st.error globally when first instantiated; ensure this side effect is acceptable in your environment.
Errors captured by st.error that occur outside any known page are recorded under the page name "unknown_page".
The schema is created/ensured in _init_db().
Tracebacks may be stored as JSON strings or plain text.
Errors are persisted immediately upon capture.

StreamlitPageMonitor(db_path=None) View Source

240    def __new__(cls, db_path=None):
241        """
242        Create or return the singleton StreamlitPageMonitor instance.
243        """
244        
245        if cls._instance is None:
246            cls._instance = super(StreamlitPageMonitor, cls).__new__(cls)
247            # Allow db_path override at first instantiation
248            if db_path is not None:
249                cls._db_path = db_path
250            logger.info(f"StreamlitPageMonitor DB path set to: {cls._db_path}")
251            # Monkey patch st.error to capture error messages
252            def patched_error(*args, **kwargs):
253                error_message = " ".join(str(arg) for arg in args)
254                current_page = cls._current_page
255                error_info = {
256                    'error': error_message,
257                    'traceback': traceback.format_stack(),
258                    'timestamp': datetime.now().isoformat(),
259                    'status': 'critical',
260                    'type': 'streamlit_error',
261                    'page': current_page
262                }
263                # Ensure current_page is a string, not None
264                if current_page is None:
265                    current_page = "unknown_page"
266                if current_page not in cls._errors:
267                    cls._errors[current_page] = []
268                cls._errors[current_page].append(error_info)
269                # Persist to DB
270                try:
271                    cls().save_errors_to_db([error_info])
272                except Exception as e:
273                    logger.error(f"Failed to save Streamlit error to DB: {e}")
274                # Call original st.error
275                return cls._st_error(*args, **kwargs)
276
277            st.error = patched_error
278
279            # Initialize SQLite database
280            cls._init_db()
281        else:
282            # If already instantiated, allow updating db_path if provided
283            if db_path is not None:
284                cls._db_path = db_path
285        return cls._instance

Create or return the singleton StreamlitPageMonitor instance.

@classmethod

def set_page_context(cls, page_name: str): View Source

314    @classmethod
315    def set_page_context(cls, page_name: str):
316        """Set the current page context"""
317        cls._current_page = page_name

Set the current page context

@classmethod

def monitor_page(cls, page_name: str): View Source

319    @classmethod
320    def monitor_page(cls, page_name: str):
321        """
322        Decorator to monitor and log exceptions for a specific Streamlit page.
323        
324        Args:
325            page_name (str): The name of the page to monitor.
326            
327        Returns:
328            Callable: A decorator that wraps the target function, sets the page context,
329            clears previous non-Streamlit errors, and logs any exceptions that occur during execution.
330            
331        The decorator performs the following actions:
332        
333            - Sets the current page context using `cls.set_page_context`.
334            - Clears previous exception errors for the page, retaining only those marked as 'streamlit_error'.
335            - Executes the wrapped function.
336            - If an exception occurs, logs detailed error information (error message, traceback, timestamp, status, type, and page)
337              to `cls._errors` under the given page name, then re-raises the exception.
338        """
339        
340        def decorator(func):
341            """
342            Decorator to manage page-specific error handling and context setting.
343            This decorator sets the current page context before executing the decorated function.
344            It clears previous exception errors for the page, retaining only Streamlit error calls.
345            If an exception occurs during function execution, it captures error details including
346            the error message, traceback, timestamp, status, type, and page name, and appends them
347            to the page's error log. The exception is then re-raised.
348            
349            Args:
350                func (Callable): The function to be decorated.
351                
352            Returns:
353                Callable: The wrapped function with error handling and context management.
354            """
355            
356            @functools.wraps(func)
357            def wrapper(*args, **kwargs):
358                # Set the current page context
359                cls.set_page_context(page_name)
360                try:
361                    # Clear previous exception errors but keep st.error calls
362                    if page_name in cls._errors:
363                        cls._errors[page_name] = [
364                            e for e in cls._errors[page_name]
365                            if e.get('type') == 'streamlit_error'
366                        ]
367                    result = func(*args, **kwargs)
368                    return result
369                except Exception as e:
370                    error_info = {
371                        'error': str(e),
372                        'traceback': traceback.format_exc(),
373                        'timestamp': datetime.now().isoformat(),
374                        'status': 'critical',
375                        'type': 'exception',
376                        'page': page_name
377                    }
378                    if page_name not in cls._errors:
379                        cls._errors[page_name] = []
380                    cls._errors[page_name].append(error_info)
381                    # Persist to DB
382                    try:
383                        cls().save_errors_to_db([error_info])
384                    except Exception as db_exc:
385                        logger.error(f"Failed to save exception error to DB: {db_exc}")
386                    raise
387            return wrapper
388        return decorator

Decorator to monitor and log exceptions for a specific Streamlit page.

Args: page_name (str): The name of the page to monitor.

Returns: Callable: A decorator that wraps the target function, sets the page context, clears previous non-Streamlit errors, and logs any exceptions that occur during execution.

The decorator performs the following actions:

- Sets the current page context using `cls.set_page_context`.
- Clears previous exception errors for the page, retaining only those marked as 'streamlit_error'.
- Executes the wrapped function.
- If an exception occurs, logs detailed error information (error message, traceback, timestamp, status, type, and page)
  to `cls._errors` under the given page name, then re-raises the exception.

@classmethod

def get_page_errors(cls): View Source

390    @classmethod
391    def get_page_errors(cls):
392        """
393        Load error records from storage and return them grouped by page.
394        This class method calls cls().load_errors_from_db() to retrieve a sequence of error records
395        (each expected to be a mapping). It normalizes each record to a dictionary with the keys:
396        
397            - 'error' (str): error message, default "Unknown error"
398            - 'traceback' (list): traceback frames or lines, default []
399            - 'timestamp' (str): timestamp string, default ""
400            - 'type' (str): error type/category, default "unknown"
401            
402        Grouping and uniqueness:
403        
404            - Records are grouped by the 'page' key; if a record has no 'page' key, the page name
405                "unknown" is used.
406            - For each page, only unique errors are kept using the 'error' string as the deduplication
407                key. When multiple records for the same page have the same 'error' value, the last
408                occurrence in the loaded sequence will be retained.
409                
410        Return value:
411        
412            - dict[str, list[dict]]: mapping from page name to a list of normalized error dicts.
413            
414        Error handling:
415        
416            - Any exception raised while loading or processing records will be logged via logger.error.
417                The method will return the result accumulated so far (or an empty dict if nothing was
418                accumulated).
419                
420        Notes:
421        
422            - The class is expected to be instantiable (cls()) and to provide a load_errors_from_db()
423                method that yields or returns an iterable of mappings.
424        """
425        
426        result = {}
427        try:
428            db_errors = cls().load_errors_from_db()
429            for err in db_errors:
430                page = err.get('page', 'unknown')
431                if page not in result:
432                    result[page] = []
433                result[page].append({
434                    'error': err.get('error', 'Unknown error'),
435                    'traceback': err.get('traceback', []),
436                    'timestamp': err.get('timestamp', ''),
437                    'type': err.get('type', 'unknown')
438                })
439            # Return only unique page errors using the 'page' column for filtering
440            return {page: list({e['error']: e for e in errors}.values()) for page, errors in result.items()}
441        except Exception as e:
442            logger.error(f"Failed to load errors from DB: {e}")
443            return result

Load error records from storage and return them grouped by page. This class method calls cls().load_errors_from_db() to retrieve a sequence of error records (each expected to be a mapping). It normalizes each record to a dictionary with the keys:

- 'error' (str): error message, default "Unknown error"
- 'traceback' (list): traceback frames or lines, default []
- 'timestamp' (str): timestamp string, default ""
- 'type' (str): error type/category, default "unknown"

Grouping and uniqueness:

- Records are grouped by the 'page' key; if a record has no 'page' key, the page name
    "unknown" is used.
- For each page, only unique errors are kept using the 'error' string as the deduplication
    key. When multiple records for the same page have the same 'error' value, the last
    occurrence in the loaded sequence will be retained.

Return value:

- dict[str, list[dict]]: mapping from page name to a list of normalized error dicts.

Error handling:

- Any exception raised while loading or processing records will be logged via logger.error.
    The method will return the result accumulated so far (or an empty dict if nothing was
    accumulated).

Notes:

- The class is expected to be instantiable (cls()) and to provide a load_errors_from_db()
    method that yields or returns an iterable of mappings.

@classmethod

def save_errors_to_db(cls, errors): View Source

445    @classmethod
446    def save_errors_to_db(cls, errors):
447        """
448        Save a sequence of error records into the SQLite database configured at cls._db_path.
449        
450        Parameters
451        ----------
452        
453        errors : Iterable[Mapping] | list[dict]
454        
455            Sequence of error records to persist. Each record is expected to be a mapping with the
456            following keys (values are stored as provided, except for traceback which is normalized):
457            
458              - "page": identifier or name of the page where the error occurred (str)
459              - "error": human-readable error message (str)
460              - "traceback": traceback information; may be a str, list, or None. If a list, it will be
461                JSON-encoded before storage. If None, an empty string is stored.
462              - "timestamp": timestamp for the error (stored as provided)
463              - "status": status associated with the error (str)
464              - "type": classification/type of the error (str)
465              
466        Behavior
467        --------
468        
469        - If `errors` is falsy (None or empty), the method returns immediately without touching the DB.
470        - Opens a SQLite connection to the path stored in `cls._db_path`.
471        - Iterates over the provided records and inserts each into the `errors` table with columns
472          (page, error, traceback, timestamp, status, type).
473        - Ensures that the `traceback` value is always written as a string (list -> JSON string,
474          other values -> str(), None -> "").
475        - Commits the transaction if all inserts succeed and always closes the connection in a finally block.
476        
477        Exceptions
478        ----------
479        
480        - Underlying sqlite3 exceptions (e.g., sqlite3.Error) are not swallowed and will propagate to the caller
481          if connection/execution fails.
482          
483        Returns
484        -------
485        
486        None
487        """
488        if not errors:
489            return
490        conn = sqlite3.connect(cls._db_path)
491        try:
492            cursor = conn.cursor()
493            for err in errors:
494                # Ensure traceback is always a string for SQLite
495                tb = err.get("traceback")
496                if isinstance(tb, list):
497                    import json
498                    tb_str = json.dumps(tb)
499                else:
500                    tb_str = str(tb) if tb is not None else ""
501                cursor.execute(
502                    """
503                    INSERT INTO errors (page, error, traceback, timestamp, status, type)
504                    VALUES (?, ?, ?, ?, ?, ?)
505                    """,
506                    (
507                        err.get("page"),
508                        err.get("error"),
509                        tb_str,
510                        err.get("timestamp"),
511                        err.get("status"),
512                        err.get("type"),
513                    ),
514                )
515            conn.commit()
516        finally:
517            conn.close()

Save a sequence of error records into the SQLite database configured at cls._db_path.

Parameters

errors : Iterable[Mapping] | list[dict]

Sequence of error records to persist. Each record is expected to be a mapping with the
following keys (values are stored as provided, except for traceback which is normalized):

  - "page": identifier or name of the page where the error occurred (str)
  - "error": human-readable error message (str)
  - "traceback": traceback information; may be a str, list, or None. If a list, it will be
    JSON-encoded before storage. If None, an empty string is stored.
  - "timestamp": timestamp for the error (stored as provided)
  - "status": status associated with the error (str)
  - "type": classification/type of the error (str)

Behavior

If errors is falsy (None or empty), the method returns immediately without touching the DB.
Opens a SQLite connection to the path stored in cls._db_path.
Iterates over the provided records and inserts each into the errors table with columns (page, error, traceback, timestamp, status, type).
Ensures that the traceback value is always written as a string (list -> JSON string, other values -> str(), None -> "").
Commits the transaction if all inserts succeed and always closes the connection in a finally block.

Exceptions

Underlying sqlite3 exceptions (e.g., sqlite3.Error) are not swallowed and will propagate to the caller if connection/execution fails.

Returns

None

@classmethod

def clear_errors(cls, page_name: Optional[str] = None): View Source

519    @classmethod
520    def clear_errors(cls, page_name: Optional[str] = None):
521        """Clear stored health-check errors for a specific page or for all pages.
522        This classmethod updates both the in-memory error cache and the persistent
523        SQLite-backed store.
524        
525        If `page_name` is provided:
526        
527        - Remove the entry for that page from the class-level in-memory dictionary
528            of errors (if present).
529        - Delete all rows in the SQLite `errors` table where `page` equals `page_name`.
530        
531        If `page_name` is None:
532        
533        - Clear the entire in-memory errors dictionary.
534        - Delete all rows from the SQLite `errors` table.
535        
536        Args:
537                page_name (Optional[str]): Name of the page whose errors should be cleared.
538                        If None, all errors are cleared.
539                        
540        Returns:
541                None
542                
543        Side effects:
544        
545                - Mutates class-level state (clears entries in `cls._errors`).
546                - Opens a SQLite connection to `cls._db_path` and executes DELETE statements
547                    against the `errors` table. Commits the transaction and closes the connection.
548                    
549        Error handling:
550        
551                - Database-related exceptions are caught and logged via the module logger;
552                    they are not re-raised by this method. As a result, callers should not
553                    rely on exceptions to detect DB failures.
554                    
555        Notes:
556        
557                - The method assumes `cls._db_path` points to a valid SQLite database file
558                    and that an `errors` table exists with a `page` column.
559                - This method does not provide synchronization; callers should take care of
560                    concurrent access to class state and the database if used from multiple
561                    threads or processes.
562        """
563        
564        if page_name:
565            if page_name in cls._errors:
566                del cls._errors[page_name]
567            # Remove from DB
568            try:
569                conn = sqlite3.connect(cls._db_path)
570                cursor = conn.cursor()
571                cursor.execute("DELETE FROM errors WHERE page = ?", (page_name,))
572                conn.commit()
573                conn.close()
574            except Exception as e:
575                logger.error(f"Failed to clear errors from DB for page {page_name}: {e}")
576        else:
577            cls._errors = {}
578            # Remove all from DB
579            try:
580                conn = sqlite3.connect(cls._db_path)
581                cursor = conn.cursor()
582                cursor.execute("DELETE FROM errors")
583                conn.commit()
584                conn.close()
585            except Exception as e:
586                logger.error(f"Failed to clear all errors from DB: {e}")

Clear stored health-check errors for a specific page or for all pages. This classmethod updates both the in-memory error cache and the persistent SQLite-backed store.

If page_name is provided:

Remove the entry for that page from the class-level in-memory dictionary of errors (if present).
Delete all rows in the SQLite errors table where page equals page_name.

If page_name is None:

Clear the entire in-memory errors dictionary.
Delete all rows from the SQLite errors table.

Args: page_name (Optional[str]): Name of the page whose errors should be cleared. If None, all errors are cleared.

Returns: None

Side effects:

    - Mutates class-level state (clears entries in `cls._errors`).
    - Opens a SQLite connection to `cls._db_path` and executes DELETE statements
        against the `errors` table. Commits the transaction and closes the connection.

Error handling:

    - Database-related exceptions are caught and logged via the module logger;
        they are not re-raised by this method. As a result, callers should not
        rely on exceptions to detect DB failures.

Notes:

    - The method assumes `cls._db_path` points to a valid SQLite database file
        and that an `errors` table exists with a `page` column.
    - This method does not provide synchronization; callers should take care of
        concurrent access to class state and the database if used from multiple
        threads or processes.

@classmethod

def load_errors_from_db(cls, page=None, status=None, limit=None): View Source

665    @classmethod
666    def load_errors_from_db(cls, page=None, status=None, limit=None):
667        """
668        Load errors from the class SQLite database.
669        This classmethod connects to the SQLite database at cls._db_path, queries the
670        'errors' table, and returns matching error records as a list of dictionaries.
671        
672        Parameters:
673        
674            page (Optional[str]): If provided, filter results to rows where the 'page'
675                column equals this value.
676            status (Optional[str]): If provided, filter results to rows where the 'status'
677                column equals this value.
678            limit (Optional[int|str]): If provided, limits the number of returned rows.
679                The value is cast to int internally; a non-convertible value will raise
680                ValueError.
681                
682        Returns:
683        
684            List[dict]: A list of dictionaries representing rows from the 'errors' table.
685            Each dict contains the following keys:
686                - id: primary key (int)
687                - page: page identifier (str)
688                - error: short error message (str)
689                - traceback: full traceback or diagnostic text (str)
690                - timestamp: stored timestamp value as retrieved from the DB (type depends on schema)
691                - status: error status (str)
692                - type: error type/category (str)
693                
694        Raises:
695        
696            ValueError: If `limit` cannot be converted to int.
697            sqlite3.Error: If an SQLite error occurs while executing the query.
698            
699        Notes:
700        
701            - Uses parameterized queries for the 'page' and 'status' filters to avoid SQL
702              injection. The `limit` is applied after casting to int.
703            - Results are ordered by `timestamp` in descending order.
704            - The database connection is always closed in a finally block to ensure cleanup.
705        """
706        
707        conn = sqlite3.connect(cls._db_path)
708        try:
709            cursor = conn.cursor()
710            query = "SELECT id, page, error, traceback, timestamp, status, type FROM errors"
711            params = []
712            filters = []
713            if page:
714                filters.append("page = ?")
715                params.append(page)
716            if status:
717                filters.append("status = ?")
718                params.append(status)
719            if filters:
720                query += " WHERE " + " AND ".join(filters)
721            query += " ORDER BY timestamp DESC"
722            if limit:
723                query += f" LIMIT {int(limit)}"
724            cursor.execute(query, params)
725            rows = cursor.fetchall()
726            errors = []
727            for row in rows:
728                errors.append({
729                    "id": row[0],
730                    "page": row[1],
731                    "error": row[2],
732                    "traceback": row[3],
733                    "timestamp": row[4],
734                    "status": row[5],
735                    "type": row[6],
736                })
737            return errors
738        finally:
739            conn.close()

Load errors from the class SQLite database. This classmethod connects to the SQLite database at cls._db_path, queries the 'errors' table, and returns matching error records as a list of dictionaries.

Parameters:

page (Optional[str]): If provided, filter results to rows where the 'page'
    column equals this value.
status (Optional[str]): If provided, filter results to rows where the 'status'
    column equals this value.
limit (Optional[int|str]): If provided, limits the number of returned rows.
    The value is cast to int internally; a non-convertible value will raise
    ValueError.

Returns:

List[dict]: A list of dictionaries representing rows from the 'errors' table.
Each dict contains the following keys:
    - id: primary key (int)
    - page: page identifier (str)
    - error: short error message (str)
    - traceback: full traceback or diagnostic text (str)
    - timestamp: stored timestamp value as retrieved from the DB (type depends on schema)
    - status: error status (str)
    - type: error type/category (str)

Raises:

ValueError: If `limit` cannot be converted to int.
sqlite3.Error: If an SQLite error occurs while executing the query.

Notes:

- Uses parameterized queries for the 'page' and 'status' filters to avoid SQL
  injection. The `limit` is applied after casting to int.
- Results are ordered by `timestamp` in descending order.
- The database connection is always closed in a finally block to ensure cleanup.

class HealthCheckService: View Source

 741class HealthCheckService:
 742    """
 743    A background-capable health monitoring service for a Streamlit-based application.
 744    This class periodically executes a configurable set of checks (system metrics,
 745    external dependencies, Streamlit server and pages, and user-registered custom checks),
 746    aggregates their results, and exposes a sanitized health snapshot suitable for UI
 747    display or remote monitoring.
 748    
 749    Primary responsibilities
 750    
 751    - Load and persist a JSON configuration that defines check intervals, thresholds,
 752        dependencies to probe, and Streamlit connection settings.
 753    - Run periodic checks in a dedicated background thread (start/stop semantics).
 754    - Collect system metrics (CPU, memory, disk) using psutil and apply configurable
 755        warning/critical thresholds.
 756    - Probe configured HTTP API endpoints and (placeholder) database checks.
 757    - Verify Streamlit server liveness by calling a /healthz endpoint and inspect
 758        Streamlit page errors via StreamlitPageMonitor.
 759    - Allow callers to register synchronous custom checks (functions returning dicts).
 760    - Compute an aggregated overall status (critical > warning > unknown > healthy).
 761    - Provide a sanitized snapshot of health data with function references removed for safe
 762        serialization/display.
 763        
 764    Usage (high level)
 765    
 766    - Instantiate: svc = HealthCheckService(config_path="path/to/config.json")
 767    - Optionally register custom checks: svc.register_custom_check("my_check", my_check_func)
 768        where my_check_func() -> Dict[str, Any]
 769    - Start background monitoring: svc.start()
 770    - Stop monitoring: svc.stop()
 771    - Retrieve current health snapshot for display or API responses: svc.get_health_data()
 772    - Persist any changes to configuration: svc.save_config()
 773    
 774    Configuration (JSON)
 775    
 776    - check_interval: int (seconds) — how often to run the checks (default 60)
 777    - streamlit_url: str — base host (default "http://localhost")
 778    - streamlit_port: int — port for Streamlit server (default 8501)
 779    - system_checks: { "cpu": bool, "memory": bool, "disk": bool }
 780    - dependencies:
 781            - api_endpoints: list of { "name": str, "url": str, "timeout": int }
 782            - databases: list of { "name": str, "type": str, "connection_string": str }
 783    - thresholds:
 784            - cpu_warning, cpu_critical, memory_warning, memory_critical, disk_warning, disk_critical
 785            
 786    Health data structure (conceptual)
 787    
 788    - last_updated: ISO timestamp
 789    - system: { "cpu": {...}, "memory": {...}, "disk": {...} }
 790    - dependencies: { "<name>": {...}, ... }
 791    - custom_checks: { "<name>": {...} }  (get_health_data() strips callable references)
 792    - streamlit_server: {status, response_code/latency/error, message, url}
 793    - streamlit_pages: {status, error_count, errors, details}
 794    - overall_status: "healthy" | "warning" | "critical" | "unknown"
 795    
 796    Threading and safety
 797    
 798    - The service runs checks in a daemon thread started by start(). stop() signals the
 799        thread to terminate and joins with a short timeout. Clients should avoid modifying
 800        internal structures concurrently; get_health_data() returns a sanitized snapshot
 801        appropriate for concurrent reads.
 802        
 803    Custom checks
 804    
 805    - register_custom_check(name, func): registers a synchronous function that returns a
 806        dict describing the check result (must include a "status" key with one of the
 807        recognized values). The service stores the function reference internally but returns
 808        sanitized results via get_health_data().
 809        
 810    Error handling and logging
 811    
 812    - Individual checks catch exceptions and surface errors in the corresponding
 813        health_data entry with status "critical" where appropriate.
 814    - The Streamlit UI integration (st.* calls) is used for user-visible error messages
 815        when loading/saving configuration; the service also logs events to its configured
 816        logger.
 817        
 818    Extensibility notes
 819    
 820    - Database checks are left as placeholders; implement _check_database for specific DB
 821        drivers/connections.
 822    - Custom checks are synchronous; if long-running checks are required, adapt the
 823        registration/run pattern to use async or worker pools.
 824    """
 825    def __init__(self, config_path: str = "health_check_config.json"):
 826        """
 827        Initializes the HealthCheckService instance.
 828        
 829        Args:
 830            config_path (str): Path to the health check configuration file. Defaults to "health_check_config.json".
 831            
 832        Attributes:
 833        
 834        - logger (logging.Logger): Logger for the HealthCheckService.
 835        - config_path (str): Path to the configuration file.
 836        - health_data (Dict[str, Any]): Dictionary storing health check data.
 837        - config (dict): Loaded configuration from the config file.
 838        - check_interval (int): Interval in seconds between health checks. Defaults to 60.
 839        - _running (bool): Indicates if the health check service is running.
 840        - _thread (threading.Thread or None): Thread running the health check loop.
 841        - streamlit_url (str): URL of the Streamlit service. Defaults to "http://localhost".
 842        - streamlit_port (int): Port of the Streamlit service. Defaults to 8501.
 843        """
 844        self.logger = logging.getLogger(f"{__name__}.HealthCheckService")
 845        self.logger.info("Initializing HealthCheckService")
 846        self.config_path = config_path
 847        self.health_data: Dict[str, Any] = {
 848            "last_updated": None,
 849            "system": {},
 850            "dependencies": {},
 851            "custom_checks": {},
 852            "overall_status": "unknown"
 853        }
 854        self.config = self._load_config()
 855        self.check_interval = self.config.get("check_interval", 60)  # Default: 60 seconds
 856        self._running = False
 857        self._thread = None
 858        self.streamlit_url = self.config.get("streamlit_url", "http://localhost")
 859        self.streamlit_port = self.config.get("streamlit_port", 8501)  # Default: 8501
 860    def _load_config(self) -> Dict:
 861        """Load health check configuration from file."""
 862        if os.path.exists(self.config_path):
 863            try:
 864                with open(self.config_path, "r") as f:
 865                    return json.load(f)
 866            except Exception as e:
 867                st.error(f"Error loading health check config: {str(e)}")
 868                return self._get_default_config()
 869        else:
 870            return self._get_default_config()
 871            
 872    def _get_default_config(self) -> Dict:
 873        """Return default health check configuration."""
 874        return {
 875            "check_interval": 60,
 876            "streamlit_url": "http://localhost",
 877            "streamlit_port": 8501,
 878            "system_checks": {
 879                "cpu": True,
 880                "memory": True,
 881                "disk": True
 882            },
 883            "dependencies": {
 884                "api_endpoints": [
 885                    # Example API endpoint to check
 886                    {"name": "example_api", "url": "https://httpbin.org/get", "timeout": 5}
 887                ],
 888                "databases": [
 889                    # Example database connection to check
 890                    {"name": "main_db", "type": "postgres", "connection_string": "..."}
 891                ]
 892            },
 893            "thresholds": {
 894                "cpu_warning": 70,
 895                "cpu_critical": 90,
 896                "memory_warning": 70,
 897                "memory_critical": 90,
 898                "disk_warning": 70,
 899                "disk_critical": 90
 900            }
 901        }
 902    
 903    def start(self):
 904        """
 905        Start the periodic health-check background thread.
 906        If the `healthcheck` runner is already active, this method is a no-op and returns
 907        immediately. Otherwise, it marks the runner as running, creates a daemon thread
 908        targeting self._run_checks_periodically, stores the thread on self._thread, and
 909        starts it.
 910        
 911        Behavior and side effects:
 912        
 913        - Idempotent while running: repeated calls will not create additional threads.
 914        - Sets self._running to True.
 915        - Assigns a daemon threading.Thread to self._thread and starts it.
 916        - Non-blocking: returns after starting the background thread.
 917        - The daemon thread will not prevent the process from exiting.
 918        
 919        Thread-safety:
 920        
 921        - If start() may be called concurrently from multiple threads, callers should
 922            ensure proper synchronization (e.g., external locking) to avoid race conditions.
 923            
 924        Returns:
 925        
 926                None
 927        """
 928        
 929        if self._running:
 930            return
 931            
 932        self._running = True
 933        self._thread = threading.Thread(target=self._run_checks_periodically, daemon=True)
 934        self._thread.start()
 935        
 936    def stop(self):
 937        """Stop the health check service."""
 938        self._running = False
 939        if self._thread:
 940            self._thread.join(timeout=1)
 941            
 942    def _run_checks_periodically(self):
 943        """Run health checks periodically based on check interval."""
 944        while self._running:
 945            self.run_all_checks()
 946            time.sleep(self.check_interval)
 947            
 948    def run_all_checks(self):
 949        """Run all configured health checks and update health data."""
 950        # Update timestamp
 951        self.health_data["last_updated"] = datetime.now().isoformat()
 952        
 953        # Check Streamlit server
 954        self.health_data["streamlit_server"] = self.check_streamlit_server()
 955        
 956        # System checks
 957        if self.config["system_checks"].get("cpu", True):
 958            self.check_cpu()
 959        if self.config["system_checks"].get("memory", True):
 960            self.check_memory()
 961        if self.config["system_checks"].get("disk", True):
 962            self.check_disk()
 963            
 964        # Rest of the existing checks...
 965        self.check_dependencies()
 966        self.run_custom_checks()
 967        self.check_streamlit_pages()
 968        self._update_overall_status()
 969        
 970    def check_cpu(self):
 971        """
 972        Checks the current CPU usage and updates the health status based on configured thresholds.
 973        Measures the CPU usage percentage over a 1-second interval using psutil. Compares the result
 974        against warning and critical thresholds defined in the configuration. Sets the status to
 975        'healthy', 'warning', or 'critical' accordingly, and updates the health data dictionary.
 976        
 977        Returns:
 978        
 979            None
 980        """
 981        
 982        cpu_percent = psutil.cpu_percent(interval=1)
 983        warning_threshold = self.config["thresholds"].get("cpu_warning", 70)
 984        critical_threshold = self.config["thresholds"].get("cpu_critical", 90)
 985        
 986        status = "healthy"
 987        if cpu_percent >= critical_threshold:
 988            status = "critical"
 989        elif cpu_percent >= warning_threshold:
 990            status = "warning"
 991            
 992        self.health_data["system"]["cpu"] = {
 993            "usage_percent": cpu_percent,
 994            "status": status
 995        }
 996        
 997    def check_memory(self):
 998        """
 999        Checks the system's memory usage and updates the health status accordingly.
1000        Retrieves the current memory usage statistics using psutil, compares the usage percentage
1001        against configured warning and critical thresholds, and sets the memory status to 'healthy',
1002        'warning', or 'critical'. Updates the health_data dictionary with total memory, available memory,
1003        usage percentage, and status.
1004        
1005        Returns:
1006        
1007            None
1008        """
1009        
1010        memory = psutil.virtual_memory()
1011        memory_percent = memory.percent
1012        warning_threshold = self.config["thresholds"].get("memory_warning", 70)
1013        critical_threshold = self.config["thresholds"].get("memory_critical", 90)
1014        
1015        status = "healthy"
1016        if memory_percent >= critical_threshold:
1017            status = "critical"
1018        elif memory_percent >= warning_threshold:
1019            status = "warning"
1020            
1021        self.health_data["system"]["memory"] = {
1022            "total_gb": round(memory.total / (1024**3), 2),
1023            "available_gb": round(memory.available / (1024**3), 2),
1024            "usage_percent": memory_percent,
1025            "status": status
1026        }
1027        
1028    def check_disk(self):
1029        """
1030        Checks the disk usage of the root filesystem and updates the health status.
1031        Retrieves disk usage statistics using psutil, compares the usage percentage
1032        against configured warning and critical thresholds, and sets the disk status
1033        accordingly (`healthy`, `warning`, or `critical`). Updates the health_data
1034        dictionary with total disk size, free space, usage percentage, and status.
1035        
1036        Returns:
1037        
1038            None
1039        """
1040        
1041        disk = psutil.disk_usage('/')
1042        disk_percent = disk.percent
1043        warning_threshold = self.config["thresholds"].get("disk_warning", 70)
1044        critical_threshold = self.config["thresholds"].get("disk_critical", 90)
1045        
1046        status = "healthy"
1047        if disk_percent >= critical_threshold:
1048            status = "critical"
1049        elif disk_percent >= warning_threshold:
1050            status = "warning"
1051            
1052        self.health_data["system"]["disk"] = {
1053            "total_gb": round(disk.total / (1024**3), 2),
1054            "free_gb": round(disk.free / (1024**3), 2),
1055            "usage_percent": disk_percent,
1056            "status": status
1057        }
1058        
1059    def check_dependencies(self):
1060        """
1061        Checks the health of configured dependencies, including API endpoints and databases.
1062        Iterates through the list of API endpoints and databases specified in the configuration,
1063        and performs health checks on each by invoking the corresponding internal methods.
1064        
1065        Raises:
1066        
1067            Exception: If any dependency check fails.
1068        """
1069        
1070        # Check API endpoints
1071        for endpoint in self.config["dependencies"].get("api_endpoints", []):
1072            self._check_api_endpoint(endpoint)
1073            
1074        # Check database connections
1075        for db in self.config["dependencies"].get("databases", []):
1076            self._check_database(db)
1077            
1078    def _check_api_endpoint(self, endpoint: Dict):
1079        """
1080        Check if an API endpoint is accessible.
1081        
1082        Args:
1083        
1084            endpoint: Dictionary with endpoint configuration
1085        """
1086        name = endpoint.get("name", "unknown_api")
1087        url = endpoint.get("url", "")
1088        timeout = endpoint.get("timeout", 5)
1089        
1090        if not url:
1091            return
1092            
1093        try:
1094            start_time = time.time()
1095            response = requests.get(url, timeout=timeout)
1096            response_time = time.time() - start_time
1097            
1098            status = "healthy" if response.status_code < 400 else "critical"
1099            
1100            self.health_data["dependencies"][name] = {
1101                "type": "api",
1102                "url": url,
1103                "status": status,
1104                "response_time_ms": round(response_time * 1000, 2),
1105                "status_code": response.status_code
1106            }
1107        except Exception as e:
1108            self.health_data["dependencies"][name] = {
1109                "type": "api",
1110                "url": url,
1111                "status": "critical",
1112                "error": str(e)
1113            }
1114            
1115    def _check_database(self, db_config: Dict):
1116        """
1117        Check database connection.
1118        Note: This is a placeholder. You'll need to implement specific database checks
1119        based on your application's needs.
1120        
1121        Args:
1122        
1123            db_config: Dictionary with database configuration
1124        """
1125        name = db_config.get("name", "unknown_db")
1126        db_type = db_config.get("type", "")
1127        
1128        # Placeholder for database connection check
1129        # In a real implementation, you would check the specific database connection
1130        self.health_data["dependencies"][name] = {
1131            "type": "database",
1132            "db_type": db_type,
1133            "status": "unknown",
1134            "message": "Database check not implemented"
1135        }
1136        
1137    def register_custom_check(self, name: str, check_func: Callable[[], Dict[str, Any]]):
1138        """
1139        Register a custom health check function.
1140        
1141        Args:
1142        
1143            name: Name of the custom check
1144            check_func: Function that performs the check and returns a dictionary with results
1145        """
1146        if "custom_checks" not in self.health_data:
1147            self.health_data["custom_checks"] = {}
1148            
1149        self.health_data["custom_checks"][name] = {
1150            "status": "unknown",
1151            "check_func": check_func
1152        }
1153        
1154    def run_custom_checks(self):
1155        """Run all registered custom health checks."""
1156        if "custom_checks" not in self.health_data:
1157            return
1158            
1159        for name, check_info in list(self.health_data["custom_checks"].items()):
1160            if "check_func" in check_info and callable(check_info["check_func"]):
1161                try:
1162                    result = check_info["check_func"]()
1163                    # Remove the function reference from the result
1164                    func = check_info["check_func"]
1165                    self.health_data["custom_checks"][name] = result
1166                    # Add the function back
1167                    self.health_data["custom_checks"][name]["check_func"] = func
1168                except Exception as e:
1169                    self.health_data["custom_checks"][name] = {
1170                        "status": "critical",
1171                        "error": str(e),
1172                        "check_func": check_info["check_func"]
1173                    }
1174                    
1175    def _update_overall_status(self):
1176        """
1177        Updates the overall health status of the application based on the statuses of various components.
1178        
1179        The method checks the health status of the following components:
1180            - Streamlit server
1181            - System checks
1182            - Dependencies
1183            - Custom checks (excluding those with a 'check_func' key)
1184            - Streamlit pages
1185            
1186        The overall status is determined using the following priority order:
1187            1. "critical" if any component is critical
1188            2. "warning" if any component is warning and none are critical
1189            3. "unknown" if any component is unknown and none are critical or warning, and no healthy components exist
1190            4. "healthy" if any component is healthy and none are critical, warning, or unknown
1191            5. "unknown" if no statuses are found
1192            
1193        The result is stored in `self.health_data["overall_status"]`.
1194        """
1195        
1196        has_critical = False
1197        has_warning = False
1198        has_healthy = False
1199        has_unknown = False
1200        
1201        # Helper function to check status
1202        def check_component_status(status):
1203            nonlocal has_critical, has_warning, has_healthy, has_unknown
1204            if status == "critical":
1205                has_critical = True
1206            elif status == "warning":
1207                has_warning = True
1208            elif status == "healthy":
1209                has_healthy = True
1210            elif status == "unknown":
1211                has_unknown = True
1212
1213        # Check Streamlit server status
1214        server_status = self.health_data.get("streamlit_server", {}).get("status")
1215        check_component_status(server_status)
1216        
1217        # Check system status
1218        for system_check in self.health_data.get("system", {}).values():
1219            check_component_status(system_check.get("status"))
1220                    
1221        # Check dependencies status
1222        for dep_check in self.health_data.get("dependencies", {}).values():
1223            check_component_status(dep_check.get("status"))
1224                    
1225        # Check custom checks status
1226        for custom_check in self.health_data.get("custom_checks", {}).values():
1227            if isinstance(custom_check, dict) and "check_func" not in custom_check:
1228                check_component_status(custom_check.get("status"))
1229        
1230        # Check Streamlit pages status
1231        pages_status = self.health_data.get("streamlit_pages", {}).get("status")
1232        check_component_status(pages_status)
1233                        
1234        # Determine overall status with priority:
1235        # critical > warning > unknown > healthy
1236        if has_critical:
1237            self.health_data["overall_status"] = "critical"
1238        elif has_warning:
1239            self.health_data["overall_status"] = "warning"
1240        elif has_unknown and not has_healthy:
1241            self.health_data["overall_status"] = "unknown"
1242        elif has_healthy:
1243            self.health_data["overall_status"] = "healthy"
1244        else:
1245            self.health_data["overall_status"] = "unknown"
1246                
1247    def get_health_data(self) -> Dict:
1248        """Get the latest health check data."""
1249        # Create a copy without the function references
1250        result: Dict[str, Any] = {}
1251        for key, value in self.health_data.items():
1252            if key == "custom_checks":
1253                result[key] = {}
1254                for check_name, check_data in value.items():
1255                    if isinstance(check_data, dict):
1256                        check_copy = check_data.copy()
1257                        if "check_func" in check_copy:
1258                            del check_copy["check_func"]
1259                        result[key][check_name] = check_copy
1260            else:
1261                result[key] = value
1262        return result
1263        
1264    def save_config(self):
1265        """
1266        Saves the current health check configuration to a JSON file.
1267        Attempts to write the configuration stored in `self.config` to the file specified by `self.config_path`.
1268        Displays a success message in the Streamlit app upon successful save.
1269        Handles and displays appropriate error messages for file not found, permission issues, JSON decoding errors, and other exceptions.
1270        
1271        Raises:
1272        
1273            FileNotFoundError: If the configuration file path does not exist.
1274            PermissionError: If there are insufficient permissions to write to the file.
1275            json.JSONDecodeError: If there is an error decoding the JSON data.
1276            Exception: For any other exceptions encountered during the save process.
1277        """
1278        
1279        try:
1280            with open(self.config_path, "w") as f:
1281                json.dump(self.config, f, indent=2)
1282                st.success(f"Health check config saved successfully to {self.config_path}")
1283        except FileNotFoundError:
1284            st.error(f"Configuration file not found: {self.config_path}")
1285        except PermissionError:
1286            st.error(f"Permission denied: Unable to write to {self.config_path}")
1287        except json.JSONDecodeError:
1288            st.error(f"Error decoding JSON in config file: {self.config_path}")
1289        except Exception as e:
1290            st.error(f"Error saving health check config: {str(e)}")
1291    def check_streamlit_pages(self):
1292        """
1293        Checks for errors in Streamlit pages and updates the health data accordingly.
1294        This method retrieves page errors using StreamlitPageMonitor.get_page_errors().
1295        If errors are found, it sets the 'streamlit_pages' status to 'critical' and updates
1296        the overall health status to 'critical'. If no errors are found, it marks the
1297        'streamlit_pages' status as 'healthy'.
1298        
1299        Updates:
1300        
1301            self.health_data["streamlit_pages"]: Dict containing status, error count, errors, and details.
1302            self.health_data["overall_status"]: Set to 'critical' if errors are detected.
1303            self.health_data["streamlit_pages"]["details"]: A summary of the errors found.
1304            
1305        Returns:
1306        
1307            None
1308        """
1309        
1310        page_errors = StreamlitPageMonitor.get_page_errors()
1311        
1312        if "streamlit_pages" not in self.health_data:
1313            self.health_data["streamlit_pages"] = {}
1314        
1315        if page_errors:
1316            total_errors = sum(len(errors) for errors in page_errors.values())
1317            self.health_data["streamlit_pages"] = {
1318                "status": "critical",
1319                "error_count": total_errors,
1320                "errors": page_errors,
1321                "details": "Errors detected in Streamlit pages"
1322            }
1323            # This affects overall status
1324            self.health_data["overall_status"] = "critical"
1325        else:
1326            self.health_data["streamlit_pages"] = {
1327                "status": "healthy",
1328                "error_count": 0,
1329                "errors": {},
1330                "details": "All pages functioning normally"
1331            }
1332    
1333    def check_streamlit_server(self) -> Dict[str, Any]:
1334        """
1335        Checks the health status of the Streamlit server by sending a GET request to the /healthz endpoint.
1336        
1337        Returns:
1338        
1339            Dict[str, Any]: A dictionary containing the health status, response code, latency in milliseconds,
1340                            message, and the URL checked. If the server is healthy (HTTP 200), status is "healthy".
1341                            Otherwise, status is "critical" with error details.
1342                            
1343        Handles:
1344        
1345            - Connection errors: Returns critical status with connection error details.
1346            - Timeout errors: Returns critical status with timeout error details.
1347            - Other exceptions: Returns critical status with unknown error details.
1348            
1349        Logs:
1350        
1351            - The URL being checked.
1352            - The response status code and text.
1353            - Health status and response time if healthy.
1354            - Warnings and errors for unhealthy or failed checks.
1355        """
1356        
1357        try:
1358            host = self.streamlit_url.rstrip('/')
1359            if not host.startswith(('http://', 'https://')):
1360                host = f"http://{host}"
1361            
1362            url = f"{host}:{self.streamlit_port}/healthz"
1363            self.logger.info(f"Checking Streamlit server health at: {url}")
1364            
1365            start_time = time.time()
1366            response = requests.get(url, timeout=3)
1367            total_time = (time.time() - start_time) * 1000
1368            self.logger.info(f"{response.status_code} - {response.text}")
1369            # Check if the response is healthy
1370            if response.status_code == 200:
1371                self.logger.info(f"Streamlit server healthy - Response time: {round(total_time, 2)}ms")
1372                return {
1373                    "status": "healthy",
1374                    "response_code": response.status_code,
1375                    "latency_ms": round(total_time, 2),
1376                    "message": "Streamlit server is running",
1377                    "url": url
1378                }
1379            else:
1380                self.logger.warning(f"Unhealthy response from server: {response.status_code}")
1381                return {
1382                    "status": "critical",
1383                    "response_code": response.status_code,
1384                    "error": f"Unhealthy response from server: {response.status_code}",
1385                    "message": "Streamlit server is not healthy",
1386                    "url": url
1387                }
1388
1389        except requests.exceptions.ConnectionError as e:
1390            self.logger.error(f"Connection error while checking Streamlit server: {str(e)}")
1391            return {
1392                "status": "critical",
1393                "error": f"Connection error: {str(e)}",
1394                "message": "Cannot connect to Streamlit server",
1395                "url": url
1396            }
1397        except requests.exceptions.Timeout as e:
1398            self.logger.error(f"Timeout while checking Streamlit server: {str(e)}")
1399            return {
1400                "status": "critical",
1401                "error": f"Timeout error: {str(e)}",
1402                "message": "Streamlit server is not responding",
1403                "url": url
1404            }
1405        except Exception as e:
1406            self.logger.error(f"Unexpected error while checking Streamlit server: {str(e)}")
1407            return {
1408                "status": "critical",
1409                "error": f"Unknown error: {str(e)}",
1410                "message": "Failed to check Streamlit server",
1411                "url": url
1412            }

A background-capable health monitoring service for a Streamlit-based application. This class periodically executes a configurable set of checks (system metrics, external dependencies, Streamlit server and pages, and user-registered custom checks), aggregates their results, and exposes a sanitized health snapshot suitable for UI display or remote monitoring.

Primary responsibilities

Load and persist a JSON configuration that defines check intervals, thresholds, dependencies to probe, and Streamlit connection settings.
Run periodic checks in a dedicated background thread (start/stop semantics).
Collect system metrics (CPU, memory, disk) using psutil and apply configurable warning/critical thresholds.
Probe configured HTTP API endpoints and (placeholder) database checks.
Verify Streamlit server liveness by calling a /healthz endpoint and inspect Streamlit page errors via StreamlitPageMonitor.
Allow callers to register synchronous custom checks (functions returning dicts).
Compute an aggregated overall status (critical > warning > unknown > healthy).
Provide a sanitized snapshot of health data with function references removed for safe serialization/display.

Usage (high level)

Instantiate: svc = HealthCheckService(config_path="path/to/config.json")
Optionally register custom checks: svc.register_custom_check("my_check", my_check_func) where my_check_func() -> Dict[str, Any]
Start background monitoring: svc.start()
Stop monitoring: svc.stop()
Retrieve current health snapshot for display or API responses: svc.get_health_data()
Persist any changes to configuration: svc.save_config()

Configuration (JSON)

check_interval: int (seconds) — how often to run the checks (default 60)
streamlit_url: str — base host (default "http://localhost")
streamlit_port: int — port for Streamlit server (default 8501)
system_checks: { "cpu": bool, "memory": bool, "disk": bool }
dependencies:
- api_endpoints: list of { "name": str, "url": str, "timeout": int }
- databases: list of { "name": str, "type": str, "connection_string": str }
thresholds:
- cpu_warning, cpu_critical, memory_warning, memory_critical, disk_warning, disk_critical

Health data structure (conceptual)

last_updated: ISO timestamp
system: { "cpu": {...}, "memory": {...}, "disk": {...} }
dependencies: { "": {...}, ... }
custom_checks: { "": {...} } (get_health_data() strips callable references)
streamlit_server: {status, response_code/latency/error, message, url}
streamlit_pages: {status, error_count, errors, details}
overall_status: "healthy" | "warning" | "critical" | "unknown"

Threading and safety

The service runs checks in a daemon thread started by start(). stop() signals the thread to terminate and joins with a short timeout. Clients should avoid modifying internal structures concurrently; get_health_data() returns a sanitized snapshot appropriate for concurrent reads.

Custom checks

register_custom_check(name, func): registers a synchronous function that returns a dict describing the check result (must include a "status" key with one of the recognized values). The service stores the function reference internally but returns sanitized results via get_health_data().

Error handling and logging

Individual checks catch exceptions and surface errors in the corresponding health_data entry with status "critical" where appropriate.
The Streamlit UI integration (st.* calls) is used for user-visible error messages when loading/saving configuration; the service also logs events to its configured logger.

Extensibility notes

Database checks are left as placeholders; implement _check_database for specific DB drivers/connections.
Custom checks are synchronous; if long-running checks are required, adapt the registration/run pattern to use async or worker pools.

HealthCheckService(config_path: str = 'health_check_config.json') View Source

825    def __init__(self, config_path: str = "health_check_config.json"):
826        """
827        Initializes the HealthCheckService instance.
828        
829        Args:
830            config_path (str): Path to the health check configuration file. Defaults to "health_check_config.json".
831            
832        Attributes:
833        
834        - logger (logging.Logger): Logger for the HealthCheckService.
835        - config_path (str): Path to the configuration file.
836        - health_data (Dict[str, Any]): Dictionary storing health check data.
837        - config (dict): Loaded configuration from the config file.
838        - check_interval (int): Interval in seconds between health checks. Defaults to 60.
839        - _running (bool): Indicates if the health check service is running.
840        - _thread (threading.Thread or None): Thread running the health check loop.
841        - streamlit_url (str): URL of the Streamlit service. Defaults to "http://localhost".
842        - streamlit_port (int): Port of the Streamlit service. Defaults to 8501.
843        """
844        self.logger = logging.getLogger(f"{__name__}.HealthCheckService")
845        self.logger.info("Initializing HealthCheckService")
846        self.config_path = config_path
847        self.health_data: Dict[str, Any] = {
848            "last_updated": None,
849            "system": {},
850            "dependencies": {},
851            "custom_checks": {},
852            "overall_status": "unknown"
853        }
854        self.config = self._load_config()
855        self.check_interval = self.config.get("check_interval", 60)  # Default: 60 seconds
856        self._running = False
857        self._thread = None
858        self.streamlit_url = self.config.get("streamlit_url", "http://localhost")
859        self.streamlit_port = self.config.get("streamlit_port", 8501)  # Default: 8501

Initializes the HealthCheckService instance.

Args: config_path (str): Path to the health check configuration file. Defaults to "health_check_config.json".

Attributes:

logger (logging.Logger): Logger for the HealthCheckService.
config_path (str): Path to the configuration file.
health_data (Dict[str, Any]): Dictionary storing health check data.
config (dict): Loaded configuration from the config file.
check_interval (int): Interval in seconds between health checks. Defaults to 60.
_running (bool): Indicates if the health check service is running.
_thread (threading.Thread or None): Thread running the health check loop.
streamlit_url (str): URL of the Streamlit service. Defaults to "http://localhost".
streamlit_port (int): Port of the Streamlit service. Defaults to 8501.

logger

config_path

health_data: Dict[str, Any]

config

check_interval

streamlit_url

streamlit_port

def start(self): View Source

903    def start(self):
904        """
905        Start the periodic health-check background thread.
906        If the `healthcheck` runner is already active, this method is a no-op and returns
907        immediately. Otherwise, it marks the runner as running, creates a daemon thread
908        targeting self._run_checks_periodically, stores the thread on self._thread, and
909        starts it.
910        
911        Behavior and side effects:
912        
913        - Idempotent while running: repeated calls will not create additional threads.
914        - Sets self._running to True.
915        - Assigns a daemon threading.Thread to self._thread and starts it.
916        - Non-blocking: returns after starting the background thread.
917        - The daemon thread will not prevent the process from exiting.
918        
919        Thread-safety:
920        
921        - If start() may be called concurrently from multiple threads, callers should
922            ensure proper synchronization (e.g., external locking) to avoid race conditions.
923            
924        Returns:
925        
926                None
927        """
928        
929        if self._running:
930            return
931            
932        self._running = True
933        self._thread = threading.Thread(target=self._run_checks_periodically, daemon=True)
934        self._thread.start()

Start the periodic health-check background thread. If the healthcheck runner is already active, this method is a no-op and returns immediately. Otherwise, it marks the runner as running, creates a daemon thread targeting self._run_checks_periodically, stores the thread on self._thread, and starts it.

Behavior and side effects:

Idempotent while running: repeated calls will not create additional threads.
Sets self._running to True.
Assigns a daemon threading.Thread to self._thread and starts it.
Non-blocking: returns after starting the background thread.
The daemon thread will not prevent the process from exiting.

Thread-safety:

If start() may be called concurrently from multiple threads, callers should ensure proper synchronization (e.g., external locking) to avoid race conditions.

Returns:

    None

def stop(self): View Source

936    def stop(self):
937        """Stop the health check service."""
938        self._running = False
939        if self._thread:
940            self._thread.join(timeout=1)

Stop the health check service.

def run_all_checks(self): View Source

948    def run_all_checks(self):
949        """Run all configured health checks and update health data."""
950        # Update timestamp
951        self.health_data["last_updated"] = datetime.now().isoformat()
952        
953        # Check Streamlit server
954        self.health_data["streamlit_server"] = self.check_streamlit_server()
955        
956        # System checks
957        if self.config["system_checks"].get("cpu", True):
958            self.check_cpu()
959        if self.config["system_checks"].get("memory", True):
960            self.check_memory()
961        if self.config["system_checks"].get("disk", True):
962            self.check_disk()
963            
964        # Rest of the existing checks...
965        self.check_dependencies()
966        self.run_custom_checks()
967        self.check_streamlit_pages()
968        self._update_overall_status()

Run all configured health checks and update health data.

def check_cpu(self): View Source

970    def check_cpu(self):
971        """
972        Checks the current CPU usage and updates the health status based on configured thresholds.
973        Measures the CPU usage percentage over a 1-second interval using psutil. Compares the result
974        against warning and critical thresholds defined in the configuration. Sets the status to
975        'healthy', 'warning', or 'critical' accordingly, and updates the health data dictionary.
976        
977        Returns:
978        
979            None
980        """
981        
982        cpu_percent = psutil.cpu_percent(interval=1)
983        warning_threshold = self.config["thresholds"].get("cpu_warning", 70)
984        critical_threshold = self.config["thresholds"].get("cpu_critical", 90)
985        
986        status = "healthy"
987        if cpu_percent >= critical_threshold:
988            status = "critical"
989        elif cpu_percent >= warning_threshold:
990            status = "warning"
991            
992        self.health_data["system"]["cpu"] = {
993            "usage_percent": cpu_percent,
994            "status": status
995        }

Checks the current CPU usage and updates the health status based on configured thresholds. Measures the CPU usage percentage over a 1-second interval using psutil. Compares the result against warning and critical thresholds defined in the configuration. Sets the status to 'healthy', 'warning', or 'critical' accordingly, and updates the health data dictionary.

Returns:

None

def check_memory(self): View Source

 997    def check_memory(self):
 998        """
 999        Checks the system's memory usage and updates the health status accordingly.
1000        Retrieves the current memory usage statistics using psutil, compares the usage percentage
1001        against configured warning and critical thresholds, and sets the memory status to 'healthy',
1002        'warning', or 'critical'. Updates the health_data dictionary with total memory, available memory,
1003        usage percentage, and status.
1004        
1005        Returns:
1006        
1007            None
1008        """
1009        
1010        memory = psutil.virtual_memory()
1011        memory_percent = memory.percent
1012        warning_threshold = self.config["thresholds"].get("memory_warning", 70)
1013        critical_threshold = self.config["thresholds"].get("memory_critical", 90)
1014        
1015        status = "healthy"
1016        if memory_percent >= critical_threshold:
1017            status = "critical"
1018        elif memory_percent >= warning_threshold:
1019            status = "warning"
1020            
1021        self.health_data["system"]["memory"] = {
1022            "total_gb": round(memory.total / (1024**3), 2),
1023            "available_gb": round(memory.available / (1024**3), 2),
1024            "usage_percent": memory_percent,
1025            "status": status
1026        }

Checks the system's memory usage and updates the health status accordingly. Retrieves the current memory usage statistics using psutil, compares the usage percentage against configured warning and critical thresholds, and sets the memory status to 'healthy', 'warning', or 'critical'. Updates the health_data dictionary with total memory, available memory, usage percentage, and status.

Returns:

None

def check_disk(self): View Source

1028    def check_disk(self):
1029        """
1030        Checks the disk usage of the root filesystem and updates the health status.
1031        Retrieves disk usage statistics using psutil, compares the usage percentage
1032        against configured warning and critical thresholds, and sets the disk status
1033        accordingly (`healthy`, `warning`, or `critical`). Updates the health_data
1034        dictionary with total disk size, free space, usage percentage, and status.
1035        
1036        Returns:
1037        
1038            None
1039        """
1040        
1041        disk = psutil.disk_usage('/')
1042        disk_percent = disk.percent
1043        warning_threshold = self.config["thresholds"].get("disk_warning", 70)
1044        critical_threshold = self.config["thresholds"].get("disk_critical", 90)
1045        
1046        status = "healthy"
1047        if disk_percent >= critical_threshold:
1048            status = "critical"
1049        elif disk_percent >= warning_threshold:
1050            status = "warning"
1051            
1052        self.health_data["system"]["disk"] = {
1053            "total_gb": round(disk.total / (1024**3), 2),
1054            "free_gb": round(disk.free / (1024**3), 2),
1055            "usage_percent": disk_percent,
1056            "status": status
1057        }

Checks the disk usage of the root filesystem and updates the health status. Retrieves disk usage statistics using psutil, compares the usage percentage against configured warning and critical thresholds, and sets the disk status accordingly (healthy, warning, or critical). Updates the health_data dictionary with total disk size, free space, usage percentage, and status.

Returns:

None

def check_dependencies(self): View Source

1059    def check_dependencies(self):
1060        """
1061        Checks the health of configured dependencies, including API endpoints and databases.
1062        Iterates through the list of API endpoints and databases specified in the configuration,
1063        and performs health checks on each by invoking the corresponding internal methods.
1064        
1065        Raises:
1066        
1067            Exception: If any dependency check fails.
1068        """
1069        
1070        # Check API endpoints
1071        for endpoint in self.config["dependencies"].get("api_endpoints", []):
1072            self._check_api_endpoint(endpoint)
1073            
1074        # Check database connections
1075        for db in self.config["dependencies"].get("databases", []):
1076            self._check_database(db)

Checks the health of configured dependencies, including API endpoints and databases. Iterates through the list of API endpoints and databases specified in the configuration, and performs health checks on each by invoking the corresponding internal methods.

Raises:

Exception: If any dependency check fails.

def register_custom_check(self, name: str, check_func: Callable[[], Dict[str, Any]]): View Source

1137    def register_custom_check(self, name: str, check_func: Callable[[], Dict[str, Any]]):
1138        """
1139        Register a custom health check function.
1140        
1141        Args:
1142        
1143            name: Name of the custom check
1144            check_func: Function that performs the check and returns a dictionary with results
1145        """
1146        if "custom_checks" not in self.health_data:
1147            self.health_data["custom_checks"] = {}
1148            
1149        self.health_data["custom_checks"][name] = {
1150            "status": "unknown",
1151            "check_func": check_func
1152        }

Args:

name: Name of the custom check
check_func: Function that performs the check and returns a dictionary with results

def run_custom_checks(self): View Source

1154    def run_custom_checks(self):
1155        """Run all registered custom health checks."""
1156        if "custom_checks" not in self.health_data:
1157            return
1158            
1159        for name, check_info in list(self.health_data["custom_checks"].items()):
1160            if "check_func" in check_info and callable(check_info["check_func"]):
1161                try:
1162                    result = check_info["check_func"]()
1163                    # Remove the function reference from the result
1164                    func = check_info["check_func"]
1165                    self.health_data["custom_checks"][name] = result
1166                    # Add the function back
1167                    self.health_data["custom_checks"][name]["check_func"] = func
1168                except Exception as e:
1169                    self.health_data["custom_checks"][name] = {
1170                        "status": "critical",
1171                        "error": str(e),
1172                        "check_func": check_info["check_func"]
1173                    }

Run all registered custom health checks.

def get_health_data(self) -> Dict: View Source

1247    def get_health_data(self) -> Dict:
1248        """Get the latest health check data."""
1249        # Create a copy without the function references
1250        result: Dict[str, Any] = {}
1251        for key, value in self.health_data.items():
1252            if key == "custom_checks":
1253                result[key] = {}
1254                for check_name, check_data in value.items():
1255                    if isinstance(check_data, dict):
1256                        check_copy = check_data.copy()
1257                        if "check_func" in check_copy:
1258                            del check_copy["check_func"]
1259                        result[key][check_name] = check_copy
1260            else:
1261                result[key] = value
1262        return result

Get the latest health check data.

def save_config(self): View Source

1264    def save_config(self):
1265        """
1266        Saves the current health check configuration to a JSON file.
1267        Attempts to write the configuration stored in `self.config` to the file specified by `self.config_path`.
1268        Displays a success message in the Streamlit app upon successful save.
1269        Handles and displays appropriate error messages for file not found, permission issues, JSON decoding errors, and other exceptions.
1270        
1271        Raises:
1272        
1273            FileNotFoundError: If the configuration file path does not exist.
1274            PermissionError: If there are insufficient permissions to write to the file.
1275            json.JSONDecodeError: If there is an error decoding the JSON data.
1276            Exception: For any other exceptions encountered during the save process.
1277        """
1278        
1279        try:
1280            with open(self.config_path, "w") as f:
1281                json.dump(self.config, f, indent=2)
1282                st.success(f"Health check config saved successfully to {self.config_path}")
1283        except FileNotFoundError:
1284            st.error(f"Configuration file not found: {self.config_path}")
1285        except PermissionError:
1286            st.error(f"Permission denied: Unable to write to {self.config_path}")
1287        except json.JSONDecodeError:
1288            st.error(f"Error decoding JSON in config file: {self.config_path}")
1289        except Exception as e:
1290            st.error(f"Error saving health check config: {str(e)}")

Saves the current health check configuration to a JSON file. Attempts to write the configuration stored in self.config to the file specified by self.config_path. Displays a success message in the Streamlit app upon successful save. Handles and displays appropriate error messages for file not found, permission issues, JSON decoding errors, and other exceptions.

Raises:

FileNotFoundError: If the configuration file path does not exist.
PermissionError: If there are insufficient permissions to write to the file.
json.JSONDecodeError: If there is an error decoding the JSON data.
Exception: For any other exceptions encountered during the save process.

def check_streamlit_pages(self): View Source

1291    def check_streamlit_pages(self):
1292        """
1293        Checks for errors in Streamlit pages and updates the health data accordingly.
1294        This method retrieves page errors using StreamlitPageMonitor.get_page_errors().
1295        If errors are found, it sets the 'streamlit_pages' status to 'critical' and updates
1296        the overall health status to 'critical'. If no errors are found, it marks the
1297        'streamlit_pages' status as 'healthy'.
1298        
1299        Updates:
1300        
1301            self.health_data["streamlit_pages"]: Dict containing status, error count, errors, and details.
1302            self.health_data["overall_status"]: Set to 'critical' if errors are detected.
1303            self.health_data["streamlit_pages"]["details"]: A summary of the errors found.
1304            
1305        Returns:
1306        
1307            None
1308        """
1309        
1310        page_errors = StreamlitPageMonitor.get_page_errors()
1311        
1312        if "streamlit_pages" not in self.health_data:
1313            self.health_data["streamlit_pages"] = {}
1314        
1315        if page_errors:
1316            total_errors = sum(len(errors) for errors in page_errors.values())
1317            self.health_data["streamlit_pages"] = {
1318                "status": "critical",
1319                "error_count": total_errors,
1320                "errors": page_errors,
1321                "details": "Errors detected in Streamlit pages"
1322            }
1323            # This affects overall status
1324            self.health_data["overall_status"] = "critical"
1325        else:
1326            self.health_data["streamlit_pages"] = {
1327                "status": "healthy",
1328                "error_count": 0,
1329                "errors": {},
1330                "details": "All pages functioning normally"
1331            }

Checks for errors in Streamlit pages and updates the health data accordingly. This method retrieves page errors using StreamlitPageMonitor.get_page_errors(). If errors are found, it sets the 'streamlit_pages' status to 'critical' and updates the overall health status to 'critical'. If no errors are found, it marks the 'streamlit_pages' status as 'healthy'.

Updates:

self.health_data["streamlit_pages"]: Dict containing status, error count, errors, and details.
self.health_data["overall_status"]: Set to 'critical' if errors are detected.
self.health_data["streamlit_pages"]["details"]: A summary of the errors found.

Returns:

None

def check_streamlit_server(self) -> Dict[str, Any]: View Source

1333    def check_streamlit_server(self) -> Dict[str, Any]:
1334        """
1335        Checks the health status of the Streamlit server by sending a GET request to the /healthz endpoint.
1336        
1337        Returns:
1338        
1339            Dict[str, Any]: A dictionary containing the health status, response code, latency in milliseconds,
1340                            message, and the URL checked. If the server is healthy (HTTP 200), status is "healthy".
1341                            Otherwise, status is "critical" with error details.
1342                            
1343        Handles:
1344        
1345            - Connection errors: Returns critical status with connection error details.
1346            - Timeout errors: Returns critical status with timeout error details.
1347            - Other exceptions: Returns critical status with unknown error details.
1348            
1349        Logs:
1350        
1351            - The URL being checked.
1352            - The response status code and text.
1353            - Health status and response time if healthy.
1354            - Warnings and errors for unhealthy or failed checks.
1355        """
1356        
1357        try:
1358            host = self.streamlit_url.rstrip('/')
1359            if not host.startswith(('http://', 'https://')):
1360                host = f"http://{host}"
1361            
1362            url = f"{host}:{self.streamlit_port}/healthz"
1363            self.logger.info(f"Checking Streamlit server health at: {url}")
1364            
1365            start_time = time.time()
1366            response = requests.get(url, timeout=3)
1367            total_time = (time.time() - start_time) * 1000
1368            self.logger.info(f"{response.status_code} - {response.text}")
1369            # Check if the response is healthy
1370            if response.status_code == 200:
1371                self.logger.info(f"Streamlit server healthy - Response time: {round(total_time, 2)}ms")
1372                return {
1373                    "status": "healthy",
1374                    "response_code": response.status_code,
1375                    "latency_ms": round(total_time, 2),
1376                    "message": "Streamlit server is running",
1377                    "url": url
1378                }
1379            else:
1380                self.logger.warning(f"Unhealthy response from server: {response.status_code}")
1381                return {
1382                    "status": "critical",
1383                    "response_code": response.status_code,
1384                    "error": f"Unhealthy response from server: {response.status_code}",
1385                    "message": "Streamlit server is not healthy",
1386                    "url": url
1387                }
1388
1389        except requests.exceptions.ConnectionError as e:
1390            self.logger.error(f"Connection error while checking Streamlit server: {str(e)}")
1391            return {
1392                "status": "critical",
1393                "error": f"Connection error: {str(e)}",
1394                "message": "Cannot connect to Streamlit server",
1395                "url": url
1396            }
1397        except requests.exceptions.Timeout as e:
1398            self.logger.error(f"Timeout while checking Streamlit server: {str(e)}")
1399            return {
1400                "status": "critical",
1401                "error": f"Timeout error: {str(e)}",
1402                "message": "Streamlit server is not responding",
1403                "url": url
1404            }
1405        except Exception as e:
1406            self.logger.error(f"Unexpected error while checking Streamlit server: {str(e)}")
1407            return {
1408                "status": "critical",
1409                "error": f"Unknown error: {str(e)}",
1410                "message": "Failed to check Streamlit server",
1411                "url": url
1412            }

Checks the health status of the Streamlit server by sending a GET request to the /healthz endpoint.

Returns:

Dict[str, Any]: A dictionary containing the health status, response code, latency in milliseconds,
                message, and the URL checked. If the server is healthy (HTTP 200), status is "healthy".
                Otherwise, status is "critical" with error details.

Handles:

- Connection errors: Returns critical status with connection error details.
- Timeout errors: Returns critical status with timeout error details.
- Other exceptions: Returns critical status with unknown error details.

Logs:

- The URL being checked.
- The response status code and text.
- Health status and response time if healthy.
- Warnings and errors for unhealthy or failed checks.

def health_check(config_path: str = 'health_check_config.json'): View Source

1414def health_check(config_path:str = "health_check_config.json"):
1415    """
1416    Displays an interactive Streamlit dashboard for monitoring application health.
1417    This function initializes and manages a health check service, presenting real-time system metrics,
1418    dependency statuses, custom checks, and Streamlit page health in a user-friendly dashboard.
1419    Users can manually refresh health checks, view detailed error information, and adjust configuration
1420    thresholds and intervals directly from the UI.
1421    
1422    Args:
1423    
1424        config_path (str, optional): Path to the health check configuration JSON file.
1425            Defaults to "health_check_config.json".
1426            
1427    Features:
1428    
1429        - Displays overall health status with color-coded indicators.
1430        - Shows last updated timestamp for health data.
1431        - Monitors Streamlit server status, latency, and errors.
1432        - Provides tabs for:
1433            * System Resources (CPU, Memory, Disk usage and status)
1434            * Dependencies (external services and their health)
1435            * Custom Checks (user-defined health checks)
1436            * Streamlit Pages (page-specific errors and status)
1437        - Allows configuration of system thresholds, check intervals, and Streamlit server settings.
1438        - Supports manual refresh and saving configuration changes.
1439        
1440    Raises:
1441    
1442        Displays error messages in the UI for any exceptions encountered during health data retrieval or processing.
1443        
1444    Returns:
1445    
1446        None. The dashboard is rendered in the Streamlit app.
1447    """
1448    
1449    logger = logging.getLogger(f"{__name__}.health_check")
1450    logger.info("Starting health check dashboard")
1451    st.title("Application Health Dashboard")
1452    
1453    # Initialize or get the health check service
1454    if "health_service" not in st.session_state:
1455        logger.info("Initializing new health check service")
1456        st.session_state.health_service = HealthCheckService(config_path = config_path)
1457        st.session_state.health_service.start()
1458    
1459    health_service = st.session_state.health_service
1460    health_service.run_all_checks()
1461    
1462    # Add controls for manual refresh and configuration
1463    col1, col2 = st.columns([3, 1])
1464    with col1:
1465        st.subheader("System Health Status")
1466    with col2:
1467        if st.button("Refresh Now"):
1468            health_service.run_all_checks()
1469    
1470    # Get the latest health data
1471    health_data = health_service.get_health_data()
1472    
1473    # Display overall status with appropriate color
1474    overall_status = health_data.get("overall_status", "unknown")
1475    status_color = {
1476        "healthy": "green",
1477        "warning": "orange",
1478        "critical": "red",
1479        "unknown": "gray"
1480    }.get(overall_status, "gray")
1481    
1482    st.markdown(
1483        f"<h3 style='color: {status_color};'>Overall Status: {overall_status.upper()}</h3>",
1484        unsafe_allow_html=True
1485    )
1486    
1487    # Display last updated time
1488    if health_data.get("last_updated"):
1489        try:
1490            last_updated = datetime.fromisoformat(health_data["last_updated"])
1491            st.text(f"Last updated: {last_updated.strftime('%Y-%m-%d %H:%M:%S')}")
1492        except Exception as e:
1493            st.error(f"Last updated: {health_data['last_updated']}")
1494            st.exception(e)
1495    
1496    server_health = health_data.get("streamlit_server", {})
1497    server_status = server_health.get("status", "unknown")
1498    server_color = {
1499        "healthy": "green",
1500        "critical": "red",
1501        "unknown": "gray"
1502    }.get(server_status, "gray")
1503
1504    st.markdown(
1505        f"### Streamlit Server Status: <span style='color: {server_color}'>{server_status.upper()}</span>",
1506        unsafe_allow_html=True
1507    )
1508
1509    if server_status != "healthy":
1510        st.error(server_health.get("message", "Server status unknown"))
1511        if "error" in server_health:
1512            st.code(server_health["error"])
1513    else:
1514        st.success(server_health.get("message", "Server is running"))
1515        if "latency_ms" in server_health:
1516            latency = server_health["latency_ms"]
1517            # Define color based on latency thresholds
1518            if latency <= 50:
1519                latency_color = "green"
1520                performance = "Excellent"
1521            elif latency <= 100:
1522                latency_color = "blue"
1523                performance = "Good"
1524            elif latency <= 200:
1525                latency_color = "orange"
1526                performance = "Fair"
1527            else:
1528                latency_color = "red"
1529                performance = "Poor"
1530                
1531            st.markdown(
1532                f"""
1533                <div style='display: flex; align-items: center; gap: 10px;'>
1534                    <div>Server Response Time:</div>
1535                    <div style='color: {latency_color}; font-weight: bold;'>
1536                        {latency} ms
1537                    </div>
1538                    <div style='color: {latency_color};'>
1539                        ({performance})
1540                    </div>
1541                </div>
1542                """,
1543                unsafe_allow_html=True
1544            )
1545    
1546    # Create tabs for different categories of health checks
1547    tab1, tab2, tab3, tab4 = st.tabs(["System Resources", "Dependencies", "Custom Checks", "Streamlit Pages"])
1548    
1549    with tab1:
1550        # Display system health checks
1551        system_data = health_data.get("system", {})
1552        
1553        # CPU
1554        if "cpu" in system_data:
1555            cpu_data = system_data["cpu"]
1556            cpu_status = cpu_data.get("status", "unknown")
1557            cpu_color = {"healthy": "green", "warning": "orange", "critical": "red"}.get(cpu_status, "gray")
1558            
1559            st.markdown(f"### CPU Status: <span style='color:{cpu_color}'>{cpu_status.upper()}</span>", unsafe_allow_html=True)
1560            st.progress(cpu_data.get("usage_percent", 0) / 100)
1561            st.text(f"CPU Usage: {cpu_data.get('usage_percent', 0)}%")
1562        
1563        # Memory
1564        if "memory" in system_data:
1565            memory_data = system_data["memory"]
1566            memory_status = memory_data.get("status", "unknown")
1567            memory_color = {"healthy": "green", "warning": "orange", "critical": "red"}.get(memory_status, "gray")
1568            
1569            st.markdown(f"### Memory Status: <span style='color:{memory_color}'>{memory_status.upper()}</span>", unsafe_allow_html=True)
1570            st.progress(memory_data.get("usage_percent", 0) / 100)
1571            st.text(f"Memory Usage: {memory_data.get('usage_percent', 0)}%")
1572            st.text(f"Total Memory: {memory_data.get('total_gb', 0)} GB")
1573            st.text(f"Available Memory: {memory_data.get('available_gb', 0)} GB")
1574        
1575        # Disk
1576        if "disk" in system_data:
1577            disk_data = system_data["disk"]
1578            disk_status = disk_data.get("status", "unknown")
1579            disk_color = {"healthy": "green", "warning": "orange", "critical": "red"}.get(disk_status, "gray")
1580            
1581            st.markdown(f"### Disk Status: <span style='color:{disk_color}'>{disk_status.upper()}</span>", unsafe_allow_html=True)
1582            st.progress(disk_data.get("usage_percent", 0) / 100)
1583            st.text(f"Disk Usage: {disk_data.get('usage_percent', 0)}%")
1584            st.text(f"Total Disk Space: {disk_data.get('total_gb', 0)} GB")
1585            st.text(f"Free Disk Space: {disk_data.get('free_gb', 0)} GB")
1586    
1587    with tab2:
1588        # Display dependency health checks
1589        dependencies = health_data.get("dependencies", {})
1590        if dependencies:
1591            # Create a dataframe for all dependencies
1592            dep_data = []
1593            for name, dep_info in dependencies.items():
1594                dep_data.append({
1595                    "Name": name,
1596                    "Type": dep_info.get("type", "unknown"),
1597                    "Status": dep_info.get("status", "unknown"),
1598                    "Details": ", ".join([f"{k}: {v}" for k, v in dep_info.items() 
1599                               if k not in ["name", "type", "status", "error"] and not isinstance(v, dict)])
1600                })
1601            
1602            # Show dependencies table
1603            if dep_data:
1604                df_deps = pd.DataFrame(dep_data)
1605                st.dataframe(df_deps)
1606            else:
1607                st.info("No dependencies configured")
1608
1609            # Create a dataframe for all custom checks from health_data
1610            custom_checks = health_data.get("custom_checks", {})
1611            check_data = []
1612            for name, check_info in custom_checks.items():
1613                if isinstance(check_info, dict) and "check_func" not in check_info:
1614                    check_data.append({
1615                        "Name": name,
1616                        "Status": check_info.get("status", "unknown"),
1617                        "Details": ", ".join([f"{k}: {v}" for k, v in check_info.items()
1618                                             if k not in ["name", "status", "check_func", "error"] and not isinstance(v, dict)]),
1619                        "Error": check_info.get("error", "")
1620                    })
1621
1622            if check_data:
1623                df_checks = pd.DataFrame(check_data)
1624
1625                # Apply color formatting to status column
1626                def color_status(val):
1627                    colors = {
1628                        "healthy": "background-color: #c6efce; color: #006100",
1629                        "warning": "background-color: #ffeb9c; color: #9c5700",
1630                        "critical": "background-color: #ffc7ce; color: #9c0006",
1631                        "unknown": "background-color: #eeeeee; color: #7f7f7f"
1632                    }
1633                    return colors.get(str(val).lower(), "")
1634
1635                # Use styled dataframe to color the Status column
1636                try:
1637                    # apply expects a function that returns a sequence of styles for the column;
1638                    # map color_status across the 'Status' column to produce the CSS strings.
1639                    st.dataframe(
1640                        df_checks.style.apply(
1641                            lambda col: col.map(color_status),
1642                            subset=["Status"]
1643                        )
1644                    )
1645                except Exception:
1646                    # Fallback if styling isn't supported in the environment
1647                    st.dataframe(df_checks)
1648            else:
1649                st.info("No custom checks configured")
1650        else:
1651            st.info("No custom checks configured")
1652    with tab4:
1653        # Always read page errors from SQLite DB for latest state
1654        page_errors = StreamlitPageMonitor.get_page_errors()
1655        error_count = sum(len(errors) for errors in page_errors.values())
1656        status = "critical" if error_count > 0 else "healthy"
1657        status_color = {
1658            "healthy": "green",
1659            "critical": "red",
1660            "unknown": "gray"
1661        }.get(status, "gray")
1662        st.markdown(f"### Page Status: <span style='color:{status_color}'>{status.upper()}</span>", unsafe_allow_html=True)
1663        st.metric("Error Count", error_count)
1664        if error_count > 0:
1665            st.markdown("<div style='background-color:#ffe6e6; color:#b30000; padding:10px; border-radius:5px; border:1px solid #b30000; font-weight:bold;'>Pages with errors:</div>",
1666            unsafe_allow_html=True)
1667            for page_name, page_errors_list in page_errors.items():
1668                display_name = page_name.split("/")[-1] if "/" in page_name else page_name
1669                for error_info in page_errors_list:
1670                    if isinstance(error_info, dict):
1671                        with st.expander(f"Error in {display_name}"):
1672                            st.info(error_info.get('error', 'Unknown error'))
1673                            if error_info.get('type') == 'streamlit_error':
1674                                st.text("Type: Streamlit Error")
1675                            else:
1676                                st.text("Type: Exception")
1677                            st.text("Traceback:")
1678                            st.code("".join(error_info.get('traceback', ['No traceback available'])))
1679                            st.text(f"Timestamp: {error_info.get('timestamp', 'No timestamp')}")
1680    
1681    # Configuration section
1682    with st.expander("Health Check Configuration"):
1683        st.subheader("System Check Thresholds")
1684        
1685        col1, col2 = st.columns(2)
1686        with col1:
1687            cpu_warning = st.slider("CPU Warning Threshold (%)", 
1688                                min_value=10, max_value=90, 
1689                                value=health_service.config["thresholds"].get("cpu_warning", 70),
1690                                step=5)
1691            memory_warning = st.slider("Memory Warning Threshold (%)", 
1692                                   min_value=10, max_value=90, 
1693                                   value=health_service.config["thresholds"].get("memory_warning", 70),
1694                                   step=5)
1695            disk_warning = st.slider("Disk Warning Threshold (%)", 
1696                                 min_value=10, max_value=90, 
1697                                 value=health_service.config["thresholds"].get("disk_warning", 70),
1698                                 step=5)
1699            streamlit_url_update = st.text_input(
1700                "Streamlit Server URL",
1701                value=health_service.config.get("streamlit_url", "http://localhost")
1702            )
1703        
1704        with col2:
1705            cpu_critical = st.slider("CPU Critical Threshold (%)", 
1706                                 min_value=20, max_value=95, 
1707                                 value=health_service.config["thresholds"].get("cpu_critical", 90),
1708                                 step=5)
1709            memory_critical = st.slider("Memory Critical Threshold (%)", 
1710                                    min_value=20, max_value=95, 
1711                                    value=health_service.config["thresholds"].get("memory_critical", 90),
1712                                    step=5)
1713            disk_critical = st.slider("Disk Critical Threshold (%)", 
1714                                  min_value=20, max_value=95, 
1715                                  value=health_service.config["thresholds"].get("disk_critical", 90),
1716                                  step=5)
1717        
1718            check_interval = st.slider("Check Interval (seconds)", 
1719                                min_value=10, max_value=300, 
1720                                value=health_service.config.get("check_interval", 60),
1721                                step=10)
1722            streamlit_port_update = st.number_input(
1723                "Streamlit Server Port",
1724                value=health_service.config.get("streamlit_port", 8501),
1725                step=1
1726            )
1727        
1728        if st.button("Save Configuration"):
1729            # Update configuration
1730            health_service.config["thresholds"]["cpu_warning"] = cpu_warning
1731            health_service.config["thresholds"]["cpu_critical"] = cpu_critical
1732            health_service.config["thresholds"]["memory_warning"] = memory_warning
1733            health_service.config["thresholds"]["memory_critical"] = memory_critical
1734            health_service.config["thresholds"]["disk_warning"] = disk_warning
1735            health_service.config["thresholds"]["disk_critical"] = disk_critical
1736            health_service.config["check_interval"] = check_interval
1737            health_service.config["streamlit_url"] = streamlit_url_update
1738            health_service.config["streamlit_port"] = streamlit_port_update
1739            
1740            # Save to file
1741            health_service.save_config()
1742            st.success("Configuration saved successfully")
1743            
1744            # Restart the service if interval changed
1745            health_service.stop()
1746            health_service.start()

Displays an interactive Streamlit dashboard for monitoring application health. This function initializes and manages a health check service, presenting real-time system metrics, dependency statuses, custom checks, and Streamlit page health in a user-friendly dashboard. Users can manually refresh health checks, view detailed error information, and adjust configuration thresholds and intervals directly from the UI.

Args:

config_path (str, optional): Path to the health check configuration JSON file.
    Defaults to "health_check_config.json".

Features:

- Displays overall health status with color-coded indicators.
- Shows last updated timestamp for health data.
- Monitors Streamlit server status, latency, and errors.
- Provides tabs for:
    * System Resources (CPU, Memory, Disk usage and status)
    * Dependencies (external services and their health)
    * Custom Checks (user-defined health checks)
    * Streamlit Pages (page-specific errors and status)
- Allows configuration of system thresholds, check intervals, and Streamlit server settings.
- Supports manual refresh and saving configuration changes.

Raises:

Displays error messages in the UI for any exceptions encountered during health data retrieval or processing.

Returns:

None. The dashboard is rendered in the Streamlit app.