streamlit_healthcheck.healthcheck
1import streamlit as st 2import psutil 3import pandas as pd 4import requests 5import time 6import threading 7import json 8import os 9from datetime import datetime 10from typing import Dict, List, Any, Optional, Callable 11import functools 12import traceback 13import logging 14import sqlite3 15 16# Set up logging 17logging.basicConfig( 18 level=logging.INFO, 19 format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', 20 handlers=[ 21 logging.StreamHandler() 22 ] 23) 24logger = logging.getLogger(__name__) 25 26class StreamlitPageMonitor: 27 """ 28 Singleton class that monitors and records errors occurring within Streamlit pages. 29 It captures both explicit Streamlit error messages (monkey-patching st.error) and 30 uncaught exceptions raised during the execution of monitored page functions, and 31 persists error details to a local SQLite database. 32 33 Key responsibilities 34 35 - Intercept Streamlit error calls by monkey-patching st.error and record them with 36 a stack trace, timestamp, status, and type. 37 - Provide a decorator `monitor_page(page_name)` to set a page context, capture 38 exceptions raised while rendering/executing a page, and record those exceptions. 39 - Store errors in an in-memory structure grouped by page and persist them to 40 an SQLite database for later inspection. 41 - Provide utilities to load, deduplicate, clear, and query stored errors. 42 43 Behavior and side effects 44 45 - Implements the Singleton pattern: only one instance exists per Python process. 46 - On first instantiation, optionally accepts a custom db_path and initializes 47 the SQLite database and its parent directory (creating it if necessary). 48 - Monkey-patches `streamlit.error` (st.error) to capture calls and still forward 49 them to the original st.error implementation. 50 - Records the following fields for each error: page, error, traceback, timestamp, 51 status, type. The SQLite table `errors` mirrors these fields and includes an 52 auto-incrementing `id`. 53 - Persists errors immediately to SQLite when captured; database IO errors are 54 logged but do not suppress the original exception (for monitored exceptions, 55 the exception is re-raised after recording). 56 57 Public API (methods) 58 59 - __new__(cls, db_path=None) 60 Create or return the singleton StreamlitPageMonitor instance. 61 62 Parameters 63 ---------- 64 db_path : Optional[str] 65 If provided on the first instantiation, overrides the class-level 66 database path used to persist captured Streamlit error information. 67 68 Returns 69 ------- 70 StreamlitPageMonitor 71 The singleton instance of the class. 72 73 Behavior 74 -------- 75 - On first instantiation (when cls._instance is None): 76 - Allocates the singleton via super().__new__. 77 - Optionally sets cls._db_path from the provided db_path. 78 - Logs the configured DB path. 79 - Monkey-patches streamlit.error (st.error) with a wrapper that: 80 - Builds an error record containing the error text, a formatted stack trace, 81 ISO timestamp, severity/status, an error type marker, and the current page. 82 - Normalizes a missing current page to "unknown_page". 83 - Stores the record in the in-memory cls._errors dictionary keyed by page. 84 - Attempts to persist the record to the SQLite DB using cls().save_errors_to_db, 85 logging any persistence errors without interrupting Streamlit's normal error display. 86 - Calls the original st.error to preserve expected UI behavior. 87 - Initializes the SQLite DB via cls._init_db(). 88 - On subsequent calls: 89 - Returns the existing singleton instance. 90 - If db_path is provided, updates cls._db_path for future use. 91 92 Side effects 93 ------------ 94 - Replaces st.error globally for the running process. 95 - Writes error records to both an in-memory structure (cls._errors) and to the 96 configured SQLite database (if persistence succeeds). 97 - Logs informational and error messages. 98 99 Notes 100 ----- 101 - The method assumes the class defines/has: _instance, _db_path, _current_page, 102 _errors, _st_error (original st.error), save_errors_to_db, and _init_db. 103 - Exceptions raised during saving of individual errors are caught and logged; 104 exceptions from instance creation or DB initialization may propagate. 105 - The implementation is not explicitly thread-safe; concurrent instantiation 106 attempts may require external synchronization if used in multi-threaded contexts. 107 - set_page_context(cls, page_name: str) 108 Set the current page name used when recording subsequent errors. 109 - monitor_page(cls, page_name: str) -> Callable 110 Decorator for page rendering/execution functions. Sets the page context, 111 clears previously recorded non-Streamlit errors for that page, runs the 112 function, records and persists any raised exception, and re-raises it. 113 - _handle_st_error(cls, error_message: str) 114 115 Handles Streamlit-specific errors by recording error details for the current page. 116 117 Args: 118 error_message (str): The error message to be logged. 119 120 Side Effects: 121 Updates the class-level _errors dictionary with error information for the current Streamlit page. 122 123 Error Information Stored: 124 - error: Formatted error message. 125 - traceback: Stack trace at the point of error. 126 - timestamp: Time when the error occurred (ISO format). 127 - status: Error severity ('critical'). 128 - type: Error type ('streamlit_error'). 129 - get_page_errors(cls) -> dict 130 Load errors from the database and return a dictionary mapping page names to 131 lists of error dicts. Performs basic deduplication by error message. 132 - save_errors_to_db(cls, errors: Iterable[dict]) 133 Persist a list of error dictionaries to the configured SQLite database. 134 Ensures traceback is stored as a string (JSON if originally a list). 135 - clear_errors(cls, page_name: Optional[str] = None) 136 Clear in-memory errors for a specific page or all pages and delete matching 137 rows from the database. 138 - _init_db(cls) 139 Ensure the database directory exists and create the `errors` table if it 140 does not exist. 141 - load_errors_from_db(cls, page=None, status=None, limit=None) -> List[dict] 142 Query the database for errors, optionally filtering by page and/or status, 143 returning a list of error dictionaries ordered by timestamp (descending) 144 and limited if requested. 145 146 Storage and format 147 148 - Default DB path: ~/local/share/streamlit-healthcheck/streamlit_page_errors.db (overridable). 149 - SQLite table `errors` columns: id, page, error, traceback, timestamp, status, type. 150 - Tracebacks may be stored as JSON strings (if originally lists) or plain strings. 151 Concurrency and robustness 152 - Designed for single-process usage typical of Streamlit apps. The singleton and 153 monkey-patching are process-global. 154 - Database interactions use short-lived connections; callers should handle any 155 exceptions arising from DB access (errors are logged internally). 156 - Decorator preserves original function metadata via functools.wraps. 157 158 Examples 159 160 - Use as a decorator on page render function: 161 >>> @StreamlitPageMonitor.monitor_page("home") 162 >>> def render_home(): 163 164 - Set page context manually: 165 >>> StreamlitPageMonitor.set_page_context("settings") 166 167 - Set custom DB path on first instantiation: 168 >>> # Place this at the top of your Streamlit app once, before any error monitoring or decorator usage to ensure the sqlite 169 >>> # database is created properly at the specified path; otherwise it will default to a temp directory. The temp directory 170 >>> # will be `~/local/share/streamlit-healthcheck/streamlit_page_errors.db`. 171 >>> StreamlitPageMonitor(db_path="/home/saradindu/dev/streamlit_page_errors.db") 172 ... 173 174 SQLite Database Schema 175 --------------------- 176 The following schema is used for persisting errors: 177 178 ```sql 179 CREATE TABLE IF NOT EXISTS errors ( 180 id INTEGER PRIMARY KEY AUTOINCREMENT, 181 page TEXT, 182 error TEXT, 183 traceback TEXT, 184 timestamp TEXT, 185 status TEXT, 186 type TEXT 187 ); 188 ``` 189 190 Field Descriptions: 191 192 | Column | Type | Description | 193 |------------|---------|---------------------------------------------| 194 | id | INTEGER | Auto-incrementing primary key | 195 | page | TEXT | Name of the Streamlit page | 196 | error | TEXT | Error message | 197 | traceback | TEXT | Stack trace or traceback (as string/JSON) | 198 | timestamp | TEXT | ISO8601 timestamp of error occurrence | 199 | status | TEXT | Severity/status (e.g., 'critical') | 200 | type | TEXT | Error type ('streamlit_error', 'exception') | 201 202 Example: 203 204 >>> @StreamlitPageMonitor.monitor_page("home") 205 >>> def render_home(): 206 207 Notes 208 209 - The class monkey-patches st.error globally when first instantiated; ensure 210 this side effect is acceptable in your environment. 211 - Errors captured by st.error that occur outside any known page are recorded 212 under the page name "unknown_page". 213 - The schema is created/ensured in `_init_db()`. 214 - Tracebacks may be stored as JSON strings or plain text. 215 - Errors are persisted immediately upon capture. 216 217 """ 218 _instance = None 219 _errors: Dict[str, List[Dict[str, Any]]] = {} 220 _st_error = st.error 221 _current_page = None 222 223 # --- SQLite schema for error persistence --- 224 # Table: errors 225 # Fields: 226 # id INTEGER PRIMARY KEY AUTOINCREMENT 227 # page TEXT 228 # error TEXT 229 # traceback TEXT 230 # timestamp TEXT 231 # status TEXT 232 # type TEXT 233 234 # Local development DB path 235 _db_path = os.path.join(os.path.expanduser("~"), "dev", "streamlit-healthcheck", "streamlit_page_errors.db") 236 # Final build DB path 237 #_db_path = os.path.join(os.path.expanduser("~"), ".local", "share", "streamlit-healthcheck", "streamlit_page_errors.db") 238 239 def __new__(cls, db_path=None): 240 """ 241 Create or return the singleton StreamlitPageMonitor instance. 242 """ 243 244 if cls._instance is None: 245 cls._instance = super(StreamlitPageMonitor, cls).__new__(cls) 246 # Allow db_path override at first instantiation 247 if db_path is not None: 248 cls._db_path = db_path 249 logger.info(f"StreamlitPageMonitor DB path set to: {cls._db_path}") 250 # Monkey patch st.error to capture error messages 251 def patched_error(*args, **kwargs): 252 error_message = " ".join(str(arg) for arg in args) 253 current_page = cls._current_page 254 error_info = { 255 'error': error_message, 256 'traceback': traceback.format_stack(), 257 'timestamp': datetime.now().isoformat(), 258 'status': 'critical', 259 'type': 'streamlit_error', 260 'page': current_page 261 } 262 # Ensure current_page is a string, not None 263 if current_page is None: 264 current_page = "unknown_page" 265 if current_page not in cls._errors: 266 cls._errors[current_page] = [] 267 cls._errors[current_page].append(error_info) 268 # Persist to DB 269 try: 270 cls().save_errors_to_db([error_info]) 271 except Exception as e: 272 logger.error(f"Failed to save Streamlit error to DB: {e}") 273 # Call original st.error 274 return cls._st_error(*args, **kwargs) 275 276 st.error = patched_error 277 278 # Initialize SQLite database 279 cls._init_db() 280 else: 281 # If already instantiated, allow updating db_path if provided 282 if db_path is not None: 283 cls._db_path = db_path 284 return cls._instance 285 286 @classmethod 287 def _handle_st_error(cls, error_message: str): 288 """ 289 Handles Streamlit-specific errors by recording error details for the current page. 290 """ 291 292 # Get current page name from Streamlit context 293 current_page = getattr(st, '_current_page', 'unknown_page') 294 error_info = { 295 'error': f"Streamlit Error: {error_message}", 296 'traceback': traceback.format_stack(), 297 'timestamp': datetime.now().isoformat(), 298 'status': 'critical', 299 'type': 'streamlit_error', 300 'page': current_page 301 } 302 # Initialize list for page if not exists 303 if current_page not in cls._errors: 304 cls._errors[current_page] = [] 305 # Add new error 306 cls._errors[current_page].append(error_info) 307 # Persist to DB 308 try: 309 cls().save_errors_to_db([error_info]) 310 except Exception as e: 311 logger.error(f"Failed to save Streamlit error to DB: {e}") 312 313 @classmethod 314 def set_page_context(cls, page_name: str): 315 """Set the current page context""" 316 cls._current_page = page_name 317 318 @classmethod 319 def monitor_page(cls, page_name: str): 320 """ 321 Decorator to monitor and log exceptions for a specific Streamlit page. 322 323 Args: 324 page_name (str): The name of the page to monitor. 325 326 Returns: 327 Callable: A decorator that wraps the target function, sets the page context, 328 clears previous non-Streamlit errors, and logs any exceptions that occur during execution. 329 330 The decorator performs the following actions: 331 332 - Sets the current page context using `cls.set_page_context`. 333 - Clears previous exception errors for the page, retaining only those marked as 'streamlit_error'. 334 - Executes the wrapped function. 335 - If an exception occurs, logs detailed error information (error message, traceback, timestamp, status, type, and page) 336 to `cls._errors` under the given page name, then re-raises the exception. 337 """ 338 339 def decorator(func): 340 """ 341 Decorator to manage page-specific error handling and context setting. 342 This decorator sets the current page context before executing the decorated function. 343 It clears previous exception errors for the page, retaining only Streamlit error calls. 344 If an exception occurs during function execution, it captures error details including 345 the error message, traceback, timestamp, status, type, and page name, and appends them 346 to the page's error log. The exception is then re-raised. 347 348 Args: 349 func (Callable): The function to be decorated. 350 351 Returns: 352 Callable: The wrapped function with error handling and context management. 353 """ 354 355 @functools.wraps(func) 356 def wrapper(*args, **kwargs): 357 # Set the current page context 358 cls.set_page_context(page_name) 359 try: 360 # Clear previous exception errors but keep st.error calls 361 if page_name in cls._errors: 362 cls._errors[page_name] = [ 363 e for e in cls._errors[page_name] 364 if e.get('type') == 'streamlit_error' 365 ] 366 result = func(*args, **kwargs) 367 return result 368 except Exception as e: 369 error_info = { 370 'error': str(e), 371 'traceback': traceback.format_exc(), 372 'timestamp': datetime.now().isoformat(), 373 'status': 'critical', 374 'type': 'exception', 375 'page': page_name 376 } 377 if page_name not in cls._errors: 378 cls._errors[page_name] = [] 379 cls._errors[page_name].append(error_info) 380 # Persist to DB 381 try: 382 cls().save_errors_to_db([error_info]) 383 except Exception as db_exc: 384 logger.error(f"Failed to save exception error to DB: {db_exc}") 385 raise 386 return wrapper 387 return decorator 388 389 @classmethod 390 def get_page_errors(cls): 391 """ 392 Load error records from storage and return them grouped by page. 393 This class method calls cls().load_errors_from_db() to retrieve a sequence of error records 394 (each expected to be a mapping). It normalizes each record to a dictionary with the keys: 395 396 - 'error' (str): error message, default "Unknown error" 397 - 'traceback' (list): traceback frames or lines, default [] 398 - 'timestamp' (str): timestamp string, default "" 399 - 'type' (str): error type/category, default "unknown" 400 401 Grouping and uniqueness: 402 403 - Records are grouped by the 'page' key; if a record has no 'page' key, the page name 404 "unknown" is used. 405 - For each page, only unique errors are kept using the 'error' string as the deduplication 406 key. When multiple records for the same page have the same 'error' value, the last 407 occurrence in the loaded sequence will be retained. 408 409 Return value: 410 411 - dict[str, list[dict]]: mapping from page name to a list of normalized error dicts. 412 413 Error handling: 414 415 - Any exception raised while loading or processing records will be logged via logger.error. 416 The method will return the result accumulated so far (or an empty dict if nothing was 417 accumulated). 418 419 Notes: 420 421 - The class is expected to be instantiable (cls()) and to provide a load_errors_from_db() 422 method that yields or returns an iterable of mappings. 423 """ 424 425 result = {} 426 try: 427 db_errors = cls().load_errors_from_db() 428 for err in db_errors: 429 page = err.get('page', 'unknown') 430 if page not in result: 431 result[page] = [] 432 result[page].append({ 433 'error': err.get('error', 'Unknown error'), 434 'traceback': err.get('traceback', []), 435 'timestamp': err.get('timestamp', ''), 436 'type': err.get('type', 'unknown') 437 }) 438 # Return only unique page errors using the 'page' column for filtering 439 return {page: list({e['error']: e for e in errors}.values()) for page, errors in result.items()} 440 except Exception as e: 441 logger.error(f"Failed to load errors from DB: {e}") 442 return result 443 444 @classmethod 445 def save_errors_to_db(cls, errors): 446 """ 447 Save a sequence of error records into the SQLite database configured at cls._db_path. 448 449 Parameters 450 ---------- 451 452 errors : Iterable[Mapping] | list[dict] 453 454 Sequence of error records to persist. Each record is expected to be a mapping with the 455 following keys (values are stored as provided, except for traceback which is normalized): 456 457 - "page": identifier or name of the page where the error occurred (str) 458 - "error": human-readable error message (str) 459 - "traceback": traceback information; may be a str, list, or None. If a list, it will be 460 JSON-encoded before storage. If None, an empty string is stored. 461 - "timestamp": timestamp for the error (stored as provided) 462 - "status": status associated with the error (str) 463 - "type": classification/type of the error (str) 464 465 Behavior 466 -------- 467 468 - If `errors` is falsy (None or empty), the method returns immediately without touching the DB. 469 - Opens a SQLite connection to the path stored in `cls._db_path`. 470 - Iterates over the provided records and inserts each into the `errors` table with columns 471 (page, error, traceback, timestamp, status, type). 472 - Ensures that the `traceback` value is always written as a string (list -> JSON string, 473 other values -> str(), None -> ""). 474 - Commits the transaction if all inserts succeed and always closes the connection in a finally block. 475 476 Exceptions 477 ---------- 478 479 - Underlying sqlite3 exceptions (e.g., sqlite3.Error) are not swallowed and will propagate to the caller 480 if connection/execution fails. 481 482 Returns 483 ------- 484 485 None 486 """ 487 if not errors: 488 return 489 conn = sqlite3.connect(cls._db_path) 490 try: 491 cursor = conn.cursor() 492 for err in errors: 493 # Ensure traceback is always a string for SQLite 494 tb = err.get("traceback") 495 if isinstance(tb, list): 496 import json 497 tb_str = json.dumps(tb) 498 else: 499 tb_str = str(tb) if tb is not None else "" 500 cursor.execute( 501 """ 502 INSERT INTO errors (page, error, traceback, timestamp, status, type) 503 VALUES (?, ?, ?, ?, ?, ?) 504 """, 505 ( 506 err.get("page"), 507 err.get("error"), 508 tb_str, 509 err.get("timestamp"), 510 err.get("status"), 511 err.get("type"), 512 ), 513 ) 514 conn.commit() 515 finally: 516 conn.close() 517 518 @classmethod 519 def clear_errors(cls, page_name: Optional[str] = None): 520 """Clear stored health-check errors for a specific page or for all pages. 521 This classmethod updates both the in-memory error cache and the persistent 522 SQLite-backed store. 523 524 If `page_name` is provided: 525 526 - Remove the entry for that page from the class-level in-memory dictionary 527 of errors (if present). 528 - Delete all rows in the SQLite `errors` table where `page` equals `page_name`. 529 530 If `page_name` is None: 531 532 - Clear the entire in-memory errors dictionary. 533 - Delete all rows from the SQLite `errors` table. 534 535 Args: 536 page_name (Optional[str]): Name of the page whose errors should be cleared. 537 If None, all errors are cleared. 538 539 Returns: 540 None 541 542 Side effects: 543 544 - Mutates class-level state (clears entries in `cls._errors`). 545 - Opens a SQLite connection to `cls._db_path` and executes DELETE statements 546 against the `errors` table. Commits the transaction and closes the connection. 547 548 Error handling: 549 550 - Database-related exceptions are caught and logged via the module logger; 551 they are not re-raised by this method. As a result, callers should not 552 rely on exceptions to detect DB failures. 553 554 Notes: 555 556 - The method assumes `cls._db_path` points to a valid SQLite database file 557 and that an `errors` table exists with a `page` column. 558 - This method does not provide synchronization; callers should take care of 559 concurrent access to class state and the database if used from multiple 560 threads or processes. 561 """ 562 563 if page_name: 564 if page_name in cls._errors: 565 del cls._errors[page_name] 566 # Remove from DB 567 try: 568 conn = sqlite3.connect(cls._db_path) 569 cursor = conn.cursor() 570 cursor.execute("DELETE FROM errors WHERE page = ?", (page_name,)) 571 conn.commit() 572 conn.close() 573 except Exception as e: 574 logger.error(f"Failed to clear errors from DB for page {page_name}: {e}") 575 else: 576 cls._errors = {} 577 # Remove all from DB 578 try: 579 conn = sqlite3.connect(cls._db_path) 580 cursor = conn.cursor() 581 cursor.execute("DELETE FROM errors") 582 conn.commit() 583 conn.close() 584 except Exception as e: 585 logger.error(f"Failed to clear all errors from DB: {e}") 586 587 @classmethod 588 def _init_db(cls): 589 """ 590 Initialize the SQLite database file and ensure the required schema exists. 591 This class-level initializer performs the following steps: 592 593 - Ensures the parent directory of cls._db_path exists; creates it if necessary. 594 - If cls._db_path has no parent directory (e.g., a bare filename), no directory is created. 595 - Connects to the SQLite database at cls._db_path (creating the file if it does not exist). 596 - Creates an "errors" table if it does not already exist with the following columns: 597 - id (INTEGER PRIMARY KEY AUTOINCREMENT) 598 - page (TEXT) 599 - error (TEXT) 600 - traceback (TEXT) 601 - timestamp (TEXT) 602 - status (TEXT) 603 - type (TEXT) 604 - Commits the schema change and closes the database connection. 605 - Logs informational and error messages using the module logger. 606 607 Parameters 608 ---------- 609 610 cls : type 611 612 The class on which this method is invoked. Must provide a valid string attribute 613 `_db_path` indicating the target SQLite database file path. 614 615 Raises 616 ------ 617 618 Exception 619 620 Re-raises exceptions encountered when creating the parent directory (os.makedirs). 621 622 sqlite3.Error 623 624 May be raised by sqlite3.connect or subsequent SQLite operations when the database 625 cannot be opened or initialized. 626 627 Side effects 628 ------------ 629 630 - May create directories on the filesystem. 631 - May create or modify the SQLite database file at cls._db_path. 632 - Writes log messages via the module logger. 633 634 Returns 635 ------- 636 637 None 638 """ 639 640 # Ensure the parent directory for the DB exists 641 db_dir = os.path.dirname(cls._db_path) 642 if db_dir and not os.path.exists(db_dir): 643 try: 644 os.makedirs(db_dir, exist_ok=False) 645 logger.info(f"Created directory for DB: {db_dir}") 646 except Exception as e: 647 logger.error(f"Failed to create DB directory {db_dir}: {e}") 648 raise 649 # Now create/connect to the DB and table 650 logger.info(f"Initializing SQLite DB at: {cls._db_path}") 651 conn = sqlite3.connect(cls._db_path) 652 c = conn.cursor() 653 c.execute('''CREATE TABLE IF NOT EXISTS errors ( 654 id INTEGER PRIMARY KEY AUTOINCREMENT, 655 page TEXT, 656 error TEXT, 657 traceback TEXT, 658 timestamp TEXT, 659 status TEXT, 660 type TEXT 661 )''') 662 conn.commit() 663 conn.close() 664 @classmethod 665 def load_errors_from_db(cls, page=None, status=None, limit=None): 666 """ 667 Load errors from the class SQLite database. 668 This classmethod connects to the SQLite database at cls._db_path, queries the 669 'errors' table, and returns matching error records as a list of dictionaries. 670 671 Parameters: 672 673 page (Optional[str]): If provided, filter results to rows where the 'page' 674 column equals this value. 675 status (Optional[str]): If provided, filter results to rows where the 'status' 676 column equals this value. 677 limit (Optional[int|str]): If provided, limits the number of returned rows. 678 The value is cast to int internally; a non-convertible value will raise 679 ValueError. 680 681 Returns: 682 683 List[dict]: A list of dictionaries representing rows from the 'errors' table. 684 Each dict contains the following keys: 685 - id: primary key (int) 686 - page: page identifier (str) 687 - error: short error message (str) 688 - traceback: full traceback or diagnostic text (str) 689 - timestamp: stored timestamp value as retrieved from the DB (type depends on schema) 690 - status: error status (str) 691 - type: error type/category (str) 692 693 Raises: 694 695 ValueError: If `limit` cannot be converted to int. 696 sqlite3.Error: If an SQLite error occurs while executing the query. 697 698 Notes: 699 700 - Uses parameterized queries for the 'page' and 'status' filters to avoid SQL 701 injection. The `limit` is applied after casting to int. 702 - Results are ordered by `timestamp` in descending order. 703 - The database connection is always closed in a finally block to ensure cleanup. 704 """ 705 706 conn = sqlite3.connect(cls._db_path) 707 try: 708 cursor = conn.cursor() 709 query = "SELECT id, page, error, traceback, timestamp, status, type FROM errors" 710 params = [] 711 filters = [] 712 if page: 713 filters.append("page = ?") 714 params.append(page) 715 if status: 716 filters.append("status = ?") 717 params.append(status) 718 if filters: 719 query += " WHERE " + " AND ".join(filters) 720 query += " ORDER BY timestamp DESC" 721 if limit: 722 query += f" LIMIT {int(limit)}" 723 cursor.execute(query, params) 724 rows = cursor.fetchall() 725 errors = [] 726 for row in rows: 727 errors.append({ 728 "id": row[0], 729 "page": row[1], 730 "error": row[2], 731 "traceback": row[3], 732 "timestamp": row[4], 733 "status": row[5], 734 "type": row[6], 735 }) 736 return errors 737 finally: 738 conn.close() 739 740class HealthCheckService: 741 """ 742 A background-capable health monitoring service for a Streamlit-based application. 743 This class periodically executes a configurable set of checks (system metrics, 744 external dependencies, Streamlit server and pages, and user-registered custom checks), 745 aggregates their results, and exposes a sanitized health snapshot suitable for UI 746 display or remote monitoring. 747 748 Primary responsibilities 749 750 - Load and persist a JSON configuration that defines check intervals, thresholds, 751 dependencies to probe, and Streamlit connection settings. 752 - Run periodic checks in a dedicated background thread (start/stop semantics). 753 - Collect system metrics (CPU, memory, disk) using psutil and apply configurable 754 warning/critical thresholds. 755 - Probe configured HTTP API endpoints and (placeholder) database checks. 756 - Verify Streamlit server liveness by calling a /healthz endpoint and inspect 757 Streamlit page errors via StreamlitPageMonitor. 758 - Allow callers to register synchronous custom checks (functions returning dicts). 759 - Compute an aggregated overall status (critical > warning > unknown > healthy). 760 - Provide a sanitized snapshot of health data with function references removed for safe 761 serialization/display. 762 763 Usage (high level) 764 765 - Instantiate: svc = HealthCheckService(config_path="path/to/config.json") 766 - Optionally register custom checks: svc.register_custom_check("my_check", my_check_func) 767 where my_check_func() -> Dict[str, Any] 768 - Start background monitoring: svc.start() 769 - Stop monitoring: svc.stop() 770 - Retrieve current health snapshot for display or API responses: svc.get_health_data() 771 - Persist any changes to configuration: svc.save_config() 772 773 Configuration (JSON) 774 775 - check_interval: int (seconds) — how often to run the checks (default 60) 776 - streamlit_url: str — base host (default "http://localhost") 777 - streamlit_port: int — port for Streamlit server (default 8501) 778 - system_checks: { "cpu": bool, "memory": bool, "disk": bool } 779 - dependencies: 780 - api_endpoints: list of { "name": str, "url": str, "timeout": int } 781 - databases: list of { "name": str, "type": str, "connection_string": str } 782 - thresholds: 783 - cpu_warning, cpu_critical, memory_warning, memory_critical, disk_warning, disk_critical 784 785 Health data structure (conceptual) 786 787 - last_updated: ISO timestamp 788 - system: { "cpu": {...}, "memory": {...}, "disk": {...} } 789 - dependencies: { "<name>": {...}, ... } 790 - custom_checks: { "<name>": {...} } (get_health_data() strips callable references) 791 - streamlit_server: {status, response_code/latency/error, message, url} 792 - streamlit_pages: {status, error_count, errors, details} 793 - overall_status: "healthy" | "warning" | "critical" | "unknown" 794 795 Threading and safety 796 797 - The service runs checks in a daemon thread started by start(). stop() signals the 798 thread to terminate and joins with a short timeout. Clients should avoid modifying 799 internal structures concurrently; get_health_data() returns a sanitized snapshot 800 appropriate for concurrent reads. 801 802 Custom checks 803 804 - register_custom_check(name, func): registers a synchronous function that returns a 805 dict describing the check result (must include a "status" key with one of the 806 recognized values). The service stores the function reference internally but returns 807 sanitized results via get_health_data(). 808 809 Error handling and logging 810 811 - Individual checks catch exceptions and surface errors in the corresponding 812 health_data entry with status "critical" where appropriate. 813 - The Streamlit UI integration (st.* calls) is used for user-visible error messages 814 when loading/saving configuration; the service also logs events to its configured 815 logger. 816 817 Extensibility notes 818 819 - Database checks are left as placeholders; implement _check_database for specific DB 820 drivers/connections. 821 - Custom checks are synchronous; if long-running checks are required, adapt the 822 registration/run pattern to use async or worker pools. 823 """ 824 def __init__(self, config_path: str = "health_check_config.json"): 825 """ 826 Initializes the HealthCheckService instance. 827 828 Args: 829 config_path (str): Path to the health check configuration file. Defaults to "health_check_config.json". 830 831 Attributes: 832 833 - logger (logging.Logger): Logger for the HealthCheckService. 834 - config_path (str): Path to the configuration file. 835 - health_data (Dict[str, Any]): Dictionary storing health check data. 836 - config (dict): Loaded configuration from the config file. 837 - check_interval (int): Interval in seconds between health checks. Defaults to 60. 838 - _running (bool): Indicates if the health check service is running. 839 - _thread (threading.Thread or None): Thread running the health check loop. 840 - streamlit_url (str): URL of the Streamlit service. Defaults to "http://localhost". 841 - streamlit_port (int): Port of the Streamlit service. Defaults to 8501. 842 """ 843 self.logger = logging.getLogger(f"{__name__}.HealthCheckService") 844 self.logger.info("Initializing HealthCheckService") 845 self.config_path = config_path 846 self.health_data: Dict[str, Any] = { 847 "last_updated": None, 848 "system": {}, 849 "dependencies": {}, 850 "custom_checks": {}, 851 "overall_status": "unknown" 852 } 853 self.config = self._load_config() 854 self.check_interval = self.config.get("check_interval", 60) # Default: 60 seconds 855 self._running = False 856 self._thread = None 857 self.streamlit_url = self.config.get("streamlit_url", "http://localhost") 858 self.streamlit_port = self.config.get("streamlit_port", 8501) # Default: 8501 859 def _load_config(self) -> Dict: 860 """Load health check configuration from file.""" 861 if os.path.exists(self.config_path): 862 try: 863 with open(self.config_path, "r") as f: 864 return json.load(f) 865 except Exception as e: 866 st.error(f"Error loading health check config: {str(e)}") 867 return self._get_default_config() 868 else: 869 return self._get_default_config() 870 871 def _get_default_config(self) -> Dict: 872 """Return default health check configuration.""" 873 return { 874 "check_interval": 60, 875 "streamlit_url": "http://localhost", 876 "streamlit_port": 8501, 877 "system_checks": { 878 "cpu": True, 879 "memory": True, 880 "disk": True 881 }, 882 "dependencies": { 883 "api_endpoints": [ 884 # Example API endpoint to check 885 {"name": "example_api", "url": "https://httpbin.org/get", "timeout": 5} 886 ], 887 "databases": [ 888 # Example database connection to check 889 {"name": "main_db", "type": "postgres", "connection_string": "..."} 890 ] 891 }, 892 "thresholds": { 893 "cpu_warning": 70, 894 "cpu_critical": 90, 895 "memory_warning": 70, 896 "memory_critical": 90, 897 "disk_warning": 70, 898 "disk_critical": 90 899 } 900 } 901 902 def start(self): 903 """ 904 Start the periodic health-check background thread. 905 If the `healthcheck` runner is already active, this method is a no-op and returns 906 immediately. Otherwise, it marks the runner as running, creates a daemon thread 907 targeting self._run_checks_periodically, stores the thread on self._thread, and 908 starts it. 909 910 Behavior and side effects: 911 912 - Idempotent while running: repeated calls will not create additional threads. 913 - Sets self._running to True. 914 - Assigns a daemon threading.Thread to self._thread and starts it. 915 - Non-blocking: returns after starting the background thread. 916 - The daemon thread will not prevent the process from exiting. 917 918 Thread-safety: 919 920 - If start() may be called concurrently from multiple threads, callers should 921 ensure proper synchronization (e.g., external locking) to avoid race conditions. 922 923 Returns: 924 925 None 926 """ 927 928 if self._running: 929 return 930 931 self._running = True 932 self._thread = threading.Thread(target=self._run_checks_periodically, daemon=True) 933 self._thread.start() 934 935 def stop(self): 936 """Stop the health check service.""" 937 self._running = False 938 if self._thread: 939 self._thread.join(timeout=1) 940 941 def _run_checks_periodically(self): 942 """Run health checks periodically based on check interval.""" 943 while self._running: 944 self.run_all_checks() 945 time.sleep(self.check_interval) 946 947 def run_all_checks(self): 948 """Run all configured health checks and update health data.""" 949 # Update timestamp 950 self.health_data["last_updated"] = datetime.now().isoformat() 951 952 # Check Streamlit server 953 self.health_data["streamlit_server"] = self.check_streamlit_server() 954 955 # System checks 956 if self.config["system_checks"].get("cpu", True): 957 self.check_cpu() 958 if self.config["system_checks"].get("memory", True): 959 self.check_memory() 960 if self.config["system_checks"].get("disk", True): 961 self.check_disk() 962 963 # Rest of the existing checks... 964 self.check_dependencies() 965 self.run_custom_checks() 966 self.check_streamlit_pages() 967 self._update_overall_status() 968 969 def check_cpu(self): 970 """ 971 Checks the current CPU usage and updates the health status based on configured thresholds. 972 Measures the CPU usage percentage over a 1-second interval using psutil. Compares the result 973 against warning and critical thresholds defined in the configuration. Sets the status to 974 'healthy', 'warning', or 'critical' accordingly, and updates the health data dictionary. 975 976 Returns: 977 978 None 979 """ 980 981 cpu_percent = psutil.cpu_percent(interval=1) 982 warning_threshold = self.config["thresholds"].get("cpu_warning", 70) 983 critical_threshold = self.config["thresholds"].get("cpu_critical", 90) 984 985 status = "healthy" 986 if cpu_percent >= critical_threshold: 987 status = "critical" 988 elif cpu_percent >= warning_threshold: 989 status = "warning" 990 991 self.health_data["system"]["cpu"] = { 992 "usage_percent": cpu_percent, 993 "status": status 994 } 995 996 def check_memory(self): 997 """ 998 Checks the system's memory usage and updates the health status accordingly. 999 Retrieves the current memory usage statistics using psutil, compares the usage percentage 1000 against configured warning and critical thresholds, and sets the memory status to 'healthy', 1001 'warning', or 'critical'. Updates the health_data dictionary with total memory, available memory, 1002 usage percentage, and status. 1003 1004 Returns: 1005 1006 None 1007 """ 1008 1009 memory = psutil.virtual_memory() 1010 memory_percent = memory.percent 1011 warning_threshold = self.config["thresholds"].get("memory_warning", 70) 1012 critical_threshold = self.config["thresholds"].get("memory_critical", 90) 1013 1014 status = "healthy" 1015 if memory_percent >= critical_threshold: 1016 status = "critical" 1017 elif memory_percent >= warning_threshold: 1018 status = "warning" 1019 1020 self.health_data["system"]["memory"] = { 1021 "total_gb": round(memory.total / (1024**3), 2), 1022 "available_gb": round(memory.available / (1024**3), 2), 1023 "usage_percent": memory_percent, 1024 "status": status 1025 } 1026 1027 def check_disk(self): 1028 """ 1029 Checks the disk usage of the root filesystem and updates the health status. 1030 Retrieves disk usage statistics using psutil, compares the usage percentage 1031 against configured warning and critical thresholds, and sets the disk status 1032 accordingly (`healthy`, `warning`, or `critical`). Updates the health_data 1033 dictionary with total disk size, free space, usage percentage, and status. 1034 1035 Returns: 1036 1037 None 1038 """ 1039 1040 disk = psutil.disk_usage('/') 1041 disk_percent = disk.percent 1042 warning_threshold = self.config["thresholds"].get("disk_warning", 70) 1043 critical_threshold = self.config["thresholds"].get("disk_critical", 90) 1044 1045 status = "healthy" 1046 if disk_percent >= critical_threshold: 1047 status = "critical" 1048 elif disk_percent >= warning_threshold: 1049 status = "warning" 1050 1051 self.health_data["system"]["disk"] = { 1052 "total_gb": round(disk.total / (1024**3), 2), 1053 "free_gb": round(disk.free / (1024**3), 2), 1054 "usage_percent": disk_percent, 1055 "status": status 1056 } 1057 1058 def check_dependencies(self): 1059 """ 1060 Checks the health of configured dependencies, including API endpoints and databases. 1061 Iterates through the list of API endpoints and databases specified in the configuration, 1062 and performs health checks on each by invoking the corresponding internal methods. 1063 1064 Raises: 1065 1066 Exception: If any dependency check fails. 1067 """ 1068 1069 # Check API endpoints 1070 for endpoint in self.config["dependencies"].get("api_endpoints", []): 1071 self._check_api_endpoint(endpoint) 1072 1073 # Check database connections 1074 for db in self.config["dependencies"].get("databases", []): 1075 self._check_database(db) 1076 1077 def _check_api_endpoint(self, endpoint: Dict): 1078 """ 1079 Check if an API endpoint is accessible. 1080 1081 Args: 1082 1083 endpoint: Dictionary with endpoint configuration 1084 """ 1085 name = endpoint.get("name", "unknown_api") 1086 url = endpoint.get("url", "") 1087 timeout = endpoint.get("timeout", 5) 1088 1089 if not url: 1090 return 1091 1092 try: 1093 start_time = time.time() 1094 response = requests.get(url, timeout=timeout) 1095 response_time = time.time() - start_time 1096 1097 status = "healthy" if response.status_code < 400 else "critical" 1098 1099 self.health_data["dependencies"][name] = { 1100 "type": "api", 1101 "url": url, 1102 "status": status, 1103 "response_time_ms": round(response_time * 1000, 2), 1104 "status_code": response.status_code 1105 } 1106 except Exception as e: 1107 self.health_data["dependencies"][name] = { 1108 "type": "api", 1109 "url": url, 1110 "status": "critical", 1111 "error": str(e) 1112 } 1113 1114 def _check_database(self, db_config: Dict): 1115 """ 1116 Check database connection. 1117 Note: This is a placeholder. You'll need to implement specific database checks 1118 based on your application's needs. 1119 1120 Args: 1121 1122 db_config: Dictionary with database configuration 1123 """ 1124 name = db_config.get("name", "unknown_db") 1125 db_type = db_config.get("type", "") 1126 1127 # Placeholder for database connection check 1128 # In a real implementation, you would check the specific database connection 1129 self.health_data["dependencies"][name] = { 1130 "type": "database", 1131 "db_type": db_type, 1132 "status": "unknown", 1133 "message": "Database check not implemented" 1134 } 1135 1136 def register_custom_check(self, name: str, check_func: Callable[[], Dict[str, Any]]): 1137 """ 1138 Register a custom health check function. 1139 1140 Args: 1141 1142 name: Name of the custom check 1143 check_func: Function that performs the check and returns a dictionary with results 1144 """ 1145 if "custom_checks" not in self.health_data: 1146 self.health_data["custom_checks"] = {} 1147 1148 self.health_data["custom_checks"][name] = { 1149 "status": "unknown", 1150 "check_func": check_func 1151 } 1152 1153 def run_custom_checks(self): 1154 """Run all registered custom health checks.""" 1155 if "custom_checks" not in self.health_data: 1156 return 1157 1158 for name, check_info in list(self.health_data["custom_checks"].items()): 1159 if "check_func" in check_info and callable(check_info["check_func"]): 1160 try: 1161 result = check_info["check_func"]() 1162 # Remove the function reference from the result 1163 func = check_info["check_func"] 1164 self.health_data["custom_checks"][name] = result 1165 # Add the function back 1166 self.health_data["custom_checks"][name]["check_func"] = func 1167 except Exception as e: 1168 self.health_data["custom_checks"][name] = { 1169 "status": "critical", 1170 "error": str(e), 1171 "check_func": check_info["check_func"] 1172 } 1173 1174 def _update_overall_status(self): 1175 """ 1176 Updates the overall health status of the application based on the statuses of various components. 1177 1178 The method checks the health status of the following components: 1179 - Streamlit server 1180 - System checks 1181 - Dependencies 1182 - Custom checks (excluding those with a 'check_func' key) 1183 - Streamlit pages 1184 1185 The overall status is determined using the following priority order: 1186 1. "critical" if any component is critical 1187 2. "warning" if any component is warning and none are critical 1188 3. "unknown" if any component is unknown and none are critical or warning, and no healthy components exist 1189 4. "healthy" if any component is healthy and none are critical, warning, or unknown 1190 5. "unknown" if no statuses are found 1191 1192 The result is stored in `self.health_data["overall_status"]`. 1193 """ 1194 1195 has_critical = False 1196 has_warning = False 1197 has_healthy = False 1198 has_unknown = False 1199 1200 # Helper function to check status 1201 def check_component_status(status): 1202 nonlocal has_critical, has_warning, has_healthy, has_unknown 1203 if status == "critical": 1204 has_critical = True 1205 elif status == "warning": 1206 has_warning = True 1207 elif status == "healthy": 1208 has_healthy = True 1209 elif status == "unknown": 1210 has_unknown = True 1211 1212 # Check Streamlit server status 1213 server_status = self.health_data.get("streamlit_server", {}).get("status") 1214 check_component_status(server_status) 1215 1216 # Check system status 1217 for system_check in self.health_data.get("system", {}).values(): 1218 check_component_status(system_check.get("status")) 1219 1220 # Check dependencies status 1221 for dep_check in self.health_data.get("dependencies", {}).values(): 1222 check_component_status(dep_check.get("status")) 1223 1224 # Check custom checks status 1225 for custom_check in self.health_data.get("custom_checks", {}).values(): 1226 if isinstance(custom_check, dict) and "check_func" not in custom_check: 1227 check_component_status(custom_check.get("status")) 1228 1229 # Check Streamlit pages status 1230 pages_status = self.health_data.get("streamlit_pages", {}).get("status") 1231 check_component_status(pages_status) 1232 1233 # Determine overall status with priority: 1234 # critical > warning > unknown > healthy 1235 if has_critical: 1236 self.health_data["overall_status"] = "critical" 1237 elif has_warning: 1238 self.health_data["overall_status"] = "warning" 1239 elif has_unknown and not has_healthy: 1240 self.health_data["overall_status"] = "unknown" 1241 elif has_healthy: 1242 self.health_data["overall_status"] = "healthy" 1243 else: 1244 self.health_data["overall_status"] = "unknown" 1245 1246 def get_health_data(self) -> Dict: 1247 """Get the latest health check data.""" 1248 # Create a copy without the function references 1249 result: Dict[str, Any] = {} 1250 for key, value in self.health_data.items(): 1251 if key == "custom_checks": 1252 result[key] = {} 1253 for check_name, check_data in value.items(): 1254 if isinstance(check_data, dict): 1255 check_copy = check_data.copy() 1256 if "check_func" in check_copy: 1257 del check_copy["check_func"] 1258 result[key][check_name] = check_copy 1259 else: 1260 result[key] = value 1261 return result 1262 1263 def save_config(self): 1264 """ 1265 Saves the current health check configuration to a JSON file. 1266 Attempts to write the configuration stored in `self.config` to the file specified by `self.config_path`. 1267 Displays a success message in the Streamlit app upon successful save. 1268 Handles and displays appropriate error messages for file not found, permission issues, JSON decoding errors, and other exceptions. 1269 1270 Raises: 1271 1272 FileNotFoundError: If the configuration file path does not exist. 1273 PermissionError: If there are insufficient permissions to write to the file. 1274 json.JSONDecodeError: If there is an error decoding the JSON data. 1275 Exception: For any other exceptions encountered during the save process. 1276 """ 1277 1278 try: 1279 with open(self.config_path, "w") as f: 1280 json.dump(self.config, f, indent=2) 1281 st.success(f"Health check config saved successfully to {self.config_path}") 1282 except FileNotFoundError: 1283 st.error(f"Configuration file not found: {self.config_path}") 1284 except PermissionError: 1285 st.error(f"Permission denied: Unable to write to {self.config_path}") 1286 except json.JSONDecodeError: 1287 st.error(f"Error decoding JSON in config file: {self.config_path}") 1288 except Exception as e: 1289 st.error(f"Error saving health check config: {str(e)}") 1290 def check_streamlit_pages(self): 1291 """ 1292 Checks for errors in Streamlit pages and updates the health data accordingly. 1293 This method retrieves page errors using StreamlitPageMonitor.get_page_errors(). 1294 If errors are found, it sets the 'streamlit_pages' status to 'critical' and updates 1295 the overall health status to 'critical'. If no errors are found, it marks the 1296 'streamlit_pages' status as 'healthy'. 1297 1298 Updates: 1299 1300 self.health_data["streamlit_pages"]: Dict containing status, error count, errors, and details. 1301 self.health_data["overall_status"]: Set to 'critical' if errors are detected. 1302 self.health_data["streamlit_pages"]["details"]: A summary of the errors found. 1303 1304 Returns: 1305 1306 None 1307 """ 1308 1309 page_errors = StreamlitPageMonitor.get_page_errors() 1310 1311 if "streamlit_pages" not in self.health_data: 1312 self.health_data["streamlit_pages"] = {} 1313 1314 if page_errors: 1315 total_errors = sum(len(errors) for errors in page_errors.values()) 1316 self.health_data["streamlit_pages"] = { 1317 "status": "critical", 1318 "error_count": total_errors, 1319 "errors": page_errors, 1320 "details": "Errors detected in Streamlit pages" 1321 } 1322 # This affects overall status 1323 self.health_data["overall_status"] = "critical" 1324 else: 1325 self.health_data["streamlit_pages"] = { 1326 "status": "healthy", 1327 "error_count": 0, 1328 "errors": {}, 1329 "details": "All pages functioning normally" 1330 } 1331 1332 def check_streamlit_server(self) -> Dict[str, Any]: 1333 """ 1334 Checks the health status of the Streamlit server by sending a GET request to the /healthz endpoint. 1335 1336 Returns: 1337 1338 Dict[str, Any]: A dictionary containing the health status, response code, latency in milliseconds, 1339 message, and the URL checked. If the server is healthy (HTTP 200), status is "healthy". 1340 Otherwise, status is "critical" with error details. 1341 1342 Handles: 1343 1344 - Connection errors: Returns critical status with connection error details. 1345 - Timeout errors: Returns critical status with timeout error details. 1346 - Other exceptions: Returns critical status with unknown error details. 1347 1348 Logs: 1349 1350 - The URL being checked. 1351 - The response status code and text. 1352 - Health status and response time if healthy. 1353 - Warnings and errors for unhealthy or failed checks. 1354 """ 1355 1356 try: 1357 host = self.streamlit_url.rstrip('/') 1358 if not host.startswith(('http://', 'https://')): 1359 host = f"http://{host}" 1360 1361 url = f"{host}:{self.streamlit_port}/healthz" 1362 self.logger.info(f"Checking Streamlit server health at: {url}") 1363 1364 start_time = time.time() 1365 response = requests.get(url, timeout=3) 1366 total_time = (time.time() - start_time) * 1000 1367 self.logger.info(f"{response.status_code} - {response.text}") 1368 # Check if the response is healthy 1369 if response.status_code == 200: 1370 self.logger.info(f"Streamlit server healthy - Response time: {round(total_time, 2)}ms") 1371 return { 1372 "status": "healthy", 1373 "response_code": response.status_code, 1374 "latency_ms": round(total_time, 2), 1375 "message": "Streamlit server is running", 1376 "url": url 1377 } 1378 else: 1379 self.logger.warning(f"Unhealthy response from server: {response.status_code}") 1380 return { 1381 "status": "critical", 1382 "response_code": response.status_code, 1383 "error": f"Unhealthy response from server: {response.status_code}", 1384 "message": "Streamlit server is not healthy", 1385 "url": url 1386 } 1387 1388 except requests.exceptions.ConnectionError as e: 1389 self.logger.error(f"Connection error while checking Streamlit server: {str(e)}") 1390 return { 1391 "status": "critical", 1392 "error": f"Connection error: {str(e)}", 1393 "message": "Cannot connect to Streamlit server", 1394 "url": url 1395 } 1396 except requests.exceptions.Timeout as e: 1397 self.logger.error(f"Timeout while checking Streamlit server: {str(e)}") 1398 return { 1399 "status": "critical", 1400 "error": f"Timeout error: {str(e)}", 1401 "message": "Streamlit server is not responding", 1402 "url": url 1403 } 1404 except Exception as e: 1405 self.logger.error(f"Unexpected error while checking Streamlit server: {str(e)}") 1406 return { 1407 "status": "critical", 1408 "error": f"Unknown error: {str(e)}", 1409 "message": "Failed to check Streamlit server", 1410 "url": url 1411 } 1412 1413def health_check(config_path:str = "health_check_config.json"): 1414 """ 1415 Displays an interactive Streamlit dashboard for monitoring application health. 1416 This function initializes and manages a health check service, presenting real-time system metrics, 1417 dependency statuses, custom checks, and Streamlit page health in a user-friendly dashboard. 1418 Users can manually refresh health checks, view detailed error information, and adjust configuration 1419 thresholds and intervals directly from the UI. 1420 1421 Args: 1422 1423 config_path (str, optional): Path to the health check configuration JSON file. 1424 Defaults to "health_check_config.json". 1425 1426 Features: 1427 1428 - Displays overall health status with color-coded indicators. 1429 - Shows last updated timestamp for health data. 1430 - Monitors Streamlit server status, latency, and errors. 1431 - Provides tabs for: 1432 * System Resources (CPU, Memory, Disk usage and status) 1433 * Dependencies (external services and their health) 1434 * Custom Checks (user-defined health checks) 1435 * Streamlit Pages (page-specific errors and status) 1436 - Allows configuration of system thresholds, check intervals, and Streamlit server settings. 1437 - Supports manual refresh and saving configuration changes. 1438 1439 Raises: 1440 1441 Displays error messages in the UI for any exceptions encountered during health data retrieval or processing. 1442 1443 Returns: 1444 1445 None. The dashboard is rendered in the Streamlit app. 1446 """ 1447 1448 logger = logging.getLogger(f"{__name__}.health_check") 1449 logger.info("Starting health check dashboard") 1450 st.title("Application Health Dashboard") 1451 1452 # Initialize or get the health check service 1453 if "health_service" not in st.session_state: 1454 logger.info("Initializing new health check service") 1455 st.session_state.health_service = HealthCheckService(config_path = config_path) 1456 st.session_state.health_service.start() 1457 1458 health_service = st.session_state.health_service 1459 health_service.run_all_checks() 1460 1461 # Add controls for manual refresh and configuration 1462 col1, col2 = st.columns([3, 1]) 1463 with col1: 1464 st.subheader("System Health Status") 1465 with col2: 1466 if st.button("Refresh Now"): 1467 health_service.run_all_checks() 1468 1469 # Get the latest health data 1470 health_data = health_service.get_health_data() 1471 1472 # Display overall status with appropriate color 1473 overall_status = health_data.get("overall_status", "unknown") 1474 status_color = { 1475 "healthy": "green", 1476 "warning": "orange", 1477 "critical": "red", 1478 "unknown": "gray" 1479 }.get(overall_status, "gray") 1480 1481 st.markdown( 1482 f"<h3 style='color: {status_color};'>Overall Status: {overall_status.upper()}</h3>", 1483 unsafe_allow_html=True 1484 ) 1485 1486 # Display last updated time 1487 if health_data.get("last_updated"): 1488 try: 1489 last_updated = datetime.fromisoformat(health_data["last_updated"]) 1490 st.text(f"Last updated: {last_updated.strftime('%Y-%m-%d %H:%M:%S')}") 1491 except Exception as e: 1492 st.error(f"Last updated: {health_data['last_updated']}") 1493 st.exception(e) 1494 1495 server_health = health_data.get("streamlit_server", {}) 1496 server_status = server_health.get("status", "unknown") 1497 server_color = { 1498 "healthy": "green", 1499 "critical": "red", 1500 "unknown": "gray" 1501 }.get(server_status, "gray") 1502 1503 st.markdown( 1504 f"### Streamlit Server Status: <span style='color: {server_color}'>{server_status.upper()}</span>", 1505 unsafe_allow_html=True 1506 ) 1507 1508 if server_status != "healthy": 1509 st.error(server_health.get("message", "Server status unknown")) 1510 if "error" in server_health: 1511 st.code(server_health["error"]) 1512 else: 1513 st.success(server_health.get("message", "Server is running")) 1514 if "latency_ms" in server_health: 1515 latency = server_health["latency_ms"] 1516 # Define color based on latency thresholds 1517 if latency <= 50: 1518 latency_color = "green" 1519 performance = "Excellent" 1520 elif latency <= 100: 1521 latency_color = "blue" 1522 performance = "Good" 1523 elif latency <= 200: 1524 latency_color = "orange" 1525 performance = "Fair" 1526 else: 1527 latency_color = "red" 1528 performance = "Poor" 1529 1530 st.markdown( 1531 f""" 1532 <div style='display: flex; align-items: center; gap: 10px;'> 1533 <div>Server Response Time:</div> 1534 <div style='color: {latency_color}; font-weight: bold;'> 1535 {latency} ms 1536 </div> 1537 <div style='color: {latency_color};'> 1538 ({performance}) 1539 </div> 1540 </div> 1541 """, 1542 unsafe_allow_html=True 1543 ) 1544 1545 # Create tabs for different categories of health checks 1546 tab1, tab2, tab3, tab4 = st.tabs(["System Resources", "Dependencies", "Custom Checks", "Streamlit Pages"]) 1547 1548 with tab1: 1549 # Display system health checks 1550 system_data = health_data.get("system", {}) 1551 1552 # CPU 1553 if "cpu" in system_data: 1554 cpu_data = system_data["cpu"] 1555 cpu_status = cpu_data.get("status", "unknown") 1556 cpu_color = {"healthy": "green", "warning": "orange", "critical": "red"}.get(cpu_status, "gray") 1557 1558 st.markdown(f"### CPU Status: <span style='color:{cpu_color}'>{cpu_status.upper()}</span>", unsafe_allow_html=True) 1559 st.progress(cpu_data.get("usage_percent", 0) / 100) 1560 st.text(f"CPU Usage: {cpu_data.get('usage_percent', 0)}%") 1561 1562 # Memory 1563 if "memory" in system_data: 1564 memory_data = system_data["memory"] 1565 memory_status = memory_data.get("status", "unknown") 1566 memory_color = {"healthy": "green", "warning": "orange", "critical": "red"}.get(memory_status, "gray") 1567 1568 st.markdown(f"### Memory Status: <span style='color:{memory_color}'>{memory_status.upper()}</span>", unsafe_allow_html=True) 1569 st.progress(memory_data.get("usage_percent", 0) / 100) 1570 st.text(f"Memory Usage: {memory_data.get('usage_percent', 0)}%") 1571 st.text(f"Total Memory: {memory_data.get('total_gb', 0)} GB") 1572 st.text(f"Available Memory: {memory_data.get('available_gb', 0)} GB") 1573 1574 # Disk 1575 if "disk" in system_data: 1576 disk_data = system_data["disk"] 1577 disk_status = disk_data.get("status", "unknown") 1578 disk_color = {"healthy": "green", "warning": "orange", "critical": "red"}.get(disk_status, "gray") 1579 1580 st.markdown(f"### Disk Status: <span style='color:{disk_color}'>{disk_status.upper()}</span>", unsafe_allow_html=True) 1581 st.progress(disk_data.get("usage_percent", 0) / 100) 1582 st.text(f"Disk Usage: {disk_data.get('usage_percent', 0)}%") 1583 st.text(f"Total Disk Space: {disk_data.get('total_gb', 0)} GB") 1584 st.text(f"Free Disk Space: {disk_data.get('free_gb', 0)} GB") 1585 1586 with tab2: 1587 # Display dependency health checks 1588 dependencies = health_data.get("dependencies", {}) 1589 if dependencies: 1590 # Create a dataframe for all dependencies 1591 dep_data = [] 1592 for name, dep_info in dependencies.items(): 1593 dep_data.append({ 1594 "Name": name, 1595 "Type": dep_info.get("type", "unknown"), 1596 "Status": dep_info.get("status", "unknown"), 1597 "Details": ", ".join([f"{k}: {v}" for k, v in dep_info.items() 1598 if k not in ["name", "type", "status", "error"] and not isinstance(v, dict)]) 1599 }) 1600 1601 # Show dependencies table 1602 if dep_data: 1603 df_deps = pd.DataFrame(dep_data) 1604 st.dataframe(df_deps) 1605 else: 1606 st.info("No dependencies configured") 1607 1608 # Create a dataframe for all custom checks from health_data 1609 custom_checks = health_data.get("custom_checks", {}) 1610 check_data = [] 1611 for name, check_info in custom_checks.items(): 1612 if isinstance(check_info, dict) and "check_func" not in check_info: 1613 check_data.append({ 1614 "Name": name, 1615 "Status": check_info.get("status", "unknown"), 1616 "Details": ", ".join([f"{k}: {v}" for k, v in check_info.items() 1617 if k not in ["name", "status", "check_func", "error"] and not isinstance(v, dict)]), 1618 "Error": check_info.get("error", "") 1619 }) 1620 1621 if check_data: 1622 df_checks = pd.DataFrame(check_data) 1623 1624 # Apply color formatting to status column 1625 def color_status(val): 1626 colors = { 1627 "healthy": "background-color: #c6efce; color: #006100", 1628 "warning": "background-color: #ffeb9c; color: #9c5700", 1629 "critical": "background-color: #ffc7ce; color: #9c0006", 1630 "unknown": "background-color: #eeeeee; color: #7f7f7f" 1631 } 1632 return colors.get(str(val).lower(), "") 1633 1634 # Use styled dataframe to color the Status column 1635 try: 1636 # apply expects a function that returns a sequence of styles for the column; 1637 # map color_status across the 'Status' column to produce the CSS strings. 1638 st.dataframe( 1639 df_checks.style.apply( 1640 lambda col: col.map(color_status), 1641 subset=["Status"] 1642 ) 1643 ) 1644 except Exception: 1645 # Fallback if styling isn't supported in the environment 1646 st.dataframe(df_checks) 1647 else: 1648 st.info("No custom checks configured") 1649 else: 1650 st.info("No custom checks configured") 1651 with tab4: 1652 # Always read page errors from SQLite DB for latest state 1653 page_errors = StreamlitPageMonitor.get_page_errors() 1654 error_count = sum(len(errors) for errors in page_errors.values()) 1655 status = "critical" if error_count > 0 else "healthy" 1656 status_color = { 1657 "healthy": "green", 1658 "critical": "red", 1659 "unknown": "gray" 1660 }.get(status, "gray") 1661 st.markdown(f"### Page Status: <span style='color:{status_color}'>{status.upper()}</span>", unsafe_allow_html=True) 1662 st.metric("Error Count", error_count) 1663 if error_count > 0: 1664 st.markdown("<div style='background-color:#ffe6e6; color:#b30000; padding:10px; border-radius:5px; border:1px solid #b30000; font-weight:bold;'>Pages with errors:</div>", 1665 unsafe_allow_html=True) 1666 for page_name, page_errors_list in page_errors.items(): 1667 display_name = page_name.split("/")[-1] if "/" in page_name else page_name 1668 for error_info in page_errors_list: 1669 if isinstance(error_info, dict): 1670 with st.expander(f"Error in {display_name}"): 1671 st.info(error_info.get('error', 'Unknown error')) 1672 if error_info.get('type') == 'streamlit_error': 1673 st.text("Type: Streamlit Error") 1674 else: 1675 st.text("Type: Exception") 1676 st.text("Traceback:") 1677 st.code("".join(error_info.get('traceback', ['No traceback available']))) 1678 st.text(f"Timestamp: {error_info.get('timestamp', 'No timestamp')}") 1679 1680 # Configuration section 1681 with st.expander("Health Check Configuration"): 1682 st.subheader("System Check Thresholds") 1683 1684 col1, col2 = st.columns(2) 1685 with col1: 1686 cpu_warning = st.slider("CPU Warning Threshold (%)", 1687 min_value=10, max_value=90, 1688 value=health_service.config["thresholds"].get("cpu_warning", 70), 1689 step=5) 1690 memory_warning = st.slider("Memory Warning Threshold (%)", 1691 min_value=10, max_value=90, 1692 value=health_service.config["thresholds"].get("memory_warning", 70), 1693 step=5) 1694 disk_warning = st.slider("Disk Warning Threshold (%)", 1695 min_value=10, max_value=90, 1696 value=health_service.config["thresholds"].get("disk_warning", 70), 1697 step=5) 1698 streamlit_url_update = st.text_input( 1699 "Streamlit Server URL", 1700 value=health_service.config.get("streamlit_url", "http://localhost") 1701 ) 1702 1703 with col2: 1704 cpu_critical = st.slider("CPU Critical Threshold (%)", 1705 min_value=20, max_value=95, 1706 value=health_service.config["thresholds"].get("cpu_critical", 90), 1707 step=5) 1708 memory_critical = st.slider("Memory Critical Threshold (%)", 1709 min_value=20, max_value=95, 1710 value=health_service.config["thresholds"].get("memory_critical", 90), 1711 step=5) 1712 disk_critical = st.slider("Disk Critical Threshold (%)", 1713 min_value=20, max_value=95, 1714 value=health_service.config["thresholds"].get("disk_critical", 90), 1715 step=5) 1716 1717 check_interval = st.slider("Check Interval (seconds)", 1718 min_value=10, max_value=300, 1719 value=health_service.config.get("check_interval", 60), 1720 step=10) 1721 streamlit_port_update = st.number_input( 1722 "Streamlit Server Port", 1723 value=health_service.config.get("streamlit_port", 8501), 1724 step=1 1725 ) 1726 1727 if st.button("Save Configuration"): 1728 # Update configuration 1729 health_service.config["thresholds"]["cpu_warning"] = cpu_warning 1730 health_service.config["thresholds"]["cpu_critical"] = cpu_critical 1731 health_service.config["thresholds"]["memory_warning"] = memory_warning 1732 health_service.config["thresholds"]["memory_critical"] = memory_critical 1733 health_service.config["thresholds"]["disk_warning"] = disk_warning 1734 health_service.config["thresholds"]["disk_critical"] = disk_critical 1735 health_service.config["check_interval"] = check_interval 1736 health_service.config["streamlit_url"] = streamlit_url_update 1737 health_service.config["streamlit_port"] = streamlit_port_update 1738 1739 # Save to file 1740 health_service.save_config() 1741 st.success("Configuration saved successfully") 1742 1743 # Restart the service if interval changed 1744 health_service.stop() 1745 health_service.start()
27class StreamlitPageMonitor: 28 """ 29 Singleton class that monitors and records errors occurring within Streamlit pages. 30 It captures both explicit Streamlit error messages (monkey-patching st.error) and 31 uncaught exceptions raised during the execution of monitored page functions, and 32 persists error details to a local SQLite database. 33 34 Key responsibilities 35 36 - Intercept Streamlit error calls by monkey-patching st.error and record them with 37 a stack trace, timestamp, status, and type. 38 - Provide a decorator `monitor_page(page_name)` to set a page context, capture 39 exceptions raised while rendering/executing a page, and record those exceptions. 40 - Store errors in an in-memory structure grouped by page and persist them to 41 an SQLite database for later inspection. 42 - Provide utilities to load, deduplicate, clear, and query stored errors. 43 44 Behavior and side effects 45 46 - Implements the Singleton pattern: only one instance exists per Python process. 47 - On first instantiation, optionally accepts a custom db_path and initializes 48 the SQLite database and its parent directory (creating it if necessary). 49 - Monkey-patches `streamlit.error` (st.error) to capture calls and still forward 50 them to the original st.error implementation. 51 - Records the following fields for each error: page, error, traceback, timestamp, 52 status, type. The SQLite table `errors` mirrors these fields and includes an 53 auto-incrementing `id`. 54 - Persists errors immediately to SQLite when captured; database IO errors are 55 logged but do not suppress the original exception (for monitored exceptions, 56 the exception is re-raised after recording). 57 58 Public API (methods) 59 60 - __new__(cls, db_path=None) 61 Create or return the singleton StreamlitPageMonitor instance. 62 63 Parameters 64 ---------- 65 db_path : Optional[str] 66 If provided on the first instantiation, overrides the class-level 67 database path used to persist captured Streamlit error information. 68 69 Returns 70 ------- 71 StreamlitPageMonitor 72 The singleton instance of the class. 73 74 Behavior 75 -------- 76 - On first instantiation (when cls._instance is None): 77 - Allocates the singleton via super().__new__. 78 - Optionally sets cls._db_path from the provided db_path. 79 - Logs the configured DB path. 80 - Monkey-patches streamlit.error (st.error) with a wrapper that: 81 - Builds an error record containing the error text, a formatted stack trace, 82 ISO timestamp, severity/status, an error type marker, and the current page. 83 - Normalizes a missing current page to "unknown_page". 84 - Stores the record in the in-memory cls._errors dictionary keyed by page. 85 - Attempts to persist the record to the SQLite DB using cls().save_errors_to_db, 86 logging any persistence errors without interrupting Streamlit's normal error display. 87 - Calls the original st.error to preserve expected UI behavior. 88 - Initializes the SQLite DB via cls._init_db(). 89 - On subsequent calls: 90 - Returns the existing singleton instance. 91 - If db_path is provided, updates cls._db_path for future use. 92 93 Side effects 94 ------------ 95 - Replaces st.error globally for the running process. 96 - Writes error records to both an in-memory structure (cls._errors) and to the 97 configured SQLite database (if persistence succeeds). 98 - Logs informational and error messages. 99 100 Notes 101 ----- 102 - The method assumes the class defines/has: _instance, _db_path, _current_page, 103 _errors, _st_error (original st.error), save_errors_to_db, and _init_db. 104 - Exceptions raised during saving of individual errors are caught and logged; 105 exceptions from instance creation or DB initialization may propagate. 106 - The implementation is not explicitly thread-safe; concurrent instantiation 107 attempts may require external synchronization if used in multi-threaded contexts. 108 - set_page_context(cls, page_name: str) 109 Set the current page name used when recording subsequent errors. 110 - monitor_page(cls, page_name: str) -> Callable 111 Decorator for page rendering/execution functions. Sets the page context, 112 clears previously recorded non-Streamlit errors for that page, runs the 113 function, records and persists any raised exception, and re-raises it. 114 - _handle_st_error(cls, error_message: str) 115 116 Handles Streamlit-specific errors by recording error details for the current page. 117 118 Args: 119 error_message (str): The error message to be logged. 120 121 Side Effects: 122 Updates the class-level _errors dictionary with error information for the current Streamlit page. 123 124 Error Information Stored: 125 - error: Formatted error message. 126 - traceback: Stack trace at the point of error. 127 - timestamp: Time when the error occurred (ISO format). 128 - status: Error severity ('critical'). 129 - type: Error type ('streamlit_error'). 130 - get_page_errors(cls) -> dict 131 Load errors from the database and return a dictionary mapping page names to 132 lists of error dicts. Performs basic deduplication by error message. 133 - save_errors_to_db(cls, errors: Iterable[dict]) 134 Persist a list of error dictionaries to the configured SQLite database. 135 Ensures traceback is stored as a string (JSON if originally a list). 136 - clear_errors(cls, page_name: Optional[str] = None) 137 Clear in-memory errors for a specific page or all pages and delete matching 138 rows from the database. 139 - _init_db(cls) 140 Ensure the database directory exists and create the `errors` table if it 141 does not exist. 142 - load_errors_from_db(cls, page=None, status=None, limit=None) -> List[dict] 143 Query the database for errors, optionally filtering by page and/or status, 144 returning a list of error dictionaries ordered by timestamp (descending) 145 and limited if requested. 146 147 Storage and format 148 149 - Default DB path: ~/local/share/streamlit-healthcheck/streamlit_page_errors.db (overridable). 150 - SQLite table `errors` columns: id, page, error, traceback, timestamp, status, type. 151 - Tracebacks may be stored as JSON strings (if originally lists) or plain strings. 152 Concurrency and robustness 153 - Designed for single-process usage typical of Streamlit apps. The singleton and 154 monkey-patching are process-global. 155 - Database interactions use short-lived connections; callers should handle any 156 exceptions arising from DB access (errors are logged internally). 157 - Decorator preserves original function metadata via functools.wraps. 158 159 Examples 160 161 - Use as a decorator on page render function: 162 >>> @StreamlitPageMonitor.monitor_page("home") 163 >>> def render_home(): 164 165 - Set page context manually: 166 >>> StreamlitPageMonitor.set_page_context("settings") 167 168 - Set custom DB path on first instantiation: 169 >>> # Place this at the top of your Streamlit app once, before any error monitoring or decorator usage to ensure the sqlite 170 >>> # database is created properly at the specified path; otherwise it will default to a temp directory. The temp directory 171 >>> # will be `~/local/share/streamlit-healthcheck/streamlit_page_errors.db`. 172 >>> StreamlitPageMonitor(db_path="/home/saradindu/dev/streamlit_page_errors.db") 173 ... 174 175 SQLite Database Schema 176 --------------------- 177 The following schema is used for persisting errors: 178 179 ```sql 180 CREATE TABLE IF NOT EXISTS errors ( 181 id INTEGER PRIMARY KEY AUTOINCREMENT, 182 page TEXT, 183 error TEXT, 184 traceback TEXT, 185 timestamp TEXT, 186 status TEXT, 187 type TEXT 188 ); 189 ``` 190 191 Field Descriptions: 192 193 | Column | Type | Description | 194 |------------|---------|---------------------------------------------| 195 | id | INTEGER | Auto-incrementing primary key | 196 | page | TEXT | Name of the Streamlit page | 197 | error | TEXT | Error message | 198 | traceback | TEXT | Stack trace or traceback (as string/JSON) | 199 | timestamp | TEXT | ISO8601 timestamp of error occurrence | 200 | status | TEXT | Severity/status (e.g., 'critical') | 201 | type | TEXT | Error type ('streamlit_error', 'exception') | 202 203 Example: 204 205 >>> @StreamlitPageMonitor.monitor_page("home") 206 >>> def render_home(): 207 208 Notes 209 210 - The class monkey-patches st.error globally when first instantiated; ensure 211 this side effect is acceptable in your environment. 212 - Errors captured by st.error that occur outside any known page are recorded 213 under the page name "unknown_page". 214 - The schema is created/ensured in `_init_db()`. 215 - Tracebacks may be stored as JSON strings or plain text. 216 - Errors are persisted immediately upon capture. 217 218 """ 219 _instance = None 220 _errors: Dict[str, List[Dict[str, Any]]] = {} 221 _st_error = st.error 222 _current_page = None 223 224 # --- SQLite schema for error persistence --- 225 # Table: errors 226 # Fields: 227 # id INTEGER PRIMARY KEY AUTOINCREMENT 228 # page TEXT 229 # error TEXT 230 # traceback TEXT 231 # timestamp TEXT 232 # status TEXT 233 # type TEXT 234 235 # Local development DB path 236 _db_path = os.path.join(os.path.expanduser("~"), "dev", "streamlit-healthcheck", "streamlit_page_errors.db") 237 # Final build DB path 238 #_db_path = os.path.join(os.path.expanduser("~"), ".local", "share", "streamlit-healthcheck", "streamlit_page_errors.db") 239 240 def __new__(cls, db_path=None): 241 """ 242 Create or return the singleton StreamlitPageMonitor instance. 243 """ 244 245 if cls._instance is None: 246 cls._instance = super(StreamlitPageMonitor, cls).__new__(cls) 247 # Allow db_path override at first instantiation 248 if db_path is not None: 249 cls._db_path = db_path 250 logger.info(f"StreamlitPageMonitor DB path set to: {cls._db_path}") 251 # Monkey patch st.error to capture error messages 252 def patched_error(*args, **kwargs): 253 error_message = " ".join(str(arg) for arg in args) 254 current_page = cls._current_page 255 error_info = { 256 'error': error_message, 257 'traceback': traceback.format_stack(), 258 'timestamp': datetime.now().isoformat(), 259 'status': 'critical', 260 'type': 'streamlit_error', 261 'page': current_page 262 } 263 # Ensure current_page is a string, not None 264 if current_page is None: 265 current_page = "unknown_page" 266 if current_page not in cls._errors: 267 cls._errors[current_page] = [] 268 cls._errors[current_page].append(error_info) 269 # Persist to DB 270 try: 271 cls().save_errors_to_db([error_info]) 272 except Exception as e: 273 logger.error(f"Failed to save Streamlit error to DB: {e}") 274 # Call original st.error 275 return cls._st_error(*args, **kwargs) 276 277 st.error = patched_error 278 279 # Initialize SQLite database 280 cls._init_db() 281 else: 282 # If already instantiated, allow updating db_path if provided 283 if db_path is not None: 284 cls._db_path = db_path 285 return cls._instance 286 287 @classmethod 288 def _handle_st_error(cls, error_message: str): 289 """ 290 Handles Streamlit-specific errors by recording error details for the current page. 291 """ 292 293 # Get current page name from Streamlit context 294 current_page = getattr(st, '_current_page', 'unknown_page') 295 error_info = { 296 'error': f"Streamlit Error: {error_message}", 297 'traceback': traceback.format_stack(), 298 'timestamp': datetime.now().isoformat(), 299 'status': 'critical', 300 'type': 'streamlit_error', 301 'page': current_page 302 } 303 # Initialize list for page if not exists 304 if current_page not in cls._errors: 305 cls._errors[current_page] = [] 306 # Add new error 307 cls._errors[current_page].append(error_info) 308 # Persist to DB 309 try: 310 cls().save_errors_to_db([error_info]) 311 except Exception as e: 312 logger.error(f"Failed to save Streamlit error to DB: {e}") 313 314 @classmethod 315 def set_page_context(cls, page_name: str): 316 """Set the current page context""" 317 cls._current_page = page_name 318 319 @classmethod 320 def monitor_page(cls, page_name: str): 321 """ 322 Decorator to monitor and log exceptions for a specific Streamlit page. 323 324 Args: 325 page_name (str): The name of the page to monitor. 326 327 Returns: 328 Callable: A decorator that wraps the target function, sets the page context, 329 clears previous non-Streamlit errors, and logs any exceptions that occur during execution. 330 331 The decorator performs the following actions: 332 333 - Sets the current page context using `cls.set_page_context`. 334 - Clears previous exception errors for the page, retaining only those marked as 'streamlit_error'. 335 - Executes the wrapped function. 336 - If an exception occurs, logs detailed error information (error message, traceback, timestamp, status, type, and page) 337 to `cls._errors` under the given page name, then re-raises the exception. 338 """ 339 340 def decorator(func): 341 """ 342 Decorator to manage page-specific error handling and context setting. 343 This decorator sets the current page context before executing the decorated function. 344 It clears previous exception errors for the page, retaining only Streamlit error calls. 345 If an exception occurs during function execution, it captures error details including 346 the error message, traceback, timestamp, status, type, and page name, and appends them 347 to the page's error log. The exception is then re-raised. 348 349 Args: 350 func (Callable): The function to be decorated. 351 352 Returns: 353 Callable: The wrapped function with error handling and context management. 354 """ 355 356 @functools.wraps(func) 357 def wrapper(*args, **kwargs): 358 # Set the current page context 359 cls.set_page_context(page_name) 360 try: 361 # Clear previous exception errors but keep st.error calls 362 if page_name in cls._errors: 363 cls._errors[page_name] = [ 364 e for e in cls._errors[page_name] 365 if e.get('type') == 'streamlit_error' 366 ] 367 result = func(*args, **kwargs) 368 return result 369 except Exception as e: 370 error_info = { 371 'error': str(e), 372 'traceback': traceback.format_exc(), 373 'timestamp': datetime.now().isoformat(), 374 'status': 'critical', 375 'type': 'exception', 376 'page': page_name 377 } 378 if page_name not in cls._errors: 379 cls._errors[page_name] = [] 380 cls._errors[page_name].append(error_info) 381 # Persist to DB 382 try: 383 cls().save_errors_to_db([error_info]) 384 except Exception as db_exc: 385 logger.error(f"Failed to save exception error to DB: {db_exc}") 386 raise 387 return wrapper 388 return decorator 389 390 @classmethod 391 def get_page_errors(cls): 392 """ 393 Load error records from storage and return them grouped by page. 394 This class method calls cls().load_errors_from_db() to retrieve a sequence of error records 395 (each expected to be a mapping). It normalizes each record to a dictionary with the keys: 396 397 - 'error' (str): error message, default "Unknown error" 398 - 'traceback' (list): traceback frames or lines, default [] 399 - 'timestamp' (str): timestamp string, default "" 400 - 'type' (str): error type/category, default "unknown" 401 402 Grouping and uniqueness: 403 404 - Records are grouped by the 'page' key; if a record has no 'page' key, the page name 405 "unknown" is used. 406 - For each page, only unique errors are kept using the 'error' string as the deduplication 407 key. When multiple records for the same page have the same 'error' value, the last 408 occurrence in the loaded sequence will be retained. 409 410 Return value: 411 412 - dict[str, list[dict]]: mapping from page name to a list of normalized error dicts. 413 414 Error handling: 415 416 - Any exception raised while loading or processing records will be logged via logger.error. 417 The method will return the result accumulated so far (or an empty dict if nothing was 418 accumulated). 419 420 Notes: 421 422 - The class is expected to be instantiable (cls()) and to provide a load_errors_from_db() 423 method that yields or returns an iterable of mappings. 424 """ 425 426 result = {} 427 try: 428 db_errors = cls().load_errors_from_db() 429 for err in db_errors: 430 page = err.get('page', 'unknown') 431 if page not in result: 432 result[page] = [] 433 result[page].append({ 434 'error': err.get('error', 'Unknown error'), 435 'traceback': err.get('traceback', []), 436 'timestamp': err.get('timestamp', ''), 437 'type': err.get('type', 'unknown') 438 }) 439 # Return only unique page errors using the 'page' column for filtering 440 return {page: list({e['error']: e for e in errors}.values()) for page, errors in result.items()} 441 except Exception as e: 442 logger.error(f"Failed to load errors from DB: {e}") 443 return result 444 445 @classmethod 446 def save_errors_to_db(cls, errors): 447 """ 448 Save a sequence of error records into the SQLite database configured at cls._db_path. 449 450 Parameters 451 ---------- 452 453 errors : Iterable[Mapping] | list[dict] 454 455 Sequence of error records to persist. Each record is expected to be a mapping with the 456 following keys (values are stored as provided, except for traceback which is normalized): 457 458 - "page": identifier or name of the page where the error occurred (str) 459 - "error": human-readable error message (str) 460 - "traceback": traceback information; may be a str, list, or None. If a list, it will be 461 JSON-encoded before storage. If None, an empty string is stored. 462 - "timestamp": timestamp for the error (stored as provided) 463 - "status": status associated with the error (str) 464 - "type": classification/type of the error (str) 465 466 Behavior 467 -------- 468 469 - If `errors` is falsy (None or empty), the method returns immediately without touching the DB. 470 - Opens a SQLite connection to the path stored in `cls._db_path`. 471 - Iterates over the provided records and inserts each into the `errors` table with columns 472 (page, error, traceback, timestamp, status, type). 473 - Ensures that the `traceback` value is always written as a string (list -> JSON string, 474 other values -> str(), None -> ""). 475 - Commits the transaction if all inserts succeed and always closes the connection in a finally block. 476 477 Exceptions 478 ---------- 479 480 - Underlying sqlite3 exceptions (e.g., sqlite3.Error) are not swallowed and will propagate to the caller 481 if connection/execution fails. 482 483 Returns 484 ------- 485 486 None 487 """ 488 if not errors: 489 return 490 conn = sqlite3.connect(cls._db_path) 491 try: 492 cursor = conn.cursor() 493 for err in errors: 494 # Ensure traceback is always a string for SQLite 495 tb = err.get("traceback") 496 if isinstance(tb, list): 497 import json 498 tb_str = json.dumps(tb) 499 else: 500 tb_str = str(tb) if tb is not None else "" 501 cursor.execute( 502 """ 503 INSERT INTO errors (page, error, traceback, timestamp, status, type) 504 VALUES (?, ?, ?, ?, ?, ?) 505 """, 506 ( 507 err.get("page"), 508 err.get("error"), 509 tb_str, 510 err.get("timestamp"), 511 err.get("status"), 512 err.get("type"), 513 ), 514 ) 515 conn.commit() 516 finally: 517 conn.close() 518 519 @classmethod 520 def clear_errors(cls, page_name: Optional[str] = None): 521 """Clear stored health-check errors for a specific page or for all pages. 522 This classmethod updates both the in-memory error cache and the persistent 523 SQLite-backed store. 524 525 If `page_name` is provided: 526 527 - Remove the entry for that page from the class-level in-memory dictionary 528 of errors (if present). 529 - Delete all rows in the SQLite `errors` table where `page` equals `page_name`. 530 531 If `page_name` is None: 532 533 - Clear the entire in-memory errors dictionary. 534 - Delete all rows from the SQLite `errors` table. 535 536 Args: 537 page_name (Optional[str]): Name of the page whose errors should be cleared. 538 If None, all errors are cleared. 539 540 Returns: 541 None 542 543 Side effects: 544 545 - Mutates class-level state (clears entries in `cls._errors`). 546 - Opens a SQLite connection to `cls._db_path` and executes DELETE statements 547 against the `errors` table. Commits the transaction and closes the connection. 548 549 Error handling: 550 551 - Database-related exceptions are caught and logged via the module logger; 552 they are not re-raised by this method. As a result, callers should not 553 rely on exceptions to detect DB failures. 554 555 Notes: 556 557 - The method assumes `cls._db_path` points to a valid SQLite database file 558 and that an `errors` table exists with a `page` column. 559 - This method does not provide synchronization; callers should take care of 560 concurrent access to class state and the database if used from multiple 561 threads or processes. 562 """ 563 564 if page_name: 565 if page_name in cls._errors: 566 del cls._errors[page_name] 567 # Remove from DB 568 try: 569 conn = sqlite3.connect(cls._db_path) 570 cursor = conn.cursor() 571 cursor.execute("DELETE FROM errors WHERE page = ?", (page_name,)) 572 conn.commit() 573 conn.close() 574 except Exception as e: 575 logger.error(f"Failed to clear errors from DB for page {page_name}: {e}") 576 else: 577 cls._errors = {} 578 # Remove all from DB 579 try: 580 conn = sqlite3.connect(cls._db_path) 581 cursor = conn.cursor() 582 cursor.execute("DELETE FROM errors") 583 conn.commit() 584 conn.close() 585 except Exception as e: 586 logger.error(f"Failed to clear all errors from DB: {e}") 587 588 @classmethod 589 def _init_db(cls): 590 """ 591 Initialize the SQLite database file and ensure the required schema exists. 592 This class-level initializer performs the following steps: 593 594 - Ensures the parent directory of cls._db_path exists; creates it if necessary. 595 - If cls._db_path has no parent directory (e.g., a bare filename), no directory is created. 596 - Connects to the SQLite database at cls._db_path (creating the file if it does not exist). 597 - Creates an "errors" table if it does not already exist with the following columns: 598 - id (INTEGER PRIMARY KEY AUTOINCREMENT) 599 - page (TEXT) 600 - error (TEXT) 601 - traceback (TEXT) 602 - timestamp (TEXT) 603 - status (TEXT) 604 - type (TEXT) 605 - Commits the schema change and closes the database connection. 606 - Logs informational and error messages using the module logger. 607 608 Parameters 609 ---------- 610 611 cls : type 612 613 The class on which this method is invoked. Must provide a valid string attribute 614 `_db_path` indicating the target SQLite database file path. 615 616 Raises 617 ------ 618 619 Exception 620 621 Re-raises exceptions encountered when creating the parent directory (os.makedirs). 622 623 sqlite3.Error 624 625 May be raised by sqlite3.connect or subsequent SQLite operations when the database 626 cannot be opened or initialized. 627 628 Side effects 629 ------------ 630 631 - May create directories on the filesystem. 632 - May create or modify the SQLite database file at cls._db_path. 633 - Writes log messages via the module logger. 634 635 Returns 636 ------- 637 638 None 639 """ 640 641 # Ensure the parent directory for the DB exists 642 db_dir = os.path.dirname(cls._db_path) 643 if db_dir and not os.path.exists(db_dir): 644 try: 645 os.makedirs(db_dir, exist_ok=False) 646 logger.info(f"Created directory for DB: {db_dir}") 647 except Exception as e: 648 logger.error(f"Failed to create DB directory {db_dir}: {e}") 649 raise 650 # Now create/connect to the DB and table 651 logger.info(f"Initializing SQLite DB at: {cls._db_path}") 652 conn = sqlite3.connect(cls._db_path) 653 c = conn.cursor() 654 c.execute('''CREATE TABLE IF NOT EXISTS errors ( 655 id INTEGER PRIMARY KEY AUTOINCREMENT, 656 page TEXT, 657 error TEXT, 658 traceback TEXT, 659 timestamp TEXT, 660 status TEXT, 661 type TEXT 662 )''') 663 conn.commit() 664 conn.close() 665 @classmethod 666 def load_errors_from_db(cls, page=None, status=None, limit=None): 667 """ 668 Load errors from the class SQLite database. 669 This classmethod connects to the SQLite database at cls._db_path, queries the 670 'errors' table, and returns matching error records as a list of dictionaries. 671 672 Parameters: 673 674 page (Optional[str]): If provided, filter results to rows where the 'page' 675 column equals this value. 676 status (Optional[str]): If provided, filter results to rows where the 'status' 677 column equals this value. 678 limit (Optional[int|str]): If provided, limits the number of returned rows. 679 The value is cast to int internally; a non-convertible value will raise 680 ValueError. 681 682 Returns: 683 684 List[dict]: A list of dictionaries representing rows from the 'errors' table. 685 Each dict contains the following keys: 686 - id: primary key (int) 687 - page: page identifier (str) 688 - error: short error message (str) 689 - traceback: full traceback or diagnostic text (str) 690 - timestamp: stored timestamp value as retrieved from the DB (type depends on schema) 691 - status: error status (str) 692 - type: error type/category (str) 693 694 Raises: 695 696 ValueError: If `limit` cannot be converted to int. 697 sqlite3.Error: If an SQLite error occurs while executing the query. 698 699 Notes: 700 701 - Uses parameterized queries for the 'page' and 'status' filters to avoid SQL 702 injection. The `limit` is applied after casting to int. 703 - Results are ordered by `timestamp` in descending order. 704 - The database connection is always closed in a finally block to ensure cleanup. 705 """ 706 707 conn = sqlite3.connect(cls._db_path) 708 try: 709 cursor = conn.cursor() 710 query = "SELECT id, page, error, traceback, timestamp, status, type FROM errors" 711 params = [] 712 filters = [] 713 if page: 714 filters.append("page = ?") 715 params.append(page) 716 if status: 717 filters.append("status = ?") 718 params.append(status) 719 if filters: 720 query += " WHERE " + " AND ".join(filters) 721 query += " ORDER BY timestamp DESC" 722 if limit: 723 query += f" LIMIT {int(limit)}" 724 cursor.execute(query, params) 725 rows = cursor.fetchall() 726 errors = [] 727 for row in rows: 728 errors.append({ 729 "id": row[0], 730 "page": row[1], 731 "error": row[2], 732 "traceback": row[3], 733 "timestamp": row[4], 734 "status": row[5], 735 "type": row[6], 736 }) 737 return errors 738 finally: 739 conn.close()
Singleton class that monitors and records errors occurring within Streamlit pages. It captures both explicit Streamlit error messages (monkey-patching st.error) and uncaught exceptions raised during the execution of monitored page functions, and persists error details to a local SQLite database.
Key responsibilities
- Intercept Streamlit error calls by monkey-patching st.error and record them with a stack trace, timestamp, status, and type.
- Provide a decorator
monitor_page(page_name)to set a page context, capture exceptions raised while rendering/executing a page, and record those exceptions. - Store errors in an in-memory structure grouped by page and persist them to an SQLite database for later inspection.
- Provide utilities to load, deduplicate, clear, and query stored errors.
Behavior and side effects
- Implements the Singleton pattern: only one instance exists per Python process.
- On first instantiation, optionally accepts a custom db_path and initializes the SQLite database and its parent directory (creating it if necessary).
- Monkey-patches
streamlit.error(st.error) to capture calls and still forward them to the original st.error implementation. - Records the following fields for each error: page, error, traceback, timestamp,
status, type. The SQLite table
errorsmirrors these fields and includes an auto-incrementingid. - Persists errors immediately to SQLite when captured; database IO errors are logged but do not suppress the original exception (for monitored exceptions, the exception is re-raised after recording).
Public API (methods)
__new__(cls, db_path=None) Create or return the singleton StreamlitPageMonitor instance.
Parameters ---------- db_path : Optional[str] If provided on the first instantiation, overrides the class-level database path used to persist captured Streamlit error information. Returns ------- StreamlitPageMonitor The singleton instance of the class. Behavior -------- - On first instantiation (when cls._instance is None): - Allocates the singleton via super().__new__. - Optionally sets cls._db_path from the provided db_path. - Logs the configured DB path. - Monkey-patches streamlit.error (st.error) with a wrapper that: - Builds an error record containing the error text, a formatted stack trace, ISO timestamp, severity/status, an error type marker, and the current page. - Normalizes a missing current page to "unknown_page". - Stores the record in the in-memory cls._errors dictionary keyed by page. - Attempts to persist the record to the SQLite DB using cls().save_errors_to_db, logging any persistence errors without interrupting Streamlit's normal error display. - Calls the original st.error to preserve expected UI behavior. - Initializes the SQLite DB via cls._init_db(). - On subsequent calls: - Returns the existing singleton instance. - If db_path is provided, updates cls._db_path for future use. Side effects ------------ - Replaces st.error globally for the running process. - Writes error records to both an in-memory structure (cls._errors) and to the configured SQLite database (if persistence succeeds). - Logs informational and error messages. Notes ----- - The method assumes the class defines/has: _instance, _db_path, _current_page, _errors, _st_error (original st.error), save_errors_to_db, and _init_db. - Exceptions raised during saving of individual errors are caught and logged; exceptions from instance creation or DB initialization may propagate. - The implementation is not explicitly thread-safe; concurrent instantiation attempts may require external synchronization if used in multi-threaded contexts.- set_page_context(cls, page_name: str) Set the current page name used when recording subsequent errors.
- monitor_page(cls, page_name: str) -> Callable Decorator for page rendering/execution functions. Sets the page context, clears previously recorded non-Streamlit errors for that page, runs the function, records and persists any raised exception, and re-raises it.
_handle_st_error(cls, error_message: str)
Handles Streamlit-specific errors by recording error details for the current page. Args: error_message (str): The error message to be logged. Side Effects: Updates the class-level _errors dictionary with error information for the current Streamlit page. Error Information Stored: - error: Formatted error message. - traceback: Stack trace at the point of error. - timestamp: Time when the error occurred (ISO format). - status: Error severity ('critical'). - type: Error type ('streamlit_error').- get_page_errors(cls) -> dict Load errors from the database and return a dictionary mapping page names to lists of error dicts. Performs basic deduplication by error message.
- save_errors_to_db(cls, errors: Iterable[dict]) Persist a list of error dictionaries to the configured SQLite database. Ensures traceback is stored as a string (JSON if originally a list).
- clear_errors(cls, page_name: Optional[str] = None) Clear in-memory errors for a specific page or all pages and delete matching rows from the database.
- _init_db(cls)
Ensure the database directory exists and create the
errorstable if it does not exist. - load_errors_from_db(cls, page=None, status=None, limit=None) -> List[dict] Query the database for errors, optionally filtering by page and/or status, returning a list of error dictionaries ordered by timestamp (descending) and limited if requested.
Storage and format
- Default DB path: ~/local/share/streamlit-healthcheck/streamlit_page_errors.db (overridable).
- SQLite table
errorscolumns: id, page, error, traceback, timestamp, status, type. - Tracebacks may be stored as JSON strings (if originally lists) or plain strings. Concurrency and robustness
- Designed for single-process usage typical of Streamlit apps. The singleton and monkey-patching are process-global.
- Database interactions use short-lived connections; callers should handle any exceptions arising from DB access (errors are logged internally).
- Decorator preserves original function metadata via functools.wraps.
Examples
Use as a decorator on page render function:
>>> @StreamlitPageMonitor.monitor_page("home") >>> def render_home():Set page context manually:
>>> StreamlitPageMonitor.set_page_context("settings")Set custom DB path on first instantiation:
>>> # Place this at the top of your Streamlit app once, before any error monitoring or decorator usage to ensure the sqlite >>> # database is created properly at the specified path; otherwise it will default to a temp directory. The temp directory >>> # will be `~/local/share/streamlit-healthcheck/streamlit_page_errors.db`. >>> StreamlitPageMonitor(db_path="/home/saradindu/dev/streamlit_page_errors.db") ...
SQLite Database Schema
The following schema is used for persisting errors:
CREATE TABLE IF NOT EXISTS errors (
id INTEGER PRIMARY KEY AUTOINCREMENT,
page TEXT,
error TEXT,
traceback TEXT,
timestamp TEXT,
status TEXT,
type TEXT
);
Field Descriptions:
| Column | Type | Description |
|---|---|---|
| id | INTEGER | Auto-incrementing primary key |
| page | TEXT | Name of the Streamlit page |
| error | TEXT | Error message |
| traceback | TEXT | Stack trace or traceback (as string/JSON) |
| timestamp | TEXT | ISO8601 timestamp of error occurrence |
| status | TEXT | Severity/status (e.g., 'critical') |
| type | TEXT | Error type ('streamlit_error', 'exception') |
Example:
>>> @StreamlitPageMonitor.monitor_page("home")
>>> def render_home():
Notes
- The class monkey-patches st.error globally when first instantiated; ensure this side effect is acceptable in your environment.
- Errors captured by st.error that occur outside any known page are recorded under the page name "unknown_page".
- The schema is created/ensured in
_init_db(). - Tracebacks may be stored as JSON strings or plain text.
- Errors are persisted immediately upon capture.
240 def __new__(cls, db_path=None): 241 """ 242 Create or return the singleton StreamlitPageMonitor instance. 243 """ 244 245 if cls._instance is None: 246 cls._instance = super(StreamlitPageMonitor, cls).__new__(cls) 247 # Allow db_path override at first instantiation 248 if db_path is not None: 249 cls._db_path = db_path 250 logger.info(f"StreamlitPageMonitor DB path set to: {cls._db_path}") 251 # Monkey patch st.error to capture error messages 252 def patched_error(*args, **kwargs): 253 error_message = " ".join(str(arg) for arg in args) 254 current_page = cls._current_page 255 error_info = { 256 'error': error_message, 257 'traceback': traceback.format_stack(), 258 'timestamp': datetime.now().isoformat(), 259 'status': 'critical', 260 'type': 'streamlit_error', 261 'page': current_page 262 } 263 # Ensure current_page is a string, not None 264 if current_page is None: 265 current_page = "unknown_page" 266 if current_page not in cls._errors: 267 cls._errors[current_page] = [] 268 cls._errors[current_page].append(error_info) 269 # Persist to DB 270 try: 271 cls().save_errors_to_db([error_info]) 272 except Exception as e: 273 logger.error(f"Failed to save Streamlit error to DB: {e}") 274 # Call original st.error 275 return cls._st_error(*args, **kwargs) 276 277 st.error = patched_error 278 279 # Initialize SQLite database 280 cls._init_db() 281 else: 282 # If already instantiated, allow updating db_path if provided 283 if db_path is not None: 284 cls._db_path = db_path 285 return cls._instance
Create or return the singleton StreamlitPageMonitor instance.
314 @classmethod 315 def set_page_context(cls, page_name: str): 316 """Set the current page context""" 317 cls._current_page = page_name
Set the current page context
319 @classmethod 320 def monitor_page(cls, page_name: str): 321 """ 322 Decorator to monitor and log exceptions for a specific Streamlit page. 323 324 Args: 325 page_name (str): The name of the page to monitor. 326 327 Returns: 328 Callable: A decorator that wraps the target function, sets the page context, 329 clears previous non-Streamlit errors, and logs any exceptions that occur during execution. 330 331 The decorator performs the following actions: 332 333 - Sets the current page context using `cls.set_page_context`. 334 - Clears previous exception errors for the page, retaining only those marked as 'streamlit_error'. 335 - Executes the wrapped function. 336 - If an exception occurs, logs detailed error information (error message, traceback, timestamp, status, type, and page) 337 to `cls._errors` under the given page name, then re-raises the exception. 338 """ 339 340 def decorator(func): 341 """ 342 Decorator to manage page-specific error handling and context setting. 343 This decorator sets the current page context before executing the decorated function. 344 It clears previous exception errors for the page, retaining only Streamlit error calls. 345 If an exception occurs during function execution, it captures error details including 346 the error message, traceback, timestamp, status, type, and page name, and appends them 347 to the page's error log. The exception is then re-raised. 348 349 Args: 350 func (Callable): The function to be decorated. 351 352 Returns: 353 Callable: The wrapped function with error handling and context management. 354 """ 355 356 @functools.wraps(func) 357 def wrapper(*args, **kwargs): 358 # Set the current page context 359 cls.set_page_context(page_name) 360 try: 361 # Clear previous exception errors but keep st.error calls 362 if page_name in cls._errors: 363 cls._errors[page_name] = [ 364 e for e in cls._errors[page_name] 365 if e.get('type') == 'streamlit_error' 366 ] 367 result = func(*args, **kwargs) 368 return result 369 except Exception as e: 370 error_info = { 371 'error': str(e), 372 'traceback': traceback.format_exc(), 373 'timestamp': datetime.now().isoformat(), 374 'status': 'critical', 375 'type': 'exception', 376 'page': page_name 377 } 378 if page_name not in cls._errors: 379 cls._errors[page_name] = [] 380 cls._errors[page_name].append(error_info) 381 # Persist to DB 382 try: 383 cls().save_errors_to_db([error_info]) 384 except Exception as db_exc: 385 logger.error(f"Failed to save exception error to DB: {db_exc}") 386 raise 387 return wrapper 388 return decorator
Decorator to monitor and log exceptions for a specific Streamlit page.
Args: page_name (str): The name of the page to monitor.
Returns: Callable: A decorator that wraps the target function, sets the page context, clears previous non-Streamlit errors, and logs any exceptions that occur during execution.
The decorator performs the following actions:
- Sets the current page context using `cls.set_page_context`.
- Clears previous exception errors for the page, retaining only those marked as 'streamlit_error'.
- Executes the wrapped function.
- If an exception occurs, logs detailed error information (error message, traceback, timestamp, status, type, and page)
to `cls._errors` under the given page name, then re-raises the exception.
390 @classmethod 391 def get_page_errors(cls): 392 """ 393 Load error records from storage and return them grouped by page. 394 This class method calls cls().load_errors_from_db() to retrieve a sequence of error records 395 (each expected to be a mapping). It normalizes each record to a dictionary with the keys: 396 397 - 'error' (str): error message, default "Unknown error" 398 - 'traceback' (list): traceback frames or lines, default [] 399 - 'timestamp' (str): timestamp string, default "" 400 - 'type' (str): error type/category, default "unknown" 401 402 Grouping and uniqueness: 403 404 - Records are grouped by the 'page' key; if a record has no 'page' key, the page name 405 "unknown" is used. 406 - For each page, only unique errors are kept using the 'error' string as the deduplication 407 key. When multiple records for the same page have the same 'error' value, the last 408 occurrence in the loaded sequence will be retained. 409 410 Return value: 411 412 - dict[str, list[dict]]: mapping from page name to a list of normalized error dicts. 413 414 Error handling: 415 416 - Any exception raised while loading or processing records will be logged via logger.error. 417 The method will return the result accumulated so far (or an empty dict if nothing was 418 accumulated). 419 420 Notes: 421 422 - The class is expected to be instantiable (cls()) and to provide a load_errors_from_db() 423 method that yields or returns an iterable of mappings. 424 """ 425 426 result = {} 427 try: 428 db_errors = cls().load_errors_from_db() 429 for err in db_errors: 430 page = err.get('page', 'unknown') 431 if page not in result: 432 result[page] = [] 433 result[page].append({ 434 'error': err.get('error', 'Unknown error'), 435 'traceback': err.get('traceback', []), 436 'timestamp': err.get('timestamp', ''), 437 'type': err.get('type', 'unknown') 438 }) 439 # Return only unique page errors using the 'page' column for filtering 440 return {page: list({e['error']: e for e in errors}.values()) for page, errors in result.items()} 441 except Exception as e: 442 logger.error(f"Failed to load errors from DB: {e}") 443 return result
Load error records from storage and return them grouped by page. This class method calls cls().load_errors_from_db() to retrieve a sequence of error records (each expected to be a mapping). It normalizes each record to a dictionary with the keys:
- 'error' (str): error message, default "Unknown error"
- 'traceback' (list): traceback frames or lines, default []
- 'timestamp' (str): timestamp string, default ""
- 'type' (str): error type/category, default "unknown"
Grouping and uniqueness:
- Records are grouped by the 'page' key; if a record has no 'page' key, the page name
"unknown" is used.
- For each page, only unique errors are kept using the 'error' string as the deduplication
key. When multiple records for the same page have the same 'error' value, the last
occurrence in the loaded sequence will be retained.
Return value:
- dict[str, list[dict]]: mapping from page name to a list of normalized error dicts.
Error handling:
- Any exception raised while loading or processing records will be logged via logger.error.
The method will return the result accumulated so far (or an empty dict if nothing was
accumulated).
Notes:
- The class is expected to be instantiable (cls()) and to provide a load_errors_from_db()
method that yields or returns an iterable of mappings.
445 @classmethod 446 def save_errors_to_db(cls, errors): 447 """ 448 Save a sequence of error records into the SQLite database configured at cls._db_path. 449 450 Parameters 451 ---------- 452 453 errors : Iterable[Mapping] | list[dict] 454 455 Sequence of error records to persist. Each record is expected to be a mapping with the 456 following keys (values are stored as provided, except for traceback which is normalized): 457 458 - "page": identifier or name of the page where the error occurred (str) 459 - "error": human-readable error message (str) 460 - "traceback": traceback information; may be a str, list, or None. If a list, it will be 461 JSON-encoded before storage. If None, an empty string is stored. 462 - "timestamp": timestamp for the error (stored as provided) 463 - "status": status associated with the error (str) 464 - "type": classification/type of the error (str) 465 466 Behavior 467 -------- 468 469 - If `errors` is falsy (None or empty), the method returns immediately without touching the DB. 470 - Opens a SQLite connection to the path stored in `cls._db_path`. 471 - Iterates over the provided records and inserts each into the `errors` table with columns 472 (page, error, traceback, timestamp, status, type). 473 - Ensures that the `traceback` value is always written as a string (list -> JSON string, 474 other values -> str(), None -> ""). 475 - Commits the transaction if all inserts succeed and always closes the connection in a finally block. 476 477 Exceptions 478 ---------- 479 480 - Underlying sqlite3 exceptions (e.g., sqlite3.Error) are not swallowed and will propagate to the caller 481 if connection/execution fails. 482 483 Returns 484 ------- 485 486 None 487 """ 488 if not errors: 489 return 490 conn = sqlite3.connect(cls._db_path) 491 try: 492 cursor = conn.cursor() 493 for err in errors: 494 # Ensure traceback is always a string for SQLite 495 tb = err.get("traceback") 496 if isinstance(tb, list): 497 import json 498 tb_str = json.dumps(tb) 499 else: 500 tb_str = str(tb) if tb is not None else "" 501 cursor.execute( 502 """ 503 INSERT INTO errors (page, error, traceback, timestamp, status, type) 504 VALUES (?, ?, ?, ?, ?, ?) 505 """, 506 ( 507 err.get("page"), 508 err.get("error"), 509 tb_str, 510 err.get("timestamp"), 511 err.get("status"), 512 err.get("type"), 513 ), 514 ) 515 conn.commit() 516 finally: 517 conn.close()
Save a sequence of error records into the SQLite database configured at cls._db_path.
Parameters
errors : Iterable[Mapping] | list[dict]
Sequence of error records to persist. Each record is expected to be a mapping with the
following keys (values are stored as provided, except for traceback which is normalized):
- "page": identifier or name of the page where the error occurred (str)
- "error": human-readable error message (str)
- "traceback": traceback information; may be a str, list, or None. If a list, it will be
JSON-encoded before storage. If None, an empty string is stored.
- "timestamp": timestamp for the error (stored as provided)
- "status": status associated with the error (str)
- "type": classification/type of the error (str)
Behavior
- If
errorsis falsy (None or empty), the method returns immediately without touching the DB. - Opens a SQLite connection to the path stored in
cls._db_path. - Iterates over the provided records and inserts each into the
errorstable with columns (page, error, traceback, timestamp, status, type). - Ensures that the
tracebackvalue is always written as a string (list -> JSON string, other values -> str(), None -> ""). - Commits the transaction if all inserts succeed and always closes the connection in a finally block.
Exceptions
- Underlying sqlite3 exceptions (e.g., sqlite3.Error) are not swallowed and will propagate to the caller if connection/execution fails.
Returns
None
519 @classmethod 520 def clear_errors(cls, page_name: Optional[str] = None): 521 """Clear stored health-check errors for a specific page or for all pages. 522 This classmethod updates both the in-memory error cache and the persistent 523 SQLite-backed store. 524 525 If `page_name` is provided: 526 527 - Remove the entry for that page from the class-level in-memory dictionary 528 of errors (if present). 529 - Delete all rows in the SQLite `errors` table where `page` equals `page_name`. 530 531 If `page_name` is None: 532 533 - Clear the entire in-memory errors dictionary. 534 - Delete all rows from the SQLite `errors` table. 535 536 Args: 537 page_name (Optional[str]): Name of the page whose errors should be cleared. 538 If None, all errors are cleared. 539 540 Returns: 541 None 542 543 Side effects: 544 545 - Mutates class-level state (clears entries in `cls._errors`). 546 - Opens a SQLite connection to `cls._db_path` and executes DELETE statements 547 against the `errors` table. Commits the transaction and closes the connection. 548 549 Error handling: 550 551 - Database-related exceptions are caught and logged via the module logger; 552 they are not re-raised by this method. As a result, callers should not 553 rely on exceptions to detect DB failures. 554 555 Notes: 556 557 - The method assumes `cls._db_path` points to a valid SQLite database file 558 and that an `errors` table exists with a `page` column. 559 - This method does not provide synchronization; callers should take care of 560 concurrent access to class state and the database if used from multiple 561 threads or processes. 562 """ 563 564 if page_name: 565 if page_name in cls._errors: 566 del cls._errors[page_name] 567 # Remove from DB 568 try: 569 conn = sqlite3.connect(cls._db_path) 570 cursor = conn.cursor() 571 cursor.execute("DELETE FROM errors WHERE page = ?", (page_name,)) 572 conn.commit() 573 conn.close() 574 except Exception as e: 575 logger.error(f"Failed to clear errors from DB for page {page_name}: {e}") 576 else: 577 cls._errors = {} 578 # Remove all from DB 579 try: 580 conn = sqlite3.connect(cls._db_path) 581 cursor = conn.cursor() 582 cursor.execute("DELETE FROM errors") 583 conn.commit() 584 conn.close() 585 except Exception as e: 586 logger.error(f"Failed to clear all errors from DB: {e}")
Clear stored health-check errors for a specific page or for all pages. This classmethod updates both the in-memory error cache and the persistent SQLite-backed store.
If page_name is provided:
- Remove the entry for that page from the class-level in-memory dictionary of errors (if present).
- Delete all rows in the SQLite
errorstable wherepageequalspage_name.
If page_name is None:
- Clear the entire in-memory errors dictionary.
- Delete all rows from the SQLite
errorstable.
Args: page_name (Optional[str]): Name of the page whose errors should be cleared. If None, all errors are cleared.
Returns: None
Side effects:
- Mutates class-level state (clears entries in `cls._errors`).
- Opens a SQLite connection to `cls._db_path` and executes DELETE statements
against the `errors` table. Commits the transaction and closes the connection.
Error handling:
- Database-related exceptions are caught and logged via the module logger;
they are not re-raised by this method. As a result, callers should not
rely on exceptions to detect DB failures.
Notes:
- The method assumes `cls._db_path` points to a valid SQLite database file
and that an `errors` table exists with a `page` column.
- This method does not provide synchronization; callers should take care of
concurrent access to class state and the database if used from multiple
threads or processes.
665 @classmethod 666 def load_errors_from_db(cls, page=None, status=None, limit=None): 667 """ 668 Load errors from the class SQLite database. 669 This classmethod connects to the SQLite database at cls._db_path, queries the 670 'errors' table, and returns matching error records as a list of dictionaries. 671 672 Parameters: 673 674 page (Optional[str]): If provided, filter results to rows where the 'page' 675 column equals this value. 676 status (Optional[str]): If provided, filter results to rows where the 'status' 677 column equals this value. 678 limit (Optional[int|str]): If provided, limits the number of returned rows. 679 The value is cast to int internally; a non-convertible value will raise 680 ValueError. 681 682 Returns: 683 684 List[dict]: A list of dictionaries representing rows from the 'errors' table. 685 Each dict contains the following keys: 686 - id: primary key (int) 687 - page: page identifier (str) 688 - error: short error message (str) 689 - traceback: full traceback or diagnostic text (str) 690 - timestamp: stored timestamp value as retrieved from the DB (type depends on schema) 691 - status: error status (str) 692 - type: error type/category (str) 693 694 Raises: 695 696 ValueError: If `limit` cannot be converted to int. 697 sqlite3.Error: If an SQLite error occurs while executing the query. 698 699 Notes: 700 701 - Uses parameterized queries for the 'page' and 'status' filters to avoid SQL 702 injection. The `limit` is applied after casting to int. 703 - Results are ordered by `timestamp` in descending order. 704 - The database connection is always closed in a finally block to ensure cleanup. 705 """ 706 707 conn = sqlite3.connect(cls._db_path) 708 try: 709 cursor = conn.cursor() 710 query = "SELECT id, page, error, traceback, timestamp, status, type FROM errors" 711 params = [] 712 filters = [] 713 if page: 714 filters.append("page = ?") 715 params.append(page) 716 if status: 717 filters.append("status = ?") 718 params.append(status) 719 if filters: 720 query += " WHERE " + " AND ".join(filters) 721 query += " ORDER BY timestamp DESC" 722 if limit: 723 query += f" LIMIT {int(limit)}" 724 cursor.execute(query, params) 725 rows = cursor.fetchall() 726 errors = [] 727 for row in rows: 728 errors.append({ 729 "id": row[0], 730 "page": row[1], 731 "error": row[2], 732 "traceback": row[3], 733 "timestamp": row[4], 734 "status": row[5], 735 "type": row[6], 736 }) 737 return errors 738 finally: 739 conn.close()
Load errors from the class SQLite database. This classmethod connects to the SQLite database at cls._db_path, queries the 'errors' table, and returns matching error records as a list of dictionaries.
Parameters:
page (Optional[str]): If provided, filter results to rows where the 'page'
column equals this value.
status (Optional[str]): If provided, filter results to rows where the 'status'
column equals this value.
limit (Optional[int|str]): If provided, limits the number of returned rows.
The value is cast to int internally; a non-convertible value will raise
ValueError.
Returns:
List[dict]: A list of dictionaries representing rows from the 'errors' table.
Each dict contains the following keys:
- id: primary key (int)
- page: page identifier (str)
- error: short error message (str)
- traceback: full traceback or diagnostic text (str)
- timestamp: stored timestamp value as retrieved from the DB (type depends on schema)
- status: error status (str)
- type: error type/category (str)
Raises:
ValueError: If `limit` cannot be converted to int.
sqlite3.Error: If an SQLite error occurs while executing the query.
Notes:
- Uses parameterized queries for the 'page' and 'status' filters to avoid SQL
injection. The `limit` is applied after casting to int.
- Results are ordered by `timestamp` in descending order.
- The database connection is always closed in a finally block to ensure cleanup.
741class HealthCheckService: 742 """ 743 A background-capable health monitoring service for a Streamlit-based application. 744 This class periodically executes a configurable set of checks (system metrics, 745 external dependencies, Streamlit server and pages, and user-registered custom checks), 746 aggregates their results, and exposes a sanitized health snapshot suitable for UI 747 display or remote monitoring. 748 749 Primary responsibilities 750 751 - Load and persist a JSON configuration that defines check intervals, thresholds, 752 dependencies to probe, and Streamlit connection settings. 753 - Run periodic checks in a dedicated background thread (start/stop semantics). 754 - Collect system metrics (CPU, memory, disk) using psutil and apply configurable 755 warning/critical thresholds. 756 - Probe configured HTTP API endpoints and (placeholder) database checks. 757 - Verify Streamlit server liveness by calling a /healthz endpoint and inspect 758 Streamlit page errors via StreamlitPageMonitor. 759 - Allow callers to register synchronous custom checks (functions returning dicts). 760 - Compute an aggregated overall status (critical > warning > unknown > healthy). 761 - Provide a sanitized snapshot of health data with function references removed for safe 762 serialization/display. 763 764 Usage (high level) 765 766 - Instantiate: svc = HealthCheckService(config_path="path/to/config.json") 767 - Optionally register custom checks: svc.register_custom_check("my_check", my_check_func) 768 where my_check_func() -> Dict[str, Any] 769 - Start background monitoring: svc.start() 770 - Stop monitoring: svc.stop() 771 - Retrieve current health snapshot for display or API responses: svc.get_health_data() 772 - Persist any changes to configuration: svc.save_config() 773 774 Configuration (JSON) 775 776 - check_interval: int (seconds) — how often to run the checks (default 60) 777 - streamlit_url: str — base host (default "http://localhost") 778 - streamlit_port: int — port for Streamlit server (default 8501) 779 - system_checks: { "cpu": bool, "memory": bool, "disk": bool } 780 - dependencies: 781 - api_endpoints: list of { "name": str, "url": str, "timeout": int } 782 - databases: list of { "name": str, "type": str, "connection_string": str } 783 - thresholds: 784 - cpu_warning, cpu_critical, memory_warning, memory_critical, disk_warning, disk_critical 785 786 Health data structure (conceptual) 787 788 - last_updated: ISO timestamp 789 - system: { "cpu": {...}, "memory": {...}, "disk": {...} } 790 - dependencies: { "<name>": {...}, ... } 791 - custom_checks: { "<name>": {...} } (get_health_data() strips callable references) 792 - streamlit_server: {status, response_code/latency/error, message, url} 793 - streamlit_pages: {status, error_count, errors, details} 794 - overall_status: "healthy" | "warning" | "critical" | "unknown" 795 796 Threading and safety 797 798 - The service runs checks in a daemon thread started by start(). stop() signals the 799 thread to terminate and joins with a short timeout. Clients should avoid modifying 800 internal structures concurrently; get_health_data() returns a sanitized snapshot 801 appropriate for concurrent reads. 802 803 Custom checks 804 805 - register_custom_check(name, func): registers a synchronous function that returns a 806 dict describing the check result (must include a "status" key with one of the 807 recognized values). The service stores the function reference internally but returns 808 sanitized results via get_health_data(). 809 810 Error handling and logging 811 812 - Individual checks catch exceptions and surface errors in the corresponding 813 health_data entry with status "critical" where appropriate. 814 - The Streamlit UI integration (st.* calls) is used for user-visible error messages 815 when loading/saving configuration; the service also logs events to its configured 816 logger. 817 818 Extensibility notes 819 820 - Database checks are left as placeholders; implement _check_database for specific DB 821 drivers/connections. 822 - Custom checks are synchronous; if long-running checks are required, adapt the 823 registration/run pattern to use async or worker pools. 824 """ 825 def __init__(self, config_path: str = "health_check_config.json"): 826 """ 827 Initializes the HealthCheckService instance. 828 829 Args: 830 config_path (str): Path to the health check configuration file. Defaults to "health_check_config.json". 831 832 Attributes: 833 834 - logger (logging.Logger): Logger for the HealthCheckService. 835 - config_path (str): Path to the configuration file. 836 - health_data (Dict[str, Any]): Dictionary storing health check data. 837 - config (dict): Loaded configuration from the config file. 838 - check_interval (int): Interval in seconds between health checks. Defaults to 60. 839 - _running (bool): Indicates if the health check service is running. 840 - _thread (threading.Thread or None): Thread running the health check loop. 841 - streamlit_url (str): URL of the Streamlit service. Defaults to "http://localhost". 842 - streamlit_port (int): Port of the Streamlit service. Defaults to 8501. 843 """ 844 self.logger = logging.getLogger(f"{__name__}.HealthCheckService") 845 self.logger.info("Initializing HealthCheckService") 846 self.config_path = config_path 847 self.health_data: Dict[str, Any] = { 848 "last_updated": None, 849 "system": {}, 850 "dependencies": {}, 851 "custom_checks": {}, 852 "overall_status": "unknown" 853 } 854 self.config = self._load_config() 855 self.check_interval = self.config.get("check_interval", 60) # Default: 60 seconds 856 self._running = False 857 self._thread = None 858 self.streamlit_url = self.config.get("streamlit_url", "http://localhost") 859 self.streamlit_port = self.config.get("streamlit_port", 8501) # Default: 8501 860 def _load_config(self) -> Dict: 861 """Load health check configuration from file.""" 862 if os.path.exists(self.config_path): 863 try: 864 with open(self.config_path, "r") as f: 865 return json.load(f) 866 except Exception as e: 867 st.error(f"Error loading health check config: {str(e)}") 868 return self._get_default_config() 869 else: 870 return self._get_default_config() 871 872 def _get_default_config(self) -> Dict: 873 """Return default health check configuration.""" 874 return { 875 "check_interval": 60, 876 "streamlit_url": "http://localhost", 877 "streamlit_port": 8501, 878 "system_checks": { 879 "cpu": True, 880 "memory": True, 881 "disk": True 882 }, 883 "dependencies": { 884 "api_endpoints": [ 885 # Example API endpoint to check 886 {"name": "example_api", "url": "https://httpbin.org/get", "timeout": 5} 887 ], 888 "databases": [ 889 # Example database connection to check 890 {"name": "main_db", "type": "postgres", "connection_string": "..."} 891 ] 892 }, 893 "thresholds": { 894 "cpu_warning": 70, 895 "cpu_critical": 90, 896 "memory_warning": 70, 897 "memory_critical": 90, 898 "disk_warning": 70, 899 "disk_critical": 90 900 } 901 } 902 903 def start(self): 904 """ 905 Start the periodic health-check background thread. 906 If the `healthcheck` runner is already active, this method is a no-op and returns 907 immediately. Otherwise, it marks the runner as running, creates a daemon thread 908 targeting self._run_checks_periodically, stores the thread on self._thread, and 909 starts it. 910 911 Behavior and side effects: 912 913 - Idempotent while running: repeated calls will not create additional threads. 914 - Sets self._running to True. 915 - Assigns a daemon threading.Thread to self._thread and starts it. 916 - Non-blocking: returns after starting the background thread. 917 - The daemon thread will not prevent the process from exiting. 918 919 Thread-safety: 920 921 - If start() may be called concurrently from multiple threads, callers should 922 ensure proper synchronization (e.g., external locking) to avoid race conditions. 923 924 Returns: 925 926 None 927 """ 928 929 if self._running: 930 return 931 932 self._running = True 933 self._thread = threading.Thread(target=self._run_checks_periodically, daemon=True) 934 self._thread.start() 935 936 def stop(self): 937 """Stop the health check service.""" 938 self._running = False 939 if self._thread: 940 self._thread.join(timeout=1) 941 942 def _run_checks_periodically(self): 943 """Run health checks periodically based on check interval.""" 944 while self._running: 945 self.run_all_checks() 946 time.sleep(self.check_interval) 947 948 def run_all_checks(self): 949 """Run all configured health checks and update health data.""" 950 # Update timestamp 951 self.health_data["last_updated"] = datetime.now().isoformat() 952 953 # Check Streamlit server 954 self.health_data["streamlit_server"] = self.check_streamlit_server() 955 956 # System checks 957 if self.config["system_checks"].get("cpu", True): 958 self.check_cpu() 959 if self.config["system_checks"].get("memory", True): 960 self.check_memory() 961 if self.config["system_checks"].get("disk", True): 962 self.check_disk() 963 964 # Rest of the existing checks... 965 self.check_dependencies() 966 self.run_custom_checks() 967 self.check_streamlit_pages() 968 self._update_overall_status() 969 970 def check_cpu(self): 971 """ 972 Checks the current CPU usage and updates the health status based on configured thresholds. 973 Measures the CPU usage percentage over a 1-second interval using psutil. Compares the result 974 against warning and critical thresholds defined in the configuration. Sets the status to 975 'healthy', 'warning', or 'critical' accordingly, and updates the health data dictionary. 976 977 Returns: 978 979 None 980 """ 981 982 cpu_percent = psutil.cpu_percent(interval=1) 983 warning_threshold = self.config["thresholds"].get("cpu_warning", 70) 984 critical_threshold = self.config["thresholds"].get("cpu_critical", 90) 985 986 status = "healthy" 987 if cpu_percent >= critical_threshold: 988 status = "critical" 989 elif cpu_percent >= warning_threshold: 990 status = "warning" 991 992 self.health_data["system"]["cpu"] = { 993 "usage_percent": cpu_percent, 994 "status": status 995 } 996 997 def check_memory(self): 998 """ 999 Checks the system's memory usage and updates the health status accordingly. 1000 Retrieves the current memory usage statistics using psutil, compares the usage percentage 1001 against configured warning and critical thresholds, and sets the memory status to 'healthy', 1002 'warning', or 'critical'. Updates the health_data dictionary with total memory, available memory, 1003 usage percentage, and status. 1004 1005 Returns: 1006 1007 None 1008 """ 1009 1010 memory = psutil.virtual_memory() 1011 memory_percent = memory.percent 1012 warning_threshold = self.config["thresholds"].get("memory_warning", 70) 1013 critical_threshold = self.config["thresholds"].get("memory_critical", 90) 1014 1015 status = "healthy" 1016 if memory_percent >= critical_threshold: 1017 status = "critical" 1018 elif memory_percent >= warning_threshold: 1019 status = "warning" 1020 1021 self.health_data["system"]["memory"] = { 1022 "total_gb": round(memory.total / (1024**3), 2), 1023 "available_gb": round(memory.available / (1024**3), 2), 1024 "usage_percent": memory_percent, 1025 "status": status 1026 } 1027 1028 def check_disk(self): 1029 """ 1030 Checks the disk usage of the root filesystem and updates the health status. 1031 Retrieves disk usage statistics using psutil, compares the usage percentage 1032 against configured warning and critical thresholds, and sets the disk status 1033 accordingly (`healthy`, `warning`, or `critical`). Updates the health_data 1034 dictionary with total disk size, free space, usage percentage, and status. 1035 1036 Returns: 1037 1038 None 1039 """ 1040 1041 disk = psutil.disk_usage('/') 1042 disk_percent = disk.percent 1043 warning_threshold = self.config["thresholds"].get("disk_warning", 70) 1044 critical_threshold = self.config["thresholds"].get("disk_critical", 90) 1045 1046 status = "healthy" 1047 if disk_percent >= critical_threshold: 1048 status = "critical" 1049 elif disk_percent >= warning_threshold: 1050 status = "warning" 1051 1052 self.health_data["system"]["disk"] = { 1053 "total_gb": round(disk.total / (1024**3), 2), 1054 "free_gb": round(disk.free / (1024**3), 2), 1055 "usage_percent": disk_percent, 1056 "status": status 1057 } 1058 1059 def check_dependencies(self): 1060 """ 1061 Checks the health of configured dependencies, including API endpoints and databases. 1062 Iterates through the list of API endpoints and databases specified in the configuration, 1063 and performs health checks on each by invoking the corresponding internal methods. 1064 1065 Raises: 1066 1067 Exception: If any dependency check fails. 1068 """ 1069 1070 # Check API endpoints 1071 for endpoint in self.config["dependencies"].get("api_endpoints", []): 1072 self._check_api_endpoint(endpoint) 1073 1074 # Check database connections 1075 for db in self.config["dependencies"].get("databases", []): 1076 self._check_database(db) 1077 1078 def _check_api_endpoint(self, endpoint: Dict): 1079 """ 1080 Check if an API endpoint is accessible. 1081 1082 Args: 1083 1084 endpoint: Dictionary with endpoint configuration 1085 """ 1086 name = endpoint.get("name", "unknown_api") 1087 url = endpoint.get("url", "") 1088 timeout = endpoint.get("timeout", 5) 1089 1090 if not url: 1091 return 1092 1093 try: 1094 start_time = time.time() 1095 response = requests.get(url, timeout=timeout) 1096 response_time = time.time() - start_time 1097 1098 status = "healthy" if response.status_code < 400 else "critical" 1099 1100 self.health_data["dependencies"][name] = { 1101 "type": "api", 1102 "url": url, 1103 "status": status, 1104 "response_time_ms": round(response_time * 1000, 2), 1105 "status_code": response.status_code 1106 } 1107 except Exception as e: 1108 self.health_data["dependencies"][name] = { 1109 "type": "api", 1110 "url": url, 1111 "status": "critical", 1112 "error": str(e) 1113 } 1114 1115 def _check_database(self, db_config: Dict): 1116 """ 1117 Check database connection. 1118 Note: This is a placeholder. You'll need to implement specific database checks 1119 based on your application's needs. 1120 1121 Args: 1122 1123 db_config: Dictionary with database configuration 1124 """ 1125 name = db_config.get("name", "unknown_db") 1126 db_type = db_config.get("type", "") 1127 1128 # Placeholder for database connection check 1129 # In a real implementation, you would check the specific database connection 1130 self.health_data["dependencies"][name] = { 1131 "type": "database", 1132 "db_type": db_type, 1133 "status": "unknown", 1134 "message": "Database check not implemented" 1135 } 1136 1137 def register_custom_check(self, name: str, check_func: Callable[[], Dict[str, Any]]): 1138 """ 1139 Register a custom health check function. 1140 1141 Args: 1142 1143 name: Name of the custom check 1144 check_func: Function that performs the check and returns a dictionary with results 1145 """ 1146 if "custom_checks" not in self.health_data: 1147 self.health_data["custom_checks"] = {} 1148 1149 self.health_data["custom_checks"][name] = { 1150 "status": "unknown", 1151 "check_func": check_func 1152 } 1153 1154 def run_custom_checks(self): 1155 """Run all registered custom health checks.""" 1156 if "custom_checks" not in self.health_data: 1157 return 1158 1159 for name, check_info in list(self.health_data["custom_checks"].items()): 1160 if "check_func" in check_info and callable(check_info["check_func"]): 1161 try: 1162 result = check_info["check_func"]() 1163 # Remove the function reference from the result 1164 func = check_info["check_func"] 1165 self.health_data["custom_checks"][name] = result 1166 # Add the function back 1167 self.health_data["custom_checks"][name]["check_func"] = func 1168 except Exception as e: 1169 self.health_data["custom_checks"][name] = { 1170 "status": "critical", 1171 "error": str(e), 1172 "check_func": check_info["check_func"] 1173 } 1174 1175 def _update_overall_status(self): 1176 """ 1177 Updates the overall health status of the application based on the statuses of various components. 1178 1179 The method checks the health status of the following components: 1180 - Streamlit server 1181 - System checks 1182 - Dependencies 1183 - Custom checks (excluding those with a 'check_func' key) 1184 - Streamlit pages 1185 1186 The overall status is determined using the following priority order: 1187 1. "critical" if any component is critical 1188 2. "warning" if any component is warning and none are critical 1189 3. "unknown" if any component is unknown and none are critical or warning, and no healthy components exist 1190 4. "healthy" if any component is healthy and none are critical, warning, or unknown 1191 5. "unknown" if no statuses are found 1192 1193 The result is stored in `self.health_data["overall_status"]`. 1194 """ 1195 1196 has_critical = False 1197 has_warning = False 1198 has_healthy = False 1199 has_unknown = False 1200 1201 # Helper function to check status 1202 def check_component_status(status): 1203 nonlocal has_critical, has_warning, has_healthy, has_unknown 1204 if status == "critical": 1205 has_critical = True 1206 elif status == "warning": 1207 has_warning = True 1208 elif status == "healthy": 1209 has_healthy = True 1210 elif status == "unknown": 1211 has_unknown = True 1212 1213 # Check Streamlit server status 1214 server_status = self.health_data.get("streamlit_server", {}).get("status") 1215 check_component_status(server_status) 1216 1217 # Check system status 1218 for system_check in self.health_data.get("system", {}).values(): 1219 check_component_status(system_check.get("status")) 1220 1221 # Check dependencies status 1222 for dep_check in self.health_data.get("dependencies", {}).values(): 1223 check_component_status(dep_check.get("status")) 1224 1225 # Check custom checks status 1226 for custom_check in self.health_data.get("custom_checks", {}).values(): 1227 if isinstance(custom_check, dict) and "check_func" not in custom_check: 1228 check_component_status(custom_check.get("status")) 1229 1230 # Check Streamlit pages status 1231 pages_status = self.health_data.get("streamlit_pages", {}).get("status") 1232 check_component_status(pages_status) 1233 1234 # Determine overall status with priority: 1235 # critical > warning > unknown > healthy 1236 if has_critical: 1237 self.health_data["overall_status"] = "critical" 1238 elif has_warning: 1239 self.health_data["overall_status"] = "warning" 1240 elif has_unknown and not has_healthy: 1241 self.health_data["overall_status"] = "unknown" 1242 elif has_healthy: 1243 self.health_data["overall_status"] = "healthy" 1244 else: 1245 self.health_data["overall_status"] = "unknown" 1246 1247 def get_health_data(self) -> Dict: 1248 """Get the latest health check data.""" 1249 # Create a copy without the function references 1250 result: Dict[str, Any] = {} 1251 for key, value in self.health_data.items(): 1252 if key == "custom_checks": 1253 result[key] = {} 1254 for check_name, check_data in value.items(): 1255 if isinstance(check_data, dict): 1256 check_copy = check_data.copy() 1257 if "check_func" in check_copy: 1258 del check_copy["check_func"] 1259 result[key][check_name] = check_copy 1260 else: 1261 result[key] = value 1262 return result 1263 1264 def save_config(self): 1265 """ 1266 Saves the current health check configuration to a JSON file. 1267 Attempts to write the configuration stored in `self.config` to the file specified by `self.config_path`. 1268 Displays a success message in the Streamlit app upon successful save. 1269 Handles and displays appropriate error messages for file not found, permission issues, JSON decoding errors, and other exceptions. 1270 1271 Raises: 1272 1273 FileNotFoundError: If the configuration file path does not exist. 1274 PermissionError: If there are insufficient permissions to write to the file. 1275 json.JSONDecodeError: If there is an error decoding the JSON data. 1276 Exception: For any other exceptions encountered during the save process. 1277 """ 1278 1279 try: 1280 with open(self.config_path, "w") as f: 1281 json.dump(self.config, f, indent=2) 1282 st.success(f"Health check config saved successfully to {self.config_path}") 1283 except FileNotFoundError: 1284 st.error(f"Configuration file not found: {self.config_path}") 1285 except PermissionError: 1286 st.error(f"Permission denied: Unable to write to {self.config_path}") 1287 except json.JSONDecodeError: 1288 st.error(f"Error decoding JSON in config file: {self.config_path}") 1289 except Exception as e: 1290 st.error(f"Error saving health check config: {str(e)}") 1291 def check_streamlit_pages(self): 1292 """ 1293 Checks for errors in Streamlit pages and updates the health data accordingly. 1294 This method retrieves page errors using StreamlitPageMonitor.get_page_errors(). 1295 If errors are found, it sets the 'streamlit_pages' status to 'critical' and updates 1296 the overall health status to 'critical'. If no errors are found, it marks the 1297 'streamlit_pages' status as 'healthy'. 1298 1299 Updates: 1300 1301 self.health_data["streamlit_pages"]: Dict containing status, error count, errors, and details. 1302 self.health_data["overall_status"]: Set to 'critical' if errors are detected. 1303 self.health_data["streamlit_pages"]["details"]: A summary of the errors found. 1304 1305 Returns: 1306 1307 None 1308 """ 1309 1310 page_errors = StreamlitPageMonitor.get_page_errors() 1311 1312 if "streamlit_pages" not in self.health_data: 1313 self.health_data["streamlit_pages"] = {} 1314 1315 if page_errors: 1316 total_errors = sum(len(errors) for errors in page_errors.values()) 1317 self.health_data["streamlit_pages"] = { 1318 "status": "critical", 1319 "error_count": total_errors, 1320 "errors": page_errors, 1321 "details": "Errors detected in Streamlit pages" 1322 } 1323 # This affects overall status 1324 self.health_data["overall_status"] = "critical" 1325 else: 1326 self.health_data["streamlit_pages"] = { 1327 "status": "healthy", 1328 "error_count": 0, 1329 "errors": {}, 1330 "details": "All pages functioning normally" 1331 } 1332 1333 def check_streamlit_server(self) -> Dict[str, Any]: 1334 """ 1335 Checks the health status of the Streamlit server by sending a GET request to the /healthz endpoint. 1336 1337 Returns: 1338 1339 Dict[str, Any]: A dictionary containing the health status, response code, latency in milliseconds, 1340 message, and the URL checked. If the server is healthy (HTTP 200), status is "healthy". 1341 Otherwise, status is "critical" with error details. 1342 1343 Handles: 1344 1345 - Connection errors: Returns critical status with connection error details. 1346 - Timeout errors: Returns critical status with timeout error details. 1347 - Other exceptions: Returns critical status with unknown error details. 1348 1349 Logs: 1350 1351 - The URL being checked. 1352 - The response status code and text. 1353 - Health status and response time if healthy. 1354 - Warnings and errors for unhealthy or failed checks. 1355 """ 1356 1357 try: 1358 host = self.streamlit_url.rstrip('/') 1359 if not host.startswith(('http://', 'https://')): 1360 host = f"http://{host}" 1361 1362 url = f"{host}:{self.streamlit_port}/healthz" 1363 self.logger.info(f"Checking Streamlit server health at: {url}") 1364 1365 start_time = time.time() 1366 response = requests.get(url, timeout=3) 1367 total_time = (time.time() - start_time) * 1000 1368 self.logger.info(f"{response.status_code} - {response.text}") 1369 # Check if the response is healthy 1370 if response.status_code == 200: 1371 self.logger.info(f"Streamlit server healthy - Response time: {round(total_time, 2)}ms") 1372 return { 1373 "status": "healthy", 1374 "response_code": response.status_code, 1375 "latency_ms": round(total_time, 2), 1376 "message": "Streamlit server is running", 1377 "url": url 1378 } 1379 else: 1380 self.logger.warning(f"Unhealthy response from server: {response.status_code}") 1381 return { 1382 "status": "critical", 1383 "response_code": response.status_code, 1384 "error": f"Unhealthy response from server: {response.status_code}", 1385 "message": "Streamlit server is not healthy", 1386 "url": url 1387 } 1388 1389 except requests.exceptions.ConnectionError as e: 1390 self.logger.error(f"Connection error while checking Streamlit server: {str(e)}") 1391 return { 1392 "status": "critical", 1393 "error": f"Connection error: {str(e)}", 1394 "message": "Cannot connect to Streamlit server", 1395 "url": url 1396 } 1397 except requests.exceptions.Timeout as e: 1398 self.logger.error(f"Timeout while checking Streamlit server: {str(e)}") 1399 return { 1400 "status": "critical", 1401 "error": f"Timeout error: {str(e)}", 1402 "message": "Streamlit server is not responding", 1403 "url": url 1404 } 1405 except Exception as e: 1406 self.logger.error(f"Unexpected error while checking Streamlit server: {str(e)}") 1407 return { 1408 "status": "critical", 1409 "error": f"Unknown error: {str(e)}", 1410 "message": "Failed to check Streamlit server", 1411 "url": url 1412 }
A background-capable health monitoring service for a Streamlit-based application. This class periodically executes a configurable set of checks (system metrics, external dependencies, Streamlit server and pages, and user-registered custom checks), aggregates their results, and exposes a sanitized health snapshot suitable for UI display or remote monitoring.
Primary responsibilities
- Load and persist a JSON configuration that defines check intervals, thresholds, dependencies to probe, and Streamlit connection settings.
- Run periodic checks in a dedicated background thread (start/stop semantics).
- Collect system metrics (CPU, memory, disk) using psutil and apply configurable warning/critical thresholds.
- Probe configured HTTP API endpoints and (placeholder) database checks.
- Verify Streamlit server liveness by calling a /healthz endpoint and inspect Streamlit page errors via StreamlitPageMonitor.
- Allow callers to register synchronous custom checks (functions returning dicts).
- Compute an aggregated overall status (critical > warning > unknown > healthy).
- Provide a sanitized snapshot of health data with function references removed for safe serialization/display.
Usage (high level)
- Instantiate: svc = HealthCheckService(config_path="path/to/config.json")
- Optionally register custom checks: svc.register_custom_check("my_check", my_check_func) where my_check_func() -> Dict[str, Any]
- Start background monitoring: svc.start()
- Stop monitoring: svc.stop()
- Retrieve current health snapshot for display or API responses: svc.get_health_data()
- Persist any changes to configuration: svc.save_config()
Configuration (JSON)
- check_interval: int (seconds) — how often to run the checks (default 60)
- streamlit_url: str — base host (default "http://localhost")
- streamlit_port: int — port for Streamlit server (default 8501)
- system_checks: { "cpu": bool, "memory": bool, "disk": bool }
- dependencies:
- api_endpoints: list of { "name": str, "url": str, "timeout": int }
- databases: list of { "name": str, "type": str, "connection_string": str }
- thresholds:
- cpu_warning, cpu_critical, memory_warning, memory_critical, disk_warning, disk_critical
Health data structure (conceptual)
- last_updated: ISO timestamp
- system: { "cpu": {...}, "memory": {...}, "disk": {...} }
- dependencies: { "
": {...}, ... } - custom_checks: { "
": {...} } (get_health_data() strips callable references) - streamlit_server: {status, response_code/latency/error, message, url}
- streamlit_pages: {status, error_count, errors, details}
- overall_status: "healthy" | "warning" | "critical" | "unknown"
Threading and safety
- The service runs checks in a daemon thread started by start(). stop() signals the thread to terminate and joins with a short timeout. Clients should avoid modifying internal structures concurrently; get_health_data() returns a sanitized snapshot appropriate for concurrent reads.
Custom checks
- register_custom_check(name, func): registers a synchronous function that returns a dict describing the check result (must include a "status" key with one of the recognized values). The service stores the function reference internally but returns sanitized results via get_health_data().
Error handling and logging
- Individual checks catch exceptions and surface errors in the corresponding health_data entry with status "critical" where appropriate.
- The Streamlit UI integration (st.* calls) is used for user-visible error messages when loading/saving configuration; the service also logs events to its configured logger.
Extensibility notes
- Database checks are left as placeholders; implement _check_database for specific DB drivers/connections.
- Custom checks are synchronous; if long-running checks are required, adapt the registration/run pattern to use async or worker pools.
825 def __init__(self, config_path: str = "health_check_config.json"): 826 """ 827 Initializes the HealthCheckService instance. 828 829 Args: 830 config_path (str): Path to the health check configuration file. Defaults to "health_check_config.json". 831 832 Attributes: 833 834 - logger (logging.Logger): Logger for the HealthCheckService. 835 - config_path (str): Path to the configuration file. 836 - health_data (Dict[str, Any]): Dictionary storing health check data. 837 - config (dict): Loaded configuration from the config file. 838 - check_interval (int): Interval in seconds between health checks. Defaults to 60. 839 - _running (bool): Indicates if the health check service is running. 840 - _thread (threading.Thread or None): Thread running the health check loop. 841 - streamlit_url (str): URL of the Streamlit service. Defaults to "http://localhost". 842 - streamlit_port (int): Port of the Streamlit service. Defaults to 8501. 843 """ 844 self.logger = logging.getLogger(f"{__name__}.HealthCheckService") 845 self.logger.info("Initializing HealthCheckService") 846 self.config_path = config_path 847 self.health_data: Dict[str, Any] = { 848 "last_updated": None, 849 "system": {}, 850 "dependencies": {}, 851 "custom_checks": {}, 852 "overall_status": "unknown" 853 } 854 self.config = self._load_config() 855 self.check_interval = self.config.get("check_interval", 60) # Default: 60 seconds 856 self._running = False 857 self._thread = None 858 self.streamlit_url = self.config.get("streamlit_url", "http://localhost") 859 self.streamlit_port = self.config.get("streamlit_port", 8501) # Default: 8501
Initializes the HealthCheckService instance.
Args: config_path (str): Path to the health check configuration file. Defaults to "health_check_config.json".
Attributes:
- logger (logging.Logger): Logger for the HealthCheckService.
- config_path (str): Path to the configuration file.
- health_data (Dict[str, Any]): Dictionary storing health check data.
- config (dict): Loaded configuration from the config file.
- check_interval (int): Interval in seconds between health checks. Defaults to 60.
- _running (bool): Indicates if the health check service is running.
- _thread (threading.Thread or None): Thread running the health check loop.
- streamlit_url (str): URL of the Streamlit service. Defaults to "http://localhost".
- streamlit_port (int): Port of the Streamlit service. Defaults to 8501.
903 def start(self): 904 """ 905 Start the periodic health-check background thread. 906 If the `healthcheck` runner is already active, this method is a no-op and returns 907 immediately. Otherwise, it marks the runner as running, creates a daemon thread 908 targeting self._run_checks_periodically, stores the thread on self._thread, and 909 starts it. 910 911 Behavior and side effects: 912 913 - Idempotent while running: repeated calls will not create additional threads. 914 - Sets self._running to True. 915 - Assigns a daemon threading.Thread to self._thread and starts it. 916 - Non-blocking: returns after starting the background thread. 917 - The daemon thread will not prevent the process from exiting. 918 919 Thread-safety: 920 921 - If start() may be called concurrently from multiple threads, callers should 922 ensure proper synchronization (e.g., external locking) to avoid race conditions. 923 924 Returns: 925 926 None 927 """ 928 929 if self._running: 930 return 931 932 self._running = True 933 self._thread = threading.Thread(target=self._run_checks_periodically, daemon=True) 934 self._thread.start()
Start the periodic health-check background thread.
If the healthcheck runner is already active, this method is a no-op and returns
immediately. Otherwise, it marks the runner as running, creates a daemon thread
targeting self._run_checks_periodically, stores the thread on self._thread, and
starts it.
Behavior and side effects:
- Idempotent while running: repeated calls will not create additional threads.
- Sets self._running to True.
- Assigns a daemon threading.Thread to self._thread and starts it.
- Non-blocking: returns after starting the background thread.
- The daemon thread will not prevent the process from exiting.
Thread-safety:
- If start() may be called concurrently from multiple threads, callers should ensure proper synchronization (e.g., external locking) to avoid race conditions.
Returns:
None
936 def stop(self): 937 """Stop the health check service.""" 938 self._running = False 939 if self._thread: 940 self._thread.join(timeout=1)
Stop the health check service.
948 def run_all_checks(self): 949 """Run all configured health checks and update health data.""" 950 # Update timestamp 951 self.health_data["last_updated"] = datetime.now().isoformat() 952 953 # Check Streamlit server 954 self.health_data["streamlit_server"] = self.check_streamlit_server() 955 956 # System checks 957 if self.config["system_checks"].get("cpu", True): 958 self.check_cpu() 959 if self.config["system_checks"].get("memory", True): 960 self.check_memory() 961 if self.config["system_checks"].get("disk", True): 962 self.check_disk() 963 964 # Rest of the existing checks... 965 self.check_dependencies() 966 self.run_custom_checks() 967 self.check_streamlit_pages() 968 self._update_overall_status()
Run all configured health checks and update health data.
970 def check_cpu(self): 971 """ 972 Checks the current CPU usage and updates the health status based on configured thresholds. 973 Measures the CPU usage percentage over a 1-second interval using psutil. Compares the result 974 against warning and critical thresholds defined in the configuration. Sets the status to 975 'healthy', 'warning', or 'critical' accordingly, and updates the health data dictionary. 976 977 Returns: 978 979 None 980 """ 981 982 cpu_percent = psutil.cpu_percent(interval=1) 983 warning_threshold = self.config["thresholds"].get("cpu_warning", 70) 984 critical_threshold = self.config["thresholds"].get("cpu_critical", 90) 985 986 status = "healthy" 987 if cpu_percent >= critical_threshold: 988 status = "critical" 989 elif cpu_percent >= warning_threshold: 990 status = "warning" 991 992 self.health_data["system"]["cpu"] = { 993 "usage_percent": cpu_percent, 994 "status": status 995 }
Checks the current CPU usage and updates the health status based on configured thresholds. Measures the CPU usage percentage over a 1-second interval using psutil. Compares the result against warning and critical thresholds defined in the configuration. Sets the status to 'healthy', 'warning', or 'critical' accordingly, and updates the health data dictionary.
Returns:
None
997 def check_memory(self): 998 """ 999 Checks the system's memory usage and updates the health status accordingly. 1000 Retrieves the current memory usage statistics using psutil, compares the usage percentage 1001 against configured warning and critical thresholds, and sets the memory status to 'healthy', 1002 'warning', or 'critical'. Updates the health_data dictionary with total memory, available memory, 1003 usage percentage, and status. 1004 1005 Returns: 1006 1007 None 1008 """ 1009 1010 memory = psutil.virtual_memory() 1011 memory_percent = memory.percent 1012 warning_threshold = self.config["thresholds"].get("memory_warning", 70) 1013 critical_threshold = self.config["thresholds"].get("memory_critical", 90) 1014 1015 status = "healthy" 1016 if memory_percent >= critical_threshold: 1017 status = "critical" 1018 elif memory_percent >= warning_threshold: 1019 status = "warning" 1020 1021 self.health_data["system"]["memory"] = { 1022 "total_gb": round(memory.total / (1024**3), 2), 1023 "available_gb": round(memory.available / (1024**3), 2), 1024 "usage_percent": memory_percent, 1025 "status": status 1026 }
Checks the system's memory usage and updates the health status accordingly. Retrieves the current memory usage statistics using psutil, compares the usage percentage against configured warning and critical thresholds, and sets the memory status to 'healthy', 'warning', or 'critical'. Updates the health_data dictionary with total memory, available memory, usage percentage, and status.
Returns:
None
1028 def check_disk(self): 1029 """ 1030 Checks the disk usage of the root filesystem and updates the health status. 1031 Retrieves disk usage statistics using psutil, compares the usage percentage 1032 against configured warning and critical thresholds, and sets the disk status 1033 accordingly (`healthy`, `warning`, or `critical`). Updates the health_data 1034 dictionary with total disk size, free space, usage percentage, and status. 1035 1036 Returns: 1037 1038 None 1039 """ 1040 1041 disk = psutil.disk_usage('/') 1042 disk_percent = disk.percent 1043 warning_threshold = self.config["thresholds"].get("disk_warning", 70) 1044 critical_threshold = self.config["thresholds"].get("disk_critical", 90) 1045 1046 status = "healthy" 1047 if disk_percent >= critical_threshold: 1048 status = "critical" 1049 elif disk_percent >= warning_threshold: 1050 status = "warning" 1051 1052 self.health_data["system"]["disk"] = { 1053 "total_gb": round(disk.total / (1024**3), 2), 1054 "free_gb": round(disk.free / (1024**3), 2), 1055 "usage_percent": disk_percent, 1056 "status": status 1057 }
Checks the disk usage of the root filesystem and updates the health status.
Retrieves disk usage statistics using psutil, compares the usage percentage
against configured warning and critical thresholds, and sets the disk status
accordingly (healthy, warning, or critical). Updates the health_data
dictionary with total disk size, free space, usage percentage, and status.
Returns:
None
1059 def check_dependencies(self): 1060 """ 1061 Checks the health of configured dependencies, including API endpoints and databases. 1062 Iterates through the list of API endpoints and databases specified in the configuration, 1063 and performs health checks on each by invoking the corresponding internal methods. 1064 1065 Raises: 1066 1067 Exception: If any dependency check fails. 1068 """ 1069 1070 # Check API endpoints 1071 for endpoint in self.config["dependencies"].get("api_endpoints", []): 1072 self._check_api_endpoint(endpoint) 1073 1074 # Check database connections 1075 for db in self.config["dependencies"].get("databases", []): 1076 self._check_database(db)
Checks the health of configured dependencies, including API endpoints and databases. Iterates through the list of API endpoints and databases specified in the configuration, and performs health checks on each by invoking the corresponding internal methods.
Raises:
Exception: If any dependency check fails.
1137 def register_custom_check(self, name: str, check_func: Callable[[], Dict[str, Any]]): 1138 """ 1139 Register a custom health check function. 1140 1141 Args: 1142 1143 name: Name of the custom check 1144 check_func: Function that performs the check and returns a dictionary with results 1145 """ 1146 if "custom_checks" not in self.health_data: 1147 self.health_data["custom_checks"] = {} 1148 1149 self.health_data["custom_checks"][name] = { 1150 "status": "unknown", 1151 "check_func": check_func 1152 }
Register a custom health check function.
Args:
name: Name of the custom check
check_func: Function that performs the check and returns a dictionary with results
1154 def run_custom_checks(self): 1155 """Run all registered custom health checks.""" 1156 if "custom_checks" not in self.health_data: 1157 return 1158 1159 for name, check_info in list(self.health_data["custom_checks"].items()): 1160 if "check_func" in check_info and callable(check_info["check_func"]): 1161 try: 1162 result = check_info["check_func"]() 1163 # Remove the function reference from the result 1164 func = check_info["check_func"] 1165 self.health_data["custom_checks"][name] = result 1166 # Add the function back 1167 self.health_data["custom_checks"][name]["check_func"] = func 1168 except Exception as e: 1169 self.health_data["custom_checks"][name] = { 1170 "status": "critical", 1171 "error": str(e), 1172 "check_func": check_info["check_func"] 1173 }
Run all registered custom health checks.
1247 def get_health_data(self) -> Dict: 1248 """Get the latest health check data.""" 1249 # Create a copy without the function references 1250 result: Dict[str, Any] = {} 1251 for key, value in self.health_data.items(): 1252 if key == "custom_checks": 1253 result[key] = {} 1254 for check_name, check_data in value.items(): 1255 if isinstance(check_data, dict): 1256 check_copy = check_data.copy() 1257 if "check_func" in check_copy: 1258 del check_copy["check_func"] 1259 result[key][check_name] = check_copy 1260 else: 1261 result[key] = value 1262 return result
Get the latest health check data.
1264 def save_config(self): 1265 """ 1266 Saves the current health check configuration to a JSON file. 1267 Attempts to write the configuration stored in `self.config` to the file specified by `self.config_path`. 1268 Displays a success message in the Streamlit app upon successful save. 1269 Handles and displays appropriate error messages for file not found, permission issues, JSON decoding errors, and other exceptions. 1270 1271 Raises: 1272 1273 FileNotFoundError: If the configuration file path does not exist. 1274 PermissionError: If there are insufficient permissions to write to the file. 1275 json.JSONDecodeError: If there is an error decoding the JSON data. 1276 Exception: For any other exceptions encountered during the save process. 1277 """ 1278 1279 try: 1280 with open(self.config_path, "w") as f: 1281 json.dump(self.config, f, indent=2) 1282 st.success(f"Health check config saved successfully to {self.config_path}") 1283 except FileNotFoundError: 1284 st.error(f"Configuration file not found: {self.config_path}") 1285 except PermissionError: 1286 st.error(f"Permission denied: Unable to write to {self.config_path}") 1287 except json.JSONDecodeError: 1288 st.error(f"Error decoding JSON in config file: {self.config_path}") 1289 except Exception as e: 1290 st.error(f"Error saving health check config: {str(e)}")
Saves the current health check configuration to a JSON file.
Attempts to write the configuration stored in self.config to the file specified by self.config_path.
Displays a success message in the Streamlit app upon successful save.
Handles and displays appropriate error messages for file not found, permission issues, JSON decoding errors, and other exceptions.
Raises:
FileNotFoundError: If the configuration file path does not exist.
PermissionError: If there are insufficient permissions to write to the file.
json.JSONDecodeError: If there is an error decoding the JSON data.
Exception: For any other exceptions encountered during the save process.
1291 def check_streamlit_pages(self): 1292 """ 1293 Checks for errors in Streamlit pages and updates the health data accordingly. 1294 This method retrieves page errors using StreamlitPageMonitor.get_page_errors(). 1295 If errors are found, it sets the 'streamlit_pages' status to 'critical' and updates 1296 the overall health status to 'critical'. If no errors are found, it marks the 1297 'streamlit_pages' status as 'healthy'. 1298 1299 Updates: 1300 1301 self.health_data["streamlit_pages"]: Dict containing status, error count, errors, and details. 1302 self.health_data["overall_status"]: Set to 'critical' if errors are detected. 1303 self.health_data["streamlit_pages"]["details"]: A summary of the errors found. 1304 1305 Returns: 1306 1307 None 1308 """ 1309 1310 page_errors = StreamlitPageMonitor.get_page_errors() 1311 1312 if "streamlit_pages" not in self.health_data: 1313 self.health_data["streamlit_pages"] = {} 1314 1315 if page_errors: 1316 total_errors = sum(len(errors) for errors in page_errors.values()) 1317 self.health_data["streamlit_pages"] = { 1318 "status": "critical", 1319 "error_count": total_errors, 1320 "errors": page_errors, 1321 "details": "Errors detected in Streamlit pages" 1322 } 1323 # This affects overall status 1324 self.health_data["overall_status"] = "critical" 1325 else: 1326 self.health_data["streamlit_pages"] = { 1327 "status": "healthy", 1328 "error_count": 0, 1329 "errors": {}, 1330 "details": "All pages functioning normally" 1331 }
Checks for errors in Streamlit pages and updates the health data accordingly. This method retrieves page errors using StreamlitPageMonitor.get_page_errors(). If errors are found, it sets the 'streamlit_pages' status to 'critical' and updates the overall health status to 'critical'. If no errors are found, it marks the 'streamlit_pages' status as 'healthy'.
Updates:
self.health_data["streamlit_pages"]: Dict containing status, error count, errors, and details.
self.health_data["overall_status"]: Set to 'critical' if errors are detected.
self.health_data["streamlit_pages"]["details"]: A summary of the errors found.
Returns:
None
1333 def check_streamlit_server(self) -> Dict[str, Any]: 1334 """ 1335 Checks the health status of the Streamlit server by sending a GET request to the /healthz endpoint. 1336 1337 Returns: 1338 1339 Dict[str, Any]: A dictionary containing the health status, response code, latency in milliseconds, 1340 message, and the URL checked. If the server is healthy (HTTP 200), status is "healthy". 1341 Otherwise, status is "critical" with error details. 1342 1343 Handles: 1344 1345 - Connection errors: Returns critical status with connection error details. 1346 - Timeout errors: Returns critical status with timeout error details. 1347 - Other exceptions: Returns critical status with unknown error details. 1348 1349 Logs: 1350 1351 - The URL being checked. 1352 - The response status code and text. 1353 - Health status and response time if healthy. 1354 - Warnings and errors for unhealthy or failed checks. 1355 """ 1356 1357 try: 1358 host = self.streamlit_url.rstrip('/') 1359 if not host.startswith(('http://', 'https://')): 1360 host = f"http://{host}" 1361 1362 url = f"{host}:{self.streamlit_port}/healthz" 1363 self.logger.info(f"Checking Streamlit server health at: {url}") 1364 1365 start_time = time.time() 1366 response = requests.get(url, timeout=3) 1367 total_time = (time.time() - start_time) * 1000 1368 self.logger.info(f"{response.status_code} - {response.text}") 1369 # Check if the response is healthy 1370 if response.status_code == 200: 1371 self.logger.info(f"Streamlit server healthy - Response time: {round(total_time, 2)}ms") 1372 return { 1373 "status": "healthy", 1374 "response_code": response.status_code, 1375 "latency_ms": round(total_time, 2), 1376 "message": "Streamlit server is running", 1377 "url": url 1378 } 1379 else: 1380 self.logger.warning(f"Unhealthy response from server: {response.status_code}") 1381 return { 1382 "status": "critical", 1383 "response_code": response.status_code, 1384 "error": f"Unhealthy response from server: {response.status_code}", 1385 "message": "Streamlit server is not healthy", 1386 "url": url 1387 } 1388 1389 except requests.exceptions.ConnectionError as e: 1390 self.logger.error(f"Connection error while checking Streamlit server: {str(e)}") 1391 return { 1392 "status": "critical", 1393 "error": f"Connection error: {str(e)}", 1394 "message": "Cannot connect to Streamlit server", 1395 "url": url 1396 } 1397 except requests.exceptions.Timeout as e: 1398 self.logger.error(f"Timeout while checking Streamlit server: {str(e)}") 1399 return { 1400 "status": "critical", 1401 "error": f"Timeout error: {str(e)}", 1402 "message": "Streamlit server is not responding", 1403 "url": url 1404 } 1405 except Exception as e: 1406 self.logger.error(f"Unexpected error while checking Streamlit server: {str(e)}") 1407 return { 1408 "status": "critical", 1409 "error": f"Unknown error: {str(e)}", 1410 "message": "Failed to check Streamlit server", 1411 "url": url 1412 }
Checks the health status of the Streamlit server by sending a GET request to the /healthz endpoint.
Returns:
Dict[str, Any]: A dictionary containing the health status, response code, latency in milliseconds,
message, and the URL checked. If the server is healthy (HTTP 200), status is "healthy".
Otherwise, status is "critical" with error details.
Handles:
- Connection errors: Returns critical status with connection error details.
- Timeout errors: Returns critical status with timeout error details.
- Other exceptions: Returns critical status with unknown error details.
Logs:
- The URL being checked.
- The response status code and text.
- Health status and response time if healthy.
- Warnings and errors for unhealthy or failed checks.
1414def health_check(config_path:str = "health_check_config.json"): 1415 """ 1416 Displays an interactive Streamlit dashboard for monitoring application health. 1417 This function initializes and manages a health check service, presenting real-time system metrics, 1418 dependency statuses, custom checks, and Streamlit page health in a user-friendly dashboard. 1419 Users can manually refresh health checks, view detailed error information, and adjust configuration 1420 thresholds and intervals directly from the UI. 1421 1422 Args: 1423 1424 config_path (str, optional): Path to the health check configuration JSON file. 1425 Defaults to "health_check_config.json". 1426 1427 Features: 1428 1429 - Displays overall health status with color-coded indicators. 1430 - Shows last updated timestamp for health data. 1431 - Monitors Streamlit server status, latency, and errors. 1432 - Provides tabs for: 1433 * System Resources (CPU, Memory, Disk usage and status) 1434 * Dependencies (external services and their health) 1435 * Custom Checks (user-defined health checks) 1436 * Streamlit Pages (page-specific errors and status) 1437 - Allows configuration of system thresholds, check intervals, and Streamlit server settings. 1438 - Supports manual refresh and saving configuration changes. 1439 1440 Raises: 1441 1442 Displays error messages in the UI for any exceptions encountered during health data retrieval or processing. 1443 1444 Returns: 1445 1446 None. The dashboard is rendered in the Streamlit app. 1447 """ 1448 1449 logger = logging.getLogger(f"{__name__}.health_check") 1450 logger.info("Starting health check dashboard") 1451 st.title("Application Health Dashboard") 1452 1453 # Initialize or get the health check service 1454 if "health_service" not in st.session_state: 1455 logger.info("Initializing new health check service") 1456 st.session_state.health_service = HealthCheckService(config_path = config_path) 1457 st.session_state.health_service.start() 1458 1459 health_service = st.session_state.health_service 1460 health_service.run_all_checks() 1461 1462 # Add controls for manual refresh and configuration 1463 col1, col2 = st.columns([3, 1]) 1464 with col1: 1465 st.subheader("System Health Status") 1466 with col2: 1467 if st.button("Refresh Now"): 1468 health_service.run_all_checks() 1469 1470 # Get the latest health data 1471 health_data = health_service.get_health_data() 1472 1473 # Display overall status with appropriate color 1474 overall_status = health_data.get("overall_status", "unknown") 1475 status_color = { 1476 "healthy": "green", 1477 "warning": "orange", 1478 "critical": "red", 1479 "unknown": "gray" 1480 }.get(overall_status, "gray") 1481 1482 st.markdown( 1483 f"<h3 style='color: {status_color};'>Overall Status: {overall_status.upper()}</h3>", 1484 unsafe_allow_html=True 1485 ) 1486 1487 # Display last updated time 1488 if health_data.get("last_updated"): 1489 try: 1490 last_updated = datetime.fromisoformat(health_data["last_updated"]) 1491 st.text(f"Last updated: {last_updated.strftime('%Y-%m-%d %H:%M:%S')}") 1492 except Exception as e: 1493 st.error(f"Last updated: {health_data['last_updated']}") 1494 st.exception(e) 1495 1496 server_health = health_data.get("streamlit_server", {}) 1497 server_status = server_health.get("status", "unknown") 1498 server_color = { 1499 "healthy": "green", 1500 "critical": "red", 1501 "unknown": "gray" 1502 }.get(server_status, "gray") 1503 1504 st.markdown( 1505 f"### Streamlit Server Status: <span style='color: {server_color}'>{server_status.upper()}</span>", 1506 unsafe_allow_html=True 1507 ) 1508 1509 if server_status != "healthy": 1510 st.error(server_health.get("message", "Server status unknown")) 1511 if "error" in server_health: 1512 st.code(server_health["error"]) 1513 else: 1514 st.success(server_health.get("message", "Server is running")) 1515 if "latency_ms" in server_health: 1516 latency = server_health["latency_ms"] 1517 # Define color based on latency thresholds 1518 if latency <= 50: 1519 latency_color = "green" 1520 performance = "Excellent" 1521 elif latency <= 100: 1522 latency_color = "blue" 1523 performance = "Good" 1524 elif latency <= 200: 1525 latency_color = "orange" 1526 performance = "Fair" 1527 else: 1528 latency_color = "red" 1529 performance = "Poor" 1530 1531 st.markdown( 1532 f""" 1533 <div style='display: flex; align-items: center; gap: 10px;'> 1534 <div>Server Response Time:</div> 1535 <div style='color: {latency_color}; font-weight: bold;'> 1536 {latency} ms 1537 </div> 1538 <div style='color: {latency_color};'> 1539 ({performance}) 1540 </div> 1541 </div> 1542 """, 1543 unsafe_allow_html=True 1544 ) 1545 1546 # Create tabs for different categories of health checks 1547 tab1, tab2, tab3, tab4 = st.tabs(["System Resources", "Dependencies", "Custom Checks", "Streamlit Pages"]) 1548 1549 with tab1: 1550 # Display system health checks 1551 system_data = health_data.get("system", {}) 1552 1553 # CPU 1554 if "cpu" in system_data: 1555 cpu_data = system_data["cpu"] 1556 cpu_status = cpu_data.get("status", "unknown") 1557 cpu_color = {"healthy": "green", "warning": "orange", "critical": "red"}.get(cpu_status, "gray") 1558 1559 st.markdown(f"### CPU Status: <span style='color:{cpu_color}'>{cpu_status.upper()}</span>", unsafe_allow_html=True) 1560 st.progress(cpu_data.get("usage_percent", 0) / 100) 1561 st.text(f"CPU Usage: {cpu_data.get('usage_percent', 0)}%") 1562 1563 # Memory 1564 if "memory" in system_data: 1565 memory_data = system_data["memory"] 1566 memory_status = memory_data.get("status", "unknown") 1567 memory_color = {"healthy": "green", "warning": "orange", "critical": "red"}.get(memory_status, "gray") 1568 1569 st.markdown(f"### Memory Status: <span style='color:{memory_color}'>{memory_status.upper()}</span>", unsafe_allow_html=True) 1570 st.progress(memory_data.get("usage_percent", 0) / 100) 1571 st.text(f"Memory Usage: {memory_data.get('usage_percent', 0)}%") 1572 st.text(f"Total Memory: {memory_data.get('total_gb', 0)} GB") 1573 st.text(f"Available Memory: {memory_data.get('available_gb', 0)} GB") 1574 1575 # Disk 1576 if "disk" in system_data: 1577 disk_data = system_data["disk"] 1578 disk_status = disk_data.get("status", "unknown") 1579 disk_color = {"healthy": "green", "warning": "orange", "critical": "red"}.get(disk_status, "gray") 1580 1581 st.markdown(f"### Disk Status: <span style='color:{disk_color}'>{disk_status.upper()}</span>", unsafe_allow_html=True) 1582 st.progress(disk_data.get("usage_percent", 0) / 100) 1583 st.text(f"Disk Usage: {disk_data.get('usage_percent', 0)}%") 1584 st.text(f"Total Disk Space: {disk_data.get('total_gb', 0)} GB") 1585 st.text(f"Free Disk Space: {disk_data.get('free_gb', 0)} GB") 1586 1587 with tab2: 1588 # Display dependency health checks 1589 dependencies = health_data.get("dependencies", {}) 1590 if dependencies: 1591 # Create a dataframe for all dependencies 1592 dep_data = [] 1593 for name, dep_info in dependencies.items(): 1594 dep_data.append({ 1595 "Name": name, 1596 "Type": dep_info.get("type", "unknown"), 1597 "Status": dep_info.get("status", "unknown"), 1598 "Details": ", ".join([f"{k}: {v}" for k, v in dep_info.items() 1599 if k not in ["name", "type", "status", "error"] and not isinstance(v, dict)]) 1600 }) 1601 1602 # Show dependencies table 1603 if dep_data: 1604 df_deps = pd.DataFrame(dep_data) 1605 st.dataframe(df_deps) 1606 else: 1607 st.info("No dependencies configured") 1608 1609 # Create a dataframe for all custom checks from health_data 1610 custom_checks = health_data.get("custom_checks", {}) 1611 check_data = [] 1612 for name, check_info in custom_checks.items(): 1613 if isinstance(check_info, dict) and "check_func" not in check_info: 1614 check_data.append({ 1615 "Name": name, 1616 "Status": check_info.get("status", "unknown"), 1617 "Details": ", ".join([f"{k}: {v}" for k, v in check_info.items() 1618 if k not in ["name", "status", "check_func", "error"] and not isinstance(v, dict)]), 1619 "Error": check_info.get("error", "") 1620 }) 1621 1622 if check_data: 1623 df_checks = pd.DataFrame(check_data) 1624 1625 # Apply color formatting to status column 1626 def color_status(val): 1627 colors = { 1628 "healthy": "background-color: #c6efce; color: #006100", 1629 "warning": "background-color: #ffeb9c; color: #9c5700", 1630 "critical": "background-color: #ffc7ce; color: #9c0006", 1631 "unknown": "background-color: #eeeeee; color: #7f7f7f" 1632 } 1633 return colors.get(str(val).lower(), "") 1634 1635 # Use styled dataframe to color the Status column 1636 try: 1637 # apply expects a function that returns a sequence of styles for the column; 1638 # map color_status across the 'Status' column to produce the CSS strings. 1639 st.dataframe( 1640 df_checks.style.apply( 1641 lambda col: col.map(color_status), 1642 subset=["Status"] 1643 ) 1644 ) 1645 except Exception: 1646 # Fallback if styling isn't supported in the environment 1647 st.dataframe(df_checks) 1648 else: 1649 st.info("No custom checks configured") 1650 else: 1651 st.info("No custom checks configured") 1652 with tab4: 1653 # Always read page errors from SQLite DB for latest state 1654 page_errors = StreamlitPageMonitor.get_page_errors() 1655 error_count = sum(len(errors) for errors in page_errors.values()) 1656 status = "critical" if error_count > 0 else "healthy" 1657 status_color = { 1658 "healthy": "green", 1659 "critical": "red", 1660 "unknown": "gray" 1661 }.get(status, "gray") 1662 st.markdown(f"### Page Status: <span style='color:{status_color}'>{status.upper()}</span>", unsafe_allow_html=True) 1663 st.metric("Error Count", error_count) 1664 if error_count > 0: 1665 st.markdown("<div style='background-color:#ffe6e6; color:#b30000; padding:10px; border-radius:5px; border:1px solid #b30000; font-weight:bold;'>Pages with errors:</div>", 1666 unsafe_allow_html=True) 1667 for page_name, page_errors_list in page_errors.items(): 1668 display_name = page_name.split("/")[-1] if "/" in page_name else page_name 1669 for error_info in page_errors_list: 1670 if isinstance(error_info, dict): 1671 with st.expander(f"Error in {display_name}"): 1672 st.info(error_info.get('error', 'Unknown error')) 1673 if error_info.get('type') == 'streamlit_error': 1674 st.text("Type: Streamlit Error") 1675 else: 1676 st.text("Type: Exception") 1677 st.text("Traceback:") 1678 st.code("".join(error_info.get('traceback', ['No traceback available']))) 1679 st.text(f"Timestamp: {error_info.get('timestamp', 'No timestamp')}") 1680 1681 # Configuration section 1682 with st.expander("Health Check Configuration"): 1683 st.subheader("System Check Thresholds") 1684 1685 col1, col2 = st.columns(2) 1686 with col1: 1687 cpu_warning = st.slider("CPU Warning Threshold (%)", 1688 min_value=10, max_value=90, 1689 value=health_service.config["thresholds"].get("cpu_warning", 70), 1690 step=5) 1691 memory_warning = st.slider("Memory Warning Threshold (%)", 1692 min_value=10, max_value=90, 1693 value=health_service.config["thresholds"].get("memory_warning", 70), 1694 step=5) 1695 disk_warning = st.slider("Disk Warning Threshold (%)", 1696 min_value=10, max_value=90, 1697 value=health_service.config["thresholds"].get("disk_warning", 70), 1698 step=5) 1699 streamlit_url_update = st.text_input( 1700 "Streamlit Server URL", 1701 value=health_service.config.get("streamlit_url", "http://localhost") 1702 ) 1703 1704 with col2: 1705 cpu_critical = st.slider("CPU Critical Threshold (%)", 1706 min_value=20, max_value=95, 1707 value=health_service.config["thresholds"].get("cpu_critical", 90), 1708 step=5) 1709 memory_critical = st.slider("Memory Critical Threshold (%)", 1710 min_value=20, max_value=95, 1711 value=health_service.config["thresholds"].get("memory_critical", 90), 1712 step=5) 1713 disk_critical = st.slider("Disk Critical Threshold (%)", 1714 min_value=20, max_value=95, 1715 value=health_service.config["thresholds"].get("disk_critical", 90), 1716 step=5) 1717 1718 check_interval = st.slider("Check Interval (seconds)", 1719 min_value=10, max_value=300, 1720 value=health_service.config.get("check_interval", 60), 1721 step=10) 1722 streamlit_port_update = st.number_input( 1723 "Streamlit Server Port", 1724 value=health_service.config.get("streamlit_port", 8501), 1725 step=1 1726 ) 1727 1728 if st.button("Save Configuration"): 1729 # Update configuration 1730 health_service.config["thresholds"]["cpu_warning"] = cpu_warning 1731 health_service.config["thresholds"]["cpu_critical"] = cpu_critical 1732 health_service.config["thresholds"]["memory_warning"] = memory_warning 1733 health_service.config["thresholds"]["memory_critical"] = memory_critical 1734 health_service.config["thresholds"]["disk_warning"] = disk_warning 1735 health_service.config["thresholds"]["disk_critical"] = disk_critical 1736 health_service.config["check_interval"] = check_interval 1737 health_service.config["streamlit_url"] = streamlit_url_update 1738 health_service.config["streamlit_port"] = streamlit_port_update 1739 1740 # Save to file 1741 health_service.save_config() 1742 st.success("Configuration saved successfully") 1743 1744 # Restart the service if interval changed 1745 health_service.stop() 1746 health_service.start()
Displays an interactive Streamlit dashboard for monitoring application health. This function initializes and manages a health check service, presenting real-time system metrics, dependency statuses, custom checks, and Streamlit page health in a user-friendly dashboard. Users can manually refresh health checks, view detailed error information, and adjust configuration thresholds and intervals directly from the UI.
Args:
config_path (str, optional): Path to the health check configuration JSON file.
Defaults to "health_check_config.json".
Features:
- Displays overall health status with color-coded indicators.
- Shows last updated timestamp for health data.
- Monitors Streamlit server status, latency, and errors.
- Provides tabs for:
* System Resources (CPU, Memory, Disk usage and status)
* Dependencies (external services and their health)
* Custom Checks (user-defined health checks)
* Streamlit Pages (page-specific errors and status)
- Allows configuration of system thresholds, check intervals, and Streamlit server settings.
- Supports manual refresh and saving configuration changes.
Raises:
Displays error messages in the UI for any exceptions encountered during health data retrieval or processing.
Returns:
None. The dashboard is rendered in the Streamlit app.