离散事件模拟结合强化学习

X-Lab

3分钟

2024年9月9日

系统仿真方法主要有三种类型：离散事件仿真 (DES)、基于代理的建模 (ABM) 和系统动态 (SD)。这里我们主要讨论 DES。

什么是离散事件仿真（DES）?

离散事件仿真（DES）是一种模拟系统行为和性能的方法。它将系统的操作建模为一系列离散事件在时间上的序列。每个事件在特定的瞬间发生，并标志着系统状态的变化。

它涉及调度和执行改变系统状态的事件。仿真通过逐个处理这些事件来推进，这可能包括随机元素来模拟系统行为的变异性。离散事件仿真的一个额外概念是，在该环境中会有一个时钟来记录当前时间。到了事件触发时刻，事件就会发生。

“事件”是指任何改变系统状态的事情发生。术语“离散”指的是我们的重点仅仅在这些事件发生的特定时刻，而忽略其他时间段的无关信息。此类事件的示例包括顾客的到达与离开、资源的分配与释放，以及影响运营的突然、意外事件，如地震等。

鉴于资源有限，离散事件仿真中一个常见场景涉及“等待”资源。例如，这可以在患者在接待区等待医疗咨询时观察到，或者当个体在服务柜台排队买东西时。

离散事件模拟的组合强化学习

在理解离散事件模拟的定义后，这个术语来了 - 强化学习。

强化学习是一种机器学习方法，在这种方法中，代理通过与环境的互动来学习并做出决策。代理开发出一种策略，以优化根据它执行的动作所获得的总奖励。

在强化学习（RL）中，代理通过尝试不同的行动并观察它们如何随时间影响奖励来学习做出最佳决策。与监督学习不同，监督学习提供正确的答案，而 RL 则使用奖励来展示代理的行动效果好坏。代理必须根据这些奖励自行判断。在 RL 中，代理的行动影响即时奖励和未来奖励。由于环境提供的信息有限，代理从自身的经验中学习。它逐渐改善其行动，以更好地适应环境。

训练一个 RL 代理可能会耗费时间，通常需要数十万步才能找到几乎最优的策略。在无模型的马尔可夫决策过程 (MDP) 中，这变得更加复杂，因为代理必须估计与环境互动后不同结果的概率，而不是拥有预定义的概率。

通过为 RL 代理提供一个可控和可重复的环境，模拟成为强化学习 (RL) 中的一种重要工具。这种方法减少了代理在复杂或潜在危险的现实场景中直接互动所带来的风险和成本。离散事件模拟 (DES) 对于建模在特定、独立的时间发生事件的环境特别有价值。这提供了一个结构化却又可适应的框架供 RL 运作，使其非常适合应用。

场景设置

考虑一个医院场景：不同部门有各自的患者队列，服务时间和就诊紧急程度各不相同。这里的目标是最小化整体等待时间，同时确保急诊病例能够优先得到关注。在这种情况下，使用离散事件仿真（DES）进行模拟可以让强化学习（RL）代理实验不同的患者流管理策略、优化服务顺序并有效优先处理患者，在无风险的虚拟环境中进行。

DES医院排队模型

患者到达： 模拟患者的到达，关键属性包括紧急程度、科室和预估服务时间。
队列管理： 每个部门有一个单独的队列。传统的队列管理可能是“先到先服务”或者基于固定的优先级规则。
服务： 模拟服务过程，患者由部门工作人员进行治疗。

RL整合

状态： 定义系统的状态，包括每个队列中的患者数量、当前正在服务的患者以及等待患者的紧急程度。
动作： 在每个决策点（例如，服务一个患者后需要选择下一个患者时），动作可以是从任意队列中选择下一个要服务的患者。
奖励： 设计一个奖励函数，对案例长时等待进行惩罚，尤其是紧急案例，并可能对非紧急案例的短时等待给予奖励。

实施步骤

使用DES模拟环境： 使用DES建模患者流动和服务机制。DES处理患者到达、等待和服务的动态。
应用RL进行决策： 使用RL代理学习从队列中选择患者的最佳策略。代理观察队列的状态，并根据其行为（患者选择）的结果获得奖励。
训练： RL代理从每个交互中（每个患者服务完成和下一个患者选择）持续学习。随着时间的推移，它识别出优化队列管理的最佳模式和策略。
整合： 将RL决策过程整合到DES中，以便在每个决策点上，咨询RL代理选择下一个要服务的患者。

与静态规则不同，RL代理可以适应变化的条件，例如患者的突然增加或科室可用性的变化。RL可以平衡多个目标，例如在最大化急诊病例护理的同时最小化等待时间。随着数据的不断收集，系统可以持续改进，适应新的患者到达模式或医院运营的变化。

实验设计

为了在Python中模拟离散事件系统（DES），如医院排队问题，您可以使用 SimPy 库。SimPy是基于过程的离散事件的标准Python仿真框架。

pip install simpy

使用Anaconda创建环境或使用本地环境。

conda create env -n RL-DES  python=3.10

开始这个简单示例吧：

import simpy
import random
import numpy as np
from collections import deque, defaultdict

# Global counter for patient IDs
patient_counter = 0


# Define the Patient class, representing a patient in the system
class Patient:
    def __init__(self, env, urgency, department, service_time):
        global patient_counter
        self.id = patient_counter  # Assign a unique ID to each patient
        patient_counter += 1

        self.env = env
        self.urgency = urgency  # Urgency level of the patient (1 is most urgent, 3 is least)
        self.department = department  # Department the patient is assigned to
        self.service_time = service_time  # Time required to serve the patient
        self.arrival_time = env.now  # Time when the patient arrives
        self.wait_time = None  # Time the patient spends waiting before being served


# Define the Department class, representing a department in the hospital
class Department:
    def __init__(self, env, name, rl_agent):
        self.env = env
        self.name = name  # Name of the department
        self.queue = deque()  # Queue to hold patients waiting in the department
        self.rl_agent = rl_agent  # Reference to the RL agent for decision-making
        self.action = env.process(self.run())  # Start the department's main process
        self.served_patients = []  # List to store wait times of served patients
        self.patient_ids = []  # List to store IDs of patients waiting in the department

    def run(self):
        # Main loop for serving patients in the department
        while True:
            if self.queue:
                # Get the current state and select a patient based on urgency and arrival time (FCFS)
                state = self.rl_agent.get_state(self)
                patient = self.select_patient(state)

                if patient:
                    # Serve the selected patient and calculate wait time
                    yield self.env.timeout(patient.service_time)
                    patient.wait_time = self.env.now - patient.arrival_time
                    self.served_patients.append(patient.wait_time)

                    # Update the RL agent's Q-table with the observed reward
                    next_state = self.rl_agent.get_state(self)
                    reward = self.calculate_reward(patient)
                    self.rl_agent.update_q_table(state, self.queue.index(patient), reward, next_state)

                    # Log patient service information
                    print(
                        f"Patient {patient.id} with urgency {patient.urgency} served in {self.name} at time {self.env.now} after waiting {patient.wait_time}")

                    # Remove the patient from the queue and the patient ID list
                    self.queue.remove(patient)
                    self.patient_ids.remove(patient.id)
            else:
                # If the queue is empty, wait for a short time before checking again
                yield self.env.timeout(1)

    def add_patient(self, patient):
        # Add a new patient to the department's queue and track their ID
        self.queue.append(patient)
        self.patient_ids.append(patient.id)
        print(f"Patient {patient.id} with urgency {patient.urgency} arrives at {self.name} at time {self.env.now}")

    def calculate_reward(self, patient):
        # Calculate the reward based on the patient's urgency and wait time
        # The reward is higher for shorter wait times and more urgent patients
        base_reward = 100  # A base reward for serving the patient
        wait_time_penalty = -patient.wait_time  # Negative impact of waiting time
        urgency_multiplier = 4 - patient.urgency  # Urgency multiplier (higher for more urgent patients)

        # Final reward calculation
        reward = base_reward + (urgency_multiplier * wait_time_penalty)

        return reward

    def get_average_wait_time(self):
        # Calculate the average wait time for all served patients
        if self.served_patients:
            return np.mean(self.served_patients)
        return 0.0

    def select_patient(self, state):
        # Select the patient to be served next based on urgency and FCFS within the same urgency
        sorted_queue = sorted(self.queue, key=lambda p: (p.urgency, p.arrival_time))
        return sorted_queue[0] if sorted_queue else None


# Define the RL Agent class, responsible for decision-making and learning
class RLAgent:
    def __init__(self, departments):
        self.q_table = {}  # Q-table for storing state-action values
        self.departments = departments  # List of departments in the system

    def get_state(self, department):
        # Get the current state, represented by the length of the queue in each department
        return tuple(len(d.queue) for d in self.departments)

    def select_action(self, state, queue_length):
        # Select an action (which patient to serve) using an epsilon-greedy strategy
        if random.random() < 0.1:
            # Exploration: Choose a random action
            return random.choice(range(queue_length))
        # Exploitation: Choose the best action based on the Q-values
        q_values = [self.q_table.get((state, a), 0) for a in range(queue_length)]
        return np.argmax(q_values)

    def update_q_table(self, state, action, reward, next_state):
        # Update the Q-value for the given state-action pair using the reward and future rewards
        old_value = self.q_table.get((state, action), 0)
        future_rewards = max([self.q_table.get((next_state, a), 0) for a in range(len(self.departments))], default=0)
        self.q_table[(state, action)] = old_value + 0.1 * (reward + 0.9 * future_rewards - old_value)


# Function to generate patients and assign them to departments
def patient_generator(env, departments, arrival_rate):
    while True:
        # Stop generating new patients after time 500
        if env.now > 500:
            break

        # Generate a new patient with weighted urgency levels
        urgency_distribution = [0.1, 0.3, 0.6]  # Probability distribution for urgency levels [1, 2, 3]
        urgency = random.choices([1, 2, 3], weights=urgency_distribution, k=1)[0]

        department = random.choice(departments)

        # Determine service time based on the department
        if department.name == "Department 1":
            service_time = max(1, np.random.normal(10, 3))  # Mean 10, SD 3
        elif department.name == "Department 2":
            service_time = max(1, np.random.normal(8, 3))  # Mean 8, SD 3
        elif department.name == "Department 3":
            service_time = 11  # Fixed service time of 11 minutes

        patient = Patient(env, urgency, department, service_time)
        department.add_patient(patient)

        # Wait for the next patient to arrive based on the arrival rate
        yield env.timeout(random.expovariate(arrival_rate))


# Function to reset the departments' state between simulation episodes
def reset_departments(departments):
    for department in departments:
        department.queue.clear()
        department.patient_ids.clear()
        department.served_patients.clear()


# Function to run one episode of the simulation and return the average wait time
def run_episode(env, rl_agent, departments, simulation_time):
    # Start the patient generator process
    env.process(patient_generator(env, departments, arrival_rate=0.3))  # Increase the patient arrival rate

    # Run the simulation until time 540 or until all patients are served
    env.run(until=simulation_time)

    # After 500, ensure the simulation continues until all patients are served
    while any(len(dept.queue) > 0 for dept in departments):
        env.run(until=env.now + 1)  # Continue running until all queues are empty

    # Calculate the average wait time for all departments
    average_wait_times = [dept.get_average_wait_time() for dept in departments]
    return np.mean(average_wait_times)


# Function to set up the simulation environment and run multiple episodes
def run_simulation(episodes=100, simulation_time=540):
    first_round_avg = 0
    last_round_avg = 0

    for episode in range(episodes):
        print(f"\nStarting episode {episode + 1}")

        # Reset the patient counter at the start of each episode
        global patient_counter
        patient_counter = 0

        env = simpy.Environment()
        rl_agent = RLAgent(departments=[])
        departments = [Department(env, f"Department {i + 1}", rl_agent) for i in range(3)]
        rl_agent.departments = departments

        # Run one episode and record the average wait time
        avg_wait_time = run_episode(env, rl_agent, departments, simulation_time)
        print(f"Average wait time for episode {episode + 1}: {avg_wait_time}")

        # Store the average wait time for the first and last episodes
        if episode == 0:
            first_round_avg = avg_wait_time
        if episode == episodes - 1:
            last_round_avg = avg_wait_time

        # Reset the departments for the next episode
        reset_departments(departments)

    # Print the average wait times for the first and last episodes
    print(f"\nAverage wait time in the first episode: {first_round_avg}")
    print(f"Average wait time in the last episode: {last_round_avg}")


# Run the simulation
if __name__ == "__main__":
    run_simulation(episodes=100, simulation_time=540)

类似的模拟与强化学习相结合的想法正逐渐在各地进行测试，而本文仅简要介绍这两个概念，更多细节将在后续文章中提供。

加入 X-Lab 的 Discord!

我们推出了一个新的讨论频道，称为 "X-Lab"，欢迎所有对相关主题感兴趣的人参与。此次会议专注于离散事件模拟与强化学习的结合。

点击链接加入我们的 Discord，参与讨论： https://discord.gg/EDdmCKuPkb

我们期待在那里见到你，深入进行激动人心的讨论!

参考：

S. Belsare, E. D. Badilla 和 M. Dehghanimohammadabadi, "基于离散事件仿真的强化学习：前提、现实和承诺," 2022年冬季仿真会议（WSC），新加坡，2022年，第2724-2735页，doi: 10.1109/WSC57314.2022.10015503.
Nian, R., J. Liu 和 B. Huang. 2020. “关于强化学习的综述：工业过程控制中的介绍和应用”。计算机与化学工程 139:106886.

参考：

S. Belsare, E. D. Badilla 和 M. Dehghanimohammadabadi, "基于离散事件仿真的强化学习：前提、现实和承诺," 2022年冬季仿真会议（WSC），新加坡，2022年，第2724-2735页，doi: 10.1109/WSC57314.2022.10015503.
Nian, R., J. Liu 和 B. Huang. 2020. “关于强化学习的综述：工业过程控制中的介绍和应用”。计算机与化学工程 139:106886.