deep reinforcement learning
در نشریات گروه برق-
یادگیری تقویتی عمیق به طور گسترده ای در مسائل یادگیری ماشینی استفاده می شود و استفاده از روش هایی جهت بهبود کارکرد آن حائز اهمیت است. تعادل بین کاوش و بهره گیری یکی از مسائل مهم در یادگیری تقویتی است و برای این منظور از روش های انتخاب عملی که همراه با کاوش هستند همچون شبه حریصانه و بیشینه نرم استفاده می شود. در این روش ها به کمک تولید اعداد تصادفی و مقدار ارزش عمل، عملی انتخاب می شود که بتواند این تعادل را برقرار کند. در طول زمان با کاوش مناسب می توان انتظار داشت که محیط بهتر شناخته شده و اعمال باارزش بیشتر شناسائی شوند. آشوب با داشتن ویژگی هائی همچون حساسیت زیاد به شرایط اولیه، غیر تناوبی، غیر قابل پیش بینی، بازدید از همه حالت های فضای جستجو و رفتار شبه تصادفی، دارای کاربردهای فراوانی است. در این مقاله، از اعداد تولیدی توسط سیستم های آشوبناک جهت استفاده در روش انتخاب عمل شبه حریصانه در یادگیری تقویتی عمیق به منظور بهبود تعادل بین کاوش و بهره گیری، استفاده می شود؛ علاوه بر آن تاثیر استفاده از آشوب در حافظه تکرار تجارب نیز بررسی خواهد شد. آزمایش های انجام شده در محیط Lunar Lander ، نشان دهنده افزایش قابل توجه سرعت یادگیری و کسب جایزه بیشتر در این محیط است
کلید واژگان: انتخاب عمل، تئوری آشوب، کاوش و بهره گیری، یادگیری تقویتی عمیقDeep reinforcement learning is widely used in machine learning problems and the use of methods to improve its performance is important. Balance between exploration and exploitation is one of the important issues in reinforcement learning and for this purpose, action selection methods that involve exploration such as ɛ-greedy and Soft-max are used. In these methods, by generating random numbers and evaluating the action-value, an action is selected that can maintain this balance. Over time, with appropriate exploration, it can be expected that the environment becomes better understood and more valuable actions are identified. Chaos, with features such as high sensitivity to initial conditions, non-periodicity, unpredictability, exploration of all possible search space states, and pseudo-random behavior, has many applications. In this article, numbers generated by chaotic systems are used for the ɛ-greedy action selection method in deep reinforcement learning to improve the balance between exploration and exploitation; in addition, the impact of using chaos in replay buffer will also be investigated. Experiments conducted in the Lunar Lander environment demonstrate a significant increase in learning speed and higher rewards in this environment
Keywords: Action Selection, Chaos Theory, Deep Reinforcement Learning, Exploration, Exploitation -
سیستم های توصیه گر یکی از مباحث بسیار مهم هم در زمینه آکادمیک و هم در زمینه صنعت است. علت اهمیت سیستم های توصیه گر ناشی از این حقیقت است که با افزایش حجم اطلاعات و گسترش داده ها، دسترسی کاربران به خدمات و سرویس های مورد نیاز خودشان در میان انبوه اطلاعات، بدون استفاده از سیستم های توصیه گر به یک امر سر در گم کننده و گاها غیر ممکن تبدیل می شود. تاکنون روش های مختلفی از جمله فیلترینگ مشارکتی، فاکتورگیری ماتریسی، رگرسیون لجستیک و شبکه های عصبی در این زمینه ارائه شده اند که اکثر این روش ها دارای محدودیت های خاصی هستند. اولین محدودیت این سیستم ها ثابت بودن و عدم توجه به تعاملات کاربر با سیستم در گذر زمان و دومین محدودیت در این سیستم ها تمرکز کردن بر روی پاداش های آنی و عدم توجه به پاداش های بلند مدت است. در این تحقیق، مدل سازی تعاملات بین کاربران و آیتم ها با استفاده از یک الگوریتم یادگیری تقویتی عمیق بهبود یافته صورت می گیرد تا سیستم توصیه گر تصمیم های خود را بصورت یک فرآیند پویا با گذر زمان بهبود داده و علاوه بر امتیاز آنی حاصل از تصمیم های اخذ شده، تاثیرات آن تصمیم در بدست آوردن پاداش های بلند مدت را نیز در نظر بگیرد. نتایج حاصل از آزمایش ها نشان می دهد که الگوریتم پیشنهادی عملکرد بهتری نسبت به سایر روش ها داشته است.
کلید واژگان: سیستم های توصیه گر، یادگیری تقویتی عمیق، هوش مصنوعی، تعاملات کاربر و آیتم هاJournal of Iranian Association of Electrical and Electronics Engineers, Volume:22 Issue: 1, Spring 2025, PP 121 -131Recommender systems are one of the most important topics in academia and industry. With the increase in the volume of information and data, it has become confusing and sometimes impossible for users to access the required services without using recommender systems. So far, various techniques have been proposed for this purpose such as collaborative filtering, matrix factorization, logistic regression, neural networks, etc. However, most of these methods suffer from two limitations: (1) considering the recommendation as a static procedure and ignoring the dynamic interactive nature between users and the recommender systems; (2) focusing on the immediate feedback of recommended items and neglecting the long-term rewards. In this research, the modeling of interactions between users and items is done using an improved deep reinforcement learning method which can consider both the dynamic adaptation and long term rewards. The results of the experiments show that the proposed algorithm performs better than other methods.
Keywords: Recommender Systems, Deep Reinforcement Learning, Artificial Intelligence, User Item Interactions -
The highest level in Endsley's situation awareness model is called projection when the status of elements in the environment is shortly predicted. In cybersecurity situation awareness, the projection for an Advanced Persistent Threat (APT) requires to predict the next step of the APT.The threats are constantly changing and becoming more complex. As supervised and unsupervised learning methods require APT datasets for projecting the next step of APTs, they cannot identify unknown APT threats.In reinforcement learning methods, the agent interacts with the environment, which might project the next step of known and unknown APTs. So far, reinforcement learning has not been used to project the next step of APTs.In reinforcement learning, the agent uses the previous states and actions to approximate the best action of the current state. When the number of states and actions is abundant, the agent employs a neural network to approximate the best action of each state.This paper presents a deep reinforcement learning system to project the next step of APTs. As there exists some relation between attack steps, we employ the Long Short Term Memory method to approximate the best action of each state. In our proposed system, based on the current situation, we project the next steps of APT threats.We have evaluated our proposed system on the DAPT2020 dataset. Based on the evaluations performed on the mentioned dataset, six criteria F1, accuracy, precision, recall, loss, and average time were obtained, which are 0.9533, 0.9736, 0.9352, 0.97, 0.0143, and 0.05749(seconds), respectively.Keywords: Situation Awareness, Advanced Persistent Threats, Projection, Deep Reinforcement Learning, LSTM, DAPT2020, SCVIC-APT-202
-
Journal of Electrical and Computer Engineering Innovations, Volume:13 Issue: 1, Winter-Spring 2025, PP 225 -240Background and ObjectivesStock recommender system (SRS) based on deep reinforcement learning (DRL) has garnered significant attention within the financial research community. A robust DRL agent aims to consistently allocate some amount of cash to the combination of high-risk and low-risk stocks with the ultimate objective of maximizing returns and balancing risk. However, existing DRL-based SRSs focus on one or, at most, two sequential trading agents that operate within the same or shared environment, and often make mistakes in volatile or variable market conditions. In this paper, a robust Concurrent Multiagent Deep Reinforcement Learning-based Stock Recommender System (CMSRS) is proposed.MethodsThe proposed system introduces a multi-layered architecture that includes feature extraction at the data layer to construct multiple trading environments, so that different feed DRL agents would robustly recommend assets for trading layer. The proposed CMSRS uses a variety of data sources, including Google stock trends, fundamental data and technical indicators along with historical price data, for the selection and recommendation suitable stocks to buy or sell concurrently by multiple agents. To optimize hyperparameters during the validation phase, we employ Sharpe ratio as a risk adjusted return measure. Additionally, we address liquidity requirements by defining a precise reward function that dynamically manages cash reserves. We also penalize the model for failing to maintain a reserve of cash.ResultsThe empirical results on the real U.S. stock market data show the superiority of our CMSRS, especially in volatile markets and out-of-sample data.ConclusionThe proposed CMSRS demonstrates significant advancements in stock recommendation by effectively leveraging multiple trading agents and diverse data sources. The empirical results underscore its robustness and superior performance, particularly in volatile market conditions. This multi-layered approach not only optimizes returns but also efficiently manages risks and liquidity, offering a compelling solution for dynamic and uncertain financial environments. Future work could further refine the model's adaptability to other market conditions and explore its applicability across different asset classes.Keywords: Multi-Agent, Concurrent Learning, Deep Reinforcement Learning, Stock Recommender System
-
Biological evidence indicates that the actuation system in humans and legged animals is characterized by impulsiveness rather than continuity; i.e., control actions are concentrated within a specific phase of the motion cycle (the stance phase), while the rest of the cycle is passive. Based on this observation, we propose a simple event-based impulsive controller to generate walking cycles for legged robots. To improve optimization speed, we parametrize the controller-applied forces as a Gaussian function of time and employ a deep reinforcement learning method to optimize the controller parameters. To further enhance learning speed, an autoencoder is utilized to address the high dimensionality in the state space. Additionally, we employ a three-phase reward shaping approach to further improve learning speed and achieve better results. In phase one, the reward function focuses on stability and forward motion to learn stable locomotion. In phase two, the reward function is modified to achieve stable locomotion with lower control effort and desired forward velocity. In phase three, the reward function remains the same as in phase two but places more emphasis on forward velocity regulation. The proposed controller, state encoder, and learning process can be implemented on a group of legged robots with actuation at the leg contact point with the ground. In this paper, the proposed approach is tested on a simulated single-legged robot. Moreover, the controller robustness is analyzed considering different types of external disturbances. The simulation results indicate the efficacy of the proposed method as a bio-inspired control approach for legged locomotion.Keywords: Deep Reinforcement Learning, Event-Based Control, Impulsive Control, Legged Robot
-
Web applications (apps) are integral to our daily lives. Before users can use web apps, testing must be conducted to ensure their reliability. There are various approaches for testing web apps. However, they still require improvement. In fact, they struggle to achieve high coverage of web app functionalities. On the one hand, web apps typically have an extensive state space, which makes testing all states inefficient and time-consuming. On the other hand, specific sequences of actions are required to access certain functionalities. Therefore, the optimal testing strategy extremely depends on the app’s features. Reinforcement Learning (RL) is a machine learning technique that learns the optimal strategy to solve a task through trial-and-error rather than explicit supervision, guided by positive or negative reward. Deep RL extends RL, and exploits the learning capabilities of neural networks. These features make Deep RL suitable for testing complex state spaces, such as those found in web apps. However, modern approaches support fundamental RL. We have proposed WeDeep, a Deep RL testing approach for web apps. We evaluated our method using seven open-source web apps. Results from experiments prove it has higher code coverage and fault detection than other existing methodsKeywords: Deep Reinforcement Learning, Automated Testing, Test Generation, Web Application
-
Web application (app) exploration is a crucial part of various analysis and testing techniques. However, the current methods are not able to properly explore the state space of web apps. As a result, techniques must be developed to guide the exploration in order to get acceptable functionality coverage for web apps. Reinforcement Learning (RL) is a machine learning method in which the best way to do a task is learned through trial and error, with the help of positive or negative rewards, instead of direct supervision. Deep RL is a recent expansion of RL that makes use of neural networks’ learning capabilities. This feature makes Deep RL suitable for exploring the complex state space of web apps. However, current methods provide fundamental RL. In this research, we offer DeepEx, a Deep RL-based exploration strategy for systematically exploring web apps. Empirically evaluated on seven open-source web apps, DeepEx demonstrated a 17% improvement in code coverage and a 16% enhancement in navigational diversity over the stateof-the-art RL-based method. Additionally, it showed a 19% increase in structural diversity. These results confirm the superiority of Deep RL over traditional RL methods in web app exploration.
Keywords: Deep Reinforcement Learning, Exploration, Model Generation, Web Application -
بازی های رایانه ای در سال های اخیر نقش مهمی در پیشرفت هوش مصنوعی داشته اند. بازی ها به عنوان محیطی مناسب برای آزمون و خطا، آزمایش ایده ها و الگوریتم های مختلف هوش مصنوعی مورد استفاده قرار گرفته اند. بازی match-3 یک سبک از بازی های محبوب در تلفن های همراه است که از فضای حالت تصادفی و بسیار بزرگ تشکیل شده که عمل یادگیری در آن را دشوار می کند. در این مقاله یک عامل هوشمند مبتنی بر یادگیری تقویتی عمیق ارائه می شود که هدف آن بیشینه سازی امتیاز در بازی match-3 است. در تعریف عامل پیشنهادی از نگاشت فضای عمل، حالت و همچنین ساختار شبکه عصبی مبتکرانه ای برای محیط بازی match-3 استفاده می شود که توانایی یادگیری حالت های زیاد را داشته باشد. مقایسه روش پیشنهادی با سایر روش های موجود از جمله روش یادگیری تقویتی مبتنی بر سیاست، روش یادگیری تقویتی مبتنی بر ارزش، روش های حریصانه و عامل انسانی نشان از عملکرد برتر روش پیشنهادی در بازی match-3 دارد.
کلید واژگان: یادگیری تقویتی عمیق، بازی تصادفی، match-3، فضای حالت بزرگComputer games have played an important role in the development of artificial intelligence in recent years. Throughout the history of artificial intelligence, computer games have been a suitable test environment for evaluating new approaches and algorithms to artificial intelligence. Different methods, including rule-based methods, tree search methods, and machine learning methods (supervised learning and reinforcement learning) have been developed to create intelligent agents in different games. Games have been used as a suitable environment for trial and error, testing different artificial intelligence ideas and algorithms. Among these researches, we can mention the research of Deep Blue in the game chess and AlphaGo in the game Go. AlphaGo is the first computer program to defeat an expert human Go player. Also, Deep Blue is a chess-playing expert system is the first computer program to win a match, against a world champion. In this paper, we focus on the match-3 game. The match-3 game is a popular game in cell phones, which consists of a very large random state space that makes learning difficult. It also has random reward function which makes learning unstable. Many researches have been done in the past on different games, including match-3. The aim of these researches has generally been to play optimally or to predict the difficulty of stages designed for human players. Predicting the difficulty of stages helps game developers to improve the quality of their games and provide a better experience for users. Based on the approach used, past works can be divided into three main categories including search-based methods, machine learning methods and heuristic methods. In this paper, an intelligent agent based on deep reinforcement learning is presented, whose goal is to maximize the score in the match-3 game. Reinforcement learning is one of the approaches that has received a lot of attention recently. Reinforcement learning is one of the branches of machine learning in which the agent learns the optimal policy for choosing actions in different spaces through its experiences of interacting with the environment. In deep reinforcement learning, reinforcement learning algorithms are used along with deep neural networks. In the proposed method, different mapping mechanisms for action space and state space are used. Also, a novel structure of neural network for the match-3 game environment has been proposed to achieve the ability to learn large state space. The contributions of this article can be summarized as follow. An approach for mapping the action space to a two-dimensional matrix is presented in which it is possible to easily separate valid and invalid actions. An approach has been designed to map the state space to the input of the deep neural network, which reduces the input space by reducing the depth of the convolutional filter and thus improves the learning process. The reward function has made the learning process stable by separating random rewards from deterministic rewards. The comparison of the proposed method with other existing methods, including PPO, DQN, A3C, greedy method and human agents shows the superior performance of the proposed method in the match-3 game
Keywords: deep reinforcement learning, random game, match-3, large state space -
The hybrid electric train which operates without overhead wires or traditional power sources relies on hydrogen fuel cells and batteries for power. These fuel cell-based hybrid electric trains (FCHETs) are more efficient than those powered by diesel or electricity because they do not produce any tailpipe emissions making them an eco-friendly mode of transport. The target of this paper is to propose low-budget FCHETs that prioritize energy efficiency to reduce operating costs and minimize their impact on the environment. To this end, an energy management strategy [EMS] has been developed that optimizes the distribution of energy to reduce the amount of hydrogen required to power the train. The EMS achieves this by balancing battery charging and discharging. To enhance the performance of the EMS, proposes to use of a deep reinforcement learning (DRL) algorithm specifically the deep deterministic policy gradient (DDPG) combined with transfer learning (TL) which can improve the system's efficiency when driving cycles are changed. </strong>DRL-based strategies are commonly used in energy management and they suffer from unstable convergence, slow learning speed, and insufficient constraint capability. To address these limitations, an action masking technique to stop the DDPG-based approach from producing incorrect actions that go against the system's physical limits and prevent them from being generated is proposed. </strong> The DDPG+TL agent consumes up to 3.9% less energy than conventional rule-based EMS while maintaining the battery's charge level within a predetermined range. The results show that DDPG+TL can sustain battery charge at minimal hydrogen consumption with minimal training time for the agent.
Keywords: Fuel Cell, State of Charge, Energy Management Strategy, Deep Reinforcement Learning, Deep Deterministic Policy Gradient, Transfer Learning -
مدل مبتنی بر چارچوب کاری کدگذار-کدگشا با یادگیری تقویتی عمیق برای یادگیری استراتژی های معاملاتی سهام
The quality of the extracted features from a long-term sequence of raw prices of the instruments greatly affects the performance of the trading rules learned by machine learning models. Employing a neural encoder-decoder structure to extract informative features from complex input time-series has proved very effective in other popular tasks like neural machine translation and video captioning. In this paper, a novel end-to-end model based on the neural encoder-decoder framework combined with deep reinforcement learning is proposed to learn single instrument trading strategies from a long sequence of raw prices of the instrument. In addition, the effects of different structures for the encoder and various forms of the input sequences on the performance of the learned strategies are investigated. Experimental results showed that the proposed model outperforms other state-of-the-art models in highly dynamic environments.
Keywords: Deep Reinforcement Learning, Deep Q-Learning, Single Stock Trading, Portfolio Management, Encoder-Decoder Framework -
در این پژوهش به بررسی یک رویکرد مبتنی بر یادگیری تقویتی عمیق برای ناوبری خودمختار ربات ها می پردازیم. رویکرد ما در این پژوهش، مبتنی بر الگوریتم DDPG و یکی از نسخه های بهبود یافته ی آن به نام SD3 است. به منظور استفاده از این الگوریتم برای مسیله ی ناوبری خودمختار، اصلاحاتی بر روی الگوریتم مذکور انجام و برای کاربرد ناوبری بهینه سازی شده است. الگوریتم اصلاح شده به علت داشتن لایه های کانولوشنی می تواند با فضاهای حالت با ابعاد زیاد نیز کار کند. همچنین برای کاهش نوسان ربات در حین حرکت و نیز تشویق برای حرکت سریع تر در محیط، استفاده از دو پارامتر پاداش و جریمه براساس سرعت خطی و سرعت زاویه ای را پیشنهاد دادیم. و برای بهبود تعمیم پذیری الگوریتم، از الگوریتمی برای تغییر متناوب شکل و چینش موانع در محیط استفاده کردیم. همچنین برای تسریع فرایند یادگیری و بهبود عملکرد ربات، داده های ورودی را نرمال کردیم. سپس الگوریتم پیشنهادی را توسط محیط شبیه ساز GAZEBO و سیستم عامل ROS پیاده سازی کرده و نتایج بدست آمده را با الگوریتم اولیه ی SD3 و الگوریتم DDPG مقایسه نمودیم. الگوریتم پیشنهادی عملکرد بهتری نسبت به این دو روش به نمایش گذاشته است.
کلید واژگان: ناوبری خودمختار، یادگیری تقویتی عمیق، DDPG، SD3Journal of Command and Control Communications Computer Intelligence, Volume:6 Issue: 2, 2022, PP 31 -45In this research we develop a deep reinforcement learning-based method for autonomous robot navigation. Our approach in this study is based on DDPG and one of its improved versions named SD3. We did some modifications on this algorithm to make it proper for autonomous navigation problems and optimize it for this problems. The modified algorithm can work with high dimensional state spaces because of using convolutional layers. Also we propose two reward terms include linear velocity reward and angular velocity penalty to encourage robot to move faster with smoother movements. For generalizing the algorithm we used an algorithm for randomly changing shape, layout and number of obstacles in the environment. And to speed up the learning process and improving the robot operation, we normalized all input data. Finally, the proposed algorithm is implemented with ROS and Gazebo and the results show improvement versus the main SD3 and DDPG algorithms.
Keywords: Autonomous navigation, Deep reinforcement learning, SD3, DDPG -
In recent years, exponential growth of communication devices in Internet of Things (IoT) has become an emerging technology which facilitates heterogeneous devices to connect with each other in heterogeneous networks. This communication requires different level of Quality-of-Service (QoS) and policies depending on the device type and location. To provide a specific level of QoS, we can utilize emerging new technological concepts in IoT infrastructure, software-defined network (SDN) and, machine learning algorithms. We use deep reinforcement learning in the process of resource management and allocation in control plane. We present an algorithm that aims to optimize resource allocation. Simulation results show that the proposed algorithm improved network performances in terms of QoS parameters, including delay and throughput compared to Random and Round Robin methods. Compared to similar methods the performance of the proposed method is also as good as the fuzzy and predictive methods.
Keywords: Internet of Things, Software-Defined Networking (SDN), Deep Reinforcement Learning, QoS -
هم زمان با فراگیرشدن تکنولوژی اینترنت اشیا در سال های اخیر، تعداد دستگاه های هوشمند و به تبع آن حجم داده های جمع آوری شده توسط آنها به سرعت در حال افزایش است. از سوی دیگر، اغلب برنامه های کاربردی اینترنت اشیا نیازمند تحلیل بلادرنگ داده ها و تاخیر اندک در ارایه خدمات هستند. تحت چنین شرایطی، ارسال داده ها به مراکز داده ابری جهت پردازش، پاسخ گوی نیازمندی های برنامه های کاربردی مذکور نیست و مدل رایانش مه، انتخاب مناسب تری محسوب می گردد. با توجه به آن که منابع پردازشی موجود در مدل رایانش مه دارای محدودیت هستند، استفاده موثر از آنها دارای اهمیت ویژه ای است.در این پژوهش به مسئله زمان بندی وظایف برنامه های کاربردی اینترنت اشیا در محیط رایانش مه پرداخته شده است. هدف اصلی در این مسیله، کاهش تاخیر ارایه خدمات است که جهت دستیابی به آن، از رویکرد یادگیری تقویتی عمیق استفاده شده است. روش ارایه شده در این مقاله، تلفیقی از الگوریتم Q-Learning، یادگیری عمیق و تکنیک های بازپخش تجربه و شبکه هدف است. نتایج شبیه سازی ها نشان می دهد که الگوریتم DQLTS از لحاظ معیار ASD، 76% بهتر از الگوریتم QLTS و 5/6% بهتر از الگوریتم RS عمل می نماید و نسبت به QLTS زمان همگرایی سریع تری دارد.
کلید واژگان: اینترنت اشیاء، رایانش مه، زمان بندی وظایف، یادگیری تقویتی عمیقWith the advent and development of IoT applications in recent years, the number of smart devices and consequently the volume of data collected by them are rapidly increasing. On the other hand, most of the IoT applications require real-time data analysis and low latency in service delivery. Under these circumstances, sending the huge volume of various data to the cloud data centers for processing and analytical purposes is impractical and the fog computing paradigm seems a better choice. Because of limited computational resources in fog nodes, efficient utilization of them is of great importance. In this paper, the scheduling of IoT application tasks in the fog computing paradigm has been considered. The main goal of this study is to reduce the latency of service delivery, in which we have used the deep reinforcement learning approach to meet it. The proposed method of this paper is a combination of the Q-Learning algorithm, deep learning, experience replay, and target network techniques. According to experiment results, The DQLTS algorithm has improved the ASD metric by 76% in comparison to QLTS and 6.5% compared to the RS algorithm. Moreover, it has been reached to faster convergence time than QLTS.
Keywords: Internet of Things, Fog computing, Task Scheduling, Deep reinforcement learning -
برای سرعت بخشیدن به فرآیند یادگیری در مسایل یادگیری تقویتی با ابعاد بالا، معمولا از ترکیب روش های TD، مانند یادگیری Q یا سارسا، با مکانیزم آثار شایستگی، استفاده می شود. در الگوریتم شبکه عمیق Q (DQN)، که به تازگی معرفی شده، تلاش شده است که با استفاده از شبکه های عصبی عمیق در یادگیری Q، الگوریتم های یادگیری تقویتی را قادر سازد که به درک بالاتری از دنیای بصری رسیده و به مسایلی گسترش یابند که در گذشته رام نشدنی تلقی می شدند. DQN که یک الگوریتم یادگیری تقویتی عمیق خوانده می شود، از سرعت یادگیری پایینی برخوردار است. در این مقاله سعی می شود که از مکانیزم آثار شایستگی که یکی از روش های پایه ای در یادگیری تقویتی به حساب می آید، در یادگیری تقویتی در ترکیب با شبکه های عصبی عمیق استفاده شود تا سرعت فرایند یادگیری بهبود بخشیده شود. همچنین برای مقایسه کارایی با الگوریتم DQN، روی تعدادی از بازی های آتاری 2600، آزمایش انجام شد و نتایج تجربی به دست آمده در آنها نشان می دهند که روش ارایه شده، زمان یادگیری را در مقایسه با الگوریتم DQN، به طرز قابل توجهی کاهش داده و سریعتر به مدل مطلوب همگرا می شود
کلید واژگان: شبکه های عصبی عمیق، Deep Q Network (DQN)، آثار شایستگی، یادگیری تقویتی عمیقTo accelerate the learning process in high-dimensional learning problems, the combination of TD techniques, such as Q-learning or SARSA, is usually used with the mechanism of Eligibility Traces. In the newly introduced DQN algorithm, it has been attempted to using deep neural networks in Q learning, to enable reinforcement learning algorithms to reach a greater understanding of the visual world and to address issues Spread in the past that was considered unbreakable. DQN, which is called a deep reinforcement learning algorithm, has a low learning speed. In this paper, we try to use the mechanism of Eligibility Traces, which is one of the basic methods in reinforcement learning, in combination with deep neural networks to improve the learning process speed. Also, for comparing the efficiency with the DQN algorithm, a number of Atari 2600 games were tested and the experimental results obtained showed that the proposed method significantly reduced learning time compared to the DQN algorithm and converges faster to the optimal model.
Keywords: Deep Neural Networks, Deep Q Networks (DQN), Eligibility Traces, Deep Reinforcement Learning
- نتایج بر اساس تاریخ انتشار مرتب شدهاند.
- کلیدواژه مورد نظر شما تنها در فیلد کلیدواژگان مقالات جستجو شدهاست. به منظور حذف نتایج غیر مرتبط، جستجو تنها در مقالات مجلاتی انجام شده که با مجله ماخذ هم موضوع هستند.
- در صورتی که میخواهید جستجو را در همه موضوعات و با شرایط دیگر تکرار کنید به صفحه جستجوی پیشرفته مجلات مراجعه کنید.