AbstractThe current investment transactions in the domestic financial market have
gradually shifted from subjective transactions based on traditional technical analysis
and other methods to transactions based on programmed quantitative strategies. There
has been a large amount of research work on quantitative strategies in the stock market,
but there is insufficient research on quantitative trading strategies in the cryptocurrency
market. The investment returns and risk control of existing strategies in intraday highfrequency trading have yet to be optimized. In order to improve the profitability and
risk control capabilities of high-frequency quantitative strategies for cryptocurrency,
this article designs a cryptocurrency trading environment, using 1-day time granularity
of Ethereum’s historical price data and derivative technical indicators as the
environmental state, aiming at the position status and The transaction operation
constructs the corresponding action space and algorithm; the deep reinforcement
learning model based on PPO is used, making it more suitable for processing the state
space of sequence input and significantly improving the learning speed of the model.
Discover practical trading strategies through the model. Under a specific model weight,
the trading strategy constructed in this article can achieve an average return of 600%
and an average volatility of 293% in 3,000 tests only by trading a single cryptocurrency,
Ethereum. , the earnings volatility ratio is 2.09.
I. Introduction
(I) Research Background
The current cryptocurrency market’s high volatility and its potential for substantial
returns in a short period have attracted an increasing number of investors to participate
in cryptocurrency trading. On the Binance platform alone, daily transaction volumes
have already exceeded the million level. For these assets, which lack fundamental
analysis, technical analysis remains the only active method of price prediction in the
cryptocurrency market. Technical analysis is based on summaries of historical
experience; however, experiences can sometimes be misleading. For instance, common
strategies like harmonic trading and the Fibonacci sequence can fail under certain
circumstances. Moreover, the cryptocurrency market includes many retail investors
whose investment behavior is often irrational. As market conditions change rapidly,
relying on manual trading is impractical. Instead, developing trading strategies that
automatically handle timing, opening, and closing positions through quantitative
methods is the main direction for current cryptocurrency market strategy development.
(II) Research Significance
With the development of deep learning in time series analysis, neural networks are
increasingly being used for asset price forecasting, with the hope of obtaining trading
signals from real-time price data to construct trading strategies. These methods often
suffer from overfitting and overlook the long-term returns of strategies, requiring
manual adjustments to parameters to achieve optimal test results.
Reinforcement learning, on the other hand, builds agents that interact continuously with
the environment, gaining rewards and ultimately progressing towards preset goals.
Therefore, by combining the strengths of these two approaches, deep learning networks
can automatically extract features from experience and use these features to optimize
reinforcement learning strategies. This allows agents to directly learn strategies from
raw perceptual inputs (such as video frames or sensor data) without any manual feature
engineering. Further developing an integrated algorithm for trading robots can better
unearth effective trading strategies and earn alpha excess returns in the market. This
paper selects the Proximal Policy Optimization (PPO) algorithm, a widely used method
in the field of reinforcement learning. Introduced by OpenAI in 2017, PPO aims to
improve the sample efficiency and training stability of policy gradient methods. The
core of the PPO algorithm is to limit the degree of policy change during policy update
steps, thus avoiding taking too large steps in the policy space, ensuring stability of
learning. PPO achieves this by introducing a truncated probability ratio objective
function known as “clipped probability ratio clipping”, which confines the probability
ratio (the ratio between new and old policies) to a certain range, typically between 1−ϵ
and 1+ϵ, where ϵ is a small positive number like 0.1 or 0.2. Due to its stability and
relatively high efficiency, PPO has demonstrated good performance in various complex
environments, especially in tasks that involve handling large amounts of data and
complex decision-making. It is widely applied in robotics control, game AI, natural
language processing, and other fields. PPO’s advantages include simplifying the
training process and enhancing sample utilization efficiency, making it one of the most
popular algorithms in reinforcement learning today. Therefore, the PPO algorithm
should also be explored for application in the field of quantitative trading to construct
algorithmic trading strategies.
II. Definition of Related Concepts
(I) Deep Reinforcement Learning
1. Definition of Deep Reinforcement Learning
Deep Reinforcement Learning (DRL) combines deep learning (DL) and reinforcement
learning (RL) techniques to address the limitations of traditional reinforcement learning
in handling high-dimensional input data. In DRL, deep learning models, especially deep
neural networks, are used as the agent’s decision-making function to learn optimal
strategies directly from raw input data.
2. Deep Reinforcement Learning Algorithms
– PPO (Proximal Policy Optimization)**: PPO is efficient in handling samples and
more effective than other algorithms like TRPO (Trust Region Policy Optimization)
because it can reuse data in a single batch, accelerating the learning process. PPO is
also easier to implement and tune, simplifying the complexity of TRPO by using a
clipped objective function to avoid complex computations, thus making
implementation and hyperparameter tuning easier. Although PPO usually maintains a
stable training process, it may experience performance fluctuations or tuning
difficulties in specific environments or tasks.
min( r ( ) A old ( st , at ),
LCLIP ( ) = E
clip (r ( ),1 − ,1 + ) A ( st , at ))
old
– Figure 2-1-2 PPO Objective Function
– DQN (Deep Q-Network)**: Known for its stability and robustness, DQN
maintains high stability during training by using experience replay and a fixed target
network. DQN is suitable for discrete action spaces and is particularly effective in tasks
with discrete actions, making it very effective in areas like video gaming. However, the
experience replay mechanism requires significant memory, and the training process is
typically computation-intensive.
– A3C (Asynchronous Advantage Actor-Critic)**: Unlike DQN, A3C does not
require experience replay. It updates policies and value functions in real-time,
enhancing learning efficiency and adaptability. A3C can also perform parallel
computations, effectively running on multicore CPUs with multiple parallel agents,
significantly accelerating the learning process. However, A3C’s performance heavily
relies on hyperparameter settings, and improper settings can lead to unstable learning
processes. In some complex environments, A3C’s convergence rate may be slower than
methods based on experience replay.
III. Market Analysis of Enrich Your Wealth
(I) Market Analysis of Quantitative Trading Robots
1. Popularity of Quantitative Trading Robots
Currently, quantitative trading robots are very active in secondary markets. Robots like
DeFiQuant have gained attention as they offer robust solutions in rapidly changing
market environments, such as the recent fluctuations in Bitcoin prices. The popularity
of these robots is evident from their widespread adoption on platforms like TradeSanta
and Pionex, which provide tools suitable for both beginners and advanced traders. For
example, TradeSanta offers various subscription modes and copy trading features,
which are attractive in volatile market conditions.
2. Market Acceptance of Quantitative Trading Robots
While public acceptance of these robots varies, it is generally growing. Tools like
Dash2Trade and Learn2Trade have made significant progress in simplifying the trading
process, which helps attract retail investors. On the other hand, concerns about their
complexity and the need for technical understanding may pose barriers to some users.
3. Market Usage of DRL Technology
– J.P. Morgan**: This major bank is applying machine learning techniques to
optimize trade execution in the forex market. Their approach involves using deep neural
networks to integrate and enhance algorithmic trading strategies, helping to minimize
market impact and improve execution quality.
– Trumid**: This fixed-income trading platform uses advanced machine learning
models to optimize the credit trading experience. Their proprietary models enhance
real-time pricing intelligence for various U.S. dollar-denominated corporate bonds.
– University of Toronto’s Rotman School of Management**: Scholars here have
explored the application of reinforcement learning in derivatives hedging, proving these
methods can outperform traditional strategies, especially in scenarios involving
transaction costs.
4. Market Prospects for DRL Quantitative Trading Technology
With the continued surge in participation and wealth accumulation by Chinese retail
investors, the demand for complex financial services, including wealth management, is
rapidly expanding. The growth of wealth among the Chinese population is driving an
increasing demand for advanced investment strategies that maximize returns and
effectively manage risks. China’s vast retail investors provide a unique opportunity for
the adoption of DRL-driven quantitative trading products, offering individual investors
access to advanced trading strategies previously available only to institutional investors.
(II) Product Pricing Analysis of Enrich Your Wealth
1. Subscription Model (Membership)
– Basic Subscription: Provides simple trading algorithms and limited monthly trades,
suitable for new traders.
– Advanced Subscription**: Includes advanced features, such as higher trading
limits, access to more complex strategies, and priority customer support, generally open
only to experienced traders.
2. Trading Fee System
By inviting investors to follow trades on specific platforms, the platform offers a
differential in transaction fees. Each time the trading robot executes a trade, the
cryptocurrency trading platform charges a portion of the fee, and another portion is
allocated to the followers as income. Therefore, the more frequent the transactions, the
higher the followers’ earnings. The 24/7 operation of DRL-based quantitative trading
robots fully capitalizes on this advantage, bringing substantial transaction fee revenues
to the followers.
3. Fund Management Model
Referring to the charging model of hedge fund managers, DRL trading robots are
packaged as fund products to attract investors to buy and earn management fees and
incentive fees for funds.
IV. Construction of DRL Quantitative Trading Robots
(I) Data Collection and Processing
1. Data Acquisition and Preprocessing
Data is collected from a website specializing in cryptocurrency data, CryptoCompare,
via its public API to retrieve daily historical price data of Ethereum, one of the
mainstream cryptocurrencies. These data serve as the foundational input for deep
reinforcement learning algorithms. Data collected include daily opening, highest,
lowest, closing prices, and trading volume of Ethereum priced in USD over the past
2000 days. After sorting by date, invalid values are detected and stored in a data frame
for subsequent processing.
2. Advanced Data Processing
To assess whether the market is fully efficient at a technical level, this paper generates
various common technical indicators based on historical price data. These indicators
help analyze both short-term and long-term price trends. The following are technical
indicators generated using Talib:
– RSI (Relative Strength Index)**: A momentum oscillator that measures the speed
and change of price movements. The RSI values range from 0 to 100. Typically, an RSI
above 70 is considered an overbought region, while below 30 is deemed oversold. RSI
helps investors identify potential reversal points in the market, making it an important
tool for deciding when to buy or sell.
– STOCH (Stochastic Oscillator)**: A momentum indicator that compares a closing
price to its price range over a given time period. It includes two lines: the fast line (%K)
and the slow line (%D), commonly used to identify overbought or oversold conditions
and potential price reversals. This is very useful for determining entry and exit points.
– MACD (Moving Average Convergence Divergence)**: An oscillator based on
moving averages that tracks trends. It operates by calculating the difference between
two exponential moving averages (EMAs). The MACD Line, Signal Line, and
Histogram provide information about price momentum and trend changes. MACD is a
key tool for traders to identify trend directions and durations.
– CCI (Commodity Channel Index)**: Compares the current price, the average price,
and typical deviations. It’s a versatile indicator used to identify new trends or warn of
extreme conditions. A CCI above +100 is considered overbought, and below -100 is
considered oversold.
– ATR (Average True Range)**: Measures market volatility. It doesn’t indicate price
trends but shows the degree of market volatility. ATR is an extremely useful tool in
setting stop-loss and take-profit levels, as it helps traders understand how asset volatility
changes.
– ROC (Rate of Change)**: Measures the percentage change in price to identify
market momentum. Positive values indicate rising momentum, while negative values
indicate falling momentum. ROC is a powerful tool for assessing whether the market
is overheated or overly cold.
– UO (Ultimate Oscillator)**: Combines short-term, medium-term, and long-term
market trends to form a composite momentum indicator. This synthetic approach is
designed to reduce the false signals common with typical oscillators, providing more
stable market entry and exit signals.
Choosing these indicators is crucial for analyzing market dynamics from different
perspectives and time frames, essential for developing quantitative trading strategies
that can adapt to volatile market conditions. The combination of these indicators
provides a comprehensive market perspective for the model, helping to identify and
leverage complex market patterns. Through such technical analysis, algorithms can
enhance understanding of market behavior, optimize trading decisions, and thus
improve investment returns.
Figure 4-1-2: Historical Price Data Including Derived Indicators
3. Feature Selection
When building predictive models, selecting the right features is crucial for model
performance. Too many irrelevant features can not only reduce the model’s operational
efficiency but also introduce noise, thereby decreasing accuracy. Feature selection helps
to simplify the model, accelerate the training process, and improve the model’s
generalization ability. The Recursive Feature Elimination (RFE) method is used to
enhance the model’s predictive performance and explanatory power. This method
gradually builds the model using Random Forest-based RFE to select the most
important features to be used as the final data inputs for the model.
4. Creating Rolling Windows
The rolling window data structure is particularly important for time series prediction
and financial market analysis. Rolling windows provide a rich historical data context
for model training, helping to capture time dependencies and dynamic changes.
Specifically, the method considers not only single-moment data but also combines data
over several days to predict future market trends. This approach mimics how investors
evaluate markets based on past performance, aiding the model in better understanding
and predicting future price movements. In this way, the model learns to predict the next
day’s price movements based on the past 5 days’ market behavior. Time series analysis
of Ethereum prices through ACF and PCF shows that prices 1 to 3 days behind have
the most significant impact on future prices. Therefore, the rolling window is generally
no more than 5 days to avoid the model considering noise as an influential factor on
price.
5. Cross-Validation and Data Plotting
In handling Ethereum transaction data, to verify model performance and ensure it can
make accurate predictions on unseen data, the dataset is divided into training and testing
sets using cross-validation. This method helps to assess the model in a controlled
environment, ensuring its results are both reliable and representative. Thus, a
conventional data split is used, with 80% of the data serving as the training set. This
portion is used to train the model, teaching it to recognize market trends and price
movements. The remaining 20% serves as the test set, used to evaluate the model’s
performance on data it has not previously seen.
To more visually display this division and understand the distribution of the training
and testing sets across the time series, Figure 4-2 shows the “closing prices” in both
datasets. Through this visualization, investors can clearly see how the data is partitioned
and observe the difference between the data used for model training (training set) and
the data used to validate model performance (test set). The training set (blue line)
displays the price data used during the model training phase, where the model learns
and establishes prediction patterns. The test set (red line) displays the price data used
for testing and validating the model’s predictive capability, which is new to the model
and
was
not
used
during
the
training
phase.
Figure 4-1-5: Training and Test Set Split Based on Closing Prices
(II) Model Construction
1. Trading Setup
This paper establishes a trading environment specifically for the cryptocurrency market,
utilizing a reinforcement learning framework to simulate trading behavior. This
environment assumes that the trader’s (i.e., the Agent’s) actions do not influence market
prices or the behavior of other investors, thereby studying the Agent’s performance in
isolation. The trading setup starts with an initial capital of 3,000 USD, and each opening
trade incurs a transaction fee rate of 0.0005. This setup aims to simulate the costs
involved in real trading, ensuring the practical feasibility and efficiency of the trading
strategy. The trading environment includes various parameters such as maximum
account balance, maximum net worth, and maximum holding amount, all set to
2,147,000,000. This hypothetical limit value is used to test the performance of the
strategy without specific numerical restrictions.
2. State Space
In this paper, the state space in the environment is represented by a fixed-length rolling
window of K-line charts. Each K-line contains 18 features: Close, Open, High, Low,
Vol (Volume), Change, RSI (Relative Strength Index), slowk (Stochastic
Oscillator %K), slowd (Stochastic Oscillator %D), MACD (Moving Average
Convergence Divergence), MACD_signal (MACD Signal Line), MACD_hist (MACD
Histogram), CCI (Commodity Channel Index), ATR (Average True Range), ROC (Rate
of Change), BP (Balance of Power), TR (True Range), and UO (Ultimate Oscillator). A
rolling window mechanism, with a window size of 30, is introduced to capture the most
recent states of the time series data, assisting the Agent in making more accurate trading
decisions. Each state’s dimension is [10,23].
3. Action Space
The action space allows the agent to make decisions within the trading environment,
based on the current market state and predicted market changes. The action space
consists of two main elements: `action_type` and `action_percentage`. These two
parameters jointly determine the specific trading behavior executed at each time step.
`action_type` determines the type of trade, designed as a continuous value. Depending
on its range, it can be interpreted as the following types of trading operations:
Figure 4-2-3: Action Type
`action_percentage` defines the proportion of the transaction relative to the current
holdings, affecting the number of shares bought or sold. For instance, in a buying
operation, this parameter determines the proportion of funds used to purchase new
shares; in a selling operation, it decides the proportion of shares to be sold.
The logic for action execution is divided into buying, selling, and holding. If the
decision is to buy, a specified percentage of the available balance is used to purchase
shares. The number of shares bought is determined by the current market price, and the
balance and holdings are adjusted accordingly. If the decision is to hold, the agent
maintains the current state and does not engage in any transactions. Selling decisions
consider stop-loss and take-profit scenarios. If the current price results in a loss
exceeding a certain percentage (e.g., 15%) or profits reach a preset threshold (e.g., 30%),
a full sale is triggered. Otherwise, the proportion of the holdings to be sold is determined
based on `action_percentage`.
4. Agent Construction
In the construction of the presented agent, the `step` function is crucial for navigating
the trading environment. Each action taken by the agent is followed by an update that
advances the step in the dataset. This function meticulously calculates the immediate
returns and volatility of the last 100 data points in the net asset history to assess the
agent’s short-term financial performance. It calculates key risk-adjusted metrics, such
as the Sharpe ratio and Sortino ratio, to fully understand the risk conditions associated
with the agent’s actions. The function also calculates the cumulative return to pay
attention to long term growth. This mechanism incorporates traditional financial
metrics such as the Sortino ratio and Sharpe ratio and adjusts for loss aversion by
imposing harsher penalties for negative returns, aligning the agent’s actions with risksensitive trading strategies. The iteration cycle concludes by checking termination
conditions; if met, it signals the end of a phase, returning the latest observations,
computed rewards, and termination state. This sophisticated agent framework is
designed to optimize trading decisions based on financial returns and risk
considerations, enhancing the agent’s performance under dynamic market conditions.
Figure 4-2-4.1: Relative Ratio Calculation
Figure 4-2-4.2: Reward Function
Figure 4-2-4.3: Risk Ratio Calculation
(III) Model Training
1. Memory Class Definition
This paper defines a stock trading agent based on deep reinforcement learning
(StockAgent) that uses the PPO algorithm to make decisions in the cryptocurrency
trading market. Upon initialization, this agent class receives an algorithm object and
sets the computing device to GPU (if available) to leverage its high computational
power for accelerating the decision-making process. The agent’s main methods include
predicting (predict), sampling (sample), and learning (learn), each tailored to operate
with the observations from the trading environment and handle potential exceptions.
The agent is built using the PyTorch framework, which supports flexible tensor
operations and automatic gradient calculation, making the implementation of complex
neural network models more straightforward. In the predict method, the agent converts
the input observations into tensors and predicts the next action; in case of anomalies, a
preset default action is returned. The sample method is similar but used for generating
possible action samples for environment exploration. The learn method adjusts the
strategy based on trading results (including current observations, actions, rewards, next
state, and whether it has ended) to optimize future trading performance.
Moreover, the embedded StockModel class within the agent defines its decision
framework, including action generation (Actor model) and value assessment (Critic
model). These models use deep neural networks to estimate the best trading actions and
predict the value of future states, helping the agent make more accurate trading
decisions in complex and variable market environments.
2. Training Loop
The research phase set a total of 1000 training cycles (episodes), systematically
evaluating and optimizing the agent’s trading strategies through these cycles. In each
cycle, the environment is reset to start a new sequence of trades, the agent selects
actions using the PPO algorithm, and learns from the immediate feedback (rewards).
After each action, the environment state is updated until the cycle ends. At the end of
each cycle, the total rewards and net value changes of the agent are recorded, helping
to monitor model performance and adjust strategies. Every 100 cycles, the system
outputs current net value, return rates, and volatility, providing data support for further
analysis. After training, visualizing the net value changes of each cycle helps to deeply
understand the model’s long-term performance stability.
3. Testing Loop
The testing phase uses data independent of the training set to evaluate the agent’s
generalization capability. In the testing loop, steps similar to those in training are
repeated, but without further model training. Over 3000 tests, total rewards, net value
changes, and volatility are recorded to assess the agent’s performance in unseen market
conditions. Additionally, the average return rates and volatility across all tests, as well
as the return-to-volatility ratio (Re_Vola_ratio), an important metric for measuring riskadjusted returns, are calculated. After every 100 tests, the system reports key financial
metrics and saves the states of the best-performing models for potential reuse or further
analysis.
4. Algorithm Weight Storage
Throughout the training and testing processes, managing algorithm weights is key to
model optimization and application. Once a better return-to-volatility ratio is
discovered, the corresponding model weights are saved to a specified file. This ensures
that the model is updated when better performances are observed, facilitating the
repeatability of experiments and subsequent model replication. Additionally, the system
provides an option for users to decide whether to load pre-trained optimal model
weights before starting experiments, offering convenience for rapid model deployment
and performance evaluation.
5. Results Visualization and Analysis
This study, through 1000 training experiments with a deep reinforcement learning
model, aims to explore the model’s performance in a cryptocurrency trading
environment. The cumulative return rates of the model across different training rounds
showed significant volatility, ranging from a low of 945.40% to a high of 2013.57%.
The volatility of these return rates (measured by the standard deviation of returns) was
relatively stable at about 0.0485, indicating consistent uncertainty in achieving returns.
Analyzing the total reward values of each training round, it is evident that the rewards
obtained in different rounds vary significantly, from a low of 245.41 to a high of 308.63,
suggesting that the strategies adopted in some training rounds were more effective at
capitalizing on market opportunities to enhance overall net value. During the training
experiments, the model demonstrated its capability to pursue high returns but also came
with corresponding risks. Despite this, the model was able to maintain positive return
rates in the vast majority of cases, a positive sign of its effectiveness.
Figure 4-3-5.1 Statistics of every 100 training times
Figure 4-3-5.2: Final Net Value Chart from 1000 Training Sessions
Figure 4-3-5.3: Net Value Changes during 3000 Out-of-Sample Tests. This chart
shows the net value changes over a testing period of 400 days during which the model’s
average final net value reached approximately 5500, with an average return rate of
97.47%. Additionally, the model’s daily average volatility during the testing period was
0.03. After adjustment for independence and identical distribution, the Sharpe ratio was
1.928. In comparison, the Sharpe ratio for holding Ethereum directly was 1.5213.
Figure 4-3-5.5: Net Value Trajectories for a Random Selection of 10 Test Batches.
These trajectories reveal that all show almost the same trend, although there are slight
losses between 150 and 240 days, but overall, after 240 days, they show high returns.
Considering that the testing interval does not include a bear market, the performance of
this strategy in a bear market remains to be examined.
Figure 4-3-5.6: Distribution of Net Values in Each Testing Cycle. The vast majority
of tests have final net values concentrated between 5000 and 6000, forming a clear
normal distribution tendency. This distribution characteristic reflects that the model can
achieve robust returns in most cases. In the final net value distribution chart, all final
net values are above 3000, meaning that all tests ended without any losses recorded.
Figure 4-3-5.3 Statistics of every 100 tests
Figure 4-3-5.4 Final net value graph of 3000 tests
Figure 4-3-5.5: Return Curves for 10 Randomly Selected Tests out of 3000
Figure 4-3-5.6 Final net value distribution chart of 3000 tests
V. Future Plans for the Enrich Your Wealth Product
(I) DRL Trading Robot Based on Integrated Algorithms
1. Incorporating Multiple DRL Algorithms
In addition to the PPO algorithm, incorporating algorithms such as A2C, SAC,
DDPG, and TD3 is planned, recognizing the extreme instability and unpredictability of
financial market information. Building a single deep neural network model often cannot
maintain effectiveness over extended trading periods. By constructing an integrated
model, it will be possible to autonomously select the base model used in actual trading
depending on different market conditions, thereby maximizing adaptation to market
trends.
2. Increasing Diversified Investment Portfolios
Adding more cryptocurrencies and diversifying investments by sectors and market
capitalization will spread investment risks.
(II) Expanding into More Trading Markets
1. Stock Market
The application of Deep Reinforcement Learning (DRL) in the stock market
primarily leverages its ability to model complex non-linear relationships using deep
neural networks, learning optimal trading strategies through interaction with market
environments. The advantage of DRL models lies in their flexibility and adaptability,
allowing them to adjust trading strategies based on real-time market data to promptly
respond to market changes, thus maximizing investment returns and reducing risks.
However, DRL models also face challenges such as overfitting and sample selection
bias, which need comprehensive consideration and resolution in practical applications.
2. Forex Market
The advantage of DRL in the forex market is its effectiveness in handling highfrequency fluctuations and variability. DRL models can capture complex non-linear
relationships in the forex market, such as correlations between currency pairs, market
sentiments, and fundamental factors, and can build dynamic environmental models to
simulate market evolution.
3. Futures Market
DRL models can grasp the complex non-linear relationships in the futures market,
including the correlations between futures price fluctuations, fundamental factors, and
technical indicators. Their real-time decision-making capabilities allow them to make
timely trading decisions under rapidly fluctuating market conditions, such as using
high-frequency trading strategies to respond to rapid market changes. DRL models can
also continuously interact with market environments and learn optimal trading
strategies to maximize profits and reduce risks, for example, using policy gradient
methods in reinforcement learning to dynamically adjust and optimize strategies.
Overall, the application of DRL in the futures market has broad development prospects,
providing investors with more intelligent and efficient trading decision support.
VI. Conclusion
This paper first emphasizes the potential application of Deep Reinforcement
Learning (DRL) technology in the field of quantitative trading. The trading strategies
constructed in this study, under specific model weights and solely trading the
cryptocurrency Ethereum, have shown impressive performance, demonstrating the
strong profit-making and risk-control capabilities of the trading agent built using the
PPO algorithm in complex and dynamic market environments. Through 1000 model
training sessions and 3000 testing experiments, the trading strategy achieved an average
yield of 97.47% and an average volatility of 0.03, with a yield volatility ratio of 1.928,
highlighting the effectiveness and practical value of the constructed trading strategy.
The experimental results show that the model maintained positive return rates in most
tests, demonstrating its capability to pursue high returns. However, as the yield
volatility ratio indicates, high returns are accompanied by corresponding risks. The
strategy overall showed excellent performance, but variations were significant in
specific operations, particularly observed in the total reward values across different
training cycles, indicating that the strategies adopted in certain cycles were more
effective at utilizing market opportunities. The application of the DRL model,
especially the PPO algorithm, showed advantages in solving high-dimensional
decision-making problems, limiting the extent of changes during policy updates to
ensure learning stability and effectiveness. Additionally, the paper proposes plans for
future research directions, including constructing DRL trading robots based on
integrated algorithms and exploring more trading markets such as stocks, forex, and
futures. These plans suggest the potential and trend for broader applications of DRL
technology in the financial field. Based on this, the paper recommends that future
research should focus on the performance of strategies in different market environments,
especially in bear markets, to verify the resilience of strategies in the face of market
fluctuations. Moreover, researchers are encouraged to explore more deep learning
models and algorithms to further enhance the accuracy and robustness of trading
strategies, providing investors with efficient, intelligent trading decision support.
Through continuous technological innovation and application expansion, the
development of quantitative trading robots in future financial markets will become
more diversified and intelligent.
reference
[1] Sun Zhilei, Tang Junyang, Fengshuo et al. Adaptive stock trading strategy based
on deep reinforcement learning [J]. Journal of Zhejiang University of Technology,
2024, 52(02): 188-195.
[2] Kong Yinying, Huang Zhihua, Deng Haodong, etc. Design of quantitative stock
trading algorithm based on deep reinforcement learning [J]. Journal of Nanchang
University
(Science
Edition),
2024,
48(01):
24-29+35.
DOI:
10.13764/
j.cnki.ncdl.2024.01.001.
[3] Yang Xu, Liu Jiapeng, Yue Han, et al. Research on investment portfolio strategies
and automated trading based on deep reinforcement learning algorithms [J].
Modern Electronic Technology, 2024, 47(06): 154-160. DOI: 10.16652/j. issn.1004373x.2024.06.025.
[4] Wen Xinxian. Research on high-frequency quantitative trading strategies based
on deep reinforcement
learning[J]. Modern Electronic Technology, 2023,
46(02):125-131.DOI:10.16652/j.issn.1004-373x.2023.02 .024.
[5] Pricope T V. Deep reinforcement learning in quantitative algorithmic trading: A
review[J]. arXiv preprint arXiv:2106.00123, 2021.
[6] Théate T, Ernst D. An application of deep reinforcement learning to algorithmic
trading[J]. Expert Systems with Applications, 2021, 173: 114632.
[7] Liu Y, Liu Q, Zhao H, et al. Adaptive quantitative trading: An imitative deep
reinforcement learning approach[C]//Proceedings of the AAAI conference on
artificial intelligence. 2020, 34(02): 2128-2135.
[8] Moody J, Saffell M. Reinforcement learning for trading[J]. Advances in Neural
Information Processing Systems, 1998, 11.
[9] Dempster M A H, Leemans V. An automated FX trading system using adaptive
reinforcement learning[J]. Expert systems with applications, 2006, 30(3): 543-552.
[10] Moody J, Wu L, Liao Y, et al. Performance functions and reinforcement learning
for trading systems and portfolios[J]. Journal of forecasting, 1998, 17(5‐6): 441-470.
[11] Wu X, Chen H, Wang J, et al. Adaptive stock trading strategies with deep
reinforcement learning methods[J]. Information Sciences, 2020, 538: 142-158.
[12] Pendharkar P C, Cusatis P. Trading financial indices with reinforcement learning
agents[J]. Expert Systems with Applications, 2018, 103: 1-13.
[13] Li Y, Ni P, Chang V. Application of deep reinforcement learning in stock trading
strategies and stock forecasting[J]. Computing, 2020, 102(6): 1305-1322.
[14] Lei K, Zhang B, Li Y, et al. Time-driven feature-aware jointly deep reinforcement
learning for financial signal representation and algorithmic trading[J]. Expert
Systems with Applications, 2020, 140: 112872.