Predicting Bitcoin Price Direction Using Machine Learning

A Data-Driven Approach to Forecasting Cryptocurrency Trends


Introduction

Bitcoin, since its inception in 2009, has evolved from a niche digital currency to a significant player in global financial markets. Its volatility and unique nature have attracted not only traders but also researchers who aim to predict its price movements. However, most of the research in this area has been conducted post-2018, leaving room for exploration and innovation. In this study, I ventured into the complex world of Bitcoin price prediction, employing various machine learning methods to determine the direction of its price movements. This post outlines my approach, the unique aspects of my study, and the results that set it apart from existing literature.

What makes this study stand out is twofold: the data and variables used, and the comparative analysis of different machine learning methods. Unlike previous studies that often relied on a limited set of variables, I incorporated a wide range of predictors, including macro-economic and political factors, cryptocurrency-specific metrics, investor attention, day anomalies, and parallel market indicators. This comprehensive dataset, sourced from platforms like Quandl, Wikipedia, Yahoo Finance, and Investing, is detailed in Table 1 below.

Table 1. Input Variables and Their Sources

VariableSource
Daily economic policy uncertainty index (US)EPU indices
Daily economic policy uncertainty index (UK)EPU indices
Hash RateQuandl
DifficultyQuandl
Estimated Transaction ValueQuandl
Total Transaction FeesQuandl
my wallet number of transactions per dayQuandl
my wallet transaction volumeQuandl
average block sizeQuandl
API Blockchain sizeQuandl
Cost per transactionQuandl
Cost % of transaction volumeQuandl
total output volumeQuandl
number of transactions per blockQuandl
Number of unique bitcoin addresses usedQuandl
Number of transactions excluding popular addressesQuandl
Total transaction fee USDQuandl
Number of transactionsQuandl
Total BitcoinQuandl
Wikipedia trendWikipedia
DayCalculated
Type of Day (Weekday/Weekend)Calculated
Lag 1Calculated
Lag 2Calculated
Bitcoin Pricecoinmarketcap
Market capitalizationcoinmarketcap
S&P 500Yahoo Finance
VIXYahoo Finance
Gold priceInvesting

Methodology

Given the complexity of predicting Bitcoin's price direction, I employed several machine learning models, including Lasso, Ridge, Elastic Net, Random Forest, and Support Vector Machine (SVM). Each model was chosen for its ability to handle the high-dimensional dataset effectively. For instance, Lasso, Ridge, and Elastic Net are well-suited for models with many predictors, while Random Forest offers flexibility in capturing both additive and interaction effects. The data preparation and analysis were conducted using Python, with all variables (except binary ones) normalized to have a mean of zero and a variance of one. Additionally, the variables were shifted by one day to measure their impact on the next day's return.

Data Splitting Strategy

When working with time series data, it's crucial to maintain the temporal order of observations. According to Machine Learning Mastery, there are three primary methods for splitting time series data into training and test sets:

  1. Train-Test Split: This method respects the temporal order and is ideal when a large dataset is available.
  2. Multiple Train-Test Splits: This approach also respects temporal order and allows for multiple evaluations.
  3. Walk-Forward Validation: Here, the model is updated with each new time step.

Given that my dataset was sizable, I opted for the Train-Test Split method, using the first three years of data for training and the last year for testing. This approach aligns with the method used in Section 4.6.3 of An Introduction to Statistical Learning (ISLR) when analyzing stock market data.

Results and Analysis

Given the 28 predictors in the dataset, models capable of handling high-dimensional data, such as Lasso, Ridge, Elastic Net, and Random Forest, were considered strong candidates for this analysis. However, I also explored and compared the performance of other methods to ensure a comprehensive evaluation.

All variables, except the binary ones, were normalized to have a mean of zero and a standard deviation of one. Additionally, the variables were shifted by one day to assess their impact on the following day's returns.

Table 2 presents the confusion matrices and accuracy rates of the models, ranked by their accuracy.

Table 2. Performance of the models
RidgeAccuracy
Prediction/OutputDownUp0.6066
Down199
Up135203
LassoAccuracy
Prediction/OutputDownUp0.6011
Down146
Up140206
Elastic netAccuracy
Prediction/OutputDownUp0.6011
Down135
Up141207
OLS Post LassoAccuracy
Prediction/OutputDownUp0.5902
Down106
Up144206
Logit Post LassoAccuracy
Prediction/OutputDownUp0.5847
Down1412
Up140200
OLSAccuracy
Prediction/OutputDownUp0.5847
Down1614
Up138198
LDAAccuracy
Prediction/OutputDownUp0.582
Down4443
Up110169
LogitAccuracy
Prediction/OutputDownUp0.5792
Down4545
Up109167
Ridge with cross-validationAccuracy
Prediction/OutputDownUp0.5792
Down00
Up154212
OLS Backward selection modelAccuracy
Prediction/OutputDownUp0.5683
Down5155
Up103157
QDAAccuracy
Prediction/OutputDownUp0.5656
Down1217
Up142195
SVMAccuracy
Prediction/OutputDownUp0.5601
Down2532
Up129180
GAM 2 Model (smoothing splines with degree of freedom of 10) Accuracy
Prediction/OutputDownUp0.5418
Down2233
Up132179
GAM 3 Model (Logit allowing for nonlinearity and lasso regularized) Accuracy
Prediction/OutputDownUp0.535
Down7490
Up80122
Random ForestAccuracy
Prediction/OutputDownUp0.5191
Down4971
Up105141
BaggingAccuracy
Prediction/OutputDownUp0.459
Down95139
Up5973
GAM 1 Model (smoothing splines with degree of freedom of 4)Accuracy
Prediction/OutputDownUp0.4536
Down102148
Up5264
TreeAccuracy
Prediction/OutputDownUp0.4208
Down154212
Up00
Comparison of model accuracies

Table 2 highlights that models such as Lasso, Ridge, Elastic Net, OLS post-Lasso, and Logit post-Lasso demonstrate strong performance. Despite various adjustments in the Random Forest and SVM models, a simple OLS model emerged as the top performer.

Although the OLS model achieves an accuracy rate of 61%, this figure is only marginally better than the naïve models. Specifically, a naïve model predicting a down movement every day in the test sample achieves a correctness rate of 42.08%, while one predicting an up movement every day achieves 57.92%. Consequently, the 61% accuracy rate of the OLS model does not offer a significant advantage over the naïve model, supporting the efficient market hypothesis. This suggests that past information is already reflected in the prices, which follow a martingale process.

Figure 1 illustrates the ROC curves for the top two models. The ROC curve for the Ridge model (shown in red) and the Lasso model (shown in blue) both closely align with the 45-degree diagonal line, indicating that even the best models perform similarly. Other models exhibit comparable ROC curves, and if included, they would overlap with those depicted.

Figure 1. ROC curve

ROC curve