**By Mattia Brusamento**

**Summary**

In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of neural networks that has successfully been applied to image recognition and analysis. In this project I've approached this class of models trying to apply it to stock market prediction, combining stock prices with sentiment analysis. The implementation of the network has been made using TensorFlow, starting from the online tutorial. In this article, I will describe the following steps: dataset creation, CNN training and evaluation of the model.

**Dataset**

In this section, it's briefly described the procedure used to build the dataset, the __data were retrieved using Google Finance and Intrinio API respectively.

**Stocks Data**

As already mentioned before, stocks data has been retrieved from Google Finance historical API ("https://finance.google.com/finance/historical?q={tick}☆tdate={startdate}&output=csv", for each tick in the list).

The time unit is the day and the value I kept is the Close price. For training purposes, missing days have been filled using linear interpolation (pandas.DataFrame.interpolate):

**News data and Sentiment Analysis**

In order to retrieve news data, I used the API from intrinio. For each tick, I downloaded the related news from "https://api.intrinio.com/news.csv?ticker={tick}". data are in csv format with the following columns:

TICKER,FIGI_TICKER,FIGI,TITLE,PUBLICATION_DATE,URL,SUMMARY, here an example:

"AAAP,AAAP:UW,BBG007K5CV53,"3 Stocks to Watch on Thursday: Advanced Accelerator Application SA(ADR) (AAAP), Jabil Inc (JBL) and Medtronic Plc. (MDT)",2017-09-28 15:45:56 +0000,https://articlefeeds.nasdaq.com/~r/nasdaq/symbols/~3/ywZ6I5j5mIE/3-stocks-to-watch-on-thursday-advanced-accelerator-application-saadr-aaap-jabil-inc-jbl-and-medtronic-plc-mdt-cm852684,InvestorPlace Stock Market News Stock Advice amp Trading Tips Most major U S indices rose Wednesday with financial stocks leading the way popping 1 3 The 160 S amp P 500 Index gained 0 4 the 160 Dow Jones Industrial Average surged 0 3 and the 160".

News have been de-duplicated based on the title. Finally, TICKER, PUBLICATION_DATE and SUMMARY columns were kept.

Sentiment Analysis was performed on the SUMMARY column using Loughran and McDonald Financial Sentiment Dictionary for financial sentiment analysis, implemented in the pysentiment python library.

This library offers both a tokenizer, that performs also stemming and stop words removal, and a method to score a tokenized text. The value chosen from the get_score method as a proxy of the sentiment is the Polarity, computed as:

*(#Positives - #Negatives)/(#Positives + #Negatives)*

import pysentiment as ps lm = ps.LM() df_news['SUMMARY_SCORES'] = df_news.SUMMARY.map(lambda x: lm.get_score(lm.tokenize(str(x)))) df_news['POLARITY'] = df_news['SUMMARY_SCORES'].map(lambda x: x['Polarity'])

The days in which there are no news are filled with 0s for Polarity.

Finally, data was groupped by tick and date, summing up the Polarity score for days in which a tick has more than one news.

**Full Dataset**

By merging stocks and news data, we get a dataset as follows, with all the days from 2016-01-04 to 2017-09-30 for 154 ticks, with the close value of the stock and the respective polarity value:

Date | Tick | Close | Polarity |
---|---|---|---|

2017-09-26 | ALXN | 139.700000 | 2.333332 |

2017-09-27 | ALXN | 139.450000 | 3.599997 |

2017-09-28 | ALXN | 138.340000 | 1.000000 |

2017-09-29 | ALXN | 140.290000 | -0.999999 |

**CNN with TensorFlow**

In order to get started with Convolutional Neural Network in Tensorflow, I used the official tutorial as reference. It shows how to use layers to build a convolutional neural network model to recognize the handwritten digits in the MNIST data set. In order to make this working for our purpose, we need to adapt our input data and the network.

**Data Model**

The input data has been modelled such that a single features element is a 154x100x2 tensor:

- 154 ticks
- 100 consecutive days
- 2 channels, one for the stock price and one for the polarity value

Lables instead are modelled as a vector of length 154, where each element is 1, if the corrresponding stock raised on the next day, 0 otherwise.

In this way, there is a sliding time window of 100 days, so the first 100 days can't be used as labels. The training set contains 435 entries, while the evaluation set 100.

**Convolutional Neural Network**

The CNN has been built starting from the example of TensorFlow's tutorial and then adapted to this use case. The first 2 convolutional and pooling layers have both height equal to 1, so they perform convolutions and poolings on single stocks, the last layer has height equal to 154, to learn correlations between stocks. Finally, there are the dense layers, with the last one of length 154, one for each stock.

The network has been dimensioned in a way that it could be trained in a couple of hours on this dataset using a laptop. Part of the code is reported here:

def cnn_model_fn(features, labels, mode): """Model function for CNN.""" # Input Layer input_layer = tf.reshape(tf.cast(features["x"], tf.float32), [-1, 154, 100, 2]) # Convolutional Layer #1 conv1 = tf.layers.conv2d( inputs=input_layer, filters=32, kernel_size=[1, 5], padding="same", activation=tf.nn.relu) # Pooling Layer #1 pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[1, 2], strides=[1,2]) # Convolutional Layer #2 conv2 = tf.layers.conv2d( inputs=pool1, filters=8, kernel_size=[1, 5], padding="same", activation=tf.nn.relu) # Pooling Layer #2 pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[1, 5], strides=[1,5]) # Convolutional Layer #3 conv3 = tf.layers.conv2d( inputs=pool2, filters=2, kernel_size=[154, 5], padding="same", activation=tf.nn.relu) # Pooling Layer #3 pool3 = tf.layers.max_pooling2d(inputs=conv3, pool_size=[1, 2], strides=[1, 2]) # Dense Layer pool3_flat = tf.reshape(pool3, [-1, 154 * 5 * 2]) dense = tf.layers.dense(inputs=pool3_flat, units=512, activation=tf.nn.relu) dropout = tf.layers.dropout( inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN) # Logits Layer logits = tf.layers.dense(inputs=dropout, units=154) predictions = { # Generate predictions (for PREDICT and EVAL mode) "classes": tf.argmax(input=logits, axis=1), "probabilities": tf.nn.softmax(logits, name="softmax_tensor") } if mode == tf.estimator.ModeKeys.PREDICT: return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions) # Calculate Loss (for both TRAIN and EVAL modes) multiclass_labels = tf.reshape(tf.cast(labels, tf.int32), [-1, 154]) loss = tf.losses.sigmoid_cross_entropy( multi_class_labels=multiclass_labels, logits=logits) # Configure the Training Op (for TRAIN mode) if mode == tf.estimator.ModeKeys.TRAIN: optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001) train_op = optimizer.minimize( loss=loss, global_step=tf.train.get_global_step()) return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

**Evaluation**

In order to evaluate the performance of the model, no standard metrics were used, but it has been built a simulation closer to a practical use of the model.

Assuming to start with an initial capital (

) equal to 1, for each day of the evaluation set we divide the capital inCequal parts, where N goes from 1 to 154.NWe put

on the top N stocks that our model predicts with the highest probabilities, 0 on the others.C/NAt this point we have a vector

that represents our daily allocation, we can compute the daily gain/loss as A multiplied by the percentage variation of each stock for that day.AWe and up with a new capital

, that we can re-invest on the next day.C = C + deltaAt the end, we will end up with a capital greater or smaller than 1, depending on the goodness of our choices.

A good baseline for the model has been identified in *N=154*: this represents the generic performance of all the stocks and it models the scenario in which we divide the capital equally on all of them. This produces a gain around **4.27%**.

For evaluation purposes, the data has been corrected, removing the days in which the market was closed.

The performance of the model, for different values of N, is reported in the picture below.

The red dotted line is the 0 baseline, while the orange line is the basline with *N=154*.

The best performance is obtained with * N=12, *with a gain around

**8.41%**, almost twice the market baseline.

For almost every N greater than 10 we have a decent performance, better than the baseline, while too small values of N degrade the performance.

**Conclusion**

It has been very interesting to try Tensorflow and CNN for the first time and trying to apply them to financial data.

This is a toy example, using quite small dataset and network, but it shows the potential of this models.

Please feel free to provide feedbacks and advices or simply to get in touch with me on LinkedIn.

**Bio: **Mattia received his MS degree cum Laude in Computer Engineering from Politecnico di Milano, after a period at TU Delft working on a thesis about Recommender Systems. Mattia is now working as a data Scientist in the Cyber Security area for an Italian company.

**Related**

**Understanding Deep Convolutional Neural Networks with a practical use-case in Tensorflow and Keras****Exploring Recurrent Neural Networks****Data Scientist: The Hottest Job on Wall Street**