Note: This post was the basis of a talk I gave at the Milwaukee Machine Learning Meetup. The slides for it are here.

Text Classification with Deep Learning

For the past couple of months, I’ve been working through the fast.ai course on deep learning at course.fast.ai. I have experience programming, working with data, and some math, but never made the jump into deep learning (also known as neural networks). In this post, I’ll go through my process of classifying text using deep learning frameworks.

It is not a shallow topic (pun intended), but I found them more intuitive than I expected. There are increasingly more resources and tools in this space, and the academic papers are more approachable than most. If this area interests you and you have some background in programming, I highly encourage you take a look at what’s available. Deep learning can generate very good models when it comes to recognizing patterns. Things like images or speech are areas where neural networks have seen a lot of success. For a good introduction to deep learning, Michael Nielsen has a great writeup here. As I worked through the course, I tried to find something I could build a model for. I chose text classification as the format of data is easier to work with than images or sounds.

For this post, I have some data from Urban Milwaukee, a local organization that covers issues, events and people affecting Milwaukee’s most urban neighborhoods. I’m especially a fan of their “Data Wonk”. They have a large number of articles as well as a variety of writers, which makes for a good candidate dataset to learn on. The plan is to learn to predict which author wrote an article based on the first paragraph of that article. I chose three authors that seemed to have the most articles: Jeramey Jannene, Dave Reid, and Bruce Murphy, and will be using anything they’ve written.

I have a CSV that contains the author, title, URL, and first paragraph of each article. The course.fast.ai course uses Python and Keras, so that’s what I’ll be using as well. Just to confirm things are as they should be, I’ll read in the CSV using pandas and check the first few rows.

import pandas as pd
data = pd.read_csv('data/urbanmilwaukee/data.csv')
print(data.head())
            author                                                url  
0     Bruce Murphy  http://urbanmilwaukee.com/2013/11/19/murphys-l...   
1  Jeramey Jannene  http://urbanmilwaukee.com/2014/08/04/photo-gal...   
2        Dave Reid  http://urbanmilwaukee.com/2009/10/04/community...   
3  Jeramey Jannene  http://urbanmilwaukee.com/2016/08/11/eyes-on-m...   
4        Dave Reid  http://urbanmilwaukee.com/2011/03/01/msoe-stud...   
                                               title  
0                New Probe is Big Trouble for Walker   
1                          2014 Wisconsin State Fair   
2  Community & Economic Development Committee Mee...   
3                   The Couture is Finally Happening   
4  MSOE Students Present Design Concepts for Lake...   
                                                  p1  
0  Back in March, retired Appeals Court Judge Nea...  
1  With a little something for everyone, the Wisc...  
2  The Community and Economic Development Committ...  
3  A 44-story apartment tower planned for a prime...  
4  The original plans for Lakeshore State Park ca...  

Thankfully, everything looks to be correct. It would be good to get an idea of the size and distribution of the attribute we’re trying to predict as well.

print(len(data))
print(data.groupby('author').size())
print(max(data.groupby('author').size())/len(data)*100)
2575
author
Bruce Murphy        612
Dave Reid          1172
Jeramey Jannene     791
dtype: int64
45.5145631068

There are 2575 rows, with 45% being from one author (Dave Reid). There is not much to benchmark against, but the simplest case is to just guess the most common category. Minimally, this means the goal is to do better than 45%. I suspect the final model will be able to do much better than that, but there’s no way to know without trying!

There exists the idea that machine learning and neural networks require large amounts of data, so the 2575 samples may seem inadequate, but that idea comes with some nuance. More data is always better and provides more options, but simpler models will not need anything approaching “big” data. For this model, 2575 samples is enough to work with, but care will have to be taken to not overfit.

Preprocessing Text and Building Training and Validation Set

The data for the first paragraph is currently structured as one long string, which is not ideal. The text needs to be in a consistent format that can be easily given as input to a model. Keras provides some utilities for text preprocessing. We’ll be using the Tokenizer, which will turn the list of strings into a list of a list of integer identifiers, where each ID corresponds to a specific word. If we had the short sentence "This is great", it would be turned into something like [34, 20, 10], where the number 34 corresponds to the word this, 20 corresponds to is, and 10 corresponds to great.

MAX_NB_WORDS = 25000
# initialize Tokenizer
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
# turn the unique words into corresponding integers
tokenizer.fit_on_texts(data.p1)
# turn the first paragraph string into sequence of numbers
sequences = tokenizer.texts_to_sequences(data.p1)
longest_sequence_length = max([len(x) for x in sequences])
print('Longest sequence is %s' % longest_sequence_length)
# word_index is a dictionary where the key is the token and the value is its ID
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
Longest sequence is 233
Found 12855 unique tokens.

Our eventual model will expect a fixed length of words each time, so there’s a slight problem with the data currently. The paragraphs don’t all have the same number of tokens, but Keras fortunately provides a straightforward way to solve this. We can specify a length of tokens per paragraph, and instruct Keras to fill the remaining spaces with zeroes if necessary. As the maximum number of tokens in the paragraphs isn’t super high (233), we set the maximum to that.

We’ll also have to put the labels for our data in a proper format, which also won’t work directly within the model. We know we only have three labels for each of the authors, but they need to be one-hot encoded. In short, this means instead of the output being a string like Bruce Murphy, Jeramey Jannene, or Dave Reid, it will be represented as a list three elements long where each position corresponds to one of the labels. Bruce Murphy would become [1, 0, 0], Jeramey Jannene would be [0, 1, 0], and Dave Reid is [0, 0, 1]. Now, the model wll output three numbers, each one representing it’s prediction of the likelihood the input belongs to that label. As an example, the model giving an output of [0.05, 0.15, 0.8] would mean it is 80% certain the article was written by Dave Reid, with Jeramey Jannene being next most likely at 15%, and Bruce Murphy the least likely at 5%.

SEQUENCE_LENGTH = 233
# pad our token sequences with zeroes
padded_data = pad_sequences(sequences, maxlen=SEQUENCE_LENGTH)
# build a dictionary of our labels and turn them into one-hot encoded arrays
labels_index = dict(((val, idx) for idx, val in enumerate(set(data.author))))
reverse_labels_index = dict(((idx, val) for idx, val in enumerate(set(data.author))))
labels = [labels_index[x] for x in data.author]
labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', padded_data.shape)
print('Shape of label tensor:', labels.shape)
print(labels[:4])
Shape of data tensor: (2575, 233)
Shape of label tensor: (2575, 3)
[[ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]]

With the data in the necessary structure, we should also split the data up between training and validation. The validation data is not trained on, and allows for seeing how well the model generalizes. For this set, it will be split 70/30. It would likely be a better idea to use k-fold cross-validation, but splitting the data as done here will be sufficient.

VALIDATION_SPLIT = .3
# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
padded_data = padded_data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT * padded_data.shape[0])
print(nb_validation_samples)
x_train = padded_data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = padded_data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]
Number of validation samples:  772

With the data prepared, the model can start to be built. To turn the words in the paragraphs into something with values that can be adjusted and trained, a popular and effective method is to use word embeddings. Keras has good documentation and an example on it. The idea is to express each word as a list of real-valued factors. The number of factors is usually somewhere in the tens or hundreds. GloVe and Word2vec are two popular projects with pre-trained vectors for a huge number of words. This is mostly copied from the previous Keras example, but we’ll be using GloVe in our model.

EMBEDDING_DIM = 100
# Turn text file with words and vector of coeffecients into a dictionary
embeddings_index = {}
f = open(os.path.join('data/', 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))
# Combine all of the unique words in our paragraphs with the pre-trained embedding
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Building Deep Learning Models

Keras models are built as layers, and though it can be more complex, this model will use the Sequential API, which means each layer will feed its output directly into the following layer. As the model is using word embedding to transform the input, the first layer will be an embedding layer. It will take the individual words and replace them with the weights from GloVe. With the input taken care of, I decided to start using one hidden dense layer with 10 units to see how it performed. The model will not have a huge capacity for learning, and may not pick up on more complex relationships.

model = Sequential()
model.add(Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            input_length=SEQUENCE_LENGTH,
                            trainable=True))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(len(labels_index), activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(),
              metrics=['acc'])
model.fit(x_train, y_train, validation_data=(x_val, y_val), nb_epoch=10, batch_size=128)
Train on 1803 samples, validate on 772 samples
Epoch 1/10
1803/1803 [==============================] - 0s - loss: 1.0545 - acc: 0.4393 - val_loss: 1.0259 - val_acc: 0.4611
Epoch 2/10
1803/1803 [==============================] - 0s - loss: 0.9729 - acc: 0.4803 - val_loss: 0.9751 - val_acc: 0.4611
Epoch 3/10
1803/1803 [==============================] - 0s - loss: 0.8573 - acc: 0.5652 - val_loss: 0.9268 - val_acc: 0.4663
Epoch 4/10
1803/1803 [==============================] - 0s - loss: 0.7254 - acc: 0.7038 - val_loss: 0.8740 - val_acc: 0.5052
Epoch 5/10
1803/1803 [==============================] - 0s - loss: 0.5925 - acc: 0.8480 - val_loss: 0.8520 - val_acc: 0.5142
Epoch 6/10
1803/1803 [==============================] - 0s - loss: 0.4645 - acc: 0.9229 - val_loss: 0.8048 - val_acc: 0.5596
Epoch 7/10
1803/1803 [==============================] - 0s - loss: 0.3357 - acc: 0.9689 - val_loss: 0.7620 - val_acc: 0.6127
Epoch 8/10
1803/1803 [==============================] - 0s - loss: 0.2136 - acc: 0.9856 - val_loss: 0.6925 - val_acc: 0.6826
Epoch 9/10
1803/1803 [==============================] - 0s - loss: 0.1214 - acc: 0.9922 - val_loss: 0.6341 - val_acc: 0.7073
Epoch 10/10
1803/1803 [==============================] - 0s - loss: 0.0702 - acc: 0.9994 - val_loss: 0.6101 - val_acc: 0.7176

The results are actually pretty good. It’s classifying around 70% of the validation data correctly, which means we’re already above our admittedly very naive benchmark. The training accuracy unfortunately quickly reaches close to 100% accuracy, so it’s learning too much about the training data. I want to try adding some more dense layers that will potentially allow for higher order learning, though the overfitting problem will only get worse.

model = Sequential()
model.add(Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=SEQUENCE_LENGTH,
                            trainable=True))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(len(labels_index), activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(),
              metrics=['acc'])
model.fit(x_train, y_train, validation_data=(x_val, y_val), nb_epoch=12, batch_size=128)
Train on 1803 samples, validate on 772 samples
Epoch 1/12
1803/1803 [==============================] - 0s - loss: 0.9887 - acc: 0.4881 - val_loss: 0.8920 - val_acc: 0.5298
Epoch 2/12
1803/1803 [==============================] - 0s - loss: 0.7181 - acc: 0.7033 - val_loss: 0.7955 - val_acc: 0.6567
Epoch 3/12
1803/1803 [==============================] - 0s - loss: 0.5389 - acc: 0.8458 - val_loss: 0.7490 - val_acc: 0.6438
Epoch 4/12
1803/1803 [==============================] - 0s - loss: 0.3842 - acc: 0.9063 - val_loss: 0.7219 - val_acc: 0.6839
Epoch 5/12
1803/1803 [==============================] - 0s - loss: 0.2611 - acc: 0.9523 - val_loss: 0.7002 - val_acc: 0.6995
Epoch 6/12
1803/1803 [==============================] - 0s - loss: 0.1834 - acc: 0.9695 - val_loss: 0.7059 - val_acc: 0.6969
Epoch 7/12
1803/1803 [==============================] - 0s - loss: 0.1324 - acc: 0.9850 - val_loss: 0.6989 - val_acc: 0.6995
Epoch 8/12
1803/1803 [==============================] - 0s - loss: 0.0925 - acc: 0.9884 - val_loss: 0.7282 - val_acc: 0.7008
Epoch 9/12
1803/1803 [==============================] - 0s - loss: 0.0627 - acc: 0.9933 - val_loss: 0.7414 - val_acc: 0.7111
Epoch 10/12
1803/1803 [==============================] - 0s - loss: 0.0449 - acc: 0.9961 - val_loss: 0.7486 - val_acc: 0.7060
Epoch 11/12
1803/1803 [==============================] - 0s - loss: 0.0365 - acc: 0.9983 - val_loss: 0.7597 - val_acc: 0.7163
Epoch 12/12
1803/1803 [==============================] - 0s - loss: 0.0246 - acc: 0.9994 - val_loss: 0.7714 - val_acc: 0.7163

Not surprisingly, the training loss drops even faster with training accuracy reaching close to 100% again. Performance was about the same though. The overfitting did indeed get worse, meaning the model is learning details specific to the training set at the expense of learning features that might generalize better.

To deal with overfitting, there are a few options. Ideally, we could find more data, but outside of augmenting our data or waiting for more articles to be written, we’re probably out of luck. We can also try simplifying our architecture, as it’s possible we may be able to get similar performance with less units and less layers. This helps restrict the capacity the model has for learning. The other option is adding some kind of regularization. The most common are L1 or L2 regularization and dropout. L1 and L2 regularization penalize weights as they get larger, and dropout will drop a specified amount of random features during training to help weights avoid becoming too specialized. Both of these are viable, but I’ve found dropout a bit easier to understand and apply effectively. Adding dropout and increasing the network size accordingly can usually yield some improvements. A good starting point for dropout is 20%-50%, so I’ll start there.

model = Sequential()
model.add(Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=SEQUENCE_LENGTH,
                            trainable=True))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(len(labels_index), activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(),
              metrics=['acc'])
model.fit(x_train, y_train, validation_data=(x_val, y_val), nb_epoch=20, batch_size=128)
Train on 1803 samples, validate on 772 samples
Epoch 1/20
1803/1803 [==============================] - 0s - loss: 1.0893 - acc: 0.3871 - val_loss: 1.0008 - val_acc: 0.5091
Epoch 2/20
1803/1803 [==============================] - 0s - loss: 0.9542 - acc: 0.5103 - val_loss: 0.8988 - val_acc: 0.5078
Epoch 3/20
1803/1803 [==============================] - 0s - loss: 0.8546 - acc: 0.5191 - val_loss: 0.8636 - val_acc: 0.5298
Epoch 4/20
1803/1803 [==============================] - 0s - loss: 0.7598 - acc: 0.5912 - val_loss: 0.8679 - val_acc: 0.4974
Epoch 5/20
1803/1803 [==============================] - 0s - loss: 0.6657 - acc: 0.7049 - val_loss: 0.7912 - val_acc: 0.6736
Epoch 6/20
1803/1803 [==============================] - 0s - loss: 0.5687 - acc: 0.7743 - val_loss: 0.8000 - val_acc: 0.6852
Epoch 7/20
1803/1803 [==============================] - 0s - loss: 0.4894 - acc: 0.8053 - val_loss: 0.7942 - val_acc: 0.6813
Epoch 8/20
1803/1803 [==============================] - 0s - loss: 0.3833 - acc: 0.8530 - val_loss: 0.8333 - val_acc: 0.6891
Epoch 9/20
1803/1803 [==============================] - 0s - loss: 0.3188 - acc: 0.8763 - val_loss: 0.8112 - val_acc: 0.7008
Epoch 10/20
1803/1803 [==============================] - 0s - loss: 0.2691 - acc: 0.8968 - val_loss: 0.8721 - val_acc: 0.6969
Epoch 11/20
1803/1803 [==============================] - 0s - loss: 0.2316 - acc: 0.9101 - val_loss: 0.8996 - val_acc: 0.7008
Epoch 12/20
1803/1803 [==============================] - 0s - loss: 0.2110 - acc: 0.9179 - val_loss: 0.9526 - val_acc: 0.6969
Epoch 13/20
1803/1803 [==============================] - 0s - loss: 0.1992 - acc: 0.9174 - val_loss: 0.9197 - val_acc: 0.6982
Epoch 14/20
1803/1803 [==============================] - 0s - loss: 0.1900 - acc: 0.9240 - val_loss: 1.0568 - val_acc: 0.7124
Epoch 15/20
1803/1803 [==============================] - 0s - loss: 0.1503 - acc: 0.9379 - val_loss: 1.0368 - val_acc: 0.6995
Epoch 16/20
1803/1803 [==============================] - 0s - loss: 0.1630 - acc: 0.9312 - val_loss: 1.0080 - val_acc: 0.7163
Epoch 17/20
1803/1803 [==============================] - 0s - loss: 0.1506 - acc: 0.9318 - val_loss: 0.9633 - val_acc: 0.7163
Epoch 18/20
1803/1803 [==============================] - 0s - loss: 0.1183 - acc: 0.9529 - val_loss: 1.1271 - val_acc: 0.7176
Epoch 19/20
1803/1803 [==============================] - 0s - loss: 0.1205 - acc: 0.9551 - val_loss: 1.2435 - val_acc: 0.7047
Epoch 20/20
1803/1803 [==============================] - 0s - loss: 0.1215 - acc: 0.9545 - val_loss: 1.1012 - val_acc: 0.7150

Unfortunately, performance doesn’t really improve. Our training accuracy deviates away from the validation accuracy a bit less, which is a good sign. I’m not sure how to meaningfully improve performance with this kind of architecture.

All of the past models had some capacity to recognize patterns, but for data with a spacial aspect to it, a specific kind of neural network has seen a lot more success in being able to recognize more complex patterns across dimensions. Enter convolutional neural networks. They’ve been studied and applied a lot, and aren’t overly complex. For a good introduction to convolutions, see here and here.

Convolutional Neural Networks

In Keras, building a convolutional model is pretty straightforward, as it’s just another layer that can be added. There are complexities around understanding and choosing the specifics, but Keras makes it easy to play around with different architectures and parameters and see what happens.

Like before, I’ll start with a very simple architecture, using only a single convolutional layer with 30 filters and a window length of 3.

model = Sequential()
model.add(Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=SEQUENCE_LENGTH,
                            trainable=True))
model.add(Conv1D(30,3, activation='relu'))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(len(labels_index), activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(),
              metrics=['acc'])
model.fit(x_train, y_train, validation_data=(x_val, y_val), nb_epoch=11, batch_size=128)
Train on 1803 samples, validate on 772 samples
Epoch 1/11
1803/1803 [==============================] - 0s - loss: 1.1061 - acc: 0.3483 - val_loss: 1.0921 - val_acc: 0.4896
Epoch 2/11
1803/1803 [==============================] - 0s - loss: 1.0887 - acc: 0.5042 - val_loss: 1.0827 - val_acc: 0.5298
Epoch 3/11
1803/1803 [==============================] - 0s - loss: 1.0682 - acc: 0.5319 - val_loss: 1.0388 - val_acc: 0.5855
Epoch 4/11
1803/1803 [==============================] - 0s - loss: 0.9973 - acc: 0.5602 - val_loss: 0.9529 - val_acc: 0.5298
Epoch 5/11
1803/1803 [==============================] - 0s - loss: 0.8742 - acc: 0.6195 - val_loss: 0.7846 - val_acc: 0.6567
Epoch 6/11
1803/1803 [==============================] - 0s - loss: 0.7182 - acc: 0.6927 - val_loss: 0.7245 - val_acc: 0.6541
Epoch 7/11
1803/1803 [==============================] - 0s - loss: 0.5979 - acc: 0.7510 - val_loss: 0.6540 - val_acc: 0.7163
Epoch 8/11
1803/1803 [==============================] - 0s - loss: 0.4760 - acc: 0.8109 - val_loss: 0.6489 - val_acc: 0.7098
Epoch 9/11
1803/1803 [==============================] - 0s - loss: 0.4212 - acc: 0.8386 - val_loss: 0.6960 - val_acc: 0.6904
Epoch 10/11
1803/1803 [==============================] - 0s - loss: 0.3516 - acc: 0.8791 - val_loss: 0.6561 - val_acc: 0.7370
Epoch 11/11
1803/1803 [==============================] - 0s - loss: 0.2701 - acc: 0.9096 - val_loss: 0.6590 - val_acc: 0.7409

By adding a single convolutional layer, the accuracy and loss of the network is significantly improved! Typically, there are multiple convolutional layers with some form of pooling afterwards. Generally, one should also check if there have been any papers published on similar problems that you’re trying to solve. There was a great writeup here of building a more complex convolutional network to classify sentences and references a paper that aimed to do the same. Their architecture takes a pre-trained word embedding layer as input and feeds it to a single layer that contains a few different convolutional layers. The paper was classifying into between 3-23 categories depending on the dataset, and used window lengths of 3, 4 and 5, and had 100 filters in each. The Quid writeup was classifying into 2 categories and uses lengths of 1, 2, and 3, and 300 filters in each. Interestingly, Quid’s dataset was in the hundreds.

I’ll heavily borrow from the model Quid used, but make some tweaks as our dataset is larger and uses paragraphs, rather than a single sentence. There are a lot more hyperparameters to play around with in this case, so it’s important to take care in how often and how much they are changed. This model will use Keras’ Functional API to build the convolutional layers, and use the Sequential API for the rest. In the last model, the Sequential API allowed creating a layer with 30 filters and a window length of 3. We however can’t have a layer with multiple window lengths. The Functional API makes that possible. The input feeds into a layer that contains convolutional layers of window length 1 through 4. I drew a rough sketch of my visualization of the two architectures with some of the repeated parts removed since I can only draw so many arrows.

Previous CNN with 30 filters of window length 3


New CNN with 20 filters of each length 1, 2, 3, and 4

The code for the new network is as follows:

graph_in = Input(shape=(233, EMBEDDING_DIM))
convs = []
for fsz in range(1, 5):
    conv = Convolution1D(nb_filter=20, filter_length=fsz,
                         border_mode='valid', activation='relu')(graph_in)
    conv = MaxPooling1D()(conv)
    flatten = Flatten()(conv)
    convs.append(flatten)
out = Merge(mode='concat')(convs)
graph = Model(input=graph_in, output=out)
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=SEQUENCE_LENGTH,
                            trainable=True))
model.add(graph)
model.add(Dense(60, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])
model.fit(x_train, y_train, validation_data=(x_val, y_val), nb_epoch=10, batch_size=128)
Train on 1803 samples, validate on 772 samples
Epoch 1/10
1803/1803 [==============================] - 0s - loss: 1.0223 - acc: 0.4753 - val_loss: 0.9180 - val_acc: 0.4715
Epoch 2/10
1803/1803 [==============================] - 0s - loss: 0.7751 - acc: 0.6689 - val_loss: 0.7195 - val_acc: 0.6697
Epoch 3/10
1803/1803 [==============================] - 0s - loss: 0.6146 - acc: 0.7415 - val_loss: 0.6876 - val_acc: 0.6684
Epoch 4/10
1803/1803 [==============================] - 0s - loss: 0.4746 - acc: 0.8186 - val_loss: 0.6363 - val_acc: 0.7202
Epoch 5/10
1803/1803 [==============================] - 0s - loss: 0.3511 - acc: 0.8652 - val_loss: 0.7761 - val_acc: 0.6567
Epoch 6/10
1803/1803 [==============================] - 0s - loss: 0.2914 - acc: 0.8907 - val_loss: 0.5967 - val_acc: 0.7565
Epoch 7/10
1803/1803 [==============================] - 0s - loss: 0.1999 - acc: 0.9456 - val_loss: 0.6193 - val_acc: 0.7552
Epoch 8/10
1803/1803 [==============================] - 0s - loss: 0.1399 - acc: 0.9678 - val_loss: 0.7198 - val_acc: 0.7254
Epoch 9/10
1803/1803 [==============================] - 0s - loss: 0.1073 - acc: 0.9773 - val_loss: 0.7006 - val_acc: 0.7409
Epoch 10/10
1803/1803 [==============================] - 0s - loss: 0.0703 - acc: 0.9889 - val_loss: 0.7101 - val_acc: 0.7604

Again, accuracy has improved and we’re around 75%-76% accuracy. This is about as complex of a model as I’d feel comfortable with. It’s very possible adjusting the learning rate in later epochs yields more improvement. As mentioned previously, k-fold cross-validation might apply better in this situation, and we could also play with things like psuedo-labelling.

With that said, I’m pretty happy with the model, and am going to save the learned weights so they can be reused later without having to train again:

model.save_weights(model_path+"conv_pseudo_1.hdf5")
# model.load_weights(model_path+"conv_pseudo_1.hdf5")

There have been some articles written by the authors on Urban Milwaukee in the time it took to write this, so they provide an opportunity to see how well the model does. I wrote a function that takes a paragraph string and returns the classification probabilities as well as the label of the most likely class:

def predict_paragraph(string):
    sequence = tokenizer.texts_to_sequences([string])
    sequence = pad_sequences(sequence, maxlen=SEQUENCE_LENGTH)
    prediction = model.predict(sequence)
    return (prediction, reverse_labels_index[np.argmax(prediction)])

I have 9 new articles. Two are by Bruce Murphy, seven by Jeramey Jannene, and none by Dave Reid.

# First two are Bruce Murphy and the remainder are Jeramey Jannene.
for paragraph in np.concatenate((bruce_paragraphs, jeramey_paragraphs), axis=0):
    print(predict_paragraph(paragraph))
(array([[  9.9902e-01,   3.9079e-04,   5.8865e-04]], dtype=float32), 'Bruce Murphy')
(array([[ 0.688 ,  0.3103,  0.0016]], dtype=float32), 'Bruce Murphy')
(array([[  4.8264e-06,   1.4238e-03,   9.9857e-01]], dtype=float32), 'Jeramey Jannene')
(array([[  9.0372e-05,   3.5314e-02,   9.6460e-01]], dtype=float32), 'Jeramey Jannene')
(array([[ 0.0026,  0.3294,  0.668 ]], dtype=float32), 'Jeramey Jannene')
(array([[  2.9068e-04,   1.3361e-02,   9.8635e-01]], dtype=float32), 'Jeramey Jannene')
(array([[  5.7478e-04,   5.0256e-02,   9.4917e-01]], dtype=float32), 'Jeramey Jannene')
(array([[  8.2228e-07,   9.8508e-01,   1.4918e-02]], dtype=float32), 'Dave Reid')
(array([[  8.3666e-05,   1.9333e-02,   9.8058e-01]], dtype=float32), 'Jeramey Jannene')

The model got 8/9 correct! We do have some previously mentioned avenues to explore if we want to try to make further improvements, and could also try applying it to things they’ve written outside of articles (maybe tweets?). If you have any questions or comments, don’t hesitate to tweet me! If you think machine learning can be applied to your project, I’m also available to help, so please contact us!