Computer Vision News - August 2022

14 CNN+LSTM Neural Networks with h5py.File('pruebavalidation.h5', 'r') as f: X_batch = f['data'][:] y_batch = f['labels'][:] for i in range(int(len(X_batch)/frames_num)): inc = count+frames_num joint_transfer.append([X_batch[count:inc],y_batch[count]]) count =inc data =[] target=[] for i in joint_transfer: data.append(i[0]) target.append(np.array(i[1])) return data, target data, target = process_alldata_training() data_test, target_test = process_alldata_test() The basic building block in a Recurrent Neural Network (RNN) is a Recurrent Unit (RU). There are many different variants of recurrent units such as the rather clunky LSTM (Long-Short-Term-Memory) and the somewhat simpler GRU (Gated Recurrent Unit) which we will use in this tutorial. Experiments in the literature suggest that the LSTM and GRU have roughly similar performances. Even simpler variants also exist and the literature suggests that they may perform even better than both LSTM and GRU, but they are not implemented in Keras which we will use in this tutorial. A recurrent neuron has an internal state that is being updated every time the unit receives a new input. This internal state serves as a kind of memory. However, it is not a traditional kind of computer memory that stores bits that are either on or off. Instead, the recurrent unit stores floating-point values in its memory state, which are read and written using matrix operations so the operations are all differentiable. This means the memory state can store arbitrary floating-point values (although typically limited between -1.0 and 1.0) and the network can be trained like a normal neural network using Gradient Descent. Define LSTM architecture When defining the LSTM architecture we have to take into account the dimensions of the transfer values. From each frame, the VGG16 network obtains as output a vector of 4096 transfer values. From each video, we are processing 20 frames so we will have 20 x 4096 values per video. The classification must be done taking into account the 20 frames of the video. If any of them detects violence, the video will be classified as violent. The first input dimension of LSTM neurons in the temporal dimension, in our case it is 20. The second is the size of the features vector (transfer values). chunk_size = 4096 n_chunks = 20 rnn_size = 512 model = Sequential() model.add(LSTM(rnn_size, input_shape=(n_chunks, chunk_size))) model.add(Dense(1024)) model.add(Activation('relu'))