4
NeuralNetwork(GCNN)[7]firstlyintroducedthegatingmechanismintoCNNforthe
language modeling, which could reduce the vanishing gradient problem for deep
architectures. GCNN[7]utilized halfoftheabstract featuresasthegatingweightsto
control the other half abstract features. However, since the weights and the abstract
featuresareconvolvedatthesamelevel,theinformationcarriedbythecontrolweights
isverymonotonous.Inthispaper,wealsointroducethegatingmechanisminCNN,but
the control weights, i.e., attention weights, are generated by a variety of specialized
convolution kernels. Therefore, the contextual information of a particular context
windowisintegratedintothecontrolweights.
Attention mechanisms attempt to mimic the human’s perception, which focus
attention selectively on parts of the target areas to obtain more details of the targets
while suppressing other useless information. Mnih et al. [33] firstly applied attention
mechanisminRNNforimageclassification.Thentheextensionsoftheattention-based
RNN model are applied to various NLP tasks [2,28]. Attention mechanism in neural
networks has attracted much attention and has been applied in a variety of neural
network architectures including encoder-decoder [55]. The process of focusing
attention in these architectures mainly reflected in the calculation of the weight
coefficient.Thelargertheweight,themoretheattentionfocusedonitscorresponding
value, that is, weight represents the importance of information, and value is its
correspondinginformation.Recently,howtousetheattentionmechanisminCNNshas
becomearesearchhotspot[51].
Activation functions havea crucial impact on the neural networks’performance.
Sigmoid[10],RectifiedLinearUnit(ReLU)[38],Softplus[38],LeakyReLU(LReLU)
[34],ParametricReLU (PReLU)[17],ExponentialLinear Unit(ELU) [5]andScaled
ExponentialLinearUnit(SELU)[26]areallfairly-knownandwidely-usedactivation
units.Activationfunctionsmakeitpossibletocarryoutthenon-lineartransformation
of the input to solve the complex problems. However, it may also bring with
disadvantages,e.g.,vanishinggradientandneuronaldeath.Therefore,itisessentialto
choosetheappropriateactivationfunctionfortheneuralnetwork.
3. The proposed model
CNN is very suitable for natural language processing, because CNN not only
allows to precisely control the length of dependencies but also enables nearby input
elementstointeractatlowerlayerswhiledistantelementsinteractathigherlayers,and
CNNcanproducethehierarchicalabstractrepresentationsoftheinputtextbystacking
multipleconvolutionlayers.Mostcurrentmethodsforsentenceclassificationbasedon
CNN intend to utilize the pooling layer to find the most significant features. In this
paper, we construct an attention-gated layer before pooling layer to identify critical
features,suppresstheimpactofotherunimportantfeaturesandhelppoolinglayerfind
thegenuinelycrucialfeatures.
In this section, wedescribe our model in detail. As depicts in Fig. 1, our model
consistsofaconvolutionallayeroperatingontheinputsentencematrix,anattention-
gated layer,a max-over-timepooling layer, anda fullyconnectedlayer with dropout
andsoftmaxoutput.Wechooseasentencewiththelengthnof7andthewordvectors'
dimensionality d of 4 as an example. In the demo model shown in Fig. 1, the first
convolutionallayerusesconvolutionkernelswiththewindowsizehof2or3words,
andtheconvolutionlayerintheattentiongatedlayerusesconvolutionkernelswiththe