we should combine the two parts. In standard word embed-
ding models, words are usually associated by leveraging
the sliding context window-based strategy [11]. For exam-
ple, in the Skip-gram model, the vector representation of
the central word is learned for predicting the other words in
a context window. Similarly, the CBOW model uses the
composition of the vectors of the surrounding words in a
context window to predict the central word. Hence, a rea-
sonable method for associating natural language words and
design patterns is to locate them in a context window.
To this end, an intuitive way is to regard design pattern
names appearing in natural language text as special “words”.
Concretely, given a document doc in the corpus C, for the
design patterns in doc:DPs,wedetectalltheoccurrencesof
design pattern names (including aliases) in doc:Tokens and
replace them with predefined tokens. These predefined
tokens are the “words” of design patterns and mixed with the
natural language words. Then design patterns can be handled
together with natural language words by the sliding context
window-based strategies. However, there is a main issue for
this way: design pattern names tend to appear infrequently in
the text. For instance, Fig. 2 presents a paragraph in a post
(#131766) of Stack Overflow. This paragraph indeed describes
the Dependency Injection design pattern, but the design pat-
tern name only appears one time at the beginning of the para-
graph. When applying the sliding context window-based
strategies to this paragraph, the design pattern Dependency
Injection can be only associated with some words in the front
but the rest are ignored.
To resolve this issue, we redefine the concept of context
window by considering both natural language words and
design patterns. In the new definition, the context window
size is not fixed, but there is also a parameter of context win-
dow size for words as the standard models. For clarity, we
name it as c.
There are two types of context windows:
Context Window for Word. For a word in a document, the
context window for this word contains other words around
the word with radius c and all the design patterns the docu-
ment describes. Formally, for a document doc in C, let
doc:TokensðiÞ denote the ith word of the text and
doc:Tokens:len denote the length of the text. The Context
Window of doc:TokensðiÞ is defined as
Context
Word
doc
ði; doc:TokensðiÞÞ
¼fdoc:TokensðjÞjmaxf1;i cgj
minfdoc:Tokens:len; i þ cg;j6¼ ig[doc:DPs:
(1)
Take the document in Table 1 as an example. Assuming c ¼
2, the Context Window for the sixth word “interface” con-
tains the two words ahead of it (i.e., “facade” and
“provide”), the two words behind it (i.e., “create” and
“subsystem”), as well as the two design patterns mentioned
in the document (i.e., “[abstract-factory]” and “[facade]”).
Context Window for Design Pattern. Given a design pattern
described by a document, the context window for the
design pattern consists of all the words in the text and the
other described design patterns. Formally, for a document
doc and a design pattern dp 2 doc:DPs, the Context Window
of dp is
Context
DP
doc
ðdpÞ
¼fdoc:TokensðjÞj1 j doc:Tokens:leng
[ðdoc:DPs fdpgÞ:
(2)
For example, in Table 1, the Context Window for the design
pattern “[abstract-factory]” contains all the words (i.e.,
“abstract”, “factory”, ..., “class”) and the other design pat-
tern “[facade]”.
According to the definitions of the two context windows,
a design pattern can be associated with each word in the
document that describes the design pattern. The tie between
words and design patterns is strengthened. To show the
effectiveness of the new definitions, we use the performance
of the method that leverages design pattern name occur-
rences (mentioned above) for comparison in Section 5.3.
With the definitions, for any document doc in C, the con-
text window of each word in doc:Tokens and the context
window of each design pattern in doc:DPs are constructed.
3.4 Vectors Traini ng
Once the context windows are clarified, the word and
design pattern vectors can be generated by any sliding con-
text window-based models. In DPWord2Vec, we choose
GloVe [13] for vector generation, due to the following
reasons:
Fig. 2. A paragraph that describes the Dependency Injection design pat-
tern. The design pattern name is in red bold font and the words in the
context window (of size five) of the name are in blue italic font.
TABLE 1
An Example for Two Types of Context Windows (c ¼ 2)
a
As declared above, the stop words are eliminated from the text of the document (in strikeout fonts) and the rest of the words are stemmed to their root forms.
1232 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 48, NO. 4, APRIL 2022
Authorized licensed use limited to: Nanchang Hangkong University. Downloaded on March 17,2024 at 02:19:07 UTC from IEEE Xplore. Restrictions apply.