mwptoolkit.module.Attention.multi_head_attention¶

class mwptoolkit.module.Attention.multi_head_attention.EPTMultiHeadAttention(**config)[source]¶

Bases: Module

Class for computing multi-head attention (follows the paper, ‘Attention is all you need’)

This class computes attention over K-V pairs with query Q, i.e.

Initialize MultiHeadAttention class

Keyword Arguments

hidden_dim (int) – Vector dimension of hidden states (H). 768 by default
num_heads (int) – Number of attention heads (N). 12 by default
dropout_p (float) – Probability of dropout. 0 by default

forward(query: Tensor, key_value: Optional[Tensor] = None, key_ignorance_mask: Optional[Tensor] = None, attention_mask: Optional[Tensor] = None, return_weights: bool = False, **kwargs)[source]¶

Compute multi-head attention

Parameters

query (torch.Tensor) – FloatTensor representing the query matrix with shape [batch_size, query_sequence_length, hidden_size].
key_value (torch.Tensor) – FloatTensor representing the key matrix or value matrix with shape [batch_size, key_sequence_length, hidden_size] or [1, key_sequence_length, hidden_size]. By default, this is None (Use query matrix as a key matrix).
key_ignorance_mask (torch.Tensor) – BoolTensor representing the mask for ignoring column vector in key matrix, with shape [batch_size, key_sequence_length]. If an element at (b, t) is True, then all return elements at batch_size=b, key_sequence_length=t will set to be -Infinity. By default, this is None (There’s no mask to apply).
attention_mask (torch.Tensor) – BoolTensor representing Attention mask for ignoring a key for each query item, with shape [query_sequence_length, key_sequence_length]. If an element at (s, t) is True, then all return elements at query_sequence_length=s, key_sequence_length=t will set to be -Infinity. By default, this is None (There’s no mask to apply).
return_weights (bool) – Use True to return attention weights. By default, this is True.

Returns

If head_at_last is True, return (Attention Output, Attention Weights). Otherwise, return only the Attention Output. Attention Output: Shape [batch_size, query_sequence_length, hidden_size]. Attention Weights: Shape [batch_size, query_sequence_length, key_sequence_length, head_nums].

Return type

Union[torch.FloatTensor, Tuple[torch.FloatTensor, torch.FloatTensor]]

training: bool¶

class mwptoolkit.module.Attention.multi_head_attention.EPTMultiHeadAttentionWeights(**config)[source]¶

Bases: Module

Class for computing multi-head attention weights (follows the paper, ‘Attention is all you need’)

This class computes dot-product between query Q and key K, i.e.

Initialize MultiHeadAttentionWeights class

Keyword Arguments

hidden_dim (int) – Vector dimension of hidden states (H). 768 by default.
num_heads (int) – Number of attention heads (N). 12 by default.

forward(query: Tensor, key: Optional[Tensor] = None, key_ignorance_mask: Optional[Tensor] = None, attention_mask: Optional[Tensor] = None, head_at_last: bool = True) → Tensor[source]¶

Compute multi-head attention weights

Parameters

query (torch.Tensor) – FloatTensor representing the query matrix with shape [batch_size, query_sequence_length, hidden_size].
key (torch.Tensor) – FloatTensor representing the key matrix with shape [batch_size, key_sequence_length, hidden_size] or [1, key_sequence_length, hidden_size]. By default, this is None (Use query matrix as a key matrix)
key_ignorance_mask (torch.Tensor) – BoolTensor representing the mask for ignoring column vector in key matrix, with shape [batch_size, key_sequence_length]. If an element at (b, t) is True, then all return elements at batch_size=b, key_sequence_length=t will set to be -Infinity. By default, this is None (There’s no mask to apply).
attention_mask (torch.Tensor) – BoolTensor representing Attention mask for ignoring a key for each query item, with shape [query_sequence_length, key_sequence_length]. If an element at (s, t) is True, then all return elements at sequence_length=s, T=t will set to be -Infinity. By default, this is None (There’s no mask to apply).
head_at_last (bool) – Use True to make shape of return value be [batch_size, query_sequence_length, key_sequence_length, head_nums]. If False, this method will return [batch_size, head_nums, sequence_length, key_sequence_length]. By default, this is True

Returns

FloatTensor of Multi-head Attention weights.

Return type

torch.FloatTensor

property hidden_dim: int¶

int :return: Vector dimension of hidden states (H)

Type: rtype

property num_heads: int¶

int :return: Number of attention heads (N)

Type: rtype

training: bool¶

class mwptoolkit.module.Attention.multi_head_attention.MultiHeadAttention(embedding_size, num_heads, dropout_ratio=0.0)[source]¶

Bases: Module

Multi-head Attention is proposed in the following paper: Attention Is All You Need.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(query, key, value, key_padding_mask=None, attn_mask=None)[source]¶

Multi-head attention

Parameters

query (torch.Tensor) – shape [batch_size, tgt_len, embedding_size].
key (torch.Tensor) – shape [batch_size, src_len, embedding_size].
value (torch.Tensor) – shape [batch_size, src_len, embedding_size].
key_padding_mask (torch.Tensor) – shape [batch_size, src_len].
attn_mask (torch.BoolTensor) – shape [batch_size, tgt_len, src_len].

Returns

attn_repre, shape [batch_size, tgt_len, embedding_size]. attn_weights, shape [batch_size, tgt_len, src_len].

Return type

tuple(torch.Tensor, torch.Tensor)

training: bool¶