Concept information
Preferred term
attention distribution
Definition
- The distribution of attention weights across the input sequence in an attention-based model.
Broader concept
Example
- As a result the ideal attention distribution should put all of the probability mass on the antecedent noun phrase for reflexive anaphora or on the sub-ject noun phrase for agreement and zero on the distractor noun phrases. (Lin, Tan & Frank, 2019)
- Further analysis indicates that WID can also learn the attention patterns from the teacher model without any alignment loss on attention distributions. (Wu, Hou, Lao, Li, Wong, Zhao & Yang, 2024)
- Recent research indicates that complementary attention distributions can lead to the same model prediction (Jain and Wallace 2019; Wiegreffe and Pinter 2019) and that the removal of input tokens with large attention weights often does not lead to a change in the model's prediction (Serrano and Smith 2019). (Hollenstein & Beinborn, 2021)
- Recent research in language processing finds that attention weights are not a good proxy for relative importance because different attention distributions can lead to the same predictions (Jain and Wallace 2019). (Hollenstein & Beinborn, 2021)
- The proposed method aims to unravel the attention distribution at each layer within a multi-layer model. (Jang, Byun & Shin, 2024)
In other languages
-
French
URI
http://data.loterre.fr/ark:/67375/8LP-C01T3CNT-T
{{label}}
{{#each values }} {{! loop through ConceptPropertyValue objects }}
{{#if prefLabel }}
{{/if}}
{{/each}}
{{#if notation }}{{ notation }} {{/if}}{{ prefLabel }}
{{#ifDifferentLabelLang lang }} ({{ lang }}){{/ifDifferentLabelLang}}
{{#if vocabName }}
{{ vocabName }}
{{/if}}