Skip to main content

Vocabulary of natural language processing

Search from vocabulary

Concept information

Preferred term

tokenization  

Definition

  • The task/process of recognizing and tagging tokens (words, punctuation marks, digits etc.) in a text. (Loterre)

Broader concept

Synonym(s)

  • text segmentation
  • tokenisation

Definitional context(s)

  • Text segmentation aims to uncover latent structure by dividing text from a document into coherent sections. (Barrow, Jain, Morariu, Manjunatha, Oard & Resnik, 2020)

Example

  • Other steps during tokenization included proper handling of special text emoticons such as "o.O". (Chapman, Bernhard & Klakow, 2020)
  • To more thoroughly evaluate our tokenization we train multilingual T5 models using Sentence-Piece and CompoundPiece. (Minixhofer, Pfeiffer & Vulic, 2023)

In other languages

  • French

  • découpage de texte
  • segmentation de texte

URI

http://data.loterre.fr/ark:/67375/8LP-T7Q0JFBM-5

Download this concept:

RDF/XML TURTLE JSON-LD Last modified 5/27/24