Words are vectorized, and words with similar semantics have vectors close to each other. Then positional encoding is applied. Finally, the semantic vector and position vector are combined into a single token.