basically if you had to retrieve similar things to an entire document from embeddings sending the whole document in the search would be bad since there’s not going to be anything similar to the whole thing

but there might be things which are closer to paragraphs in it

or certain lines

the solution to this is to chunk your source material chunking is the process of breaking down a particular source material into smaller pieces. there’s a lot of ways people implementing chunking and it’s worth experimenting since it’s heavily usecase dependent

common way of chunking is to split by sentences