update
Browse files
README.md
CHANGED
|
@@ -31,25 +31,36 @@ FlagEmbedding can map any text to a low-dimensional dense vector which can be us
|
|
| 31 |
And it also can be used in vector databases for LLMs.
|
| 32 |
|
| 33 |
************* 🌟**Updates**🌟 *************
|
| 34 |
-
-
|
| 35 |
-
- 09/
|
|
|
|
|
|
|
| 36 |
- **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
|
| 37 |
- **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
|
| 38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
- 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [available](https://huggingface.co/spaces/mteb/leaderboard).
|
| 40 |
-
- 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
|
| 41 |
-
- 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
|
| 42 |
-
- 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
|
|
|
|
|
|
|
| 43 |
|
| 44 |
|
| 45 |
## Model List
|
| 46 |
|
| 47 |
`bge` is short for `BAAI general embedding`.
|
| 48 |
|
| 49 |
-
| Model | Language | | Description | query instruction for retrieval
|
| 50 |
|:-------------------------------|:--------:| :--------:| :--------:|:--------:|
|
| 51 |
-
| [BAAI/
|
| 52 |
-
| [BAAI/bge-reranker-
|
|
|
|
| 53 |
| [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
|
| 54 |
| [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
|
| 55 |
| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
|
|
@@ -64,11 +75,15 @@ And it also can be used in vector databases for LLMs.
|
|
| 64 |
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
|
| 65 |
|
| 66 |
|
| 67 |
-
|
| 68 |
|
| 69 |
-
|
| 70 |
For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
## Frequently asked questions
|
| 73 |
|
| 74 |
<details>
|
|
@@ -105,7 +120,11 @@ please select an appropriate similarity threshold based on the similarity distri
|
|
| 105 |
<summary>3. When does the query instruction need to be used</summary>
|
| 106 |
|
| 107 |
<!-- ### When does the query instruction need to be used -->
|
| 108 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
For a retrieval task that uses short queries to find long related documents,
|
| 110 |
it is recommended to add instructions for these short queries.
|
| 111 |
**The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task.**
|
|
@@ -365,7 +384,7 @@ which is more accurate than embedding model (i.e., bi-encoder) but more time-con
|
|
| 365 |
Therefore, it can be used to re-rank the top-k documents returned by embedding model.
|
| 366 |
We train the cross-encoder on a multilingual pair data,
|
| 367 |
The data format is the same as embedding model, so you can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker).
|
| 368 |
-
More details
|
| 369 |
|
| 370 |
|
| 371 |
## Contact
|
|
@@ -375,7 +394,8 @@ You also can email Shitao Xiao(stxiao@baai.ac.cn) and Zheng Liu(liuzheng@baai.ac
|
|
| 375 |
|
| 376 |
## Citation
|
| 377 |
|
| 378 |
-
If you find
|
|
|
|
| 379 |
```
|
| 380 |
@misc{bge_embedding,
|
| 381 |
title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
|
|
@@ -388,4 +408,5 @@ If you find our work helpful, please cite us:
|
|
| 388 |
```
|
| 389 |
|
| 390 |
## License
|
| 391 |
-
FlagEmbedding is licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
|
|
|
|
|
|
| 31 |
And it also can be used in vector databases for LLMs.
|
| 32 |
|
| 33 |
************* 🌟**Updates**🌟 *************
|
| 34 |
+
- 10/12/2023: Release [LLM-Embedder](./FlagEmbedding/llm_embedder/README.md), a unified embedding model to support diverse retrieval augmentation needs for LLMs. [Paper](https://arxiv.org/pdf/2310.07554.pdf) :fire:
|
| 35 |
+
- 09/15/2023: The [technical report](https://arxiv.org/pdf/2309.07597.pdf) of BGE has been released
|
| 36 |
+
- 09/15/2023: The [masive training data](https://data.baai.ac.cn/details/BAAI-MTP) of BGE has been released
|
| 37 |
+
- 09/12/2023: New models:
|
| 38 |
- **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
|
| 39 |
- **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
<details>
|
| 43 |
+
<summary>More</summary>
|
| 44 |
+
<!-- ### More -->
|
| 45 |
+
|
| 46 |
+
- 09/07/2023: Update [fine-tune code](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md): Add script to mine hard negatives and support adding instruction during fine-tuning.
|
| 47 |
- 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [available](https://huggingface.co/spaces/mteb/leaderboard).
|
| 48 |
+
- 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
|
| 49 |
+
- 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
|
| 50 |
+
- 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
|
| 51 |
+
|
| 52 |
+
</details>
|
| 53 |
|
| 54 |
|
| 55 |
## Model List
|
| 56 |
|
| 57 |
`bge` is short for `BAAI general embedding`.
|
| 58 |
|
| 59 |
+
| Model | Language | | Description | query instruction for retrieval [1] |
|
| 60 |
|:-------------------------------|:--------:| :--------:| :--------:|:--------:|
|
| 61 |
+
| [BAAI/llm-embedder](https://huggingface.co/BAAI/llm-embedder) | English | [Inference](./FlagEmbedding/llm_embedder/README.md) [Fine-tune](./FlagEmbedding/llm_embedder/README.md) | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See [README](./FlagEmbedding/llm_embedder/README.md) |
|
| 62 |
+
| [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | Chinese and English | [Inference](#usage-for-reranker) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | a cross-encoder model which is more accurate but less efficient [2] | |
|
| 63 |
+
| [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | Chinese and English | [Inference](#usage-for-reranker) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | a cross-encoder model which is more accurate but less efficient [2] | |
|
| 64 |
| [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
|
| 65 |
| [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
|
| 66 |
| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
|
|
|
|
| 75 |
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
|
| 76 |
|
| 77 |
|
| 78 |
+
[1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
|
| 79 |
|
| 80 |
+
[2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
|
| 81 |
For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
|
| 82 |
|
| 83 |
+
All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI.
|
| 84 |
+
If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models .
|
| 85 |
+
|
| 86 |
+
|
| 87 |
## Frequently asked questions
|
| 88 |
|
| 89 |
<details>
|
|
|
|
| 120 |
<summary>3. When does the query instruction need to be used</summary>
|
| 121 |
|
| 122 |
<!-- ### When does the query instruction need to be used -->
|
| 123 |
+
|
| 124 |
+
For the `bge-*-v1.5`, we improve its retrieval ability when not using instruction.
|
| 125 |
+
No instruction only has a slight degradation in retrieval performance compared with using instruction.
|
| 126 |
+
So you can generate embedding without instruction in all cases for convenience.
|
| 127 |
+
|
| 128 |
For a retrieval task that uses short queries to find long related documents,
|
| 129 |
it is recommended to add instructions for these short queries.
|
| 130 |
**The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task.**
|
|
|
|
| 384 |
Therefore, it can be used to re-rank the top-k documents returned by embedding model.
|
| 385 |
We train the cross-encoder on a multilingual pair data,
|
| 386 |
The data format is the same as embedding model, so you can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker).
|
| 387 |
+
More details please refer to [./FlagEmbedding/reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
|
| 388 |
|
| 389 |
|
| 390 |
## Contact
|
|
|
|
| 394 |
|
| 395 |
## Citation
|
| 396 |
|
| 397 |
+
If you find this repository useful, please consider giving a star :star: and citation
|
| 398 |
+
|
| 399 |
```
|
| 400 |
@misc{bge_embedding,
|
| 401 |
title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
|
|
|
|
| 408 |
```
|
| 409 |
|
| 410 |
## License
|
| 411 |
+
FlagEmbedding is licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
|
| 412 |
+
|